This article aims to take on a few of the machine learning algorithms for people who aim to gain knowledge on important machine learning concepts while using freely available materials and resources along the way. The prime objective of this outline is to help you wade through the numerous free options that are available. There are many, to be sure, but which are the best? Which complement one another? What is the best order in which to use selected resources?

Common machine learning algorithms include:

Decision tree SVM Naive Bayes KNN K-Means Random forest

Below are the common machine learning Algorithms briefly explained with Python and R code.

Decision Tree

This is one of my favorite algorithms and I use it quite frequently. It is a type of supervised learning algorithm that is mostly used for classification problems. Surprisingly, it works for both categorical and continuous dependent variables. In this algorithm, we split the population into two or more homogeneous sets. This is done based on significant attributes and independent variables to make groups as distinct as possible.

Python code:

#Import Library #Import other necessary libraries like pandas, numpy... from sklearn import tree #Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset # Create tree object model = tree.DecisionTreeClassifier(criterion='gini') # for classification, here you can change the algorithm as gini or entropy (information gain) by default it is gini # model = tree.DecisionTreeRegressor() for regression # Train the model using the training sets and check score model.fit(X, y) model.score(X, y) #Predict Output predicted= model.predict(x_test)

R code:

library(rpart) x <- cbind(x_train,y_train) # grow tree fit <- rpart(y_train ~ ., data = x,method="class") summary(fit) #Predict Output predicted= predict(fit,x_test)

SVM (Support Vector Machine)

This is a classification method. In this algorithm, we plot each data item as a point in an n-dimensional space (where n is the number of features you have), with the value of each feature being the value of a particular coordinate.

For example, if we only had two features, like height and hair length of an individual, we’d first plot these two variables in a two-dimensional space where each point has two coordinates (which are known as support vectors).

Now, we will find some line that splits the data between the two differently classified groups of data. This will be the line from which the distances between the closest points in each of the two groups will be farthest away.

Python code:

#Import Library from sklearn import svm #Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset # Create SVM classification object model = svm.svc() # there is various option associated with it, this is simple for classification. You can refer link, for mo# re detail. # Train the model using the training sets and check score model.fit(X, y) model.score(X, y) #Predict Output predicted= model.predict(x_test)

R code:

library(e1071) x <- cbind(x_train,y_train) # Fitting model fit <-svm(y_train ~ ., data = x) summary(fit) #Predict Output predicted= predict(fit,x_test)

Naive Bayes

This is a classification technique based on Bayes’ theorem with an assumption of independence between predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about three inches in diameter. Even if these features depend on each other or on the existence of the other features, a naive Bayes classifier would consider all of these properties to independently contribute to the probability that this fruit is an apple.

The Naive Bayes model is easy to build and is particularly useful for very large datasets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.

Bayes theorem provides a way of calculating posterior probability: P(c|x) from P(c), P(x) and P(x|c).

P(c|x) is the posterior probability of class (target) given predictor (attribute).

P(c) is the prior probability of class.

P(x|c) is the likelihood which is the probability of predictor given class.

P(x) is the prior probability of predictor.

Python code:

#Import Library from sklearn.naive_bayes import GaussianNB #Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset # Create SVM classification object model = GaussianNB() # there is other distribution for multinomial classes like Bernoulli Naive Bayes, Refer link # Train the model using the training sets and check score model.fit(X, y) #Predict Output predicted= model.predict(x_test)

R code:

library(e1071) x <- cbind(x_train,y_train) # Fitting model fit <-naiveBayes(y_train ~ ., data = x) summary(fit) #Predict Output predicted= predict(fit,x_test)

KNN (K-Nearest Neighbors)

This can be used for both classification and regression problems. However, it is more widely used in classification problems in the ML industry. K-nearest neighbors is a simple algorithm that stores all available cases and classifies new cases by a majority vote of its K neighbors. The case assigned to the class is the most common amongst its K-nearest neighbors, measured by a distance function.

These distance functions can be Euclidean, Manhattan, Minkowski, or Hamming distance. The first three functions are used for continuous functions and Hamming is used for categorical variables. If K = 1, then the case is simply assigned to the class of its nearest neighbor. At times, choosing K turns out to be a challenge while performing KNN modeling.

KNN can easily be mapped to our real lives. If you want to learn about a person about whom you have no information, you might like to find out about their close friends and the circles they move in to gain access to their information!

Things to consider before selecting KNN:

KNN is computationally expensive.

Variables should be normalized, or else higher-range variables can bias it.

Works on the pre-processing stage more before going for KNN, like outlier/noise removal.

Python code:

#Import Library from sklearn.neighbors import KNeighborsClassifier #Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset # Create KNeighbors classifier object model KNeighborsClassifier(n_neighbors=6) # default value for n_neighbors is 5 # Train the model using the training sets and check score model.fit(X, y) #Predict Output predicted= model.predict(x_test)

R code:

library(knn) x <- cbind(x_train,y_train) # Fitting model fit <-knn(y_train ~ ., data = x,k=5) summary(fit) #Predict Output predicted= predict(fit,x_test)

K-Means

This is a type of unsupervised algorithm that solves clustering problems. Its procedure follows a simple and easy way to classify a given dataset through a certain number of clusters (assume K clusters). Data points inside a cluster are homogeneous and heterogeneous to peer groups.

Remember figuring out shapes from ink blots? K-means is somewhat similar this activity. You look at the shape and spread to decipher how many different clusters/populations are present!

How K-means forms a cluster:

K-means picks K number of points for each cluster, known as centroids. Each data point forms a cluster with the closest centroids, i.e. K clusters. Finds the centroid of each cluster based on existing cluster members. Here, we have new centroids. As we have new centroids, repeat Steps 2 and 3. Find the closest distance for each data point from new centroids and get associated with new K clusters. Repeat this process until convergence occurs, i.e. centroids do not change.

How to Determine the Value of K

In K-means, we have clusters and each cluster has its own centroid. The sum of the square of the difference between the centroid and the data points within a cluster constitutes the sum of the square value for that cluster. Also, when the sum of square values for all the clusters is added, it becomes the total within the sum of square values for the cluster solution.

We know that as the number of cluster increases, this value keeps decreasing, but if you plot the result, you may see that the sum of squared distance decreases sharply up to some value of K, and then much more slowly after that. Here, we can find the optimum number of cluster.

Python code:

#Import Library from sklearn.cluster import KMeans #Assumed you have, X (attributes) for training data set and x_test(attributes) of test_dataset # Create KNeighbors classifier object model k_means = KMeans(n_clusters=3, random_state=0) # Train the model using the training sets and check score model.fit(X) #Predict Output predicted= model.predict(x_test)

R code:

library(cluster) fit <- kmeans(X, 3) # 5 cluster solution

Random Forest

Random forest is a trademark term for an ensemble of decision trees. In a random forest, we have a collection of decision trees known as a forest. To classify a new object based on attributes, each tree gives a classification and we say the tree “votes” for that class. The forest chooses the classification with the most votes (over all the trees in the forest).

Each tree is planted and grown as follows:

If the number of cases in the training set is N, then the sample of N cases is taken at random but with a replacement. This sample will be the training set for growing the tree. If there are M input variables, a number m<<M is specified such that at each m variable is selected at random out of the M and the best split on the m is used to split the node. The value of m is held constant during the forest's growth. Each tree is grown to the largest extent possible. There is no pruning.

Python code:

#Import Library from sklearn.ensemble import RandomForestClassifier #Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset # Create Random Forest object model= RandomForestClassifier() # Train the model using the training sets and check score model.fit(X, y) #Predict Output predicted= model.predict(x_test)

R code:

library(randomForest) x <- cbind(x_train,y_train) # Fitting model fit <- randomForest(Species ~ ., x,ntree=500) summary(fit) #Predict Output predicted= predict(fit,x_test)

That's all for this time!