Imagine we have a set of labeled and unlabeled data, and we want to build a classifier which takes the unlabeled data as input and labels that data as output. With this kind of situation, we’ll need to build a classification model that will learn from already-labeled data (training data). Later we’ll use that model to predict our unlabeled data (test data).

This type of machine learning is called supervised learning, which we can define as feeding data into a machine learning algorithm. In doing so, we’re actually showing that groups exist, and which data belong to which groups.

There are many supervised learning models. Examples include, Support Vector Machines (SVM), logistic regression, decision trees, factorization machines, random forests, and K-Nearest Neighbors (KNN) — which will be the focus of this article.

KNN is a non parametric technique, and in its classification it uses k, which is the number of its nearest neighbors, to classify data to its group membership. It primarily works by implementing the following steps.

First, it calculates the distance between all points. Second, it finds the k points that are closest based on the previously calculated distances. Finally, the class is chosen by the majority of the surrounding points.

K is a positive integer which varies. If you have k as 1, then it means that your model will be classified to the class of the single nearest neighbor. The choice of k is very important in KNN because a larger k reduces noise. However, to choose an optimal k, you will use GridSearchCV, which is an exhaustive search over specified parameter values.

In the above plot, black and red points represent two different classes of data. We need to classify our blue point as either red or black. If k = 1, KNN will pick the nearest of all and it will automatically make a classification that the blue point belongs to the nearest class. If k > 1, then a vote by majority class will be used to classify the point.

We’re going to work through a practical example using Python’s scikit-learn. Therefore, we need to install pandas, which we’ll use while working with dataframes. We also need to install numpy, which will help us work with numpy arrays. Finally, we’ll install scikit-learn, which is a machine learning package in Python that helps us work with algorithms like KNN.