K Nearest Neighbor (KNN) algorithms are conceptually among the simplest machine learning models to understand. For this reason they are also a great entry point for those new to machine learning. Support for KNN Regression was recently added to lurn — a machine learning gem for ruby. This post will give a brief introduction to KNN Regression and how to build a model using lurn.

This article is written for those new to regression and machine learning. If you are already familiar with KNN regression and want to know how to implement it with lurn you can skip to the last section.

About KNN Regression

Regression models are used to predict the value of a single variable, called the target variable. This prediction is based on the value of one or more other variables from the same observation, called predictors. For instance regression models could be used to predict a person’s income based on age and level of education. In this example the target is the person’s income and the predictors are age and eduction. In order to train, a model to make a prediction you have to provide it with a data set where both the predictors and the target are known. This is commonly known as the training data set.

KNN regression is a specific type of regression model which represents each observation in the training set as a point in an n-dimensional space where n is the number of variables in the predictors. In the income example above the data has 2 dimensions — age and education. To predict the income of a new person a KNN model would find a sample of people in the training set that are most similar to the new person and calculate the average of their income.

Walking through a real example might make this more clear. The scatter plot below shows the sepal length and width of each plant in the iris data set — a small database of iris plants. If you hover over any of the points you can see the petal length for that plant as well. For those interested the sepal of a flower is the leafy supporting structure at the base of the flower. More info here.

Using this as a training set, KNN Regression can be applied to predict the petal length of new plants. Suppose a new observation is found with a sepal length of 6.1 and sepal width 3.3. To estimate its petal length with knn regression (with k = 5) find the 5 training observations that are most similar to the newly observed plant and average their petal lengths. The plot below shows the new observation in red with its 5 nearest neighbors highlighted in green. Averaging the neighbors’ petal lengths, the estimated value for the new observation would be (5.6 + 6.0 + 5.4 + 4.5 + 4.8) / 5= 5.26.

Most real world examples will involve more than two predictors and are more difficult to visualize but the concept is the same.

Using KNN Regression in Ruby with lurn

The remainder of this article will walk through building a KNN regression model in lurn for predicting the fuel efficiency of a car. This will include setting up lurn, downloading the training data set, training the model, and finally making a prediction for a new car.

Setup

First you’ll need to install lurn. As of this writing KNN regression has not been released to rubygems so you’ll need to install from github. You can either:

Add it to your gemfile:



gem 'lurn', github: 'dansbits/lurn' # add to your gemfile bundle install # run in the terminal

Or install it manually

git clone git@github.com:dansbits/lurn.git

cd lurn

gem build lurn

gem install lurn-0.1.1.gem

This example will also use the daru gem so install it if you haven’t already.

gem install daru

The Automobile Data Set

The automobile dataset available here contains 205 records describing cars imported to the U.S. in 1985. The target variable in this example will be the city mpg and a subset of the other variables will be used as predictors. The code below will download the data set into a data frame, remove missing data, and subset the data to the variables needed for the model.

Run the above code in a ruby console or jupyter notebook and take a look at the resulting predictors and city_mpg variables to see what the input data looks like.

Training The Model

Training the model is as simple as initializing it with a K value and passing it arrays of predictors and target values. The below code is all you need to have a working model.

To predict the city fuel efficiency of a new vehicle create an array containing the same predictor variables used to train the model and pass it to the predict method. See the example below.

That’s it! You should now have all the tools you need to build your own KNN models.

Additional Resources

More on KNN models

lurn homepage