Linear path

In this article, we will first discuss linear regression, what is it all about and how to do it in Python. We’ll next look at a technique for locally smoothing our estimates to better fit the data. I.E using LWLR to overcome underfitting.

Get the full code on GitHub

What is Linear Regression(LR)

Let’s first understand what is regression. Regression is a kind of supervised learning where we have a target variable or something we want to predict. The difference between regression and classification is that in regression, our target variable is numeric and continuous.

LR is used for finding a linear relationship between the target and one or more predictors. The variable we are predicting is called the (criterion variable, outcome variable, endogenous variable, or regressand) while the variable(s) we are basing our predictions on is called the (predictor variable, exogenous variable, or regressor). There are two types of linear regression: Simple LR or Multiple LR. When there is only one predictor variable, the prediction method is called simple LR. The plot of a simple LR always forms a straight line.

Some properties of LR are:

Pros : Easy to interpret results, computationally inexpensive

: Easy to interpret results, computationally inexpensive Cons : Poorly models nonlinear data

: Poorly models nonlinear data Works with: Numeric values, nominal values

Finding best fit lines with LR

When using regression, our main goal is to predict a numeric target value. One way to do this is to write out an equation for the target value with respect to the inputs. Let's assume we want to forecast our energy base on the quantity of food and water we drink. One possible equation is:

Energy = 0.0015*food - 0.99*water

This is known as a regression equation. The values 0.0015 and -0.99 are known as regression weights. The process of finding these regression weights is called regression.

Linear regression means you can add up the inputs multiplied by some constants to get the output. There’s another type of regression called nonlinear regression in which this isn’t true; the output may be a function of inputs multiplied together.

Energy = 0.0015*food/water

In this regard, we will be finding the best fit line for this data. Let’s load the data:

Load data from a file

The loadDataSet() function opens a text file with tab-delimited values and assumes the last value is the target value.

To visualize the data, we will use this python code to plot the linear distribution of the data

# Convert arrays to matrix

xMat = np.mat(data.dataMat)

yMat = np.mat(data.labelMat) # Plot

fig = plt.figure(figsize=(20,10))

ax = fig.add_subplot(111)

ax.scatter(xMat[:,1].flatten().A[0], yMat.T[:,0].flatten().A[0])

plt.show()

Data linear distributed plot

Assume our input data is in the matrix X, and our regression weights in the vector w. For a given piece of data X1, our predicted value is given by:

We have Xs and ys, but how can we find the ws? One way is to find the ws that minimize the error. We define error as the difference between predicted y and the actual y. Using just the error will allow positive and negative values to cancel out, so we use the squared error as:

Writing this in matrix notation, we have :

If we take the derivative of this with respect to w, we’ll get

We can set this to zero and solve for w to get the following final equation:

NB: The final equation has a matrix inverse. Hence we have to first check for matrix inverse existance before using it or we may have an error.

Let's look how the final equation can be implemented in Python

We have used linalg.det(m) to find out if a matrix has an inverse. If equals to zero, then its a singular matrix so has no inverse

Plot and show best-fit line

So far we have loaded our data and we have implemented our linear regression base on the derived formula above. Now we will use it to show the best-fit line on the plot

ws = standRegres(data.dataMat, data.labelMat)

xMat = mat(data.dataMat)

yMat = mat(data.labelMat) # Our predicted value yHat using the weights (ws) # Sort it to avoid out of order

xCopy = xMat.copy()

xCopy.sort(0)

yHat = xCopy*ws # Plot

fig = plt.figure()

ax = fig.add_subplot(111)

ax.scatter(xMat[:,1].flatten().A[0], yMat.T[:,0].flatten().A[0])

ax.plot(xCopy[:,1],yHat)

plt.show()

Plot showing best-fit line

Check the Correlation coefficient

To calculate how well the predicted value, yHat, matches our actual data, y, we check the correlation between the two series

# Get predicted value(unsorted)

# Transpose yHat so we have both vectors as row vectors

yHat = xMat*ws

np.corrcoef(yHat.T, yMat) >>> array([[ 1. , 0.98647356],

[ 0.98647356, 1. ]])

Numpy.corrcoef() is used to find the correlation coefficient between 2 series(matrix) to know how similar they are.

From the above results, element in the diagonal is 1.0 because the correlation between yMat and yMat is perfect. However, we have a correlation of 0.98 between yHat and yMat. Hence our model has done well in its prediction

Locally Weighted Linear Regression (LWLR)

Linear regression has one problem, is that it tends to underfit the data. It gives us the lowest mean-squared error for unbiased estimators. Hence with underfitting, we aren’t getting the best predictions.

One way to reduce the mean-squared error is a technique known as LWLR. With LWLR, we give a weight to data points near our data point of interest; then we compute a least-square regression. The formula now becomes:

(W) here is a matrix that’s used to weight the data points. LWLR uses a kernel similar to the one in SVM to weight nearby points more heavily than other points. The most common kernel to use is a Gaussian. This assigns a weight given by:

From the formula above, the closer the data point x is to the other points, the larget w(i,i) will be.

is to the other points, the larget will be. We also see a constant k which is a user-defined constant that will determine how much to weight nearby points. So it determines how quickly the decay happens. This is the only parameter we have to worry about with LWLR

Pros

With a suitable k value, we can have a best-fit for our data free from overfitting and underfitting

Cons

It involves a lot of computation. You must use the entire data to find a single estimate

In Python code, we have the following

The function lwlr() creates matrices from the input data, then it creates a diagonal weights matrix called weights. The weight matrix is a square matrix with as many elements as data points. The function next iterates over all of the data points and computes a value, which decays exponentially as you move away from the testPoint. The input K controls how quickly the decay happens. After we have populated the weights matrix, we then find the estimate for testPoints similar to the function standRegres().

lwlrTest() calls lwlr() for every point in the dataset.

We earlier said that with a suitable K value, we can have a best-fit for our data free from overfitting and underfitting. In this regard, we will test 3 values for K (1.0, 0.01, 0.003) and see which value of K will best fit our data.

We will use the following code to plot the best fit

# find extimate yHat for all data points. k = 0.01

yHat = lwlrTest(data.dataMat, data.dataMat, data.labelMat, k) # Plot needs the data to be sorted. Here, we sort xArr

srtInd = xMat[:,1].argsort(0)

xSort = xMat.copy()

xSort.sort(0) # Plot

fig = plt.figure()

ax = fig.add_subplot(111)

ax.plot(xSort[:,1], yHat[srtInd])

ax.scatter(xMat[:,1].flatten().A[0], mat(data.labelMat).T.flatten().A[0], s=2, c='red')

plt.show()

Case K=1.0

With K=1.0, we see nothing changes, it still experiences underfitting.

Case K=0.003

With K=0.003, we see that our model experiences overfitting.

Case K=0.01

With K=0.01, we have the best-fit line free from overfitting and underfitting.

End!

We saw how to find the best-fit line free from underfitting and overfitting using LWLR method. However as mentioned above, One problem with LWLR is that it involves numerous computations.

I hope you enjoyed reading this article and if by any means you have a suggestion regarding other methods which are better than LWLR or anything I have missed to include in this article, please your comments are welcome.