This article discusses the basics of linear regression and its implementation in Python programming language.

Linear regression is a statistical approach for modelling the relationship between a dependent variable with a given set of independent variables.

Note: In this article, we refer dependent variables as response and independent variables as features for simplicity.

In this article, we refer dependent variables asand independent variables asfor simplicity.

In order to provide a basic understanding of linear regression, we start with the most basic version of linear regression, i.e. Simple linear regression.

Simple Linear Regression

Simple linear regression is an approach for predicting a response using a single feature.





It is assumed that the two variables are linearly related. Hence, we try to find a linear function that predicts the response value(y) as accurately as possible as a function of the feature or independent variable(x).

Let us consider a dataset where we have a value of response y for every feature x:

For generality, we define:

x as feature vector, i.e x = [x_1, x_2, …., x_n],

y as response vector, i.e y = [y_1, y_2, …., y_n]

for n observations (in above example, n=10).

A scatter plot of above dataset looks like:-

Now, the task is to find a line which fits best in above scatter plot so that we can predict the response for any new feature values. (i.e a value of x not present in the dataset)





This line is called the regression line.

The equation of the regression line is represented as:

Here,

h(x_i) represents the predicted response value for ith observation.

for ith observation. b_0 and b_1 are regression coefficients and represent y-intercept and slope of regression line respectively.

To create our model, we must “learn” or estimate the values of regression coefficients b_0 and b_1. And once we’ve estimated these coefficients, we can use the model to predict responses!

In this article, we are going to use the Least Squares technique.

Now consider:

Here, e_i is a residual error in ith observation.

So, our aim is to minimize the total residual error.



We define the squared error or cost function, J as:





and our task is to find the value of b_0 and b_1 for which J(b_0,b_1) is minimum!

Without going into the mathematical details, we present the result here:



where SS_xy is the sum of cross-deviations of y and x:



and SS_xx is the sum of squared deviations of x:

Note: The complete derivation for finding least squares estimates in simple linear regression can be found here

Given below is the python implementation of the above technique on our small dataset:



import numpy as np import matplotlib.pyplot as plt def estimate_coef(x, y): n = np.size(x) m_x, m_y = np.mean(x), np.mean(y) SS_xy = np. sum (y * x) - n * m_y * m_x SS_xx = np. sum (x * x) - n * m_x * m_x b_1 = SS_xy / SS_xx b_0 = m_y - b_1 * m_x return (b_0, b_1) def plot_regression_line(x, y, b): plt.scatter(x, y, color = "m" , marker = "o" , s = 30 ) y_pred = b[ 0 ] + b[ 1 ] * x plt.plot(x, y_pred, color = "g" ) plt.xlabel( 'x' ) plt.ylabel( 'y' ) plt.show() def main(): x = np.array([ 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 ]) y = np.array([ 1 , 3 , 2 , 5 , 7 , 8 , 8 , 9 , 10 , 12 ]) b = estimate_coef(x, y) print ("Estimated coefficients:

b_0 = {} \

b_1 = {}". format (b[ 0 ], b[ 1 ])) plot_regression_line(x, y, b) if __name__ = = "__main__" : main()

The output of the above piece of code is:

Estimated coefficients: b_0 = -0.0586206896552 b_1 = 1.45747126437



And the graph obtained looks like this:



