Multiple linear regression: How It Works? (Python Implementation)

Multiple linear regression

Multiple linear regression attempts to model the relationship between two or more features and a response by fitting a linear equation to observed data.

Clearly, it is nothing but an extension of Simple linear regression



p features(or independent variables) and one response(or dependent variable).

Also, the dataset contains n rows/observations. Consider a dataset withfeatures(or independent variables) and one response(or dependent variable).Also, the dataset containsrows/observations.

We define:

X (feature matrix) = a matrix of size n X p where x_{ij} denotes the values of jth feature for ith observation.

So,





and

y (response vector) = a vector of size n where y_{i} denotes the value of response for ith observation.





regression line for p features is represented as: Theforfeatures is represented as:







where h(x_i) is predicted response value for the ith observation and b_0, b_1, …, b_p are the regression coefficients.

Also, we can write:









where e_i represents a residual error in ith observation.

We can generalize our linear model a little bit more by representing feature matrix X as:





So now, the linear model can be expressed in terms of matrices as: So now, the linear model can be expressed in terms of matrices as:







where,

and

Now, we determine an estimate of b, i.e. b’ using Least Squares method.

As already explained, the Least Squares method tends to determine b’ for which total residual error is minimized.

We present the result directly here:







where ‘ represents the transpose of the matrix while -1 represents the matrix inverse.

Knowing the least square estimates, b’, the multiple linear regression model can now be estimated as:





where y’ is the estimated response vector.

Note: The complete derivation for obtaining least square estimates in multiple linear regression can be found The complete derivation for obtaining least square estimates in multiple linear regression can be found here

Given below is the implementation of multiple linear regression techniques on the Boston house pricing dataset using Scikit-learn.



import matplotlib.pyplot as plt import numpy as np from sklearn import datasets, linear_model, metrics boston = datasets.load_boston(return_X_y = False ) X = boston.data y = boston.target from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4 , random_state = 1 ) reg = linear_model.LinearRegression() reg.fit(X_train, y_train) print ( 'Coefficients:

' , reg.coef_) print ( 'Variance score: {}' . format (reg.score(X_test, y_test))) plt.style.use( 'fivethirtyeight' ) plt.scatter(reg.predict(X_train), reg.predict(X_train) - y_train, color = "green" , s = 10 , label = 'Train data' ) plt.scatter(reg.predict(X_test), reg.predict(X_test) - y_test, color = "blue" , s = 10 , label = 'Test data' ) plt.hlines(y = 0 , xmin = 0 , xmax = 50 , linewidth = 2 ) plt.legend(loc = 'upper right' ) plt.title( "Residual errors" ) plt.show()





The output of the above program looks like this:





Coefficients: [ -8.80740828e-02 6.72507352e-02 5.10280463e-02 2.18879172e+00 -1.72283734e+01 3.62985243e+00 2.13933641e-03 -1.36531300e+00 2.88788067e-01 -1.22618657e-02 -8.36014969e-01 9.53058061e-03 -5.05036163e-01] Variance score: 0.720898784611





and Residual Error plot looks like this:









In the above example, we determine accuracy score using Explained Variance Score.

We define:

explained_variance_score = 1 – Var{y – y’}/Var{y}

where y’ is the estimated target output, y the corresponding (correct) target output, and Var is Variance, the square of the standard deviation.

The best possible score is 1.0, lower values are worse.

Assumptions

Given below are the basic assumptions that a linear regression model makes regarding a dataset on which it is applied:

Linear relationship : Relationship between response and feature variables should be linear. The linearity assumption can be tested using scatter plots. As shown below, the 1st figure represents linearly related variables whereas variables in 2nd and 3rd figure are most likely non-linear. So, the 1st figure will give better predictions using linear regression.

: Relationship between response and feature variables should be linear. The linearity assumption can be tested using scatter plots. As shown below, the 1st figure represents linearly related variables whereas variables in 2nd and 3rd figure are most likely non-linear. So, the 1st figure will give better predictions using linear regression. Little or no multi-collinearity : It is assumed that there is little or no multicollinearity in the data. Multicollinearity occurs when the features (or independent variables) are not independent of each other.

: It is assumed that there is little or no multicollinearity in the data. Multicollinearity occurs when the features (or independent variables) are not independent of each other. Little or no auto-correlation : Another assumption is that there is little or no autocorrelation in the data. Autocorrelation occurs when the residual errors are not independent of each other. You can refer here for more insight into this topic.

: Another assumption is that there is little or no autocorrelation in the data. Autocorrelation occurs when the residual errors are not independent of each other. You can refer here for more insight into this topic. Homoscedasticity: Homoscedasticity describes a situation in which the error term (that is, the “noise” or random disturbance in the relationship between the independent variables and the dependent variable) is the same across all values of the independent variables. As shown below, figure 1 has homoscedasticity while figure 2 has heteroscedasticity.

As we reach the end of this article, we discuss some applications of linear regression below.

Applications:

1. Trend lines: A trend line represents the variation in some quantitative data with the passage of time (like GDP, oil prices, etc.). These trends usually follow a linear relationship. Hence, linear regression can be applied to predict future values. However, this method suffers from a lack of scientific validity in cases where other potential changes can affect the data.

2. Economics: Linear regression is the predominant empirical tool in economics. For example, it is used to predict consumer spending, fixed investment spending, inventory investment, purchases of a country’s exports, spending on imports, the demand to hold liquid assets, labour demand, and labour supply.





3. Finance: Capital price asset model uses linear regression to analyze and quantify the systematic risks of an investment.





4. Biology: Linear regression is used to model causal relationships between parameters in biological systems.





Full Machine Learning Series

http://bit.ly/2Ufe34U