This article presents a concise introduction to linear model. The primary objective is a brief introduction to linear models, how to use them in R and an overview of statistical concepts involved.

[Updated March 3 2019] Thank the comments made by u/BigMakondo and u/Catdci I included some examples for linearizable models.

.

What is a linear regression?

A linear regression is a statistical model that analyzes the relationship between a response variable (often called y) and one or more variables and their interactions (often called x or explanatory variables) [1]

.

.

What is a linear model?

A model is linear if it is linear in parameters or can be transformed to be linear in parameters (linearizable).

A model is linear in parameters if it can be written as the sum of terms, where each term is either a constant or a parameter multiplying a predictor (X i ):

.

Let’s see some examples of models on which linear regression can be applied directly, and models that require some transformation prior to the application of a linear regression.

In both cases they are examples, and do not cover all cases.

.

1.Models that aren’t immediately linear but can become linear after a transformation (linearizable): Linear models of non-linear relationships

For example, the model below models a non-linear relationship (because the derivative of Y with respect to X 1 is a function of X 1 ). By creating a new variable W 1 =X 1 2, and re-writing the equation with W 1 replacing X 1 2, we have an equation that satisfies the definition of a linear model.[B]



.

2. Models that aren’t immediately linear but can become linear after a transformation (linearizable) using natural logarithm

x <- runif(100, 1, 3.5) y <- exp(x) + rnorm(100, 0, 1) plot(x, y)

This model may appear to be non-linear, however it can be transformed into a linear model applying the natural logarithm of both sides, then it is considered to be a linear model.





.

Clearly, y is exponential in x, but (ignoring heteroskedasticity for now) a linear regression can still be used by taking the log of yand back-transforming the fitted values:

points(x, exp(fitted(lm(log(y) ~ x))), pch = 20, col = 2)





.





Notice that you are still using a linear regression here. Not in the sense that y is linear in x, but that we can use transformations such that this is the case.

.

3. Simple Linear model

Linear regression assumes that there exists a linear relationship between the response variable (also called the dependent variable), and the explanatory variables. [1].

In predictive models, there is a response variable which is the variable that we are interested in predicting.

.

Example I: Depreciation for a car based on vehicle age

A 5-year-old car is usually less expensive than a 1-year-old car, and that it is mainly for the age. So, here we find a linear relationship between:

value: response variable

age: explanatory variable

And yes, we are not considering a lot of other variables in the final value: miles (driven miles), fuel type, automatic or manual, number of doors, color… We are only considering one variable for input, and one for output. For that reason, this is a simple linear regression.

.

Example II: Linear relationship between weight and age

We know for sure that a kid with 10 years is heavier than a kid with 5 years. Bingo! We found another linear relationship. In this case [2]:

weight: response variable

age: explanatory variable

.

Linear regression: Practical Example

One of the most basic dataset is mtcars, the data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). [5]

# Load package "datasets": library(datasets) #Load data "mtcars": data(mtcars)

.

Objetive

Analyzing the effect of weight on fuel consumption, for that reason let’s take mpg as y and wt as x.

.

Linear regression applied to mtcars

ggplot(mtcars, aes(x=wt, y=mpg)) + geom_point()+ geom_smooth(method=lm)+ theme_minimal()+ scale_fill_viridis(discrete = TRUE)

.

Understanding the visualization

The points represent the pairs (mpg, wt) for each car into the dataset. The blue line is the regression of mpg respect to the weight.

.

.

How can we understand the gray area?

By default, it is the 95% confidence level interval for predictions from a linear model (“lm”). The documentation from ?geom_smooth states that:

The default stat for this geom is stat_smooth see that documentation for more options to control the underlying statistical transformation.

Reviewing the documentation on ?stat_smooth, we find the methods used to measure the smoother’s area and the supported arguments. One of the arguments for is level : confidence interval to use (0.95 by default)

Building a thinner area involves to set up a level with .90 as a confidence level. [3]:

ggplot(mtcars, aes(x=wt, y=mpg)) + geom_point() + geom_smooth(method="lm", level=0.90)

.

What means the confidence interval?

In this case, indicates there is 95% confidence that the true regression line is within the shaded region. It shows us the uncertainty implicit in our estimate of the exact relationship between your response and the predictor variable. [4]

.

Theory: How to understand the linear regression?

Estimate Std. Error is the standard deviation of the sampling distribution of the estimate of the coefficient under the standard regression assumptions. Such standard deviations are called standard errors of the corresponding quantity (the coefficient estimate in this case).

The Residual standard error , which is usually called s, represents the standard deviation of the residuals. It’s a measure of how close the fit is to the points. Or in other words: the proportion of the variance in the data that’s explained by the model.

The value of F statistic is the probability that the null hypothesis for the full model is true.[10]

The p-value in the last row is the p-value for that test, essentially comparing the full model you fitted with an intercept-only model. The larger the spread of points around the fitted line, the larger the uncertainty of the model, the lower the p-value. Conventionally, p < 0.05 indicates that the correlation is likely significant whereas p > 0.05 indicates it is not. [7]

The R2 states how much of the variability in the data is described by the model. Essentially divides the variance of the points around the line (adding up the squared of all the black lines in the figure above) by the total variance in the dataset ( sum((mpg - mean(mpg))^2) ).

Multiple R squared is simply a measure of Rsquared for models that have multiple predictor variables. Therefore it measures the amount of variation in the response variable that can be explained by the predictor variables. The fundamental point is that when you add predictors to your model, the multiple Rsquared will always increase, as a predictor will always explain some portion of the variance.

Adjusted Rsquared controls against this increase, and adds penalties for the number of predictors in the model. Therefore it shows a balance between the most parsimonious model, and the best fitting model. Generally, if you have a large difference between your multiple and your adjusted Rsquared that indicates you may have overfit your model. [9]

.

Example: How to understand the linear regression?

m_model <- lm(mpg ~ wt, mtcars) summary(m_model) # Call: # lm(formula = mpg ~ wt, data = mtcars) # # Residuals: # Min 1Q Median 3Q Max # -4.5432 -2.3647 -0.1252 1.4096 6.8727 # # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 37.2851 1.8776 19.858 < 2e-16 *** # wt -5.3445 0.5591 -9.559 1.29e-10 *** # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # # Residual standard error: 3.046 on 30 degrees of freedom # Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446 # F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10

.

Multiple R-squared means our model handles 75.28% of the variation on mpg.

Multiple R-squared is 0.7528 and Adjusted R-squared is 0.7446, the small variation means there is no risk to overfit the model.

The estimated slope of each coefficient ( wt and the y-intercept), which suggests the best-fit prediction of mpg is 37.2851 + (-5.3445) * wt . [8]

and the y-intercept), which suggests the best-fit prediction of mpg is . [8] The p-value < 0.025 indicates that the correlation is likely significant.

.

We can conclude that our model is slightly robust.

.

Predicting values

The main reason because we are using a model with linear regression is because we can use it to predict new values of the dependent variable based on the information that we had about dependent and independent variable. So using R:

fitted_lm = lm(mpg ~ wt, mtcars) newdata = data.frame(wt = runif(5, 1, 6)) newdata$predicted_mpg = predict(fitted_lm, newdata = newdata) newdata # wt predicted_mpg # 1 4.490858 13.283861 # 2 5.742183 6.596194 # 3 5.820495 6.177658 # 4 4.428383 13.617756 # 5 5.605404 7.327205

.

References