This introduction to linear regression discusses a simple linear regression model with one predictor variable, and then extends it to the multiple linear regression model with at least two predictors.

By Pushpa Makhija, CleverTap.

Prediction has always been a curious topic in life due to a key attribute – the extreme human desire to know what is coming next.

Let’s ponder over our thoughts to answer a simple question – “Where is prediction most relevant in your life today?”

Predictions are central to every aspect of our life, whether we realize it or not. During school days, it was predicting what we would love to do in the future to choose a career path, checking the weather today to determine how should I dress, evaluating inventory numbers for the next day, to less important predictions made daily during our interactions with other people – like doing time management and getting into classes for a student, to dining, socializing, etc.

So, what’s a prediction?



A prediction or forecast, is a statement about the future. It’s a guess, sometimes based on knowledge or experience, but not always.

Now, let’s consider a popular and common use case of the speed of an object to understand how predictions play an important role in our real world; in shaping our lives in ways and instances that we aren’t aware of at first, and thereby help us to make informed decisions.

Have you ever thought of or answered day-to-day questions like –

How fast vehicles such as cars and trains can go and how their speeds are calculated?

When a police officer gives someone a speeding ticket, how does she know for sure if the person was speeding?

These questions force us to recollect our old learnings and refresh the concept of the physical system – the speed of an object is the magnitude of its velocity (the rate of change of its position) i.e. the speed of a certain object is calculated by dividing the distance travelled by the time taken to travel that distance.

At first glance, the above equation states that speed is a function of two quantities – distance and time. But, it really is a simple linear relationship called the rate formula because at least one of the 3 variables will always be a constant depending on the problem on hand.

The perfect linear relationship, as prevalent in the physical systems due to their inherent nature, are termed as Deterministic (or functional) relationships – comprising of an equation, that exactly describes the relationship between the two variables. Other examples could be

Fahrenheit degree and Celsius degree (Fahr = 9/5 Cels + 32),

Circumference (Circumference = pi x diameter) and

Exchange rate conversion formula (new currency = (exchange rate) x (your currency))

Generally, we do come across scenarios depicting Statistical relationships – where the relationship between the two variables is not perfect, but there could be negative or positive relationship between the variables. Some examples could be

Height and weight – as height increases, weight might increase but not perfectly

Driving speed and gas mileage – as driving speed increases, gas mileage is expected to decrease, but not perfectly

These relationships, not of perfect kind nature, when graphed, gives a scatter plot of points, as seen from the plot of height-weight information of 30 adults as below:

The above scatter plot reflects the relationship between height and weight as linear, depicting a positive increase in weight with height. We can thus fit a straight line to this data which would provide the best estimate of the observed trend.

The data points in the above scatter plot could be summarized in many ways as shown by various lines in the below plot:

Now the question arises – What is the best fit line that summarizes the relationship between height-weight, amongst all possible lines?

The best fit line is the one in blue color, and termed as regression line, which is actually the plot of the predicted score on y, for each possible value of x.

But, the next question comes – how to arrive at this best line?

The best line fitting the given data is obtained by “minimizing the residual variation” as below

where

is the actual observed value of response variable,

is the predicted value of response variable (as obtained from the model), and

is the residual variation – the variation between the observed and predicted value of y

The closer the regression line comes to all the data points on the scatter plot, the better it is. This means that the minimum variation of points around the line results into low prediction error.

The best fit straight line to summarize the data, as described above, could be obtained by using a prediction method such as Simple Regression.