One of the most central concepts in statistics is the concept of standard deviation and its relationship to all other statistical quantities such as the variance and the mean. Students in introductory courses are told to “just remember the formula” but, believe me, this is not the best way to explain a concept. In this post, I will try to provide a visual and intuitive explanation of the standard deviation.

Let’s say you got a list of grades, which in this case would be our real-world measurements. We want to “compress” the information in those measurements into a handful number of quantities that we can later use to compare, say, grades of different classes or grades of different years. Due to our limited cognitive capacity, we do not want to go over the grades, one by one, to find out which class scored higher on average. You need to summarize those numbers. This is why we have descriptive statistics.

There are two ways to summarize the numbers: by quantifying their similarities or their differences. Ways of quantifying their similarity to one another are formally called “measures of central tendency”. Those measures include the mean, median and mode. Ways of quantifying their differences are called “measures of variability” and include the variance and standard deviation. The standard deviation should tell us how a set of numbers are different from one another, with respect to the mean.

Let’s take an actual example. Imagine that you collected those numbers for student grades (and, for the sake of simplicity, let’s assume those grades are the population).

\(2, 8, 9, 3, 2, 7, 1, 6\)

Let’s first plot those numbers in a simple scatter plot

Now that we have all the numbers in a scatter plot, the first step to calculate the variation is to find the center of those numbers: the average (or the mean).

\(\bar{x} = \frac{\sum_{n=1}^{N} x_{n}}{N} = \frac{2+8+9+3+2+7+1+6}{8} = \frac{38}{8} = 4.75\)

Visually, we can plot a line to indicate the mean grade.

Now that we have a line for the mean, the next step is to calculate the distance between each point and the mean and then square this distance. Remember that our goal is to calculate the variation of those numbers, with respect to the mean. We can simply do this mathematically or visually

As you see here, “squaring” is really nothing but drawing a square. There are two points here: we can’t just take the sum of all differences. As some differences are positive and some are negative, taking the sum will make negative numbers cancel out the positive ones ending up with zero (which does not mean anything). To resolve this, we take the square of differences (and I will explain at the end why we take the square of differences and not any other measure such as the absolute value).

Now, we calculate the sum of those squared differences (or, the sum of squares):

By calculating the sum of squares we effectively calculated the total variability (i.e., differences) in those grades. Understanding how variability relates to differences is the key to understand many statistical estimates and inference tests. What 67.5 means is that if we stack all those squares in a mega square, its area will be equal to \(67.5 \text{ points}^2\), where points here refers to the unit of grades. The total variability of any set of measurements is an area of a square.

The variance

Now that we got the total variability or the area of the mega-square, what we really want is the mean variability. To find that mean, we just divide the total area by the number of squares.

\(\frac{\sum(x_{n} – \bar{x})^2}{N}=\frac{67.5}{8} = 8.45 \text{ points}^2\)

For most practical purposes you want to divide by \(N-1\), and not by \(N\), as you will be trying to estimate this value from a sample, not from a population. However, here we assumed we have the total population. The point still is that you want to calculate the mean square of those little squares. What we just calculated is the variance, which is the mean variability, or the mean squared difference.

The standard deviation

Why can’t we just go ahead with the variance as an indicator of the variability in the grades? The only problem with the variance is that we can’t compare it with the raw grades, because the variance is a “squared” value or, in other words, it is an area and not a length. Its unit is \(\text{points}^2\) which is not the same unit of our raw grades (which is \(\text{points}\)). So what should we do to get rid of the square? Taking the square root!

At last, we now have the standard deviation: the square root of the variance which is \(2.91 \text{points}\).

This is the core idea of standard deviation. This basic intuition should make it easier to understand why it makes sense to use units of standard deviations when dealing with z-scores, normal distribution, standard error and analysis of variance. Also, if you just replace the mean with a fitted (predicted) line Y in the standard deviation formula, then you are dealing with basic regression terms like the mean squared error (if you didn’t use the square root), the root mean squared error (with taking the square root but now with respect to a fitted line). Furthermore, both correlation and regression formulas can be written with the sum of squares (or the total variability area) of different quantities. Partitioning sums of squares is a key concept to understand the generalized linear models and the bias-variance tradeoff in machine learning.

In short: standard deviation is everywhere.

The issue with the absolute value

You might be wondering, why should we square the differences and not just take the absolute value. There is nothing really that prevents you from using the mean absolute value of differences instead of the mean squared difference. The mean absolute value will give the same exact weight to all the differences while squaring the differences will give more weight to the numbers that are further apart from the mean. This might be something you want to do. However, most mathematical theories make use of the squared differences (for reasons beyond the scope of this post such as differentiability).

However, I will answer this question with a counterexample that is easy to understand (source). Let’s say we have two sets of grades wit the same mean, \(x_{1}\) and \(x_{2}\):

\(x_{1}= 2, 2, 10, 10\)

\(x_{2}= 13, 7, 0, 4\)

By looking at those grades, you can easily see that \(x_{1}\) has lower variability and spread of numbers than \(x_{2}\). Let’s go ahead and calculate the mean absolute differences of both (knowing that their means is 6):

\(\frac{\sum |x – \bar{x}|}{N} = \frac{|-4| + |-4| + |4| + |4|}{4} = \frac{16}{4} = 4\)

\(\frac{\sum |x – \bar{x}|}{N} = \frac{|7| + |1| + |-6| + |-2|}{4} = \frac{16}{4} = 4\)

Opps! that should be bad. Both sets give the exact same value of variability although we would want to see \(x_{1}\) having a slightly lower value than \(x_{2}\) as the numbres are less variable. If we use the squared differences, however, we get:

\( \sqrt{\frac{\sum (x – \bar{x})^2}{N}} = \sqrt{\frac{(-4)^2 + (-4)^2 + (4)^2 + (4)^2}{4}} = \sqrt{\frac{64}{4}} = \sqrt{16} = 4 \)

\( \sqrt{\frac{\sum (x – \bar{x})^2}{N}} = \sqrt{\frac{(7)^2 + (1)^2 + (-6)^2 + (-2)^2}{4}} = \sqrt{\frac{90}{4}} = \sqrt{22.5} = 4.74 \)

Which, thanks to squaring the differences, appearaly gives us what we hoped for: the standard deviation gets bigger when numbres are more spread out.