Hypothesis testing visualized

Literally seeing how stat tests work

In this article, we’ll get an intuitive, visual feel for hypothesis testing. While there are many articles online that explain it in words, there aren’t nearly enough that rely primarily on visuals; which is surprising since the subject lends itself quite well to exposition through pictures and movies.

But before getting too far ahead of ourselves, let’s briefly describe what it even is.

What is

Best to start with an example of a hypothesis test before describing it generally. The first thing we need is a hypothesis. For example, we could hypothesize that the average height of men is greater than the average height of women. In the spirit of ‘proof by contradiction’, we first assume that there is no difference between the average heights of the two genders. This becomes our default, or null hypothesis. If we collect data on the heights of the two groups and find that it is extremely unlikely to have observed this data if the null hypotheses were true (for example, “if the null is true, why do I see such a big difference between the average male and female heights in my samples?”), we can reject it and conclude there is indeed a difference.

For a general hypothesis testing problem, we need the following:

A metric we care about (average height in the example above). Two (or more) groups which are different from each other in some known way (males and females in the example above). A null hypothesis that the metric is the same across our groups, so any difference we observe in our collected data must merely be statistical noise and an alternate hypothesis which says there is indeed some difference.

We can then proceed to collect data for the two groups, estimate the metric of interest for them and see how compatible our data is with our null and alternate hypothesis. The last part is where the theory of hypothesis testing comes in. We’ll literally see how it works in the proceeding sections.

How to reject

Now that we’ve formed our hypothesis and collected our data, how do we use it to reject our null? The general framework is as follows:

Define a statistic that can be used to measure the deviation of the metric we care about between the two groups. In our average heights example, such a metric could be the difference between the average male and female heights (the null hypothesis would be that it is zero). Another could be the ratio between the average male and female heights (the null hypothesis would be that it is one). Since we start off assuming the null hypothesis, we already know the average value of the distribution of our test statistic (in the example above, zero for difference and one for ratio). Everything else about the distribution (like variance and other moments), we get from the data we collected. Now get a point estimate of the statistic from the collected data (for difference in average heights, average the heights of the two groups you see and take the difference). If the null hypothesis were true, what would be the probability of seeing something as or more extreme than the estimate we observed? If this probability (called the p-value) is lower than a certain threshold, we conclude that the null hypothesis couldn’t have produced it. This probability becomes an estimate of the false positive rate of our test (since we’ll reject the null even though it is true with this probability).

A note about the variance

It is important to stress that the distribution of concern is the distribution of our estimate in the test statistic and not the distribution of the metric in the population. For instance, in our example of comparing the mean heights of males and females, there will be some variance in the heights of males. However, this is not the variance we’re interested in. If we take n males and average their heights, we get an estimate of the average height of the male population. There is a variance in this estimate unless we round up every last male member on the planet, measure their heights and average them. If we conducted the experiment again, we will probably get a different set of males with different heights. So, the average we get will be slightly different this time. This variance across repeat experiments (if we conducted them multiple times) is what we’re interested in estimating.

As we increase our sample size for the any of the groups, the variance in our estimate for that group goes down. If you happen to be allergic or just not in the mood for Math and just want to jump to the pictures, you can skip the following section and skip to the next one, I won’t feel bad :)

Variance: the Math

The variance of our test statistic (example, difference in estimates of average male and female heights) depends on the variance in estimates of the metrics for the two groups. To make this concrete, let’s say we sampled n1 males and calculated the variance in their heights, s1². The variance in the estimator of the average male height becomes: s1²/n1. This is because the estimated average male height is:

Eq 1: Estimate of mean height of males

And so the variance in this estimate (how much the estimate, h_m would vary if we sampled n_1 males multiple times) becomes:

Eq 2: Variance in estimate of average height of males given we sample n_1 of them.

In equation (2) the best estimate of V(h_i), which is the variance of a single sample is s_1², the variance calculated from our sample. You can see that as we collect more samples, the variance in this estimator goes down.

Similarly, the estimate of the variance in female heights is: s_2²/n2, if n_2 females are sampled.

If we choose the difference in means to be our test statistic, it can be expressed as:

And the variance in this statistic becomes (since there is no correlation between samples from the two groups):

Eq 3: Variance of test statistic

To make it small, both n_1 and n_2 must be made large. Here, we took the example of the difference in means. However, the general conclusion will hold for other test statistics we might have constructed (like ratio of means) as well.

Show me the pictures

All right, I promised pictures and yet all you’ve seen so far are words and symbols. Let’s visualize what hypothesis testing looks like. Figure 0 below shows the probability density function of t (the test statistic which is the difference in means).

Let’s say we choose a false positive rate (FPR), α of 15% (we don’t want more than a 15% chance that we incorrectly conclude there is a difference when there is none).

The pink vertical line is the point at which the area to the right becomes 15%. So this becomes our threshold for rejecting the null. If the observed statistic is greater than the pink line, we reject the null (conclude there is a significant difference between the heights) and if its less, we don’t reject it (data couldn’t conclude there is a substantial difference in heights).

Fig 0: Distribution of the difference of means. The pink area is 15%, so the pink vertical line is our current threshold. If the difference we observe from the data is greater than it, we reject the null.

Note that for our criterion to reject the null hypothesis, we construct the threshold not on the test statistic itself, but on a probability calculated from its distribution under the null. Why do it in this convoluted way? Because it ensures our test will be able to catch the tiniest of differences between the metrics of the two groups, given enough data, as the variance in the estimate of the statistic will go to zero.

Let’s say the heights of males are always 5 cm more than those of the females. The green point in figure 1 above represents this 5 cm difference. However, since it lies to the right of the pink threshold, we fail to recognize it and conclude that no difference could be found.

Let’s see what happens as we start increasing the amount of data in one of the groups (let’s say males — the groups are labeled as treatment and control in the figure; let’s say treatment means males and control means females). Per equation (3), the variance of the test statistic starts to go down. And as a result, the pink line (where the area to the right is 15%) starts to move leftward towards the green point. Also per figure 1, we can see that increasing the sample size for the male group can reduce the variance only so much before the contribution from the other group starts to dominate. So, the pink line couldn’t quite reach the green point.

Fig 1: As we increase the sample size of the one of the groups, the variance of the test statistic under the null reduces. This causes the point at which the FPR is 15% to shift to the left. Eventually however, the variance stops reducing since the second group becomes the blocker. Created using: https://github.com/ryu577/pyray

However, as we increase the sample size for the second group (women) as well, the variance starts substantially reducing again and the pink line which is our threshold moves leftward once more until it finally reaches the green point. And with that much data, we will be able to capture differences in average heights as small as the green point from the blue line. The moral of this story is that even the smallest effect size can be caught given enough data (in both groups).

Fig 2: Beyond a point, increasing the sample size for the first group started giving us diminishing returns. So, we had to start increasing the sample size of the other group to get a further reduction in the variance of the distribution of the test statistic under the null hypothesis.

Now, we just showed that if we have a tolerance of 15% for the false positive rate, we get the pink line in figures 1 and 2 above. But this is set by us. If false positive rate is the only thing we care about, why not set it as low as possible (0)? This would involve moving the pink line to infinity and we wouldn’t ever reject the null. There would be no false positives since there would be no positives at all. The obvious downside to this test is when the alternate hypothesis is in fact true (there is a difference between the heights). Now, since we never reject the null, we would incorrectly always fail to reject it even when a difference exists. This would make our false negative rate (probability test returns negative for significant difference even if there is one) the worst possible at 100%. In statistics lingo, the false positive rate is called the type-1 error and is denoted by α while the false negative rate is called the type-2 error and is denoted by β. Now, α comes from the null hypothesis where we know the mean value of the test statistic (for the test of equal means we’ve been working with, the mean value of the difference of means is zero). To get β, we assume the alternate hypothesis is true (there is indeed a difference in the heights). So, we need to get the distribution of our statistic under it. The variance and other aspects of this alternate hypothesis should be the same as the null. But, what should we set the mean to? For the null, it was zero (for difference of means test statistic). For the alternate, we just wave our hands and pull a number out of a hat. It’s called the “effect size” we want our test to care about. Basically, we assume that the difference in means is exactly 5 cm (say) and see how good our test is at catching this difference (rejecting the null).

In the two graphs are shown in the figure 3 below, the yellow curve is the null hypothesis and the purple one is the alternate hypothesis. The gap in their peaks is the effect size. Figure 3 shows clearly the trade-off between α (denoted by the yellow area) and β (denoted by purple area). As we reduce α, we move the pink threshold to the right. But this results in increasing the purple area, β. Also note that when α=0, we always predict negative. So, the false negative rate when the alternate hypothesis is true becomes β=1. Similarly, when α=1, we’ll have β=0. And since there is a clear trade-off between them, we’ll get a decreasing function that connects these two extremes. This kind of α-β trade-off graph is shown in the bottom-left of figure 3 below. We move along the decreasing function as the pink threshold moves back-and-forth.

Fig 3: Tradeoff between FPR and FNR. Image created using https://github.com/ryu577/pyray

Worst possible test

Now we know that the false positive rate, α is up to us to set but there is a trade-off between it and the false negative rate, β. For a given sample size (in both groups), any test based on some statistic we construct will have α-β profile like the one shown in the bottom left of figure 3. We want both α and β to be low, so if a given hypothesis test’s curve stays below that of another, we will prefer it. In statistics lingo, the preferable test is called a “more powerful test” because 1-β is called the power of a test. Now that we have a way to call a test “more powerful”, a natural quest for the “most powerful” test is born, given our metric of interest and the distributions of this metric in the treatment and control groups. This is the concept of a uniformly most powerful test and there is considerable effort in statistics to find these tests. We however, will go in the opposite direction in this section, searching for the worst possible, least powerful test. For if we know what the worst possible deal is, we will never get duped in the worst possible way.

Imagine Tom was tasked with determining weather or not there is a statistically significant difference between the heights of males and females. Instead of going out and collecting some data from some males and females, he stays at home and simply tosses a coin. The coin has a probability α of coming up “heads”. If he does get “heads”, he simply concludes that the null hypothesis is true and there is no difference in average heights (in statistics lingo, he “fails to reject the null”). If he gets tails (probability 1-α), he concludes that the alternate hypothesis is true. Given that the null hypothesis is true, he will have a probability α of incorrectly rejecting it (by definition). And if the alternate is true, he will have a probability β=1-α of not rejecting the null. So the relationship between α and β in this case will simply be β=1-α. This is shown in the red line of figure 4 below. For a more reasonable test that actually collects some sample data and constructs a sensible test statistic (like difference or ratio of means), the corresponding relationship might look something like the white curve below it.

Fig 4: The α-β curve for the worst possible hypothesis test is given by the red line, β=1-α. A more reasonable hypothesis test where we actually look at some data might be given by the white curve. You can see for a given α, the β we get for the red line is much higher (worse).

Sample size

Another question of importance in hypothesis testing involves how much sample size we need for our experiment. To answer this, we need the target false positive rate (α), false negative rate (β) and effect size we’re interested in measuring. Let’s say we wanted a 16% false positive rate and 10% false negative rate. This is represented by the green point in the graph on the bottom left of figure 5 below. You can see that initially, the α-β curve isn’t touching the green point (for the target α, the β is much higher than what we desire). However, as we start increasing the sample sizes of the control and treatment groups, the entire curve starts shifting downwards until eventually, the green point lies on it (notice that the yellow area, which is α remains constant but the purple area, β reduces significantly). This is due to the pink line moving to the left and also the purple curve getting thinner. And this is essentially how we can tell in advance the sample sizes we’ll need for our two groups given the false positive and false negative rates, as well as the effect size we wish to capture.

Fig 5: Increasing sample size allows us to get any FNR for a given FPR. Image created using https://github.com/ryu577/pyray

That covers most of the essential aspects of hypothesis testing. Let me know what you think, what I missed and if there are any other aspects of hypothesis testing conducive to such visualizations.