Hypothesis testing: the distribution doesn’t matter(!)

Take the t-test and throw out the t. What you’re left with is just as neat.

1. The claim

I read somewhere on the internet (so it must be true) that what separates good writers from the rest is that they know how their story is going to end before they even begin writing. Knowing the ending provides direction, clarity and purpose since they know what they’re aiming for. So, I’ll give you the ending of this story, the punch line right here (henceforth, “the claim”) — take a two sample t-test (or any hypothesis test); now throw away the t-distribution (the distribution of the null hypothesis — we’ll get into what that is) and replace it with any other distribution under the sun. The new distribution might be Gaussian, or it could be some weird multi-modal shape, doesn’t matter. Now if you use this “modified” hypothesis test on any data, you end up with the same performance (error rates) as with the original test. On the surface, that’s like opening up a Ferrari, ripping out its engine and replacing it with the engine of some other car. And the performance of this “modified” Ferrari remains the same as if we hadn’t changed anything!? How?

1.1 Background

Okay, so you’ll need the basics of hypothesis testing for this journey. See here if you’re feeling rusty. While the claim holds for any hypothesis test (as we’ll prove), we’ll use the two-sample t-test for comparing means as a template since it’s one of the more widely-known hypothesis tests.

Now, what is a hypothesis test? Let’s take a concrete example for the two-sample t-test. Say we have a manufacturing process that produces nuts of a given size. Of course, no process is perfect and so, the nuts produced have some small variance in their sizes. A new automation process was introduced yesterday, which makes this process much more efficient. We’re concerned that it might have changed the distribution of the sizes of the nuts though, which is unacceptable. The two-sample t-test gives us a way of comparing the mean sizes from the old and new processes. If even the mean size for the new process is significantly different, it will be a problem. All we need is to get some nuts manufactured by the old process and some by the new process and feed the data pertaining to their sizes into the hypothesis test.

The hypothesis test is then like a watchman, keeping an eye on the quality of our manufacturing process and alerting us when it’s compromised. This begs the question though — who watches the watchman? Just like the manufacturing process isn’t perfect, the hypothesis test we use to tell if there is a difference in means or not isn’t perfect either. It might make an error. And there are two kinds of errors it can make. It might miss a real difference in the means, saying there is no difference when there in-fact is, or it might raise a false alarm, saying there is a difference when there is in-fact none. The first kind of error is called a false-negative (we’ll denote it b) and the second kind is called a false-positive (we’ll denote it a). There is always a trade-off between these two kinds of errors. For example, it’s trivial to get a perfect (zero) false positive rate by never saying there is a difference (test never returns positive, so no false positives). However, this strategy will lead to a 100% false negative rate. Similarly, one can get a zero false negative rate, but incur a 100% false positive rate. Useful tests generally lie somewhere in between these extremes. The more aggressively the test predicts positives, the higher the false positive rate and also, lower the false negative rate (real differences become less likely to be missed). This trade-off between the false positive and false negative rates can be plotted for any test (see figure 1 below). If test-1 has a curve that stays below that of test-2, we can say test-1 makes less errors and is preferable to test-2 (also called “more powerful”). Now, let’s take the two-sample t-test for comparing means. If gives us a certain graph for the error rates. Surely the graph should become worse if we replace the t-distribution in the test with another arbitrary distribution? Let’s see what happens.

1.2 Measuring errors

To see how well our test performs, we don’t have to wait for real world data from manufactured nuts. There are only two possibilities; either the new process caused the mean to change (null hypothesis) or it didn’t (alternate hypothesis). Say we were going to do our experiment where we produce 30 nuts with the old manufacturing process and 30 from the new one and pass the data into our test. Well, we can first assume that the new process didn’t change the mean sizes and simulate 30 sizes corresponding to the old process and 30 from the new one from the same distribution (say, normal with mean size set to appropriate value and some added variance). We know there is no difference between the groups in this scenario since we simulated the data that way. If the test fires and says it found a difference in means, we know this is a false positive. Similarly, we generate 30 sizes from the old process and then 30 from the new process, this time setting the mean to a different (say slightly higher) value for the new process. We then pass these data points to the test as well. We now expect our test to find a difference. If it doesn’t, this is a false negative. We apply our test to these data points with different proclivities for firing positives (“trigger happiness”), and repeat the process many times. This gives us the false positive to false negative rate trade-off for the test.

1.3 Illustration

Now, we proceed with the simulation described in the previous section, using the t-test applied as-is, with the correct distribution. But next, we “modify” the test by swapping the t-distribution used in the test with other arbitrary distributions. We might expect these modifications to have a detrimental effect on the effectiveness of the test (and so higher error rates, meaning the false positive to false negative trade-off curve should move upwards).

But when we actually plot these curves, the figure below is what we get:

Fig 1: False positive rate to false negative rate trade-off for various null hypothesis distributions plugged into the skeleton of a two-sample t-test for comparing means.

In the plot, we first use the t-distribution (with mean 0 and standard deviation 1 as required by the test), next we replaced the t-distribution with a normal distribution with the same mean and standard deviation. Next, we replaced that with a normal distribution with mean 1 and standard deviation 2. And finally, we replaced that with a Cauchy distribution with mean 0 and standard deviation 1. The false positive rate to false negative rate lines corresponding to these tests are shown in figure 1. Surprisingly, they are all right on top of each other. So, swapping the t-distribution with these other unrelated distributions is having no effect what-so-ever on the performance of the test! Let’s now take a look at the code used to draw this plot.

2. Creating the plot

Now, let’s explore the code used to create the plot above. Since we want to take the t-test and replace the t-distribution with other distributions, the “uno” step is to implement the two-sample t-test from scratch. This is based on the description here.

Two sample t-test from scratch. See gist at: https://gist.github.com/ryu577/6146e88b347bce5a66dda168bc5cefe0

Next, we wrap this in a function that can use two arrays instead of the raw means and variances.

Wrapper for the two-sample t-test which takes arrays as input. We can also pass the distribution to use as the null hypothesis. See gist at: https://gist.github.com/ryu577/b18fcddaa73216448ef72593a76a2d15

Now, we implement a function that simulates data from the null and alternate hypotheses and generates the false positive and false negative rates at various significance levels, hence drawing out the alpha-beta curve.

Function that swaps various distributions in place of the t in the two sample t-test. See code at: https://gist.github.com/ryu577/925fb03bb2a1869a60037eec74ddfcf5

And finally, putting this all together to draw out the alpha-beta plots using different distributions for the null hypothesis in the two-sample t-test.

And this generates something very similar to the plot we saw in the previous section.

If you don’t like Math, this is the point where you add the new nugget of knowledge to your collection and ride off into the sunset. Otherwise, stay for..

3. The Derivation of the claim

First, we define some notation. We’ll assume our test is one-sided with the alternate hypothesis being that the second population is larger on some metric (like the mean for the two-sample t-test).

The hypothesis tests we’re interested in operate as follows:

We want to know if some property of one group is larger than that of the another group. We calculate a test statistic (X) that will be expected to take on large values if the property of the second group is indeed much larger than the first. We collect some random samples from both groups and calculate this test statistic based on the samples. If the test statistic is so large that the probability of seeing something as or larger than it is less than some significance level α, we reject the null hypothesis and conclude the second group probably has a larger value for the property in question.

The distribution of the test statistic, X under various scenarios:

X_0: The distribution of the test-statistic under the null hypothesis (assuming there is no difference in our metric of interest; assumption of innocence — what all trials start with).

Y_0: The distribution of the test-statistic under the null hypothesis in the perfect world of our test — where all the distributional assumptions of our test are perfectly satisfied and unicorns ride across rainbows.

X_a: The distribution of the test statistic under the alternate hypothesis, when there is a difference in our metric of interest between the two groups.

3.1 The error rates

Per the fourth point above, our criterion for rejecting the null hypothesis is (at some significance level, α): P(X≥Y_0) < α. But the left hand side is the probability that Y_0 will be greater than some value, which is the survival function of Y_0 (denoted S_{Y_0}). So, the probability of rejecting the null hypothesis becomes:

where in the last step, we used the fact that the survival function of any distribution is monotonically decreasing. When the null hypothesis is true, the distribution of the test statistic, X is: X ∼X_0. So the probability above becomes:

But under the null hypothesis, rejecting the null is by definition an error. So the probability above is an error rate; the rate of erroneously rejecting the null when it is true. Let’s call this false positive rate, a.

So, the significance level that leads to a false positive rate, a is given by:

Eq (1): The significance level that leads to false positive rate, a.

Similarly, under the alternate hypothesis, the distribution of the test statistic becomes: X ∼ X_a. So the probability of rejecting the null becomes:

In the case of the alternate hypothesis, we’d be making an error if we don’t reject the null. So, the rate of the error where we don’t reject the null when we should have, b becomes:

Eq (2): The false negative error rate.

Now, we want b as a function of a (false negative rate as a function of false positive rate). Because the trade-off in the rates of the actual errors is what matters. To get this, we can substitute equation (1) into equation (2) and this gives us:

Eq (3): The false negative to false positive rate trade-off.

Equation (3) tells us that the distribution used in the null hypothesis of our test, Y_0 when using the hypothesis test has no bearing on the trade-off between the false negative to false positive rate. And this proves the claim we started this blog with. In other words, doesn’t matter if the Y_0 we use in our two-sample t-test is a t-distribution, or Normal or Cauchy. The false negative to false positive rate trade-off (b as a function of a as given by equation (3)) will remain exactly the same.