A two-sample t-test is intended to determine whether there’s evidence that two samples have come from distributions with different means. The test assumes that both samples come from normal distributions.

Robust to non-normality, not to asymmetry

It is fairly well known that the t-test is robust to departures from a normal distribution, as long as the actual distribution is symmetric. That is, the test works more or less as advertised as long as the distribution is symmetric like a normal distribution, but it may not work as expected if the distribution is asymmetric.

This post will explore the robustness of the t-test via simulation. How far can you be from a normal distribution and still do OK? Can you have any distribution as long as it’s symmetric? Does a little asymmetry ruin everything? If something does go wrong, how does it go wrong?

Experiment design

For the purposes of this post, we’ll compare the null hypothesis that two groups both have mean 100 to the alternative hypothesis that one group has mean 100 and the other has mean 110. We’ll assume both distributions have a standard deviation of 15. There is a variation on the two-sample t-test that does not assume equal variance, but for simplicity we will keep the variance the same in both groups.

We’ll do a typical design, 0.05 significance and 0.80 power. That is, under the null hypothesis, that is if the two groups do come from the same distribution, we expect the test to wrongly conclude that the two distributions are different 5% of the time. And under the alternative hypothesis, when the groups do come from distributions with means 100 and 110, we expect the test to conclude that the two means are indeed different 80% of the time.

Under these assumptions, you’d need a sample of 36 in each group.

Python simulation code

Here’s the code we’ll use for our simulation.

from scipy.stats import norm, t, gamma, uniform, ttest_ind num_per_group = 36 def simulate_trials(num_trials, group1, group2): num_reject = 0 for _ in range(num_trials): a = group1.rvs(num_per_group) b = group2.rvs(num_per_group) stat, pvalue = ttest_ind(a, b) if pvalue < 0.05: num_reject += 1 return(num_reject)

Normal simulation

Let’s see how the two-sample t-test works under ideal conditions by simulating from the normal distributions that the method assumes.

First we simulate from the null, i.e. we draw the data for both groups from the same distribution

n1 = norm(100, 15) n2 = norm(100, 15) print( simulate_trials(1000, n1, n2) )

Out of 1000 experiment simulations, the method rejected the null 54 times, very close to the expected number of 50 based on 0.05 significance.

Next we simulate under the alternative, i.e. we draw data from different distributions.

n1 = norm(100, 15) n2 = norm(110, 15) print( simulate_trials(1000, n1, n2) )

This time the method rejected the null 804 times in 1000 simulations, very close to the expected 80% based on the power.

Gamma simulation

Next we draw data from a gamma distribution. First we draw both groups from a gamma distribution with mean 100 and standard deviation 15. This requires a shape parameter of 44.44 and a scale of 2.25.

g1 = gamma(a = 44.44, scale=2.25) g2 = gamma(a = 44.44, scale=2.25) print( simulate_trials(1000, g1, g2) )

Under the null simulation, we rejected the null 64 times out of 1000 simulations, higher than expected, but not that remarkable.

Under the alternative simulation, we pick the second gamma distribution to have mean 110 and standard deviation 15, which requires shape 53.77 and scale 2.025.

g1 = gamma(a = 44.44, scale=2.25) g2 = gamma(a = 53.77, scale=2.045) print( simulate_trials(1000, g1, g2) )

We rejected the null 805 times in 1000 simulations, in line with what you’d expect from the power.

When the gamma distribution has such large shape parameters, the distributions are approximately normal, though slightly asymmetric. To increase the asymmetry, we’ll use a couple gamma distributions with smaller mean but shifted over to the right. Under the null, we create two gamma distributions with mean 10 and standard deviation 15, then shift them to the right by 90.

g1 = gamma(a = 6.67, scale=1.5, loc=90) g2 = gamma(a = 6.67, scale=1.5, loc=90) print( simulate_trials(1000, g1, g2) )

Under this simulation we reject the null 56 times out of 1000 simulations, in line with what you’d expect.

For the alternative simulation, we pick the second distribution to have a mean of 20 and standard deviation 15, and then shift it to the right by 90 so tht it has mean 110. This distribution has quite a long tail.

g1 = gamma(a = 6.67, scale=1.5, loc=90) g2 = gamma(a = 1.28, scale=11.25, loc=90) print( simulate_trials(1000, g1, g2) )

This time we rejected the null 499 times in 1000 simulations. This is a serious departure from what’s expected. Under the alternative hypothesis, we reject the null 50% of the time rather than the 80% we’d expect. If higher mean is better, this means that half the time you’d fail to conclude that the better group really is better.

Uniform distribution simulation

Next we use uniform random samples from the interval [74, 126]. This is a symmetric distribution with mean 100 and standard deviation 15. When we draw both groups from this distribution we rejected the null 48 times, in line with what we’d expect.

u1 = uniform(loc=74, scale=52) u2 = uniform(loc=74, scale=52) print( simulate_trials(1000, u1, u2) )

If we move the second distribution over by 10, drawing from [84, 136], we rejected the null 790 times out of 1000, again in line with what you’d get from a normal distribution.

u1 = uniform(loc=74, scale=52) u2 = uniform(loc=84, scale=52) print( simulate_trials(1000, u1, u2) )

In this case, we’ve made a big departure from normality and the test still worked as expected. But that’s not always the case, as in the t(3) distribution below.

Student-t simulation (fat tails)

Finally we simulate from a Student-t distribution. This is a symmetric distribution but it has fat tails relative to the normal distribution.

First, we simulate from a t distribution with 6 degrees of freedom and scale 12.25, making the standard deviation 15. We shift the location of the distribution by 100 to make the mean 100.

t1 = t(df=6, loc=100, scale=12.25) t2 = t(df=6, loc=100, scale=12.25) print( simulate_trials(1000, t1, t2) )

When both groups come from this distribution, we rejected the null 46 times. When we shifted the second distribution to have mean 110, we rejected the null 777 times out of 1000.

t1 = t(df=6, loc=100, scale=12.25) t2 = t(df=6, loc=110, scale=12.25) print( simulate_trials(1000, t1, t2) )

In both cases, the results are in line what we’d expect given a 5% significance level and 80% power.

A t distribution with 6 degrees of freedom has a heavy tail compared to a normal. Let’s try making the tail even heavier by using 3 degrees of freedom. We make the scale 15 to keep the standard deviation at 15.

t1 = t(df=3, loc=100, scale=15) t2 = t(df=3, loc=100, scale=15) print( simulate_trials(1000, t1, t2) )

When we draw both samples from the same distribution, we rejected the null 37 times out of 1000, less than the 50 times we’d expect.

t1 = t(df=3, loc=100, scale=15) t2 = t(df=3, loc=110, scale=15) print( simulate_trials(1000, t1, t2) )

When we shift the second distribution over to have mean 110, we rejected the null 463 times out of 1000, far less than the 80% rejection you’d expect if the data were normally distributed.

Just looking at a plot of the PDFs , a t(3) looks much closer to a normal distribution than a uniform distribution does. But the tail behavior is different. The tails of the uniform are as thin as you can get—they’re zero!—while the t(3) has heavy tails.

These two examples show that you can replace the normal distribution with a moderately heavy tailed symmetric distribution, but don’t overdo it. When the data some from a heavy tailed distribution, even one that is symmetric, the two-sample t-test may not have the operating characteristics you’d expect.