OK, so this is nothing new. Greg Francis said it, and Uri Simonsohn said it, Ulrich Schimmack said it, lots of people have said it. But it’s worth saying again.

To get NIH funding, you need to demonstrate (that is, convincingly claim) that your study has 80% power.

I hate the term “power” as it’s all tied into the idea of the goal of a study that results in statistical significance. But let’s set that aside for now, and just do the math, which is that with a normal distribution, if you want an 80% probability of your 95% interval excluding zero, then the true effect size has to be at least 2.8 standard errors from zero.

All right, then. Suppose we really were running studies with 80% power. In that case, the expected z-score is 2.8, and 95% of the time we’d see z-scores between 0.8 and 4.8.

Let’s open up the R:

> 2*pnorm(-0.8) [1] 0.42 > 2*pnorm(-4.8) [1] 1.6e-06

So we should expect to routinely see p-values ranging from 0.42 to . . . ummmm, 0.0000016. And those would be clean, pre-registered p-values, no funny business, no researcher degrees of freedom, no forking paths.

Let’s explore further . . . the 75th percentile of the normal distribution is 0.67, so if we’re really running studies with 80% power, then one-quarter of the time we’d see z-scores above 2.8 + 0.67 = 3.47.

> 2*pnorm(-3.47) [1] 0.00052

Dayum. We’d expect to see clean, un-hacked p-values less than 0.0005, at least a quarter of the time, if we were running studies with minimum 80% power, as we routinely claim we’re doing, if we ever want any of that sweet, sweet NIH funding.

And, yes, that’s 0.0005, not 0.005. There’s a bunch of zeroes there.

And, no, this ain’t happening. We don’t have 80% power. Heck, we’re lucky if we have 6% power.

Remember that wonderful passage from the Nosek, Spies, and Motyl “50 shades of gray” paper:

We conducted a direct replication while we prepared the manuscript. We ran 1,300 participants, giving us .995 power to detect an effect of the original effect size at alpha = .05.

Followed by:

The effect vanished (p = .59).

None of this should be a surprise

When I say, “None of this should be a surprise,” I don’t just mean that, in response to the replication crisis and the work of Ioannidis, Button et al., etc., we should realize that statistically-based science is not what it’s claimed to be. And I don’t just mean that, given the real world of type M errors and the statistical significance filter, that we should expect claims of statistical power (which are based on optimistic interpretations of a biased literatures) will be wildly inflated. Issues of failed replications, type M errors, etc., are huge, and they contribute to the conditions that allow the erroneous power estimates.

But what I’m saying right here is that, even knowing nothing about any replication crisis, without any real-world experience or cynicism or sociology or documentation or whatever you want to call it . . . it just comes down to the math. With 80% power, we’d expect to see tons and tons of p-values like 0.0005, 0.0001, 0.00005, etc. This would just be happening all the time. But it doesn’t.

I should’ve realized this the first time I was asked to demonstrate 80% power for a grant proposal. And certainly I should’ve realized this when writing the section on sample size and power analysis in my book with Jennifer, over ten years ago, well before I’d thought about all the problems in statistical practice of which we are now so painfully aware. All the math in that section is correct—but the implications of the math reveal the absurdity of the assumptions.

P.S. In response to a couple of comments below: Yes, the p-value is conditioned on the assumed effect size. But the point of the power calculation for the NIH grant is to say that the power really is at least 80%, at least much of the time. The assumptions are supposed to be reasonable. Given that we’re not routinely seeing tons and tons of p-values like 0.0005, 0.0001, 0.00005, etc., this suggests that the assumptions are not so reasonable.

P.P.S. The funny thing is, when you design a study it seems like it should be so damn easy to get 80% power. It goes something like this: Assume a small effect size, say 0.1 standard deviations; then to get 2.8 standard errors from zero you just need 0.1 = 2.8/sqrt(N), thus N = (2.8/0.1)^2 = 784. Voila! OK, 784 seems like a lot of people, so let’s assume a effect size of 0.2 standard deviations, then we just need N = 196, that’s not so bad. NIH, here we come!

What went wrong? Here’s what’s happening: (a) effects are typically much smaller than people want to believe, (b) effect size estimates from the literature are massively biased, (c) systematic error is a thing, (d) so is variation across experimental conditions. Put it all together, and even that N = 784 study is not going to do the job—and even if you do turn up a statistically significant difference in your particular experimental conditions, there’s no particular reason to expect it will generalize. So, designing a study with 80% power is not so easy after all.

P.P.P.S. To clarify one more thing: I do not think the goal of an experiment should be to get “statistical significance” or any other sense of certainty. The paradigm of routine discovery is over, and it can’t be recovered with N = 784 or even N = 7840.