In an otherwise pointless comment thread the other day, Dan Lakeland contributed the following gem:

A p-value is the probability of seeing data as extreme or more extreme than the result, under the assumption that the result was produced by a specific random number generator (called the null hypothesis).

I could care less about p-values but I really really like the identification of a null hypothesis with a random number generator. That’s exactly the point.

The only thing missing is to specify that “as extreme or more extreme” is defined in terms of a test statistic which itself needs to be defined for every possible outcome of the random number generator. For more on this last point, see section 1.2 of the forking paths paper:

The statistical framework of this paper is frequentist: we consider the statistical properties of hypothesis tests under hypothetical replications of the data. Consider the following testing procedures: 1. Simple classical test based on a unique test statistic, T , which when applied to the observed data yields T(y). 2. Classical test pre-chosen from a set of possible tests: thus, T(y;φ), with preregistered φ. For example, φ might correspond to choices of control variables in a regression, transformations, and data coding and excluding rules, as well as the decision of which main effect or interaction to focus on. 3. Researcher degrees of freedom without fishing: computing a single test based on the data, but in an environment where a different test would have been performed given different data; thus T(y;φ(y)), where the function φ(·) is observed in the observed case. 4. “Fishing”: computing T(y;φj) for j = 1,…,J: that is, performing J tests and then reporting the best result given the data, thus T(y; φbest(y)). Our claim is that researchers are doing #3, but the confusion is that, when we say this, researchers think we’re accusing them of doing #4. To put it another way, researchers assert that they are not doing #4 and the implication is that they are doing #2. In the present paper we focus on possibility #3, arguing that, even without explicit fishing, a study can have a huge number of researcher degrees of freedom, following what de Groot (1956) refers to as “trying and selecting” of associations. . . . It might seem unfair that we are criticizing published papers based on a claim about what they would have done had the data been different. But this is the (somewhat paradoxical) nature of frequentist reasoning: if you accept the concept of the p-value, you have to respect the legitimacy of modeling what would have been done under alternative data. . . .

Summary

These are the three counterintuitive aspects of the p-value, the three things that students and researchers often don’t understand:

– The null hypothesis is not a scientific model. Rather, it is, as Lakeland writes, “a specific random number generator.”

– The p-value is not the probability that the null hypothesis is true. Rather, it is the probability of seeing a test statistic as large or larger than was observed, conditional on the data coming from this specific random number generator.

– The p-value depends entirely on what would have been done under other possible datasets. It is not rude to speculate on what a researcher would have done had the data been different; actually, such specification is required in order to interpret the p-value, in the same way that the only way to answer the Monty Hall problem is to specify what Monty would have done under alternative scenarios.