Sam Behseta sends along this paper by Laura Lazzeroni, Ying Lu, and Ilana Belitskaya-Lévy, who write:

P values from identical experiments can differ greatly in a way that is surprising to many. The failure to appreciate this wide variability can lead researchers to expect, without adequate justification, that statistically significant findings will be replicated, only to be disappointed later.

I agree that the randomness of the p-value—the fact that it is a function of data and thus has a sampling distribution—is an important point that is not well understood. Indeed, I think that the z-transformation (the normal cdf, which takes a z-score and transforms it into a p-value) is in many ways a horrible thing, in that it takes small noisy differences in z-scores and elevates them into the apparently huge differences between p=.1, p=.01, p=.001. This is the point of the paper with Hal Stern, “The difference between ‘significant’ and ‘not significant’ is not itself statistically significant.” The p-value, like any data summary, is a random variable with a sampling distribution.

Incidentally, I have the same feeling about cross-validation-based estimates and even posterior distributions: all of these are functions of the data and thus have sampling distributions, but theoreticians and practitioners alike tend to forget this and instead treat them as truths.

This particular article is that it takes p-values at face value, whereas in real life p-values typically are the product of selection, as discussed by Uri Simonson et al. a few years ago in their “p-hacking” article and as discussed by Eric Loken and myself a couple years ago in our “garden of forking paths” article. I think real-world p-values are much more optimistic than the nominal p-values discussed by Lazzeroni et al. But in any case I think they’re raising an important point that’s been under-emphasized in textbooks and in the statistics literature.