A few weeks ago, Nature published an article summarising the various measures and counter-measures suggested to improve statistical inferences and science as a whole (Chawla, 2017). It detailed the initial call to lower the significance threshold to 0.005 from 0.05 (Benjamin et al., 2017) and the paper published in response (Lakens et al., 2017). It was a well written article, with one minor mistake: an incorrect definition of a p-value :

The two best sources for the correct definition of a p-value (along with its implications and examples of how a p-value can be misinterpreted) are Wasserstein & Lazar (2016) and its supplementary paper Greenland et al. (2016). A p-value has been defined as: “a statistical summary of the compatibility between the observed data and what we would predict or expect to see if we knew the entire statistical model (all the assumptions used to compute the P value) were correct” (Greenland et al., 2016). To put it another way, it tells us the probability of finding the data you have or more extreme data assuming the null hypothesis (along with all the other assumptions about randomness in sampling, treatment, assignment, loss, and missingness, the study protocol, etc.) are true. The definition provided in the Chawla article is incorrect because it states “the smaller the p-value, the less likely it is that the results are due to chance”. This gets things backwards: the p-value is a probability deduced from a set of assumptions e.g. the null hypothesis is true, so it can’t also tell you the probability of that assumption at the same time. Joachim Vandekerckhove and Ken Rothman give further evidence as to why this definition is incorrect:

This definition is specifically wrong if the alternative hypothesis is unlikely a priori, which is reasonably often. — J. Vandekerckhove (@VandekerckhoveJ) October 4, 2017

Thinking p is the probability of the null is a form of base rate neglect fallacy, which gets worse as the base rate gets lower. — J. Vandekerckhove (@VandekerckhoveJ) October 4, 2017

…Here is a true story that shows why a P-value cannot tell you whether the null hypothesis is correct. pic.twitter.com/OjXPuIK25o — Ken Rothman (@ken_rothman) October 4, 2017

What’s the big deal?

I posted the above photo to the Facebook group Psychological Methods Discussion Group where it prompted some discussion. What interested me the most was one comment by J. Kivikangas which I have screen capped below.

I responded with : “

J. Kivikangas replied:

I then asked on Twitter if anyone could provide a better answer to the question. Posted below are their responses.

of finding p = 0.04 until it is LESS likely under H1 than H0. Should be known to people. But you *still* have a 5% error rate. — Daniël Lakens (@lakens) September 22, 2017

Problem with not knowing definition, is you don't know which questions you are asking.That's a problem if you do science. — Daniël Lakens (@lakens) September 22, 2017

Daniel Lakens also gave a link to his blog post where he explains how p-values traditionally classed as significant e.g. 0.03<p<0.05, can be more likely under H0 than H1.

Agree with what you and Daniël wrote. + H0 vs. HA is purely a numerical matter; HA may be true for diff reasons (e.g. expectancy fx, bias) — Jan Vanhove (@janhove) September 22, 2017

Ben Prytherch remarked that: “It isn’t just that small p-values provide weaker evidence than one would think under the false interpretation of “probability these results were due to chance”. It’s also that large p-values don’t provide anywhere near the kind of evidence in support of the null that this interpretation implies. Imagine getting p = 0.8. The popular misinterpretation of the p-value would say that there’s an 80% chance that the null is true. But this is nuts, especially considering how many nulls are point nulls that can’t possibly be true (e.g. “the population correlation is precisely zero”).”

J Kivinkangas commented back:

This is as far as the conversation went for the time being. If you have any points you would like to make on the topic, please comment below, and I will add them.

So where does this leave us?

Whilst I don’t think we persuaded J. Kivikangas about the importance of understanding the precise definition of a p-value, I still believe it is useful. Beyond the inherent value of knowing the correct definition I feel it has real world significance (as I detail above) and therefore all researchers should understand what a p-value actually is. However, I can understand why others might disagree if they believe it doesn’t negatively impact the research they perform.

References

Benjamin, D. J., Berger, J., Johannesson, M., Nosek, B. A., Wagenmakers, E.J., Berk, R., … Johnson, V. (2017). Redefine statistical significance. Nature Human Behaviour. Available at: https://doi.org/10.17605/OSF.IO/MKY9J

Chawla, D.S. (2017). ‘One-size-fits-all’ threshold for P values under fire. Nature News. Available at: http://www.nature.com/news/one-size-fits-all-threshold-for-p-values-under-fire-1.22625#/b1 [accessed on: 02/10/2017].

Gigerenzer, G.; Krauss, S.; & Vitouch, O. (2004). The Null Ritual: What You Always Wanted to Know About Significance Testing but Were Afraid to Ask. In: Kaplan, D. (Ed.). (2004). The Sage handbook of quantitative methodology for the social sciences (pp. 391–408). Thousand Oaks, CA: Sage.

Greenland, S.; Senn, S.J.; Rothman, K.J. et al. (2016) European Journal of Epidemiology 31, 337. https://doi.org/10.1007/s10654-016-0149-3

Lakens, D.; Adolfi, F.; Albers, C.; … Zwaan, R. (2017). Justify Your Alpha: A Response to “Redefine Statistical Significance”. DOI: 10.17605/OSF.IO/9S3Y6. Available at: https://psyarxiv.com/9s3y6 [accessed on: 02/10/2017]

Wagenmakers, E.J. & Gronau, Q. (2017). Bayesian Spectacles. Available at: https://www.bayesianspectacles.org/ [accessed on: 02/10/2017]

Wasserstein, R.L. & Lazar, N.A. (2016) The ASA’s Statement on p-Values: Context, Process, and Purpose. The American Statistician, 70 (2), 129-133, DOI: 10.1080/00031305.2016.1154108