No single index should substitute for scientific reasoning. — Official ASA statement

TLDR: The American Statistical Association’s officially stance is that p-values are bad measures of evidence. We as psychologists need to recalibrate our intuitions for what constitutes good evidence. See the full statement here. [Link fixed!]

The American Statistical Association just released its long-promised official statement regarding its stance on p-values. If you don’t remember (don’t worry, it was over a year ago), the ASA responded to Basic and Applied Social Psychology’s (BASP) widely publicized p-value ban by saying,

A group of more than two-dozen distinguished statistical professionals is developing an ASA statement on p-values and inference that highlights the issues and competing viewpoints. The ASA encourages the editors of this journal [BASP] and others who might share their concerns to consider what is offered in the ASA statement to appear later this year and not discard the proper and appropriate use of statistical inference.

This development is especially relevant for psychologists, since the p-value is ubiquitous in our literature. I think I have only ever seen a handful of papers without one. Are we using it correctly? What is proper? The ASA is here to set us straight.

The scope of the statement

The statement begins by saying “While the p-value can be a useful statistical measure, it is commonly misused and misinterpreted.” To help clarify how the p-value should be used, the ASA “believes that the scientific community could benefit from a formal statement clarifying several widely agreed upon principles underlying the proper use and interpretation of the p-value.” Their stated goal is to articulate “in non-technical terms a few select principles that could improve the conduct or interpretation of quantitative science, according to widespread consensus in the statistical community.”

So first things first: what is a p-value?

The ASA gives the following definition for a p-value:

a p-value is the probability under a specified statistical model that a statistical summary of the data (for example, the sample mean difference between two compared groups) would be equal to or more extreme than its observed value.

So the p-value is a probability statement about the observed data, and data more extreme than those observed, given an underlying statistical model (e.g., a null hypothesis) is true. How can we use this probability measure?

Six principles for using p-values

The basic gist of the statement is this: p-values can be used as a measure of the misfit between the data with a model (e.g., a null hypothesis), but that measure of misfit does not tell us the probability that the null hypothesis is true (as we all hopefully know by now). It does not tell us what action we should take — submit to a big name journal, abandon/continue a research line, implement an intervention, etc. It does not tell us how big or important the effect we’re studying is. And most importantly (in my opinion), it does not give us a meaningful measure of evidence regarding a model or hypothesis.

Here are the principles:

P-values can indicate how incompatible the data are with a specified statistical model. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold. Proper inference requires full reporting and transparency. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.

In the paper each principle is followed by a paragraph of detailed exposition. I recommend you take a look at the full statement.

So what does this mean for psychologists?

The ASA gives many explicit recommendations and it is worth reading their full (short!) report. I think the most important principle is principle 6. Psychologists mainly use p-values as a measure of the evidence we have obtained against the null hypothesis. You run your study, check the p-value, if p is below .05 then you have “significant” evidence against the null hypothesis, and then you feel justified in doubting it and consequently having confidence in your preferred substantive hypothesis.

The ASA tells us this is not good practice. Taking a p-value as strong evidence just because it is below .05 is actually misleading; the ASA specifically says “a p-value near 0.05 taken by itself offers only weak evidence against the null hypothesis.” I recently discussed a paper on this blog (Berger & Delampady, 1987 [pdf]) that showed exactly this: A p-value near .05 can only achieve a maximum Bayes factor of ~2 with most acceptable priors, which is a very weak level of evidence — and usually it is much weaker still.

The bottom line is this: We need to adjust our intuitions about what constitutes adequate evidence. Joachim Vandekerckhove and I recently concluded that one big reason effects “failed to replicate” in the Reproducibility Project: Psychology is that the evidence for their existence was unacceptably weak to begin with. When we properly evaluate the evidence from the original studies (even before taking publication bias into account) we see there was little reason to believe the effects ever existed in the first place. “Failed” replications are a natural consequence of our current low standards of evidence.

There are many (many, many) papers in the statistics literature showing that p-values overstate the evidence against the null hypothesis; now the ASA have officially taken this stance as well.

Choice quotes

Below I include some quotations think are most relevant to practicing psychologists.

Researchers should recognize that a p-value without context or other evidence provides limited information. For example, a p-value near 0.05 taken by itself offers only weak evidence against the null hypothesis. Likewise, a relatively large p-value does not imply evidence in favor of the null hypothesis; many other hypotheses may be equally or more consistent with the observed data. For these reasons, data analysis should not end with the calculation of a p-value when other approaches are appropriate and feasible.

In view of the prevalent misuses of and misconceptions concerning p-values, some statisticians prefer to supplement or even replace p-values with other approaches. These include methods that emphasize estimation over testing, such as confidence, credibility, or prediction intervals; Bayesian methods; alternative measures of evidence, such as likelihood ratios or Bayes Factors; and other approaches such as decision-theoretic modeling and false discovery rates

The widespread use of “statistical significance” (generally interpreted as “p ≤ 0.05”) as a license for making a claim of a scientific finding (or implied truth) leads to considerable distortion of the scientific process.

Whenever a researcher chooses what to present based on statistical results, valid interpretation of those results is severely compromised if the reader is not informed of the choice and its basis. Researchers should disclose the number of hypotheses explored during the study, all data collection decisions, all statistical analyses conducted and all p-values computed. Valid scientific conclusions based on p-values and related statistics cannot be drawn without at least knowing how many and which analyses were conducted, and how those analyses (including p-values) were selected for reporting.

Statistical significance is not equivalent to scientific, human, or economic significance. Smaller p-values do not necessarily imply the presence of larger or more important effects, and larger p-values do not imply a lack of importance or even lack of effect.