Verifying that a statistically significant result is scientifically meaningful is not only good scientific practice, it is a natural way to control the Type I error rate. Here we introduce a novel extension of the p-value—a second-generation p-value (p δ )–that formally accounts for scientific relevance and leverages this natural Type I Error control. The approach relies on a pre-specified interval null hypothesis that represents the collection of effect sizes that are scientifically uninteresting or are practically null. The second-generation p-value is the proportion of data-supported hypotheses that are also null hypotheses. As such, second-generation p-values indicate when the data are compatible with null hypotheses (p δ = 1), or with alternative hypotheses (p δ = 0), or when the data are inconclusive (0 < p δ < 1). Moreover, second-generation p-values provide a proper scientific adjustment for multiple comparisons and reduce false discovery rates. This is an advance for environments rich in data, where traditional p-value adjustments are needlessly punitive. Second-generation p-values promote transparency, rigor and reproducibility of scientific results by a priori specifying which candidate hypotheses are practically meaningful and by providing a more reliable statistical summary of when the data are compatible with alternative or null hypotheses.

Introduction

P-values abound in the scientific literature. They have become the researcher’s essential tool for summarizing when the data are incompatible with the null hypothesis. Although p-values are widely recognized as imperfect tools for this task, the impact of their flaws on scientific inference remains hotly debated [1–5]. The debate over the proper use and interpretation of p-values has stymied and divided the statistical community [6–14]. Recurring themes include the difference between statistical and scientific significance, the routine misinterpretation of non-significant p-values, the unrealistic nature of a point null hypothesis, and the challenges with multiple comparisons. With no widely-accepted alternative to promote, statisticians are left to tweak the manner in which p-values are applied and interpreted [11,12]. Some have even suggested that the problem lies with instruction: p-values are fine, they are just widely misused [15,16]. After a century of widespread adoption in science, with their flaws and advantages well-known, it is time for an upgrade.

The purpose of this paper is to introduce a novel and intuitive extension that better serves the p-value’s intended purpose. We call this upgrade a second-generation p-value. Second-generation p-values are easy to compute and interpret. They offer improved inferential capability, e.g. it is now possible for the data to indicate support for the null hypothesis. They control the Type I error naturally, forcing it to zero as the sample size grows. This, in turn, offsets Type I Error inflation that results from multiple comparisons or multiple examinations of accumulating data. Findings identified by second-generation p-values are less likely to be false discoveries than findings identified by classical p-values. Consequently, second-generation p-values do not require ad-hoc adjustments to provide strict error control and this improves power in studies with massive multiple comparisons. They also implicitly codify good research practice: the smallest effect size of scientific relevance must now be specified before looking at results. This prevents the inevitable rationalization that accompanies the post-hoc interpretation of mediocre results that have been deemed statistically significant. This singular change alone will improve rigor and reproducibility across science.

Our examples (Section 3) were selected from a wide range of contexts to highlight the broad utility of this new tool. We will not dwell on the well-known drawbacks of classical p-values [11,12,13,14]. The frequency properties of second-generation p-values are the same or better than traditional p-values. These technical details, along with supplementary exposition, can be found in the supplementary materials (S1 and S2 Files). A distinguishing feature of second-generation p-values is that they are intended as summary statistics that indicate when a study has met its a priori defined endpoint: the observed data support only alternative hypotheses or only null hypotheses.

Given the complexity surrounding the interpretation and computation of p-values, and the plethora of ad-hoc statistical adjustments for them, the reader is forgiven for any pre-emptive statistical fatigue, pessimism, or skepticism. After all, every statistical adjustment for multiple comparisons boils down to nothing more than ranking the p-values and picking a cutoff to determine significance. While each method offers its own preferred cut-off, the core value judgement—the ranking—remains the same. Second generation p-values, however, change that ranking; they favor results that are both scientifically relevant and statistically significant. For example, Section 3 presents an application where a Bonferroni correction yields 264 genes of interest from a study of 7128 candidate genes where 2028 had an unadjusted p-value of 0.05 or less. An application of the second-generation p-value also yields 264 gene findings (their second-generation p-value is 0), ensuring the same Type I Error control. However, 82 (31%) of those genes fail to meet the Bonferroni criteria. The difference is both fascinating and striking, and is due to the second-generation p-value’s preference for scientific relevance (which in this case amounts to a preference for clinically relevant fold changes in expression levels).

1.1 Illustration of approach The top diagram of Fig 1 depicts an estimated effect, typically the best supported hypothesis , its 95% confidence interval (CI), and the traditional point null hypothesis, H 0 . The CI contains all the effect sizes that are supported by the data at the 95% level; we will refer to it as the set of data-supported hypotheses. If the null hypothesis is well outside of the interval, the p-value is very small or near zero. If the CI just barely excludes the null hypothesis, the p-value will be slightly less than 0.05. When the CI contains the null hypothesis, the p-value will be larger than 0.05. The p-value grows to 1 as the null hypothesis approaches the center of the CI. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 1. Illustration of a point null hypothesis, H 0 ; the estimated effect that is the best supported hypothesis, −, CI+]; and the interval null hypothesis Illustration of a point null hypothesis, H; the estimated effect that is the best supported hypothesis,; the 95% confidence interval (CI) for the estimated effect [CI, CI]; and the interval null hypothesis https://doi.org/10.1371/journal.pone.0188299.g001 Now imagine that the null hypothesis is a contiguous set—an interval—rather than just a single point, as depicted in the bottom diagram of Fig 1. The interval null is the set of effects that are indistinguishable from the null hypothesis, due to limited precision or practicality. For example, the null hypothesis “no age difference” might be re-framed as “no age difference of more than 365 days”, the latter being what we really mean when we say two people are the same age (e.g., they are both 45). An interval null always exists, even if it is narrow. When a 95% CI is entirely contained within the null interval, the data support only null hypotheses (this is the traditional benchmark for showing statistical equivalence). When the CI and null set do not overlap, the data are said to be incompatible with the null. Lastly, when the null set and confidence interval partially intersect, the data are inconclusive. Thus, the degree of overlap conveys how compatible the data are with the null premise. The second-generation p-value is the fraction of overlap multiplied by a small-sample correction factor. We define it formally in Section 2. In a very real sense, the second-generation p-value is nothing more than the codification of today’s standards for good scientific and statistical practice.