Introduction

“It seems to me that statistics is often sold as a sort of alchemy that transmutes randomness into certainty, an ‘uncertainty laundering’ that begins with data and concludes with success as measured by statistical significance. (...) The solution is not to reform p-values or to replace them with some other statistical summary or threshold, but rather to move toward a greater acceptance of uncertainty and embracing of variation.” Gelman (2016)

Scientific results can be irreproducible for at least six major reasons (Academy of Medical Sciences, 2015). There may be (1) technical problems that are specific to the particular study. There may be more general problems like (2) weak experimental design or (3) methods that are not precisely described so that results cannot be reproduced. And there may be statistical issues affecting replicability that are largely the same in many fields of research. Such issues are (4) low statistical power, and (5) ‘data dredging’ or ‘p-hacking’ by trying alternative analyses until a significant result is found, which then is selectively reported without mentioning the nonsignificant outcomes. Related to that, (6) publication bias occurs when papers are more likely to be published if they report significant results (Bishop & Thompson, 2016).

Is a major part of an apparent crisis of unreplicable research caused by the way we use statistics for analyzing, interpreting, and communicating our data? Significance testing has been severely criticized for about a century (e.g., Boring, 1919; Berkson, 1938; Rozeboom, 1960; Oakes, 1986; Cohen, 1994; Ziliak & McCloskey, 2008; Kline, 2013), but the prevalence of p-values in the biomedical literature is still increasing (Chavalarias et al., 2016). For this review, we assume that a revolution in applied statistics with the aim of banning p-values is not to be expected nor necessarily useful, and that the main problem is not p-values but how they are used (Gelman, 2013b; Gelman, 2016). We argue that one of the smallest incremental steps to address statistical issues of replicability, and at the same time a most urgent step, is to remove thresholds of statistical significance like p = 0.05 (see Box 1). This may still sound fairly radical to some, but for the following reasons it is actually not.

First, p-values can be traditionally employed and interpreted as evidence against null hypotheses also without using a significance threshold. However, what needs to change for reducing data dredging and publication bias is our overconfidence in what significant p-values can tell, and, as the other side of the coin, our bad attitude towards p-values that do not pass a threshold of significance. As long as we treat our larger p-values as unwanted children, they will continue disappearing in our file drawers, causing publication bias, which has been identified as the possibly most prevalent threat to reliability and replicability of research already a long time ago (Sterling, 1959; Wolf, 1961; Rosenthal, 1979). Still today, in an online survey of 1576 researchers, selective reporting was considered the most important factor contributing to irreproducible research (Baker, 2016).

Second, the claim to remove fixed significance thresholds is widely shared among statisticians. In 2016, the American Statistical Association (ASA) published a statement on p-values, produced by a group of more than two dozen experts (Wasserstein & Lazar, 2016). While there were controversial discussions about many topics, the consensus report of the ASA features the following statement: “The widespread use of ‘statistical significance’ (generally interpreted as ‘p ≤ 0.05’) as a license for making a claim of a scientific finding (or implied truth) leads to considerable distortion of the scientific process” (Wasserstein & Lazar, 2016). And a subgroup of seven ASA statisticians published an extensive review of 25 misinterpretations of p-values, confidence intervals, and power, closing with the words: “We join others in singling out the degradation of p-values into ‘significant’ and ‘nonsignificant’ as an especially pernicious statistical practice” (Greenland et al., 2016).

The idea of using p-values not as part of a binary decision rule but as a continuous measure of evidence against the null hypothesis has had many advocates, among them the late Ronald Fisher (Fisher, 1956; Fisher, 1958; Eysenck, 1960; Skipper, Guenther & Nass, 1967; Labovitz, 1968; Edgington, 1970; Oakes, 1986; Rosnow & Rosenthal, 1989; Stoehr, 1999; Sterne & Smith, 2001; Gelman, 2013a; Greenland & Poole, 2013; Higgs, 2013; Savitz, 2013; Madden, Shah & Esker, 2015; Drummond, 2016; Lemoine et al., 2016; Van Helden, 2016). Removing significance thresholds was also suggested by authors sincerely defending p-values against their critics (Weinberg, 2001; Hurlbert & Lombardi, 2009; Murtaugh, 2014a).

In the following, we start with reviewing what p-values can tell about replicability and reliability of results. That this will not be very encouraging should not be taken as another advice to stop using p-values. Rather, we want to stress that reliable information about reliability of results cannot be obtained from p-values nor from any other statistic calculated in individual studies. Instead, we should design, execute, and interpret our research as a ‘prospective meta-analysis’ (Ioannidis, 2010), to allow combining knowledge from multiple independent studies, each producing results that are as unbiased as possible. Our aim is to show that not p-values, but significance thresholds are a serious obstacle in this regard.

We therefore do not focus on general misconceptions about p-values, but on problems with, history of, and solutions for applying significance thresholds. After discussing why significance cannot be used to reliably judge the credibility of results, we review why applying significance thresholds reduces replicability. We then describe how the switch in interpretation that often follows once a significance threshold is crossed leads to proofs of the null hypothesis like ‘the earth is flat (p > 0.05)’. We continue by summarizing opposing recommendations by Ronald Fisher versus Jerzy Neyman and Egon Pearson that led to the unclear status of nonsignificant results, contributing to publication bias. Finally, we outline how to use graded evidence and discuss potential arguments against removing significance thresholds. We conclude that we side with a neoFisherian paradigm of treating p-values as graded evidence against the null hypothesis. We think that little would need to change, but much could be gained by respectfully discharging significance, and by cautiously interpreting p-values as continuous measures of evidence.