It is worth noting that earlier, in the 1980s and 1990s, the social sciences were also said to be in a crisis. Funding agents threatened to cut budgets because they were frustrated with the ongoing production of small and inconsistent effect sizes ( Hunter & Schmidt, 1996 ). As an answer, methodologists introduced several artefact corrections, namely corrections for range restriction, measurement error, and dichotomization. These corrections almost always increase the observed effect sizes ( Fern & Monroe, 1996 ; Schmidt & Hunter, 1996 ).

Longitudinal trends of positive versus negative results

Akin to signal detection theory, measures that decrease false positives will lead to more false negatives (Fiedler, Kutzner & Krueger, 2012; Finkel, Eastwick & Reis, in press). Hence, we ought to ask ourselves whether the alleged crisis reflects the true state of affairs. Have positive results really become more prevalent over the years? Furthermore, it is important to determine whether the prevalence of positive results has dropped recently, which would indicate that the methodological recommendations have been endorsed by the scientific community.

So far, research on longitudinal trends of positive versus negative results has been scarce. An exception is Fanelli (2012), who manually coded 4,656 journal papers. In his article “Negative results are disappearing from most disciplines and countries,” Fanelli found that the number of papers providing support for the main hypothesis had increased from 70% in 1990 to 86% in 2007. Fanelli further concluded that the increase was significantly stronger in the social sciences and in some biomedical fields than in the physical sciences. He also reported that Asian countries have been producing more positive results than the United States, which in turn produce more positive results than European countries. It is worth noting that, according to Fanelli (2012), negative results are disappearing relative to the number of papers that tested a hypothesis. As clarified by Fanelli (2014): “the absolute number of negative results in Fanelli (2012) did not show a decline” (italics added).

Fanelli’s (2012) paper has some limitations. First, although his sample size is impressive (considering that the coding was done manually), the statistical power does not seem high enough for assessing differences in growth between disciplines or countries. We extracted and re-analyzed data shown in Fanelli’s figures and found that the 95% of the confidence intervals (CI) of the regression slopes are overlapping between disciplines (1.44%/year increase in positive results for the social sciences (95% CI [0.99, 1.90]), 0.92%/year increase for the biological sciences (95% CI [0.53, 1.32]), and 0.65%/year increase for the physical sciences (95% CI [0.19, 1.11]), see also Fig. 1, in which the longitudinal trends in the three scientific disciplines are presented together to assist comparison). Fanelli (2012, see also 2010) claimed support for a hierarchy of sciences with physical sciences at the top and social sciences at the bottom, and positive results increasing toward the hierarchy’s lower end: “The average frequency of positive results was significantly higher when moving from the physical, to the biological to the social sciences, and in applied versus pure disciplines…, all of which confirms previous findings (Fanelli, 2010)” (Fanelli, 2012). Although the results shown in Fig. 1 are consistent with this ordinal relationship, the average percentages of positive results over the period 1991–2007 are highly similar for the three disciplines (78.3% in physical sciences [N = 1,044], 80.1% in biological sciences [N = 2,811], and 81.5% in social sciences [N = 801]). Fanelli’s alleged differences between Europe, Asia, and the United States are not statistically strong either. For example, Fanelli (2012) reported p-values of 0.017 and 0.048 for paired comparisons between the three world regions, based on the results of a stepwise regression procedure that included all main effects and interactions. A second limitation is that Fanelli’s (2012) assessment of positive/negative results was based on the reading of abstracts or full-texts by the author himself. This approach could have introduced bias, because Fanelli’s coding was not blind to the scientific discipline that the papers belong to. Randomization issues are at play as well, because the coding was first done for papers published between 2000 and 2007, and subsequently for papers published between 1990 and 1999.

Figure 1: Number of papers reporting a positive result divided by the total number of papers examined (i.e., papers reporting a positive result + papers reporting a negative result) per publication year, for three scientific disciplines. The figure was created by graphically extracting the data shown in Fanelli ’s ( 2012 ) figures. The dashed lines represent the results of a simple linear regression analysis.

Pautasso (2010) also studied longitudinal trends of positive versus negative research findings. Opposite to Fanelli’s (2012) manual coding, Pautasso searched automatically for the phrases “significant difference” versus “no significant difference” (and variants thereof) in the titles and abstracts of papers in the (Social) Science Citation Index for 1991–2008, and in CAB Abstracts and Medline for 1970–2008. Pautasso (2010) found that the percentage of both positive and negative results with respect to all papers had increased over time (cf. Fanelli, 2012, who focused on the percentage of papers reporting a positive result with respect to the selected papers that tested a hypothesis, i.e., the sum of papers reporting either positive or negative results). Pautasso (2010) further showed that the positive results had increased more rapidly than the negative results, hence yielding a “worsening file drawer problem,” consistent with Fanelli’s (2012) observation that negative results are disappearing. However, some of Pautasso’s (2010) conclusions seem not in line with Fanelli’s (2010) hierarchy of sciences. For example, Pautasso found that the worsening file-drawer problem was not apparent for papers retrieved with the keyword “psychol*”, whereas psychology lies within the bottom of the hierarchy proposed by Fanelli (2010).

Leggett et al. (2013) manually collected 3,701 p-values between 0.01 and 0.10 that were reported in all articles published in 1965 and in 2005 in two psychological journals, Journal of Experimental Psychology: General (JEP) and Journal of Personality and Social Psychology (JPSP). The authors found a distinctive peak of p-values just below 0.05. Moreover, the overrepresentation of p-values just below the threshold of significance was more manifest in 2005 than in 1965. For example, for JPSP, in 2005, there were 3.27 (i.e., 108/33) times as many p-values in the range 0.04875 < p ≤ 0.05000 as compared to p-values in the range 0.04750 < p ≤ 0.04875, while for 1965 the corresponding ratio was 1.40 (i.e., 28/20; data extracted from Leggett et al.’s figures). Moreover, Leggett et al. found that in 2005, 42% of the 0.05 values were erroneously rounded down p-values, as compared to 19% in 1965. According to the authors, these longitudinal trends could be caused by software-driven statisticizing: “Modern statistical software now allows for simple and instantaneous calculations, permitting researchers to monitor their data continuously while collecting it. The ease with which data can be analysed may have rendered practices such as ‘optional stopping’ (Wagenmakers, 2007), selective exclusion of outliers, or the use of post hoc covariates (Sackett, 1979) easier to engage in.”

Jager & Leek (2014) used a parsing algorithm to automatically collect p-values from the abstracts of five medical journals (The Lancet, The Journal of the American Medical Association, The New England Journal of Medicine, The British Medical Journal, and The American Journal of Epidemiology) between 2000 and 2010 and reported that between 75% and 80% of the 5,322 p-values they collected were lower than the significance threshold of 0.05. These authors found that the distribution of reported p-values has not changed over those years.

All four aforementioned longitudinal analyses require updating, as they cover periods up to 2007 (Fanelli, 2012), 2008 (Pautasso, 2010), 2005 (Leggett et al., 2013), and 2010 (Jager & Leek, 2014). Considering the alleged crisis (Wagenmakers, 2007) and the exponential growth of scientific literature (Larsen & Von Ins, 2010), a modern replication of these works seems warranted. Replication is also needed because Fanelli’s (2012) work has received ample citations and attention from the popular press (cf. Yong, 2012, stating in Nature that psychology and psychiatry are the “worst offenders”, based on Fanelli, 2012).