Abstract A focus on novel, confirmatory, and statistically significant results leads to substantial bias in the scientific literature. One type of bias, known as “p-hacking,” occurs when researchers collect or select data or statistical analyses until nonsignificant results become significant. Here, we use text-mining to demonstrate that p-hacking is widespread throughout science. We then illustrate how one can test for p-hacking when performing a meta-analysis and show that, while p-hacking is probably common, its effect seems to be weak relative to the real effect sizes being measured. This result suggests that p-hacking probably does not drastically alter scientific consensuses drawn from meta-analyses.

Citation: Head ML, Holman L, Lanfear R, Kahn AT, Jennions MD (2015) The Extent and Consequences of P-Hacking in Science. PLoS Biol 13(3): e1002106. https://doi.org/10.1371/journal.pbio.1002106 Published: March 13, 2015 Copyright: © 2015 Head et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited Funding: Funding for this research was provided by Australian Research Council Grants awarded to MDJ, RL and LH. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: The authors have declared that no competing interests exist. Abbreviations:: NHST, Null hypothesis significance testing

The Consequences of P-Hacking for Meta-analyses Meta-analysis is an excellent method for systematically synthesizing the literature and quantifying an effect or relationship by averaging effect sizes from multiple studies after weighting each one by its reliability [33,51]. However, meta-analyses are only as good as the data they use, and a recent study estimated that up to 37% of meta-analyses of clinical trials reporting a significant mean effect size represent false positives [34]. Tests for evidential value and p-hacking can readily be used to detect biases in datasets used in meta-analyses. We encourage researchers conducting meta-analyses to report p-values associated with each effect size (which is not currently standard practice) and then to test for evidential value and p-hacking. For a recent example of this practice, see [52]. To demonstrate this procedure, we obtained p-values from studies subject to meta-analyses by evolutionary biologists studying sexual selection [53–61] (see S1 Text). When conducting our own meta-analysis of all the data used in these meta-analyses, there was clear evidence that researchers have strong evidential value for claims that effect sizes are nonzero (binomial glm: estimated proportion of p-values in the upper bin (0.025 ≤ p < 0.05) (lower CI, upper CI) = 0.202 (0.179, 0.228), p<0.001, n = 12 datasets). We then examined each dataset separately and found statistically significant evidential value for 9 of the 12 p-curves (Table 3). The three p-curves that did not show evidential value had the three lowest sample sizes, so low statistical power to detect evidential value may explain the lack of significance. Again, it is worth noting that evidential value for well-studied phenomena is not a given (see a real-world example in [62]). PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Table 3. Tests for evidential value and p-hacking for published meta-analyses. https://doi.org/10.1371/journal.pbio.1002106.t003 When considering evidence for p-hacking, we found that when we included misreported p-values (those given as p < 0.05 which were actually larger; a total of 16 cases—see S1 Text) there were more p-values in the upper than the lower bin for 7 of 12 p-curves (Table 3). This bias was significant in one dataset (Fig. 4), which was also the one with the largest sample size. However, the evidence for p-hacking disappeared when we excluded misreported p-values from our analyses (Table 3). One could argue that including misreported p-values in the upper bin of our binomial test biases our results toward detecting p-hacking, but reporting nonsignificant results as “p<0.05” is a component of p-hacking that should not be ignored. Indeed, Leggett et al. [45] also found considerable misreporting of p-values around the 0.05 threshold. They noted that p-values were more likely to be misreported as significant when they were not, rather than the reverse, and that this “error” has become more common in recent years. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 4. The distribution of p-values associated with the meta-analysis conducted by Jiang et al. (2013). The p-curve shows evidence for evidential value (strong right skew) and p-hacking (rise in p-values just below 0.05). https://doi.org/10.1371/journal.pbio.1002106.g004 More importantly, when misreported p-values were included in our analysis we found significant p-hacking from a meta-analysis of the p-curves of the 12 meta-analyses (binomial glm: estimated proportion of p-values in the upper bin (0.045 < p < 0.05) (lower CI) = 0.615 (0.513), p = 0.033; excluding misreported p-values: 0.489 (0.375), p = 0.443). Although questions subjected to meta-analysis might not be a representative sample of all research questions asked by scientists, our results indicate that studies on questions identified by researchers as important enough to warrant a meta-analysis tend to be p-hacked. Whether this influences the general conclusions of a meta-analysis depends on both the extent of p-hacking and the strength of the true effect. For instance, we found a statistically significant indication of p-hacking in only one of the 12 questions examined in published meta-analyses (Fig. 4). However, this study [56] also showed strong evidential value and p-values in the 0.045–0.05 bin were only a small proportion of published significant p-values. It is therefore unlikely that p-hacking would change the qualitative conclusions made in this meta-analysis, although p-hacking might have inflated the estimated mean effect size. In general, meta-analyses might be robust to inflated effects sizes that results from p-hacking, because: 1) all else being equal, studies that are most susceptible to p-hacking are those with small sample sizes (i.e., because low statistical power means less chance of a significant result), and these are given less weighting in a meta-analysis, 2) at least in some fields (e.g., ecology and evolution), meta-analyses often use data that is not directly related to the primary focus of the original paper. The p-values associated with secondary questions are less likely to be p-hacked. One way to check how sensitive estimates of effects sizes are to p-hacking would be to randomly remove the appropriate number of studies that contribute to a hump in the p-curve just below 0.05. Alternatively, meta-analysts could estimate effect sizes using p-curves (i.e., using only the significant p-values they find), a method which has been proposed to account for publication biases and to offer a conservative estimate of the true effect when there is p-hacking [62,63]. Development of p-curve methods is ongoing and we look forward to further tests of their ability to correct for the file-drawer effect, p-hacking, and other forms of publication bias given that real world data are likely to violate some of the assumptions in the available simulations of their effectiveness.

Summary and Conclusions Our study provides two lines of empirical evidence that p-hacking is widespread in the scientific literature. Our text-mining approach is based on a very large dataset that consists of p-values from different disciplines and questions, while our meta-analysis approach consists of p-values concerning a few specific hypotheses. Both approaches yielded similar results: evidential value for claims that the mean effect sizes for key study questions are nonzero—the conclusions researchers are making based on significant study findings—but that estimated mean effect size has probably been inflated by p-hacking. Eliminating p-hacking entirely is unlikely when career advancement is assessed by publication output, and publication decisions are affected by the p-value or other measures of statistical support for relationships. Even so, there are a number of steps that the research community and scientific publishers can take to decrease the occurrence of p-hacking (see Box 3). Box 3. Recommendations The key to decreasing p-hacking is better education of researchers. Many practices that lead to p-hacking are still deemed acceptable. John et al. [16] measured the prevalence of questionable research practices in psychology. They asked survey participants if they had ever engaged in a set of questionable research practices and, if so, whether they thought their actions were defensible on a scale of 0–2 (0 = no, 1 = possibly, 2 = yes). Over 50% of participants admitted to “failing to report all of a study’s dependent measures” and “deciding whether to collect more data after looking to see whether the results were significant,” and these practices received a mean defensibility rating greater than 1.5. This indicates that many researchers p-hack but do not appreciate the extent to which this is a form of scientific misconduct. Amazingly, some animal ethics boards even encourage or mandate the termination of research if a significant result is obtained during the study, which is a particularly egregious form of p-hacking (Anonymous reviewer, personal communication). What can researchers do? Clearly label research as prespecified (i.e., designed to answer a specific question, where detail of methods and analyses can be fully reported prior to data collection) or exploratory (i.e., involves exploration of data that looks intriguing, where methods and analyses used are often post hoc [13]), so that readers can treat results with appropriate caution. Results from prespecified studies offer far more convincing evidence than those from exploratory research [2].

Adhere to common analysis standards [2]; measuring only response variables that are known (or predicted) to be important; and using sufficient sample sizes.

Perform data analysis blind wherever possible. This approach makes it difficult to p-hack for specific results.

Place greater emphasis on the quality of research methods and data collection rather than the significance or novelty of the subsequent findings when reviewing or assessing research. Ideally, methods should be assessed independently of results [13,44]. What can journals do? Provide clear and detailed guidelines for the full reporting of data analyses and results. For instance, stating that it is necessary to report effect sizes whether small or large, to report all p-values to three decimal places [27,64], to report samples sizes, and, most importantly, to be explicit about the entire analysis process (not just the final tests used to generate reported p-values). This will reduce p-hacking and aid the collection of data for meta-analyses and text-mining studies.

Encourage and/or provide platforms for method prespecification [13,65]. Although methods and results in publications do not always match their prespecified protocols [5,66], prespecification allows readers to assess the risk of p-hacking and adjust their confidence in the reported outcomes accordingly.

Encourage and/or provide platforms for open access to raw data. While access to raw data does not prevent p-hacking, it does make researchers more accountable for marginal results and allows readers to reanalyze data to check the robustness of results.

Supporting Information S1 Text. Details of how text-mined data and data from meta-analyses were collected and analysed. https://doi.org/10.1371/journal.pbio.1002106.s001 (DOCX)

Acknowledgments We are grateful to members of the Jennions lab for comments and discussion on various versions of the manuscript.