We examined the percentage of p values (.05 < p ≤ .10) reported as marginally significant in 44,200 articles, across nine psychology disciplines, published in 70 journals belonging to the American Psychological Association between 1985 and 2016. Using regular expressions, we extracted 42,504 p values between .05 and .10. Almost 40% of p values in this range were reported as marginally significant, although there were considerable differences between disciplines. The practice is most common in organizational psychology (45.4%) and least common in clinical psychology (30.1%). Contrary to what was reported by previous researchers, our results showed no evidence of an increasing trend in any discipline; in all disciplines, the percentage of p values reported as marginally significant was decreasing or constant over time. We recommend against reporting these results as marginally significant because of the low evidential value of p values between .05 and .10.

Recent failures to reproduce findings of studies (e.g., as in the “Reproducibility Project: Psychology” by the Open Science Collaboration, 2015) have fanned the debate about the claiming of findings on the basis of their statistical significance. In their article “Redefine Statistical Significance,” Benjamin et al. (2018) argued that the standard for claiming new discoveries, p < .05, is too low and a leading cause of nonreproducibility and false-positive results, and they proposed to change the standard to p < .005. On the other hand, Lakens et al. (2018) argued that researchers should transparently report and justify their significance level, whether it is .05 or something else.

Following up on the debate on the use of significance levels in psychology, we empirically examined the extent to which studies in psychology claim a finding on the basis of a significance level that is even lower than .05, often called marginally significant, that is, .05 < p ≤ .10. More specifically, we examined the percentage of p values between .05 and .10 that is reported in studies as marginally significant, across journals and disciplines of psychology and over time. On the way, we also reexamined Pritschet, Powell, and Horne’s (2016) claims that marginally significant results have become more prevalent in psychology over time and that results are reported as marginally significant more frequently in social psychology than in developmental psychology. Examining the prevalence of results reported as marginally significant and reexamining the claims of Pritschet et al. is important as it bears on differences in reproducibility across disciplines and trends over time; higher p values are generally associated with lower reproducibility and more false positives (Camerer et al., 2016; Ioannidis, 2005; Open Science Collaboration, 2015).

Pritschet et al. (2016) looked at the frequency of articles in which at least one result was reported as marginally significant or as approaching significance in articles from the journals Cognitive Psychology, Developmental Psychology, and the Journal of Personality and Social Psychology, meant to “represent three major subfields of psychology: cognitive, developmental, and social” (p. 1037), for the years 1970, 1980, 1990, 2000, and 2010. Although Pritschet et al.’s findings may be interpreted as a higher willingness of researchers over time and in social psychology to claim marginal significance in their articles, we should be careful because of the presence of confounding factors. Their outcome variable was the percentage of articles in which at least one result was reported as marginally significant. However, if an article contains more p values, the probability increases that the article contains at least one result reported as marginally significant. In devising their outcome measure, Pritschet et al. did not take into account that the number of reported p values per journal article has increased over the years or that articles in the Journal of Personality and Social Psychology, on average, contain more p values than those in (at least) Developmental Psychology (Nuijten, Hartgerink, van Assen, Epskamp, & Wicherts, 2016). In further analyses, Pritschet et al. also controlled for the number of experiments in an article, which did not affect their conclusions, but the number of experiments is only a rough and imperfect proxy for the number of p values. More generally, any factor affecting the distribution of p values and their frequency in the interval .05 to .10, such as the statistical power of research, p hacking, or merely the reporting of statistical results, will affect the percentage of articles reporting one or more results as marginally significant. Thus, this outcome provides limited information on researchers’ usage of the concept of marginal significance, both over time and across journals. Factors affecting the distribution of p values, however, will not affect the percentage of p values between .05 and .10 reported as marginally significant, as this percentage is conditional on the occurrence of such a p value.

Whole parts of the scientific literature can be examined using automated methods. Several recent publications have successfully used extracted statistics to examine the scientific literature on the basis of such automated methods (e.g., Lakens, 2015; Nuijten et al., 2016; Vermeulen et al., 2015). One of the most common automated methods is using so-called regular expressions that search through the provided article for predefined strings of text, the results of which are then saved to a data file for analysis. The more complex the data that need to be extracted, the more limited this method becomes. Fortunately, when p values are extracted, only three things need to be identified in the text: the p, the comparison sign, and the value itself (for an extensive treatment on the limitations of using reported p values, see Hartgerink, van Aert, Nuijten, Wicherts, & van Assen, 2016; Jager & Leek, 2014, and discussions in the first issue of Volume 15 of Biostatistics). The advantage of automated methods when examining the scientific literature is that they permit collecting large samples of data. For example, Nuijten et al. (2016), using an R package (statcheck) that extracts only complete American Psychological Association (APA)-formatted test results (t, F, etc.), collected 258,105 p values from 30,717 articles published between 1985 and 2013.

Using automated extraction of p values, we examined the prevalence of p values between .05 and .10 reported as marginally significant in psychology. We first partially replicated and extended Pritschet et al.’s (2016) findings by examining the prevalence of marginally significant results in two journals, the Journal of Personality and Social Psychology and Developmental Psychology. Then, we examined that prevalence between 1985 and 2016 in journals published by the APA, distinguishing nine psychology disciplines: social, developmental, cognitive, clinical, educational, experimental, forensic, health, and organizational.

Results We present our results in two steps. First, we present results for the Journal of Personality and Social Psychology and Developmental Psychology. Here, we also considered the average number of p values between .05 and .10 reported per article and year. Second, we present the results for all included APA journals taken together and for the nine psychology disciplines previously described (see Table 1). Journal of Personality and Social Psychology and Developmental Psychology Our analyses confirmed that the percentage of articles with at least one result reported as marginally significant was higher in the Journal of Personality and Social Psychology than in Developmental Psychology; whereas Pritschet et al. (2016) found percentages of 39.52 ( Journal of Personality and Social Psychology) and 24.29 (Developmental Psychology), we found percentages of 41.84 and 21.74, respectively (see Table 2, last column). The differences (albeit small) between their and our results are explained by the fact that we incorporated other articles and by differences in the selection and calculation of results (marginally significant results by Pritschet et al. and p values in the .05–.10 range in combination with a window of ±200 words). Following Pritschet et al., we observed an increase in the reporting of marginally significant results at the level of articles for Developmental Psychology and the Journal of Personality and Social Psychology, although the increase for Developmental Psychology was very small (estimated increase of approximately 2.5% over 30 years; see Fig. 2). For the Journal of Personality and Social Psychology, this trend was brought about by an increase in both the average number of p values between .05 and .10 per article and the percentage of p values between .05 and .10 reported as marginally significant (see Fig. 2). For Developmental Psychology, the percentage of p values reported as marginally significant decreased over time, but this decrease was offset by a larger increase in the number of p values between .05 and .10 over time. The latter results demonstrate the importance of distinguishing results at the level of articles from those at the level of p values. Download Open in new tab Download in PowerPoint Psychology and its disciplines Reporting p values between .05 and .10 as marginally significant was common practice in all psychology disciplines. Table 2 shows that, on average, almost 40% of p values (.05 < p ≤ .10) in the 70 examined APA journals were reported as marginally significant between 1985 and 2016. The practice was most common in organizational psychology (45.38%), social psychology (44.47%), and experimental psychology (40.65%). The fewest p values between .05 and .10 were reported as marginally significant in clinical psychology (30.08%), health psychology (31.58%), and forensic psychology (33.91%). The disciplines of educational psychology (34.69%), developmental psychology (37.72%), and cognitive psychology (39.49%) fell between these two groups. That higher percentages were consistently found for the outcome variable at the level of p values (see Table 2, penultimate column) than at the level of articles (last column) is explained by the many articles that contain p values but without values in the range .05 to .10. Of the total 44,200 articles with p values, only 25,800 contained p values between .05 and .10, which thus inflates the denominator of the percentage of articles containing at least one marginally significant result. We examined the overall trend in the reporting of marginally significant results and the trends in each discipline (see Fig. 3). Across all journals, the percentage of p values reported as marginally significant decreased (b = −0.32) in the period from 1985 to 2016. For no discipline was there evidence of an increasing trend. On the basis of the linear trend (b), the largest decreases were in forensic psychology (b = −0.92), cognitive psychology (b = −0.68), and experimental psychology (b = −0.6). Three disciplines were mostly stable over the years: social psychology (b = −0.02), organizational psychology (b = −0.09), and developmental psychology (b = −0.12). The change over time for the three remaining disciplines fell between these two groups. These were health psychology (b = −0.27), clinical psychology (b = −0.29), and educational psychology (b = −0.35). Note that the plots also indicate a trend for more p values reported in the literature. Download Open in new tab Download in PowerPoint The percentage of articles containing p values with at least one p value between .05 and .10 reported as marginally significant increased when averaged across all APA journals and for all disciplines individually, except for forensic psychology, health psychology, and organizational psychology (see Fig. 2). As demonstrated in the previous section, these trends are not straightforward to interpret, as they are also affected by trends in the frequency of p values between .05 and .10 per article. Consecutively, this frequency of p values is affected by trends in the reporting of p values and trends in the statistical power of psychological research over time, although there is, at most, a small increase in power over time in our data (see the Supplemental Material). Note that possible trends in p-value reporting and power do not affect the percentage of p values reported as marginally significant, as that percentage is conditional on the p value being between .05 and .10.

Discussion Following up on the debate about the use of significance levels in psychology, we empirically examined the extent to which researchers have claimed a finding to be marginally significant on the basis of a p value between .05 and .10 in psychology and its disciplines between 1985 and 2016. Examining the prevalence of results reported as marginally significant is important, as it bears on differences in reproducibility across disciplines and trends over time; higher p values are generally associated with lower reproducibility and more false positives. Following Pritschet et al. (2016), we examined trends in the percentage of articles with p values reported as marginally significant and showed that these are affected by differences across disciplines in the number of p values between .05 and .10 and the development over time of this number. We also examined the prevalence of p values between .05 and .10 reported as marginally significant across time in nine psychology disciplines, which is not affected by factors influencing the distribution of p values. That p values between .05 and .10 are interpreted as marginally significant appears common in psychology. Across the nine disciplines we examined, almost 40% of such values were reported as marginally significant in the period from 1985 to 2016, although the prevalence differed by discipline. We found higher percentages of p values between .05 and .10 reported as marginally significant in social psychology than in developmental and cognitive psychology, corroborating the findings by Pritschet et al. (2016), but differences were small (up to 7%). Overall, marginally significant p values were the most prevalent in organizational psychology and the least prevalent in clinical psychology. A few disciplines had a stable trend, but most described a downward trend in the percentage of p values between .05 and .10 reported as marginally significant between 1985 and 2016. Controlling for the increasing numbers of p values across the years, we found that the positive trends reported by Pritschet et al. (2016) for cognitive psychology, developmental psychology, and social psychology thus disappeared. On the other hand, the Journal of Personality and Social Psychology, which Pritschet et al. used to represent social psychology, still showed a positive trend. This illustrates the problem with using a single journal to represent entire psychology disciplines. The downward trend in psychology overall may reflect increasing awareness among researchers that p values in the range of .05 to .10 represent weak evidence against the null or a tendency to also report p values that do not correspond to tests of the main hypotheses and are not interpreted in the main text. It may also be that percentages are decreasing because of increasingly stringent competition to publish and less leniency among editors for marginally significant results (as previously suggested by Lakens, 2015). Regardless of the reason, what matters is that results with such p values do not end up in the file drawer and are not “transformed” into significant results (Simmons, Nelson, & Simonsohn, 2011) but are reported in the literature. We demonstrated that it is not straightforward to examine and interpret trends in the percentage of articles that report at least one p value between .05 and .10 as marginally significant because they are affected by factors influencing the p-value distribution of results reported in articles. One can attempt to model the p-value distribution and factors influencing it. However, as so many factors affect the p-value distribution and these models are based on strong assumptions, we believe it is impossible to draw strong conclusions on the mechanisms causing differences or trends in p-value distributions (Hartgerink et al., 2016). We therefore recommend examining the percentage of p values between .05 and .10 that is reported as marginally significant, as it is not affected by these factors. Our results are qualified by three issues. First, because p values of .05 tend to be reported as significant (Nuijten et al., 2016), we excluded these results, regardless of whether the sign was >, <, or =. However, a portion of p values reported as “p > .05” will also be below or equal to .10. It seems possible that researchers who report a p value between .05 and .10 as “p > .05” would also be less likely to report this result as marginally significant and label it nonsignificant instead. If this is the case, our results may be slightly biased in favor of higher estimates. On the other hand, our second limitation leads to bias in the opposite direction. Matthew Hankins (2013) compiled a list of 508 ways that researchers have described results as marginally significant. Of these, only 77 include the expressions “margin*” or “approach*,” our indicators of marginal significance. Although there is no telling how common the different expressions on Hankins’s list are, their existence nonetheless indicates that our estimates of the prevalence of marginally significant results in psychology are likely to be underestimates because of the varied terminology available to label results that are close to significance. Third, and relatedly, our results on marginal significance are limited by our data-collection procedure; strictly speaking, our conclusions apply to the use of “margin*” and “approach*” in the window of ±200 characters of a p value between .05 and .10. To conclude, we cannot blindly generalize our conclusions to the overall use of marginal significance in the psychological literature. In the end, the degree to which results reported as marginally significant are problematic depends on research design. Questionable research practices inflate the risk of false-positive results (John, Loewenstein, & Prelec, 2012). One of a multitude of such practices is the post hoc decision to change what decision rule one uses or how strictly it is applied (Wicherts et al., 2016). Because most researchers are likely to use an implicitly predefined alpha level, later reporting results as marginally significant is an example of an implicit change in the decision rule. The severity of this practice depends on the extent to which the decision rule has been altered. Nevertheless, because p values between .05 and .10 are known to have low evidential value (Benjamin et al., 2018; Ioannidis, 2005), we recommend against reporting these results as being marginally significant.

Action Editor

Brent W. Roberts served as action editor for this article. Author Contributions

C. H. J. Hartgerink and M. A. L. M. van Assen developed the study concept. All the authors contributed to the study design. C. H. J. Hartgerink extracted the data. A. Olsson-Collentine analyzed the data, and the analysis was checked by C. H. J. Hartgerink. All the authors interpreted the results and contributed to the writing of the manuscript, with A. Olsson-Collentine writing the first draft and M. A. L. M. van Assen and A. Olsson-Collentine writing the revision. All the authors approved the final manuscript for submission. ORCID iDs

Anton Olsson-Collentine https://orcid.org/0000-0002-4948-0178 Chris H. J. Hartgerink https://orcid.org/0000-0003-1050-6809 Declaration of Conflicting Interests

The author(s) declared that there were no conflicts of interest with respect to the authorship or the publication of this article. Supplemental Material

Additional supporting information can be found at http://journals.sagepub.com/doi/suppl/10.1177/0956797619830326 Open Practices

All code and data have been made publicly available via the Open Science Framework and can be accessed at osf.io/28gxz. Materials consist of the extraction functions used to obtain the measures for the reported analyses. The design and analysis plans for this study were not preregistered. The complete Open Practices Disclosure for this article can be found at http://journals.sagepub.com/doi/suppl/10.1177/0956797619830326. This article has received the badges for Open Data and Open Materials. More information about the Open Practices badges can be found at http://www.psychologicalscience.org/publications/badges.