Here we study whether researchers' willingness to share data for reanalysis is associated with the strength of the evidence (defined as the statistical evidence against the null hypothesis of no effect) and the quality of the reporting of statistical results (defined in terms of the prevalence of inconsistencies in reported statistical results). To this end, we followed-up on Wicherts et al.'s requests for data [12] by comparing statistical results in papers from which data were either shared or not, and to check for errors in the reporting of p-values in both types of papers.

Statistical analyses of research data are quite error prone [1] , [2] , [3] , accounts of statistical results may be inaccurate [4] , and decisions that researchers make during the analytical phase of a study may lean towards the goal of achieving a preferred (significant) result [5] , [6] , [7] , [8] . For these and other (ethical) reasons [9] , many scientific journals like PLoS ONE [10] and professional organizations such as the American Psychological Association (APA) [11] have clear policies concerning the sharing of data after research results are published. For instance, upon acceptance for publication of a paper in one of the over 50 peer-reviewed journals published by the APA, authors sign a contract that they will make available data to peers who wish to reanalyze their data to verify the substantive claims put forth in the paper. Nonetheless, the replication of statistical analyses in published psychological research is hampered by psychologists' pervasive reluctance to share their raw data [1] , [12] . In a large-scale study Wicherts et al. [12] found that 73% of psychologists publishing in four top APA journals defied APA guidelines by not sharing their data for reanalysis. The unwillingness to share data of published research has been documented in a number of fields [13] , [14] , [15] , [16] , [17] , [18] , [19] , [20] and is often ascribed in part to the fear among authors that independent reanalysis will expose statistical or analytical errors in their work [21] and will produce conclusions that differ from theirs [22] . However, no published research to date has addressed whether this rather bleak scenario has a bearing on reality.

Methods

In the summer of 2005, Wicherts and colleagues [12] contacted the corresponding authors of 141 papers that were published in the second half of 2004 in one of four high-ranked journals published by the APA: Journal of Personality and Social Psychology (JPSP), Developmental Psychology (DP), Journal of Consulting and Clinical Psychology (JCCP), and Journal of Experimental Psychology: Learning, Memory, and Cognition (JEP∶LMC). The data were requested to determine the effects of outliers on statistical outcomes (see Text S1 for details). Although all corresponding authors had signed a statement that they would share their data for such verification purposes [11], most authors failed to do so. In the current study, we related the willingness to share data from 49 papers published in JPSP or JEP∶LMC to two relevant characteristics of the statistical outcomes reported in the papers, namely the internal consistency of the statistical results and the distribution of significantly reported (p<.05) p-values. We restricted the attention to JPSP and JEP∶LMC, because (1) authors in these journals were more willing to share data than authors in the other journals from which Wicherts et al. requested data, (2) no corresponding authors in these two journals declined to share data, because they were part of an ongoing project or because of propriety rights or ethical considerations, and (3) studies in these two journals were fairly homogeneous in terms of analysis and design (mostly lab experiments). We also restricted our attention to results from null-hypothesis significance testing (NHST) [23]. This procedure is not without its critics [24], [25], but remains to be used extensively in psychology [26] and related fields. NHST provides p-values that, if smaller than alpha = .05, are considered by many researchers [27], [28] and reviewers [29] to lend support to the hypothesized effects. Psychological research data are often amenable to alternative methods of analysis [6], [22], [30] that may affect what can be concluded from them (at least within the rules of NHST). The specifics of the analysis will typically have more relevance when statistical results are nearly significant at the alpha = .05 level. Put differently, smaller p-values provide stronger evidence against the null hypothesis of no effect [31]. The strength of the evidence based on Bayes factors from Bayesian t-tests has been found to be strongly inversely related to the p-values of traditional t-tests [32]. If the strength of the evidence so defined plays part in the willingness to share data, then it is to be expected that p-values in papers from which data were not shared lie closer to .05. Because reported p-values are often inconsistent with the given test statistics and degrees of freedom [33], we also check for errors in reporting of statistical results.

Data Retrieval We extracted from the papers all the t, F, and χ2 test statistics associated with NHST, the given degrees of freedom (e.g., F(2,24) = 3.41), the sidedness of tests (1- or 2-tailed), and the reported exact p-value (e.g., p = .03) or the reported level of significance (e.g., p<.05). We considered these tests because these are the most common test statistics of NHST in psychology. Although it was infeasible to determine for each test whether it was in line with the researchers' predictions, NHST is typically used for the purpose of rejecting the null hypothesis. We did not consider test statistics that were not associated with NHST (e.g., model fitting or Bayesian analyses). We only included test results that were uniquely reported, complete (i.e., test statistic, degrees of freedom, and p-value were reported), and that were reported as being significant (i.e., p<.05) in the main text or in tables in the results sections. T-tests were considered 2-tailed, unless stated otherwise. The exact p-values were computed on the basis of the given test statistic and DF(s) in Microsoft Excel 2008 for Mac, version 12.1.0. A further four papers published in the two journals from which data were requested in 2005 were not included in the follow-up because they did not involve NHST or did not contain significant results on the basis of t, F, of χ2 tests. Five undergraduates, who were unaware from which papers data were shared also independently retrieved a total of 495 statistics and DFs. We compared these 495 statistics to ours and determined that the accuracy rate in our own data was 99.4%. The three minor errors in our data retrieval were corrected but proved trivial.

Detecting Reporting Errors Inconsistencies between reported p-values (or ranges) and p-values recalculated from the retrieved statistics were detected automatically in Excel as follows. The recomputed p-value was first rounded to the same number of digits as was used in the reported p-value (range). Subsequently, an IF-statement automatically checked for consistency. Next, we determined by hand whether reporting errors were not due to possible errors in extraction (none were found) or to rounding. For example, a test result such as “t(15) = 2.3; p = 0.034” could have arisen from test statistic that could possibly range from 2.25 to 2.35. Consequently, the correct p-value could range from .033 to .040 and so the reported value was not seen as inconsistent, although the recomputed p-value is .0362. In the analyses of the p-value distributions, we used the nearest next decimal that attained consistency for these correctly rounded cases (i.e., 2.34 in the example), but used the p-value on the basis of the reported test statistic in other cases. We checked whether over-reported p-values were corrected upwards via procedures like Bonferroni's or Huyn-Feldt's, but did not use these corrections in analyzing p-value distributions. As some of the inconsistencies may have arisen from the use of one-sided testing, we made additional searches of the text for explicit references thereof. In one instance, an F-test result was reported explicitly as a one-sided test, but because this result was equivalent to a one-sided t-test we did not consider it erroneous (as suggested by an independent reviewer). As a final check, the three authors independently verified all 49 inconsistencies on the basis of the papers. All documented errors are available upon request. The use of this method previously revealed quite high errors rates in the reporting of p-values in papers published in Nature, the British Medical Journal [4], and two psychiatry journals [34]. In a recent study covering a fairly representative sample of 281 psychology papers [33], roughly 50% of the papers that involved NHST were found to include at least one such reporting error. As discussed elsewhere [33], likely causes include (1) errors in the retrieval and copying of test statistics, degrees of freedom, and/or p-value (e.g., reporting the total DF instead of the error DF of an F test), (2) incorrect rounding of last decimal (e.g., p = .059 reported as p = .05), (3) the use of one-tailed tests without mentioning this, (4) incorrect use of tests (e.g., dividing the p value of an F or χ2 test by two to report a one-sided p value, whereas the F or χ2 test is already a one-sided test), (5) confusion of = with<(e.g., p = .012 reported as p<.01), and (6) copy-editing errors (e.g., a failure to alter relevant numbers after the use of “copy-paste” in writing the paper). Although many inconsistencies between reported and recomputed p-values in Bakker and Wicherts' study were minor, roughly 15% of the papers contained at least one result that was presented as being statistically significant (p<.05), but that proved, upon recalculation, not to be significant (p>.05). Such serious errors in the reporting of results increase the desirability to have the data available for reanalysis.