2.1 The Bayes factor

To evaluate replication success we will make use of Bayes factors [15, 16]. The Bayes factor (B) is a tool from Bayesian statistics that expresses how much a data set shifts the balance of evidence from one hypothesis (e.g., the null hypothesis ) to another (e.g., the alternative hypothesis ). Bayes factors require researchers to explicitly define the models under comparison.

In this report we compare the null hypothesis of no difference against an alternative hypothesis with a potentially nonzero effect size. Our prior expectation regarding the effect size under is represented by a normal distribution centered on zero with variance equal to 1 (this is a unit information prior, which carries a weight equivalent to approximately one observation [17]).

Other analysts could reasonably choose different prior distributions when assessing these data, and it is possible they would come to different conclusions. For example, in the case of a replication study specifically, a reasonable choice for the prior distribution of is the posterior distribution of the originally reported effects [18]. Using the original study’s posterior as the replication’s prior asks the question, “Does the result from the replication study fit better with predictions made by a null effect or by the originally reported effect?” A prior such as this would lend itself to more extreme values of the Bayes factor because the two hypotheses make very different predictions; the null hypothesis predicts replication effect sizes close to zero, whereas the original studies’ posterior distributions will typically be centered on relatively large effect sizes and hence predict large replication effect sizes. As such, Bayes factors for replications that find small-to-medium effect sizes will often favor (δ = 0) over the alternative model that uses the sequential prior because the replication result poorly fits the predictions made by the original posterior distribution, whereas small-to-medium effects will yield less forceful evidence in favor of over the alternative model using the unit information prior that we apply in this analysis.

There are two main reasons why, in the present paper, we choose to use the unit information prior over this sequential prior. First, our goal is not to evaluate how well empirical results reproduce, but rather to see how the amount of evidence gathered in an original study compares to that found in an independent replication attempt. This question is uniquely addressed by computing Bayes factors on two data sets, using identical priors. Compared to the sequential prior, the unit information prior we have chosen for our analysis is somewhat conservative, meaning that it requires more evidence before strongly favoring in a replication study. Indeed, results presented in a blog post by the first author [19] suggest that when a sequential prior is used approximately 20% of replications show strong evidence favoring , as opposed to no replications strongly favoring with the unit information prior used in this report. Of course, it is to be expected that different analysts obtain different answers with different priors, because they are asking different questions (as Sir Harold Jeffreys [20] famously quipped: “It is sometimes considered a paradox that the answer depends not only on the observations but on the question; it should be a platitude,” p. vi).

A second reason we do not use the sequential prior in this report is that it does not take into account publication bias. Assuming that publication bias has a greater effect on the original studies than it did on the (pre-registered, certain to be published regardless of outcome) replications, the observed effect sizes in original and replicate studies are not expected to be equal. Using the original posterior distribution as a prior in the replication study would penalize bias in the original result; since the replication attempts will nearly always show smaller effect sizes than the biased originals, it will be more common to ‘fail to replicate’ these original findings (by accumulating evidence in favor of in the replication). However, here we are interested in evaluating the evidential support for the effects in the replication, rather than using them to quantify the effect of publication bias. In other words, we are interested in answering the following question: If we treat the two results as independent, do they provide similar degrees of evidence?