Summary: A team of 186 researchers conducted replications of 28 classic and contemporary findings in psychology. Overall, 14 of the 28 findings failed to replicate despite the massive sample size with more than 60 laboratories contributing samples from all over the world to test each finding. The study examined the extent to which variability in replication success can be attributed to the study sample. If a finding replicated, it replicated in most samples with occasional variation in the magnitude of the findings. If a finding was not replicated, it failed to replicate with little variation across samples and contexts. This evidence is inconsistent with a popular explanation that failures to replicate in psychology are likely due to changes in the sample between the original and replication study. Ongoing efforts to improve research rigor such as preregistration and transparency standards may be the best opportunities to improve reproducibility.

The results of a massive replication project in psychology are published today in Advances in Methods and Practices in Psychological Science. The paper “Many Labs 2: Investigating Variation in Replicability Across Sample and Setting” represents the efforts of 186 researchers from 36 nations and territories to replicate 28 classic and contemporary psychology findings. Compared to other large replication efforts, the unique quality of this study was that each of the 28 findings was repeated in more than 60 laboratories all over the world resulting in a median sample size of 7,157. This was more than 64X larger than the median sample size of 112 of the original studies. This provided two important features for evaluating replicability: (1) extremely sensitive tests for whether the effect could be replicated, and (2) insight into whether the replicability of the findings varied based on the sample.

Overall, 14 of the 28 findings (50%) replicated successfully. The effect sizes of the replication studies were less than half the size of the original studies on average. With samples from all six populated continents, the study tested a popular argument that some psychology findings fail to replicate because the original and replication samples were different. Across these 28 findings, there was not much evidence that replicability was highly dependent on the sample. For studies that failed to replicate, the finding failed to replicate in almost all the samples. That is, they failed to exceed replication success that would be expected by chance if there was no effect to detect. For studies that replicated successfully, particularly those with large effect sizes, the replications were successful in almost all the samples. There was some heterogeneity among the larger effects across samples but, for these cases, that variability seemed to indicate that the effect magnitude was larger in some samples than others, and not that the finding was present in some samples and absent in others. “We were surprised that our diversity in our samples from around the world did not result in substantial diversity in the findings that we observed” said Rick Klein, one of the project leaders and a postdoctoral associate at the University of Grenoble Alpes in France. “Among these studies at least, if the finding was replicable, we observed it in most samples, and the magnitude of the finding only varied somewhat depending on the sample.”

This paper is the latest of six major replication projects in the social and behavioral sciences published since 2014. These projects are a response to collective concern that the reproducibility of published findings may not be as robust as is assumed, particularly because of publication pressures that may lead to publication bias in which studies and findings with negative results are ignored or unpublished. Such biases could distort the evidence in the published literature implying that the findings are stronger than the existing evidence suggests and failing to identify boundary conditions on when the findings will be observed. Across the six major replication projects, 90 of 190 findings (47%) have been replicated successfully according to each study’s primary evaluation criterion. “The cumulative evidence suggests that there is a lot of room to improve the reproducibility of findings in the social and behavioral sciences” said Fred Hasselman, one of the project leaders and an assistant professor at Radboud University in Nijmegen.

The Many Labs 2 project addressed some of the common criticisms that skeptics have offered for why studies may fail to replicate. First, the studies had massive samples ensuring sufficient power to detect the original findings. Second, the replication team obtained the original materials to ensure faithful replications of the original findings. Third, all 28 studies underwent formal peer review at the journal prior to conducting the studies, a process called Registered Reports. This ensured that original authors and other experts could provide critical feedback on how to improve the study design to ensure that it would be an adequate test of the original finding. Fourth, all of the studies were preregistered at OSF (http://osf.io/) to ensure strong confirmatory tests of the findings, and all data, materials, and analytic code for the projects are archived and openly available on the OSF for others to review and reproduce the findings (https://osf.io/8cd4r/). And, fifth, the study directly evaluated whether sample characteristics made a meaningful difference in the likelihood of observing the original finding and, in most cases, it did not. Michelangelo Vianello, one of the project leads and a Professor at the University of Padua concluded “We pursued the most rigorous tests of the original findings that we could. It was surprising that even with these efforts we were only able to obtain support for the original findings for half of the studies. These results do not definitively mean that the original findings were wrong, but they do suggest that they are not as robust as might have been assumed. More research is needed to identify whether there are conditions in which the unreplicated findings can be observed. Many Labs 2 suggests that diversity in samples and settings may not be one of them.”

A second paper “Predicting Replication Outcomes in the Many Labs 2 Study” is scheduled to appear in the Journal of Economic Psychology and has also been released. This paper reported evidence that researchers participating in surveys and prediction markets about the Many Labs 2 findings could predict which of the studies were likely to replicate and which were not. In prediction markets, each share for a finding that successfully replicates is worth $1 and each share for a finding that fails to replicate is worth nothing. Researchers then buy and sell shares in each finding to predict which ones will succeed and fail to replicate. The final market price is interpretable as the predicted probability that the original finding will replicate. Anna Dreber, senior author of the prediction market paper, and Professor at the Stockholm School of Economics and University of Innsbruck said “We now have four studies successfully demonstrating that researchers can predict whether findings will replicate or not in surveys and prediction markets with pretty good accuracy. This suggests potential reforms in peer review of grants and papers to help identify findings that are exciting but highly uncertain to invest resources to see if they are replicable.”

Failure to replicate is part of ordinary science. Researchers are investigating the unknown and there will be many false starts in the generation of new knowledge. Nevertheless, prominent failures to replicate findings in psychology and other scientific disciplines have increased concerns that the published literature is not as reproducible as expected. The scientific community is in the midst of a reformation that involves self-critical review of the reproducibility of research, evaluation of the cultural incentives that may lead to irreproducibility and inefficiency in discovery, and testing of solutions to improve reproducibility and accelerate science. The Many Labs 2 project embodies many of the reforms that are spreading across disciplines including preregistration, use of Registered Reports in partnership with a journal, and open sharing of all data, materials, and code. The investigation of irreproducibility itself serves as a mechanism for improving reproducibility and the pace of discovering knowledge, solutions, and cures.