A large-scale effort to replicate results in psychology research has rebuffed claims that failures to reproduce social-science findings might be down to differences in study populations.

The drive recruited labs around the world to try to replicate the results of 28 classic and contemporary psychology experiments. Only half were reproduced successfully using a strict threshold for significance that was set at P < 0.0001 (the P value is a common test for judging the strength of scientific evidence).

The initiative sampled populations from across six continents, and the team behind the effort says that its overall findings suggest that the culture or setting of the group of participants is not an important factor in whether results can be replicated.

Under scrutiny

The reproducibility of research results — and psychology particularly — has come under scrutiny in recent years. Several efforts have tried to repeat published findings in a variety of fields, with mixed outcomes.

The latest effort, called Many Labs 2, was led by psychologist Brian Nosek of the Center for Open Science in Charlottesville, Virginia. Nosek and his colleagues designed their project to address major criticisms of previous replication efforts — including questions about sampling and the assertion that research protocols might not be carried out properly in reproducibility attempts.

Researchers obtained the original materials used in each experiment, and asked experts — in many cases, the original authors of the studies — to review their experimental protocols in advance. Sixty different labs in 36 countries and territories then redid each experiment, providing combined sample sizes that were, on average, 62 times larger than the original ones. The results of the effort are posted today as a preprint1 and are scheduled to be published in Advances in Methods and Practices in Psychological Science.

“We wanted to address the common reaction that, of course the replication failed because the conditions changed, and people are different,” says Nosek. “It’s a possible explanation, but not a satisfying one, because we don’t know why that difference is important.”

Even under these conditions, the results of only 14 of the 28 experiments were replicated, and the researchers determined that the diversity of the study populations had little effect on the failures. “Those that failed tended to fail everywhere,” says Nosek.

For successful replication attempts, the picture was more complicated. For these studies, the results showed some differences between different replication attempts but overall, that variation was relatively small.

“Heterogeneity occurs, but it is not as big as we think, and is not a plausible explanation for why some studies fail to replicate,” says Nosek. “It closes off one of the obvious alternative explanations.”

Replication chain

Many Labs 2 is the latest in a series of six large-scale replication efforts in psychology. It focused on a range of studies, none of which had been looked at by other big reproducibility projects.

They include classic studies such as psychologist Daniel Kahneman’s 1981 work2 on framing effects, a form of cognitive bias in which people react differently to a particular choice depending on how it is presented (the study was successfully replicated), and modern research, including work3 by Yoel Inbar in 2009 showing that people who were more likely to experience feelings of disgust tended to be more homophobic.

The attempt to replicate Inbar’s study failed with the strict significance criterion, which surprised Nosek. “I had high confidence in that one because it’s related to things I study myself.”

Inbar, a psychologist at the University of Toronto Scarborough in Canada, who took part in Many Labs 2, was also surprised that his work failed to replicate, but he doesn’t question the outcome. “We could have just gotten lucky, since the original sample size was small, or attitudes may have shifted over time,” he says.

Inbar says that there were also weaknesses in his original study. For instance, he used data initially collected by a colleague for another study.

The focus on reproducibility in recent years means that Inbar, like many psychologists, has changed how he works in an effort to produce more-reliable results. “These days, I would never take an opportunistic secondary analysis like that,” he says.

Not a doomsayer

Replication projects such as Nosek’s do not establish the overall replication rate in a field, because the studies chosen for replication are not a representative sample. Nor do they answer the question of what a ‘good’ replication rate would be. Researchers are not aiming for a perfect score. “Achieving 100% reproducibility on initial findings would mean that we are being too conservative and not pushing the envelope hard enough,” says Nosek.

A previous Many Labs project4 successfully replicated 10 out of 13 studies, while other projects have found replication rates as low as 36%. Of the 190 studies examined in the 6 large-scale efforts combined, 90 were successfully replicated, for a rate of 47%.

That seems too low to Inbar. “If we only have a coin-flip chance to replicate with a large sample size, that feels wrong,” he says.

But Fritz Strack, a psychologist at the University of Würzburg in Germany, is not sure that such replication projects reveal anything useful about the state of psychology. Rather, he says, each replication teaches us more about what might be affecting the result. “Instead of declaring yet another classical finding a ‘false positive’, replicators should identify the conditions under which an effect can and cannot be obtained,” he adds.

Nosek counters that ongoing replication efforts are important for two reasons: to ensure that the replication results are themselves replicable, and to address criticisms of previous work, as this one did. “That is how science advances: evidence, criticism, more evidence to examine the viability of the criticisms,” he says.