Replication has emerged as a powerful tool to check science and get us closer to the truth. Researchers take an experiment that’s already been done, and test whether its conclusions hold up by reproducing it. The general principle is that if the results repeat, then the original results were correct and reliable. If they don’t, then the first study must be flawed, or its findings false.

But there’s a big wrinkle with replication studies: They don’t work like that. As researchers reproduce more experiments, they’re learning that they can’t always get clear answers about the reliability of the original results. Replication, it seems, is a whole lot murkier than we thought.

Take the latest findings from the large-scale Reproducibility Project: Cancer Biology. Here, researchers focused on reproducing experiments from the highest-impact papers about cancer biology published from 2010 to 2012. They shared their results in five papers in the journal ELife last week — and not one of their replications definitively confirmed the original results.

The findings echoed those of another landmark reproducibility project, which, like the cancer biology project, came from the Center for Open Science. This time, the researchers replicated major psychology studies — and only 36 percent of them confirmed the original conclusions.

Replicating studies, it turns out, is really, really hard. For them to get easier, scientists are going to have to learn to be much more transparent about their process. We’re also going to need to start looking at replication studies with the same critical gaze we reserve for single studies of any kind.

Replicating studies is really, really hard

Replicating a study doesn’t just mean reading the paper and trying to run it again. It’s more like trying to play a complicated board game without all the instructions or even all the parts.

To pull off the replication studies in cancer biology, the researchers had to go back to the authors of the original papers and ask them to share any information they may have left out of their methods sections and any additional unpublished data that helped them arrive at their conclusions.

They then came up with a plan for reproducing the study, and got that plan peer reviewed — including by the original authors, statisticians, and experts in the relevant field.

“Sometimes we’d learn we were doing the wrong experiment, and we’d go back and make the changes,” said Tim Errington, who led the cancer reproducibility project. Once the changes were accepted, the real work would begin: running the experiment.

After they wrote up the results, they passed them through peer review and the original author was invited to weigh in again. (Not every author wanted to participate, of course, which sometimes made the process even harder.)

Sometimes, things as small as the temperature of the lab could cause a cell biology experiment to flop, Errington said. Other researchers who’ve replicated psychology studies said they had to overcome language and cultural barriers in translating the original science to a new lab. The takeaway here is that replication can go off the rails very easily if researchers don’t take the utmost care — and even when they do.

If a replication fails, it may be a bad replication — not necessarily a bad original study

Interpreting the results of a replication study is another place where things can get messy, fast. And increasingly scientists are realizing that if a replication fails to reproduce the original results, it doesn’t mean the original was wrong.

With the Reproducibility Project: Cancer Biology, not one of the replications definitively confirmed the original results: Two of the papers reproduced parts of the original experiments, one failed to reproduce the original experiment entirely, and the other two replications were impossible to interpret because of technical problems with the models.

“It could be that the original was false positive,” Errington said. “It could be that the replication is a false negative, that the replication did something wrong, or probably that they are both right — the original is right, and the replication is right — and what’s occurring is that we don’t know the cause of the discrepancy.”

In other words, there are any number of reasons a replication may fail, and they may say nothing about the quality of the original research.

“A single failure to replicate is no more definitive than a single study claiming discovery,” Bristol researcher Marcus Munafò, who recently co-authored a manifesto for reproducible science, told me. “Given the latter, we need to do more of the former and not just assume that a single study claiming discovery is robust.”

Doing replication studies properly is costly and time consuming, said Lawrence Tabak, the principal deputy director at the National Institutes of Health. Which “means you can’t do wholesale replications — but there is likely a subset of studies we really should consider doing replication.”

For example, researchers should try to replicate their findings from animal studies before moving to costlier human studies. “[Let’s say a] replication study of an animal study would cost half a million dollars,” Tabak said. “One could argue that’s a lot of money — why spend it? The answer is you would want to do be certain the key experimental results could be replicated before spending $5 million on a first in human study.”

The neurology and aging centers at the NIH already do this, he added, and the NIH recently released guidelines on reproducing research.

Scientists and journals need to get better at describing their methods and sharing data

Whenever I’ve talked to researchers who have done replication studies, it becomes clear pretty quickly that their endeavor won’t go so well if they don’t have buy-in from the researchers who did the original study. That’s because the methods sections of research papers — which describe how an experiment was done — often aren’t very detailed.

“One of the biggest barriers to reproducibility is simply knowing how the original research was done,” said Brian Nosek, who co-founded the Center for Open Science, which supported the cancer biology and psychology reproducibility projects.

A greater degree of transparency and data sharing would make replications easier — and more reliable. So researchers need to get better at describing their methods in detail and sharing their data.

“There may be some results that can be reproduced all the time very easily, others that get reproduced under very specific conditions, and others that are not reproduced at all. The emerging picture is that the two last categories may be very common,” Stanford meta-researcher John Ioannidis said in an email. “Funding agencies should clearly pay attention to this,” he added, instead of wasting money on research that isn’t reproducible.

Researchers could also make use of tools, like open-source software that tracks every version of a data set, so that they can share their data more easily and have transparency built into their workflow. As we wrote in a feature on the biggest problems in science at Vox, journals and funders also needs to rethink their incentive structures to reward more transparency and replications.

“[Replication projects are] shining a light on the way we conduct research,” Errington said. For better or worse, the way studies are done now makes the work of replicating them very hard and the results of replications more dubious. But, he added, “If we make ourselves more open and transparent from the beginning, before our work is even published in a paper, that’ll probably help a lot.”