For scientists, getting research published in the journal Nature is a huge deal. It carries weight, prestige, and the promise of career advancement—as do the pages of its competitor, Science. Both have a reputation for publishing innovative, exciting, and high-quality work with a broad appeal. That reputation means that papers from these journals make up a substantial portion of day-to-day science news.

But the prestige of these journals doesn’t exempt them from problems that have been plaguing science for decades. In fact, because they publish such exciting and innovative work, there's a risk that they're even more likely to publish thrilling but unreliable papers. They may also be contributing to a scientific record that shows only the "yes" answers to big questions but neglects to mention the important-but-boring "no" results.

Colin Camerer, a behavioral economist at the California Institute of Technology, recently led a team of researchers in trying to repeat 21 social science studies from Science and Nature, successfully replicating 13 of them. The results, published yesterday (in Nature, naturally), may also hint at how our focus on positive results is biasing the literature. They also paint a complicated picture of the replication crisis in social science and illustrate how infinitely tricky the project of replication is.

How reliable is the scientific record?

Psychology’s reliability crisis erupted in 2011 with a wave of successive shocks: the publication of a paper purporting to show pre-cognition; a fraud scandal; and a recognition of p-hacking, where researchers exercised too much liberty in how they chose to analyze data to make almost any result look real. Scientist began to wonder whether the publication record was bloated with unreliable findings.

The crisis is far from being limited to psychology; many of the problems plague fields from economics to biomedical research. But psychology has been a sustained and particularly loud voice in the conversation, with projects like the Center for Open Science aiming to understand the scope of the problem and trying to fix it.

In 2015, the Center published its first batch of results from a huge psychology replication project. Out of 100 attempted replications, only around a third were successful. At the time, the replicators were cautious about their conclusions, pointing out that a failed replication could mean that the original result was an unreliable false positive—but it could also mean that there were unnoticed differences in the experiments or that the failed replication was a false negative.

In fact, the bias toward publishing positive results makes false negatives a significant risk in replications.

The perils of false negatives

One challenge of experimental work is deciding how many research subjects you need to get a reliable result. There’s no one-size-fits-all answer when it comes to sample sizes: the right number of people (or mice, or countries) to study will depend on the question you’re asking. If you can expect a big difference between groups—for example, if you plan to find out whether men are, on average, taller than women—you don’t need that many people. But if you think your difference is going to be small, you need a much bigger sample size.

That expected difference between groups, called the effect size, helps researchers work out how many subjects they need for their study. It’s important to get it right, because if you don’t have enough subjects for the effect size, you're more likely to miss a result that is actually real—it won't be distinguishable from statistical noise.

Researchers will often look at previous research to estimate an effects size. In the case of a replication, it seems sensible to use an effect size derived from the original paper.

The thing is, there’s reason to think that these effect sizes might not be super accurate. Experiments often ask multiple questions at the same time—if you ask enough questions, random chance will make it look like the answer is “yes” on some of them. Scientists have typically reported only their “yes” answers because those are the only ones that seem interesting. But if everyone does that, over time, the literature gets biased: the big “yes” effect sizes get published, but they may reflect luck as much as what’s really going on.

This means that a replication might actually be on the hunt for an effect size that’s smaller than the one in the original study. So, the researchers doing the replication might need to use more people to be sure they’ve got a decent chance of finding the effect. This could be one of the reasons why the replication rate has been so low.

Replicating a study sounds simple, but it isn’t

Camerer and his colleagues wanted to test the reliability of social results published in Nature and Science. They went looking for studies published between 2010 and 2015 that would be easy to replicate: those that used research subjects that were easy to access (like undergraduate students) and tested a clear experimental hypothesis. They found 21 papers that fit their criteria.

But Camerer and colleagues didn’t just want to look at each study on its own; they wanted to find out if they could say anything general about the reliability of this kind of work. They wanted to do science on the science, or meta-science. That meant that they needed to try to be consistent in how they did each replication. With wildly varying studies, that’s difficult, and it meant making some blanket decisions so that every paper got similar treatment.

The team decided to focus just on the first experiment in each paper and try to replicate that. A single experiment can produce multiple results, so if the replication shows that some are the same and some are different, how do you decide whether it has been successful? The researchers decided to focus just on the result that the original study considered the most important and compare that with the replication.

They involved the original authors in the replication of their work so they could be sure that the replications were as close to the original studies as possible and that everyone agreed on how they were going to analyze the data. They also made sure they had big enough sample sizes to find much smaller effects than those reported in the original papers, making it less likely that they’d get false negatives.