While science as a whole has produced remarkably reliable answers to a lot of questions, it does so despite the fact that any individual study may not be reliable. Issues like small errors on the part of researchers, unidentified problems with materials or equipment, or the tendency to publish positive answers can alter the results of a single paper. But collectively, through multiple studies, science as a whole inches towards an understanding of the underlying reality.

A meta-analysis is a way to formalize that process. It takes the results of multiple studies and combines them, increasing the statistical power of the analysis. This may cause exciting results seen in a few small studies to vanish into statistical noise, or it can tease out a weak effect that's completely lost in more limited studies.

But a meta-analysis only works its magic if the underlying data is solid. And a new study that looks at multiple meta-analyses (a meta-meta-analysis?) suggests that one of those factors—our tendency to publish results that support hypotheses—is making the underlying data less solid than we like.

Publication bias

It's possible for publication bias to be a form of research misconduct. If a researcher is convinced of their hypothesis, they might actively avoid publishing any results that would undercut their own ideas. But there's plenty of other ways for publication bias to set in. Researchers who find a weak effect might hold off on publishing in the hope that further research would be more convincing. Journals also have a tendency to favor publishing positive results—one where a hypothesis is confirmed—and avoid publishing studies that don't see any effect at all. Researchers, being aware of this, might adjust the publications they submit accordingly.

As a result, we might expect to see a bias towards the publication of positive results, and stronger effects. And, if a meta-analysis is done using results with these biases, it will end up having a similar bias, despite its larger statistical power.

While this issue has been recognized by researchers, it's not obvious how to prevent this from being a problem with meta-analyses. It's not even clear how to tell it's a problem with meta-analyses. But a small team of Scandinavian researchers—Amanda Kvarven, Eirik Strømland, and Magnus Johannesson—have figured out a way.

Their work relies on the fact that several groups have organized direct replications of studies in the behavioral sciences. Collectively, these provide a substantial number of additional test subjects (over 53,000 of them in the replications used), but aren't subject to the potential biases that influence regular scientific publications. These should, collectively, provide a reliable measure of what the underlying reality is.

The three researchers searched the literature to identify meta-analyses on the same research question, and came up with 15 of them. From there, it was a simple matter of comparing the effects seen in the meta-analyses to the ones obtained in the replication efforts. If publication bias isn't having an effect, the two should be substantially similar.

They were not substantially similar.

Almost half the replications saw a statistically significant effect of the same sort seen by the meta-analysis. An equal number saw an effect of the same sort, but the effect was small enough that it didn't rise to significance. Finally, one remaining study saw a statistically significant effect that wasn't present in the meta-analysis.

Further problems appeared when the researchers looked at the size of the effect the different studies identified. The effects seen in the meta-analyses were, on average three times larger than those seen in the replication studies. This wasn't caused by a few outliers; instead, a dozen of the 15 topics showed larger effects sizes in the meta-analyses.

All of that's consistent with what you might expect from a publication bias favoring strong positive results. The field had recognized that this might be a problem, and developed some statistical tools intended to correct for the problem. So, the researchers reran the meta-analyses using three of these tools. Two of them didn't work. The third was effective, but came at the cost of reducing the statistical power of the meta-analysis—in other words, it eliminated one of the primary reasons for doing a meta-analysis in the first place.

This doesn't mean that meta-analyses are a failure, or all research results are unreliable. The work was done in a field—behavioral science—where enough problems had already been recognized to motivate extensive replication studies in the first place. The researchers cite a separate study from the medical literature that compared meta-analyses of a collection of small trials to the outcome of larger clinical trials that followed. While there was a slight bias for positive effects there, too, it was quite small, especially in comparison to the differences identified here.

But the study does indicate that the problem of publication bias is a real one. Fortunately, it's one that can be tackled if journals were more willing to publish papers with negative results. If the journals did more to encourage these sorts of studies, researchers would likely be able to provide them with no shortage of negative results.

Aside from the main message of this paper, Kvarven, Strømland, and Johannesson use an additional measure to ensure the robustness of their work. Rather than simply counting anything with a p value less than 0.05 as significant, they limit that to things with a p value less than 0.005. They term things in between these two values as "suggestive evidence."

Nature Human Behavior, 2019. DOI: 10.1038/s41562-019-0787-z (About DOIs).