As it turned out, that finding was entirely predictable. While the SSRP team was doing their experimental re-runs, they also ran a “prediction market”—a stock exchange in which volunteers could buy or sell “shares” in the 21 studies, based on how reproducible they seemed. They recruited 206 volunteers—a mix of psychologists and economists, students and professors, none of whom were involved in the SSRP itself. Each started with $100 and could earn more by correctly betting on studies that eventually panned out.

At the start of the market, shares for every study cost $0.50 each. As trading continued, those prices soared and dipped depending on the traders’ activities. And after two weeks, the final price reflected the traders’ collective view on the odds that each study would successfully replicate. So, for example, a stock price of $0.87 would mean a study had an 87 percent chance of replicating. Overall, the traders thought that studies in the market would replicate 63 percent of the time—a figure that was uncannily close to the actual 62-percent success rate.

The traders’ instincts were also unfailingly sound when it came to individual studies. Look at the graph below. The market assigned higher odds of success for the 13 studies that were successfully replicated than the eight that weren’t—compare the blue diamonds to the yellow diamonds.

Camerer et al. 2018 / Nature Human Behavior.

“It is great news,” says Anna Dreber from the Stockholm School of Economics, who came up with the idea of using prediction markets to study reproducibility in 2015. “It suggests that people more or less already know which results will replicate.”

“If researchers can anticipate which findings will replicate, or fail to, it makes it harder to sustain dismissive claims about the replications or the replicators,” adds Brian Nosek from the Center of Open Science, who was part of the SSRP.

What clues were the traders looking for? Some said that they considered a study’s sample size: Small studies will more likely produce false positives than bigger ones. Some looked at a common statistical metric called the P value. If a result has a P value that’s less than 0.05, it’s said to be statistically significant, or positive. And if a study contains lots of P values that just skate under this threshold, it’s a possible sign that the authors committed “p-hacking”—that is, they futzed with their experiment or their data until they got “positive” but potentially misleading results. Signs like this can be ambiguous, and “scientists are usually reluctant to lob around claims of p-hacking when they see them,” says Sanjay Srivastava from the University of Oregon. “But if you are just quietly placing bets, those are things you’d look at.

Beyond statistical issues, it strikes me that several of the studies that didn’t replicate have another quality in common: newsworthiness. They reported cute, attention-grabbing, whoa-if-true results that conform to the biases of at least some parts of society. One purportedly showed that reading literary fiction improves our ability to understand other people’s beliefs and desires. Another said that thinking analytically weakens belief in religion. Yet another said that people who think about computers are worse at recalling old information—a phenomenon that the authors billed as “the Google effect.” All of these were widely covered in the media.