By guest blogger Jesse Singal

As most observers of psychological science recognise, the field is in the midst of a replication crisis. Multiple high-profile efforts to replicate past findings have turned up some dismal results — in the 2015 Open Science Collaboration published in Science, for example, just 36% of the evaluated studies showed statistically significant effects the second time around. The results of Many Labs 2, published last year, weren’t quite as bad, but still pretty dismal: just 50% of studies replicated during that effort.

Some of these failed replications don’t come across as all that surprising, at least in retrospect, given the audacity of original claims. For example, a study published in Science in 2012 claimed that subjects who looked at an image of The Thinker had, on average, a 20-point lower belief in God on a 100-point scale than those who looked at a supposedly less analytical statue of a discus thrower, leading to the study’s headline finding that “Analytic Thinking Promotes Religious Disbelief.” It’s an astonishing and unlikely result given how tenaciously most people cling to (non)belief — it defies common sense to think simply looking at a statue could have such an effect. “In hindsight, our study was outright silly,” the lead author admitted to Vox after the study failed to replicate. Plenty of other psychological studies have made similarly bold claims.

In light of this, an interesting, obvious question is how much stock we should put into this sort of intuition: does it actually tell us something useful when a given psychological result seems unlikely on an intuitive level? After all, science is replete with real discoveries that seemed ridiculous at first glance.

A new study, available as a preprint, from Suzanne Hoogeveen and colleagues at the University of Amsterdam set out to answer that question by asking laypeople to estimate whether “27 high-profile social science findings” would replicate. The team put 233 people, none of whom had a PhD in psychology or had heard of the Many Labs replication project, into one of two groups. One group was simply given a description of each finding; the other was given a description and an evaluation of the strength of the finding, based on the study’s so-called “Bayes factor.” (This was presented in a way that required no actual knowledge of Bayesian analysis. For example: “BF = 11.9. This qualifies as strong evidence.”) The participants then rated how confident they were that the finding would replicate on a scale from -100 (extremely confident that the study would not replicate) to 100 (extremely confident that it would replicate).

Overall, the group provided with just the description did an above-chance job predicting replicability — they did so accurately 58% of the time. Providing an analysis of the strength of evidence boosted that performance significantly, to 67%.

This chart runs down all the studies, and the distribution of replicability ratings the participants gave them. Light grey studies replicated, while darker ones didn’t:

It’s quite clear, visually, that while the participants were stymied by some of the studies in the middle, they did very well with regard to the studies they were most confident about, in either direction. The studies the respondents were most confident would replicate had indeed been successfully replicated, including the findings that “Body cues, not facial expressions, discriminate between intense positive and negative emotions” and that people are less likely to choose to donate to one charity over another when told a significant chunk of their donation would go to administrative costs.

As for the studies almost everyone agreed wouldn’t replicate: one found that players given many chances to guess letters in a “Wheel of Fortune” game subsequently did better on an attention task than those given few chances, and another found that “Washing one’s hands after making a choice eliminates post-decisional dissonance effects, suggesting that hand-washing psychologically removes traces of the past, including concerns about past decisions.” And at the very, very bottom? The “silly” study about the atheism-inspiring statue.

All this suggests that these sorts of judgements can, in fact, provide useful information to the field of psychology as it starts the potentially mammoth undertaking of digging itself out from the replication crisis. And it’s worth noting that this isn’t the first hint that people are pretty good at this sort of prediction: there’s already some evidence that online bettors, at least, can “sniff out weak psychology studies,” as The Atlantic put it.

But there’s a dark side to this finding, too: “These results emphasize that the scientific culture of striving for newsworthy, extreme, and sexy findings is indeed problematic,” the authors note, “as counterintuitive findings are the least likely to replicate.”

This is a rather intense case of clashing incentives: as a research psychologist, you’re more likely to get media write-ups (or, if you’re really lucky, a TED Talk) if you come up with a counterintuitive finding. But those are exactly the findings which are least likely to replicate. Which can go a long way toward explaining the mess psychology is in.

– Laypeople Can Predict Which Social Science Studies Replicate [this study is a preprint meaning that it has yet to be subjected to peer review and the final published version may differ from the version on which this report was based]

Post written by Jesse Singal (@JesseSingal) for the BPS Research Digest. Jesse is a contributing writer at New York Magazine, and he publishes his own newsletter featuring behavioral-science-talk. He is also working on a book about why shoddy behavioral-science claims sometimes go viral for Farrar, Straus and Giroux.

At Research Digest we’re proud to showcase the expertise and writing talent of our community. Click here for more about our guest posts.