‘Many notable results in psychology are now being questioned because later research has reached different conclusions’

Imagine that a group of researchers set out to explore the idea that adopting a “power pose” could make a real difference to how we thought and acted. High-power poses include standing with hands on hips, feet planted confidently apart, or lounging back in a chair with feet on table and hands behind head; low-power poses include slumped shoulders and folded arms.

The researchers asked 200 people to adopt such poses, then tested the levels of two hormones in their saliva: testosterone, associated with high status, and cortisol, associated with stress.

The astonishing findings? Well, actually, there were no astonishing findings: the power poses seemed to make no difference worth mentioning. High-power poses were correlated with slightly lower testosterone and slightly higher cortisol — the opposite of what might be expected, but tiny and statistically indistinguishable from chance.

Now imagine that a second group of researchers re-examined the same hypothesis. There were some small variations, and the study was smaller (42 participants). The new study did produce remarkable findings: high-power poses boosted testosterone and lowered cortisol. Low-power poses had the opposite effect. The scale of the effect was described as “a whopping significant difference” by one of the researchers — more formally, the sizes were both practically large and statistically significant. (Statistical significance is a test of whether the result might easily have been a fluke; it’s possible to have small but statistically significant results, or large but statistically insignificant results.)

Faced with both these research findings, published in reputable journals, what should we think? The natural conclusion is that the second study was a fluke, and that standing in a bold pose for a couple of minutes makes no difference to hormone levels. Being open-minded people, we might also be intrigued by the faint possibility that the second study had uncovered a genuine and important result.

This is a hypothetical scenario, I should emphasise. It hasn’t happened. The studies did take place but not in this order. The smaller study was conducted by Amy Cuddy of Harvard and Dana Carney and Andy Yap of Columbia. It inspired a book, and a TED talk that has been watched 34 million times. The larger study was conducted by a team led by Eva Ranehill. But the smaller Cuddy-Carney Yap study didn’t come second; it was conducted first. The Ranehill team’s study came later.

This story will sound familiar to some. Many notable results in psychology are now being questioned because later research has reached different conclusions. Last year, the “Reproducibility Project”, a large collaborative effort reproducing 100 studies in psychology, published the unnerving finding that only 36 per cent of the replication attempts had produced statistically significant results.

But it is not easy to know quite what to make of that percentage. Failing to find a statistically significant effect in a replication does not simply discredit the original work. For example, some replications find similar effects to the original studies without achieving statistical significance. That means the replication provides (faint) support for the original study rather than evidence against it.

Wharton psychologist Uri Simonsohn suggests a replication attempt should use a substantially larger sample than the original, so it is likely to estimate effects more precisely. If the replication fails to find an effect, that’s not proof there’s no effect; it does suggest, however, that the original study was a fluke.

Columbia University statistician Andrew Gelman suggests a simple rule of thumb that I followed in the opening paragraphs of this column: mentally reverse the order of the studies. Imagine the “replication” came first, and the “original” study came later. Being published first should not be a privileged position from which our conclusions can only be budged with extraordinary evidence. Gelman’s rule of thumb helps us avoid doggedly sticking to the status quo.

But perhaps the most important lesson is to remember that while “statistical significance” sounds scientific, it’s hardly a cast-iron endorsement of a result. The theory behind statistical significance assumes that a single pre-chosen hypothesis will be tested. In practice, researchers rarely pre-specify their hypothesis. They can test dozens, or hundreds — and sooner or later a pattern will emerge, if only by chance.

Imagine testing the idea that vitamin supplements boost childhood achievement. OK. But only for girls? Only for boys? Only for children suffering a poor diet? Only for under-10s?

An unscrupulous researcher can grind through the myriad combinations until a statistically significant pattern appears. But, says Gelman, there is no reason to think such unethical behaviour is common. More likely, researchers gather the data, look informally at the patterns they see, and only then choose a few hypotheses to test. They will tell themselves — correctly — that they’re being led by the data. That’s fine. But nobody should take seriously a test of statistical significance that emerges from such a research process: it will bring up fluke after fluke.

There are various technical solutions to this problem. But a little common sense also goes a long way. When a study of 42 subjects inspires 34 million people, it’s not unreasonable to go back and check the results.

Written for and first published at ft.com.