By guest blogger Dan Jones

Amid all the talk of a “replication crisis” in psychology, here’s a rare good news story – a new project has found that a sub-field of the discipline, known as “experimental philosophy” or X-phi, is producing results that are impressively robust.

The current crisis in psychology was largely precipitated by a mass replication attempt published by the Open Science Collaboration (OSC) project in 2015. Of 100 previously published significant findings, only 39 per cent replicated unambiguously, rising to 47 per cent on more relaxed criteria.

Now a paper in Review of Philosophy and Psychology has trained the replicability lens on the burgeoning field of experimental philosophy. Born roughly 15 years ago, X-Phi takes the tools of contemporary psychology and applies them to unravelling how people think about many of the major topics of Western philosophy, from the metaphysics of free will and the nature of the self, to morality and the problem of consciousness.

Take one of the earliest and most well-known findings from the field. Back in 2003, Joshua Knobe, now at Yale University, set out to study what determines whether we describe the outcome of someone’s behaviour as intentional or not. Knobe ran two short vignettes by his participants. One described the CEO of a company being brought a new business plan by his VPs that would cause harm to the environment. The CEO says “I don’t care about harming the environment, go ahead”. The project proceeds and the environment is duly hurt.

In that case, most people say that the CEO intentionally harmed the environment. But switch the word “harm” to “help” and now people are reluctant to say that the CEO intentionally helped the environment, even though in both cases the CEO was indifferent to harming or helping the environment as a side effect of his business activities. This asymmetry, known as the Knobe effect, revealed how moral considerations affect our deployment of ostensibly non-moral concepts, like intentionality.

X-Phi has generated great excitement making it essential to see whether the pall of the replicability crisis darkens this field too. So the X-Phi Replicability Project (XRP) – a collaboration between twenty research teams in eight countries, devised by Brent Strickland of the Institut Jean Nicod, Paris, France and led by Florian Cova of the Swiss Centre for Affective Sciences, University of Geneva, Switzerland – set about re-running a representative sample of 40 X-Phi studies to see how the field is holding up (the sample comprised the most cited X-Phi papers from 2003-2015, two random papers for each year, and a few extra to bring the total to 40).

There are numerous criteria that can be used for determining whether a study replicates or not. The XRP used three: a subjective assessment by the replicating team; whether the replication achieved a statistically significant result in the same direction as the original study; and whether the effect sizes of the original and replication studies were comparable. The XRP found that X-Phi studies replicated 78 per cent of the time according to the first two criteria, and 71 per cent of the time judged by the third – way better than the 39-47 per cent for psychology as whole, although other specific sub-fields like cognitive psychology have also reported promising results.

“It’s wonderful news,” says Knobe. “And it’s a bit ironic, because the whole approach of X-Phi was modelled on the techniques used in the rest of psychology, yet we ended up with this higher replicability!”

Why is X-Phi doing so well? A big part of the field’s success seems to come from the kinds of studies it focuses on. Specifically, 78 per cent of the studies in the XRP sample were “content-based”, meaning they looked at how participants’ behaviour, beliefs or judgments changed depending on the task or stimuli they were given (e.g., Knobe’s changing of the word “harm” to “help” to elicit different judgments about intentional behaviour), rather than being observational or demographic-based (looking at aspects of individuals that predict differences in outcome), for instance. And such content-based studies – whether in psychology as a whole or in X-Phi – replicate better than other kinds of study: 90 per cent replicated in the XRP sample (compared to 21-31 per cent for other designs), and 65 per cent in the OSC sample. Notably, content-based studies comprised just 34 per cent on the OSC sample – less than half the proportion found in X-Phi – giving X-Phi a higher overall replicability rate.

In addition, the effect sizes studied in the predominantly content-based studies of X-Phi might typically be larger than in the rest of contemporary psychology – and as the OSC found, studies that report large effect sizes tend to replicate better. One reason for this could be that the effects X-Phi explores are powerful enough to be available to introspection. The Knobe effect, for example, has an intuitive, natural feel to it – one that Knobe sensed but needed to confirm empirically. And the XRP found that early X-Phi studies tended to produce especially large effect sizes, adding plausibility to the idea that they tapped into low-hanging, introspectively visible psychological fruit.

A cluster of other factors may be giving X-Phi a boost. X-Phi studies tend to be cheaper and easier to run than in many areas of psychology. Often, all you need are some written vignettes and a few questions about them, and a way of delivering this material to people – a process made easy by the widely used Amazon Mechanical Turk survey platform. That means that researchers may be less invested in a given study, and feel less pressure to “p-hack” the data to produce a positive, publishable, but ultimately unreliable, result. This take is backed up by the XRP’s statistical analysis that found less evidence of p-hacking in X-Phi compared to psychology generally. What’s more, X-Phi researchers also seem to be more comfortable publishing null or negative results: while such papers comprised just 3 per cent of the total in the OSC, they made up 10 per cent of XRP’s sample.

The success of X-Phi can’t be emulated by simply choosing to investigate psychological phenomena with large effect sizes. Nor does the high replicability of content-based studies mean that psychologists should restrict their research to these kinds of studies, which Knobe says would be absurd.

Florian Cova agrees, but says that the fact that some study designs tend to produce more robust findings than others should not be ignored. “If we know that certain kinds of studies are more likely to produce fragile results or false positives, we can adjust our study designs in these cases to try to minimise these risks.”

Perhaps the broadest lesson for other psychologists is the importance of basing studies on detailed theories, rather than just looking for interesting or quirky psychological phenomena. “Research in X-Phi is usually driven by a rich theory that predicts what we should expect to happen,” says Knobe. Such theory-driven research could make null results more palatable and take the pressure off needing to find some positive result to squeeze out of the data. “If we run a study in which the predicted effect doesn’t occur, that null result is still a publishable finding because it speaks to whether the theory is true or not.”

—Estimating the Reproducibility of Experimental Philosophy

Post written by Dan Jones (@MultipleDraftz) for the BPS Research Digest. Dan is a freelance writer based in Brighton, UK, whose writing has appeared in The Psychologist, New Scientist, Nature, Science and many other magazines. He blogs at www.philosopherinthemirror.wordpress.com.