By Christian Jarrett

There’s been a lot of talk of the crisis in psychology. For decades, and often with the best of intentions, researchers have engaged in practices that have made it likely their results are “false positives” or not real. But that was in the past. The crisis is ending. “We do not call the rain that follows a long drought a ‘water crisis’,” write Leif Nelson at UC Berkeley and Joseph Simmons and Uri Simonsohn at the University of Pennsylvania. “We do not call sustained growth following a recession an ‘economic crisis'”.

In their paper, due for publication in the Annual Review of Psychology, the trio observe that had any psychologists been in hibernation for the last seven years, they would not recognise their field today. The full disclosure of methods and data, the pre-registration of studies, the publication of negative findings, and replication attempts – all of which help reduce risk of false positives – have increased immeasurably. “The improvements to our field have been dramatic,” they write. “This is psychology’s renaissance.”

As well giving the field of psychology a pep talk, their paper provides a useful review of how we got to this point, the reasons things are getting better, and the ongoing controversies.

The crisis before the renaissance

Nelson and his colleagues believe that starting in 2011, several key developments led to psychology entering a period of self-reflection about its methods. First, a weird, controversial study was published in a prestigious journal. It applied widely used statistical methods to demonstrate “transparently outlandish” effects (our regrettably gullible coverage at the time: “Dramatic study shows participants are affected by psychological phenomena from the future“).

The same year there were several fraud scandals. There was also Nelson and his colleagues’ own influential paper demonstrating how easy it was to use selective reporting and strategic data analysis to extract a ridiculous positive result from random data (in this case, listening to music was shown to decrease your age). They called this approach p-hacking, which is a reference to the fact that psychologists frequently use a statistic known as the p-value – specifically whether it is below 0.05 – to determine whether their result is statically significant or not. Not long after, a survey of psychology researchers showed that “questionable research practices” indicative of p-hacking were commonplace.

Nelson and his colleagues believe that p-hacking helps explain how for several decades psychologists have used underpowered studies with too few participants, and yet succeeded in publishing countless positive results (a similar problem afflicts neurosciencea similar problem afflicts neuroscience). “P-hacking has long been the biggest threat to the integrity of our discipline,” they write. Researchers have prodded and pushed their results, dropping participants here, running new trials there, selectively reporting just those conditions that seemed to work. For years, the prevailing ethos was that to succeed you do what you can to extract a positive finding from your experiments.

P-hacking could explain why so many famous findings in psychology have failed to replicate, often when tested under more rigorous conditions. But this is open to debate, and in some cases ill-tempered argument (the authors of past studies that haven’t replicated have sometimes bristled at the suggestion that they did not conduct their studies robustly enough, or that their findings are not real).

Nelson and his colleagues believe all sides can agree that most researchers are honest and well-intentioned. As self-confessed former p-hackers, they write that “p-hacking is not categorised as such by the researcher who is doing it. It is not something that malevolent researchers engage in while laughing maniacally.” Regardless of how frequent p-hacking has been, Nelson et al. hope that everyone recognises that it’s better to reduce p-hacking. Or put differently, that it’s better to do science in a way that reduces the risk of false-positives.

The renaissance begins

Fortunately, a consensus seems to have emerged around this position. Amidst all the drama, a revolution is underway. A key player is the psychologist Brian Nosek at the University of Virginia who helped launch the Open Science Framework, an online platform that makes it easy to share methods and data online, and then in 2013 he co-founded the Center for Open Science (read our coverage of some of the large-scale replication efforts organised by the Center).

Other positive changes include key journals such as Psychological Science and Social Psychological and Personality Science implementing the requirement for researchers to disclose all of their measures and manipulations. Even better, pre-registration of methods (publishing your planned methods and hypotheses before you collect your data) is becoming easier and more widespread (two key sites for this are AsPredicted.org and the Open Science Framework), and an increasing number of journals now publish “registered reports”, a cause championed by psychologist Chris Chambers in the UK.

“We expect that in 3-5 years, published pre-registered experimental psychology studies will be either common or extremely common,” write Nelson et al. Some have complained that pre-registered stifles scientific exploration, but Nelson’s team counter “it does not preclude exploration, but it does communicate to readers that it occurred”.

There has also been a welcome “surge of interest” in replicating previous studies – one of the main ways to uncover and address the possible effects of p-hacking on previously published research (according to the CurateScience database, 96 per cent of over 1000 replication attempts have been conducted since 2011). With regard to the debates that often ensue after a failed replication attempt (such as whether the replication was similar enough to the original), Nelson et al. propose a compromise: “the burden of proof is on the researcher espousing the least plausible claim”.

For instance, if the author of the original finding complains that the replication study took place on a different day of the week (and that’s why it didn’t work), it’s beholden on her or him to demonstrate why day of the week should moderate the effect that they originally claimed to have uncovered. On other hand, if the replicators used an obviously inferior manipulation (e.g. in a study testing the effects of hunger, they used just a few minutes without food to induce hunger), it’s up to them to show that the lack of effect persists when hunger is induced in more robust fashion. “Neutral observers often agree on who has the burden of proof,” write Nelson et al.

Another issue for psychology’s renaissance is how to categorise a replication attempt as a success or failure. In fact, Nelson et al. explain how this is often not straight forward and in many cases it is more fair and accurate to interpret unsuccessful replications as inconclusive.

Ongoing debates

Other issues going forward include finding optimal ways to check the veracity of collections of prior studies, such as through using a statistical technique known as “p-curve analysis”. No one approach is flawless. More also needs to be done to check for innocent errors (which are extremely common) and outright fraud (thankfully rare, although Nelson’s team say that “for every case of fraud that is identified, there are almost certainly may more that are not”).

One of the simplest solutions is simply to require researchers to post their data and materials online. “Public data posting not only allows others to verify the accuracy of the analyses, but also incentivises authors to more carefully avoid errors,” write Nelson and his colleagues.

Some readers may be surprised that Nelson et al. don’t welcome all the efforts at reform in psychology. For instance, many have called for a greater emphasis on meta-analyses, in which the findings from many studies are combined. But Nelson’s group argue that this can make matters worse – for instance, biases in the studies can accumulate rather than cancel each other out. “The end result of a meta-analysis is as strong as the weakest link; if there is some garbage in, there is only garbage out.”

Nelson’s team are also sceptical of those who say psychology should ditch p-value based significance testing for other metrics, such as confidence intervals and Bayesian results. “It is not the [particular] statistic that causes the problem, it is the mindlessness [with which they are relied upon].”

“The Enlightenment is just around the corner”

These are healthy debates and they will continue for years to come. For now though, let’s join Nelson and his co-authors in recognising the positive and welcome changes underway in psychological science. “Practices that promise to increase the integrity of our discipline – replications, disclosure, pre-registration – are orders of magnitude more common than they were just a short time ago. And although they are not yet common enough, it is clear that the Middle Ages are behind us, and the Enlightenment is just around the corner.”

—Psychology’s renaissance [Our coverage is based on an early version of this paper published at SSRN, the final published version may differ]

—Listen to our PsychCrunch episode on whether we can trust psychological studies

Christian Jarrett (@Psych_Writer) is Editor of BPS Research Digest