Controversial software is proving surprisingly accurate at spotting errors in psychology papers

When Dutch researchers developed an open-source algorithm designed to flag statistical errors in psychology papers, it received mixed reaction from the research community—especially after the free tool was run on tens of thousands of papers and the results were posted online. Many questioned the accuracy of the algorithm, named statcheck, or said the exercise amounted to public shaming.

But statcheck actually gets it right in more than 95% of cases, its developers claim in a study posted on the preprint server PsyArXiv on 16 November. Some outsiders agree, and are calling for routine use. “The new paper convincingly shows that statcheck is indeed robust,” says Casper Albers, a psychometrician at the University of Groningen in the Netherlands. Others still aren’t convinced.

Statcheck was developed in 2015 by Michèle Nuijten, a methodologist at Tilburg University in the Netherlands, and Sacha Epskamp, a psychometrician at the University of Amsterdam. It scours papers for data reported in the standard format prescribed by the American Psychological Association (APA) and uses them to calculate the p-value, a controversial but widely used measure of statistical significance. If the calculated p-value differs from the one reported by the researchers, the tool flags it as an “inconsistency”; if the reported p-value is below the commonly used threshold of 0.05 and statcheck’s figure isn’t, or vice versa, it is labeled a “gross inconsistency” that may call into question the conclusions. (Erroneous p-values are increasingly recognized as a big problem in psychology; Nuijten believes most stem from human error, but statcheck cannot distinguish misconduct from honest mistakes.)

In a 2015 study, Nuijten and colleagues ran statcheck on more than 30,000 psychology papers and found that half contained at least one statistical inconsistency, and one in eight had a gross inconsistency. Last year, Nuijten’s Tilburg University colleague Chris Hartgerink analyzed just under 700,000 results reported in more than 50,000 psychology studies using statcheck, and had the results automatically posted on the postpublication peer-review site PubPeer, with email notifications to the authors. Some researchers welcomed the feedback, but the German Psychological Society (DGPs) said the postings were causing needless reputational damage. Susan Fiske, a psychologist at Princeton University and former head of the Association for Psychological Science, called the effort a “form of harassment.” (The study was a one-time exercise; the researchers haven’t publicly subjected papers to statcheck since.)

Whether statcheck is fair depends in part on its accuracy. “If it’s known that on 99% of occasions the automated check is accurate, then fine. If the accuracy is only 90%, I’d be really unhappy about the current process,” developmental neuropsychologist Dorothy Bishop of the University of Oxford in the United Kingdom told Retraction Watch at the time.

For the new paper, the team ran statcheck on 49 papers that Nuijten’s colleagues had checked for statistical inconsistencies by hand in a paper published in 2011. They found that the algorithm’s “true positive rate” lies between 85.3% and 100%, and its “true negative rate” lies between 96% and 100%. (The precise numbers depended on various settings in statcheck.) Combined, those numbers meant that statcheck gets the right answer from the extracted results between 96.2% and 99.9% of the time.

The researchers also tried to address another criticism: That statcheck often stumbles when researchers have applied legitimate statistical corrections to their data. By searching for specific keywords, the researchers found that such corrections are vastly more common than they had estimated in their previous paper. “Something went wrong there,” Nuijten says. But she and her colleagues found that corrected statistics weren’t a major source of inconsistencies.

Thomas Schmidt, an experimental psychologist at the University of Kaiserslautern in Germany, remains wary. Because it works only with APA-style reporting, statcheck can calculate p-values for only 61% of statistical tests, he noted in a comment posted on PsyArXiv on 22 November. By Schmidt’s calculations, statcheck has a “poor sensitivity” of only 52%. “It is generally unacceptable as a research tool, and certainly unacceptable for purely automatic scanning of multitudes of papers,” he wrote. Nuijten says the team never claimed that statcheck can handle all reported statistics; the point of the new study was to check how well it does with the stats that it does recognize, she says.

DGPs Secretary Mario Gollwitzer, a psychologist at the Philipps University of Marburg in Germany, is now convinced. Although papers should never be dismissed based on statcheck alone, “We believe that authors should use [it] to scan their paper” before submitting it to a journal, he says.

Some already do. Since the developers released statcheck as a web application in September 2016, it has been accessed by more than 18,000 visitors, Nuijten says. “Statcheck can examine many statistics very quickly, and identify a subset for me that may be problematic,” says Brian Nosek, executive director of the Center for Open Science in Charlottesville, Virginia. “This is a huge efficiency gain.”

A few psychology journals have made statcheck part of their peer-review process, and Nuijten envisions expanding to other disciplines, such as the biomedical sciences. “Statcheck is not perfect,” its proud developer says, “but it’s pretty close.”