Online tool calculates reproducibility scores of PubMed papers

A new online tool unveiled 19 January measures the reproducibility of published scientific papers by analyzing data about articles that cite them.

The software comes at a time when scientific societies and journals are alarmed by evidence that findings in many published articles are not reproducible and are struggling to find reliable methods to evaluate whether they are.

The tool, developed by the for-profit firm Verum Analytics in New Haven, Connecticut, generates a metric called the r-factor that indicates the veracity of a journal article based on the number of other studies that confirm or refute its findings. The r-factor metric has drawn much criticism from academics who said its relatively simple approach might not be sufficient to solve the multifaceted problem that measuring reproducibility presents.

Early reaction to the new tool suggests that Verum has not fully allayed those concerns. The Verum developers concede the tool still has limitations; they said they released it to receive feedback about how well it works and how it could be improved. Verum has developed the project as a labor of love, and Co-Founder Josh Nicholson said he hopes the release of the early version tool will attract potential funders to help improve it.

Verum announced the methodology underlying the tool, based on the r-factor, in a preprint paper last August and refined it in the new tool. It relies solely on data from freely available research papers in the popular biomedical search engine PubMed.

Nicholson and his colleagues developed the tool by first manually examining 48,000 excerpts of text in articles that cited other published papers. Verum’s workers classified each of these passages as either confirming, refuting, or mentioning the other papers. Verum then used these classifications to train an algorithm to autonomously recognize each kind of passage in papers outside this sample group.

Based on a sample of about 10,000 excerpts, Verum’s developers claim their tool correctly classifies passages accurately 93% of the time. But it detects mentioning citations much more precisely than confirming or refuting ones, which were much less common in their sample. The vast majority of articles mention previous studies without confirming or refuting their claims; only about 8% of all citations are confirmatory and only about 1% are refuting.

The tool’s users can apply the algorithm by entering an article’s unique PubMed identifier code. The algorithm scours PubMed to find articles that cite the paper of interest and all passages that confirm, refute, or mention the paper. The tool then generates an r-factor score for the paper by dividing the number of confirming papers by the sum of the confirming and refuting papers.

This formula tends to assign high scores, close to 1, to papers seldom refuted. The low number of refuting papers in Verum’s database means that many articles have r-factors of 1—which tends to limit the tool’s usefulness. (R-factors also contain a subscript number indicating the total number of studies that attempted to replicate the paper—an r-factor of 1 16 means the tool scanned 16 replication studies.)

Psychologist Christopher Chartier of Ashland University in Ohio, who developed an online platform that assists with the logistics of replication studies, tried the new tool at the request of Science Insider. “It appears to do what it claims to do, but I don't find much value in the results,” he says. One reason, he says, is that r-factors may be skewed by a publication bias—where scholarly journals favorably publish positive results over negative results. “We simply can’t trust the published literature to be a reliable and valid indicator of a finding’s replicability,” Chartier said.

“Attempting to estimate the robustness of a published research finding is notoriously difficult,” said Marcus Munafò, a biological psychologist at the University of Bristol in the United Kingdom, a key figure in tackling irreproducibility. It’s difficult, he said, to know the precision or quality of individual confirmatory or refuting studies without reading them.

Another limitation in Verum’s tool is that because it trawls only freely available papers on PubMed, it misses paywalled scholarly literature.

Still, the Verum team will press on. Next on their agenda is to increase the number of sample papers used to train their algorithm to improve its accuracy in recognizing confirming and refuting papers.