Say this much for the “ reproducibility crisis” in science: It’s poorly timed. At the same instant that a significant chunk of elected and appointed policymakers seem to disbelieve the science behind global warming, and a significant chunk of parents seem to disbelieve the science behind vaccines … a bunch of actual scientists come along and point out that vast swaths of the social sciences don’t stand up to scrutiny. They don’t replicate—which is to say, if someone else does the same experiment, they get different (often contradictory) results. The scientific term for that is bad.

What’s good, though, is that the scientific method is built for self-correction. Researchers are trying to fix the problem. They’re encouraging more sharing of data sets and urging each other to preregister their hypotheses—declaring what they intend to find and how they intend to find it. The idea is to cut down on the statistical shenanigans and memory-holing of negative results that got the field into this mess. No more collecting a giant blob of data and then combing through it for a publishable outcome, a practice known as “HARKing”—hypothesizing after results are known.

And self-appointed teams are even going back through old work, manually, to see what holds up and what doesn’t. That means doing the same experiment again, or trying to expand it to see if the effect generalizes. It’s a slog—boring, expensive, and time-consuming. To the Defense Advanced Research Projects Agency, the Pentagon’s mad-science wing, the problem demands an obvious solution: Robots.

A Darpa program called Systematizing Confidence in Open Research and Evidence—yes, SCORE—aims to assign a “credibility score” (see what they did there) to research findings in the social and behavioral sciences, a set of related fields to which the reproducibility crisis has been particularly unkind. In 2017, I called the project a bullshit detector for science, somewhat to the project director’s chagrin. Well, now it’s game on: Darpa has promised $7.6 million to the Center for Open Science, a nonprofit organization that’s leading the charge for reproducibility. COS is going to aggregate a database of 30,000 claims from the social sciences. For 3,000 of those claims, the Center will either attempt to replicate them or subject them to a prediction market—asking human beings to essentially bet on whether the claims would replicate or not. (Prediction markets are pretty good at this; in a study of reproducibility in the social sciences last summer, for example, a betting market and a survey of other researchers performed about as well as actual do-overs of the studies.)

“The replication work is an assessment of ground-truth fact,” a final call on whether a study held up or failed, says Tim Errington, director of research at COS. “That’s going to get benchmarked against algorithms. Other teams are going to come up with a way to do that automatically, and then you assess each against the other.”

In other words, first you get a database, then you do some human assessment, and then the future machine overlords come in? “I would say ‘machine partners,’” says Adam Russell, an anthropologist and SCORE program manager at Darpa. He’s hoping that the machine-driven “phase II” of the program—which starts taking applications in March—will lead to algorithms that will outperform bettors in a prediction market. (Some early work has already suggested that it’s possible.) “It’ll potentially provide insight into ways we can do things better,” Russell says. Russell wants the Defense Department to understand problems in national security—how insurgencies form, how humanitarian aid gets distributed, how to deter enemy action. It wants to know which research studies are worth paying attention to.