Caution is indeed warranted, according to Julia Dressel and Hany Farid from Dartmouth College. In a new study, they have shown that COMPAS is no better at predicting an individual’s risk of recidivism than random volunteers recruited from the internet.

“Imagine you’re a judge and your court has purchased this software; the people behind it say they have big data and algorithms, and their software says the defendant is high-risk,” says Farid. “Now imagine I said: Hey, I asked 20 random people online if this person will recidivate and they said yes. How would you weight those two pieces of data? I bet you’d weight them differently. But what we’ve shown should give the courts some pause.” (A spokesperson from Equivant declined a request for an interview.)

COMPAS has attracted controversy before. In 2016, the technology reporter Julia Angwin and colleagues at ProPublica analyzed COMPAS assessments for more than 7,000 arrestees in Broward County, Florida, and published an investigation claiming that the algorithm was biased against African Americans. The problems, they said, lay in the algorithm’s mistakes. “Blacks are almost twice as likely as whites to be labeled a higher risk but not actually re-offend,” the team wrote. And COMPAS “makes the opposite mistake among whites: They are much more likely than blacks to be labeled lower-risk but go on to commit other crimes.”

Northpointe questioned ProPublica’s analysis, as did various academics. They noted, among other rebuttals, that the program correctly predicted recidivism in both white and black defendants at similar rates. For any given score on COMPAS’s 10-point scale, white and black people are just as likely to re-offend as each other. Others have noted that this debate hinges on one’s definition of fairness, and that it’s mathematically impossible to satisfy the standards set by both Northpointe and ProPublica—a story at The Washington Post clearly explains why.

The debate continues, but when Dressel read about it, she realized that it masked a different problem. “There was this underlying assumption in the conversation that the algorithm’s predictions were inherently better than human ones,” she says, “but I couldn’t find any research proving that.” So she and Farid did their own.

They recruited 400 volunteers through a crowdsourcing site. Each person saw short descriptions of defendants from ProPublica’s investigation, highlighting seven pieces of information. Based on that, they had to guess if the defendant would commit another crime within two years.

On average, they got the right answer 63 percent of their time, and the group’s accuracy rose to 67 percent if their answers were pooled. COMPAS, by contrast, has an accuracy of 65 percent. It’s barely better than individual guessers, and no better than a crowd. “These are nonexperts, responding to an online survey with a fraction of the amount of information that the software has,” says Farid. “So what exactly is software like COMPAS doing?”