For decades, many researchers thought that statistics were better than humans were at predicting whether a released criminal would end up back in jail. Today commercial risk-assessment algorithms help courts all over the country with this type of forecasting. Their results can inform how legal officials decide on sentencing, bail and the offer of parole. The widespread adoption of semi-automated justice continues despite the fact that, over the past few years, experts have raised concerns over the accuracy and fairness of these tools. Most recently, a new Science Advances paper, published on Friday, found that algorithms performed better than people at predicting if a released criminal would be rearrested within two years. Researchers who worked on a previous study have contested these results, however. The one thing current analyses agree on is that nobody is close to perfect—both human and algorithmic predictions can be inaccurate and biased.

The new research is a direct response to a 2018 Science Advances paper that found untrained humans performed as well as a popular risk-assessment software called Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) at forecasting recidivism, or whether a convicted criminal would reoffend. That study drew a great deal of attention, in part because it contradicted perceived wisdom. Clinical psychologist “Paul Meehl stated, in a famous book in 1954, that actuarial, or statistical, prediction was almost always better than unguided human judgment,” says John Monahan, a psychologist at the University of Virginia School of Law, who was not involved in the most recent study but has worked with one of its authors. “And over the past six decades, scores of studies have proven him right.” When the 2018 paper came out, COMPAS’s distributor, the criminal justice software company Equivant (formerly called Northpointe), posted an official response on its Web site saying the study mischaracterized the risk-assessment program and questioning the testing method used. When contacted more recently by Scientific American, an Equivant representative had no additional comment to add to this response.

To test the conclusions of the 2018 paper, researchers at Stanford University and University of California, Berkeley, initially followed a similar method. Both studies used a data set of risk assessments performed by COMPAS. The data set covered about 7,000 defendants in Broward County in Florida and included each individual’s “risk factors”—salient information such as sex, age, the crime with which that person was charged and the number of his or her previous offenses. It also contained COMPAS’s prediction for whether the defendant would be rearrested within two years of release and confirmation of whether that prediction came true. From that information, the researchers could gauge COMPAS’s accuracy. Additionally, the researchers used the data to create profiles, or vignettes, based on each defendant’s risk factors, which they showed to several hundred untrained humans recruited through the Amazon Mechanical Turk platform. They then asked the participants whether they thought a person in a vignette would commit another crime within two years.

The study from 2018 found that COMPAS displayed about 65 percent accuracy. Individual humans were slightly less correct, and the combined human estimate was slightly more so. Following the same procedure as the researchers in that paper, the more recent one confirmed these results. “The first interesting thing we notice is that we could, in fact, replicate their experiment,” says Sharad Goel, a co-author of the new study and a computational social scientist at Stanford. “But then we altered the experiment in various ways, and we extended it to several other data sets.” Over the course of these additional tests, he says, algorithms displayed more accuracy than humans.

First, Goel and his team expanded the scope of the original experiment. For example, they tested whether accuracy changed when predicting rearrest for any offense versus a violent crime. They also analyzed evaluations from multiple programs: COMPAS, a different risk-assessment algorithm called the Level of Service Inventory-Revised (LSI-R) and a model that the researchers built themselves.

Second, the team tweaked the parameters of its experiment in several ways. For example, the previous study gave the human subjects feedback after they made each prediction, allowing people to learn more as they worked. The new paper suggests that this approach is not true to some real-life scenarios, where judges and other court officials may not learn the outcomes of their decisions immediately—or at all. So the new study gave some subjects feedback while other received none. “What we found there is that if we didn’t provide immediate feedback, then the performance dropped dramatically for humans,” Goel says.

The researchers behind the original study disagree with the idea that feedback renders their experiment unrealistic. Julia Dressel was an undergraduate computer science student at Dartmouth College when she worked on that paper and is currently a software engineer for Recidiviz, a nonprofit organization that builds data analytics tools for criminal justice reform. She notes that the people on Mechanical Turk may have no experience with the criminal justice system, whereas individuals predicting criminal behavior in the real world do. Her co-author Hany Farid, a computer scientist who worked at Dartmouth in 2018 and who is currently at U.C. Berkeley, agrees the people who use tools such as COMPAS in real life have more expertise than those who received feedback in the 2018 study. “I think they took that feedback a little too literally, because surely judges and prosecutors and parole boards and probation officers have a lot of information about people that they accumulate over years. And they use that information in making decisions,” he says.

The new paper also tested whether revealing more information about each potential backslider changed the accuracy of predictions. The original experiment provided only five risk factors about each defendant to the predictors. Goel and his colleagues tested this condition and compared it with the results when they provided 10 additional risk factors. The higher-information situation was more akin to a real court scenario, when judges have access to more than five pieces of information about each defendant. Goel suspected this scenario might trip up humans because the additional data could be distracting. “It’s hard to incorporate all of these things in a reasonable way,” he says. Despite his reservations, the researchers found that the humans’ accuracy remained the same, although the extra information could improve an algorithm’s performance.

Based on the wider variety of experimental conditions, the new study concluded that algorithms such as COMPAS and LSI-R are indeed better than humans at predicting risk. This finding makes sense to Monahan, who emphasizes how difficult it is for people to make educated guesses about recidivism. “It’s not clear to me how, in real life situations—when actual judges are confronted with many, many things that could be risk factors and when they’re not given feedback—how the human judges could be as good as the statistical algorithms,” he says. But Goel cautions that his conclusion does not mean algorithms should be adopted unreservedly. “There are lots of open questions about the proper use of risk assessment in the criminal justice system,” he says. “I would hate for people to come away thinking, ‘Algorithms are better than humans. And so now we can all go home.’”

Goel points out that researchers are still studying how risk-assessment algorithms can encode racial biases. For instance, COMPAS can say whether a person might be arrested again—but one can be arrested without having committed an offense. “Rearrest for low-level crime is going to be dictated by where policing is occurring,” Goel says, “which itself is intensely concentrated in minority neighborhoods.” Researchers have been exploring the extent of bias in algorithms for years. Dressel and Farid also examined such issues in their 2018 paper. “Part of the problem with this idea that you're going to take the human out of [the] loop and remove the bias is: it’s ignoring the big, fat, whopping problem, which is the historical data is riddled with bias—against women, against people of color, against LGBTQ,” Farid says.

Dressel also notes that even when they outperform humans, the risk assessment tools tested in the new study do not have very high accuracy. “The COMPAS tool is around 65 percent, and the LSI-R is around 70 percent accuracy. And when you’re thinking about how these tools are being used in a courtroom context, where they have very profound significance—and can very highly impact somebody’s life if they are held in jail for weeks before their trial—I think that we should be holding them to a higher standard than 65 to 70 percent accuracy—and barely better than human predictions.”

Although all of the researchers agreed that algorithms should be applied cautiously and not blindly trusted, tools such as COMPAS and LSI-R are already widely used in the criminal justice system. “I call it techno utopia, this idea that technology just solves our problems,” Farid says. “If the past 20 years have taught us anything, it should have [been] that that is simply not true.”