In 2016, ProPublica caused a stir when it evaluated the performance of software that's used in criminal justice proceedings. The software, which is used to evaluate a defendant's chance of committing further crimes, turned out to produce different results when evaluating black people and Caucasians.

The significance of that discrepancy is still the subject of some debate, but two Dartmouth College researchers have asked a more fundamental question: is the software any good? The answer they came up with is "not especially," as its performance could be matched by recruiting people on Mechanical Turk or performing a simple analysis that only took two factors into account.

Software and bias

The software in question is called COMPAS, for Correctional Offender Management Profiling for Alternative Sanctions. It takes into account a wide variety of factors about defendants and uses them to evaluate whether those individuals are likely to commit additional crimes and helps identify intervention options. COMPAS is heavily integrated into the judicial process (see this document from the California Department of Corrections for a sense of its importance). Perhaps most significantly, however, it is sometimes influential in determining sentencing, which can be based on the idea that people who are likely to commit additional crimes should be incarcerated longer.

ProPublica's evaluation of the software focused on arrests in Broward County, Florida. It found that the software had similar accuracy when it came to predicting whether black and caucasian defendants would re-offend. But false positives—cases where the software predicted another offense that never happened—were twice as likely to involve black defendants. The false negatives, where defendants were predicted to remain crime-free but didn't, were twice as likely to involve Whites.

But by other measures, the software showed no indication of bias (including, as noted above, its overall accuracy). So the significance of these findings has remained a subject of debate.

The Dartmouth researchers, Julia Dressel and Hany Farid, decided not to focus on bias but on the overall accuracy. To do so, they took the records of 1,000 defendants and extracted their age, sex, and criminal history. These were split up into pools of 20, and Mechanical Turk was used to recruit people who were asked to guess the probability that each of the 20 individuals would commit another crime within the next two years.

Wisdom of Mechanical Turks

Pooling these results, these people had a mean accuracy of 62 percent. That's not too far off the accuracy of COMPAS, which was 65 percent. In this test, multiple individuals evaluated each defendant, so the authors pooled these and took the majority opinion as a decision. This brought the accuracy up to 67 percent, edging out COMPAS. Other measurements of the Mechanical Turks' accuracy suggested they were just as good as the software.

The results were also similar in that there was no significant difference between their evaluations of black and caucasian defendants. The same was true when the authors presented a similar set of records to a new set of people but this time included information on the defendant's race. So in terms of overall accuracy, these inexperienced people were roughly as good as the software.

But they were also roughly as bad, as they were also more likely to make false positives when the defendant was black, though not to the same extent as COMPAS (a 37-percent false-positive rate for Blacks, compared to 27 percent for Whites). The false negative rate, where defendants were predicted not to re-offend but did, was also higher in Caucasians (40 percent) than it was for Blacks (29 percent). Those numbers are remarkably similar to the rates of COMPAS' errors. Including race data on the defendants didn't make a significant difference.

If the algorithm could be matched by what is almost certainly a bunch of amateurs, Dressel and Farid reasoned, maybe it's because it is not especially good. So they did a series of simple statistical tests (linear regressions) using different combinations of the data they had on each defendant. They found that they could match the performance of COMPAS using only two: the age of the defendant and the total count of prior convictions.

This isn't quite as much of a shock as it appears to be. Dressel and Farid make a big deal of the claim that COMPAS supposedly considers 137 different factors when making its prediction. A statement by Equivant, the company that makes the software, points out that those 137 are only for evaluating interventions; prediction of reoffending only uses six factors. (The rest of the statement distills down to "this shows that our software's quite good.") Dressel and Farid also acknowledge that re-arrest is an imperfect measure of future criminal activity, as some crimes don't result in arrests, and there are significant racial biases in arrest rates.

What to make of all this comes down to whether you're comfortable having a process that's wrong about a third of the time influencing things like how much time people spend in prison. At the moment, however, there's no evidence of anything that's more effective than that.

Science Advances, 2017. DOI: 10.1126/sciadv.aao5580 (About DOIs).