How should we fairly balance this trade-off? There’s no universal answer, but in the 1760s, the English judge William Blackstone wrote, “It is better that ten guilty persons escape than that one innocent suffer.”

Blackstone’s ratio is still highly influential in the US today. So let’s use it for inspiration.

Move the threshold to where the “released but rearrested” percentage is roughly 10 times the “needlessly jailed” percentage.

You can already see two problems with using an algorithm like COMPAS. The first is that better prediction can always help reduce error rates across the board, but it can never eliminate them entirely. No matter how much data we collect, two people who look the same to the algorithm can always end up making different choices.

The second problem is that even if you follow COMPAS’s recommendations consistently, someone—a human—has to first decide where the “high risk” threshold should lie, whether by using Blackstone’s ratio or something else. That depends on all kinds of considerations—political, economic, and social.

Now we’ll come to a third problem. This is where our explorations of fairness start to get interesting. How do the error rates compare across different groups? Are there certain types of people who are more likely to get needlessly detained?

Let’s see what our data looks like when we consider the defendants’ race.

Now move each threshold to see how it affects black and white defendants differently.

Race is an example of a protected class in the US, which means discrimination on that basis is illegal. Other protected classes include gender, age, and disability.

Now that we’ve separated black and white defendants, we’ve discovered that even though race isn’t used to calculate the COMPAS risk scores, the scores have different error rates for the two groups. At the default COMPAS threshold between 7 and 8, 16% of black defendants who don’t get rearrested have been needlessly jailed, while the same is true for only 7% of white defendants. That doesn’t seem fair at all! This is exactly what ProPublica highlighted in its investigation.

Okay, so let’s fix this.

Move each threshold so white and black defendants are needlessly jailed at roughly the same rate.

(There are a number of solutions. We’ve picked one, but you can try to find others.)

We tried to reach Blackstone’s ratio again, so we arrived at the following solution: white defendants have a threshold between 6 and 7, while black defendants have a threshold between 8 and 9. Now roughly 9% of both black and white defendants who don’t get rearrested are needlessly jailed, while 75% of those who do are rearrested after spending no time in jail. Good work! Your algorithm seems much fairer than COMPAS now.

But wait—is it? In the process of matching the error rates between races, we lost something important: our thresholds for each group are in different places, so our risk scores mean different things for white and black defendants.

White defendants get jailed for a risk score of 7, but black defendants get released for the same score. This, once again, doesn’t seem fair. Two people with the same risk score have the same probability of being rearrested, so shouldn’t they receive the same treatment? In the US, using different thresholds for different races may also raise complicated legal issues with the 14th Amendment, the equal protection clause of the Constitution.

So let’s try this one more time with a single threshold shared between both groups.

Move the threshold again so white and black defendants are needlessly jailed at the same rate.

If you’re getting frustrated, there’s good reason. There is no solution.

We gave you two definitions of fairness: keep the error rates comparable between groups, and treat people with the same risk scores in the same way. Both of these definitions are totally defensible! But satisfying both at the same time is impossible.

The reason is that black and white defendants are rearrested at different rates. Whereas 52% of black defendants were rearrested in our Broward County data, only 39% of white defendants were. There’s a similar difference in many jurisdictions across the US, in part because of the country’s history of police disproportionately targeting minorities (as we previously mentioned).

Predictions reflect the data used to make them—whether by algorithm or not. If black defendants are arrested at a higher rate than white defendants in the real world, they will have a higher rate of predicted arrest as well. This means they will also have higher risk scores on average, and a larger percentage of them will be labeled high-risk—both correctly and incorrectly. This is true no matter what algorithm is used, as long as it’s designed so that each risk score means the same thing regardless of race.

This strange conflict of fairness definitions isn’t just limited to risk assessment algorithms in the criminal legal system. The same sorts of paradoxes hold true for credit scoring, insurance, and hiring algorithms. In any context where an automated decision-making system must allocate resources or punishments among multiple groups that have different outcomes, different definitions of fairness will inevitably turn out to be mutually exclusive.

There is no algorithm that can fix this; this isn’t even an algorithmic problem, really. Human judges are currently making the same sorts of forced trade-offs—and have done so throughout history.

But here’s what an algorithm has changed. Though judges may not always be transparent about how they choose between different notions of fairness, people can contest their decisions. In contrast, COMPAS, which is made by the private company Northpointe, is a trade secret that cannot be publicly reviewed or interrogated. Defendants can no longer question its outcomes, and government agencies lose the ability to scrutinize the decision-making process. There is no more public accountability.

So what should regulators do? The proposed Algorithmic Accountability Act of 2019 is an example of a good start, says Andrew Selbst, a law professor at the University of California who specializes in AI and the law. The bill, which seeks to regulate bias in automated decision-making systems, has two notable features that serve as a template for future legislation. First, it would require companies to audit their machine-learning systems for bias and discrimination in an “impact assessment.” Second, it doesn’t specify a definition of fairness.

“With an impact assessment, you're being very transparent about how you as a company are approaching the fairness question,” Selbst says. That brings public accountability back into the debate. Because “fairness means different things in different contexts,” he adds, avoiding a specific definition allows for that flexibility.

But whether algorithms should be used to arbitrate fairness in the first place is a complicated question. Machine-learning algorithms are trained on “data produced through histories of exclusion and discrimination,” writes Ruha Benjamin, an associate professor at Princeton University, in her book Race After Technology. Risk assessment tools are no different. The greater question about using them—or any algorithms used to rank people—is whether they reduce existing inequities or make them worse.

Selbst recommends proceeding with caution: “Whenever you turn philosophical notions of fairness into mathematical expressions, they lose their nuance, their flexibility, their malleability,” he says. “That’s not to say that some of the efficiencies of doing so won’t eventually be worthwhile. I just have my doubts.”

Words and code by Karen Hao and Jonathan Stray. Design advising from Emily Luong and Emily Caulfield. Editing by Niall Firth and Gideon Lichfield. Special thanks to Rashida Richardson from AI Now, Mutale Nkonde from Berkman Klein Center, and William Isaac from DeepMind for their review and consultation.

Correction: A previous version of the article linked to information about a risk assessment tool different from COMPAS. It has been removed to avoid confusion.