Introduction

As data-driven algorithms have come to play an increasingly important role in the ways governments make decisions, concern over what goes into these algorithms and what comes out has become more urgent. Using data to inform government decisions promises to improve efficiency and impartiality, but many fear that in reality, these tools fail to deliver on their potential. Many advocates argue that by using data tainted by historically prejudiced practices or by reflecting the (often unconscious) biases of mostly-white, mostly-male developers, algorithms will spit out results that discriminate against people of color, religious minorities, women, and other groups. These concerns gained significant traction following a 2016 article by ProPublica that analyzed racial disparities in the predictions made by COMPAS, a tool that creates risk scores to help assign bond amounts in Broward County, FL.[1] By comparing risk scores to actual criminal activity, the analysis concluded that the software was twice as likely to falsely label black defendants as future criminals than white defendants, and more likely to falsely label white defendants as low risk. In other words, more supposedly high-risk black defendants did not commit crimes, while more supposedly low-risk white defendants did commit crimes. A number of other investigations and analyses have surfaced on social media monitoring efforts that target racially and religiously-loaded language,[2] facial recognition software that has lower accuracy when evaluating black female faces,[3] and child neglect prediction software that disproportionately targets poor black families,[4] among many other examples of bias.

Overlapping with these concerns about prejudice are questions about the accuracy of algorithmic decisions. The common and intuitive narrative is that algorithm-informed decisions must be better because they are based on data rather than instinct, and that this improved accuracy also ensures less bias. However, two Dartmouth computer science researchers recently questioned these assumptions in a highly-publicized analysis of Northpointe’s COMPAS tool, the same software that ProPublica had investigated two years prior.[5] Using data on 1,000 real defendants from Broward County that ProPublica had made public, their analysis showed that COMPAS predicts recidivism no better than a random online poll of people who have no criminal justice training at all (65 percent accuracy for COMPAS and 67 percent for the random sample), and that there was the same degree of bias in both sets of predictions. The researchers also created their own algorithm and found that it only takes two data points to achieve the 65 percent accuracy achieved by COMPAS: the person's age and the number of prior convictions. While of course this is only one test of one algorithmic tool, the research undermines the common assumption that data-driven predictions inevitably improve the accuracy and impartiality of government decisions.

With this information on potential bias and the suspect accuracy of certain algorithms, what are policymakers to do? Should cities abandon data-driven tools and defer to the instincts of police officers and caseworkers, as governments have done for hundreds of years? Certainly not. It is important to remember what governments hope to replace with data-driven work, namely the practices that have led to biased datasets, that have perpetuated these biases, that have unfairly targeted members of certain racial, religious, or socio-economic groups or excluded them from certain benefits.

The answer to bias in algorithms is not to abandon data-driven decision-making, but to improve it. Cities need to put in place strategies for ensuring the automated tools they rely on to make critical decisions—like whether or not someone receives bail, whether or not a teacher gets fired, or whether to take a child away from her parents—are fair. This paper will seek to outline structural, policy, and technical strategies that governments should implement to reduce bias in algorithms. The paper will also consider a thornier question: what governments should do with an algorithm that disproportionately affects some protected class, but that does so based on “accurate” predictions that are grounded in prejudiced practices.

Structural Considerations

The foundation for ensuring fairness in the algorithms created by an organization is a set of structural conditions that promote a culture committed to reducing bias. Without this, the policy and technical strategies that follow are irrelevant, because they will never become a priority for analysts.

Create diverse teams.

Building a culture of fairness starts with hiring a diverse group of engineers. Black, Latinx, and Native American people are underrepresented in tech by 16 to 18 percentage points compared with their presence in the US labor force overall. To be sure, part of the process is improving tech education for individuals in these groups. Some have proposed integrating computer science into curricula at an earlier stage—as early as kindergarten—and actively encouraging traditionally underrepresented racial groups as well as women to pursue education in STEM fields. However, improving education is not enough. Black and Latinx people in the U.S. earn nearly 18 percent of computer science degrees, but hold barely five percent of tech jobs.[6] This means tech-driven organizations in the public, private, and non-profit sectors also need to put more emphasis on hiring women and people of color. A 2016 White House report on artificial intelligence highlighted the critical value of a diverse workforce in the tech sector. According to the report, “Research has shown that diverse groups are more effective at problem solving than homogeneous groups, and policies that promote diversity and inclusion will enhance our ability to draw from the broadest possible pool of talent [and] solve our toughest challenges.”[7]

Hiring a diverse cast of employees is particularly valuable in the effort to reduce bias in algorithms, offering an array of perspectives that can fill in the blind spots of a traditionally white and male field. A diverse group of coders will be more likely to notice disparities in training data—like the one that led a Google facial recognition algorithm to label black people as gorillas.[8] They will also be more likely to notice and account for cultural differences. A famous illustration of the often-unexpected appearance of cultural myopia is the “ketchup question.”[9] While most white Americans keep their ketchup in the fridge, black Americans in the south as well as many British people keep their ketchup in the cupboard. Now, it’s unlikely that this particular cultural difference would matter much to a team of coders, though there are certainly situations in which it might. For example, a team creating some kind of logic test might ask a question like “Ice cream is to the freezer as ketchup is to the ______”, expecting an answer like “fridge.” However, this example exhibits an important larger point, that cultural diversity on a team offers perspectives that can challenge the pre-conceived notions and remedy the cultural myopia of more homogenous groups. In the case of algorithms for policing and sentencing, these perspectives can provide critical insights that shape the ways the tools work. A black coder, for example, may be more likely to know that police are more than three times more likely to arrest black people for marijuana possession than white people, even though the two groups use drugs at roughly the same rate.[10] With this information, an agency may choose to normalize or altogether exclude drug arrests from the factors under consideration.

And it’s not only racial diversity that these organizations should strive for, but also intellectual diversity driven by differences in background. An article from Fast Company argues, “Currently AI is a rarefied field, exclusive to Ph.D. technologists and mathematicians. Teams of more diverse backgrounds will by nature raise the questions, illuminate the blind spots, check assumptions to ensure such powerful tools are built upon a spectrum of perspectives.”[11] Hiring experienced technologists is obviously essential for creating advanced analytical tools, but hiring people with backgrounds in public policy, sociology, and even philosophy can help uncover and resolve many of the ethical dilemmas at the heart of data-driven government. The Berkman Klein Center’s Mason Kortz had a similar perspective, saying of predictive criminal justice projects, “You might want someone who studied critical race theory on your team.”[12]

Teach ethical development.

It’s important that engineering teams not only have a diverse group of coders, but also that these coders are trained in creating ethical algorithms. “We need to think about changing the way the field is taught,” said Kira Hessekiel, also of the Berkman Klein Center.[13] This starts in the classroom, with integrating considerations of fairness into computer science courses. Some professors have already begun offering this type of training in class. For example, computer science professor Fei Fang of Carnegie Mellon University has designed a course titled “Artificial Intelligence for Social Good” that seeks to teach students how to use AI in order to solve problems like poverty, hunger, and healthcare. The course emphasizes the need to zoom out, analyze problems from the perspective of those who will be affected, and consider many potential solutions that leverage different tools, rather than jumping straight into technical problem solving. As a part of the class, students will choose a social problem to solve with AI and “investigate it, as a journalist, designer, or anthropological researcher would.”[14]

This emphasis on thoughtful cross-disciplinary analysis must not be reserved for the classroom, but must also seep into the culture of professional teams. DJ Patil, former Chief Data Scientist of the United States Office of Science and Technology Policy, has argued that there needs to be a code of ethics for data scientists, similar to the Hippocratic Oath that governs medical practice. Patil has called for data scientists to collaborate in order to create a set of principles that guide and hold data scientists accountable as professionals.[15] Data for Democracy (D4D)—a community of volunteers that collaborates on a variety of government data science projects—has also sought to create such a code of ethics. The organization has partnered with Bloomberg and BrightHive to develop this code that will guide professionals in becoming a “thoughtful, responsible agent of change.” D4D has used GitHub to collect ideas and suggestions from the data science community, and is now in the process of developing more targeted discussion questions that will frame the code.[16]

Policy Considerations

A representative and thoughtful team of analysts and data leaders will be more likely to put in place policies that help to reduce bias and inaccuracy in algorithmic tools. These policies prioritize openness and rigorous deliberation over algorithmic decision-making.

Encourage transparency.

New York City recently made headlines with its efforts to make the algorithms used by city agencies more transparent to the public. The premise behind these efforts is that the public not only has a right to know what goes into the algorithms that profoundly affect their lives, but that this public can also improve these tools by scrutinizing them for bias and inaccuracies. In addition to creating a task force on algorithmic transparency, New York recently—and less visibly—began publishing detailed project descriptions and source code for projects initiated by the Mayor’s Office of Data Analytics (MODA).

Amen Mashariki, urban analytics lead for Esri and former chief analytics officer for MODA, explained the value of this transparency in rooting out bias.[17] Mashariki pointed to one MODA project in particular, an effort to predict instances of tenant harassment by landlords across the city. “What if the tenant harassment analysis pointed only to landlords from one racial, religious, or other group?” Mashariki asked. If the city publishes information about its algorithmic process, residents can audit that process to make sure it aligns with the city’s public policies—like non-discrimination. “Anyone can step up and say, wait a minute, that’s unfair,” he explained. Therefore, according to Mashariki, cities need to make enough information available “such that residents can trust that public policy is being adhered to.”

One of the principal challenges is accessibility—algorithms often involve analytics techniques that are not familiar to most residents, and source code is even less comprehensible to laypeople. Vendor agreements erect another barrier to transparency, as many cities contract development to private companies, which own the tools they create and often demand secrecy for their proprietary products.

In response to these challenges, researchers, technologists, and policymakers have proposed a number of strategies for making automated tools transparent. Some have called for releasing information that will allow residents to meaningfully audit automated tools, while holding onto companies’ “secret sauce.” These strategies tend to highlight a few key elements to disclose: what data went into the algorithm and why, the analytics techniques used to analyze the data, and data on the performance of the algorithm pre- and post-implementation.[18] Other researchers have introduced innovative ways of explaining complex algorithms, either by creating surrogate models that approximate machine learning tools,[19] or by creating data visualizations that represent these tools.[20] Still others argue that cities should only employ tools that are easily comprehensible to the public. “It’s better just to build a simple model,” said Berk Ustun, a postdoctoral fellow at the Center for Computation and Society at Harvard.[21] “You can create something you can explain to a policymaker that works just as well as this other stuff.”

There’s also significant debate over the merits of publishing source code. While some vendors are hesitant to reveal this information, this type of radical transparency could in fact benefit vendors. Not only would transparency be a selling point for interested cities, but releasing source code also allows experts to analyze a vendor’s product and improve it for free. This can help avoid PR disasters like the ProPublica piece on COMPAS and create a better product for clients.

And, while many seem to think that source code wouldn’t be particularly useful to residents, some members of the computer science community disagree. Ustun told an anecdote about a different ProPublica investigation that unsealed the source code of a DNA analysis tool used by New York City’s crime lab. Within a day of the release, there was a thread on popular computer science blog Hacker News examining flaws in the model and proposing solutions. While source code might not be useful to every resident, it certainly would be useful to those with expertise.

Regardless of the method of disclosure they mandate, cities will want to ensure that they don’t put such an onerous requirement on agencies that they revert back to unscientific conjecture-driven approaches. As a result, any effective effort towards algorithmic transparency must engage stakeholders within city government.

Engage those affected by an algorithm.

Human-centered design—the process of engaging users in the development process—is becoming more and more prominent in the public sector. Examples include an effort to observe public works employees in Pittsburgh to understand their paperwork-laden process for filling potholes and develop a platform that fit their needs,[22] a civic engagement campaign in Chicago in which the Department of Innovation and Technology (DoIT) led demonstrations and solicited feedback on the city’s new open data platform,[23] and engagement with local business owners in Gainesville, FL to map all 13 steps of the permitting process and launch a web platform to provide guidance on the most difficult steps.[24]

Human-centered design is not only relevant to developing technological tools residents will use, but also for creating policies and tools that will affect residents—like an algorithm. Creating a representative and thoughtful team of developers certainly helps reflect the needs and wants of citizens, but engaging those actually affected goes one step further

Human-centered design aligns closely with many of the accessibility challenges that surround transparency. If residents are to meaningfully engage in the algorithmic design process, the automated tools at hand need to be accessible to people with varying backgrounds and levels of expertise. Holding user-centered design sessions that leverage tools like surrogate algorithms and visualizations—as well as simply using simpler algorithms—can help bring citizen voices into the fray, and allow them to ensure that automated tools work for them.

Technical Considerations

As the issue of bias in algorithms has gained prominence, a budding group of researchers at the intersection of law, ethics, and computer science has proposed technical solutions for ensuring algorithms don’t encode bias. These processes leverage insights into the analytics techniques and data used in creating automated tools in order to analyze and change algorithms to improve fairness.

Fix algorithms with algorithms.

While sometimes a source of bias, the tools of computer and data science also offer a number of potential techniques for resolving bias and inaccuracy. A 2015 paper from Feldman et. al. proposed strategies for identifying and removing disparate impact in algorithms, ensuring that they do not discriminate against any particular group. In order to pinpoint algorithms with a disparate impact, the authors created a classifier algorithm that attempts to guess members of two different classes based on outputs. For example, this algorithm would attempt to guess the race of individuals based on the recidivism risk scores offered. When these classifier algorithms have a higher error rate—meaning they’re less capable of guessing race based on outputs—they are less biased. The paper also introduces a method for transforming the input dataset so that the predictability of the protected attribute is impossible, while preserving much of the predictive power of the unprotected attributes. Without going into too much detail, the authors outline a number of potential repair algorithms that change the input dataset, with various degrees of effect on the predictive power of the algorithm. They also test this algorithm on a number of datasets, arguing that their fairness procedure retains more utility—or predictive power—than any other procedures for removing disparate impact.[25] Their work shows that while data-driven technology may have the potential to perpetuate bias, it may also contain the solutions to the problem..

Another important contribution to the research on technical strategies for bias mitigation comes from Dwork et al. Dwork focuses less on group fairness—i.e., ensuring different groups have proportionate outcomes—and more on individual fairness—i.e., treating similar individuals similarly. More technically, the authors outline a framework for fairness such that “the distributions over outcomes observed by x and y are indistinguishable up to their distance d(x, y)”—where differences in outputs are proportional to differences in relevant inputs. As in Feldman et al., the paper seeks to create fairness with minimum effect on utility. The authors approach this goal as an optimization problem, outlining an algorithm for maximizing utility subject to the fairness constraint they set out initially.[26] Calders and Verwer take a similar approach to reducing bias, proposing a method whereby developers subtract the conditional probability of a positive classification based on a “sensitive value”—such as membership in a protected class—from that of a non-sensitive value.[27] In other words, holding all other variables constant, what the difference is in the probability of positive classification (e.g., high risk score) for a member of a protected class versus another individual. The closer this difference gets to zero, the less biased the algorithm.

Deemphasize data.

On a less academic note, as a technique to reduce bias, some private companies have begun thinking carefully about the data their algorithms digest. Algorithmic policing company Azavea has considered potential sources of bias in the data on which it trains its algorithm, and on which the algorithm makes predictions. Azavea’s HunchLab platform predicts crime risk in neighborhoods across cities in order to determine police deployments, and uses historical crime as a key indicator. However, Azavea has acknowledged the potential for enforcement bias—police tendencies to arrest more people of color based on prejudice—to slip into this crime data. There is little doubt that historical crime data will reflect past police bias, and that an algorithm trained on this data will disproportionately affect communities of color. However, this type of bias is much less present in major crimes such as homicides, robberies, assaults, or burglaries than in drug-related or nuisance crimes. As a result, HunchLab has de-emphasized arrest data for the latter, minor types of crime.[28]

Navigating Disparate Impact

Even after technologists and policymakers have pursued all the structural, policy, and technical strategies to root out bias from bad data or an unrepresentative group of analysts, they may still face a trickier question. What do you do if your model—without explicitly considering membership in a protected class in its inputs—still has a disparate impact on a certain group, but one that accurately reflects current conditions? The most controversial example of this pertains to race in predictive policing or recidivism risk analysis: what if it’s actually true that people of color in a city commit violent crimes at a higher rate, and your algorithm reflects this?

This is not an unrealistic situation. According to U.S. Department of Justice Data from 1980-2008, black residents committed homicides at a higher rate than white residents.[29] Based on this data, is disparate impact justified? Is it acceptable for black residents to receive longer sentences for the same crimes because people of their race have historically committed more crimes?

This perspective overlooks the systemic structures that have created disproportionate crime rates among black and white residents. A history of redlining, mass incarceration, employment and educational discrimination, hate crimes, and systematic robbery of black property, power, and opportunity that dates back to slavery has ensured that the statistics are as they are. Historic practices ensuring black residents cannot get a good education, job, or house have all but guaranteed people of color will commit more violent crimes. “It is patently true that black communities, home to a class of people regularly discriminated against and impoverished, have long suffered higher crime rates,” explains Ta-Nehisi Coates.[30] And, relying on the suggestions of algorithms that encourage incarcerating more black people for longer sentences will only perpetuate these structures and enhance inequity.

However, the alternative may also be unsatisfying. In some cases, radical equity would require ensuring your algorithm does not have a disparate impact, even if doing so hampers its accuracy. In adjusting your algorithm, you would fail to identify people at high risk of committing a violent crime, leaving them out on the street and endangering other residents. As Coates also acknowledges, “The argument that high crime is the predictable result of a series of oppressive racist policies does not render the victims of those policies bulletproof.”[31] Acknowledging the systemic factors that have led to higher crime rates in underrepresented groups does not make those crimes any less real or harmful to their victims.

And, even if you are willing to sacrifice some accuracy to avoid disparate impact, it is unclear how far these tradeoffs should go. What qualifies as disparate impact and to which groups does it apply? For example, is an algorithm that predicts more men will commit spousal abuse the same as one predicting more black people will commit crimes? Are there some situations in which disparate impact might be justified? Are there other ways of addressing structural bias besides eliminating inequalities in algorithmic outcomes or ignoring these algorithms altogether? I’d like to offer a couple ways of navigating these and other difficult questions.

Legal basis for disparate impact.

In order to understand the implications of disparate impact on algorithmic decision-making, it’s first important to understand the origins of the concept. Disparate impact comes originally from the Supreme Court’s 1971 decision in Griggs v. Duke Power Company, which addressed whether Duke’s employment requirements were constitutional. In its hiring process, the company required that employees possess a high-school diploma and take an IQ test—evaluations that disqualified many more black applicants than white. The Court ruled that criteria with a disproportionate effect on protected classes—even if “neutral on their face, and even neutral in terms of intent”—could be unconstitutional. However, if these criteria have "a manifest relationship to the employment in question" or support some “business necessity,” they may still be allowable.[32]

Ensuing cases have addressed what degree of disproportionality qualifies as disparate impact. It would of course be silly to require that each group have the exact same number of people in each outcome pool—if one more black person than white is hired out of a pool of 1,000, this hardly seems like disparate impact. While the Court has not adopted any “rigid mathematical formula” for disparate impact, the US Equal Employment Opportunity Commission (EEOC) has adopted an 80 percent rule, and many researchers have followed suit.[33] This rule prescribes that if the probability of a positive outcome for members of a protected class is less than or equal to 80 percent than for a non-protected class, there’s a disparate impact. In the criminal justice context for example, if the probability of a black resident receiving a low risk score is less than 80 percent that of a white resident, there’s good reason to argue for disparate impact.

These legal structures set out a few important rules. For one, disparate impact commonly only applies to protected classes—namely sex, race, age, disability, color, creed, national origin, religion, or genetic information. It would be illogical to disallow disparate impact on any class of people, as there are an infinite number of potential classifications, many of which a city would want to single out with an algorithm. For example, a predictive policing algorithm that disproportionately identifies murderers as opposed to non-murderers is desirable, while one that disproportionately affects black residents is more fraught. Another important principle is that only a certain degree of inequity qualifies as disproportionate impact, as minute differences among outcomes are inevitable. And, the legal basis for disproportionate impact also allows for exceptions in cases where the process behind disparate impact supports a “business necessity,” which may be applicable to government cases that serve particularly important purposes.

Biased compared to what?

With these legal principles in mind, I’d like to develop a few recommendations for cities that come upon algorithms with disparate impact. The first is the need to compare the algorithmic process with current practices. According to Mason Kortz of Harvard Law School’s Berkman Klein Center, cities need to ask “How good are we at this now? Is the algorithm less biased than the baseline?” With respect to algorithmic sentencing, for example, cities can analyze existing disparities in sentence lengths for white and black residents who have committed the same crimes. Cities can compare these measures to the performance of an algorithm on a set of test data, and see which produces less biased results. If an algorithm is less biased—even if not totally free of bias—a city has good reason to deploy it. Amen Mashariki explained this philosophy: “With MODA…if we were able to create a one percent improvement in an agency, that was a win.” Likewise, if an algorithm can reduce bias in some practice, it should be used.

The problem, according to Kortz and Hessekiel, is that often cities don’t have baselines by which to compare algorithms, nor good evidence on the performance of algorithms themselves. We often don’t know, for example, how well judges predict recidivism, nor how well the algorithms that are intended to improve judges’ decisions perform. This means that cities need to invest more resources into testing automated tools against existing systems. The aforementioned research in the Dartmouth computer science department was one of the first rigorous analyses of this kind, comparing the recidivism predictions of Northpointe’s COMPAS tool in Broward County, FL against predictions by a random sample of online users. That their analysis showed COMPAS was no more accurate and no less biased than this random sample is a sign that more research is necessary.

Think about optimization goals and data use.

Regardless of whether or not an automated tool outperforms existing systems, cities should think carefully about the repercussions of disparate impact. It’s not enough that an algorithm improves on existing processes if those processes were never particularly good in the first place.

If an automated tool produces disparate impact, cities should think about whether or not they were optimizing for the right thing in the first place. For example, it may be more likely that residents of color than white residents will commit a misdemeanor crime within two years of committing an initial crime. In this case, a sentencing algorithm may recommend on average longer sentences for black residents. Yet why should whether or not a resident commits another minor crime within two years be the relevant outcome? Kortz and Hessekiel suggested that another outcome—like whether or not a resident will maintain a steady job seven to ten years after the initial crime—might be a more relevant metric, and it’s plausible that a longer sentence would in fact decrease the chances of this happening.

Of course, there are certain outcomes on which governments would not be willing to compromise. For example, few would argue that governments should adjust the goals of an algorithm that predicts the probability of a resident committing a violent crime within two years. This interest seems to fall in a similar vein as the “business necessity” identified in Griggs, which allows for disparate impact in cases that serve a compelling goal. Whether or not this person retains employment ten years down the line, cities will want to ensure he or she doesn’t commit the crime.

In this case, it may be less about whether an algorithm makes disparate predictions about different groups, and more about what cities do with this information. If an algorithm predicts, for example, that someone is likely to commit a crime, there’s a menu of options available other than surveilling that person and waiting for him or her to engage in criminal behavior. As Kortz explained, “If a predictive policing algorithm over-identifies black people who are at risk of committing a crime, and you use the output to direct police on who to surveil and arrest, there are huge problems. If you take the same algorithm, but use it to send people mailings about social services, it’s still biased, but has a way different effect.”

An example of a crime prediction project in Johnson County, KS illustrates the importance of data use. The county partnered with researchers from the University of Chicago to develop an early intervention system for individuals who cycle through the criminal justice, mental health, social services, and emergency services systems. The researchers generated approximately 250 features and developed a machine-learning model that output risk scores of people at risk of re-entering jail in the near future. However, instead of sending this data to the police so they can keep their eyes on certain residents, the County will send the list of people with high risk scores to the mental health center’s emergency services so these individuals can be connected to care and decrease the likelihood of future police interactions.[34]

Cities can pursue similar interventions to act ethically on other controversial algorithms. Instead of pinning an interminable sentence on those with a high recidivism risk, criminal justice agencies may wish to experiment with programs that have a proven track record in reducing recidivism. A program in Los Angeles has sought to reduce recidivism for juveniles in particular, creating a probation camp that offers individual, family, and group counseling services to kids who have committed a crime. While the typical one-year rearrest rate for juveniles is almost 75 percent, only one in three kids who attended a probation camp was arrested within a year, and only 20 percent were convicted of a crime.[35]

More broadly, these examples underscore the need for cities to be much more thoughtful about whether their policy goals match the optimizations their algorithms work towards. If people have a higher recidivism risk, why should they be detained for longer? Does a longer sentence reduce their chance of recidivating? The reality is that the outcomes cities try to predict often have nothing to do with the policy actions they take in response to those outcomes. In the case of recidivism prediction, cities should really attempt to predict which people, if given a longer sentence, will be less likely to recidivate. It also may be that a longer sentence is not the most effective intervention, and that really they should try to understand which residents, if given social services like nutrition assistance and job training, will be less likely to recidivate. We need more research from governments, non-profits, and academic institutions into what really works in these situations, and by extension what outcomes algorithms should attempt to predict.

By pursuing these kinds of people-centric and proven policies, cities can act on predictive insights without perpetuating bias and inequality. In fact, this rehabilitative focus can help reverse some of the damage done in underserved communities.

Examine causality.

Of course, not all communities that rely on criminal justice algorithms have these types of programs in place, nor the resources nor political capital to implement them. “The criminal justice examples are thorny, because we know there are all these problems, but we don’t have a good way of fixing them,” Hessekiel explained. It’s easy to say “Just fix the criminal justice system,” but accomplishing this requires a radical shift that goes well beyond algorithms, and this advice doesn’t provide much insight as to what to do in the meantime. What are jurisdictions to do with predictive algorithms in the absence of a system that prioritizes rehabilitation over retribution?

One potential course is analyzing causality in the disparate impact of algorithms. Cities should accept disparate impact only if there’s a causal relationship between the group affected and the outcome predicted—a strategy that can help distinguish between a true “business necessity” and prejudiced conjecture. Prioritizing causality ensures that disparate impact reflects real predictive differences between individuals, rather than sociological structures. As a study by Osonde Osoba and William Welser IV of the Rand Corporation argues, “Accurate causal justifications for algorithmic decisions are the most reliable audit trails for algorithms.”[36] If an algorithm that predicts child abuse has a disparate impact on people who’ve committed assault, this seems perfectly reasonable. Prior convictions of assault are indicative of violent tendencies, which certainly have a causal relationship with likelihood of abuse. Admittedly, cases that involve protected or quasi-protected classes like gender, race, or national origin are more difficult. For example, is there a causal relationship between gender and a crime like spousal abuse, which men are disproportionately likely to commit? One could argue that biological differences have a causal relationship to the outcome, but these arguments are somewhat fraught, and the line between inherent and socialized differences is difficult to identify. There do exist some technical strategies for identifying causal relationships, as researchers have begun equipping machine learning algorithms with causal and counterfactual reasoning.[37] While it doesn’t always offer a clear answer, considering and attempting to isolate causality as a requirement for disparate impact can help root out practices that will perpetuate bias, while still using useful predictors. Having these conversations is an important step towards preventing the perpetuation of oppressive structures.

“Affirmative action for algorithms.”

Another proposed solution to disparate impact involves creating different algorithms for different groups. What Kortz and Hessekiel referred to as “affirmative action for algorithms,” this strategy calls for analyzing data on separate classes separately. So, for example, instead of predicting recidivism risk for all people at once, analysts might train algorithms on men and women, or black and white residents separately. These algorithms would then produce risk scores for people relative to their class—for example, black women, or white men. This practice has a strategic advantage over class-neutral analytics techniques because predictors for one class may not work for another. Harvard’s Berk Ustun explained this phenomenon in an analysis of Census data he had done to predict whether people will earn $50,000 or more.[38] When he analyzed data for everyone regardless of gender or race, the error rate was around 25 percent. He then “decoupled” the model, analyzing data for different groups separately. While the overall accuracy didn’t improve much because the majority group was so large, the accuracy for smaller groups—for example, black female immigrants—skyrocketed.

The issue with this strategy of decoupling is the same that has plagued other examples of affirmative action. Is it fair? In a sentencing context, it’s likely that a white woman and black man with similar criminal histories and ages would receive very different risk scores via decoupling. This might solve the issue of disparate impact, but would raise the issue of disparate treatment, whereby two people are treated differently because they belong to different classes. Historically, the Courts have not looked fondly upon examples of disparate treatment, and so implementing this kind of technique would likely require a strong legal justification, especially in the context of criminal justice. However, the increases in accuracy that Ustun found in his analysis may provide enough evidence to justify decoupling in many areas.

Conclusion

Before deploying algorithmic tools—especially those that can have a profound effect on citizen lives—cities should do everything they can to mitigate the risk of bias, and even then take a cautious approach to data use. At the same time, there is serious need for more research on the part of cities and academic institutions alike to understand how to involve citizens in development and to determine whether algorithms actually predict what they claim to predict and whether these optimizations align with policy goals. It is tempting for policymakers to view automated tools as a finished product, fine-tuned by the vendors that peddle them. Yet in reality, many algorithms are more like first drafts that introduce a number of unanswered questions and require more rigorous examination.

As many tech evangelists have argued, the answer to the problem of bias is not to abandon technology, but to improve it. The motivation for moving away from old systems of criminal sentencing or teacher evaluation was that these processes were flawed, and that data-driven practices offered a solution. If tech and data leaders approach algorithmic decision-making with consideration of the potential for bias, this is already a step in the right direction. The result may still be imperfect, but recognition of those imperfections will allow for consistent momentum towards algorithmic fairness.