

Abstract:

As the number of government algorithms grow, so does the need to evaluate algorithmic fairness. This paper has three goals. First, we ground the notion of algorithmic fairness in the context of disparate impact, arguing that for an algorithm to be fair, its predictions must generalize across different protected groups. Next, two algorithmic use cases are presented with code examples for how to evaluate fairness. Finally, we promote the concept of an open source repository of government algorithmic “scorecards,” allowing stakeholders to compare across algorithms and use cases.



Our sincerest appreciation to Dr. Dyann Daley and Predict Align Prevent for their generous support of this work. We are also extremely grateful to Matt Harris for ensuring our code is in fact, replicable. Finally, we are immensely appreciative of the time and expertise of several reviewers including Drs. Dennis Culhane, Dyann Daley, John Landis, and Tony Smith.

Section 2 provides some background on the accuracy/generalizability metrics used to assess fairness in this report. Section 3 and Section 4 present the real estate tax assessment and recidivism models, respectively, as well as the scorecards for each. Section 5 concludes.

The term “algorithm” can pertain to a wide array of decision-making tools. This report focuses exclusively on supervised machine learning algorithms, which we define as a class of machine learning models that learn from a set of observed experiences to predict an experience that has yet to be observed.

This report is a proof-of-concept, demonstrating two use cases. The first use case is a placed-based machine learning algorithm to predict home prices, which many jurisdictions now use for tax assessment purposes. The second use case is a person-level machine learning model for predicting prisoner recidivism. Along the way, two sets of analytics are presented (one for place, one for people) that describe model accuracy and model “generalizability.” English narrative, replicable code, and data visualizations are provided for each. We hope novice and aspiring government data scientists will sharpen their skills by replicating the code found in the appendix below .

In response to public pressure, each jurisdiction will need to evaluate the fairness of their algorithms. One solution is a standard, open source “scorecard,” we call the Open Algorithmic Scorecard (OAS). Each jurisdiction would have a separate scorecard for every algorithm they deploy, featuring a set of simple metrics describing accuracy and bias. These scorecards would live in an open repository allowing one to compare prototype model results to models created elsewhere; promote transparency by filing finished scorecards; and provide an arena where policy-makers, data scientists, academics, civic technologists, and other stakeholders could observe best practices.

Thus, in the future, all jurisdictions will be developing comparable algorithms that predict comparable outcomes of interest towards the fulfillment of these shared use cases.

The main goal of this paper is to bridge the gap between these two stakeholder groups by providing code examples that introduce the novice public-sector data scientist to algorithmic fairness. Our second goal is to present an open source standard by which governments can compare their algorithms to those of their peers. Our motivation is informed by the following observations:

Recently new open source tools have emerged to help governments evaluate algorithms for fairness. Examples include the Ethics & Algorithms Toolkit created by a consortia of authors in government and academia, as well as the University of Chicago’s Aequitas “open source bias toolkit.” The Ethics & Algorithms Toolkit is excellent for developing governance and policy around algorithms, while Aequitas is aimed at more experienced data scientists. Both are invaluable tools and stakeholders would be well advised to integrate them into their technology workflow.

Government is still wrestling with how to regulate private sector algorithms - which are closed, their inner-workings cast as intellectual property. In the public sector, where transparency is expected, the calculus is different. Governments today use algorithms to dole out subsidies, determine program eligibility, and prioritize the allocation of limited taxpayer-funded resources. While fairness should be at the heart of government algorithms, it’s still unclear how best to referee public algorithms. The fact is that if human decisions are biased, implicit or otherwise, then the algorithms we train from those decisions will also be biased.

Algorithms are increasingly making decisions in place of humans. Algorithms effect the products we buy like insurance, credit cards, and bank loans, and the information we are exposed to like shopping recommendations and news articles. Between bestsellers like Cathy O’Neil’s “Weapons of Math Destruction” and relentless news coverage of tech company data mining, society is beginning to understand that algorithms can bring as much peril as they do promise.

It is impossible to identify the effect of unobserved variables. As an alternative, researchers are actively developing a series of fairness metrics. If bias cannot be judged by the input features, perhaps it can be judged by opening the black box and looking for bias in the predictions. We find this review of fairness metrics to be particularly relevant for policy-makers. In the case studies below, the fairness criterion we present hinge on an algorithms ability to generalize across different group typologies - like rich and poor neighborhoods or Caucasian and African American ex-offenders.

If an algorithm does not generalize to one protected class, its use for resource allocation may have a disparate impact on that group. For example, the recidivism algorithm created below predicts a higher rate of false positives for African Americans relative to Caucasians. This may occur because the algorithm is underfit to the African American “experience.” It may also be that the training data itself is biased, a common critique of prediction in the criminal justice domain. Critics have argued that systematic over-policing of historically disenfranchised communities creates a feedback loop where more reported crime leads to more predicted crime, which leads to more cops on patrol and thus, more reported crimes. It could be that police bias leads to the over-policing of certain communities. It could also be that people with higher propensity to commit crimes sort into these communities. In reality, both likely play a role, but if the relationship goes unobserved, then like any statistical model, systematic error will lead to bias.

Disparate impact may play a role when machine learning algorithms are poorly fit. There are two general conditions that lead to poor predictions. An “underfit” model does not exhibit a high degree of predictive accuracy, likely because not enough effective predictive variables or “features” are included. An “overfit” model, traditionally, is one that may predict well on training data but fails when used to predict for new observations. Models may also be overfit to one type of experience, predicting differently from one group to the next.

Social Scientists are well-versed in the issues of fairness and discrimination. Identifying discrimination in an algorithm however, is just as nuanced and complicated as it is in housing and labor markets. It is unlikely that a jurisdiction would create an algorithm with the express intent of discriminating against a protected class. Given the black box nature of these models, it is far more likely that they would create an algorithm from which decisions have a “disparate impact” on members of a protected class. Disparate impact is a legal theory positing that although a policy or program may not be discriminatory prima facie, it still may have an adverse discriminatory effect even if unintended.

Biased algorithms may have more dire consequences for people-oriented use cases, like recidivism. As we discuss below, one example of bias is higher false positive rates for African American ex-offenders compared to Whites. A false positive in this context means that the model predicted that an ex-offender would recidivate but actually did not. When false positives are disproportionately predicted for a protected class, decisions made from that algorithm may come with significant social costs.

For some use cases, bias may have real social costs. A biased tax assessment model may systematically under or over-assess the value of certain homes. In the former case, City tax coffers lose out on revenue, while gentrifiers freeride on new amenities and services. In the latter case, excessive tax burden in poor communities may lead to greater housing instability and inequality.

Governments use machine learning algorithms to allocate limited taxpayer-funded resources, and a biased model may mean that these resources are misallocated. Resources may be wasted on a population that does not need them or allocated in a way that ultimately proves harmful. A biased algorithm may leave policy makers wondering whether a data-driven approach is any more useful than existing institutional knowledge.

Philadelphia and cities like it are using property tax freezes as a way to offset gentrification-induced displacement. While it is likely that gentrification causes increased property taxes, it may also be true that poorly calibrated tax assessment models are partially to blame. Figure 3.10 displays a mock scorecard for the Philadelphia tax assessment algorithm. Note the fairness score, which we make up entirely simply to show that once many algorithms can be compared in a repository setting, it is possible to rank them accordingly.

Clarke’s office analyzed the most recent assessments and found that about 75 percent of households that had assessment increases (from 2017 & 2018) between 50 percent and 200 percent are in census tracts with low to moderate income, meaning their income levels would likely qualify.

The other approach is to enact policies that mitigate the negative effect of property tax reassessments after they occur. In 2014, the City of Philadelphia enacted the Longtime Owner Occupants Program (LOOP) which freezes assessments that increase 200% or more (triple the base amount) from year to year for homeowners living in the residence for ten years or more. Recently, Philadelphia City Council President Darrell Clarke, who has been highly critical of the tax assessment system, proposed new legislation that would lower the LOOP threshold from 200% to 50%. From the Philadelphia Inquirer (emphasis added):

Improvements to the model could be made by adding new features to equitably reduce error across space. If new features do not help there are two potential remedies. The first is to come up with a set of rules to make general corrections in instances where, for example, the predicted price is more than 20% that of the neighborhood mean. These corrections are common in predictive algorithms, but as the number of rules increases, the need for a supervised algorithm decreases.

The model exhibits significant differences in error rates across neighborhood contexts. Deployment of an algorithm like this would be problematic. Sales with the lowest (1st quintile) error rate have average observed price and error rates of $230,696 and 3.6% respectively. Sales with the highest (5th quintile) error rates have average observed price and error rate of $87,644 and 174% respectively. A model biased this way places a disproportionately higher property tax burden on lower-valued homes. In such an instance, an argument could be made that the algorithm has a disparate impact on the low-income families who likely live in these communities. For them, the algorithm may lead to more economic hardship and exasperate housing instability.

Finally, Figure 3.9 tests how well the algorithm generalizes to the various neighborhood typologies. Interestingly, error rates appear relatively comparable across gentrifying (42.8%) and non-gentrifying tracts (51.3%). However, there are higher error rates in low versus high poverty neighborhoods (25.6% to 70.6%), and between White and non-White neighborhoods (26% to 69.9%).

Correlated errors suggest that spatial bias exists in the model. This is further explored in Figure 3.8 which is generated from “Leave One Group Out Cross Validation” (LOGOCV). Instead of holding out and testing for a random subset, LOGOCV reveals how well the model generalizes to a given neighborhood by training on all but one neighborhood and validating on the hold out. Each neighborhood takes a turn acting as the hold out. The map visualizes average error rates by hold out neighborhood and reaffirms the fact that that the model works better in some parts of Philadelphia than others.

Figure 3.7 maps the absolute error for test set sales both on a dollar and percentage basis. Clear differences can be observed. The arrangement of errors provides additional intuition. Ideally, the algorithm would account for enough variation in price such that the remaining variation (the error), were randomly distributed across the city. Figure 3.7 clearly illustrates that this is not the case. Different communities exhibit different levels of error. We can use the results of a Global Moran’s I test to find that the spatial configuration of errors exhibits statistically significant clustering (p-value < 0.001).

Next, the model is used to predict for the withheld 40% test set. Mean Absolute Error (MAE) is the mean absolute value difference between observed and predicted sale prices for the test set. The MAE is $44,058. For context, the average single-family home price in our sample is $186,961. The Root Mean Square Error (RMSE) is similar to the MAE, but errors are squared, averaged, and square rooted. The RMSE which penalizes higher errors is $72,582. The Mean Absolute Percent Error (MAPE) is the mean absolute value difference between observed and predicted sale prices on a percentage basis. The MAPE is 49.7%. 100-fold cross-validation without hyperparameter tuning is performed. This test provides some intuition about how the model would predict for data it has yet to see. The mean MAE across all holdouts is $43,587 and the standard deviation is $1,232, suggesting a model that would generalize to new data.

Accuracy and generalizability is assessed in a variety of ways. First, Figure 3.5 visualizes a scatterplot of observed prices as a function of predicted prices. The pink line represents a hypothetical perfect prediction. The plot suggests that the model fits reasonably well, with reduced accuracy for higher prices. Figure 3.6 echoes this finding by plotting Mean Absolute Percent Error by decile.

For demonstration purposes, the model is more simplistic than it would be in reality. 10-fold cross validation is performed on a 60% training set to tune the hyperparameters of a Random Forest algorithm. All goodness of fit metrics are reported either from cross-validation or from the 40% test set. Figure 3.4 shows the feature importance associated with the final model.

Normally a host of features are employed to model the spatial structure, but in this simple example, we include just one set of features - a fixed effect for each neighborhood. Our hypothesis is that explicitly accounting for neighborhood variation helps to control for local comparables as well as any equilibrium effects.

Figure 3.3 visualizes the neighborhood amenity features developed for the model. The goal is to quantify the level of amenity and disamenity “exposure” to each home sale citywide. To quantify the exposure to aggravated assaults for example, each home sale location measures the distance from itself to its k nearest assault neighbors and takes the mean.

Not only do prices vary by neighborhood, but they vary by neighborhood type as well. We split neighborhoods into “high” and “low” designations across three different typologies and visualize differences in Figure 3.2. The first is Qualified Census Tracts (QCT), a poverty designation HUD uses to allocate housing tax credits. QCT designations provide a deliberate and policy-relevant threshold for judging generalizability. Neighborhoods that qualify for tax credits exhibit mean single-family home prices that are nearly half that of those that do not. Next, we understand whether the algorithm generalizes to gentrifying neighborhoods. Tracts are designated as “gentrifying” and non-gentrifying" using metrics from the Federal Reserve Bank of Philadelphia . Mean prices differ across gentrifying and non-gentrifying neighborhoods. A third typology is race-related. To determine whether the model generalizes with respect to race, the city is grouped into “majority White” and “majority non-White” census tracts. Mean prices are clearly higher in the former group. A well generalized algorithm should exhibit comparable error rates across each group.

Figure 3.1 visualizes the mean and standard deviation (as a percentage of the mean) of single-family home prices by neighborhood in Philadelphia. Not surprisingly, high and low price neighborhoods cluster. Perhaps a bit more surprising, is that low priced neighborhoods also exhibit relatively high price variance. This may relate to the gentrification disequilibrium described above.

Philadelphia is missing a surprising amount of data on the internal characteristics of homes. 17% of transactions in our data list zero bedrooms. Perhaps OPA imputes missing values, but for this example, we use a fixed effect to denote when number of bedrooms equals zero. All told, we employ eight house/parcel specific features in the model below.

Our data come from the Philadelphia Office of Property Assessment (OPA). Those interested in replicating the analysis can download the data here or access assessment data directly on OpenDataPhilly . The dataset is comprised of market transactions of single family home sales from July 2017 to July 2018. Sales less than $3,000 and greater than $1,000,000 are removed, as well as observations with missing data. The final dataset includes 21,964 transactions with a mean and standard deviation sale price of $185,950 and $162,873, respectively. The table below provides a description of the variables developed for the model.

Accuracy is simply the difference between the observed price of the home and the predicted price. This difference is often referred to as “error.” Generalizability is a bit more complex. The general approach for assessing bias in these algorithms is to investigate how errors cluster in space. The steps are: 1) train the model; 2) use the trained model to predict for out-of-sample sales; then 3) calculate and map errors. For an assessment algorithm to generalize well, it must exhibit comparable error rates across different neighborhoods and neighborhood contexts. If the model predicts better for rich versus poor neighborhoods or White versus African American neighborhoods then the model may be biased. The spatial arrangement of errors is explored further below.

These algorithms are not without inherent biases. There are some key assumptions including 1) that all buyers and sellers have access to the same market information; 2) that neighborhood crime can be measured with the same accuracy as say, the number of bedrooms in a house; and 3) that buyers exhibit homogeneous preferences for amenities, like schools. A final source of bias is an assumption that neighborhoods are in “equilibrium”. This is almost never the case, particularly in gentrifying communities. In these communities, buyers and sellers capitalize future expectations into prices (ie. “what will this house be worth if a new subway station is opened nearby?”). Simply put, buyers and sellers will disagree on the future value of gentrified housing in a neighborhood, which may make it difficult to predict variation in prices at the neighborhood level.

The first is parcel/internal characteristics like the size of the lot and the number of bedrooms. Next is neighborhood characteristics including exposure to crime or access to transit. A third is the “spatial component” which hypothesizes that house price is a function, in part, of neighboring prices. These features take their motivation from real estate appraisers who compare similar homes in close proximity (i.e. “comparables”). Properly controlling for comparables requires features that capture the unique spatial scale of prices in a neighborhood. For over two decades, Urban Spatial has developed a host of specialized approaches to account for this spatial component. Interested readers can refer to our work here , although for simplicity, we omit these more complicated features from our model below.

The Zestimate algorithm is very similar to the methodology counties use to assess home values and calculate property tax liability. These methods are rooted in the “hedonic model” - an econometric approach for deconstructing the market price of a good to the value of each constituent part. The hedonic model can estimate the “capitalization effect” or price premium associated with an extra bedroom or the presence of a garage. It can also be used, as is the case with tax assessment, for prediction. Typically, these algorithms are trained on recent transactions, then used to predict value for all houses citywide. The hedonic model relies on several different feature types, each explained below.

It is difficult to participate in the real estate market and not interact with machine learning models. Airbnb’s algorithms recommend rental prices to its hosts. Trulia’s computer vision algorithms convert house photos to home features. Perhaps the most ubiquitous real estate algorithm is Zillow’s Zestimate, which predicts the current market value of a property.

4. Recidivism Prediction

4.1 Recidivism and predictive parity In 2016, ProPublica released an evaluation of the COMPAS recidivism prediction algorithm built by a company called Northpointe and currently in use by Florida and other states around the country. ProPublica found that while the algorithm had comparable accuracy rates across different racial groups (what is known as “predictive parity”), there were clear racial differences for errors that had high social costs. This paradox lead Propublica to ask a fascinating question - “how could an algorithm simultaneously be fair and unfair?” In the criminal justice system, as in life, decisions are made by weighing risks. Among a host of Federal sentencing guidelines, judges are to “protect the public from further crimes of the defendant.” Rhetorically, this sounds straightforward - identify the risk that an individual will cause the public harm and impose a sentence that will reduce this risk. However, bias always plays a role in decision-making. We’d never ask the average citizen to weigh risks and punish accordingly because we don’t believe the average citizen could act with impartiality. Although this is the standard we impose on judges, even they make systematic mistakes. Can an algorithm help judges make better decisions? A recent paper determined that even with much less data, people make as accurate predictions on whether someone will recidivate compared to COMPAS. On the other hand, studies have shown that introducing prediction into the decision-making process can reduce the odds of re-arrests. The use of data-driven risk models in the criminal justice system has increased in recent years. These algorithms predict risk for a host of outcomes for use in bail hearings, to determine whether an inmate should be granted parole, and to support sentencing decisions by assessing future criminal behavior. In the case below, the focus is on recidivism. Unlike tax assessment which has fairness implications at the community or household level, recidivism prediction can have a disparate impact on individuals. Below a recidivism algorithm is developed using the COMPAS data provided by ProPublica. Accuracy and generalizability are discussed and recent research is presented showing how it may not be possible to rid these algorithms of their bias. We conclude by providing some useful context for deploying these algorithms in the face of bias.

4.2 Accuracy and generalizability in recidivism algorithms A recidivism “classifier” algorithm has two “binary” outcomes - “Recidivate” and “Non-recidivate.” While the “percent of correct predictions” is a simple measure of accuracy, it lacks nuance, particularly given the social costs associated with different types of errors. Through an understanding of the many different approaches for measuring accuracy in binary predictive models, an understanding of generalizability emerges. The basic premise is to learn the recidivism experience of ex-offenders in the recent past to subject these experiences onto a population for which the propensity to recidivate is unknown. The prediction from the model is a number - a “risk score” - running from 0 to 1, and interpreted as “the probability person i will recidivate.” The analyst chooses a risk score threshold for which an ex-offender is then classified as predicted to recidivate. The model can then be validated by comparing predicted classifications to observed classifications, giving a host of more nuanced error types: True Positive (“Sensitivity”) - “The person was predicted to recidivate and actually recidivated.”

True Negative (“Specificity”) - “The person was predicted not to recidivate and actually did not recidivate.”

False Positive - “The person was predicted to recidivate and actually did not recidivate.”

False Negative - “The person was predicted not to recidivate and actually did recidivate.” The severity of classification errors and their associated social costs can only be judged in the context of the use case. The focus here, as it was in the case with tax assessment, is to understand whether these error types generalize well across race. What makes this so difficult is that the observed base rates of recidivism vary significantly across race.

4.3 Data and exploratory analysis The data for this analysis was acquired by ProPublica as part of a public records request and is currently hosted on their GitHub page. At the time of analysis, we were unable to secure a data dictionary, thus much of the feature engineering routines employed in our code below were copied directly from ProPublica’s IPython Notebook. While this is not ideal, it is the nature of working with open data, at times. After cleaning, the data describes the 6,163 ex-offenders screened by COMPAS in 2013 and 2014. There are 53 columns in the original data describing length of jail stays, type of charges, the degree of crimes committed, and criminal history. Many of these variables were added by Northpointe, the original author of the COMPAS algorithm, and are not relevant to the model building process. There are also a host of variables that Northpointe collects from survey data that do not seem to be present in the dataset. Also, noticeably absent, are data describing economic and educational backgrounds. We return to this shortcoming in Section 4.5. The model developed below is simplistic - it is not a replication of the existing Northpointe algorithm, which is proprietary. The table below describes the features. Variable Description sex Categorical variables that indicates whether the ex-offender is male or female age The age of the person age_cat Variable that categories ex-offenders into three groups by age: Less than 25, 25 to 45, Greater than 45 race The race of the person priors_count The number of prior crimes committed two_year_recid Numerical binary variable of whether the person recidivated or not, where 0 is not recidivate and 1 is the person recidivated r_charge_desc Description of the charge upon recidivating c_charge_desc Description of the original criminal charge c_charge_degree Degree of the original charge r_charge_degree Degree of the charge upon recidivating juv_other_count Categorical variable of the number of prior juvenile convictions that are not considered either felonies or misdemeanors length_of_stay How long the person stayed in jail Recidivated Character binary variable of whether the person recidivated (Recidivate) or not (notRecidivate) Figure 4.1 illustrates the most frequent initial charge. Crimes of varying severity are included in the dataset. Figure 4.2 visualizes that for repeat offenders, the recidivism event tends to be a lesser crime than the initial offense. Figure 4.3 visualizes the rate of recidivism outcomes by race. Note the rate of recidivism for African Americans is twice that (59%) of Caucasians (29%). This has important implications for generalizability as described below.

4.4 Modeling A logistic regression model is estimated. Again, a more advanced model would be employed in reality, but for demonstration purposes the model is kept simple. Minimal feature selection tests are undertaken and to keep the model simple, only 7 features are employed. There is a naive belief among some that algorithmic discrimination can be prevented by omitting controls for protected groups from the model. Although this is not the case, race predictors are omitted from the algorithm. Figure 4.4 plots the variable importance for the model, defined as standardized regression coefficients. The greatest predictive power comes from features describing the total number of prior convictions for a given ex-offender. As these features skew Figure 4.4, they are omitted from the plot. The data are split into 75/25 training/test sets. 100-fold cross validation is performed on the 75% training set to test how well the model would generalize to new data. Ex-offenders with predicted probabilities greater than or equal to 50% are classified as “will recidivate,” while those less than 50% are classified as “will not recidivate”. Other approaches to finding appropriate cut-offs are described below. Here, we err on the side of simplicity.

4.5 Accuracy and generalizability The simplest accuracy metric is defined as (# of True Positives + # of True Negatives) / Total Observations. Model Accuracy for Caucasians, African Americans, and Hispanics is 67%, 68.2%, and 67.8%, respectively. By this metric alone, we could conclude that the model is fair. There are some additional visual representations worth examining as well. Figure 4.5 presents a Receiver Operator Characteristic (ROC) Curve - a traditional measure of accuracy for binary classifiers. The ROC curve illustrates trade-offs in True Positive and False Positive rates for a given threshold. The diagonal line in Figure 4.5 is the “random guess line.” Along this line, if the model correctly predicts recidivism say, 40% it also mis-classifies recidivism 40% of the time. If the ROC curve, dips below the random guess line, then the model is less effective than a random coin flip. By definition, this is an underfit model. For contrast, an ROC curve that has a perfect right angle with the vertex at the top left of the plot, is perfectly overfit. In such an instance, the interpretation would be that if the model correctly classifies recidivism 100% of the time, it incorrectly classifies it 0% of the time. Here, the model suggests that if recidivism is classified correctly 75% of the time, it is mis-classified ~45% of the time. The Area Under the Curve (AUC) simply measures the plot space below the ROC curve. An ROC curve along the coin flip line has an AUC of 0.50 or 50%. The overfit ROC curve described above would have an AUC of 1 or 100%. The AUC for this model is 0.73 - which represents marginal accuracy. AUC is useful for visualizing accuracy across race. Three ROC curves and corresponding AUC metrics are presented in Figure 4.6 - one for each of the three predominant race groups included in the data. The results confirm the finding of predictive parity. These goodness of fit metrics are robust to cross-validation as well. Figure 4.7 shows mean and standard deviation across 100 random test set holdouts for three key goodness of fit metrics. The mean True Negative Rate (Specificity) is lower than the mean True Positive rate (Sensitivity), suggesting that the model is more error-prone in cases where ex-offenders are predicted not to recidivate. Accuracy is too simplistic of a measure, given the social costs associated with certain types of errors. Generalizing to random holdouts is important, but generalizing across race is more so. Figure 4.8 contrasts observed and predicted recidivism rates. 44.5% of ex-offenders are observed to recidivate across all races, but only 39.6% are predicted to do so. This underprediction is readily apparent for Caucasians and Hispanics, who are predicted to recidivate at far lower rates. Figure 4.9 compares error rates by race. Again, the model appears equally accurate with respect to race. True Negatives predict correctly that an ex-offender will not recidivate and are far more common for Caucasians and Hispanics. False Negatives predict that an ex-offender will not recidivate, but does. These errors are far more common with Caucasians and Hispanics. True Positives predict correctly that an ex-offender will recidivate and this rate is far higher for African Americans. False Positives predict that an ex-offender will recidivate, but does not. African Americans exhibit higher false positive rates. Finally, the tradeoff between False Negatives and False Positives are visualized for Caucasian and African American ex-offenders. The below Detection Error Tradeoff (DET) Curve shows in general, that a threshold yielding a lower False Positive rate will yield higher False Negative rates, but the rate of False Negatives will be higher for Caucasians. Conversely, a threshold that yields lower False Negative rates will yield very high False Positive rates, but there will be less bias across the races. The DET curve provides an effective indicator to understand the social costs associated with the model.