A data-driven review of AI bias in production systems

In 2011, IBM Watson made headlines when it beat Jeopardy legends Ken Jennings and Brad Rutter in a $1M match. In Final Jeopardy, Jennings admitted defeat by writing “I, for one, welcome our new computer overlords.”

That was in 2011, when a good score in a computer image recognition contest called ImageNet Challenge was 75%. More recently, after numerous advances in hardware (GPUs & TPUs), software, and techniques, the best programs can now recognize over 95% of objects. The consensus amongst some experts is that state-of-the-art AI has surpassed humans in image recognition tasks.

So one would expect that today’s Watson Visual Recognition, a descendant (by marketing at least) of the Jeopardy winning supercomputer and contemporary of superhuman image recognizers, would have no problem identifying an image of Rachel Anne Maddow, the cable news personality.

Watson thinks Maddow is male with 84% confidence.

While Watson successfully guesses Maddow’s age range (with 52% confidence), its gender labelling (with 84% confidence) of Maddow as male is wrong.

Watson is not alone though. Amazon Rekognition, the image recognizer from AWS, also thinks that Maddow is a man, albeit with only 59% confidence.

AWS also thinks that Maddow is a man.

AWS is more confident when it comes to misgendering Alex Anzalone, the 6' 3", 245lb NFL linebacker, labelling Anzalone as female with 79% confidence.

So what’s going on here? Why are Jeopardy-beating supercomputers with superhuman image recognition capabilities so confidently failing at gender-labelling*?

* A binary gender label (ex: “male” or “female”) clearly has inherent shortcomings as we increasingly realize that gender is a spectrum. Gender identity, gender expression, and biological sex may align or not, adding to the complexity of the problem.

How Modern Machine Learning Works

To understand the failings of AWS and Watson, one should first understand how modern machine learning actually works. The top computer vision systems of today “learn” by optimizing a vastly complex mathematical model using large labelled training sets. In the 2017 paper Revisiting the Unreasonable Effectiveness of Data, Google’s AI team reconfirmed the 2010 intuition that more data is better.

Per Google, performance continues to increase as the size of datasets increases.

As models become more complex and training data grows to hundreds of millions of images and beyond, it becomes nearly impossible for any single machine learning developer to fully understand the behaviors of any system. Unlike traditional computer programming with deterministic outcome, large machine learning systems operate in terms of probabilities (“confidence level”) rather than certainty. They are nearly black boxes in which data is fed and outcomes tuned to “minimize loss.” Although researchers are increasingly focused on developing explainable AI whose decisions can be better understood by humans, such endeavors are far from production ready.

Given the difficulty of testing such complex systems , systematic errors can emerge. In a study called Gender Shades, MIT’s Joy Buolamwini and Microsoft’s Timnit Gebru demonstrated the biases of commercial image recognition systems by showing that darker-skinned females were misgendered up to 34% of the time compared to light-skinned males at only 0.8% of the time. A study by ProPublica claims that a software system used in courts called COMPASS falsely predicts that black defendants pose a higher risk of recidivism than they do and the opposite for white defendants.

While there are prominent conversations on and generous fundings for AI safety focusing on preventing existential threats from Skynet-like artificial general intelligence, what is clear is that there are practical AI safety issues in production today. Critical decisions are more widely made using machine learning systems. In China’s increasingly survillance-focused society, for example, facial recognition is already used for “algorithmic governance” — capturing crimes and shaming unwanted behaviors. Systematic errors and biases can result in severe consequences.

Cloud Image Recognition Systems

In 2013, Matthew Zeiler, a top contestant in the aforementioned Imagenet competition, rebuffed offers from Google in order to start Clarifai. Clarifai provides a cloud-based service that lets any developer access a world-class image recognition service at $0.0012 per image. Since then, this space has become competitive, with services today provided by Amazon’s AWS, Google’s Cloud Platform, Microsoft’s Azure, and IBM’s Watson.

These services all work in a similar way. A developer sends an image to an API, which responds with the relevant results. Each service has its own collection of models, which allow particular classes of images to be optimized. For example, Clarifai offers apparel- and food-specific models for customers interested in those categories. They also offer custom models in which users can add additional data to enhance the results of bespoke image recognizers.

Gender-classification in these systems is typically exposed through their face detection models. Currently, AWS offers age range, gender, and the detection of features such as beard and glasses. Azure offers age and gender classifications with no confidence levels. Clarifai offers age, gender, and ethnic classifications, and Watson offers age range and gender .

Google, the company with arguably the best image recognition technology and team, is unique in not offering age and gender classifications. Instead, it offers detection for emotions (“joy,” “sorrow,” “anger,” and “surprise) and whether the person has “headwear.” Google’s absence from gender-labelling speaks to the ethical, business, and technical challenges of offering such a service.

Training & Testing Machine Learning

To build their services, these companies use a lot of data. In machine learning, data is divided into training, validation, and test sets. The training dataset is used to build the model. The results you see are based on the mathematical optimizations performed to fit the training data. The validation dataset is used to further tune the model. Finally, the test dataset is independent from the training data but follows the same probability distribution and used to test the model.

For example, if a training dataset for pastry images contains 1 million images of donuts, 1 million images of muffins, and 1 million images of croissants, then the test dataset is expected to have the same proportions. So even if a system successfully passes the test dataset at near 100%, it would not successfully recognize images of kouign-amann or roti canai, since these were not in the original training and test sets.

In practice, the actual make up of datasets used to train AWS, Azure, Clarifai, Google, and Watson are opaque. Timnit Gebru (of the aforementioned Gender Shades project) now advocates for “Datasheets for Datasets” so that the composition and intended uses of data are transparent and accountable. However, given that hardware, software, and techniques in machine learning are becoming commoditized, companies still view data as a proprietary advantage for their businesses that should remain opaque.

As more machine learning systems get used in production, it is increasingly important to adopt better testing beyond the test dataset. Unlike traditional software quality assurance, in which systems are tested to ensure that features operate as expected, machine learning testing requires the curation and generation of new datasets and a framework capable of dealing with confidence levels rather than the traditional 404 and 500 error codes from web servers.

My partner Alex and I have been working on tools for to support machine learning in production. As she wrote in The Rise of the Model Servers, as machine learning moves from the lab into production, additional security and testing services are required to fully complete the stack. One of our tools, ML Safety Server, allows the rapid generation and management of additional test datasets and the tracking of how these datasets perform over time. It is from using the Safety Server that we discovered that AI thinks Rachel Maddow is a man.

How We Test

We’ve been using public cloud APIs to prototype the Safety Server. We discovered the Rachel Maddow issue when testing image recognition services. AWS, Azure, Clarifai, and Watson have all misgendered Rachel Maddow when given recent images of her.

Rachel Maddow is misgendered in all 4 image recognition APIs.

However, when provided with an image of Rachel Maddow from her high school yearbook, when she had long blond hair, the gender labelling was correct.

If Rachel Maddow were to do a 20 year challenge, the changes of her facial features would be unremarkable. Natural aging aside, her face is her face. The only obvious differences are her short dark hair and her thick-framed glasses.

To test our hypothesis that the short hair and glasses are problematic, we curated 1,000 images each of light-skinned women with short hair, light-skinned women with glasses, and light-skinned women with both short hair and glasses. In all cases, the misgendering rate was higher than expected, exceeding 15% for all 4 APIs for all 3 datasets. It is clear that accessories and hairstyles (which evolve overtime with fashion) can cause a mislabelling of gender.

The APIs are struggling with test data of women with short hair and glasses

AWS thinks that a short-haired Katy Perry is male.

We also generated 1,000 image each of men with long hair and men with eye makeup, and the results speak for themselves, with over 12% error rate.

Clarifai: Man with eye makeup is classified as a woman.

AWS: Man with eye makeup is classified as female.

Clarifai: Prince with eye makeup is labelled as a woman.

Watson: Man with long hair is classified as female.

AWS: Man with long hair is classified as female.

Watson: Man with long hair is classified as female.

It is true that some of these mistakes happen at a lower confidence level and that AWS does warn that an appropriate threshold needs to be set for mission critical applications. Nevertheless, not all developers will follow this guideline, and Microsoft’s Azure is so certain of itself that it doesn’t even provide a confidence score by default.

And The Question Is

If “The Reasons AI thinks Rachel Maddow Is A Man” comes up in Jeopardy, “What Are Incomplete Testing and Training Data?” is probably a better question than “What Are Glasses and Short Hair?” To be fair, the problems these AI companies are trying to solve are technically hard and the solutions not always obvious. It is not unexpected that their systems will fail on “edge cases” outside of the original training data. By broadening their training data, these systems could improve over time. Still, it is entirely possible that as improvements are made in one area, results will get worse in other cases.

As our world progressively moves into one in which AI plays an authoritative role, it is important for us to remember these systems’ fallacies. Companies that deploy machine learning in production should continually track and test their models even in the wild, and public cloud services like image recognition should perhaps be scrutinized the same way other products are by the likes of Consumer Reports.

Feel free to contact us if you want to talk more about testing and safety issues in machine learning.