New research out of MIT’s Media Lab is underscoring what other experts have reported or at least suspected before: facial recognition technology is subject to biases based on the data sets provided and the conditions in which algorithms are created.

Joy Buolamwini, a researcher at the MIT Media Lab, recently built a dataset of 1,270 faces, using the faces of politicians, selected based on their country’s rankings for gender parity (in other words, having a significant number of women in public office). Buolamwini then tested the accuracy of three facial recognition systems: those made by Microsoft, IBM, and Megvii of China. The results, which were originally reported in The New York Times, showed inaccuracies in gender identification dependent on a person’s skin color.

Gender was misidentified in less than one percent of lighter-skinned males; in up to seven percent of lighter-skinned females; up to 12 percent of darker-skinned males; and up to 35 percent in darker-skinned females.

“Overall, male subjects were more accurately classified than female subjects replicating previous findings (Ngan et al., 2015), and lighter subjects were more accurately classified than darker individuals,” Buolamwini wrote in a paper about her findings, which was co-authored by Timnit Gebru, a Microsoft researcher. “An intersectional breakdown reveals that all classifiers performed worst on darker female subjects.”

Update: In the days after this report came out, Ruchir Puri, chief architect and IBM Fellow in IBM’s Watson and Cloud Platform groups, published a blog post about mitigating bias in artificial intelligence models. Puri wrote that IBM conducted its own facial recognition study using the faces of parliamentarians, and while he acknowledges that IBM’s data set and methodology were slightly different, says “the error rates of IBM’s upcoming visual recognition service are significantly lower than those of the three systems presented in the paper.”

Still, it’s hardly the first time that facial recognition technology has been proven inaccurate. More and more evidence points towards the need for diverse data sets, as well as diversity among the people who create and deploy these technologies, in order for the algorithms to accurately recognize individuals regardless of race or other identifiers.

Back in 2015, Google was called out by a software engineer for mistakenly identifying his black friends as “gorillas” in its Photos app, something the company promised to fix (when in reality, it may have just removed the word “gorillas” from its index of search results in the app).

Two years ago, The Atlantic reported on how facial recognition technology used for law enforcement purposes may “disproportionately implicate African Americans.” It’s one of the larger concerns around this still-emerging technology – that innocent people could become suspects in crimes because of inaccurate tech – and something that Buolamwini and Gebru also cover in their paper, citing a year-long investigation across 100 police departments that revealed “African-American individuals are more likely to be stopped by law enforcement and be subjected to face recognition searches than individuals of other ethnicities.”

And, as The Atlantic story points out, other groups have found in the past that facial recognition algorithms developed in Asia were more likely to accurately identify Asian people than white people; while algorithms developed in parts of Europe and the US were able to identify white faces better.

The algorithms aren’t intentionally biased, but more research supports the notion that a lot more work needs to be done to limit these biases. “Since computer vision technology is being utilized in high-stakes sectors such as healthcare and law enforcement, more work needs to be done in benchmarking vision algorithms for various demographic and phenotypic groups,” Buolamwini wrote.

Editor’s Note, February 15, 2018: This post has been updated to include a response from IBM.