Teaching computers to understand human language used to be a tedious and imprecise process. Now, language algorithms analyze oceans of text to teach themselves how language works. The results can be unsettling, such as when the Microsoft bot Tay taught itself to be racist after a single day of exposure to humans on Twitter .

“You don’t tell an NLP the grammatical rules of the language explicitly,” said Ran Zmigrod , a PhD student in computer science at the University of Cambridge who specializes in “debiasing” these models.

The first step to fixing a sexist algorithm is admitting it’s not “broken.” Machine bias is human bias, and it can start polluting the data before the decision-making code even starts to run. The selling point of machine learning is that it teaches itself, to varying degrees, how to learn from the data it is given.

As NLP systems creep into every corner of the digital world, from job recruitment software to hate speech detectors to police data, that signal problem grows to fit the size of its real-world container. Every industry that uses machine language solutions risks contamination. Algorithms given jurisdiction over public services like healthcare frequently exacerbate inequalities , excusing the ancient practice of shifting blame the most vulnerable populations for their circumstances in order to redistribute the best services to the least in need; models that try to predict where crime will occur can wind up making racist police practices even worse .

Skewed data is a very old problem in the social sciences, but machine learning hides its bias under a layer of confusion. Even AI researchers who work with machine learning models––like neural nets, which use weighted variables to approximate the decision-making functions of a human brain––don’t know exactly how bias creeps into their work, let alone how to address it.

Siri, Google Translate, and job applicant tracking systems all use the same kind of algorithm to talk to humans. Like other machine learning systems, NLPs (short for “natural language processors” or sometimes “natural language programs”) are bits of code that comb through vast troves of human writing and churn out something else––insights, suggestions, even policy recommendations. And like all machine learning applications, a NLP program’s functionality is tied to its training data––that is, the raw information that has informed the machine’s understanding of the reading material.

“Data and datasets are not objective; they are creations of human design,” writes data researcher Kate Crawford . When designers miss or ignore the imprint of biased data on their models, the result is what Crawford calls a “signal problem,” where “data are assumed to accurately reflect the social world, but there are significant gaps, with little or no signal coming from particular communities.”

Instead, Zmigrod explained, the code uses training data to identify the language’s important rules, and then applies those rules to the task on a smaller, more focused dataset. One way a model might do this is with Markov chains, which estimate how closely associated two elements are by seeing how well the presence of one predicts the presence of the other. For example, it checks whether having “homemaker” in a text correlates with the word “she.”

If that just sounds like a fancy way of running the numbers, it’s because that’s exactly what it is. “People see machine learning as something supernatural, but I just view it as a very elaborate statistics,” Andrea Eunbee Jang, a research intern at the Mila Artificial Intelligence Institute in Quebec, told Motherboard.

Jang is part of a project at Mila that, last year, developed a taxonomy for all the types of gender bias found in NLP datasets. The programmer masculinity problem, according to Jang, is a prime example of a “gender generalization,” or an assumption based on over-valuing of gender stereotypes. “In machine learning in general, data will be tailored for the majority group,” Jang said. Added fellow researcher Yasmeen Hitti: “If the text the model is based on has only seen male programmers, it will assume that the programmer you’re typing about is also male. It just depends on what it’s seen before.”

SEXISM, EMBEDDED

Though projects that scrape new data from Twitter and YouTube have begun to dot the academic landscape, the traditional “ground truth” datasets used to build NLPs come from free collections—like the e-book archives at Project Gutenberg, collections of movie and restaurant reviews translated to plain text for machine learning applications, and dictionary entries. These huge piles of sentences are chosen to represent a range of language forms, but not necessarily a range of perspectives.