For all the advances being made in the field, artificial intelligence still struggles when it comes to identifying hate speech. When he testified before Congress in April, Facebook CEO Mark Zuckerberg said it was “one of the hardest” problems. But, he went on, he was optimistic that “over a five- to 10-year period, we will have AI tools that can get into some of the linguistic nuances of different types of content to be more accurate in flagging things for our systems.” For that to happen, however, humans will need first to define for ourselves what hate speech means—and that can be hard because it’s constantly evolving and often dependent on context.

“Hate speech can be tricky to detect since it is context and domain dependent. Trolls try to evade or even poison such [machine learning] classifiers,” says Aylin Caliskan, a computer science researcher at George Washington University who studies how to fool artificial intelligence.

In fact, today’s state-of-the-art hate-speech-detecting AIs are susceptible to trivial workarounds, according to a new study to be presented at the ACM Workshop on Artificial Intelligence and Security in October. A team of machine learning researchers from Aalto University in Finland, with help from the University of Padua in Italy, were able to successfully evade seven different hate-speech-classifying algorithms using simple attacks, like inserting typos. The researchers found all of the algorithms were vulnerable, and argue humanity’s trouble defining hate speech contributes to the problem. Their work is part of an ongoing project called Deception Detection via Text Analysis.

The Subjectivity of Hate-Speech Data

If you want to create an algorithm that classifies hate speech, you need to teach it what hate speech is, using data sets of examples that are labeled hateful or not. That requires a human to decide when something is hate speech. Their labeling is going to be subjective on some level, although researchers can try to mitigate the effect of any single opinion by using groups of people and majority votes. Still, the data sets for hate-speech algorithms are always going to be made up of a series of human judgment calls. That doesn’t mean AI researchers shouldn’t use them, but they have to be upfront about what they really represent.

“In my view, hate-speech data sets are fine as long as we are clear what they are: they reflect the majority view of the people who collected or labeled the data,” says Tommi Gröndahl, a doctoral candidate at Aalto University and the lead author of the paper. “They do not provide us with a definition of hate speech, and they cannot be used to solve disputes concerning whether something ‘really’ constitutes hate speech.”

In this case, the data sets came from Twitter and Wikipedia comments, and were labeled by crowdsourced micro-laborers as hateful or not (one model also had a third label for “offensive speech”). The researchers discovered that the algorithms didn’t work when they swapped their data sets, meaning the machines can’t identify hate speech in new situations different from the ones they’ve seen in the past.

"Hate-speech data sets are fine as long as we are clear what they are: they reflect the majority view of the people who collected or labeled the data." Tommi Gröndahl

That’s likely due in part to how the data sets were created in the first place, but the problem is really caused by the fact that humans don’t agree what constitutes hate speech in all circumstances. “The results are suggestive of the problematic and subjective nature of what should be considered ‘hateful’ in particular contexts,” the researchers wrote.

Another problem the researchers discovered is that some of the classifiers have a tendency to conflate merely offensive speech with hate speech, creating false positives. They found the single algorithm that included three categories—hate speech, offensive speech, and ordinary speech—as opposed to two, did a better job of avoiding false positives. But eliminating the issue altogether remains a tough problem to fix, because there is no agreed-upon line where offensive speech definitely slides into hateful territory. It’s likely not a boundary you can teach a machine to see, at least for now.

Attacking With Love

For the second part of the study, the researchers also attempted to evade the algorithms in a number of ways by inserting typos, using leetspeak (such as “c00l”), adding extra words, and by inserting and removing spaces between words. The altered text was meant to evade AI detection but still be clear to human readers. The effectiveness of their attacks varied depending on the algorithm, but all seven hate-speech classifiers were significantly derailed by at least some of the researchers’ methods.