Social media platforms like Facebook and Twitter have imposed rigorous policies in an effort to combat hate speech and extremism. Existing AI-based policing models however tend to simply detect and delete objectionable posts based on keywords.

Now, researchers from Intel AI and University of California at Santa Barbara have introduced a new generative hate speech intervention model, along with two large-scale fully-labeled hate speech datasets collected from Reddit and Gab.

The standout feature of the research is that along with hate speech detection, the datasets can also provide tailored intervention responses written by Amazon Mechanical Turk workers. In this way an AI model can be trained to both detect hate speech and generate appropriate responses for specific types of hate speech.

“Simply detecting and blocking hate speech or suspicious users often has limited ability to prevent these users from simply turning to other social media platforms to continue to engage in hate speech as can be seen in the large move of individuals blocked from Twitter to Gab,” the researchers explain.

The datasets consist of 5,020 conversations retrieved from Reddit pages such as “r/The Donald,” a subreddit for discussion on US President Donald Trump that was “quarantined” earlier this year for incitements to violence. The research team used keywords to identify potentially hateful comments and then reconstructed the conversational context of each comment. The dataset also contains 11,825 conversations retrieved from right-wing discussion platform Gab.

The research team crowd-sourced workers from Amazon Mechanical Turk to label the comments and generate intervention responses on a case-by-case basis. The workers were asked to answer two questions:

Which posts or comments in this conversation are hate speech? If there exists hate speech in the conversation, how would you respond to intervene? Write down a response that can probably hold it back (word limit: 140 characters).

In their experiments, researchers evaluated four methods on a binary hate speech detection task: Logistic Regression (LR), Support Vector Machines (SVM), Convolutional Neural Networks (CNN), and Recurrent Neural Networks (RNN). They also evaluated three models on generative hate speech intervention tasks: Seq2Seq, Variational Auto-Encoder (VAE), and Reinforcement Learning (RL). The results are below.

Bots have a spotty history when it comes to racist or inflammatory content — several years ago the Microsoft online digital assistant “Tay” was prompted to spew a series of racist and inflammatory tweets before her handlers pulled the plug. And a recent paper from the Seattle-based Allen Institute for Artificial Intelligence (AI2) showed how even relatively innocent trigger words and phrases can be used to “inflict targeted errors” on natural language processing (NLP) models, triggering the generation of racist and hostile content.

With both vulgar humans and rogue bots to contend with in the online arena, the Intel and UC Santa Barbara datasets provide a valuable tool for both detection of and intervention on hateful comments.

The paper A Benchmark Dataset for Learning to Intervene in Online Hate Speech is on arXiv. The dataset has been open-sourced on GitHub.