Identifying Hate Speech with BERT and CNN. Photo by Burst from Pexels.

Two years ago, Toxic Comment Classification Challenge was published on Kaggle. The main aim of the competition was to develop tools that would help to improve online conversation:

Discussing things you care about can be difficult. The threat of abuse and harassment online means that many people stop expressing themselves and give up on seeking different opinions. Platforms struggle to effectively facilitate conversations, leading many communities to limit or completely shut down user comments.

In this post, we develop a tool that is able to recognize toxicity in comments. We use BERT (a Bidirectional Encoder Representations from Transformers) to transform comments to word embeddings. With embeddings, we train a Convolutional Neural Network (CNN) using PyTorch that is able to identify hate speech.

Here are a few links that might interest you:

Disclosure: Bear in mind that some of the links above are affiliate links and if you go through them to make a purchase I will earn a commission. Keep in mind that I link Udacity programs and my tutorials because of their quality and not because of the commission I receive from your purchases. The decision is yours, and whether or not you decide to buy something is completely up to you.

To run the code, download this Jupyter notebook.

Setup

%matplotlib inline import logging

import time

from platform import python_version import matplotlib

import matplotlib.pyplot as plt

import numpy as np

import pandas as pd

import sklearn

import torch

import torch.nn as nn

import torch.nn.functional as F

import transformers

from sklearn.metrics import roc_auc_score

from torch.autograd import Variable

Loading the data

Go to Toxic Comment Classification Challenge to download the data (unzip it and rename the folder to data ). We train and test the model with train.csv because entries in test.csv are without labels and are intended for Kaggle submissions.

Let’s load the data.

df = pd.read_csv('data/train.csv')

Let’s set the random seed to make the experiment repeatable and shuffle the dataset. Shuffling data serves the purpose of reducing variance and making sure that the model will overfit less.

np.random.seed(42)

df = df.sample(frac=1)

df = df.reset_index(drop=True)

The dataset consists of comments and different types of toxicity like threats, obscenity and insults. This problem is in the domain of Multi-label classification because each comment can be tagged with multiple insults (or none).

df.head()

Let’s display the first comment — don’t worry, it is without toxicity threats :)

df.comment_text[0]

“Geez, are you forgetful! We’ve already discussed why Marx was not an anarchist, i.e. he wanted to use a State to mold his ‘socialist man.’ Ergo, he is a statist — the opposite of an anarchist. I know a guy who says that, when he gets old and his teeth fall out, he’ll quit eating meat. Would you call him a vegetarian?”

Eg. the comment with id 103 is marked as toxic, severe_toxic, obscene, and insult (the comment_text is intentionally hidden). The goal of this post is to train a model that will be able to flag comments like these.

target_columns = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"] df.iloc[[103]][target_columns]

Defining datasets

We limit the size of the trainset to 10000 comments as we train the Neural Network (NN) on the CPU. The validation set (1000 comments) is used to measure the accuracy of the NN during training and the test set (2000 comments) is used to measure the accuracy after NN is trained.

df_train = df[:10000].reset_index(drop=True)

df_val = df[10000:11000].reset_index(drop=True)

df_test = df[11000:13000].reset_index(drop=True)

Transforming the Text

To make a CNN work with textual data, we need to transform words of comments to vectors. Huggingface developed a Natural Language Processing (NLP) library called transformers that does just that. It also supports multiple state-of-the-art language models for NLP, like BERT.

What is NLP? What is BERT?

Natural Language Processing — image from hackernoon

According to Wikipedia, Natural Language Processing is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human languages, in particular how to program computers to process and analyze large amounts of natural language data. Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.

BERT was created by researchers from Google — image from Articleify

BERT is a language model that was created and published in 2018 by Jacob Devlin and Ming-Wei Chang from Google [3]. BERT replaces the sequential nature of Recurrent Neural Networks with a much faster Attention-based approach. BERT makes use of Transformer, an attention mechanism that learns contextual relations between words (or sub-words) in a text. In its vanilla form, Transformer includes two separate mechanisms — an encoder that reads the text input and a decoder that produces a prediction for the task. BERT achieved state-of-the-art results in a wide variety of NLP tasks. Since BERT’s goal is to generate a language model, only the encoder mechanism is necessary. To learn more about BERT, read BERT Explained: State of the art language model for NLP by Rani Horev.

In this example, we are using BERT as an encoder and a separate CNN as a decoder that produces predictions for the task. We could use BERT for this task directly (as described in Multilabel text classification using BERT — the mighty transformer), but we would need to retrain the multi-label classification layer on top of the Transformer so that it would be able to identify the hate speech.

From words to BERT embeddings

With BERT each word of a comment is transformed into a vector of size [1 x 768] (768 is the length of a BERT embedding). A comment consists of multiple words, so we get a matrix [n x 768] , where n is the number of words in a comment. There is less than n words as BERT inserts [CLS] token at the beginning of the first sentence and a [SEP] token at the end of each sentence.

We use a smaller BERT language model, which has 12 attention layers and uses a vocabulary of 30522 words. BERT uses a tokenizer to split the input text into a list of tokens that are available in the vocabulary. It learns words that are not in the vocabulary by splitting them into subwords.

Let’s load the BERT model, Bert Tokenizer and bert-base-uncased pre-trained weights.

model_class = transformers.BertModel

tokenizer_class = transformers.BertTokenizer

pretrained_weights='bert-base-uncased' # Load pretrained model/tokenizer

tokenizer = tokenizer_class.from_pretrained(pretrained_weights)

bert_model = model_class.from_pretrained(pretrained_weights)

We transform each comment into a 2D matrix. Matrices have a predefined size, but some comments have more words than others. To transform a comment to a matrix, we need to:

limit the length of a comment to 100 words (100 is an arbitrary number),

pad a comment with less than 100 words (add 0 vectors to the end).

max_seq = 100

def tokenize_text(df, max_seq):

return [

tokenizer.encode(text, add_special_tokens=True)[:max_seq] for text in df.comment_text.values

]

def pad_text(tokenized_text, max_seq):

return np.array([el + [0] * (max_seq - len(el)) for el in tokenized_text])

def tokenize_and_pad_text(df, max_seq):

tokenized_text = tokenize_text(df, max_seq)

padded_text = pad_text(tokenized_text, max_seq)

return torch.tensor(padded_text)

def targets_to_tensor(df, target_columns):

return torch.tensor(df[target_columns].values, dtype=torch.float32)

BERT doesn’t simply map each word to an embedding like it is the case with some context-free pre-trained language models (Word2Vec, FastText or GloVe). To calculate the context, we need to feed the comments to the BERT model.

In the code below, we tokenize, pad and convert comments to PyTorch Tensors. Then we use BERT to transform the text to embeddings. This process takes some time so be patient.

train_indices = tokenize_and_pad_text(df_train, max_seq)

val_indices = tokenize_and_pad_text(df_val, max_seq)

test_indices = tokenize_and_pad_text(df_test, max_seq) with torch.no_grad():

x_train = bert_model(train_indices)[0]

x_val = bert_model(val_indices)[0]

x_test = bert_model(test_indices)[0] y_train = targets_to_tensor(df_train, target_columns)

y_val = targets_to_tensor(df_val, target_columns)

y_test = targets_to_tensor(df_test, target_columns)

This is the first comment transformed into word embeddings with BERT. It has a [100 x 768] shape.

x_train[0]

The first comment is not toxic and it has just 0 values.

y_train[0]

Convolutional Neural Network

Convolutional Neural Network for MNIST Handwritten Digits Classification

CNNs are a category of Neural Networks that have proven very effective in areas such as image recognition and classification. Because of these successes, many researchers try to apply them to other problems, like NLP. To learn more about CNNs, read this great article about CNNs: An Intuitive Explanation of Convolutional Neural Networks.

class KimCNN(nn.Module):

def __init__(self, embed_num, embed_dim, class_num, kernel_num, kernel_sizes, dropout, static):

super(KimCNN, self).__init__() V = embed_num

D = embed_dim

C = class_num

Co = kernel_num

Ks = kernel_sizes



self.static = static

self.embed = nn.Embedding(V, D)

self.convs1 = nn.ModuleList([nn.Conv2d(1, Co, (K, D)) for K in Ks])

self.dropout = nn.Dropout(dropout)

self.fc1 = nn.Linear(len(Ks) * Co, C)

self.sigmoid = nn.Sigmoid()

def forward(self, x):

if self.static:

x = Variable(x) x = x.unsqueeze(1) # (N, Ci, W, D) x = [F.relu(conv(x)).squeeze(3) for conv in self.convs1] # [(N, Co, W), ...]*len(Ks) x = [F.max_pool1d(i, i.size(2)).squeeze(2) for i in x] # [(N, Co), ...]*len(Ks) x = torch.cat(x, 1)

x = self.dropout(x) # (N, len(Ks)*Co)

logit = self.fc1(x) # (N, C)

output = self.sigmoid(logit)

return output

The architecture of KimCNN [2].

The KimCNN [1] was introduced in a paper Convolutional Neural Networks for Sentence Classification by Yoon Kim from New York University in 2014. At the time, it improved the accuracy of multiple NLP tasks. The KimCNN uses a similar architecture as the network used for analyzing visual imagery.

Steps of KimCNN [2]:

Take a word embedding on the input [n x m] , where n represents the maximum number of words in a sentence and m represents the length of the embedding. Apply convolution operations on embeddings. It uses multiple convolutions of different sizes [2 x m] , [3 x m] and [4 x m] . The intuition behind this is to model combinations of 2 words, 3 words, etc. Note, that convolution width is m - the size of the embedding. This is different from CNNs for images as they use square convolutions like [5 x 5] . This is because [1 x m] represents a whole word and it doesn’t make sense to run a convolution with a smaller kernel size (eg. a convolution on half of the word). Apply Rectified Linear Unit (ReLU) to add the ability to model nonlinear problems. Apply 1-max pooling to down-sample the input representation and to help to prevent overfitting. Fewer parameters also reduce computational cost. Concatenate vectors from previous operations to a single vector. Add a dropout layer to deal with overfitting. Apply a softmax function to distribute the probability between classes. Our network differs here because we are dealing with a multilabel classification problem — each comment can have multiple labels (or none). We use a sigmoid function, which scales logits between 0 and 1 for each class. This means that multiple classes can be predicted at the same time.

Training the model

Let’s set the parameters of the model:

embed_num represents the maximum number of words in a comment (100 in this example).

represents the maximum number of words in a comment (100 in this example). embed_dim represents the size of BERT embedding (768).

represents the size of BERT embedding (768). class_num is the number of toxicity threats to predict (6).

is the number of toxicity threats to predict (6). kernel_num is the number of filters for each convolution operation (eg. 3 filters for [2 x m] convolution).

is the number of filters for each convolution operation (eg. 3 filters for convolution). kernel_sizes of convolutions. Eg. look at combinations 2 words, 3 words, etc.

of convolutions. Eg. look at combinations 2 words, 3 words, etc. dropout is the percentage of randomly set hidden units to 0 at each update of the training phase. Tip: Make sure you disable dropout during test/validation phase to get deterministic output.

is the percentage of randomly set hidden units to 0 at each update of the training phase. static parameter True means that we don’t calculate gradients of embeddings and they stay static. If we set it to False, it would increase the number of parameters the model needs to learn and it could overfit.

embed_num = x_train.shape[1]

embed_dim = x_train.shape[2]

class_num = y_train.shape[1]

kernel_num = 3

kernel_sizes = [2, 3, 4]

dropout = 0.5

static = True model = KimCNN(

embed_num=embed_num,

embed_dim=embed_dim,

class_num=class_num,

kernel_num=kernel_num,

kernel_sizes=kernel_sizes,

dropout=dropout,

static=static,

)

We train the model for 10 epochs with batch size set to 10 and the learning rate to 0.001. We use Adam optimizer with the BCE loss function (binary cross-entropy). Binary cross-entropy loss allows our model to assign independent probabilities to the labels, which is a necessity for multilabel classification problems.

n_epochs = 10

batch_size = 10

lr = 0.001

optimizer = torch.optim.Adam(model.parameters(), lr=lr)

loss_fn = nn.BCELoss()

The code below generates batches of data for training.

def generate_batch_data(x, y, batch_size):

i, batch = 0, 0

for batch, i in enumerate(range(0, len(x) - batch_size, batch_size), 1):

x_batch = x[i : i + batch_size]

y_batch = y[i : i + batch_size]

yield x_batch, y_batch, batch

if i + batch_size < len(x):

yield x[i + batch_size :], y[i + batch_size :], batch + 1

if batch == 0:

yield x, y, 1

Let’s train the model.

train_losses, val_losses = [], [] for epoch in range(n_epochs):

start_time = time.time()

train_loss = 0 model.train(True)

for x_batch, y_batch, batch in generate_batch_data(x_train, y_train, batch_size):

y_pred = model(x_batch)

optimizer.zero_grad()

loss = loss_fn(y_pred, y_batch)

loss.backward()

optimizer.step()

train_loss += loss.item() train_loss /= batch

train_losses.append(train_loss)

elapsed = time.time() - start_time model.eval() # disable dropout for deterministic output

# deactivate autograd engine to reduce memory usage and speed up computations with torch.no_grad():

val_loss, batch = 0, 1

for x_batch, y_batch, batch in generate_batch_data(x_val, y_val, batch_size):

y_pred = model(x_batch)

loss = loss_fn(y_pred, y_batch)

val_loss += loss.item()

val_loss /= batch

val_losses.append(val_loss) print(

"Epoch %d Train loss: %.2f. Validation loss: %.2f. Elapsed time: %.2fs."

% (epoch + 1, train_losses[-1], val_losses[-1], elapsed)

)

In the image below, we can observe that train and validation loss converge after 10 epochs.

plt.plot(train_losses, label="Training loss")

plt.plot(val_losses, label="Validation loss")

plt.legend()

plt.title("Losses")

Training and validation loss

Testing the model

The model is trained. We evaluate the model performance with the Area Under the Receiver Operating Characteristic Curve (ROC AUC) on the test set. scikit-learn’s implementation of AUC supports the binary and multilabel indicator format.

Let’s use the model to predict the labels for the test set.

model.eval() # disable dropout for deterministic output with torch.no_grad(): # deactivate autograd engine to reduce memory usage and speed up computations

y_preds = []

batch = 0

for x_batch, y_batch, batch in generate_batch_data(x_test, y_test, batch_size):

y_pred = model(x_batch)

y_preds.extend(y_pred.cpu().numpy().tolist())

y_preds_np = np.array(y_preds)

The model output 6 values (one for each toxicity threat) between 0 and 1 for each comment. We can use 0.5 as a threshold to transform all the values greater than 0.5 to toxicity threats, but let’s calculate the AUC first.

y_preds_np

We extract real labels of toxicity threats for the test set. Real labels are binary values.

y_test_np = df_test[target_columns].values y_test_np[1000:]

The AUC of a model is equal to the probability that the model will rank a randomly chosen positive example higher than a randomly chosen negative example. The higher the AUC, the better (although it is not that simple, as we will see below). When AUC is close to 0.5, it means that the model has no label separation capacity whatsoever. When AUC is close to 0, it means that we need to invert predictions and it should work well :)

Let’s calculate the AUC for each label.

auc_scores = roc_auc_score(y_test_np, y_preds_np, average=None)

df_accuracy = pd.DataFrame({"label": target_columns, "auc": auc_scores})

df_accuracy.sort_values('auc')[::-1]

In the table above, we can observe that the model achieves high AUC for every label. Note, AUC can be a misleading metric when working with an imbalanced dataset.

Imbalanced dataset

We say that the dataset is balanced when 50% of labels belong to each class. The dataset is imbalanced when this ratio is closer to 90% to 10%. The known problem with models trained on imbalanced datasets is that they report high accuracies. Eg. If the model predicts always 0, it can achieve 90% accuracy.

Let’s check if we have an imbalanced dataset.

positive_labels = df_train[target_columns].sum().sum()

positive_labels 2201 all_labels = df_train[target_columns].count().sum()

all_labels 60000 positive_labels/all_labels 0.03668333333333333

Only 2201 labels are positive out of 60000 labels. The dataset is imbalanced, so the reported accuracy above shouldn’t be taken too seriously.

Sanity check

Let’s do a sanity check to see if the model predicts all comments as 0 toxicity threats.

df_test_targets = df_test[target_columns]

df_pred_targets = pd.DataFrame(y_preds_np.round(), columns=target_columns, dtype=int)

df_sanity = df_test_targets.join(df_pred_targets, how='inner', rsuffix='_pred')

df_test_targets.sum()

df_pred_targets.sum()

We can observe that the model predicted 3 toxicity threats: toxic, obscene and insults, but it never predicted severe_toxic, threat and identify_hate. This doesn’t seem great, but at least it didn’t mark all comments with zeros.

df_sanity[df_sanity.toxic > 0][['toxic', 'toxic_pred']]

We see that the model correctly predicted some comments as toxic.

Conclusion

We trained a CNN with BERT embeddings for identifying hate speech. We used a relatively small dataset to make computation faster. Instead of BERT, we could use Word2Vec, which would speed up the transformation of words to embeddings. We spend zero time optimizing the model as this is not the purpose of this post. So reported accuracies shouldn’t be taken too seriously. The more important are outlined pitfalls with imbalanced datasets, AUC and the dropout layer.

Instead of using novel tools like BERT, we could go old school with TD-IDF and Logistic Regression. Would you like to read a post about it? Let me know in the comments below.

References

[1] Yoon Kim, Convolutional Neural Networks for Sentence Classification (2014), https://arxiv.org/pdf/1408.5882.pdf

[2] Ye Zhang, A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification (2016), https://arxiv.org/pdf/1510.03820.pdf

[3] Jacob Devlin, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018), https://arxiv.org/abs/1810.04805

Before you go

I am building an online business focused on Data Science. I tweet about how I’m doing it. Follow me there to join me on my journey.

These are a few links that might interest you: