Work has been slow in the first week of the year, so I decided to try my hand at a Kaggle competition for the first time (yeah I know I am late to the party). After signing up and looking around, I ended up on the Jigsaw Toxic Comment Classification Challenge. Incase you only browse Medium and have no idea what a toxic comment means, here you go:

This post describes my (kinda) successful attempt at training a ConvNet to classify a comment into one or more types of toxicity: threat, obscenity, insult, etc. (6 classes in total). Compared to the leader log-loss of 0.022, my simple model scored ~0.055 — Not amazing, but pretty good for <100 lines of code with Keras! At the end of the post, I will also mention some meta-learnings about my first go at competitive ML :-).

Preprocessing the Text

The training data was provided as a CSV file with ~100k rows. Each row contained a unique ID, the text, and a 1/0 per class denoting classification.

from backports import csv

import numpy as np # Helps in reading long texts

csv.field_size_limit(sys.maxsize)





def get_texts_and_targets(filename):

texts = []

targets = []



with io.open(filename, encoding='utf-8') as csvfile:

readCSV = csv.reader(csvfile)

for i, row in enumerate(readCSV):

if i == 0:

# Header row

continue

texts.append(row[1].strip().encode('ascii', 'replace'))

targets.append(np.array([float(x) for x in row[2:]]))

print("Total number of texts: %s" % len(texts))

return texts, targets

Being a Deep learning noob, I started writing some kickass preprocessing code in NLTK. Turns out, Keras provides a handy Tokenizer class to deal with all basic tasks such as special-character-removal and conversion to lowercase. So I got lazy and just used that:

from keras.preprocessing.text import Tokenizer

from keras.preprocessing.sequence import pad_sequences # Max number of input words in any sample

MAX_SEQUENCE_LENGTH = 200

VALIDATION_SPLIT = 0.1 def get_datasets(texts, targets, tokenizer=None):

if tokenizer is None:

tokenizer = Tokenizer()

tokenizer.fit_on_texts(texts)



sequences = tokenizer.texts_to_sequences(texts)

word_index = tokenizer.word_index

data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)



targets = np.asarray(targets)



indices = np.arange(data.shape[0])

np.random.shuffle(indices)

data = data[indices]

targets = targets[indices]

nb_validation_samples = int(VALIDATION_SPLIT * data.shape[0])



x_train = data[:-nb_validation_samples]

y_train = targets[:-nb_validation_samples]

x_val = data[-nb_validation_samples:]

y_val = targets[-nb_validation_samples:]



return tokenizer, word_index, x_train, y_train, x_val, y_val

Word Embeddings

For word embeddings, I used the Glove Twitter vectors with 100 dimensions. Other options were pre-trained vectors from Word2Vec or Fasttext. I tried Word2Vec, and like others, Glove worked better for me. I did not get to applying Fasttext, which is a promising prospect — mainly because Fasttext has vectors for ‘fractions’ of words, and that might be useful for misspelt words (or other OOV terms) commonly found in comments.

If you are used to the Gensim Python package like me, you can use their script to convert the Glove embeddings into word2vec format. Once that is done, the vectors can be loaded pretty easily:

from gensim.models import KeyedVectors def load_glove_model():

word2vec = KeyedVectors.load_word2vec_format(

os.path.join(WORD2VEC_FOLDER,

'word2vec_twitter_glove.txt'),

binary=False)

return word2vec

An embedding layer can be defined in Keras as:

def get_embedding_layer(word_index, gensim_model):

embedding_dim = len(gensim_model.wv['apple'])

embedding_matrix = np.zeros((len(word_index) + 1, embedding_dim))

for word, i in word_index.items():

if word in gensim_model.wv.vocab:

embedding_matrix[i] = gensim_model.wv[word]

embedding_layer = Embedding(len(word_index) + 1,

embedding_dim,

weights=[embedding_matrix],

input_length=MAX_SEQUENCE_LENGTH,

trainable=True)

return embedding_layer

Notice the trainable=True part in the above snippet — We could use the embeddings as-they-are, but fine-tuning them during training adjusts their semantic ‘location’ for our particular application. This is basically a form of Transfer Learning.

The CNN

At this point you might be wondering why I used CNNs for a text-understanding task. One, because I had never trained a CNN in Keras (and I wanted to). But well…this article gives a good intuition about how 1-dimensional convolutions could be useful for processing text. In 1-D convolutions, you essentially go over patches of words instead of pixels (think about a sliding window of words, a. la. reading). For a visual feel (and to make this post more attractive), I have added this totally-original image:

CNNs are not particularly good for most NLP tasks since they lose out on the sequential flow of information. But since the objective here boils down to recognizing ‘blocks’ of sentiments scattered in text, they work decently well!

We use 2 Convolutional+Max-Pooling blocks followed by 3 dense layers:

from keras.layers import *

from keras.models import Model N_TARGET_CLASSES = 6 def get_convnet_model(embedding_layer):

sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')

embedded_sequences = embedding_layer(sequence_input)

x = Conv1D(128, 5, activation='relu')(embedded_sequences)

x = MaxPooling1D(5)(x)

x = Conv1D(128, 5, activation='relu')(x)

x = MaxPooling1D(5)(x)

x = Flatten()(x)

x = Dense(128, activation='relu')(x)

x = Dense(64, activation='relu')(x)

preds = Dense(N_TARGET_CLASSES, activation='sigmoid')(x)



model = Model(sequence_input, preds)

return model

Sigmoid (and not Softmax) is the more appropriate objective function here, since each sample could belong to multiple classes (A comment could be an insult and obscene at the same time).

I tried using Dropout for regularization, but it did not seem to help with the scores. Therefore, I dropped the idea.

Training

I found that Adagrad (with its default settings) worked best for this use-case. Keras has support for various optimizers, and I did not try tuning parameters such as decay (which could have reduced the error further). For a brief overview of the various optimization techniques out there, look at Ruder’s awesome blog post.

texts, targets = get_texts_and_targets('train.csv')

tokenizer, word_index, x_train, y_train, x_val, y_val = get_datasets(texts, targets)

word2vec = load_word2vec_model()

embedding_layer = get_embedding_layer(word_index, word2vec)

model = get_convnet_model(embedding_layer)

model.compile(loss='binary_crossentropy',

optimizer='adagrad',

metrics=['accuracy'])

model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=2, batch_size=32, verbose=1)

The binary_crossentropy objective is Keras’ version of log-loss (so you get the same value). Since I used pre-trained vectors and a dataset of ~85k instances, 2 epochs is enough (based on Keras logs, the loss seems to plateau in the last half of the second epoch itself).

For the sake of brevity, I won’t write out the code I used for inference and building the output file (You could do it easily with model.predict). The overall log-loss computed by Keras turned out to be around 0.055, which is not bad considering this single-model approach.