Problem definition and setup

Link to code

To find a small dataset to play with, I found a Kaggle competition from way back when: Detecting Insults in Social Commentary. The training set has 3000 datapoints, which is 100 times smaller than the last natural language processing challenge I tackled.

The challenge: Identify whether a comment would be considered insulting to another participant in the conversation.

The comments were taken from commenting sites, message boards etc, and were provided as csv files.

Here is an example of a non-insulting comment:

"Celebrity Big Brother IS a mistake."

And of an insulting comment:

"You are stuck on stupid obviously...give me a break and don\'t vote.\

moron"

Cleaning the data

As the above examples show, it was necessary to clean the sentences. I’m going to be using word embeddings to quantify the words, so I want to isolate the words in the sentence, without any of the fluff such as the line breaks or the apostrophes.

To do this, I removed all linebreaks from the comments ( '\

' was especially common in the data, so I specifically made a point to remove it), and only kept letters.

Doing this changed the above comment to:

You are stuck on stupid obviously give me a break and don t vote moron

Tokenizing

SpaCy is very good at dealing with words, and in particular is very good at tokenizing the data. What this means is that it is good at recognizing where the different words are in a sentence (which is useful, since they are not always delimitated by whitespace).

Another very cool part of SpaCy is that it automatically assigns these tokens 300 dimensional vectors, based on the GloVe word embeddings (which I explore here). It’s therefore super easy for me to take a sentence, and generate a matrix of embeddings from it.

However, the intuitive approach to creating a matrix from a sentence using embeddings doesn’t work for small datasets. If I was training a neural network, my approach would simply be to append all the embedding vectors together:

The intuitive approach to generating a matrix using word embeddings. Each word has its own 300 dimensional embedding. They are all appended together to create a ‘sentence matrix’.

The problem with this is that as the sentence length increases, the number of features (i.e. the size of the matrix) explodes. The average sentence length in this dataset is 33 words; this would yield a matrix of size (33, 300) , with 9900 elements.

Given my 3000 training points, this is begging to be overfit.

My solution was to find the mean of the matrix for each element, so that I ended up with a 300 dimensional ‘mean’ vector for each sentence:

This meant each input would have 300 features; this is far more reasonable.

Solving with scikit-learn

Link to code

The metric for this competition was AUC ROC. Roughly, the AUC ROC score can be interpreted as how likely a positively classified item is actually positive, and how likely a negatively classified item is negative.

This is better than simple accuracy, because it accounts for skew in the dataset as well. For instance, if 95% of my comments were insults, then a classifier which classified every comment as insulting would have 95% accuracy, but it would be a useless classifier. AUC ROC avoids this.

Considering this, lets now going to investigate each of the algorithms individually.

Logistic Regression

How does it work? This classifier will define coefficients for each of the variables in my input feature (so since my input is a 300 dimensional variable, 300 coefficients are defined). This then defines a function:

This relationship does not have to be linear; different x terms could be multiplied together, for instance. However, LIBLINEAR, the library used by scikit-learn, does assume a linear relationship.

The coefficients are then trained so that the output is 1 if the input features are of an insult, and 0 if not. Oftentimes, a sigmoid function is used to make this last step easier:

This function is called the sigmoid function, and restricts the output to between 1 and 0, which is ideal for classification, where something either is an insult (1) or is not (0). Particularly, it allows f(X) to get very large or small if it is confident in a classification, as the function is asymptotic.

How did it do? Finetuning this model was quite easy, as it only has a single parameter which can really be changed: C, or the inverse of regularization strength (basically, C controls how much the coefficients can vary. This prevents overfitting of the data).

Logistic Regression yielded the following training curve:

These plots show the success of the models on a training set and cross validation set (the cross validation set was made with 20% of the data) as the training set grew from just 100 samples to all the training data available (to 80% of the data). The size of the cross validation set remains fixed. Making these plots of model success against training set size is useful to uncover the performance of the model, and to identify under/overfitting.

The simplicity of this algorithm explains why it is so quick to train. Its success suggests that taking the mean of the word vectors in a sentence is a highly efficient way of capturing the sentiment in a sentence, since it linearly maps to whether or not a sentence was offensive.

Random Forest

How does it work? To understand a random forest, its helpful to understand a decision tree. This post does a great job of it, which I won’t try to repeat, but very broadly: given some data set (i.e. some job offers, below), information is used to break the data into smaller and smaller subsets, until it is classified into either ‘accept’ or ‘decline’ (or ‘insult’ or ‘not insult’, in my case). The better data is used higher up the tree — so in the example below, salary would be the most indicative metric as to whether or not a job should be accepted or declined.

A random forest (intuitively) uses many decision trees, by randomly splitting the data into (many) subsets. It then trains decision trees on each of these subsets, before combining the result to generate a conclusion.

How did it do? The important parameters to tune when training a random forest are the number of decision trees (around 100 is a good place to start) and the number of features in each subset of the data (here, the square root of the number of features — about 17, for 300 features — is recommended).

This proved most successful for me, yielding the following training curve:

There’s some serious overfitting happening here; the training set has an AUC-ROC score of nearly 1, whilst the cross validation set had a final score of 0.87.

This does make sense when considering how random forests work; the vast majority of decision trees are going to be rubbish (and are going to overfit). However, they should cancel each other out, leaving a few trees which do generalize to the cross validation set (and then to future data).

Support Vector Machines

How does it work? Logistic Regression uses linear coefficients to define a function, which is then used to predict the class of the data.

Instead of linear coefficients, support vector machines compare the distance of each dimension of the data to some landmark. This defines a new feature space, which can then be used to predict the class of the data.

This is a little confusing, so lets consider this for one dimensional data (imagine if instead of a length 300 input vector, it was length 1):

The sigma defines L1’s ‘circle of influence’. If it is larger, then F1 will be big even if D is large (if X1 and L1 are further apart). If it is smaller, then F1 will vanish if D gets large.

My feature X1 has now been transformed to a feature, which I can then use in an equation much like for logistic regression:

This allows me to separate my data in much more complex ways than a simple linear logistic regression, allowing for much better performance.

How does it do? Like for logistic regression, the most important hyperparameter to train is the regularization constant, C.

Note: SVC stand for support vector classification

With a final AUC ROC of 0.89, it is by far the best classifier. However, because it needs to define a whole new feature space on which to do the classification, its also much much slower than other algorithms, taking 14.8 seconds to train on this dataset (compared to 421 ms for logistic regression).

Comparison of Algorithms

How do these all do?

SVM is the best, but defining the new feature space incurs a significant computational cost, and it takes 35 times longer to train than a Logistic Regression (for the relatively small improvement of 0.01 AUC ROC).

Conclusion

A few takeaways from this: