Modeling!

Once we convert all our comments into sets of TF-IDF word vectors, the next step is modeling. The end goal is to implement a kind of auto moderation algorithm to a chatroom, requiring a fast algorithm capable of handling hundreds of thousands of concurrent viewers chatting with each other. To do this, we use basic logistic regression for classification. In essence, logistic regression uses your middle school slope formula:

where y is the odds that something will occur (squashed between 0 and 1 using a sigmoid function), m is the unit change in y due to a change in the independent variable x and b is the bias, or the y-intercept. This article is great for a more in-depth explanation of logistic regression.

So now we have all the pieces to build our models. At this point, we train our models by feeding each comment into our TF-IDF vectorizer, which turns that comment into a vector of 20,000 features (the maximum number of words we want to track as our vocabulary). Then we pass those 20,000 TF-IDF score features into our model. For simplicity, we train six separate models independently, one for each label. This gives us the following ROC scores!