November 01, 2017 Ben Ellerby 3 min read

This is a running blog written during my attempt to build a Trump-Obama tweet classifier in under an hour, providing a quick guide to text classification using a Naive Bayesian approach without ‘recoding the wheel’.

Note: This is less a tutorial on Machine Learning / Classifier theory, and more targeted at showing how well established classification techniques along with simple libraries can be used to build classifiers quickly for real-world data.

Live demo here: https://benellerby.github.io/trump-obama-tweet-classifier/

Data

First, we need labeled historic data which machine learning approaches, such as _Bayesian Classifiers__, _rely on for training.

So we need to get past tweets of both presidents. Luckily Twitter gives us access to the last 3,200 tweets of a user, but it relies on a bit of scripting to automate the process.

Let’s start with Tweepy which is a simple Python interface to the Twitter API that should speed up the scripting side.

Note: Issues with pip install so cloned and built the package manually on OSX.

Now we need credentials so let’s go to Twitter, sign in and use the Twitter Application Console to create a new app and get credentials.

If using a placeholder for your app’s URL fails then direct it to your public GitHub page, that’s what I’ve done.

Now, there is a challenge to get the tweets as there are multiple API calls to get the list of tweet IDs and then the tweet content. To save time I found a script and adapted it for Donald Trump and Obama respectively.

After running this twice we have two JSON files of the last 3,200 tweets of each president. Yet, the JSONs are just listed as “{...}{...}” with no comma delimitation and no surrounding square brackets. This is therefore invalid JSON and needs to be fixed.

In fact, it's in the JSON Lines format. As we won't have a scaling issue parsing this json, we can convert it to a standard JSON and parse directly through JS rather than split on "

".

A quick regex turns the files into usable JSON arrays. Replacing “}{“ with “},{“ and adding the two surrounding square brackets to the whole list.

Building the Classifier

Next, building a Naive Bayesian Classifier for our 2 categories, Trump and Obama.

The main decision to make is what feature set (attributes of each data element that are used in classification e.g. length, words) to use and how to implement it. Both of these are solved by the Bayes NPM package which provides a simple interface to build Naive Bayesian models from textual data.

The Bayes package uses term frequency as the single, relatively simple, feature for classification. Text input is tokenized (split up into individual words without punctuation) and then a frequency table constructed mapping each token to the number of times it’s used within the document (tweet).

There are perhaps some improvements that could be made to the tokenisation such as stop word removal and stemming, but let’s see how this performs.

([Checkout the implementation](https://github.com/ttezel/bayes/blob/master/lib/naivebayes.js), it’s ~300 lines of very readable Javascript.)_

We can open up a fresh NPM project, require the Bayes package and jump into importing the JSON files… so far so good. (Don’t forget to NPM init and install)

var bayes = require ( 'bayes' ) ; var classifier = bayes ( ) ; var trumpTweets = require ( './tweetFormatted.json' ) ; var obamaTweets = require ( './tweetFormatted2.json' ) ;

Now training the model by iterating over the president's and then their tweets, using the tweet text attribute to get the content of the tweet. The classifier is trained with a simple call to the ‘learn’ function with each tweet.