The Data and Exploratory Data Analysis

I scraped the scripts from the website TheInfosphere.org using BeautifulSoup, where fans of the show have organized transcripts from the first six seasons of the show. In total, there were about 21,000 lines of script. The three main characters shared the bulk of the lines, at around 3500–4000 apiece, and the other characters had 1500 or fewer. The formatting of the lines were not uniform, so it took quite a bit of work to process the data. One particular difficulty was that many episodes have slightly different versions of characters (for example, Hermes was Hermaphrodite in “Bender’s Game”), so I had to go through and group the many versions of Leela or Bender together.

I used sklearn’s chi-square feature selection to take a look at the important unigrams and bigrams associated with each character. I highlighted some of the ones that gave me a laugh, like Zapp Brannigan saying “sham pag”, which is how he pronounces champagne. Tokenizing the lines of script was a step I had to take for the analysis that follows.

I also had some fun with tableau visualizations looking at word counts and associations. I noted how Fry and Bender both say the word love quite a bit, but it is clear who Fry loves. And of course, Bender loves himself, wants to kill humans, and loves his shiny metal ass.

Topic Modeling

Initially, my goal with the topic modeling was that I could produce topics that would align with the characters themselves. This would turn out to be quite difficult. I used combinations of Latent Semantic Analysis, Non-Negative Matrix Factorization, and Latent Dirichlet Allocation to find topics that were the easiest to interpret and could separate out the characters.

Before I could use any of these strategies, I prepocessed the data using the Spacy ‘en’ lemmatization which reduced words down to their base forms. Spacy also changes all pronouns to ‘-PRON-’ to make the text easier to work with. I had to tokenize the data using sklearn’s countvectorizer, which takes basic counts of words in the text, and tfidfvectorizer, which attempts to minimize the impact of words that are used frequently in the data. Both methods produce a sparse matrix (unigrams and bigrams totaled around 17,000 words and phrases).

vectorizers and LSA/NMF

Some key things I considered and tinkered with: min_df and max_df (the threshold of word counts allowed, which can be assigned integer values for word counts or a decimal percentage), and the stop words I would add. Ultimately, lowering the max_df to around .3 seemed to work best. I also tried removing character names in the stop words, but decided to keep the names as certain characters were clearly much more associated with other ones (Amy and Kif for example, or Kif and Zapp).

Then I fit LSA and NMF on the sparse matrix to reduce the number of features. LSA works by performing singular value decomposition on the sparse matrix, which separates the one matrix into 3, where you have the middle one being eigenvalues and the last one having rows equal to the new, reduced number of features (check out this DataCamp article for a more thorough explanation). NMF is similar, but the matrices must be non-negative (for a breakdown of NMF, go to this blog here). The topics produced by NMF seemed to work a little better, but they still did not make much sense to me.

Next, I used Gensim’s Latent Dirichlet Analysis (LDA) to produce the topics. For text based topic modeling, this process seems to be favored nowadays even though it is not quite as simple to use. The code I used here was largely derived from this machinelearningplus post. To determine the best number of topics, I looked at the perplexity/coherence scores. Even though I had a peak in coherence around 10, I decided to use 6 topics because it was the most easily interpretable for me.

The pyLDAvis module creates a nice visualization of the topics produced by Gensim’s LDA. You ideally want the topics to be large and clearly separated circles. After numerous attempts at tinkering with min/max_df, stop words, and number of topics, this was the best set of topics I could seem to identify (the bubbles are indeed large and mostly not overlapping).

pyLDAvis plot

I had to reduce max_df down to .3 to get topics that were separated, and it was a little difficult to interpret the largest bubbles, but the 3–6th bubbles seemed to be associated with Planet Express, Hermes, Leela, and Amy, which is the result I desired.

Character Identification

Next, I wanted to see if I could identify characters given a line of script. This process would prove to be as “slippery as a snake in a sugarcane patch!” My hypothesis is that dialogue in a television show like this is not great to differentiate characters, since many of their lines are based more on events than on character personalities. For example, a reaction to an event like “whoa” is not helpful at all. Some lines are clearly associated with a specific character (“bite my shiny metal a**”), but these catch phrases only account for a small percentage of the character’s lines.

For the classification, I first tried using features produced by tfidfvectorizer with several machine learning models. The data was labeled by the character who said it, and I reduced the number of characters to focus on the top 10 characters only. I also grouped the lines together into groups of 5 at a time, since there were far too many one or two word lines that were not helpful on their own. This reduced my number of observations from 21,000 down to 4,000. I took this data and performed a test-train-split using sklearn, trained my various models on the train data, and tested on the test data.

Example of model fitting/testing

The models I tried included logistic regression, multinomial naive Bayes, decision tree, and SGDClassifier (which uses support vector machines), all of which are available through sklearn. The SGDClassifier was the big winner, with an accuracy score of 55% (not great, but better than the rest).

SGD Classifier Recall Scores

As you can see from the confusion matrix, the dark diagonals indicate that the model did fairly well in identifying the characters correctly.

I then attempted to use more complicated techniques according to the process laid out in this excellent blog by Susan Li (who was so helpful for this project overall), but sadly the Word2Vec, Doc2Vec, and the LSTM recurrent neural network from Keras did not produce results that could perform nearly as well as my SGDClassifier did with simple TfidfVectorizer’s features. I will admit, I only trained the LSTM on a few epochs, so in the future I may try again with more fine tuning and far more epochs. But the LSTM would play a much more significant part in the final portion of my project.

Text Generation

Having machines identify the characters wasn’t enough. I wanted a machine to completely understand the characters. In fact, I wanted to create a robot that could mimic a character’s speech. And of course, if I had a robot copy a character’s speech, I was going to be meta about it and mimic Bender.

To learn speech in text, one of the preferred methods is to use a Late Short Term Memory Recurrent Neural Networks (what a mouthful). Keras has a sequential LSTM model that is pretty effective for this. I essentially used the code laid out in this blog post by Jason Brownlee.

LSTM works like a normal neural network, except it maintains a thread of information between layers, and each layer has a gate mechanism that decides what percentage of the information to keep or sent to the next thread. In my code, I used LSTM to look at one character at a time. I changed all characters to lowercase, and replaced accented letters with their non-accented versions to reduce the size of the dictionary. Then I read in a random 100 letter sequence, and tried to predict the next letter and then cut off the first letter, and repeated the process.

I tried using the “Adam” and “RMSProp” optimizers (RMSProp was able to get lower log loss for me, which we are trying to minimize), and I tried using 1 or 2 dropout layers of .2 percent. After 100 epochs of training on a dual dropout layer model (13 hours on my poor 2080TI GPU), I achieved a log loss of 1.52. The model actually seemed to do well, sometimes creating output that had words and made some grammatical sense. However, it would also produce output that got into a repetitive loop. Around a third of my attempts would repeat the phrase “i reckon i dont even know.”

In the original model, I simply chose the highest probability letter as determined by the model (it produces a vector of probabilities for each letter), but to introduce some variability I tried using the probabilities themselves to decide which letter to choose (so it would not always default to the highest probability, but would still weigh the probabilities). This definitely fixed the problem of variability, but made my output much harder to read.

After my model generated text, I wanted to at least ensure some of the words were more readable (instead of being a word like ‘bugdy” which clearly is closest to ‘buddy’ in context). For this, I used SeatGeek’s fuzzwuzzy module which calculates the Levenshtein distance between two words to match it to the nearest word. The Levenshtein distance is a metric based on the number of letters that need to be changed/added/deleted to match another word.

To show off my bot, I created a flask app to generate a random 100 character Bender line at a time. I hosted it with heroku here, but it seems to be broken (it takes a really long time to load, if it loads at all, since the free service is not nearly as fast as my laptop), and for the time being, here are some outputs from my local machine.