For the past few months, the Curriculum team at Codecademy has been hard at work creating Machine Learning courses. While we all loved writing the courses, we also wanted to see what we could do with real-world data. As a result, we challenged each other to find a use for machine learning in a topic that we were passionate about.

For my project, I decided to see if I could make a machine learning model that would give me some insights into my favorite reality TV show, Survivor.

If you're not familiar with Survivor, it's a show where contestants are stranded on an island together. Each week one person is voted off the island. The last contestant remaining wins. In this post, I'll walk you through my entire process—from forming my primary question to presenting my final results.

Forming my question and gathering data

As I started working on this project, I had so many different ideas that I wanted to investigate. I created a huge list of interesting ways machine learning could be applied to my favorite tv show. However, as I began looking for available data, I quickly realized that I wouldn't be able to answer many of those questions.

For example, I really wanted to create a system that could identify who would become a "hero" or a "villain", but I didn't have a dataset with these labels. This is sometimes the unfortunate reality of machine learning. Gathering the data can be almost as difficult as building the model. You have to decide how much effort you want to put into the data collecting process. If your question is incredibly specific, you might need to collect your data in a very intentional way.

On the other hand, if your project is more exploratory, it can still be useful to use a dataset that isn't ideal in order to run some initial experiments.

In this case, I found a dataset containing the confessionals of every contestant from every season of the show. In Survivor, a confessional is a scene where a contestant is talking directly to the camera, away from the other competitors. This data was in a .csv file, so importing it into a Pandas DataFrame was relatively simple.

With this data in hand, I was able to refine my list of questions to ones that focused on confessionals. Could I predict the winner of a season based on confessionals? Could I identify alliances or storylines based on what people were saying? By looking at my data and thinking about the questions I wanted to answer, I had a clear path. It was time to start coding!

Cleaning the data

I now had access to my data as a Pandas DataFrame and knew what questions I wanted to answer with it. However, there was still a ton of work that I needed to do in order to feed the data into a machine learning model.

For example, a problem that needed to be solved was how to handle contestants with the same name. What if I wanted to include data from two different seasons, where each season had a contestant named "Mike"?

Right now, the confessionals from the two different Mikes would be grouped together into the same column. To solve this, I had to change the name of the columns of my DataFrame to include the name of the season in addition to the contestant name. Instead of having one column named "Mike", I now had two columns named "Mike-worlds-apart" and "Mike-australia".

This example is very specific to my Survivor dataset, but it highlights a common problem in almost all machine learning projects. There will usually be some small quirk in your dataset that needs to be fixed before moving on to the fun part—running the model!

I've found that there are two major ways to fix these issues before they become bigger headaches. The first is to rely on your content knowledge. If you're able to anticipate an issue like two contestants sharing a name, you can fix it before it causes problems later on.

The second tactic is to gain a strong understanding of your dataset. If you didn't collect the data yourself, really take some time to look through what you have. Which columns have troublesome NaN (not a number) values? Which columns can be removed entirely? Are there useful columns that you can add based on the data you already have?

Taking the time to correctly format the data might not be the most exciting part of the machine learning process, but it is one of the most important.

Making the model

Now that I finally had my data in a usable format, it was time to start using some machine learning models.

The fact that my data was text-based helped narrow my search of which models to use. I first tried using a Naive Bayes classifier, an algorithm that is often used to detect spam emails. I thought if a Naive Bayes model could detect spam, maybe it could detect a Survivor winner based on their confessionals.

This was not particularly successful. Because there are so few Survivor winners compared to Survivor losers, the system almost always predicted that a contestant would lose the game.

This is a great demonstration of a machine learning system that has a class imbalance. You can imagine this problem cropping up in other systems that try to predict rare events—for example, trying to predict whether a tumor is malignant can be tricky if most tumors are benign.

I tried a few more models that resulted in some unimpressive results. At this point, I was stumped, so I asked for help. I went to two different sources for inspiration.

First, I went to Ian, another Curriculum Developer at Codecademy. Ian knows nothing about Survivor (yet…), but he's a machine learning wiz. I described the data I had and the kinds of questions I was interested in, and Ian was able to show me some projects he had worked on that were similar.

I also took to the ultimate source of Survivor knowledge—Survivor Twitter. Getting the chance to talk with other people who really understood the subject area was invaluable. My Survivor friends were able to help me refine the questions I was asking.

Have you experimented with other predictions? Merge vs Non-merge, top 5 versus everyone else? Stuff like that. Maybe even top 3. — Sean Falconer (@seanfalconer) September 12, 2018

After getting some help, I finally ended up with a model that worked reasonably well. I used an NMF, which is an unsupervised clustering technique. My system could take a list of players and cluster them into groups based on the content of their confessionals. These groups were reasonable approximations of alliances and storylines throughout the season.

The image below shows the raw data when clustering an episode of from season 32 into 6 clusters.

Presenting the results

I had my raw results. Now I needed to turn that information into something other people could understand—nobody wants to look at what my terminal is spitting out. Figuring out how to display my results was a fun challenge that involved a few Matplotlib techniques I had never used before. I ultimately ended up with the gif you see below. (Spoilers ahead for Survivor Cagayan!)

This gif is showing how alliances and storylines changed over the course of Survivor Cagayan. For every episode, if a contestant is entirely in cluster one, they will be on the far left side of the graph. If they're entirely in cluster two, they'll be on the far right side of the graph. They can be anywhere in between if they have qualities of both clusters. On the x-axis you can see some of the most important words for each cluster.

Ultimately I feel like these clusters represented storylines of the season. People would appear in a cluster together if they talked about each other a lot. They weren't necessarily voting together; in fact, they were sometimes bitter rivals!

This is especially clear in the clusters from the third episode of the season. Three clusters appear—one to the left, one in the middle, and one to the right. At this point in the season, the contestants were split into three teams. Those three clusters are almost identical to the three teams!

Further work

There is so much more I want to do with this project. I would love to see what clusters appear if you mix and match contestants from different seasons. Could we find clusters that represent personality types? Maybe the clusters would reflect a certain style of gameplay?

In order to answer these questions, I would need to iterate on the steps above. If I mix and match contestants from different seasons, I might want to remove all names from the confessionals—it doesn't matter who they're talking about, it matters what they're talking about. I'm also curious about what would happen if I played with the time the confessional was said. Maybe confessionals later in an episode are more significant than earlier ones.

Thank you to Reddit user /u/m4milo for giving us access to the data. If you're a Survivor fan and want to see some other visualizations, get in contact with me on Twitter.

Be sure to also check out our other machine learning project analyzing Taylor Swift lyrics.