For the past few months, the Curriculum team at Codecademy has been hard at work creating Machine Learning courses. While we all loved writing the courses, we also wanted to see what we could do with real-world data. As a result, we challenged each other to find a use for machine learning in a topic that we were passionate about. For me, that's music.

It's said that popular music is a reflection of society, a barometer for our collective wants, fears, and emotional states. Others are of the belief that music is more a reflection of the artist, a diary that's been flung from the nightstand drawer into the media frenzy of our modern world. In either case, music can serve as an insight into the human mind in ways that many other mediums cannot.

One tool we can use to dig further into lyric-based music is natural language processing, or NLP, a subset of artificial intelligence devoted to the analysis of language. I wanted to use NLP to analyze the body of work of a popular artist with an intriguing history: Taylor Swift.

Forming my question and gathering data

For my analysis, I worked with a Kaggle dataset containing all of Taylor Swift's lyrics, from her 2006 eponymous album to her most recent release, 2017's Reputation.

As someone only familiar with her bigger hits, I was interested in learning more about Taylor's progression as an artist and person. What are the core themes she addresses, and how have they changed as she's grown from a teenage country sweetheart into an international pop sensation? And what are the deeper connections in the word choices she makes in her songs?

Cleaning the data

The original dataset contains 4,862 rows of data, representing each individual line of lyric Taylor has sung, as well as the track title and album for each line. Since I was interested in analyzing themes on a song by song basis, I had to aggregate the lyrics up to a song level using Pandas.

Transforming your data to the right level of granularity for the purposes of your analysis or machine learning project is a common task, and will often require some level of preprocessing and experimentation with Pandas to get just right.

One tricky aspect of NLP projects is that all texts analyzed will contain a variety of words that do not provide any meaningful information in terms of detecting underlying structure or themes. These can be common words such as "I", "me", or "my", as well as specific words that appear frequently in the entire collection of text that is being studied, known as the corpus.

For Taylor's songs, these can be words such as "oh" and "yeah". In industry, these words are called stop words, and removing them from our corpus before analysis is a helpful step toward acheiving better results!

Making the model

To understand the thematic changes in Taylor's music over time, I decided to build a topic model based on her song lyrics. Topic Modeling is a process by which we find latent, or hidden, topics in a series of documents.

What does this mean? It means that by looking at a series of documents—in this case, the songs in Taylor's discography—we can find sets of words that often co-occur, forming cohesive "topics" that are prevalent in certain songs from throughout her career. Once we define these topics and the words that compose them, we can then track how prevalent these topics are over time, indicating tonal shifts in Taylor's music and thus reflecting back on her life.

The first step to building a topic model is to extract features from the corpus to model off of. In NLP, a common technique is the bag-of-words model. A bag-of-words model totals the frequencies of each word in a document, with each unique word being its own feature and its frequency being the value.

An even better means of feature extraction that digs a bit deeper is tf-idf , or term frequency-inverse document frequency. This method penalizes the word counts for words that appear very often in a corpus, since they should be less likely to provide insight into the specific topics of a document given how common the word is. Using tf-idf , we are able to identify features for each song that can represent how important each word is to that song.

Now that I had my features, I could create a topic model! The modeling technique I chose for this project is NMF. NMF, or non-negative matrix factorization, is an algorithm that we can use to pull out our topics, or co-occurring word groupings, and the prevalence of these topics across each song in Taylor's discography.

The first part of the output from NMF are the words that make up each topic. At this point in the ML process, we get to use those creative juices! By checking the top 10 words in each topic, you can make an executive decision about what topic or idea is represented by the words.

Based on the words and my knowledge of Taylor Swift, I came up with the topics below:

The other piece of output from NMF is a document-topic matrix. In this matrix, every row is a song, every column is a topic, and the value is a relative score of how much the topic exists in a specific song.

I wanted to see how often each topic appears in each of Taylor's songs, but I also wanted to set a threshold for how high a topic score needs to be in order for a song to be labeled with that topic. After playing around with the topic score threshold, I decided to set the threshold at 0.1.

Now that I had this transformed matrix of topics and songs, I was able to focus my analysis to each album/year. After grouping by year and summing the count of songs across each topic, I had what I was looking for: the number of songs in each topic per album!

Presenting the results

From the topic count of songs per album, I was able to construct the "Song Topics over Time" graph below:

When it came to interpreting and validating my results, I referred to Codecademy's in-house Taylor Swift expert, my colleague Laura. With Laura's deep knowledge of both Taylor's catalog and life, we were able to match the changes in topic density over time seen above with the events of Taylor's life.

As Taylor progresses from a country artist into a pop artist, we see an increase in the content of songs related to dancing. When Taylor moves out of her family home into her first apartment just before 2010, we see the greatest prevalence of words related to growing up. Throughout her career, we see an interesting fight between the topics of Love and Beauty versus Bad/Remorse. With each new romance and subsequent heartbreak Taylor experiences, these topics continue to be at war with each other in terms of dominance in her music.

2012 brought Taylor's arguably best and most popular album, Red, and with it the greatest number of songs with the topic of Love and Beauty. Other interesting highlights include a spike in the Bad/Remorse topic during 2014, the time when Taylor was infamously at odds with Kanye West and Katy Perry. In 2017, Taylor seems to be less contemplative than before, indicating a greater sense of self and confidence.

For all its greatness, topic modeling is not perfect. According to Laura, during Taylor's Reputation era, she was experiencing the greatest level of love, beauty, and acceptance in her personal life. This doesn't seem to match up with what our topic model says, so perhaps some finer tuning is needed.

Making a new model

Now that I had my complete topic model, I wanted to create a second model that looked at the deeper relationship between individual words, rather than the overarching topics of Taylor's songs.

To do this I used a modeling technique called word2vec. With this model, we can map each word that appears in Taylor's lyrics to a 100-dimensional vector space, where semantically similar words are mapped to nearby points. We can then look at the similarity of words by comparing the distance between their mapped points. This mapping of a word to a vector space is called a word embedding. With this kind of model, we are able to see how similar certain words are to each other with respect to Taylor's songwriting style.

Given these word embeddings, I wanted to find a way to visualize which words are related and which do not show a connection. Step in everyone's favorite high dimensional visualization tool, t-SNE!

t-SNE, or t-Distributed Stochastic Neighbor Embedding, is a dimensionality reduction technique designed for visualizing higher dimensional data in a 2-D space.

What does a 100-dimensional word vector look like? It's hard to tell! I could spend hours researching the intricacies of higher dimensional spaces, but finding a way to represent this data in 2-D can be much more useful to a simple-minded human like me. By putting the Taylor-specific word embeddings into t-SNE, we can explore Taylor's syntactic decisions!

Presenting the results

By observing the words that cluster together in the t-SNE below, we get an idea of how closely-related words are in terms of Taylor's syntactical choices. Remember, the words are clustered based on their embeddings' similarities.

Let's zoom in on a cluster from the bottom right to see what is going on.

Now you might not claim to be a fan of Taylor Swift, but most would be sure to recognize this clustering of words to come from Taylor's song "We Are Never Ever Getting Back Together". C'mon, you know this one!

Other interesting clusterings include on the bottom left "sing" and "loving", suggesting Taylor's affinity for the talent that has brought her fame, as well as "bad" and "blood" bottom center, alluding to her song with Kendrick Lamar, "Bad Blood."

Besides clusterings, we can presume that words on their own are used in unique ways compared to the rest of Taylor's diction. Both "sad" and "heart", more isolated in the t-SNE than most terms, popped out to me as provocative words that Taylor seems to use in her songwriting uniquely and with great intent.

Further Work

The analysis done here is just the start of all the cool things I could do with this data set. Given my topic model, I could create a recommendation engine to help listeners discover new Taylor Swift songs based on their favorites. I could also dig deeper into the syntax of Taylor's lyrics, performing a grammatical analysis and then creating a song generator to make my own Taylor Swift lyrics!

The fruits of natural language processing are endless, and the insights it can provide give us deeper understanding of who we are as a communicative species. What will you find out with NLP?

Be sure to also check out our other machine learning project analyzing Survivor confessionals.