This essay covers fairly advanced statistical concepts (including machine learning!), which you probably didn’t expect going in.

Which means we’re anticipating experts with machine learning and data science PhDs to email us with, “that was the most complete piece of bullshit ever. You didn't even consider....,“ proceeded by a 10-point response on how they would have done this differently.

But if you’re not one of those cantankerous experts, you might be intrigued by this mathematical wizardy. Which means you get:

This was an outcome of our project on rapper’s vocabularies, which introduced many folks to the wonders of natural language processing.

To read more about each concept, check out the links below:

Unique Words to Hip Hop and Artists

Mapping Lyrically Similar Artists

Methdology Notes

The general music corpus was formed using data from LyricFind. We filtered hip-hop artists by cross-referencing their primary genre on MusixMatch.

For consistency, The hip hop data was cleaned using the same script as the LyricFind corpus. This included efforts to standardize spelling, remove capitalization, and apply light lemmatization.

Most Hip Hop: To find the words most “characteristic” of hip-hop, we computed the odds that a word appeared in the hip hop corpus vs. the general corpus. For example, this is # of appearences in hip hop corpus divided by total words in hip hop corpus. We then compare that to the same math for the general corpus.

Some words were filtered from this list that, while indexing high in hip hop vs. the general corpus, were still rather rare words. These all had fewer than 1,000 occurances in the hip hop corpus. For example “lowrider” had a 255:1 ratio in hip hop vs. other genres, but was only used 116 times in 26 million words.

TF-IDF: to determine the words that characterize each hip-hop artist, we used a technique called term frequency-inverse document frequency (tf-idf). Each rapper gets assigned a tf-idf score for every word in the hip-hop corpus. For a given word, we count the number of times it occurs in one rapper’s catalogue (its term frequency) and divide by the number of artists that use it across the hip-hop corpus (its document frequency). The words with the ten highest tf-idf scores for each artist were deemed the words “most unique” to him or her.

We made two slight modifications to the traditional formula. 1) We used sublinear scaling on the term frequencies, giving us a little more variation across our lists. You can read more about why you might want to do sublinear scaling here. 2) We also set a “cut-off” for document frequency of 0.1. That means, to be considered in our tf-idf computation, a term had to be used at least once by 10% of the artists in our dataset. This rules out words that are repeated over and over by one or a few artists (think “controlla” for Drake).

Cosine Similarity: Cosine similarity is a common way of calculating the similarity between two vectors by taking the cosine of the angle between them. In our case, that means taking the tf-idf vector for an artist and comparing it to that of another. Higher cosine values imply more similarity, with an upper bound of 1 when the vectors are perfectly similar.

t-SNE: To create our map of rappers, we used a dimensionality reduction technique called t-SNE. We took the tf-idf matrix and first reduced it to 50 dimensions using Truncated singular value decomposition (SVD). We then took the resulting matrix and fed it into t-SNE with a perplexity parameter set to 40. The output of the t-SNE algorithm mapped rappers to a two-dimensional space based on the similarity of their lyrics.

Special thanks to Josh Upton for edits.