3, 2, 1… Go!

Six months ago I was sitting at my desk, looking for proposals for my Master’s Thesis, suddenly I got a notification on my phone, it was an email from a PhD student of the Institute of Neuroinformatics (INI):

We are starting a new project, we want to generate rap lyrics with AI, if you want to write your Master’s Thesis about it, please let us know.

“Wow, that’s the sort of intriguing, out-of-the-box project I need, let me reply and check it out”.

One week later, I was already dealing with rap lines, rhyming and related papers. Two weeks later, I had the first meeting with the whole team:

Nikola Nikolov, finishing his PhD at INI with a focus on Natural Language Processing (NLP).

Eric Malmi, working at Google Zurich, with a PhD in data mining.

Curtis Northcutt, PhD student in EECS at MIT and rapper.

Loreto Parisi, director of machine learning at Musixmatch and his team.

Top left: Nikola - Top right: Curtis - Bottom left: Eric - Bottom right: Loreto.

They are all amazing guys with incredible minds: I think there is no better introduction for them than their accomplishments.

Text generation

Generating machine text which is almost indistinguishable from human text is a very challenging task in NLP. We address a specific form of creative text generation, rap lyrics generation: when producing new lyrics, we also have to take into account the stylistic features of rap text, like flow, metric and rhyming, besides the other properties of natural language, for instance coherence and grammatical correctness.

Not only is the task fun and fascinating, allowing to explore the limits of AI, but also the outputs of such models may be helpful and inspirational for different artists and writers.

Rhyme Density

We focus on generating lyrics which contain more rhymes: to measure the rhyme technicality of a rap text, we use Rhyme Density (RD), a metric introduced by Malmi et al. in the “Dopelearning” paper.

The metric captures the amount of assonance rhymes in a given piece of text by computing the longest matching vowel sequences (LVS) for every word of the text.

“This is a job, I get paid to sling some raps, What you made last year was less than my income tax”

In the example, the LVS for the word tax is 3.

The LVS of a text is the average of the LVS of all the words in the text.

The artist with the highest rhyme density per song is Inspectah Deck with an average of 1.187, Eminem is 39th with 1.04, The Lonely Island is 94th with 0.870.

Dataset

Musixmatch provided us with a dataset composed by 60k songs whose genre is rap or very close to rap. We have 24k different artists and an average of 70 lines per song.

We measured the average rhyme density of a lyric in the dataset, which is 1.045. We also use a repetition rate (RR) metric to evaluate the amount of unique n-grams in a text: RR ranges from 0 to 100%, where 0 means no repetition and 100 means full repetition. The average RR of a song in the dataset is 22.3%.

The songs go through a preprocessing pipeline to improve regularization and remove unwanted words. The final dataset is a text file where songs are appended to each other and separated by an “end of song” token.

GPT2 fine-tuning

To generate rap lyrics we use the state of the art language model released by OpenAI, GPT2.

Image taken from the blog post “Illustrated GPT2”

GPT2 is a stack of decoders: given an input context, it outputs a vector which is then multiplied by the whole vocabulary embedding matrix. This operation produces a score for each word in the vocabulary. The higher the score, the higher the probability of that word to be the next word.

We work with the smallest GPT2 version released by OpenAI, the one with 117M parameters. In our experiments we realized that our dataset is not big enough to take advantage of the more complex models, which are additionally much slower to train.

The GPT2 version released by OpenAI has already been trained on 40GB of human text extracted from the web. We want to exploit that knowledge and adapt it to our task: rap generation.

We fine-tuned the pretrained 117M model on our 60k rap songs dataset: we trained the network for 200k iterations which took around 40 hours on our single GeForce GTX 1080 GPU. Below we report the perplexity scores of the best model. Perplexity is a measure of how well our model predicts a word, the lower the better. Test and validation sets are composed by 1000 random songs extracted from the dataset.

Training Perplexity: 22,873 Test Perplexity: 27,473 Validation Perplexity: 24,894

We produced 300 songs with the fine-tuned GPT2, our baseline model, and evaluated them quantitatively.

The longest matching sequence is the longest substring in common between the generated song and the whole training dataset, measured as total number of consecutive characters in common. We compute it to make sure there is no overfitting and thus no copying from the original lyrics: an average of 23 characters means that just few consecutive words are copied from a song in the dataset.

These are two interesting verses generated by the network:

Verse 1 - Baseline model

I’ve had a lot of chances and I’ve lost a lot of battles I’ve lost a lot of friends, I’ve lost myself I don’t know why I haven’t lost all the rage I’ve lost the power to fight, I’ve lost the power to move I’ve got so much on my mind, I think I’m just caught in between

Verse 2 - Baseline model

Once you take my hand, you’ll understand Sometimes I feel like I’m walking on a tightrope I feel like I’m running from my past I feel like I’m running from who I am But I’ll be okay, ’cause I’m out here with you, you, you We’ll make it through whatever comes our way ’Cause we are one, we are one Even when you go down I’ll hold you up And I know there’s nothing that I can say To convince you to stay Maybe I am too abstract

The RD of the first verse is 0.784, the one of the second verse is 0.807.

Rhyme density biased sampling

The average rhyme density of the generated songs is 0.932; we want to improve it by biasing the probabilities of the next word estimated by the model.

We call the model probability of a specific word P and we modify it according to the formula:

P = P * (1 + alpha * RD)

where RD is the resulting rhyme density of the generated text if that word was chosen as next word, and alpha is a new hyperparameter we introduce to regulate the rhyme component.

We generated 300 songs for different alpha values ranging from 0 to 100 and evaluated rhyme density, perplexity and repetition rate for each of them.