Text Augmentation

The dataset consisted of 144 songs which is 167887 words. I really wanted to make a comment about the number of songs Alex has written and these don’t even include the songs from the last shadow puppets and his solo album — I am getting distracted!

Given the dataset isn’t as large as expected for a language modelling task text augmentation could be applied.

The two types of text augmentation that were used here were

Substitution — Replaces the current word with that is generally predicted by the language model.

Insertion — Uses the words as features for predicting the next word.

I used nlpaug for this and a really good overview can be found in this article — Data Augmentation library for text by Edward Ma

nlpaug has character, word and flow augmenters. To generate synthetic data for lyrics I believe using word level models was more beneficial and flow augmenters like ‘naf.sequential’ is used to sequentially apply different augmentations.

I used two types of augmentation — BertAug and FasttextAug. They both insert/substitute similar words based on context. BertAug uses BERT language model for predicting the replaced word or predict the next word in case of insertion. FasstextAug replaces or inserts the words based on contextualised word embeddings.

Results after BERTAug insert and substitute

in : there is always somebody taller with more of a wit

out: it is always somebody taller with more of temper wit

weeeirrrdddd.. but sounds about right.

Results after FasttextAug insert and substitute

in : there is always somebody taller with more of a wit

out: There is invariably somebody tall with more of a wit

Also an interesting thing that happened, there were no ValueError exceptions for unknown words for FasttestAug because of sub-word embeddings — I used wiki-news-300d-1M-subword.vec for loading the model —

Except for — well — “i.d.s.t. i.d.s.t. i.d.s.t i.d.s.t” , “choo-choo! choo-choo! choo-choo!”and “shoo-wop shoo-wop shoo-wop”. I honestly don’t blame it.

After augmentation there were 334524 words in the corpus. That means the new data is twice the original data.

Creation of Augmented dataset did take quite some time. (around one hourish) I do have the .txt file of final corpus uploaded on google drive.