Can We Generate High-Quality Movie Reviews Using Language Models?

Fine-tuning a language model on IMDB movie reviews and generating movie reviews using various different methods.

Photo by Ahmet Yalçınkaya on Unsplash

Introduction

Recently, language models — models which attempt to predict the next word in a sentence, typically a deep neural network— have made a serious splash as OpenAI has announced that they have managed to train a 1.5 billion parameters language model called “GPT-2”, which they originally deemed too dangerous to release. It has since been released, and now you can even talk to it all you want!

The examples on the OpenAI blog and the “Talk to Transformer” website linked above really blew my mind, and I began to wonder whether we would somehow be able to generate high-quality movie reviews, similar to the IMDb dataset, which includes 100,00 movie reviews divided into 25,000 positive reviews, 25,000 negative reviews, and 50,000 unlabeled reviews. As far as I am aware, however, the full GPT-2 model — which contains 1.5 billion parameters — cannot fit in a single GPU without a significant amount of effort, and definitely would not fit in my personal GPU. And so I became curious — what would be the quality of reviews generated by a standard language model after training it on the IMDb dataset?

Setup

I decided to perform some experiments using fastai’s provided language model, which has been pretrained on Wikipedia (see the ULMFiT paper here for more details about the model). I then finetuned it on the entire IMDb dataset since I do not care about the sentiment expressed in the movie review and I am not interested in sentiment analysis and so I do not need to maintain the train/test split provided in the dataset. The purpose of this finetuning is to make the model focus on the domain of movie reviews, as being able to accurately predict the next word in a review is extremely useful for also being able to realistically generate the next word in a movie review which is our goal. If you are interested in generating text from another domain, please feel free to repeat these experiments using a corpus of your choice!

I recently came across “The curious case of neural text degeneration”, which explores existing text generation strategies in depth, highlights their deficiencies and proposes some new well-motivated and successful approaches, and so I decided to compare the quality of the movie reviews generated by each of the methods. All methods have the same prompts (as seen below) and review length (100 words), and use the same model for inference.

Methods & Results

Here I will present the methods evaluated and briefly and provide some examples (apologies for the image quality) — these examples are the result of running the model once for each approach and are not cherry-picked or otherwise tampered with.

For increased readability, implementation details, and more examples please check out the link to the Github repo at the bottom.

Greedy top-1 approach — since the language model outputs a probability vector the size of our vocabulary, we could simply always take the token with the highest probability and return it, and by repeating this process and continually adding the generated token to the existing prompt we can generate the full review.

Greedy multinomial approach —in this version, I sample a token based on the probability distribution instead of always taking the most likely one — this procedure is still greedy (only considers the probability distribution at the current token), but leads to less repetitive results.

As we can see in the examples, the reviews by the greedy top-1 are repetitive do not resemble natural language. The reviews generated by the greedy multinomial approach are surprisingly good in my opinion but are still not quite there yet.

A few examples of generated reviews using the “greedy top-1” approach. The words in square brackets are the given prompt, “xxbos” represents the beginning of a review.

A few examples of generated reviews using the “greedy-multinomial” approach. The words in square brackets are the given prompt, “xxbos” represents the beginning of a review.

Beam-search approach — in this approach, instead of initially only picking the token with the highest probability (top-1), we pick the top-k tokens with the highest probability. This parameter is also known as “beam width”. At each step, we then add each of the k tokens to the existing prompt and keep only the top-k combinations up to this point. In theory, this allows us to pick tokens based on not only the tokens which precede them but also the tokens which succeed them. However, in practice this method often fails miserably and ends up generating extremely repetitive text — all the top-k results seem to be chains of repetitive and “safe” choices for the model, and so even a multinomial distribution over the final top-k produces poor results most of the time. For a more in-depth analysis of this particular failure mode, please refer to the paper. As we can see in the examples, this method tends to produce sentences which make short-term sense but are very repetitive and overall fails to produce high-quality reviews.

A few examples of generated reviews using the “beam-search” approach. The words in square brackets are the given prompt, “xxbos” represents the beginning of a review.

Top-k approach — instead of always picking the token with the highest probability, we now pick the top-k tokens with the highest probability, similar to beam-search. As we can see in the examples, the results can be hit-or-miss — sometimes the entire review is cohesive and seems realistic, but at other times the sentences do not make much sense together. As a whole, it seems to produce more interesting and varied results than the previous methods in my opinion.

A few examples of generated reviews using the “top-k” approach. The words in square brackets are the given prompt, “xxbos” represents the beginning of a review.

Top-p nucleus approach — this approach was proposed in the paper in order to solve the “guess a number for k” problem from the top-k approach. Instead of working with a fixed k, it makes more sense to decide on k based on the probability distribution — k should be whichever number is needed in order for the top-k tokens to contain most of the probability mass. Therefore in top-p nucleus approach, we provide the function with probability p and for each word, it decides on a number k and then performs top-k sampling as before. Overall, the quality seems quite similar to top-k to me, but it can probably handle edge cases better (where the probability density is relatively uniform, and so we would want a high k, when it is relatively peaked, and we would want a small k).

A few examples of generated reviews using the “top-p” approach. The words in square brackets are the given prompt, “xxbos” represents the beginning of a review.

Some more implementation details — I have chosen to remove the special token representing any unknown words from the generation, and have added support for minimum token probability and probability distribution temperature (in addition to the parameters each method takes). I have not had the chance to tweak these parameters yet, so it’s entirely possible that better text generation is possible through some experimentation.

Conclusion

In conclusion, we have observed that the quality of text generation depends not only on the size of the model used but also on the generation approach itself, with different methods generating reviews of vastly different quality - greedy top-1 text generation seems to produce low quality text, beam-search seems to produce extremely repetitive text, and top-k and top-p seem to produce non-repetitive text which is of higher quality than top-1. Hopefully, text generation approaches will remain an active field of research and soon we will not need a 1.5B parameter model in order to generate realistic text!

References