Part 2: Applying Language Models to Real Data

Data Source and Pre-Processing

For this demonstration, we will be using the IMDB large movie review dataset made available by Stanford. The data contains the rating given by the reviewer, the polarity and the full comment.

For example, the first negative comment here in full is the following:

First, we convert the full comments into their individua sentences, introduce notation for the start and end of sentence and clean the text by removing any punctuation and lowercase all words.

Unigram Model

As this is the easiest to compute, we can find the probability of each word occurring as use this to estimate the probability of the whole sentence occurring by the following:

Alternatively, we can compute this using logarithms as by log rules, the following holds true:

We do this because addition is typically computationally faster than multiplication.

For example, with the unigram model, we can calculate the probability of the following words.

Bigram Model

The unigram model is perhaps not accurate, therefore we introduce the bigram estimation instead. Applying this is somewhat more complex, first we find the co-occurrences of each word into a word-word matrix. The counts are then normalised by the counts of the previous word as shown in the following equation:

So, for example, if we wanted to improve our calculation for the P(a|to) shown previously, we first count the occurrences of (to,a) and divide this by the count of occurrences of (t0).

and likewise, if we were to change the initial word to ‘has’:

As mentioned, to properly utilise the bigram model we need to compute the word-word matrix for all word pair occurrences. With this, we can find the most likely word to follow the current one. However, this also requires an exceptional amount of time if the corpus is large and so it may be better to compute this for words as required rather than doing so exhaustively.

With this, we can find some examples of the most likely word to follow the given word:

Some words have many likely words to follow but others, such as “unnatural” have only one. This is likely due to there being few instances of the word occurring in the first place.

Forming Sentences with Probabilities

These computations can be use to form basic sentences. We will fix the start and end of the sentence to the respective notations “<s” and “/s>” and will vary the columns chosen from the word-word matrix so that the sentences become varied.

Even calculating the next word ‘on the fly’, the time required to do this is exceptionally large. For example, even one sentence on my machine took almost two hours to compute:

Checking some of these probabilities manually we find that the words are very likely, for example:

Tri-grams and beyond!

If we continue the estimation equation, we can form one for trigrams:

For example:

Therefore, the tri-gram phrase ‘to a movie’ is used more commonly than ‘to a film’ and is the choice our algorithm would take when forming sentences.