In this article, the problem of learning word representations with neural network from scratch is going to be described. This problem appeared as an assignment in the Coursera course Neural Networks for Machine Learning, taught by Prof. Geoffrey Hinton from the University of Toronto in 2012. This problem also appeared as an assignment in this course from the same university. The problem description is taken from the assignment pdf.

Problem Statement

In this article we will design a neural net language model. The model will learn to

predict the next word given the previous three words. The network looks like the following:

The dataset provided consists of 4-grams (A 4-gram is a sequence of 4 adjacent words in a sentence). These 4-grams were extracted from a large collection of text.

The 4-grams are chosen so that all the words involved come from a small

vocabulary of 250 words. Note that for the purposes of this assignment special characters such as commas, full-stops, parentheses etc. are also considered words.

Few of the 250 words in the vocabulary are shown as the output from the matlab / octave code below.

load data.mat

data.vocab

ans =

{

[1,1] = all

[1,2] = set

[1,3] = just

[1,4] = show

[1,5] = being

[1,6] = money

[1,7] = over

[1,8] = both

[1,9] = years

[1,10] = four

[1,11] = through

[1,12] = during

[1,13] = go

[1,14] = still

[1,15] = children

[1,16] = before

[1,17] = police

[1,18] = office

[1,19] = million

[1,20] = also

.

.

[1,246] = so

[1,247] = time

[1,248] = five

[1,249] = the

[1,250] = left

}

The training set consists of 372,550 4-grams. The validation and test sets have 46,568 4-grams each.

Let’s first look at the raw sentences file, first few lines of the file is shown below. It contains the raw sentences from which these 4-grams were extracted. It can be seen that the kind of sentences we are dealing with here are fairly simple ones.

The raw sentences file: first few lines

No , he says now .

And what did he do ?

The money ‘s there .

That was less than a year ago .

But he made only the first .

There ‘s still time for them to do it .

But he should nt have .

They have to come down to the people .

I do nt know where that is .

No , I would nt .

Who Will It Be ?

And no , I was not the one .

You could do a Where are they now ?

There ‘s no place like it that I know of .

Be here now , and so on .

It ‘s not you or him , it ‘s both of you .

So it ‘s not going to get in my way .

When it ‘s time to go , it ‘s time to go .

No one ‘s going to do any of it for us .

Well , I want more .

Will they make it ?

Who to take into school or not take into school ?

But it ‘s about to get one just the same .

We all have it .

The training data extracted from this raw text is a matrix of 372550 X 4. This means there are 372550 training cases and 4 words (corresponding to each 4-gram) per training case.

Each entry is an integer that is the index of a word in the vocabulary. So each row represents a sequence of 4 words. The following octave / matlab code shows how the training dataset looks like.

load data.mat [train_x, train_t, valid_x, valid_t, test_x, test_t, vocab] = load_data(100); % 3-gram features for a training data-tuple train_x(:,13,14) %ans = %46 %58 %32 data.vocab{train_x(:,13,14)} %ans = now %ans = where %ans = do % target for the same data tuple from training dataset train_t(:,13,14) %ans = 91 data.vocab{train_t(:,13,14)} %ans = we

The validation and test data are also similar. They contain 46,568 4-grams each.

Before starting the training, all three need to be separated into inputs and targets and the training set needs to be split into mini-batches.

The data needs to get loaded and then separated into inputs and target. After that, mini-batches of size 100 for the training set are created.

First we need to train the model for one epoch (one pass through the training set using forward propagation). Once implemented the cross-entropy loss will start decreasing.

At this point, we can try changing the hyper-parameters (number of epochs, number of hidden units, learning rates, momentum, etc) to see what effect that has on the training and validation cross entropy.

The training method will output a ‘model’ (weight matrices, biases for each layer in the network).

Description of the Network

As shown above, the network consists of an input layer, embedding layer, hidden layer and output layer.

The input layer consists of three word indices. The same ‘word_embedding_weights’ are used to map each index to a distributed feature representation. These mapped features constitute the embedding layer. More details can be found here.

This layer is connected to the hidden layer, which in turn is connected to the output layer.

The output layer is a softmax over the 250 words.

The training consists of two steps: (1) forward propagation: computes (predicts) the output probabilities of the words in the vocabulary as the next word given a 3-gram as input. (2) back-propagation: propagates the error in prediction from the output layer to the input layer through the hidden layers.

Forward Propagation

The forward propagation is pretty straight-forward and can be implemented as shown in the following code: function [embedding_layer_state, hidden_layer_state, output_layer_state] = ... fprop(input_batch, word_embedding_weights, embed_to_hid_weights,... hid_to_output_weights, hid_bias, output_bias) % This method forward propagates through a neural network. % Inputs: % input_batch: The input data as a matrix of size numwords X batchsize where, % numwords is the number of words, batchsize is the number of data points. % So, if input_batch(i, j) = k then the ith word in data point j is word % index k of the vocabulary. % % word_embedding_weights: Word embedding as a matrix of size % vocab_size X numhid1, where vocab_size is the size of the vocabulary % numhid1 is the dimensionality of the embedding space. % % embed_to_hid_weights: Weights between the word embedding layer and hidden % layer as a matrix of soze numhid1*numwords X numhid2, numhid2 is the % number of hidden units. % % hid_to_output_weights: Weights between the hidden layer and output softmax % unit as a matrix of size numhid2 X vocab_size % % hid_bias: Bias of the hidden layer as a matrix of size numhid2 X 1. % % output_bias: Bias of the output layer as a matrix of size vocab_size X 1. % % Outputs: % embedding_layer_state: State of units in the embedding layer as a matrix of % size numhid1*numwords X batchsize % % hidden_layer_state: State of units in the hidden layer as a matrix of size % numhid2 X batchsize % % output_layer_state: State of units in the output layer as a matrix of size % vocab_size X batchsize % [numwords, batchsize] = size(input_batch); [vocab_size, numhid1] = size(word_embedding_weights); numhid2 = size(embed_to_hid_weights, 2); %% COMPUTE STATE OF WORD EMBEDDING LAYER. % Look up the inputs word indices in the word_embedding_weights matrix. embedding_layer_state = reshape(... word_embedding_weights(reshape(input_batch, 1, []),:)',... numhid1 * numwords, []); %% COMPUTE STATE OF HIDDEN LAYER. % Compute inputs to hidden units. inputs_to_hidden_units = embed_to_hid_weights' * embedding_layer_state + ... repmat(hid_bias, 1, batchsize); % Apply logistic activation function. hidden_layer_state = 1 ./ (1 + exp(-inputs_to_hidden_units)); %zeros(numhid2, batchsize); %% COMPUTE STATE OF OUTPUT LAYER. % Compute inputs to softmax. inputs_to_softmax = hid_to_output_weights' * hidden_layer_state + repmat(output_bias, 1, batchsize); %zeros(vocab_size, batchsize); % Subtract maximum. % Remember that adding or subtracting the same constant from each input to a % softmax unit does not affect the outputs. Here we are subtracting maximum to % make all inputs &amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;= 0. This prevents overflows when computing their % exponents. inputs_to_softmax = inputs_to_softmax... - repmat(max(inputs_to_softmax), vocab_size, 1); % Compute exp. output_layer_state = exp(inputs_to_softmax); % Normalize to get probability distribution. output_layer_state = output_layer_state ./ repmat(... sum(output_layer_state, 1), vocab_size, 1);

Back-Propagation

The back-propagation is much more involved. The math for the back-propagation is shown below for a simple 2-layer network, taken from this lecture note.

As the model trains it prints out some numbers that tell how well the training is going.

The model shows the average per-case cross entropy (CE) obtained on the training set. The average CE is computed every 100 mini-batches. The average CE over the entire training set is reported at the end of every epoch.

After every 1000 mini-batches of training, the model is run on the validation set. Recall, that the validation set consists of data that is not used for training. It is used to see how well the model does on unseen data. The cross entropy on validation set is reported.

The validation error is expected to decrease with increasing epochs till the model starts getting over-fitted with the training data. Hence, the training is stopped immediately when the validation error starts increasing to prevent over-fitting.

At the end of training, the model is run both on the validation set and on the test set and the cross entropy on both is reported.

Some Applications

1. Predict next word

Once the model has been trained, it can be used to produce some predictions for the next word given a set of 3 previous words.

The next example shows when the model is given a 3-gram ‘life’, ‘in’, ‘new’ as input and asked to predict the next word, it predicts the word ‘york’ to be most likely word with the highest (~0.94) probability and the words such as ‘year’, ‘life’ and ‘world’ with low probabilities.

It also shows how the forward propagation is used to compute the prediction: the distribution for the next word given the 3-gram. First the words are projected into the embedding space, flattened and then the weight-matrices are multiplied sequentially followed by application of the softmax function to compute the likelihood of each word being a next word following the 3-gram.

2. Generate stylized pseudo-random text

Here are the steps to generate a piece of pseudo-random text:

Given 3 words to start from, initialize the text with those 3 words. Next, the model is asked to predict k most probable words as a candidate word following the last 3 words. Choose one of the most probable words predicted randomly and insert it at the end of the text. Repeat steps 2-3 to generate more words otherwise stop.

Here is the code that by default generates top 3 predictions for each 3-gram sliding window and chooses one of predicted words tandomly:

function gen_rand_text(words, model, k=3) probs = []; i = 4; while (i < 20 || word != '.') [word, prob] = predict_next_word(words{i-3}, words{i-2}, words{i-1}, model, k); words = {words{:}, word}; probs = [probs; prob]; i = i + 1; end fprintf(1, "%s ", words{:}) ; fprintf(1, '

'); fprintf(1, "%.2f ", round(probs.*100)./100) ; fprintf(1, '

'); end

Starting with the words 'i was going‘, here are some texts that were generated using the model:

Starting with the words ‘life in new‘, here is a piece of text that was generated using the model:

3. Find nearest words

The word embedding weight matrix can be used to represent a word in the embedding space and then the distances from every other word in the vocabulary are computed in this word representation space. Then the closest words are returned.

As can be seen from the following animation examples, the semantically closer words are chosen mostly as the nearest words given a word. Also, higher the number of epochs, better the ordering of the words in terms of semantic similarity.

For example, the closest semantically similar word (i.e. with least distance) for the word ‘between’ is the word ‘among‘, whereas the nearest words for ‘day’ are ‘year’ and ‘week’. Also, the word ‘and’ is nearer to the word ‘but’ than the word ‘or’.

4. Visualization in 2-dimension with t-SNE

In all the above examples, the dimension of the word embedding space was 50. Using t-SNE plot (t-distributed stochastic nearest neighbor embedding by Laurens van der Maaten) the words can be projected into a 2 dimensional space and visualized, by keeping the (semantically) nearer words in the distributed representation space nearer in the projected space.

As can be seen from the following figures, the semantically close words (highlighted with ellipses) are placed near to each other in the visualization, since in the distributed representation space they were close to each other.

Also, the next animation visualizes how the neighborhood of each word changes with training epochs (the model is trained up to 10 epochs).

5. Solving Word-Analogy Problem

with the distributed representation: In this type of problems 2 words (w1, w2) from the vocabulary are given where the first is relate to the second one with some semantic relation. Now, a third word (w3, from the vocabulary) is given and a fourth word that has similar semantic relation with the third word is to be found from the vocabulary.

The following figure shows the word analogy problem and a possible solution using an exhaustive search in the embedding space for a word that has the distance (with the third word) that is closest to the distance in between the first and second word in the representation space.

The next code shows results of a few word-analogy example problems and the solutions found using the distributed representation space. As can be seen, despite the fact that the dataset was quite small and there were only 250 words in the vocabulary, the algorithm worked quite well to find the answers for the examples shown. analogy('year', 'years', 'day', model); % singular-plural relation %year:years::day:days %dist_E('year','years')=1.119368, dist_E('day', 'days')= 1.169186 analogy('on', 'off', 'new', model) % antonyms relation %on:off::new:old %dist_E('on','off')=2.013958, dist_E('new','old')=2.265665 analogy('use', 'used', 'do', model) % present-past relation %use:used::do:did %dist_E('use','used')=2.556175, dist_E('do','did')=2.456098 analogy('he', 'his', 'they', model) % pronoun-relations %he:his::they:their %dist_E('he','his')=3.824808, dist_E('they','their')=3.825453 analogy('today', 'yesterday', 'now', model) %today:yesterday::now:then %dist_E('today','yesterday')=1.045192, dist_E('now','then')=1.220935

Model Selection

Now the model is trained 4 times by changing the values of the hyper-parameters d (dimension of the representation space) and h (the number of nodes in the hidden layer), by trying all possible combinations d=8, d=32 and h=64, h=256.

The following figures show the cross-entropy errors on the training and validation sets for the models.As can be seen from the following figures, the models with hidden layer size 64 are trained till 3 epochs, whereas the models with hidden layer size 256 are trained for 4 epochs (since higher numbers of parameters to train).

The least validation error (also least training error) is obtained for the model with d=32 and h=256, so this is the best model.