In the past few years, deep learning is all the fuss in the tech industry.

To keep up on things I like to get my hands dirty implementing interesting network architectures I come across in article readings.

Few months ago I came across a very nice article called Siamese Recurrent Architectures for Learning Sentence Similarity.It offers a pretty straightforward approach to the common problem of sentence similarity.

Named MaLSTM (“Ma” for Manhattan distance), its architecture is depicted in figure 1 (diagram excludes the sentence preprocessing part).

Notice that since this is a Siamese network, it is easier to train because it shares weights on both sides.

Figure 1 MaLSTM’s architecture — Similar color means the weights are shared between the same-colored elements

Network explained

(I will be using Keras, so some technical details are related to the implementation)

So first of all, what is a “Siamese network”?

Siamese networks are networks that have two or more identical sub-networks in them.

Siamese networks seem to perform well on similarity tasks and have been used for tasks like sentence semantic similarity, recognizing forged signatures and many more.

In MaLSTM the identical sub-network is all the way from the embedding up to the last LSTM hidden state.

Word embedding is a modern way to represent words in deep learning models. More about it can be found in this nice blog post.

Essentially it’s a method to give words semantic meaning in a vector representation.

Inputs to the network are zero-padded sequences of word indices. These inputs are vectors of fixed length, where the first zeros are being ignored and the nonzeros are indices that uniquely identify words.

Those vectors are then fed into the embedding layer. This layer looks up the corresponding embedding for each word and encapsulates all them into a matrix. This matrix represents the given text as a series of embeddings.

I use Google’s word2vec embedding, same as in the original paper.

The process is depicted in figure 2.

Figure 2 Embedding process

We have two embedded matrices that represent a candidate of two similar questions. Then we feed them into the LSTM (practically, there is only one) and the final state of the LSTM for each question is a 50-dimensional vector. It is trained to capture semantic meaning of the question.

In figure 1, this vector is denoted by the letter h.

If you don’t entirely understand LSTMs, I suggest reading this wonderful post.

By now we have the two vectors that hold the semantic meaning of each question. We put them through the defined similarity function (below)

MaLSTM similarity function

and since we have an exponent of a negative the output (the prediction in our case) will be between 0 and 1.

Training

The optimizer of choice in the article is the Adadelta optimizer, which can be read about in this article. We also use gradient clipping to avoid the exploding gradient problem. You may find a nice explanation of the gradient clipping in this video from the Udacity deep learning course.

This is where I will diverge a little from the original paper. For the sake of simplicity, I do not use a specific weight initialization scheme and do not pretrain it on a different task.

Other parameters such as batch size, epochs, and the gradient clipping norm value are chosen by me.