The challenge: Correctly assess whether or not questions on Quora are duplicates of each other.

The data provided by Kaggle comes in a csv file, with five rows:

id 3

qid1 7

qid2 8

question1 Why am I mentally very lonely? How can I solve...

question2 Find the remainder when [math]23^{24}[/math] i...

is_duplicate 0

where is_duplicate is 1 if the questions have the same intent, and 0 if not. I want to use this data to be able to predict whether questions in a test set are duplicates.

0. Setup

Link to code

To begin with, I need to turn the data I have from this csv file into something I can feed into a neural network.

0.1 Tokenizing and padding the questions

The first step to doing this is turning the words in the questions into something my neural network can more readily compare. Luckily, this has already been done for me using word embeddings (which I explore here), which turn words into float arrays.

Keras’ neural networks also like to receive all their inputs as the same size. Since not all the questions are the same length, I need to pick a question length (how many words per question I will input into my neural network).

To find the best question length, let’s look at how the question lengths change across the dataset:

The longest question length is 237 words

The mean question lengths are 10.9422317754 and 11.1820410203

That’s not tremendously helpful. I can also plot the frequencies of the question lengths:

A plot of the frequency of question lengths in the dataset

This is far more telling! There’s a very long tail to this dataset, but the vast majority of questions are shorter than 35 words. It will be far more computationally efficient to pick this as my max question length than 237 words (I’ll be using 300 dimensional word embedding, so this way I’m training over 35*300 arrays instead of 237*300 arrays).

MAX_LENGTH = 35

Now that I have defined this length, I want to turn my questions into 35-element arrays defined by the word embeddings.

An inefficient approach would be to pass strings to the neural network, and have an embedding layer with a dictionary to the word embeddings. A better approach is to represent each word by an integer, and have a dictionary which translates each word to the integer.

Keras has a function called Tokenizer which does exactly this. I can pick how many words I want turned to indices (in my case, I pick the 20,000 most common words in the dataset)

tokenizer = Tokenizer(nb_words = 20000)

tokenizer.fit_on_texts(train.question1 + train.question2) q1sequence = tokenizer.texts_to_sequences(train.question1)

q2sequence = tokenizer.texts_to_sequences(train.question2) word_index = tokenizer.word_index

word_index is now a dictionary which links each word to a unique integer (eg. ' replaces': 28265 ).

Keras also has a function to make sure all the input sequences are the same length:

q1_input = pad_sequences(q1sequence, maxlen = MAX_LENGTH)

q2_input = pad_sequences(q2sequence, maxlen = MAX_LENGTH)

I’m nearly ready to train a neural network!

0.2. Adding word embeddings

The last step is to take these inputs, which are now length 35 arrays of integers (where each integer can be linked to a word), and add the useful information captured by the word embeddings.

I can’t just make an array for the inputs which include the word embeddings, since the resulting array is so large it takes up all of python’s memory (I tried this, and it was 18Gb when I saved it as a text file).

A better approach is to add an embedding layer to my neural network, where each input is only given its embeddings as its passed through the network. To do this, I need to make another dictionary, which takes the integers in q1_input and q2_input and translates them to their respective word embeddings.

The first step is to open the Glove embeddings, which are a text file, and turn them into a dictionary. I can then create an embedding matrix, which contains all of the Glove embeddings and links them (by index) to all of the words in word_index .

Of course, not every word in my data will have a GloVe equivalent (for instance, some people will mispell words); if this is the case, I leave the vector as only zeroes.

With this embedding matrix, I can then create an embedding layer, which will be the input layer of my neural network:

embedding_layer = Embedding(len(word_index) + 1,

EMBEDDING_DIM,

weights=[embedding_matrix],

input_length=MAX_LENGTH,

trainable=False)

I now have all the pieces to train a neural network!

1. My First Natural Language Processing Neural Network

Recurrent Neural Networks (which I look at here) make a ton of sense; for my neural network to understand the intent behind a question, it’s going to need to remember what the first word was even as it gets to the tenth word.

I also want my neural network to consider the two questions separately, to extract meaning from them, before considering them together.