Introduction

Numeric representation of Text documents is challenging task in machine learning and there are different ways there to create the numerical features for texts such as vector representation using Bag of Words, Tf-IDF etc.I am not going in detail what are the advantages of one over the other or which is the best one to use in which case. There are lot of good reads available to explain this. My focus here is more on the doc2vec and how to use it for sentence similarity

What is Word2Vec?

It’s a Model to create the word embeddings, where it takes input as a large corpus of text and produces a vector space typically of several hundred dimesions. it was introduced in two papers between September and October 2013, by a team of researchers at Google. The underlying assumption of Word2Vec is that two words sharing similar contexts also share a similar meaning and consequently a similar vector representation from the model.

The meaning of a word can be found from the company it keeps

For instance: “Bank”, “money” and “accounts” are often used in similar situations, with similar surrounding words like “dollar”, “loan” or “credit”, and according to Word2Vec they will therefore share a similar vector representation. From this assumption, Word2Vec can be used to find out the relations between words in a dataset, compute the similarity between them, or use the vector representation of those words as input for other applications such as text classification or clustering.

Source: Google Images

What is Doc2Vec?

if you understand word2vec then it would be easier to understand the Doc2vec, since it’s an extension for word2vec. So the objective of doc2vec is to create the numerical representation of sentence/paragraphs/documents unlike word2vec that computes a feature vector for every word in the corpus, Doc2Vec computes a feature vector for every document in the corpus..The vectors generated by doc2vec can be used for tasks like finding similarity between sentences/paragraphs/documents

source: Google Images

As per the orignal document, Paragraph Vector is capable of constructing representations of input sequences of variable length. Unlike some of the previous approaches, it is general and applicable to texts of any length: sentences, paragraphs, and documents.

Paragraph Vector framework (see Figure above), every paragraph is mapped to a unique vector, represented by a column in matrix D and every word is also mapped to a unique vector, represented by a column in matrix W. The paragraph vector and word vectors are averaged or concatenated to predict the next word in a context. In the experiments, we use concatenation as the method to combine the vectors

Check this link for Doc2vec implementation in Gensim Library

Now we will see how to use doc2vec(using Gensim) and find the Duplicate Questions pair, Competition hosted on Kaggle by Quora

Problem Statement:

Quora gets lot of duplicate questions which is added by it’s user from different locations and the main intent of Quora is to have a unique questions which can be answered by other users who are an expert or provide their opinion about the Question being asked. The primary goal of this competition is to go through the pair of questions and identify whether they are identical or not. For example, the queries “What is the most populous state in the USA?” and “Which state in the United States has the most people?” should not exist separately on Quora because the intent behind both is identical

Downloading Data

Data can be downloaded using the below Kaggle link

https://www.kaggle.com/quora/question-pairs-dataset

Import Data and Cleaning

After downloading the csv file using the above Kaggle link clean the Data and drop the row if any of the questions out of the two are null Remove Stopwords using NLTK library and strip all the special characters

Check for null Questions and drop the rows

# Import required libraries import pandas as pd import pandas as pd import numpy as np import nltk from nltk.corpus import stopwords from nltk.stem import SnowballStemmer import re from gensim import utils from gensim.models.doc2vec import LabeledSentence from gensim.models.doc2vec import TaggedDocument from gensim.models import Doc2Vec from sklearn.metrics.pairwise import cosine_similarity from sklearn.metrics import accuracy_score # Import Data df=pd.read_csv('./questions.csv') # Check for null values df[df.isnull().any(axis=1)] # Drop rows with null Values df.drop(df[df.isnull().any(axis=1)].index,inplace=True)

Remove Stop Words

# Remove stop words if remove_stopwords: stops = set(stopwords.words("english")) words = [w for w in text.lower().split() if not w in stops] final_text = " ".join(words)

Remove Special Characters

# Special Characters review_text = re.sub(r"[^A-Za-z0-9(),!.?\'\`]", " ", final_text ) review_text = re.sub(r"\'s", " 's ", final_text ) review_text = re.sub(r"\'ve", " 've ", final_text ) review_text = re.sub(r"n\'t", " 't ", final_text ) review_text = re.sub(r"\'re", " 're ", final_text ) review_text = re.sub(r"\'d", " 'd ", final_text ) review_text = re.sub(r"\'ll", " 'll ", final_text ) review_text = re.sub(r",", " ", final_text ) review_text = re.sub(r"\.", " ", final_text ) review_text = re.sub(r"!", " ", final_text ) review_text = re.sub(r"\(", " ( ", final_text ) review_text = re.sub(r"\)", " ) ", final_text ) review_text = re.sub(r"\?", " ", final_text ) review_text = re.sub(r"\s{2,}", " ", final_text )

Label all the Questions

Gensim Doc2Vec needs model training data to tag each question with a unique id, So here we would be tagging the questions with their qid using TaggedDocument API. Check the original data for the column qid1 and 1id2

Before feeding these questions to the Model, we will split each questions into different word and form list of words for each of them along with the tagging. You can see below we have used split to separate it into individual words.

labeled_questions=[] labeled_questions.append(TaggedDocument(questions1[i].split(), df[df.index == i].qid1)) labeled_questions.append(TaggedDocument(questions2[i].split(), df[df.index == i].qid2))

Build the Model

The labeled question is used to build the vocabulary from a sequence of sentences. This represents the vocabulary (sometimes called Dictionary in gensim) of the model. which keeps track of all unique words

model = Doc2Vec ( dm = 1 , min_count = 1 , window = 10 , size = 150 , sample = 1e-4 , negative = 10 ) model . build_vocab ( labeled_questions )

Train the Model

Model should be initialized, trained for a few epochs. This might take some time depends on your hardware configuration

# Train the model with 20 epochs for epoch in range(20): model.train(labeled_questions,epochs=model.iter,total_examples=model.corpus_count) print("Epoch #{} is complete.".format(epoch+1))

Test the Model

After the model is trained we will check if it has learnt all the words and it’s contextual meaning. We will search for “Washington” city using most_similar api and see what is the result?, It should show all the words in the document which is near or similar to Washington contextually

model.most_similar('washington')

Looks like the model is trained well and the results are coming up good here. We looked up for Washington and it gives similar Cities in US as an outputA

Cosine Similarity

We will iterate through each of the question pair and find out what is the cosine Similarity for each pair. Check this link to find out what is cosine similarity and How it is used to find similarity between two word vectors

score = model.n_similarity(questions1_split[i],questions2_split[i])

Accuracy

There are different ways using which you can evaluate the accuracy of this model on the training data. Once the similarity score is calculated for each of the Questions pair then you can set a threshold value to find out which of the pair is duplicate or not. Since your score should be either 0 or 1 so you can set a threshold of 0.6 so if similarity score of any pair is > 0.6 then it’s a duplicate score is 1 and for any pair of question if it is <0.6 then the pair is not a duplicate and score is 0. Then pass this score to the accuracy score sklearn api with the original score in the csv file and check the accuracy of the model

from sklearn.metrics import accuracy_score accuracy = accuracy_score(df.is_duplicate, scores) * 100

Conclusion

We have seen that with a minimum effort how we have produce numerical features of the sentences(questions) and compare it using the cosine similarity to find out whether the question pair is duplicate or not. Additionaly, As a next step you can use the Bag of Words or TF-IDF model to covert these texts into numerical feature and check the accuracy score using cosine similarity.

To conclude - if you have a document related task then DOC2Vec is the ultimate way to convert the documents into numerical vectors.