In this series, I will be creating prototypes for generating sentence vectors using popular theories and in this particular post I will be starting with simple averaging of word vectors method for obtaining sentence vector.

Edited 3rd June 2017:

I am adding results of evaluations of each approach on quora duplicate question answer dataset. It has about 400,000 pairs of questions with manual evaluation of similarity in intent. You can read more about this dataset at https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs

This article assumes that you do know how to obtain word vectors from given text. If you don’t, do refer to first two lectures of this amazing series on applying deep learning to natural language processing problems.

Code for plain averaging method

So what’s first thing that comes to your mind when you have word vectors and need to calculate sentence vector.

Just average them?

Yes that’s what we are going to do here.

Basic idea of plain average of word vectors to get sentence vector

Turns out we don’t even have to write lot of code. Spacy module has inbuilt doc vector function. Spacy uses GLoVE word vectors of 300 dimensions trained on Commoncrawl corpus.

Basic code for obtaining sentence vector in this approach would be as follows

#Code for obtaining sentence vector from spacy module

import spacy

nlp = spacy.load('en')

tweet_doc = self.nlp("Here goes your some input text")

print tweet_doc.vector

I took most recent 100,000 tweets from popular handles and calculated sentence vectors for them.

I have uploaded full code at https://github.com/premrajnarkhede/sentence2vec and there is sample tweets.json as well containing 100,000 tweets. All you have to do to run this code is

python sen2vec.py

For finding closest tweets, program takes text from user as input, calculates its sentence vector and then looks for closest other vectors.

Results

Results on quora dataset (Quantitative checks)

While I was able to calculate sentence vectors for all questions and find distance between given two pairs of questions, I had to run logistic regression to find out cutoff for distance score above which I should say questions are dissimilar and below which questions can be called similar

I got 63% Accuracy on test set. Test set had equal proportion of similar and dissimilar questions

Results on tweets (Qualitative checks)

I gave some tweet texts as input to program to find similar tweets and results I obtained were as follows

Query text:Research Associate in first-principles electronic structure methods for quantum embedding https://t.co/ke1o2w5b5M

Results — — — — — — — — — — — -

Distance: 0.0 Tweet Text: Research Associate in first-principles electronic structure methods for quantum embedding https://t.co/ke1o2w5b5M

Distance: 1.49180190612 Tweet Text: New Materials Could Make Quantum Computers More Practical — Stanford electrical engineering Professor Jelena Vu… https://t.co/mwRcMcEriO

Distance: 1.55241788122 Tweet Text: Eigentechno — Principal Component Analysis applied to electronic music https://t.co/Jfsste5RNs

Distance: 1.57296885052 Tweet Text: 2D Materials Go Ferromagnetic, Creating a New Scientific Field https://t.co/RQRhYz4wHP

Distance: 1.64445564296 Tweet Text: The best strategies for self-assessment, according to Buddhist and Stoic philosophy https://t.co/SVF5Iku8qK

Distance: 1.64445564296 Tweet Text: The best strategies for self-assessment, according to Buddhist and Stoic philosophy https://t.co/9CNHz1ab3Z

Distance: 1.65336638565 Tweet Text: Google Plans to Demonstrate the Supremacy of Quantum Computing https://t.co/h84ioIf9GQ

Distance: 1.6694213117 Tweet Text: Creative Destruction Lab is launching a quantum machine learning accelerator in Toronto https://t.co/lap3gXkvf8 by @ryanlawler

Distance: 1.67209887565 Tweet Text: A Stanford University psychologist’s elegant three-step method for creating new habits https://t.co/pxiEmgxpRd

Distance: 1.67446022161 Tweet Text: Modeling Reality: Putting Systems Engineering Theory into Practice https://t.co/oHY3190Px3

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Comment: Quite decent output isn’t it?

Another example

Query text:The US is helping allies hide civilian casualties in the fight against ISIS — @Foreignpolicy https://t.co/inPNFw8CQm https://t.co/kJQ1W6kPsM

Distance: 0.0 Tweet Text: The US is helping allies hide civilian casualties in the fight against ISIS — @Foreignpolicy https://t.co/inPNFw8CQm https://t.co/kJQ1W6kPsM

Distance: 1.03971526787 Tweet Text: Iraq and the U.S. are in talks to keep an American troop presence after ISIS fight https://t.co/4uHPdimNQ3 https://t.co/o2NVdj82iu

Distance: 1.04809368426 Tweet Text: OPEC and its allies are digging in for a long war of attrition against shale https://t.co/SuKYozYlyl https://t.co/zAtbkXeeRZ

Distance: 1.07585984153 Tweet Text: Trump’s calls to step up the fight against terror resonate at NATO after Manchester attack https://t.co/S9Pnhl2jrf https://t.co/cktVHIj8xu

Distance: 1.07882313999 Tweet Text: How to defend yourself against the WannaCrypt global ransomware attack https://t.co/hNqfbMk6XU https://t.co/o6xUhIJtr0

Distance: 1.07882313999 Tweet Text: How to defend yourself against the WannaCrypt global ransomware attack https://t.co/BkYKvZnfNp https://t.co/Qv2PutAgHN

Distance: 1.07882313999 Tweet Text: How to defend yourself against the WannaCrypt global ransomware attack https://t.co/KGLa4fDfCN https://t.co/GRnOawqrHc

Distance: 1.07882313999 Tweet Text: How to defend yourself against the WannaCrypt global ransomware attack https://t.co/FU8L4mb0zT https://t.co/8yAWHyW796

Distance: 1.07882313999 Tweet Text: How to defend yourself against the WannaCrypt global ransomware attack https://t.co/sp4KoNaipX https://t.co/CERUqOp9oX

Distance: 1.08526030358 Tweet Text: Several worshippers and the imam at the Manchester bomber’s mosque fought in the Libyan civil war https://t.co/IQhb3ThuPT

— — — — — — — — — — — — — — — — —

Comment: Ransomware tweets are almost as close to queried text as much actual ISIS related tweets are. So that’s a fail!