Based on the above depiction, the model represents each document by a dense vector which is trained to predict words in the document. The only difference being the paragraph or document ID, used along with the regular word tokens to build out the embeddings. Such a design enables this model to overcome the weaknesses of bag-of-words models.

Neural-Net Language Models (NNLM) is a very early idea based on a neural probabilistic language model proposed by Bengio et al. in their paper, ‘A Neural Probabilistic Language Model’ in 2003, they talk about learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences. The model learns simultaneously a distributed representation for each word along with the probability function for word sequences, expressed in terms of these representations. Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar (in the sense of having a nearby representation) to words forming an already seen sentence.

Google has built a universal sentence embedding model, nnlm-en-dim128 which is a token-based text embedding-trained model that uses a three-hidden-layer feed-forward Neural-Net Language Model on the English Google News 200B corpus. This model maps any body of text into 128-dimensional embeddings. We will be using this in our hands-on demonstration shortly!

Skip-Thought Vectors were also one of the first models in the domain of unsupervised learning-based generic sentence encoders. In their proposed paper, ‘Skip-Thought Vectors’, using the continuity of text from books, they have trained an encoder-decoder model that tries to reconstruct the surrounding sentences of an encoded passage. Sentences that share semantic and syntactic properties are mapped to similar vector representations.

This is just like the Skip-gram model, but for sentences, where we try to predict the surrounding sentences of a given source sentence.

Quick Thought Vectors is a more recent unupervised approach towards learning sentence emebddings. Details are mentioned in the paper ‘An efficient framework for learning sentence representations’. Interestingly, they reformulate the problem of predicting the context in which a sentence appears as a classification problem by replacing the decoder with a classfier in the regular encoder-decoder architecture.

Quick Thought Vectors (Source: https://openreview.net/forum?id=rJvJXZb0W)

Thus, given a sentence and the context in which it appears, a classifier distinguishes context sentences from other contrastive sentences based on their embedding representations. Given an input sentence, it is first encoded by using some function. But instead of generating the target sentence, the model chooses the correct target sentence from a set of candidate sentences. Viewing generation as choosing a sentence from all possible sentences, this can be seen as a discriminative approximation to the generation problem.

InferSent is interestingly a supervised learning approach to learning universal sentence embeddings using natural language inference data. This is hardcore supervised transfer learning, where just like we get pre-trained models trained on the ImageNet dataset for computer vision, they have universal sentence representations trained using supervised data from the Stanford Natural Language Inference datasets. Details are mentioned in their paper, ‘Supervised Learning of Universal Sentence Representations from Natural Language Inference Data’. The dataset used by this model is the SNLI dataset that comprises 570k human-generated English sentence pairs, manually labeled with one of the three categories: entailment, contradiction and neutral. It captures natural language inference useful for understanding sentence semantics.

InferSent training scheme (Source: https://arxiv.org/abs/1705.02364)

Based on the architecture depicted in the above figure, we can see that it uses a shared sentence encoder that outputs a representation for the premise u and the hypothesis v. Once the sentence vectors are generated, 3 matching methods are applied to extract relations between u and v :

Concatenation (u, v)

Element-wise product u ∗ v

Absolute element-wise difference |u − v|

The resulting vector is then fed into a 3-class classifier consisting of multiple fully connected layers culminating in a softmax layer.

Universal Sentence Encoder from Google is one of the latest and best universal sentence embedding models which was published in early 2018! The Universal Sentence Encoder encodes any body of text into 512-dimensional embeddings that can be used for a wide variety of NLP tasks including text classification, semantic similarity and clustering. It is trained on a variety of data sources and a variety of tasks with the aim of dynamically accommodating a wide variety of natural language understanding tasks which require modeling the meaning of sequences of words rather than just individual words.

Their key finding is that, transfer learning using sentence embeddings tends to outperform word embedding level transfer. Do check out their paper, ‘Universal Sentence Encoder’ for further details. Essentially, they have two versions of their model available in TF-Hub as universal-sentence-encoder. Version 1 makes use of the transformer-network based sentence encoding model and Version 2 makes use of a Deep Averaging Network (DAN) where input embeddings for words and bi-grams are first averaged together and then passed through a feed-forward deep neural network (DNN) to produce sentence embeddings. We will be using Version 2 in our hands-on demonstration shortly.

Understanding our Text Classification Problem

It’s time for putting some of these universal sentence encoders into action with a hands-on demonstration! Like the article mentions, the premise of our demonstration today will focus on a very popular NLP task, text classification — in the context of sentiment analysis. We will be working with the benchmark IMDB Large Movie Review Dataset. Feel free to download it here or you can even download it from my GitHub repository.

This dataset comprises a total of 50,000 movie reviews, where 25K have positive sentiment and 25K have negative sentiment. We will be training our models on a total of 30,000 reviews as our training dataset, validate on 5,000 reviews and use 15,000 reviews as our test dataset. The main objective is to correctly predict the sentiment of each review as either positive or negative.

Universal Sentence Embeddings in Action

Now that we have our main objective cleared up, let’s put universal sentence encoders into action! The entire tutorial is available in my GitHub repository as a Jupyter Notebook. Feel free to download it and play around with it. I recommend using a GPU-based instance for playing around with this. I love using Paperspace where you can spin up notebooks in the cloud without needing to worry about configuring instances manually.

My setup was an 8 CPU, 30 GB, 250 GB SSD and an NVIDIA Quadro P4000 which is usually cheaper than most AWS GPU instances (I love AWS though!).

Note: This tutorial is built using TensorFlow entirely given that they provide an easy access to the sentence encoders. However I’m not a big fan of their old APIs and I’m looking for someone to assist me on re-implementing the code using the tf.keras APIs instead of tf.estimator . Do reach out to me if you are interested in contributing and we can even feature your work on the same! (contact links in my profile and in the footer)

Load up Dependencies