Embeddings for anything

Word vectors are essential tools for a wide variety of NLP tasks. But pre-trained word vectors don’t exist for the types of entities businesses often care the most about. Where there are pre-trained word2vec embeddings for words like ‘red’ and ‘banana’, there are no pre-trained word2vec embeddings for users of a social network, local businesses, or any other entity that isn’t frequently mentioned in the Google News corpus from which the word2vec embeddings were derived.

Businesses care about their customers, their employees, their suppliers, and other entities for which there are no pre-trained embeddings. Once trained, vectorized representations of entities can be used as inputs to a wide range of machine learning models. For example, they could be used in models predicting which ads users are likely to click on, which university applicants are likely to graduate with honors, or which politician is likely to win an election.

Entity embeddings allow us to accomplish these types of tasks by leveraging the bodies of natural language text associated with these entities that businesses frequently have. For example, we can create entity embeddings from the posts a user has written, the personal statement a university applicant wrote, or the tweets and blog posts people write about a politician.

Any business that has entities paired with text could make use of entity embeddings, and when you think about it, most businesses have this one way or another: Facebook has users and the text they post or are tagged in, LinkedIn has users and the text of their profiles, Yelp has users and the reviews they write, along with businesses and the reviews written about them, Airbnb has places to stay along with descriptions and reviews, universities have applicants and the admission essays they write, and the list goes on. In fact, Facebook recently published a paper detailing an entity embedding technique.

The aim with my entity2vec project was to find a way to use text associated with entities to create general-use embeddings that represent those entities. To do this, I used a technique somewhat similar to word2vec’s negative sampling to squeeze the information from a large body of text known to be associated with a certain entity into entity embeddings.

Example 1: Famous People

To develop and test the technique, I tried training embeddings to represent prominent people (e.g. Barack Obama, Lady Gaga, Angelina Jolie, Bill Gates). Prominent people were a good starting point because, for these very famous peoples’ names, pre-trained Google word2vec embeddings exist and are freely available, so I’d be able to compare my embeddings’ performance against the word2vecs for those peoples’ names.

Like with word2vec, I needed a training task that would force the entity embeddings to learn general information about the entities they stand for. I decided to train a classifier that would take a snippet of text from a person’s Wikipedia article and learn to guess who that snippet is about.

The training task would take several entity embeddings as input and would output the position of the entity embedding that the text snippet is about. In the following example, the classifier would see as input a text snippet about Obama, as well as the embeddings for Obama, and three other randomly chosen people. The classifier would output a number representing which of its inputs is the Obama embedding.

All of the embeddings would be trainable in each step, so, not only would the correct person embedding learn information about what that person is, but the other incorrect embeddings would also learn something about what their people are not.

This technique seemed sensible intuitively, but, in order to validate my results, I needed to try the resulting embeddings out on some other tasks to see if they’d actually learned general information about their entities.

To do this, I trained simple classifiers on several other tasks that took entity embeddings as inputs and outputted classifications like the gender or occupation of the entity. Here is the architecture of these classifiers:

And here are the results obtained, compared against guessing and against doing the same thing with word2vec embeddings.

My embeddings performed pretty much on-par with the word2vec embeddings even though mine were trained on much less text — about 30 million words vs 100 billion. That is four orders of magnitude less text required!