Credits

If your time is in short supply, just read the Explained Paper Short Version.

Goals

English language has in the order of 100,000 words. If we are working on an NLP problem, one can represent each word as a one-hot vector of dimension 100,000. This is a sparse and high dimension input. Our goal is to map this into dense low dimensional input of around 300 dimension, \( v _w\). Then we can feed this to some other model like LSTM for some NLP tasks. Hopefully, our new representation respects some semantic rules like

Words with similar meaning should be close to each other

The direction from “do” to “doing” should be similar to “go” to “going”

Doing addition: King - male + female ~ queen

It is surprising that word2vec can give us this, without any label on the meaning of words. All it needs is words cooccurance. It is also surpsisingly simple, nothing more complicated than logistic regression.

Ideas of Word2vec

Given a sentence:

You used to call me on my cell phone.

and a window k (consider k = 1 for simplicity), we define the context of a word as its 2k neighbor words. Word2vec defines the positive dataset \( \mathcal{D} \) of

word context you used used you used to to used to call

Each of these pair did appear in the dataset, so we associate them with a label 1. We now define the negative word pairs dataset \( \mathcal{D’} \)

word non_context you random_word1 used random_word2 to random_word3 call random_word4

and label these pair as 0 (Note that when we pick one random word from the vocabulary, there is some tiny chance that the picked word is actually a valid context, but it is very small that we consider it 0).

We then use the logistic loss to train a vector representation for each word such as it maximize So we basically maximize the probability of seeing those word pairs in \( \mathcal{D}\), and not seeing those word pairs in \( \mathcal{D’}\). Taking log, we can rewrite the optimization as for \( \sigma(x) = \frac{1}{1 + e ^ {-x}} \)

Now to be clear, all of the \( v _w\) are our vector representation of word, together, they form a matrix of size (vocabulary size by projection size), e.g. (100,000 by 300). All of the \( v _c \) are our vector representation of context, together, they form a matrix of similar size if we have one negative sample per positive sample. In practice this later matrix is often discarded. We basically optimize with respect to those two matrix, such that the product \( v _c \cdot v _w\) is big for those [word, context] we see in our dataset, and small for those [word, context] we do not see in our dataset.

Enhancement

The final method used is a bit more complicated, with tricks that make it work better, for example

Instead of sampling one negative context per [word, context] pair uniformly, it samples \(m\) context words with probability distribution proportional to how often each context word is in the dataset.

Instead of using a fix \( k\) window around each word, the window is uniformly distributed from \( 1,2, …, K \)

Treat all rare words as an “UNK” token, and downsampling most common words.

The method we mention so far is called the Skip-Gram Negative Sampling, the original paper also mentions the Continuous Bag of Words, where it models the probability of word given context. The authors claim that the skip-gram negative sampling works better for bigger dataset (see more in the TensorFlow example).

Extra

This word representation, finding the two vector (matrix) representation \( V _w\) and \( V _c\) can be thought of as factorizing an implicit matrix \( M = V _w \cdot V _c ^ T \), where each element of \( M, M _{ij}\) reflect the strength of association between word \( i \) and context \( j\). More specifically, it is found that , for PMI is the pointwise mutual information, defined as , for \( N(w)\) counts the number of occurance of \(w \) in \( \mathcal{D}\)

Testing out

The official website of word2vec has a very fast code in C++, where one can test things out pretty easily. Head over there, training the model should take five minutes. It even has the option of download pre-trained models. Here I’m using there pre-trained model on Google News words, and find similar words to a word (think of this like a Thesaurus)