One year ago, Tomáš Mikolov (together with his colleagues at Google) made some ripples by releasing word2vec, an unsupervised algorithm for learning the meaning behind words. In this blog post, I’ll evaluate some extensions that have appeared over the past year, including GloVe and matrix factorization via SVD.

In case you missed the buzz, word2vec was widely featured as a member of the “new wave” of machine learning algorithms based on neural networks, commonly referred to as deep learning (though word2vec itself is rather shallow). Using large amounts of unannotated plain text, word2vec learns relationships between words automatically. The output are vectors, one vector per word, with remarkable linear relationships that allow us to do things like vec(“king”) – vec(“man”) + vec(“woman”) =~ vec(“queen”), or vec(“Montreal Canadiens”) – vec(“Montreal”) + vec(“Toronto”) resembles the vector for “Toronto Maple Leafs”.

Check out my online word2vec demo and the blog series on optimizing word2vec in Python for more background.

So, what’s changed?

For one, Tomáš Mikolov no longer works for Google :-)

More relevantly, there was a lovely piece of research done by the good people at Stanford: Jeffrey Pennington, Richard Socher and Christopher Manning. They explicitly identified the objective that word2vec optimizes through its async stochastic gradient backpropagation algorithm, and neatly connected it to the well-established field of matrix factorizations.

And in case you’ve never heard of that — in short, word2vec ultimately learns word vectors and word context vectors. These can be viewed as two 2D matrices (of floats), of size #words x #dim each. Their method GloVe (Global Vectors) identified a matrix which, when factorized using the particular SGD algorithm of word2vec, yields out exactly these two matrices. So where word2vec was a bit hazy about what’s going on underneath, GloVe explicitly names the “objective” matrix, identifies the factorization, and provides some intuitive justification as to why this should give us working similarities.

Very nice and clear paper, go read it if you haven’t!

For example, if we have the following nine preprocessed sentences, and set window=5, the co-occurrence matrix looks like this:

# nine input sentences texts = [['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ['graph', 'trees'], ['graph', 'minors', 'trees'], ['graph', 'minors', 'survey']] # word-word co-occurrence matrix, with context window size of 5 [[0 1 1 1 1 1 1 1 0 0 0 0] [1 0 1 0 0 2 0 0 1 0 0 0] [1 1 0 0 0 1 0 1 1 0 0 0] [1 0 0 0 1 1 2 2 0 0 0 0] [1 0 0 1 0 1 1 1 0 0 1 1] [1 2 1 1 1 2 1 2 3 0 0 0] [1 0 0 2 1 1 0 2 0 0 0 0] [1 0 1 2 1 2 2 0 1 0 0 0] [0 1 1 0 0 3 0 1 0 0 0 0] [0 0 0 0 0 0 0 0 0 0 2 1] [0 0 0 0 1 0 0 0 0 2 0 2] [0 0 0 0 1 0 0 0 0 1 2 0]] # (rows/columns represent words: # "computer human interface response survey system time user eps trees graph minors", # in that order)

Note how the matrix is very sparse and symmetrical; the implementation we’ll use below takes advantage of both these properties to train GloVe more efficiently.

The GloVe algorithm then transforms such raw integer counts into a matrix where the co-occurrences are weighted based on their distance within the window (word pairs farther apart get less co-occurrence weight):

# same row/column names as above [[ 0. 0.5 1. 0.5 0.5 1. 0.33 1. 0. 0. 0. 0. ] [ 0. 0. 1. 0. 0. 2. 0. 0. 0.5 0. 0. 0. ] [ 0. 0. 0. 0. 0. 1. 0. 1. 0.5 0. 0. 0. ] [ 0. 0. 0. 0. 0.25 1. 2. 1.33 0. 0. 0. 0. ] [ 0. 0. 0. 0. 0. 0.33 0.2 1. 0. 0. 0.5 1. ] [ 0. 0. 0. 0. 0. 0. 0.5 1. 1.67 0. 0. 0. ] [ 0. 0. 0. 0. 0. 0. 0. 0.75 0. 0. 0. 0. ] [ 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. ] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.5 1. ] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 2. ] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]]

then takes a log, and factorizes this matrix to produce the final word vectors.

This was really exciting news — it means the plain softmax word2vec essentially reduces to counting how many times words occur together, with some scaling thrown in. Technically, this is just a glorified cooccurrence_counts[word, other_word]++ in a loop, followed by any of the standard matrix factorization algorithms, both of which are well understood processes with efficient implementations.

GloVe vs word2vec

Oddly, the evaluation section of this GloVe paper didn’t match the quality of the rest. It had serious flaws in how the experiments compared GloVe to other methods. Several people called the authors out on the weirdness, most lucidly the Levy & Goldberg research duo from Bar Ilan university — check out their “apples to apples” blog post for a bit of academic drama. To summarize, when evaluated properly, paying attention to parameter settings, GloVe doesn’t really seem to outperform the original word2vec, let alone by 11% as the GloVe paper claimed.

Luckily, Maciej Kula implemented GloVe in Python, using Cython for performance. Using his neat implementation, we can try to make sense of the performance and accuracy ourselves.

Code to train GloVe in Python:

from gensim import utils, corpora, matutils, models import glove # Restrict dictionary to the 30k most common words. wiki = models.word2vec.LineSentence('/data/shootout/title_tokens.txt.gz') id2word = corpora.Dictionary(wiki) id2word.filter_extremes(keep_n=30000) word2id = dict((word, id) for id, word in id2word.iteritems()) # Filter all wiki documents to contain only those 30k words. filter_text = lambda text: [word for word in text if word in word2id] filtered_wiki = lambda: (filter_text(text) for text in wiki) # generator # Get the word co-occurrence matrix -- needs lots of RAM!! cooccur = glove.Corpus() cooccur.fit(filtered_wiki(), window=10) # and train GloVe model itself, using 10 epochs model_glove = glove.Glove(no_components=600, learning_rate=0.05) model_glove.fit(cooccur.matrix, epochs=10)

And similarly for training word2vec:

model_word2vec = models.Word2Vec(size=600, window=10) model_word2vec.build_vocab(filtered_wiki()) model_word2vec.train(filtered_wiki())

The reason why we restricted the vocabulary to only 30,000 words is that Maciej’s implementation of GloVe requires memory quadratic in the number of words: it keeps that sparse matrix of all word x word co-occurrences in RAM. In contrast, the gensim word2vec implementation is happy with linear memory, so millions of words are not a problem there. This is not an intrinsic limitation of GloVe though; with a different implementation, the co-occurrence matrix could be assembled out-of-core (Map/Reduce seems ideal for the job), and the factorization could just stream over it with constant memory too, in a more gensim-like fashion.

Results for 600 dims, context window of 10, 1.9B words of EN Wikipedia. algorithm accuracy on the word analogy task wallclock time peak RAM [MB] I/O only = iterating over wiki with

sum(len(text) for text in filtered_wiki()) N/A 3m 25 GloVe, 10 epochs, learning rate 0.05 67.1% 4h12m 9,414 GloVe, 100 epochs, learning rate 0.05 67.3% 18h39m 9,452 word2vec, hierarchical skipgram, 1 epoch 57.4% 3h10m 266 word2vec, negative sampling with 10 samples, 1 epoch 68.3% 8h38m 628 word2vec, pre-trained GoogleNews model released by Tomáš Mikolov, 300 dims, 3,000,000 vocabulary 55.3% ? ?

Basically, where GloVe precomputes the large word x word co-occurrence matrix in memory and then quickly factorizes it, word2vec sweeps through the sentences in an online fashion, handling each co-occurrence separately. So, there is a tradeoff between taking more memory (GloVe) vs. taking longer to train (word2vec). Also, once computed, GloVe can re-use the co-occurrence matrix to quickly factorize with any dimensionality, whereas word2vec has to be trained from scratch after changing its embedding dimensionality.

Note that both implementations are fairly optimized, running on 8 threads (on an 8 core machine), using the exact same input corpus, text preprocessing, vocabulary and evaluation code, so that the numbers are directly comparable. Code here.

SPPMI and SVD

In a manner analogous to GloVe, Levy and Goldberg (the same researchers mentioned above) analyzed the objective function of word2vec with negative sampling. That’s the one that performed best in the table above, so I decided to check it out too.

Again, they manage to derive a beautifully simple connection to matrix factorization. This time, the word x context objective “source” matrix is computed differently to GloVe. Each matrix cell, corresponding to word and context word is computed as , where is the number of negative samples in word2vec (for example, ). PMI is the standard pointwise mutual information — if we use the notation that word and context occurred together times in the training corpus, then (no smoothing).

The funky “SPPMI” name simply reflects that we’re subtracting from PMI (“shifting”) and that we’re taking the (“positive”; should be non-negative, really). So, Shifted Positive Pointwise Mutual Information.

For example, for the same nine texts we used above and , the SPPMI matrix looks like this:

[[ 0. 0.83 0.83 0.49 0.49 0. 0.49 0.13 0. 0. 0. 0. ] [ 0.83 0. 1.16 0. 0. 0.83 0. 0. 0.98 0. 0. 0. ] [ 0.83 1.16 0. 0. 0. 0.13 0. 0.47 0.98 0. 0. 0. ] [ 0.49 0. 0. 0. 0.49 0. 1.18 0.83 0. 0. 0. 0. ] [ 0.49 0. 0. 0.49 0. 0. 0.49 0.13 0. 0. 0.83 1.05] [ 0. 0.83 0.13 0. 0. 0. 0. 0.13 1.05 0. 0. 0. ] [ 0.49 0. 0. 1.18 0.49 0. 0. 0.83 0. 0. 0. 0. ] [ 0.13 0. 0.47 0.83 0.13 0.13 0.83 0. 0.29 0. 0. 0. ] [ 0. 0.98 0.98 0. 0. 1.05 0. 0.29 0. 0. 0. 0. ] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 2.37 1.9 ] [ 0. 0. 0. 0. 0.83 0. 0. 0. 0. 2.37 0. 2.08] [ 0. 0. 0. 0. 1.05 0. 0. 0. 0. 1.9 2.08 0. ]]

No neural network training, no parameter tuning, we can directly take rows of this SPPMI matrix to be the word vectors. Very fast and simple. How does raw SPPMI compare to word2vec’s and GloVe’s factorizations though?

Comparison on 600 dims, context window 10, 1.9B words of EN Wikipedia. algorithm accuracy on the analogy task wallclock time peak RAM [MB] word2vec, negative sampling k=10, 1 epoch 68.3% 8h38m 628 GloVe, learning rate 0.05, 10 epochs 67.1% 4h12m 9,414 SPPMI, k=1 48.7% 50m 3,433 SPPMI, k=10 30.3% 50m 3,429 SPPMI-SVD, k=1 39.4% 1h23m 3,426 SPPMI-SVD, k=10 3.8% 1h23m 3,444

The SPPMI-SVD method simply factorizes the sparse SPPMI matrix using Singular Value Decomposition (SVD), rather than the gradient descent methods of word2vec/GloVe, and uses the (dense) left singular vectors as the final word embeddings. SVD is a fast, scalable method with straightforward geometric interpretation, and it performed very well in the NIPS experiments of Levy & Goldberg, who suggested SSPMI-SVD.

In the table above, the quality of both SPPMI and SPPMI-SVD models is atrocious, especially for higher values of (more “shift”). I’m not sure why this is; I’ll try to get the original implementation of Levy & Goldberg to compare.

Also, I originally tried to get this table on 1,000 dims, rather than 600. But the GloVe implementation started failing, producing word vectors with NaNs in them, and weird <1% accuracies when I tried to combat that by decreasing its learning rate. Maciej is still working on that one, so if you’re thinking of using GloVe in production, beware. EDIT: Successfully resolved, Maciej’s GloVe handles that fine now.

To make experiments easier, I wrote and published a script that takes the parsed English Wikipedia and computes accuracy on the analogy task, using each of these different algorithms in turn. You can get it from GitHub and experiment with the various methods yourself, trying them on your own data / application.

What does that all mean?

Playing with Wikipedia is fun, but usually clients require more concrete insights from us.

How do we tweak word2vec to better model what we want?

How to tune word2vec model quality on a specific task (which is, in all likelihood, not “word analogies”)?

I’ll postpone that until the next post. Suffice to say that the performance depends on tuning the methods’ internal parameters, in non-obvious ways. The Bar Ilan powerhouse of Levy, Goldberg & Dagan wrote a full paper on that, exploring the various parameter combinations (dynamic vs. fixed context window, subsampling, negative distribution smoothing, taking context vectors into account as well….). Their paper is under review now — I’ll post a link here as soon as it becomes public. EDIT: ACL paper here.

In the meanwhile, there has been some tentative research into the area of word2vec error analysis and tuning. Check out this web demo (by Levy & Goldberg again, from this paper), for investigating which contexts get activated for different words. This can lead to visual insights into which co-occurrences are responsible for a particular class of errors.

TL;DR: the word2vec implementation is still fine and state-of-the-art, you can continue using it :-)

Note from Radim: Get my latest machine learning tips & articles delivered straight to your inbox (it's free). Unsubscribe anytime, no spamming. Max 2 posts per month, if lucky.

If you like such machine learning shenanigans, sign up for my newsletter above, get in touch for commercial projects or check out my older posts: efficient nearest neighbour similarity search, optimizing word2vec, its doc2vec extension.