If you’re having trouble viewing the formulas below, try turning off Adblock

Back in 2013, Mikolov et al released word2vec, a language model which associated words with vectors based on how often words appeared near eachother. What was intriguing was that the vectors it chose turned analogies into arithmetic:

$$v_\text{queen} - v_\text{woman} + v_\text{man} \approx v_\text{king}$$ $$v_\text{paris} - v_\text{france} + v_\text{italy} \approx v_\text{rome}$$

Last week, Arora et al released Random Walks on Context Spaces, which gives a rigorous explanation of how this phenomenon might have come about.

So mathematically, what’s an analogy?

If we’re given a word $w$, let’s use $p(w^\prime|w)$ to denote the probability that when we see the word $w$, another word $w^\prime$ shows up ‘nearby’. Then a natural probabilistic interpretation of

king is to queen as man is to woman

is that for most other words $w^\prime$, it should be the case that

$$\frac{p(w^\prime|\text{king})}{p(w^\prime|\text{queen})} \approx \frac{p(w^\prime|\text{man})}{p(w^\prime|\text{woman})}.$$

So finding a word $w$ such that

$w$ is to queen as man is to woman

becomes an optimization problem: find a $w$ that minimizes

$$\arg \min_w E_{w^\prime}\left[ \ln \frac{p(w^\prime|w)}{p(w^\prime|\text{queen})} - \ln \frac{p(w^\prime|\text{man})}{p(w^\prime|\text{woman})} \right]^2.$$

But what’s that got to do with vectors?

Well, the optimization problem can be rewritten as

$$\arg \min_w E_{w^\prime} \left[ (\text{PMI}(w^\prime, w) - \text{PMI}(w^\prime, \text{queen})) - (\text{PMI}(w^\prime, \text{man}) - \text{PMI}(w^\prime, \text{woman}))\right]^2$$

using the pointwise mutual information, which measures how strongly associated two events are. Now, suppose we had a set of word vectors $v_w$ such that $\text{PMI}(w^\prime, w) \approx v_{w^\prime} \cdot v_w $. Then the above would become

$$\arg \min_w E_{w^\prime} [ (v_{w^\prime} \cdot v_w - v_{w^\prime} \cdot v_\text{queen}) - (v_{w^\prime} \cdot v_\text{man} - v_{w^\prime} \cdot v_\text{woman})]^2$$

If we were lucky enough that these $v_w$ were also distributed isotropically, then it’d also be the case that $E_{w^\prime}[(v_{w^\prime} \cdot v_u)^2] \approx ||v_u||^2$. Which means our optimization problem would become

$$\arg \min_w ||(v_w - v_\text{queen}) - (v_\text{man} - v_\text{woman})||^2$$

Or equivalently,

Find a $w$ such that $v_w \approx v_\text{queen} - v_\text{woman} + v_\text{man}$.

Which, hey, is how we started this article!

So to explain word2vec’s behaviour, we need plausible a way to construct vectors $v_w$ from co-occurrence frequencies such that

$v_w \cdot v_{w^\prime} \approx \text{PMI}(w, w^\prime)$ The $v_w$ are isotropically distributed.

And how’d you do that?

That’s the crux of Arora et al’s paper, where they construct a Markov chain such that (1) and (2) are satisfied. The idea is that as you move through a text, a hidden vector called the context drifts about. When the context is $c$, the probability of word $w$ being output is

$$p(w|c) \approx \exp(v_w \cdot c)$$

So word $w$ is likely to appear when the context vector is pointing in the same direction as $v_w$, and less likely when it’s pointing away. The intuition that the authors give is

The coordinates of the context vector represent topics. If the $i$th coordinate of the context corresponds to gender, its value represents the extent to which gender is being talked about at the moment. A positive value of this coordinate could correspond to maleness —leading to an increase in the probability of producing words like he, king, man— and negative value could correspond to femaleness —causing a probability increase for words like she, queen, woman.

Anyway, the model assumes (2) holds, and then the authors the show (1) holds using a couple of pages of maths. The high-level idea is that

If $\text{PMI}(w, w^\prime)$ is large, then it must be that many of the context vectors which are likely to output one word are likely to output the other too.

And since a context vector $c$ is likely to output a word $w$ when $c$ and $v_w$ are 'similar’,

then if $c$ is likely to output both $w$ and $w^\prime$, then $v_w$ and $v_{w^\prime}$ must be similar.

Which means $v_w \cdot v_{w^\prime}$ should be large.

No-one’s ever thought of that model before?

They probably have, but with the usual Bayesian methods it’s computationally-intensive to fit. With (1) however, the word vectors $v_w$ can be constructed by calculating the empirical PMIs ($\widehat{\text{PMI}}$) using co-occurrence frequencies, and then solving

$$\min_{{v}} \sum \left( \widehat{\text{PMI}}(w, w^\prime) - v_{w^\prime} \cdot v_w\right)$$

using your continuous optimizer of choice.

Neat.

Yup. In order to simplify matters, I’ve done some fairly grave disservices to their work in this article, so if you’ve got the time it’s definitely worth looking at the paper. If you’re intimidated by the amount of maths you see, don’t be: they’ve pushed most of it into the appendices, and while anyone familiar with the phrase 'concentration inequality’ will find those a useful read, anyone who isn’t won’t be missing much in the way of intuition.