$\begingroup$

I was reading the standard/famous word2vec model and according to standord's notes for cs224n the objective function changes from:

$$J_{original} = -\sum^{2m}_{j=0,j

eq m} u^\top_{c-m+j} v_c + 2m log \left( \sum^{|V|}_{k=1} exp(u^{\top}_k v_c) \right)$$

to:

$$J_{NS1} = -log \sigma( u^\top_{c-m+j} v_c ) - \sum^{K}_{k=1} log \sigma( -u^{\top}_k v_c )$$

or

$$ J_{NS2} = - \left( \log\sigma( v_{w_o}^T v_{w_c} ) + \sum^K_{i=1} \mathbb{E}_{i \sim P(w)} \left[ \log \sigma( - u^T_{w_i} v_{w_c})\right] \right)$$

I was wondering, where does the second objective function come from? Where does negative sampling come from? I don't require a rigurous proof/derivation but any type of justification would be nice. Wow is the second one approximating the first? In any sense? Rough, approximation, intuitive, is there anything to justify this?

Note I understand that there is a speed gain. I am more interested in understanding what might have been the thought process to derive the above while still approximately wanting to optimize the original function or have good word embeddings.

My own thoughts:

Let $P_{\theta}(D=1 \mid w,c)$ be the probability that a given pair $(w,c)$ word and context came from the corpus data. Consider $-J_{NS1} = \log \sigma( u^\top_{c-m+j} v_c ) + \sum^{K}_{k=1} log \sigma( -u^{\top}_k v_c )$ (i.e. lets view things as maximizing probabilities). It seems that maximizing the first term correctly outputs two word vectors that are correlated since to make $-J_{NS1}$ large one can make the first term large by making the first term close to 1, which can be attained by making the inner product of the vectors large.

However, it seems to me that the second term is actually motivating us to get back word representations that are bad. Lets look at what the second term is:

$$ \log \sigma( -u^\top_{\text{not context}} v_{center}) = \log \left(1 - \sigma( u^\top_{\text{not context}} v_{center}) \right)$$

we can increase the above term by making $1 - \sigma( u^\top_{\text{not context}} v_{center})$ large which means we make $\sigma( u^\top_{\text{not context}} v_{center})$ small (close to zero "probability"). This means that we want very negative argument to the sigmoid. Which means that we get vectors that have a large negative inner product. This seems sort of wrong to me because if the inner product was zero i.e. the words were perpendicular, that would be a better objective to have. Why did they choose the other one instead? Wouldn't perpendicular words be better? i.e. if the words are not similar and thus not correlated then they have nothing to do with each other and thus have zero inner product.

Essentially, why is negative inner product a better sense of word similarity than inner product that is zero?