$\begingroup$

How does one actually apply SGD to word to vec?

My question are:

What entails a mini-batch? Do we (at least conceptually) take derivatives wrt to all model params? How does changing the objective function affect how do this? (negative sampling for example)

Its not clear to me what is the “data examples”. My guess is that since the objective function is:

$$ J(\theta) = \frac{1}{T} \sum^T_{t=1} J(w_{t} ; \theta)= \frac{1}{T} \sum^T_{t=1} \sum_{-m\leq j \leq m} -\log p(w_{t+j} \mid w_t ; \theta) $$

that we sample some center words $ T’ << T $, since there are lots of centers words. Further, since we cant optimize this in one step exactly (if we could I assume we probably wouldn't be doing SGD), we approximately minimize it by sampling center words.

Also when we use SGD I assume we do $

abla_\theta$ where $\theta = [V,U]$ which are all the input and output vectors. I assume that since most partial derivatives would be zero sinze most $ \frac{ \partial p(w_{t+j} \mid w_t ; \theta)}{\partial w_j} = 0$ since for most words $w_j$ the partial is zero since it doesn't actually appear in the context. I assume we don't actually do the update with a gigantic vector with zeros. If not, how is this actually done?

Last thing I assume that the gradient of the objective $J(w_{t} ; \theta)$ is only affected by which variables appear and which variables we can actually take derivatives. For example if we use negative sampling which has a different loss form:

$$ J(w_{t} ; \theta) = \log \sigma( v_{w_o}^T v_{w_t} ) + \sum^K_{i=1} \mathbb{E}_{i \sim P(w)} \left[ log \sigma( - u^T_{w_i} v_{w_t} )\right]$$

then it depends if this derivative is zero or not (or some other value) depending on which values with sample? Right?