Estimating Gradients for Discrete Random Variables by Sampling without Replacement Estimating Gradients for Discrete Random Variables by Sampling without Replacement Kool, Wouter and van Hoof, Herke and Welling, Max 2020

Paper summary

gngdb

It's a shame that the authors weren't able to continue their series of [great][reinforce] [paper][attention] [titles][beams], although it looks like they thought about calling this paper **"Put Replacement In Your Basement"**. Also, although they don't say it in the title or abstract, this paper introduces an estimator the authors call the **"unordered set estimator"** which, as a name, is not the best. However, this is one of the most exciting estimators for gradients of non-differentiable expectations over structured discrete distributions (that sounds specific but includes a vast number of important problems, for example see [the first paragraph of this paper][mcts] and, less important but fun, adversarial examples or GANS on sequences). For a distribution $p_\theta(s)$, where $s$ can take some set of discrete states, this paper is concerned with how to compute the gradient of the expectation of a function, $f$, over $p_\theta(s)$: $$

abla_\theta \mathbb{E}_{s \sim p_\theta(s)} \left[ f(s) \right] $$ To jump directly to the definition of the estimator I've got to define the *leave-one-out* ratio. Given a distribution $p$ over unordered sets $S^k$, the leave-one-out ratio is $$ R(S^k, s) = \frac{p^{D \backslash \{ s \}} (S^k \backslash \{ s \} )}{p(S^k)}. $$ Where $p^{D \backslash \{ s \}}$ is the distribution over sets that don't contain $s$, which is the value of the first sampled element. For a given element $s$, it's the ratio of probability of sampling the other elements *given $s$* in $S^k$ divided by the probability of sampling $S^k$. Then the *unordered set estimator* is $$

abla_\theta \mathbb{E}_{s \sim p_\theta(s)} \left[ f(s) \right] =

abla_\theta \mathbb{E}_{s \sim p_\theta(s)} \left[ \sum_{s \in S^k} p(s) R(S^k, s) f(s) \right] \\ = \mathbb{E}_{s \sim p_\theta(s)} \left[

abla_\theta p_\theta (s) R(S^k, s) f(s) \right], $$ and they proceed to show that it's a Rao-Blackwellization (so guaranteed as low or lower variance) than a single sample estimator, importance weighted estimators and the [the RB discrete estimator][rb]. The authors also show that they can use the multiple samples to compute a built-in baseline and further reduce variance. If we define the leave-one-out ratio on a distribution with one element already left out as $R^{D \backslash \{ s \} } (S^k, s')$ (the *second-order* leave-one-out ratio): $$

abla_\theta \mathbb{E}_{s \sim p_\theta(s)} \left[ f(s) \right] \\ = \mathbb{E}_{s \sim p_\theta(s)} \left[ \sum_{s \in S^k}

abla_\theta p_\theta (s) R(S^k, s) \left( f(s) - \sum_{s' \in S^k} p_{\theta} (s') R^{D \backslash \{ s \} } (S^k, s') f(s') \right) \right] $$ Thanks to [stochastic beam search][beams] these sets can be sampled quickly and easily, so it all adds up to a powerful generic method. If I need gradients in an SGD setting of an expectation involving discrete variables, and I can take multiple samples, this is the most promising method I know about at the moment. [reinforce]: https://openreview.net/forum?id=r1lgTGL5DE [attention]: https://arxiv.org/abs/1803.08475 [beams]: https://arxiv.org/abs/1903.06059 [mcts]: https://arxiv.org/pdf/1910.06862.pdf [rb]: https://arxiv.org/abs/1810.04777