$\begingroup$

The paper 2014 paper by Kingma et al does not deal with modelling of discrete latent variables per say. In this paper, $y$ represents the discrete label information. When we don't have a label, we instead estimate it using $q_{\phi}(y|x)$ (a classifier). However, this is has no probabilistic interpretation, as we cannot sample from it or score a sample.

To make learning tractable in this model, we instead resort to marginalisation as shown below: $$p(x) = \sum_{y} p(x, y)$$

You can imagine, as $y$ becomes high-dimensional, marginalisation becomes difficult. Now in the paper by Jang et al, $y$ is a true latent random variable. Therefore we can both sample $y \sim q_{\phi}(y|x)$ as well as score it. The main contribution of this paper (and this same one) is a reparametrisation trick of a relaxed one-hot represented categorical distribution $p(p,\tau)$ that converges to a categorical distribution $\text{Cat}(p)$ as $\tau \to 0$ is annealed.