My current project involves working with deep autoregressive models: a class of remarkable neural networks that aren’t usually seen on a first pass through deep learning. These notes are a quick write-up of my reading and research: I assume basic familiarity with deep learning, and aim to highlight general trends and similarities across autoregressive models, instead of commenting on individual architectures.

tldr: Deep autoregressive models are sequence models, yet feed-forward (i.e. not recurrent); generative models, yet supervised. They are a compelling alternative to RNNs for sequential data, and GANs for generation tasks.

Deep Autoregressive Models

To be explicit (at the expense of redundancy), this blog post is about deep autoregressive generative sequence models. That’s quite a mouthful of jargon (and two of those words are actually unnecessary), so let’s unpack that.

Deep Well, these papers are using TensorFlow or PyTorch… so they must be “deep”

You would think this word is unnecessary, but it’s actually not! Autoregressive linear models like ARMA or ARCH have been used in statistics, econometrics and financial modelling for ages. Autoregressive Stanford has a good introduction to autoregressive models, but I think a good way to explain these models is to compare them to recurrent neural networks (RNNs), which are far more well-known. Obligatory RNN diagram. Source: Chris Olah. Like an RNN, an autoregressive model’s output \(h_t\) at time \(t\) depends on not just \(x_t\), but also \(x\)’s from previous time steps. However, unlike an RNN, the previous \(x\)’s are not provided via some hidden state: they are given as just another input to the model.

The following animation of Google DeepMind’s WaveNet illustrates this well: the \(t\)th output is generated in a feed-forward fashion from several input \(x\) values. WaveNet animation. Source: Google DeepMind. Put simply, an autoregressive model is merely a feed-forward model which predicts future values from past values.

I’ll explain this more later, but it’s worth saying now: autoregressive models offer a compelling bargain. You can have stable, parallel and easy-to-optimize training, faster inference computations, and completely do away with the fickleness of truncated backpropagation through time, if you are willing to accept a model that (by design) cannot have infinite memory. There is recent research to suggest that this is a worthwhile tradeoff. Generative Informally, a generative model is one that can generate new data after learning from the dataset.

More formally, a generative model models the joint distribution \(P(X, Y)\) of the observation \(X\) and the target \(Y\). Contrast this to a discriminative model that models the conditional distribution \(P(Y|X)\).

GANs and VAEs are two families of popular generative models.

This is unnecessary word #1: any autoregressive model can be run sequentially to generate a new sequence! Start with your seed \(x_1, x_2, ..., x_k\) and predict \(x_{k+1}\). Then use \(x_2, x_3, ..., x_{k+1}\) to predict \(x_{k+2}\), and so on. Sequence model Fairly self explanatory: a model that deals with sequential data, whether it is mapping sequences to scalars (e.g. language models), or mapping sequences to sequences (e.g. machine translation models).

Although sequence models are designed for sequential data (duh), there has been success at applying them to non-sequential data. For example, PixelCNN (discussed below) can generate entire images, even though images are not sequential in nature: the model generates a pixel at a time, in sequence!

Notice that an autoregressive model must be a sequence model, so it’s redundant to further describe these models as sequential (which makes this unnecessary word #2).

A good distinction is that “generative” and “sequential” describe what these models do, or what kind of data they deal with. “Autoregressive” describes how these models do what they do: i.e. they describe properties of the network or its architecture.

Some Architectures and Applications

Deep autoregressive models have seen a good degree of success: below is a list of some of examples. Each architecture merits exposition and discussion, but unfortunately there isn’t enough space here to devote to do any of them justice.

These models have also found applications: for example, Google DeepMind’s ByteNet can perform neural machine translation (in linear time!) and Google DeepMind’s Video Pixel Network can model video.

Some Thoughts and Observations