It has been interesting to watch deep learning evolve over the past four years. Deep learning has made some significant advances, but the progress in unsupervised learning has caught my eye recently. I was academically birthed from the womb of a Frequentist, but the impact of deep Bayesian models cannot be ignored. At the Lab we recently completed the D*Script challenge on handwriting author identification. I found myself delving into deep generative models and actually enjoying it. However, I got hung up on some of the key components that differentiate generative architectures from discriminant architectures. The purpose of this post is to highlight some of the differences between generative and discriminative auto-encoders, and how one could use generative models (specifically DRAW) to solve the problem of writer identification in handwritten documents.

The main method for deep unsupervised learning is the auto-encoder. The traditional auto-encoder (AE) is typically made up of dense, fully connected layers of a feed-forward neural network. AEs are unsupervised in that their targeted output is the same as their input. The goal of AEs is to compress the input data and then attempt to recreate it (see figure below), much like crushing a soda can and then trying to bend it back into its original form.

The first half of the AE is called the encoder, because it compresses the original input and encodes it into a latent space. Often this latent space has a lower dimensionality than the input. The second half is called the decoder, and its job is to undo and recreate what the encoder compressed. The loss function simply measures how much information was lost when comparing the original input to the reconstruction; the better the reconstruction, the lower the loss. The encoder and decoder are arbitrary functions. More recently other types of auto-encoders have been innovated.

The variational auto-encoder (VAE) was innovated with beautiful Bayesian theory to support its existence. VAEs differ from traditional AEs in that they have a generative component and want to say something about the distribution of the latent space . After data are compressed by the encoder, the encoding is used to define the parameters of a latent posterior distribution. In other words, we travel from a prior to a posterior in light of the data. From this distribution we can randomly draw a sample , conditioned on input , for the decoder to reconstruct back into the original input. The decoder is the generative portion of the VAE that makes sense of the latent sample. If we return to the crushed can analogy, this is like crushing a soda can, having it engulfed by The NeverEnding Story’s the “Nothing”, having manifest itself on Earth in one of many forms of human sorrow and suffering, and then reconstructing that back into a soda can (see figure below).

Two significant benefits emerge from this:

1) The input no longer has a direct route from beginning to end as with the AE, thus the decoder must learn how to disentangle the randomly drawn latent sample. This allows the decoder to model complex and even multi-modal distributions.

2) Given a randomly drawn sample , not conditioned on input x, the decoder can “hallucinate” a reconstruction from the randomness. This allows it to generate samples that look like real world data.

However, there are a few obstacles to overcome with the latent posterior from which is drawn. First, you won't be able to find it in the forest, because it is untrackable. Second, in practice, exact inference of the posterior is usually intractable. The later means we cannot differentiate what we need to and must use approximate inference. One method is to approximate with a distribution we know, call it . How close we come using is evaluated using the Kullback-Leibler Divergence. This is the first part of the final loss function: the latent loss. But we still can't back-propagate through this architecture. Enter the Diederik (Kingma). In the enlightening paper Auto-Encoding Variational Bayes, Kingma et. al developed the “reparameterization trick” that transforms from being defined probabilistically as to being defined deterministically as , where is a random noise from a standard normal distribution (Note: This specific reparameterization is for the normal distribution. Others exist for the other families of distributions.). and are parameters learned through back-propagation. The encoder helps define the parameters of , while the decoder attempts to recreate the input from the random sample . How well we reconstruct the latent sample provides the second part of the VAE's final loss function: the reconstruction loss.

So far we have talked about auto-encoders in an unsupervised setting. How do we make the leap to supervised or semi-supervised learning using the auto-encoder architecture? The traditional way to use an AE in the supervised setting is to encode the data and use the latent representation, call it , as a feature-rich replacement for the input. Then all of the ’s are used as a dataset to train a separate classifier. This approach uses two separate architectures (AE and classifier) and hence two separate loss functions. The VAE, on the other hand, already has two parts to its loss function in its single architecture. In another Kingma et. al paper titled Semi-Supervised Learning with Deep Generative Models, a third loss is added to the final loss function: the classification loss (here is a presentation by Kingma and Welling that adds more clarity and visuals to their paper). This three-loss architecture is well poised to tackle handwriting challenges. But before we jump into the possible VAE end-to-end solution, let's briefly explore the handwriting authorship challenge.

Author identification is a very difficult problem in handwriting which often gets confused with handwriting optical character recognition (OCR). OCR is concerned with which letter is written, whereas author identification is concerned with how a letter is written. There are a few obstacles that stand in the way of analyzing writing styles. First, unique features need to be captured about each author. Without an automated process and handwriting experts, this can be a very time intensive problem. Second, any noise (lines, watermarks, stains, etc.) can throw off even the most sophisticated algorithm. Data for many of the handwriting competitions are blank white paper with handwriting in black ballpoint pen. See the ICDAR competition for an example. Third, in practice you don’t always know who wrote what. The many handwritten documents in the wild don’t have a signature, and even if they do there is no guarantee it belongs to the name signed. Thus, there is a classification part and a data retrieval part to this problem. So, what do VAEs have to do with all of this? I believe generative models have the end-to-end answer.

There are two VAE papers that could Captain Planet their powers together for an end-to-end solution. The first is DRAW, and the second is the Semi-Supervised VAE already mentioned. DRAW is a VAE with an attention mechanism, a very clever attention mechanism. It attends to the most important part of the image while iteratively reading the data and writing the data onto a “canvas.” If you haven’t seen the video of DRAW in action, please click here. Attention in a VAE could be valuable, because it could be taught to attend to the features that separate one writer from another. The DRAW paper has an example of classification, but it only uses the encoder with attention. It achieves excellent results without the decoder or reconstruction, but the reconstruction loss can add a lot of value. One value-add is noise removal. Auto-encoders have long been used to remove imperfections from images (see here and here for examples). One could take pristine competition data, add noise (lines, watermarks, stains, etc.) to the inputs, and at reconstruction time the model would see the clean target image it should have recreated, thus learning what to ignore. In order to preserve the reconstruction with attention in the DRAW network, Kingma et. al add the semi-supervised (or supervised, if that’s what your heart and data desire) to the VAE in a way that it contributes to the whole system. It allows for writing styles to be incorporated into the model, which is something of tremendous worth to the problem at hand. Adding in the latent loss, this three-loss DRAW architecture could be well poised to address the handwriting writer identification problem.