It seems that quite a few people with interest in deep learning think of it in terms of unsupervised pre-training, autoencoders, stacked RBMs and deep belief networks. It’s easy to get into this groove by watching one of Geoff Hinton’s videos from a few years ago, where he bashes backpropagation in favour of unsupervised methods that are able to discover the structure in data by themselves, the same way as human brain does. Those videos, papers and tutorials linger. They were state of the art once, but things have changed since then.

These days supervised learning is the king again. This has to do with the fact that you can look at data from many different angles and usually you’d prefer representation that is useful for the discriminative task at hand. Unsupervised learning will find some angle, but will it be the one you want? In case of the MNIST digits, sure. Otherwise probably not. Or maybe it will find a lot of angles while you only need one.

Ladies and gentlemen, please welcome Sander Dieleman. He has won the The Galaxy Zoo challenge on Kaggle, so he might know a thing or two about deep learning. Here’s a Reddit comment about deep learning Sander recently made, reprinted with the author’s permission:

It’s true that unsupervised pre-training was initially what made it possible to train deeper networks, but the last few years the pre-training approach has been largely obsoleted.

Nowadays, deep neural networks are a lot more similar to their 80’s cousins. Instead of pre-training, the difference is now in the activation functions and regularisation methods used (and sometimes in the optimisation algorithm, although much more rarely).

I would say that the “pre-training era”, which started around 2006, ended in the early ’10s when people started using rectified linear units (ReLUs), and later dropout, and discovered that pre-training was no longer beneficial for this type of networks.

ReLUs (and modern variants such as maxout) suffer significantly less from the vanishing gradient problem. Dropout is a strong regulariser that helps ensure the solution we get generalises well. These are precisely the two issues pre-training sought to solve, but they are now solved in different ways.

Using ReLUs also tends to result in faster training, because they’re faster to compute, and because the optimisation is easier.

Commercial applications of deep learning usually consist of very large, feed-forward neural nets with ReLUs and dropout, and trained simply with backprop. Unsupervised pre-training still has its applications when labeled data is very scarce and unlabeled data is abundant, but more often than not, it’s no longer needed.

If you want more, Sander pointed us to a similiar comment by our favourite Dave W-F from Toronto.