In 2017 the “Godfather of Deep Learning” Geoffrey Hinton and his students Sara Sabour and Nicholas Frosst proposed the discrimininatively trained, multi-layer capsule system “CapsNet” in their paper Dynamic Routing Between Capsules. CapsNet showed better performance on recognizing overlapped digits on the MNIST dataset than convolutional networks and continues to garner attention as a promising research direction in computer vision, deep learning and beyond.

Hinton and Sabour have now co-developed an unsupervised version of the capsule network in a joint research effort with the Oxford Robotics Institute. In the paper Stacked Capsule Autoencoders, they show the new approach can achieve state-of-the-art results for unsupervised classification on the SVHN (Street View House Numbers Dataset, which comprises over 600k real-world images of house numbers from Google Street View Images); and near state-of-the-art performance on the MNIST handwritten digit dataset.

Akin to modules in human brains, capsules are outstandingly good at understanding and encoding nuances such as pose (position, size, orientation), deformation, velocity, albedo, hue, texture etc. A capsule system understands an object by interpreting the organized set of its interrelated parts geometrically. Since these geometric relationships remain intact, a system can rely on them to identify objects even with changes in viewpoint, ie translation invariance.

The researchers used an unsupervised version of a capsule network, where a neural encoder trained through backpropagation looks at all image parts to infer the presence and poses of object capsules.

The researchers designed their Stacked Capsule Autoencoders (SCAE) in three stages:

Under the Constellation Autoencoder (CCAE), a model is trained without supervision by maximizing the likelihood of part capsules subject to sparsity constraints

A Part Capsule Autoencoder (PCAE) segments an image into parts and infers poses.

An Object Capsule Autoencoder (OCAE) organizes the discovered parts with poses into smaller sets.

Finally, researchers stacked the OCAE on top of the PCAE to form the SCAE, which identifies and captures spatial relationships between whole objects and parts.

Researchers noticed that the vectors of presence probabilities for object capsules are more likely to form tight clusters, and assigning a class to each tight cluster can produce performance such as the state-of-the-art and near state-of-the-art results in unsupervised classification on the SVHN and MNIST respectively. Researchers also noticed that SCAE further improved on these two datasets (from 55% to 67% and from 98.5% to 99%) by learning less than 300 parameters.

The research demonstrates SCAE as a novel method for representation learning in which highly structured decoder networks train both an encoder network that can segment images into parts and their poses, and an encoder network that can compose these parts into coherent wholes.

The paper Stacked Capsule Autoencoders is on arXiv.