Transfer learning

From the perspective of general intelligence, the most interesting thing about AlexNet’s vocabulary is that it can be reused, or transferred, to visual tasks other than the one it was trained on, such as recognising whole scenes rather than individual objects. Transfer is essential in an ever-changing world, and humans excel at it: we are able to rapidly adapt the skills and understanding we’ve gleaned from our experiences (our ‘world model’) to whatever situation is at hand. For example, a classically-trained pianist can pick up jazz piano with relative ease. Artificial agents that form the right internal representations of the world, the reasoning goes, should be able to do similarly.

Nonetheless, the representations learned by classifiers such as AlexNet have limitations. In particular, as the network was only trained to label images with a single class (cat, dog, car, volcano), any information not required to infer the label—no matter how useful it might be for other tasks—is liable to be ignored. For example, the representations may fail to capture the background of the image if the label always refers to the foreground. A possible solution is to provide more comprehensive training signals, like detailed captions describing the images: not just “dog,” but “A Corgi catching a frisbee in a sunny park.” However, such targets are laborious to provide, especially at scale, and still may be insufficient to capture all the information needed to complete a task. The basic premise of unsupervised learning is that the best way to learn rich, broadly transferable representations is to attempt to learn everything that can be learned about the data.

If the notion of transfer through representation learning seems too abstract, consider a child who has learned to draw people as stick figures. She has discovered a representation of the human form that is both highly compact and rapidly adaptable. By augmenting each stick figure with specifics, she can create portraits of all her classmates: glasses for her best friend, her deskmate in his favorite red tee-shirt. And she has developed this skill not in order to complete a specific task or receive a reward, but rather in response to her basic urge to reflect the world around her.

Learning by creating: generative models

Perhaps the simplest objective for unsupervised learning is to train an algorithm to generate its own instances of data. So-called generative models should not simply reproduce the data they are trained on (an uninteresting act of memorisation), but rather build a model of the underlying class from which that data was drawn: not a particular photograph of a horse or a rainbow, but the set of all photographs of horses and rainbows; not a specific utterance from a specific speaker, but the general distribution of spoken utterances. The guiding principle of generative models is that being able to construct a convincing example of the data is the strongest evidence of having understood it: as Richard Feynman put it, "what I cannot create, I do not understand."

For images, the most successful generative model so far has been the Generative Adversarial Network (GAN for short), in which two networks—a generator and a discriminator—engage in a contest of discernment akin to that of an artistic forger and a detective. The generator produces images with the goal of tricking the discriminator into believing they are real; the discriminator, meanwhile, is rewarded for spotting the fakes. The generated images, first messy and random, are refined over many iterations, and the ongoing dynamic between the networks leads to ever-more realistic images that are in many cases indistinguishable from real photographs. Generative adversarial networks can also dream details of landscapes defined by the rough sketches of users.

A glance at the images below is enough to convince us that the network has learned to represent many of the key features of the photographs they were trained on, such as the structure of animal’s bodies, the texture of grass, and detailed effects of light and shade (even when refracted through a soap bubble). Close inspection reveals slight anomalies, such as the white dog’s apparent extra leg and the oddly right-angled flow of one of the jets in the fountain. While the creators of generative models strive to avoid such imperfections, their visibility highlights one of the benefits of recreating familiar data such as images: by inspecting the samples, researchers can infer what the model has and hasn’t learned.