Visual and audio events tend to occur together: a musician plucking guitar strings and the resulting melody; a wine glass shattering and the accompanying crash; the roar of a motorcycle as it accelerates. These visual and audio stimuli are concurrent because they share a common cause. Understanding the relationship between visual events and their associated sounds is a fundamental way that we make sense of the world around us.

In Look, Listen, and Learn and Objects that Sound (to appear at ECCV 2018), we explore this observation by asking: what can be learnt by looking at and listening to a large number of unlabelled videos? By constructing an audio-visual correspondence learning task that enables visual and audio networks to be jointly trained from scratch, we demonstrate that:

the networks are able to learn useful semantic concepts;

the two modalities can be used to search one another (e.g. to answer the question, “Which sound fits well with this image?”); and

the object making the sound can be localised.

Limitations of previous cross-modal learning approaches

Learning from multiple modalities is not new; historically, researchers have largely focused on image-text or audio-vision pairings. However, a common approach has been to train a “student” network in one modality using the automatic supervision provided by a “teacher” network in the other modality (“teacher-student supervision”), where the “teacher” has been trained using a large number of human annotations.

For instance, a vision network trained on ImageNet can be used to annotate frames of a YouTube video as “acoustic guitar”, which provides training data to the “student” audio network for learning what an “acoustic guitar” sounds like. In contrast, we train both visual and audio networks from scratch, where the concept of the “acoustic guitar” naturally emerges in both modalities. Somewhat surprisingly, this approach achieves superior audio classification compared to teacher-student supervision. As described below, this also equips us to localise the object making the sound, which was not possible with previous approaches.

Learning from cross-modal self-supervision

Our core idea is to use a valuable source of information contained in the video itself: the correspondence between visual and audio streams available by virtue of them appearing together at the same time in the same video. By seeing and hearing many examples of a person playing a violin and examples of a dog barking, and rarely or never seeing a violin being played while hearing a dog bark and vice versa, it should be possible to conclude what a violin and a dog look and sound like. This approach is, in part, motivated by the way an infant might learn about the world as their visual and audio capabilities develop.

We apply learning by audio-visual correspondence (AVC), a simple binary classification task: given an example video frame and a short audio clip, decide whether they correspond to each other or not.