Object localization involves predicting the location of a moving object within a scene. Not surprisingly, researchers have tended to rely on visual data as input, which together with some physics understanding will generally enable a machine to perform the task. This camera-based approach however can be compromised by low light conditions, fog, occlusions, etc.

In a bid to improve object localization in such less-than-ideal circumstances, an MIT and IBM research group has proposed a cross-modal auditory localization framework that can effectively locate objects using stereo sound.

Although vision is humans’ go-to sense for understanding environments, we instinctively draw on additional senses when vision is insufficient. Auditory cues can play a huge role for example in localizing an approaching ambulance in a busy street or a meowing cat in a dark room. Sound localization and cross-modal learning are research directions that aim to augment machines’ abilities in this regard.

Sound localization uses microphone arrays and beam-forming to analyze delays in a sound received by differently positioned microphones and estimate the location of the object emitting the sound. Because audio-visual data contains a wealth of resources for knowledge transfer between different modalities, cross-modal learning is a also a growing research area.

The MIT and IBM paper proposes a framework comprising a “teacher” vision network and “student” stereo sound network. The student network attempts to mimic teacher network outputs by transferring object detection knowledge across modalities during training. The vision network detects an object in a video and marks it with a bounding box, then the stereo sound network learns to map audio signals to the predicted bounding box coordinates. In the final inference mode, the student network directly predicts an object’s location using sound, without any visual inputs.