The “cocktail party effect” describes humans’ ability to hold a conversation in a noisy environment by listening to what their conversation partner is saying while filtering out other chatter, music, ambient noises, etc. We do it naturally but the problem has been widely studied in machine learning, where the development of environmental sound recognition and source separation techniques that can tune into a single sound and filter out all others is a research focus.

MIT CSAIL researchers recently introduced their PixelPlayer system, which has learned to identify objects that produce sound in videos. The system uses deep learning and was trained by binge-watching 60 hours of musical performances to identify the natural synchronization of visual and audio information.

The team trained deep neural networks to concentrate on images and audio and identify pixel-level image locations for sound sources in the videos.

The PixelPlayer architecture includes a video analysis network responsible for separating visual features from video frames, an audio analysis network that encodes audio input, and an audio synthesizer network which predicts sounds by combining pixel-level visual and audio features.

PixelPlayer’s self-supervised mix-and-separate training also enables it to annotate instrument characteristics without manual intervention. Team member Hang Zhao, a former NVIDIA Research intern, says the deep learning system “gets to know which objects make what kinds of sounds.”

Researchers used a MUSIC (Multimodal Sources of Instrument Combinations) dataset built from YouTube videos to train the model. MUSIC has 714 non-post-processed videos of musical solos and duets, and 11 instrument categories. The Nvidia Titan V GPU chip’s processing power allowed the CNN analyze the videos at a very high speed. “It learned in about a day,” says Zhao. PixelPlayer can now identify more than 20 instruments.

PixelPlayer can extract the soundtracks of individual instruments, enabling engineers for example to isolate and adjust each instrument’s various levels. Zhao adds that “the system could also be used by robots to understand environmental sounds.”

Other research teams are tackling the cocktail party problem using a variety of approaches, including developing deep learning techniques for hearing aids.

The MIT CSAIL paper The Sound of Pixels is on Arxiv, and the team will present their work at the European Conference of Computer Vision in September. Further demonstrations can be found at http://sound-of-pixels.csail.mit.edu/.