Image: Computer Science and Artificial Intelligence Laboratory

Using machine learning, researchers from MIT have developed a system that produces sound effects that are so realistic they even fool human listeners.


The new algorithm, developed by researchers from MIT’s Computer Science and Artificial Intelligence Laboratory, can predict the precise acoustical qualities of a sound, and then simulate it in an extremely realistic way. When analyzing a silent video clip, such as an object being hit by a drumstick, the system can produce a sound for the hit that’s realistic enough to fool human listeners.

To make it work, PhD student Andrew Owens and his team applied a technique known as “deep learning” that enables computers to pick out important patterns buried in massive amounts of raw data completely autonomously. Over the course of several months, the researchers recorded about 1,000 videos of an estimated 46,000 sounds that represented an array of objects being hit, scraped, and prodded by a drumstick. (The drumstick was chosen because of its ability to produce consistent sounds.) A deep-learning algorithm then analyzed the videos, deconstructing the sounds according to pitch, loudness, and other acoustical qualities.


“To then predict the sound of a new video, the algorithm looks at the sound properties of each frame of that video, and matches them to the most similar sounds in the database,” noted Owens in MIT News. “Once the system has those bits of audio, it stitches them together to create one coherent sound.”

Computer Science and Artificial Intelligence Laboratory

Incredibly, the algorithm was able to simulate—with a surprising degree of accuracy—the fine acoustical details of various hits, including the sounds of the drumstick on metal, wood, rocks, dirt, and even leaves. The synthetic sounds were so good that test subjects picked the fake sounds over the real ones twice as often. Materials like leaves and dirt were particularly difficult to distinguish from the real thing, mostly because these objects tend to have less “clean” sounds than other objects.



This research will do more than put foley artists out of work. In future, this system could improve robots’ abilities to evaluate and interact with their environment.


“A robot could look at a sidewalk and instinctively know that the cement is hard and the grass is soft, and therefore know what would happen if they stepped on either of them,” said Owens. “Being able to predict sound is an important first step toward being able to predict the consequences of physical interactions with the world.”

[MIT News, arXiv]