A deep-learning algorithm from MIT CSAIL passed the Turing Test for sound, leading researchers to believe this could improve robots’ abilities to interact with their surroundings.

Researchers from MIT’s Computer Science and Artificial Intelligence Lab (CSAIL) have developed a deep-learning algorithm that passes the Turing Test for sound: when shown a silent video clip of an object being hit, the algorithm can produce a sound for the hit that is realistic enough to fool human viewers.

The team believes that future work in this area could improve robots’ abilities to interact with their surroundings. For robots to navigate the world, for example, they need to be able to make reasonable assumptions about their surroundings and what might happen during a sequence of events.

“A robot could look at a sidewalk and instinctively know that the cement is hard and the grass is soft, and therefore know what would happen if they stepped on either of them,” said MIT CSAIL PhD student Andrew Owens. “Being able to predict sound is an important first step toward being able to predict the consequences of physical interactions with the world.”

How MIT’s AI Works

Over several months, the researchers recorded roughly 1,000 videos of an estimated 46,000 sounds that represent various objects being hit, scraped and prodded with a drumstick. (They used a drumstick because it provided a consistent way to produce a sound.) Next, the team fed those videos to a deep-learning algorithm that deconstructed the sounds and analyzed their pitch, loudness and other features.

More MIT Robotics Coverage

MIT 3D Prints Hydraulic Robots

MIT’s Chronos Could Lead to Safer Drones

DeepDrumpf Twitterbot Uses AI to Imitate Donald Trump

MIT ‘Eyeriss’ Neural Chip Powers Mobile AI

MIT Drone Autonomously Navigates Obstacle Course

Complete MIT Coverage

“To then predict the sound of a new video, the algorithm looks at the sound properties of each frame of that video, and matches them to the most similar sounds in the database,” said Owens. “Once the system has those bits of audio, it stitches them together to create one coherent sound.”

The result is that the algorithm can accurately simulate the subtleties of different hits, from the staccato taps of a rock to the longer waveforms of rustling ivy. Pitch is no problem either, as it can synthesize hit-sounds ranging from the low-pitched “thuds” of a soft couch to the high-pitched “clicks” of a hard wood railing.

The AI Fooled Humans

To test how realistic the fake sounds were, the team conducted an online study in which subjects saw two videos of collisions – one with the actual recorded sound, and one with the algorithm’s – and were asked which one was real.

The result: subjects picked the fake sound over the real one twice as often as a baseline algorithm. They were particularly fooled by materials like leaves and dirt that tend to have less “clean” sounds than, say, wood or metal.

Improving the AI in the Future

Researchers say that there’s still room to improve the system. For example, if the drumstick moves especially erratically in a video, the algorithm is more likely to miss or hallucinate a false hit. It is also limited by the fact that it applies only to “visually indicated sounds” – sounds that are directly caused by the physical interaction that is being depicted in the video.

“From the gentle blowing of the wind to the buzzing of laptops, at any given moment there are so many ambient sounds that aren’t related to what we’re actually looking at,” said Owens. “What would be really exciting is to somehow simulate sound that is less directly associated to the visuals.”