Listen to a bird singing in a nearby tree, and you can relatively quickly identify its approximate location without looking. Listen to the roar of a car engine as you cross the road, and you can usually tell immediately whether it is behind you.

The human ability to locate a sound in three-dimensional space is extraordinary. The phenomenon is well understood—it is the result of the asymmetric shape of our ears and the distance between them.

But while researchers have learned how to create 3D images that easily fool our visual systems, nobody has found a satisfactory way to create synthetic 3D sounds that convincingly fool our aural systems.

Today, that looks set to change at least in part, thanks to the work of Ruohan Gao at the University of Texas at and Kristen Grauman at Facebook Research. They have used a trick that humans also exploit to teach an AI system to convert ordinary mono sounds into pretty good 3D sound. The researchers call it 2.5D sound.

First some background. The brain uses a variety of clues to work out where a sound is coming from in 3D space. One important clue is the difference between a sound’s arrival times at each ear—the interaural time difference.

A sound produced on your left will obviously arrive at your left ear before the right. And although you are not conscious of this difference, the brain uses it to determine where the sound has come from.

Another clue is the difference in volume. This same sound will be louder in the left ear than in the right, and the brain uses this information as well to make its reckoning. This is called the interaural level difference.

These differences depend on the distance between the ears. Stereo recordings do not reproduce this effect, because the separation of stereo microphones does not match it.

The way sound interacts with ear flaps is also important. The flaps distort the sound in ways that depend on the direction it arrives from. For example, a sound from the front reaches the ear canal before hitting the ear flap. By contrast, the same sound coming from behind the head is distorted by the ear flap before it reaches the ear canal.

The brain can sense these differences too. In fact, the asymmetric shape of the ear is the reason we can tell when a sound is coming from above, for example, or from many other directions.

The trick to reproducing 3D sound artificially is to reproduce the effect that all this geometry has on sound. And that’s a tough problem.

One way to measure the distortion is with binaural recording. This is a recording made by placing a microphone inside each ear, which can pick up these tiny variations.

By analyzing the variations, researchers can then reproduce them using a mathematical algorithm known as a head-related transfer function. That turns any ordinary pair of headphones into extraordinary 3D sound machines.

But because everybody’s ears are different, everybody hears sound in a different way. So creating a person’s head-related transfer function means measuring the shape of the person’s ears before playing a recording. And although that can be done in the lab, nobody has worked out how to do it in the wild.

Still, there are ways to approximate 3D sound using the sound distortions that don’t depend on ear shape—the interaural time and level differences.

The trick that Grauman and Gao use is to determine what direction a sound is coming from using visual cues (as humans often do too). So given a video of a scene and mono sound recording, the machine-learning system works out where the sounds are coming from and then distorts the interaural time and level differences to produce that effect for the listener.