Almost 25 years ago, researcher Xuedong Huang founded the speech recognition program at Microsoft Research (MSR). His groundbreaking work ended up in Microsoft’s products like Cortana and Kinect. Today, voice recognition is pretty much figured out. But while computers hear us well, they still don’t see us very well. Gestural interfaces are still rudimentary. We may have virtual reality at home–and yet, those systems can’t even make out our own hands.

That may change soon, as Huang says a “paradigm shift” is happening within Microsoft Research. In a newly released demo of its gesture platform, Handpose, the company is revealing an unprecedentedly accurate hand tracking system that requires so little processing power that it could scale from computers to tablets to VR headsets.

Huang, who now consults on gesture research at MSR labs spanning the globe, explains that gestures have been stuck where voice recognition was in the 1970s.

“A very simple way to understand it is, in the ’70s, for every word, we had a whole template,” he explains. So “banana” had an image in the computer, essentially, that matched the word up to your utterance. Better voice recognition introduced more and more of these templates for “banana” to understand more ways different people might pronounce the word.

“Mobile freed people from being tethered to a PC . . . but they’re being tethered to their phone!”

In the ’80s, a profound shift happened, he continues. Voice recognition systems began analyzing phonemes–the unique sound chunks that together make up words–rather than entire words, so a whole logic system could be built that mixed and matched different sounds to postulate what you might be saying. Add a few decades of data, and mountains of information collected by services like Google, and voice recognition is pretty good.

Most gesture systems, including Microsoft Kinect, still use this simple style of template matching. But Handpose, MSR’s new gesture recognition system, abandons those templates completely. Instead, it incorporates what it’s calling a “gesture vocabulary.” The system looks at your hand and, instead of seeing it as a whole blob that needs to match to something preprogrammed in the system, it breaks up your hand into independent pieces–so it can reason how chunks of your fingers and knuckles curl into a fist. “Those core elements are almost like a phoneme for a pronunciation of a word,” says Huang.

Suddenly, a vision system like Kinect, which can currently only recognize large sweeps of your hand using a broad image-matching technique, could use these finger phonemes to identify fine motor controls like grasping tiny objects or touch-typing on a holographic QWERTY keyboard floating in midair. It might seem like MSR is playing with semantics, but Huang views this gesture vocabulary as the “physics to express ourselves.”