Historic Achievement: Microsoft researchers reach human parity in conversational speech recognition

Microsoft researchers from the Speech & Dialogue research group include, from back left, Wayne Xiong, Geoffrey Zweig, Xuedong Huang, Dong Yu, Frank Seide, Mike Seltzer, Jasha Droppo and Andreas Stolcke. (Photo by Dan DeLong)

Microsoft has made a major breakthrough in speech recognition, creating a technology that recognizes the words in a conversation as well as a person does.

In a paper published Monday, a team of researchers and engineers in Microsoft Artificial Intelligence and Research reported a speech recognition system that makes the same or fewer errors than professional transcriptionists. The researchers reported a word error rate (WER) of 5.9 percent, down from the 6.3 percent WER the team reported just last month.

The 5.9 percent error rate is about equal to that of people who were asked to transcribe the same conversation, and it’s the lowest ever recorded against the industry standard Switchboard speech recognition task.

“We’ve reached human parity,” said Xuedong Huang, the company’s chief speech scientist. “This is an historic achievement.”

The milestone means that, for the first time, a computer can recognize the words in a conversation as well as a person would. In doing so, the team has beat a goal they set less than a year ago — and greatly exceeded everyone else’s expectations as well.

“Even five years ago, I wouldn’t have thought we could have achieved this. I just wouldn’t have thought it would be possible,” said Harry Shum, the executive vice president who heads the Microsoft Artificial Intelligence and Research group.

The research milestone comes after decades of research in speech recognition, beginning in the early 1970s with DARPA, the U.S. agency tasked with making technology breakthroughs in the interest of national security. Over the decades, most major technology companies and many research organizations joined in the pursuit.

“This accomplishment is the culmination of over twenty years of effort,” said Geoffrey Zweig, who manages the Speech & Dialog research group.

The milestone will have broad implications for consumer and business products that can be significantly augmented by speech recognition. That includes consumer entertainment devices like the Xbox, accessibility tools such as instant speech-to-text transcription and personal digital assistants such as Cortana.

“This will make Cortana more powerful, making a truly intelligent assistant possible,” Shum said.

Parity, not perfection

The research milestone doesn’t mean the computer recognized every word perfectly. In fact, humans don’t do that, either. Instead, it means that the error rate – or the rate at which the computer misheard a word like “have” for “is” or “a” for “the” – is the same as you’d expect from a person hearing the same conversation.

Zweig attributed the accomplishment to the systematic use of the latest neural network technology in all aspects of the system.

The push that got the researchers over the top was the use of neural language models in which words are represented as continuous vectors in space, and words like “fast” and “quick” are close together.

“This lets the models generalize very well from word to word,” Zweig said.

‘A dream come true’

Deep neural networks use large amounts of data – called training sets – to teach computer systems to recognize patterns from inputs such as images or sounds.

To reach the human parity milestone, the team used Microsoft Cognitive Toolkit, a homegrown system for deep learning that the research team has made available on GitHub via an open source license.

Huang said Microsoft Cognitive Toolkit’s ability to quickly process deep learning algorithms across multiple computers running a specialized chip called a graphics processing unit vastly improved the speed at which they were able to do their research and, ultimately, reach human parity.

The gains were quick, but once the team realized they were on to something it was hard to stop working on it. Huang said the milestone was reached around 3:30 a.m.; he found out about it when he woke up a few hours later and saw a victorious post on a private social network.

“It was a dream come true for me,” said Huang, who has been working on speech recognition for more than three decades.

The news came the same week that another group of Microsoft researchers, who are focused on computer vision, reached a milestone of their own. The team won first place in the COCO image segmentation challenge, which judges how well a technology can determine where certain objects are in an image.

Baining Guo, the assistant managing director of Microsoft Research Asia, said segmentation is particularly difficult because the technology must precisely delineate the boundary of where an object appears in a picture.

“That’s the hardest part of the picture to figure out,” he said.

The team’s results, which built on the award-winning very deep neural network system Microsoft’s computer vision experts designed last year, was 11 percent better than the second place winner and a significant improvement over Microsoft’s first place win last year.

“We continue to be a leader in the field of image recognition,” Guo said.

From recognition to true understanding

Despite huge strides in recent years in both vision and speech recognition, the researchers caution there is still much work to be done.

Moving forward, Zweig said the researchers are working on ways to make sure that speech recognition works well in more real-life settings. That includes places where there is a lot of background noise, such as at a party or while driving on the highway. They’ll also focus on better ways to help the technology assign names to individual speakers when multiple people are talking, and on making sure that it works well with a wide variety of voices, regardless of age, accent or ability.

In the longer term, researchers will focus on ways to teach computers not just to transcribe the acoustic signals that come out of people’s mouths, but instead to understand the words they are saying. That would give the technology the ability to answer questions or take action based on what they are told.

“The next frontier is to move from recognition to understanding,” Zweig said.

Shum has noted that we are moving away from a world where people must understand computers to a world in which computers must understand us. Still, he cautioned, true artificial intelligence is still on the distant horizon.

“It will be much longer, much further down the road until computers can understand the real meaning of what’s being said or shown,” Shum said.

Related:

Allison Linn is a senior writer at Microsoft. Follow her on Twitter.