Microsoft researchers from the Speech & Dialog research group include, from back left, Wayne Xiong, Geoffrey Zweig, Xuedong Huang, Dong Yu, Frank Seide, Mike Seltzer, Jasha Droppo and Andreas Stolcke. (Photo by Dan DeLong)

Microsoft announced that its speech recognition technology has achieved a word error rate (WER) of only 5.9%, which the company said was similar to what human transcribers are able to achieve.



Historic Achievement In Word Error Rates

The company also said many have sought this milestone since the early 1970s, when DARPA began researching speech recognition technology in the interest of national security.

“We’ve reached human parity,” said Xuedong Huang, the company’s chief speech scientist. “This is an historic achievement.”

Although Microsoft has constantly improved its speech recognition technology--just last month it hit a WER of 6.3%, which isn’t that far away from the 5.9% it achieved this month. However, the 5.9% milestone has much more significance because it’s as low as it is for humans, and it’s the first time any company has reached it.

Human-Level WER, But Achieved Differently

Microsoft is indeed right that reaching this low WER is a significant milestone. However, just as CPU benchmarks that return a total score don’t tell you the whole story about a chip’s performance, neither does the “Switchboard” (SWB) benchmark Microsoft used to compare its software against human transcribers.

As you can see in the table below, taken from Microsoft’s paper, the overall WER may be exactly the same for humans and the company’s automatic speech recognition (ASR) system, but it’s quite different when you look deeper. The deletion rate is significantly smaller for the ASR system compared to humans; for substitution, the situation reverses.

Overall substitution, deletion and insertion rates - Microsoft's "Achieving Human Parity In Conversational Speech Recognition"

"Substitution" in this case refers to words being replaced with other words when the recording is being transcribed. "Deletion" refers to words being added wrongfully, and then deleted.

In another conversational telephone speech benchmark, CallHome (CH), the ASR system does significantly more substitutions and insertions than humans, but fewer deletions. However, the overall WER is also similar here (11.1% for the ASR and 11.3% for human transcribers), although it’s higher than in the Switchboard test for both the ASR system and the human transcribers.

WER Parity, Not True Human Parity

Even assuming the word error rates are identical in every way, it still wouldn’t mean that machine speech recognition is just as good as a human's. Even if the number of word errors that machines make are on par with humans, machines can still make significantly different ones. Therefore, sentences transcribed by a machine could be much more confusing to humans than they would be if other humans transcribed them, even if the error rate is the same.

For instance, Microsoft’s paper also noted that the ASR system confused “backchannel” words such as “uh-huh,” which is an acknowledgement to what the other speaker is saying, with hesitations such as “uh,” which is a pause before continuing to speak. Humans don’t make these mistakes because they know intuitively what these spoken words represent.

Speech Recognition Keeps Getting Better

Human speech recognition isn’t perfect either, which is shown by the Switchboard and CallHome benchmarks. Machine learning-based speech recognition may not yet be quite as good as humans in real world usage, but just the fact that word error rates are now similar means that speech recognition software is getting close to achieving true human parity, or even surpassing humans in speech recognition.

These latest improvements also mean that Microsoft’s services which take advantage of speech recognition, such as Cortana, are going to become more useful and less frustrating to use. Microsoft's latest achievement, along with Google's recently announced near-human level accuracy in machine translation, synthetic speech generation that sounds almost as good as humans, and better-than-human image recognition, all show that we're living in a time when machines are beginning to truly understand humans and the world around us.