Microsoft is boasting that it's top of the tree when it comes to speech recognition these days, and of course that bodes well for digital assistant Cortana.

Apparently, according to Microsoft's chief speech scientist, Xuedong Huang, the company just set a new record in terms of the industry standard Switchboard speech recognition benchmark, hitting a word error rate (WER) of 6.3%.

That beats IBM's achievement of a WER of 6.6%, which was only recorded last weekend.

Things have certainly come a long way in the last couple of decades, because as Microsoft notes, 20 years ago the best WER benchmark was just over 43%.

In a recently published paper, Microsoft researchers stated: "Our best single system achieves an error rate of 6.9% on the NIST 2000 Switchboard set. We believe this is the best performance reported to date for a recognition system not based on system combination. An ensemble of acoustic models advances the state of the art to 6.3% on the Switchboard test data."

Of course, this is obviously good news for the accuracy of Cortana, the speech-based virtual assistant which deals with queries in natural language, and is a big part of Microsoft's strategy going forward, having been introduced to desktop computers with Windows 10.

Deep neural nets

Both Microsoft and IBM are driving further ahead with better speech recognition thanks to deep neural networks which are really paying dividends these days, and helping to develop the technology at speed.

Recent advances in the deep neural net field have been critical in terms of smoothing over these systems, and they include a new type of cross-layer network connection, along with the use of Microsoft's Computational Network Toolkit (CNTK).

The CNTK bristles with optimizations that enable these networks to run much faster, making use of the power of many GPUs in parallel to hone routines further.

Microsoft stated: "CNTK is already used by the team that helps Microsoft's virtual assistant, Cortana. By combining the use of CNTK and GPU clusters, Cortana's speech training is now able to ingest 10 times more data in the same amount of time."

The end goal is, of course, to have Cortana be able to understand every word someone is saying just as effectively as a real person can. And just maybe that's not as far off as we might imagine…

Via: WinBeta