SCALE: How accurate is Deep Speech at translating Mandarin?

AWNI HANNUN: It has a 6 percent character error rate, which essentially means that it gets wrong 6 out of 100 characters. To put that in context, this is in my opinion — and to the best of our lab’s knowledge — the best system at transcribing Mandarin voice queries in the world.

In fact, we ran an experiment where we had a few people at the lab who speak Chinese transcribe some of the examples that we were testing the system on. It turned out that our system was better at transcribing examples than they were — if we restricted it to transcribing without the help of the internet and such things.

“We give it enough data that it’s able to learn what’s relevant from the input to correctly transcribe the output, with as little human intervention as possible.”

What is it about Mandarin that makes it such a challenge compared with other languages?

There are a couple of differences with Mandarin that made us think it would be very difficult to have our English speech system work well with it. One is that it’s a tonal language, so when you say a word in a different pitch, it changes the meaning of the word, which is definitely not the case in English. In traditional speech recognition, it’s actually a desirable property that there is some pitch invariance, which essentially means that it tries to ignore pitch when it does the transcription. So you have to change a bunch of things to get a system to work with Mandarin, or any Chinese for that matter.

Awni Hannun. Source: Baidu

However, for us, it was not the case that we had to change a whole bunch of things, because our pipeline is much simpler than the traditional speech pipeline. We don’t do a whole lot of pre-processing on the audio in order to make it pitch-invariant, but rather just let the model learn what’s relevant from the data to most effectively transcribe it properly. It was actually able to do that fine in Mandarin without having to change the input.

The other thing that is very different about Chinese — Mandarin, in this case — is the character set. The English alphabet is 26 letters, whereas in Chinese it’s something like 80,000 different characters. Our system directly outputs a character at a time as it’s building its transcription, so we speculated it would be very challenging to have to do that on 80,000 characters at each step versus 26. That’s a challenge we were able to overcome just by using characters that people commonly say, which is a smaller subset.

Baidu has been handling a fairly high volume of voice searches for a while now. How is the Deep Speech system better than the previous system for handling queries in Mandarin?

Baidu has a very active system for voice search in Mandarin, and it works pretty well. I think in terms of total query activity, it’s still a relatively small percentage. We want to make that share larger, or at least enable people to use it more by making the accuracy of the system better.

Source: Baidu / Awni Hannun

Can you describe the difference between a search-based system like Deep Speech and something like Microsoft’s Skype Translate, which is also based on deep learning?

Typically, the way it’s done is there are three modules in the pipeline. The first is a speech-transcription module, the second is the machine-translation module and the third would be the speech-synthesis module. What we’re talking about, specifically, is just the speech-transcription module, and I’m sure Microsoft has one as part of Skype Translate.

Our system is different than that system in that it’s more what we call end-to-end. Rather than having a lot of human-engineered components that have been developed over decades of speech research — by looking at the system and saying what what features are important or which phonemes the model should predict — we just have some input data, which is an audio .WAV file on which we do very little pre-processing. And then we have a big, deep neural network that outputs directly to characters. We give it enough data that it’s able to learn what’s relevant from the input to correctly transcribe the output, with as little human intervention as possible.

One thing that’s pleasantly surprising to us is that we had to do very little changing to it — other than scaling it and giving it the right data — to make this system we showed in December that worked really well on English work remarkably well in Chinese, as well.

“We want to build a speech system that can be used as the interface to any smart device, not just voice search.”

What’s the usual timeline to get this type of system from R&D into production?

It’s not an easy process, but I think it’s easier than the process of getting a model to be very accurate — in the sense that it’s more of an engineering problem than a research problem. We’re actively working on that now, and I’m hopeful our research system will be in production in the near term.

Baidu has plans — and products — in other areas, including wearables and other embedded forms of speech recognition. Does the work you’re doing on search relate to these other initiatives?

We want to build a speech system that can be used as the interface to any smart device, not just voice search. It turns out that voice search is a very important part of Baidu’s ecosystem, so that’s one place we can have a lot of impact right now.