I hear you Floresco Images/Getty

Google’s latest take on machine translation could make it easier for people to communicate with those speaking a different language, by translating speech directly into text in a language they understand.

Machine translation of speech normally works by first converting it into text, then translating that into text in another language. But any error in speech recognition will lead to an error in transcription and a mistake in the translation.

Researchers at Google Brain, the tech giant’s deep learning research arm, have turned to neural networks to cut out the middle step. By skipping transcription, the approach could potentially allow for more accurate and quicker translations.


The team trained its system on hundreds of hours of Spanish audio with corresponding English text. In each case, it used several layers of neural networks – computer systems loosely modelled on the human brain – to match sections of the spoken Spanish with the written translation. To do this, it analysed the waveform of the Spanish audio to learn which parts seemed to correspond with which chunks of written English. When it was then asked to translate, each neural layer used this knowledge to manipulate the audio waveform until it was turned into the corresponding section of written English.

Corresponding patterns

“It learns to find patterns of correspondence between the waveforms in the source language and the written text,” says Dzmitry Bahdanau at the University of Montreal in Canada, who wasn’t involved with the work.

After a learning period, Google’s system produced a better-quality English translation of Spanish speech than one that transcribed the speech into written Spanish first. It was evaluated using the BLEU score, which is designed to judge machine translations based on how close they are to that by a professional human.

The system could be particularly useful for translating speech in languages that are spoken by very few people, says Sharon Goldwater at the University of Edinburgh in the UK.

International disaster relief teams, for instance, could use it to quickly put together a translation system to communicate with people they are trying to assist. When an earthquake hit Haiti in 2010, says Goldwater, there was no translation software available for Haitian Creole.

Goldwater’s team is using a similar method to translate speech from Arapaho, a language spoken by only 1000 or so people in the Native American tribe of the same name, and Ainu, a language spoken by a handful of people in Japan.

Rare languages

The system could also be used to translate languages that are rarely written down, since it doesn’t require a written version of the source language to produce successful translations.

Until it is tested on a much larger dataset, it’s hard to tell how the new approach really compares with more conventional translation systems, says Goldwater. But she thinks it could set the standard for future machine translation.

Some services already use machine translation to let people who speak different languages have conversations in real time. Skype introduced a live speech-to-text translation feature in 2014 and now supports nine languages, including Mandarin and Arabic as well as the most common European languages. But like other existing translation methods, Skype’s transcribes speech into text before translating that text into a different language.

And text translation service Google Translate already uses neural networks on its most popular language pairs, which lets it analyse entire sentences at once to figure out the best written translation. Intriguingly, this system appears to use an “interlingua” – a common representation of sentences that have the same meaning in different languages – to translate from one language to another, meaning it could translate between a language pair it hasn’t explicitly been trained on. The Google Brain researchers suggest the new speech-to-text approach may also be able to produce a system that can translate multiple languages.

But while machine translation keeps improving, it’s difficult to tell how neural networks are coming to their solutions, says Bahdanau. “It’s very hard to understand what’s happening inside.”

Journal reference: arXiv, DOI: arxiv.org/abs/1703.08581