Google AI yesterday released its latest research result in speech-to-speech translation, the futuristic-sounding “Translatotron.” Billed as the world’s first end-to-end speech-to-speech translation model, Translatotron promises the potential for real-time cross-linguistic conversations with low latency and high accuracy.

Humans have always dreamed of a voice-based device that could enable them to simply leap over language barriers. While advances in deep learning have contributed to highly improved accuracy in speech recognition and machine translation, smooth conversations between different language speakers remained hampered by unnatural pauses during machine processing.

Google’s wireless headphone Pixel Bud released in 2017 boasted real-time speech translation, but users found the practical experience less then satisfying. Delivering an English-language prompt such as “Help me speak Russian” would connect the earbud to the Google Translate app on the user’s smartphone. The app would then convert the user’s English speech into English text, translate that to Russian text, then read the content aloud in Russian. The steps in the speech-text-text-speech transfer however caused a few seconds of latency, and Google strove to speed that up.

In 2017, Google researchers introduced a deep neural network architecture that could directly translate speech in one language into text in another. Their experiments showed the end-to-end approach outperformed previous cascade models combining speech recognition and machine translation models in Spanish-English speech translation tasks. The research laid the foundation for Google Assistant Interpreter Mode introduced earlier this year, which translates a users’ speech into target language text on a Google smart display.

Google took another leap forward today with Translatotron. The new model comprises an attention-based sequence-to-sequence network trained on voice spectrograms which generates spectrograms of the target-language translation; a neural vocoder that converts output spectrograms to time-domain waveforms; and a pretrained speaker encoder to preserve a user’s vocal characteristics. Voice transcripts are still needed during training, but not for the inferencing.

Translatotron demonstrated an impressive translation accuracy in Spanish-to-English tasks. However, the model did not defeat the baseline ST (speech-to-text) → TTS (text-to-speech) cascade model in experiments, remaining 6 BLEU points below the baseline in Conversational Spanish-to-English dataset and 9.3 BLEU points shy on the Fisher Spanish-English dataset (target speech synthesized by Parallel WaveNet in a female English speaker’s voice).

In two additional speech quality tasks, Translatotron using WaveRNN vocoders scored over 4.0 — a “very good range” — in the evaluation of speech naturalness, and managed to preserve speakers’ vocal characteristics in cross-language voice transfer tasks, although not as well as conventional TTS models.

Researchers concluded that further work will be required to improve the Translatotron model, but believe their experiments open up new possibilities for faster and more efficient Google Translate applications.

The paper Direct speech-to-speech translation with a sequence-to-sequence model is on arXiv.