Alphabet’s Tacotron 2 Text-to-Speech Engine Sounds Nearly Indistinguishable From a Human

We may earn a commission for purchases made using our links.

Alphabet’s subsidiary, DeepMind, developed WaveNet, a neural network that powers the Google Assistant‘s speech synthesis, in October. It’s capable of better and more realistic audio samples than the search giant’s previous text-to-speech system, and what’s more, it generates raw audio — not spliced-together sounds from voice actors. Now, researchers at Alphabet have developed a new version, Tacotron 2, that uses multiple neural networks to produce speech almost indistinguishable from a human.

Here’s a sample. The first was generated using Tacotron 2, and the second is a voice actor:

Tacotron 2 consists of two deep neural networks. As the research paper published this month describes it, the first translates text into a spectrogram, a visual representation of a spectrum of audio frequencies. The second — DeepMind’s WaveNet — interpret the chart and generates corresponding audio elements. The result is an end-to-end engine that can emphasize words, correctly pronounce names, pick up on syntactical clues (i.e., stress words that are italicized or capitalized), and alter the way it enunciates based on punctuation.

It’s unclear whether Tacotron 2 will make its way to user-facing services like the Google Assistant, but it’d be par for the course. Shortly after the publication of DeepMind’s WaveNet research, Google rolled out machine learning-powered speech recognition in multiple languages on Assistant-powered smartphones, speakers, and tablets.

There’s only one problem: Right now, the Tacotron 2 system is trained to mimic one female voice. To generate new voices and speech patterns, Google would need to train the system again.