We augment Tacotron with a prosody encoder. The lower half of the image is the original Tacotron sequence-to-sequence model. For technical details, please refer to the paper.

Text: *Is* that Utah travel agency?

Reference prosody (Australian) Synthesized without prosody embedding (American) Synthesized with prosody embedding (American)

Reference Text: For the first time in her life she had been danced tired.

Synthesized Text: For the last time in his life he had been handily embarrassed.

Reference prosody (American) Synthesized without prosody embedding (American) Synthesized with prosody embedding (American)

Text: I've Swallowed a Pollywog.

Reference prosody (Unseen American Speaker) Synthesized without prosody embedding (British) Synthesized with prosody embedding (British)

Model architecture of Global Style Tokens. The prosody embedding is decomposed into “style tokens” to enable unsupervised style control and transfer. For technical details, please refer to the paper.

Text: United Airlines five six three from Los Angeles to New Orleans has Landed.