As I was getting ready to write this post, I thought back to my childhood (largely spent watching TV) and some of the popular computer and robot voices of the 1960’s and 1970’s. In just a few minutes, pleasant memories of HAL-9000, B9 (Lost in Space), the original Star Trek Computer, and Rosie (from The Jetsons) all came to mind. At that time, most people expected mechanically generated speech to sound precise, clipped, and devoid of human emotion.

Fast forward many years and we now see that there are many great applications and use cases for computer-generated speech, commonly known as Text-to-Speech or TTS. Entertainment, gaming, public announcement systems, e-learning, telephony, assistive apps & devices and personal assistants are just a few starting points. Many of these applications are great fits for mobile environments where connectivity is very good but local processing power and storage are so-so at best.

Hello, Polly

In order to address these use cases (and others that you will dream up), we are introducing Polly, a cloud service that converts text to lifelike speech that you can use in your own tools and applications. Polly currently supports a total of 47 male & female voices spread across 24 languages, with additional languages and voices on the roadmap.

Polly was designed to address many of the more challenging aspects of speech generation. For example, consider the difference in pronunciation of the word “live” in the phrases “I live in Seattle” and “Live from New York.” Polly knows that this pair of homographs are spelled the same but are pronounced quite differently. Or, what about the “St.” Depending on the language and the context, this could mean (and should be pronounced) as either “street” or “saint.” Again, Polly knows what to do here. Polly can also deal with units, fractions, abbreviations, currencies, dates, times, and other speech components in sophisticated, language-specific fashion.

In order to do this, we worked with professional, native speakers of each target language. We asked each speaker to pronounce a myriad of representative words and phrases in their chosen language, and then disassembled the audio into sound units known as diphones.

Polly works really well with unadorned text. You simply provide the text and Polly will take care of the rest, delivering an audio file or a stream that represents the text in an accurate, natural, and lifelike way. For more sophisticated applications, you can use SSML (Speech Synthesis Markup Language) to provide Polly with additional information. For example, if your text contains words drawn from more than one language (perhaps English with some French mixed in), you can flag it to be pronounced as such using SSML.

I can’t embed sound clips in this post, so you’ll have to visit the Polly Console and try it out yourself. You simply enter your text and click on Listen to speech:

You can also save the generated audio in an MP3 file and use it within your own applications.

Here is the fully expanded Language and Region menu:

Technical Details

Although you are welcome to use Polly from the Console, you will probably want to do something a bit more dynamic. You can simply call the SynthesizeSpeech API function with your text or your SSML. You can stream the output directly to your user, or you can generate an MP3 or Ogg file and play it back as desired. Polly can generate high quality (up to 22 kHz sampling rate) audio in MP3 or Vorbis formats, along with telephony-quality (8 kHz) audio in PCM format.

You can also use the AWS Command Line Interface (CLI) to generate audio. For example:

$ aws polly synthesize-speech \ --output-format mp3 --voice-id Joanna \ --text "Hello my name is Joanna." \ joanna.mp3

Polly encrypts all data at rest and transfers the audio across SSL connections. The text submissions are disassociated from the submitter, stored in encrypted form for up to 6 months, and used to maintain and improve Polly.

Pricing and Availability

You can use Polly to process 5 million characters per month at no charge. After that, you pay $0.000004 per character, or about $0.004 per minute of generated audio. That works out to about $0.018 for this blog post, or around $2.40 for the full text of Adventures of Huckleberry Finn.

Polly is available now in the US East (N. Virginia), US West (Oregon), US East (Ohio), and Europe (Ireland) Regions and you can start using it today. Let me know what you come up with!

Ready to learn more? Register for our webinar on December 13th!

— Jeff;