Amazon Polly

Amazon’s text-to-speech service Polly was announced at the end of 2016.

At the time of writing, it supported 25 languages, ranging from Danish and Australian English to Portuguese Brazilian and Turkish.

A larger number of voices has several benefits. The generated speech can be used for dialogues, represent different personas and achieve a higher degree of localization.

Polly offers a selection of eight different voices for American English. The names of the voices used in the study are Joanna and Matthew.

It should be noted that Amazon has promised not to retire any current or future voices made available through Polly.

In the experiment I’ve ran, Polly achieved the second-highest overall ratings. In terms of the pleasantness of the speech, Amazon’s service edged out Google’s Cloud-to-Text API.

Here is the speech output that was presented to the participants:

Samples generated with the Amazon Polly voices Joanna and Matthew

Speech markers can be obtained for positions specified with a mark element and on the level of words and sentences. In addition, Polly allows you to retrieve visemes that represent the position of the speaker’s face and mouth when saying a word.

The console is a great way to experiment with SSML and get a first impression of the available feature set.

100 requests are allowed per second. The input can consist of up to 3,000 billed characters. SSML tags do not count as billed characters. The output is limited to 10 minutes of synthesized speech per request.

The pricing model is simple. During the first twelve months, the first five million characters are on Amazon. Above this tier, requests are billed on a pay-as-you-go basis at $4 per one million characters.

Speech synthesis with Amazon Polly

Polly supports all SSML tags that we’ve mentioned as well as two extensions: breaths and voice effects.

The self-closing amazon:breath tag instructs the artificial speaker to take a (fairly life-like) breath of a specified length and volume.

Voice effects include whispering, speaking softly and changing the vocal tract length to make the speaker sound bigger or smaller.

Deep voice and heavy breathing Amazon’s SSML extensions

Google Cloud Text to Speech

Cited more than 250 times, the WaveNet paper[1] published by researchers from Google DeepMind is an important milestone in the recent history of speech synthesis.

The Github repositories that sprang up to replicate the results achieved by the DeepMind researchers have been starred and forked thousands of times.[9, 10, 11]

In a study described in that paper, subjects were asked to rate the naturalness of speech generated with WaveNet, actual human speech and output of two competing models. On the same scale from 1 to 5 that was used in the study reported in this article, the mean opinion score was 4.2 for the WaveNet samples, 4.5 for the human speech and less than 4 for the competing models.

Last November, Google finally released the alpha version of its long-awaited Cloud Text-to-Speech service. At the time of writing, the service is in the beta stage and “not intended for real-time usage in critical applications”.

The service offers WaveNet-based speech synthesis and what Google refers to as standard voices or non-WaveNet voices.

The six available WaveNet voices are in US English. According to the documentation, these are the same voices that are used in Google Assistant, Google Search, and Google Translate.

The 28 standard voices cover several European languages and include a few female voices for Asian markets.

In contrast to the other services, the voices have technical identifiers rather than memorable names. The two voices I’ve used, for example, are referred to as en-US-Wavenet-A and en-US-Wavenet-C.

This is the playlist for the output used in the experiment:

Samples generated with the Google Cloud Text-to-Speech voices en-US-Wavenet-A and en-US-Wavenet-C

My own results are comparable to those reported in the WaveNet paper. Among the four competitors, Google’s service achieved the highest naturalness score and the best overall ratings.

If natural sound is primary concern, then this is most likely the right choice of you.

It should, however, be pointed out that both Amazon Web Services and IBM Watson offer more features. Neither timing information nor SSML extensions are supported by Google Cloud Text-to-Speech.

The premium price for the WaveNet functionality is set at $16 per one million characters for requests in excess of the first one million characters covered by the free tier.

Four million characters per month can be synthesized with the standard voices at no cost. Subsequent requests set you back pay $4 for every one million characters.

In addition to limits of 300 requests per minute and 5,000 characters per request, there is a quota of 150,000 characters per minute.

If you decide to the Java SDK, make sure to import from the package v1beta1 in the namespace com.google.cloud.texttospeech (and not from the v1 package).

Speech synthesis with Google Cloud Text to Speech

Microsoft Cognitive Services Text to Speech

Microsoft’s Cognitive Services Text To Speech is currently available as a preview. The greatest strength of this service is the degree of localization that it offers.

The 80 voices that are available across 32 languages cover an unparalleled range of European and Asian locales.

At this point, however, there is a clear trade-off between quantity and quality. The output generated with the two voices ZiraRUS and BenjaminRUS received the worst ratings in the experiment: 3.2 for naturalness and 3.33 for pleasantness.

The samples that were generated for the experiment can be accessed through the following playlist:

Samples generated with the Microsoft Cognitive Text To Speech voices ZiraRUS and BenjaminRUS

Microsoft’s customization feature creates a unique voice model using studio recordings and associated scripts as training data. This feature is currently in private preview and limited to US-English and mainland Chinese.

The free tier covers five million characters per month. In the S1 tier, the price per one million characters synthesized with the default voices is $2. Speech-to-text with custom models is available at a price of $3 per one million characters plus a $20 monthly fee per model.

A console appears to be available only for its precursor, the Bing text-to-speech API.

The service supports version 1.0 of SSML without extensions and limits the input to 1,024 characters per request, a fraction of the length of a news article.

The only official Java library that exists is used for Android development. Interacting with the REST API, however, is a straight-forward two-step process. The client first obtains a token by providing the subscription key. This token — which is valid for 10 minutes — is then used to obtain the synthesized speech from the API. Note that voices are specified inside the SSML document using the voice tag.

Speech synthesis with Microsoft Cognitive Services Speech to Text

Watson Text to Speech

IBM has introduced two interesting SSML extensions for its Watson Text to Speech service: Expressive SSML and Voice Transformation SSML.

The first extension is available for the US English voice Allison and implemented through the express-as element. The tag has a type attribute with three possible self-descriptive settings: GoodNews, Apology and Uncertainty.

Expressive SSML in Watson Text To Speech

One can easily see how Expressive SSML enhances customer support solutions and other applications aiming at life-like conversations.

While Watson Text to Speech comes only with support for 13 voices across 7 languages out of the box, the second SSML extension enables the creation of new voices.

Going beyond the benefits of a broad range of default voices that are in general use, unique voices can enhance branding efforts through a memorable and differentiated user experience.

Using the voice transformation element, customers can apply built-in transformations or define their own changes to create new voices based on the three existing US English alternatives.

Using the values Young and Soft for type attribute, the sound of the three existing voices can be made more youthful and softer.

To apply custom transformations, the type attribute must be set to Custom. This provides a fine-grained control over different aspects of the voice through optional attributes. Adjustable voice characteristics include the pitch, rate, timbre, breathiness and glottal tension.

In the experiments I’ve conducted, Watson Text to Speech performed slightly better than Microsoft’s service, but did not achieve the level of naturalness and pleasantness that Amazon and Google provide.

The names of the voices that have been used in the experiment are Allison and Michael. The generated samples rated by the participants are available through the following playlist:

Samples with the IBM Watson voices Allison and Michael

With the exception of the w tag, all of the SSML elements we’ve mentioned are supported. For languages other than US English, however, the say-as instruction is limited to only two types of interpretations: digits and letters.

Timing information can be obtained for words and markers.

The Lite plan is restricted to 10,000 characters. Under the Standard tier, the synthesis of the first one million characters is free to the consumer. Subsequent requests are charged at a rate of $0.02 per 1,000 characters, making Watson Text to Speech the most expensive among the four services.

A web demo showcases the basic functionality and the SSML extensions.

While the body of a single request can have, at most, 5,000 characters, there is no limit on the number of requests sent per minute.

The Java SDK works seamlessly and intuitively: