TL; DR

In this post, I’ll walk through the basics of using PowerShell to interact with the Google Cloud Text-to-Speech API. Partly a documentation exercise and partly a guide I’d like to have been able to read when I started on my Elite Dangerous Google Cloud Text-to-Speech project.

As such we’ll start with a basic script to produce an audio file and explain how that works. Later I’ll walk through some parameters of interest to affect the response. Lastly, I’ll introduce how we can use SSML for a variation in response.

The bare minimum

This post assumes you’ve followed the steps in this Google article to correctly set up a project and configure the Cloud SDK on your machine.

To get started with PowerShell, let’s look at the minimum script that would produce an audio file:

#Common auth, URL and headers $gauth = gcloud auth print-access-token $headers = @{} $headers.add( "Authorization" , "Bearer $gauth" ) $target = "https://texttospeech.googleapis.com/v1/text:synthesize" #Variables $text = "Hey there! You're using plain text for this synthesis" $languageCode = 'en' $body = @{ input = @{ text = $text } voice = @{ languageCode = $languageCode } audioConfig = @{ audioEncoding = 'MP3' } } #Build JSON body for the request $jbody = ConvertTo-Json ($body) #Try conversion request try { $response = Invoke-RestMethod -ContentType 'application/json' -headers $headers -Uri $target -Method Post -body $jbody #Extract the base64 encoded response $base64Audio = $response.audioContent #Produce output file $base64Audio | Out-File -FilePath "./google.txt" -Encoding ascii -Force $convertedFileName = 'GTTS-Plain-{0}.mpga' -f (get-date -f yyyy-MM-dd-hh-mm-ss) certutil -decode google.txt $convertedFileName } catch { Write-Host "StatusCode:" $_.Exception.Response.StatusCode.value__ Write-Host "StatusDescription:" $_.Exception.Response.StatusDescription }

Breaking it down

Most of the code is structural (build up authentication, target URL and file output). The #Variables section of the code will be the focus of our changes. When we examine the requirements documentation for a request - we need only to offer the following information:

Text or SSML field for synthesis input

Language code

Audio encoding

You might notice I’ve picked en as the language code. This invokes a specific behaviour:

Note that the TTS service may choose a voice with a slightly different language code than the one selected; it may substitute a different region (e.g. using en-US rather than en-CA if there isn’t a Canadian voice available), or even a different language, e.g. using “nb” (Norwegian Bokmal) instead of “no” (Norwegian)”.

Therefore using en we can only say that we’ll get an English result back - without certainty on which region is selected. We can use en-GB for a preferred British voice conversion - with the understanding that an alternative region voice could be selected perhaps due to capacity within the system. Here are some plain text examples with only the region specified:

languageCode Audio EN-GB Your browser does not support the audio element. EN-US Your browser does not support the audio element.

Specifying gender

We can request a preferred gender for the voice. The documentation notes another may be picked if it is not available rather than failing the request. The relevant Powershell code is changed as follows:

$body = @{ input = @{ text = $text } voice = @{ languageCode = $languageCode ssmlGender = 'FEMALE' } audioConfig = @{ audioEncoding = 'MP3' } }

languageCode ssmlGender Audio EN-GB female Your browser does not support the audio element. EN-GB male Your browser does not support the audio element.

Voice selection

If you do not specify a voice in the JSON request then one is picked for you based on the indicated language code. If you want to specify a voice you can choose from a list of supported voices. Below is an example to indicate a voice name preference:

#Variables $languageCode = 'en-GB' $voicename = 'en-GB-Standard-D' $text = "Hey there! You're using plain text for this synthesis and selecting voice $voicename." $body = @{ input = @{ text = $text } voice = @{ languageCode = $languageCode name = $voicename } audioConfig = @{ audioEncoding = 'MP3' } }

languageCode voicename Audio EN-GB en-GB-Standard-A Your browser does not support the audio element. EN-GB en-GB-Standard-B Your browser does not support the audio element. EN-GB en-GB-Standard-C Your browser does not support the audio element. EN-GB en-GB-Standard-D Your browser does not support the audio element.

For completeness here is a similar set of examples with the premium Wavenet option.

$text = "Hey there! You're using plain text for this synthesis and selecting premium voice $voicename."

languageCode voicename Audio EN-GB en-AU-Wavenet-A Your browser does not support the audio element. EN-GB en-AU-Wavenet-B Your browser does not support the audio element. EN-GB en-AU-Wavenet-C Your browser does not support the audio element. EN-GB en-AU-Wavenet-D Your browser does not support the audio element.

Changing the audio configuration

A number of options are available to use for changing the resulting audio - here are 3 key options:

Rate of speech

Speaking pitch

Gain

We can extend our hash table to include these options, the below values represent the same as the current defaults:

audioConfig = @{ audioEncoding = 'MP3' speakingRate = 1 pitch = 0 volumeGainDb = 0 }

Here are some examples with the modification of those values with the following text:

$text = "The quick brown fox jumps over the lazy dog."

Audio config Audio speakingRate = 1,pitch = 0,volumeGainDb = 0 Your browser does not support the audio element. speakingRate = 1.5,pitch = 0,volumeGainDb = 0 Your browser does not support the audio element. speakingRate = 1,pitch = 0,volumeGainDb = 5 Your browser does not support the audio element. speakingRate = 1,pitch = 10,volumeGainDb = 0 Your browser does not support the audio element.

Using Speech Synthesis Markup Language (SSML)

SSML allows you to nuance text with a variety of tools in the form of a mark-up language. Google document what they support in a clear fashion. Modifying the code to use SSML is as simple as:

#Variables $languageCode = 'en-GB' $voicename = 'en-GB-Wavenet-A' $text = "<speak>The <say-as interpret-as= `" characters `" >quick</say-as> brown fox jumps over the lazy dog.</speak>" $body = @{ input = @{ ssml = $text } voice = @{ languageCode = $languageCode name = $voicename } audioConfig = @{ audioEncoding = 'MP3' } }

The modifications are to specify ssml = $text and for $text to contain a valid SSML string. Note the character escaping `” so that we may use " in a PowerShell string.

SSML say-as effect Audio none Your browser does not support the audio element. interpret-as="characters" Your browser does not support the audio element. interpret-as="expletive" Your browser does not support the audio element.

In conclusion

Part of the novelty in using PowerShell for this project was the lack of direct documentation or examples. On the flip side, there were numerous examples in other languages (Ruby/Python/PHP/Node.js/Java/Go/C# and curl). Those examples helped me find analogues in PowerShell.

The ability to select voices and nuance speech is relavent to adding variantion into my Elite Dangerous project and working on one of the documented limitations.

Acknowledgements

I’ve a standing acknowledgement to add due to the use of PowerShell and hash tables. Thanks, Kevin Marquette, for your ever excellent “Everything you wanted to know about hashtables”.

Related posts

Elite Dangerous Google Cloud Text-to-Speech project