In-Depth

Say What? Incorporating Windows Phone 8 Speech Recognition into Your Apps

The Windows Phone 8 SDK added a speech recognition API that's easy to use and flexible. Learn how to put it to work in your application.

I think it's safe to say that a majority of mobile developers tend to skip any first-generation mobile OSes in case the platform fails to take off. Now that we're seeing the third-generation of Windows Phone 8, more and more developers are taking note. With a new OS also comes new hardware to take advantage of several key features in the API. Windows Phone 8 includes all sorts of new features such as near-field communication, native code support, in-app purchasing, speech recognition and more.

In this article, I'll focus solely on the speech recognition API introduced in Windows Phone 8. There are two speech components commonly used in most applications: Text-to-Speech (TTS) and Speech-to-Text (STT). I'll show you how to work with both.

Here's what you'll need to get started:

The Windows Phone SDK 8.0 provides everything needed to build and develop apps or games for Windows Phone 8 and 7.5. After downloading, take a moment to read over the system requirements. You'll need to be running Windows 8 and Visual Studio 2012 to develop apps for Windows Phone 8.

I'd also recommend reviewing the available Windows Phone 8 app samples to learn how to work with the platform through code. You can also download samples in other programming languages such as C++, C# or Visual Basic .NET.

Begin by launching Visual Studio 2012 from the Windows 8 Start screen and selecting Installed | Templates | Visual C# | Windows Phone, as shown in Figure 1.

[Click on image for larger view.] Figure 1. The default Windows Phone templates.

Note that there are many different types of templates for Windows Phone apps. The only real difference from previous versions of the SDK is the availability of the templates called Windows Phone XAML and Direct3D App and Windows Phone HTML5 App. The first is for building managed apps using native components; the second one simply hosts a Web browser control inside a XAML page (note that this template has no relation to WinJS for Windows Store apps).

Choose the Windows Phone App Template and name the project VSMagSR. Next, select Windows Phone 8.0 for the phone OS and press OK. After the application's created, click on Properties, then WMAppManifest.xml. If you're familiar with Windows Store apps, some options on this page are identical. Navigate over to the Capabilities tab and place a checkmark in ID_CAP_SPEECH_RECOGNITION and ID_CAP_MICROPHONE, as shown in Figure 2.

Figure 2. Required capabilities for speech recognition.

Save the file and don't worry about the other added capabilities; Microsoft believes most Windows Phone 8 apps will use these capabilities and turns them on automatically.

After adding the necessary capabilities, switch over and add two buttons to your MainPage.xaml file, replacing the contents of the Grid with the code in Listing 1.

Listing 1. Adding speech recognition buttons.



<Grid x:Name="LayoutRoot" Background="Transparent"> <Grid.RowDefinitions> <RowDefinition Height="Auto"/> <RowDefinition Height="*"/> </Grid.RowDefinitions> <!--TitlePanel contains the name of the application and page title--> <StackPanel x:Name="TitlePanel" Grid.Row="0" Margin="12,17,0,28"> <TextBlock Text="VS Mag" Style="{StaticResource PhoneTextNormalStyle}" Margin="12,0"/> <TextBlock Text="speech" Margin="9,-7,0,0" Style="{StaticResource PhoneTextTitle1Style}"/> </StackPanel> <!--ContentPanel - place additional content here--> <StackPanel x:Name="ContentPanel" Grid.Row="1" Margin="12,0,12,0"> <Button x:Name="TextToSpeech" Click="TextToSpeech_Click" Content="Text to Speech" /> <Button x:Name="SpeechToText" Click="SpeechToText_Click" Content="Speech to Text" /> </StackPanel> </Grid>

This provides the necessary UI to demonstrate TTS and STT.

Text-to-Speech

TTS, also known as speech synthesis, is simpler to implement than STT. It simply reads back text to the user.

Switch over to MainPage.xaml.cs and add the following event handler to work with the first button:

public async void TextToSpeech_Click(object sender, RoutedEventArgs e) { SpeechSynthesizer synth = new SpeechSynthesizer(); await synth.SpeakTextAsync("You are reading Visual Studio Magazine!"); }

From this code snippet, you can tell that it only took two lines of text to make the phone speak to you. Because the SpeechSynthesizer works asynchronously, you'll need to use the async and await keywords. Run the app in the Windows Phone 8 emulator or device and you should hear the phone say, "You are reading Visual Studio Magazine!"

The default language spoken and gender of the speaker depend on your default phone settings, found at settings/speech. The language or gender can be changed pretty easily:

SpeechSynthesizer synth = new SpeechSynthesizer(); var frenchVoice=InstalledVoices.All .Where(voice => voice.Language.Equals("fr-FR") & voice.Gender == VoiceGender.Female) .FirstOrDefault(); synth.SetVoice(frenchVoice); await synth.SpeakTextAsync("Salut tout le monde!");

In this sample, I again use the SpeechSynthesizer class, but specify French and female with a simple LINQ statement. The SpeechSynthesizer class exposes a SetVoice method that you can pass in the voice you declared earlier called frenchVoice. Once the voice is declared, you can simply call SpeakTextAsync to say, "Hi, Everyone," in French.

The only concern you might have is a language not being installed. If so, you can simply check to see if the frenchVoice is null or not before calling the SpeakTextAsync method.

You can use Speech Synthesis Markup Language (SSML) version 1.0 to read back text either through code or an SSML document. Here's how to create some of the markup through code:

SpeechSynthesizer synth = new SpeechSynthesizer(); string ssmlPrompt = "<speak version=\"1.0\" "; ssmlPrompt += "xmlns=\"http://www.w3.org/2001/10/synthesis\" xml:lang=\"en-US\">"; ssmlPrompt += " You are reading Visual Studio Magazine! </speak>"; await synth.SpeakSsmlAsync(ssmlPrompt);

Because TTS is installed in the OS, you don't need an Internet connection to play back text.

Speech-to-Text

STT is commonly referred to as speech recognition. The simplest implementation of this can be accomplished with a few lines of code. Head back to your MainPage.xaml.cs, add the following event handler, run the application and click on the STT button:

public async void SpeechToText_Click(object sender, RoutedEventArgs e) { SpeechRecognizerUI speechRecognition=new SpeechRecognizerUI(); SpeechRecognitionUIResult recoResult=await speechRecognition.RecognizeWithUIAsync(); if (recoResult.ResultStatus == SpeechRecognitionUIStatus.Succeeded) { MessageBox.Show(string.Format("You said {0}.", recoResult.RecognitionResult.Text)); } }

As before, the async and await keywords are being used, but two new classes -- SpeechRecognizerUI and SpeechRecognitionUIResult -- translate spoken words into text.

Upon application launch, the Speech Recognition Service asks for permission to send your data to Microsoft for support of the speech recognition service. If you click accept, a screen says "Listening …" After you speak a few words, it will say, "Heard you say …" It will then read back the text spoken to it. Finally, after checking to see if the ResultStatus succeeded, it will display a MessageBox that should contain the spoken text.

You can customize some parts of the UI, such as the ListenText or ExampleText; but for the most part, the UI is set by Microsoft. You can, however, kill the UI outright through the SpeechRecognizer class. If you do, make sure you add in exception handlers in case something goes wrong.

Speech recognition can also limit acceptable words. This is extremely useful if, for example, your driving game has a "left" or "right" command and the user doesn't know what to do. You can limit acceptable words by adding the code from Listing 2. The UI is shown in Figure 3.

Figure 3. The UI informs the user of available choices.

Listing 2. Limiting the words speech recognition will identify.



SpeechRecognizerUI speechRecognizer = new SpeechRecognizerUI(); speechRecognizer.Settings.ListenText = "Go Left or Right?"; speechRecognizer.Settings.ExampleText = "Examples you can use are: left, right"; speechRecognizer.Settings.ReadoutEnabled = true; speechRecognizer.Settings.ShowConfirmation = true; speechRecognizer.Recognizer.Grammars.AddGrammarFromList("answer", new string[] { "left", "right" }); SpeechRecognitionUIResult result = await speechRecognizer.RecognizeWithUIAsync(); if (result.ResultStatus == SpeechRecognitionUIStatus.Succeeded) { MessageBox.Show(result.RecognitionResult.Text); }

If a valid word is said, the selection is confirmed, as shown in Figure 4.

Figure 4. Confirming the speech.

Spoken words not in the grammar list result in a "Sorry, didn't catch that" dialog.

Rich API

One topic outside the scope of this article is voice commands, which allow a user to hold down the Start button on the phone and launch an application, or even navigate to a certain page in your app. It's a very powerful feature and is fully documented on msdn.com.

Microsoft provides a rich voice API that allows STT and TTS with only a few lines of code -- and you also have flexibility to customize different parts of the API. As more Windows Phone 8 devices hit the market, more developers will start utilizing this technology. You now have a head start, so go ahead and start implementing speech recognition in your Windows Phone 8 apps.