WP8 Speech Recognition – Dynamically Generating SRGS grammar files

Recently I released another update to Car Dash; the main feature of this update was to improve the quality of speech recognition in the app’s voice commands for playing music. I did this by switching my Speech Recognition from programmatic list grammars to dynamically generated SRGS grammar files. In this post I wanted to share how and why I did this.

When I first released Car Dash it supported voice commands to allow the user to play music from a named album or artist in their Xbox music library. This was pretty easy to do, I just looped through the phone’s list of albums and artists and added each one to a list which would then be added to the speech recognizer’s grammar:

private void _loadGrammar() { var library = new MediaLibrary(); var artistGrammar = new List(); var albumGrammar = new List(); foreach(var artist in library.Artists) artistGrammar.Add("Play Artist " + artist.Name); foreach(var album in library.Albums) albumGrammar.Add("Play Album " + album.Name); _recognizer.Grammars.AddGrammarFromList("artistCommands", artistGrammar); _recognizer.Grammars.AddGrammarFromList("albumCommands", albumGrammar); } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 private void _loadGrammar ( ) { var library = new MediaLibrary ( ) ; var artistGrammar = new List ( ) ; var albumGrammar = new List ( ) ; foreach ( var artist in library . Artists ) artistGrammar . Add ( "Play Artist " + artist . Name ) ; foreach ( var album in library . Albums ) albumGrammar . Add ( "Play Album " + album . Name ) ; _recognizer . Grammars . AddGrammarFromList ( "artistCommands" , artistGrammar ) ; _recognizer . Grammars . AddGrammarFromList ( "albumCommands" , albumGrammar ) ; }

This worked pretty well for the initial release of Car Dash, but I started to get a lot of user feedback and they wanted more options to play their music. So over the last few months I added the ability to play songs from a certain genre, playlist and eventually even the option to name a specific song that you’d want to play; all using the programmatic grammar lists.

Unfortunately the addition of voice commands for specific individual songs seemed to hurt the accuracy of the speech recognition. I got some complaints via UserVoice and saw a few negative reviews complaining about this. So I knew I needed to do something to improve the quality and accuracy of the Speech Recognition.

I went back to the documentation on Speech Recognition and found that using SRGS grammar files should produce better results. But the documentation provided no method of generating an SRGS file, aside from creating one by hand and including it in your application package. Obviously this wouldn’t work for Car Dash where I had to include the users music collection in the grammar files.

What I decided to do was include a template SRGS file in the application assets that would include all the voice command rules. The vocabulary that the rules depended on would have to be dynamically generated, so I would create a new SRGS file on the fly combining the template and the generated vocabulary. Here’s a sample of the template .grxml file:

<?xml version="1.0" encoding="utf-8"?> <grammar xml:lang="en" root="musicCommand" tag-format="semantics/1.0" version="1.0" xmlns="http://www.w3.org/2001/06/grammar"> <rule id="artist_Request"> <item> play artist <item> <ruleref uri="#artist"/> </item> </item> </rule> <rule id="album_Request"> <item> play album <item> <ruleref uri="#album"/> </item> </item> </rule> <rule id="song_Request"> <item> play song <item> <ruleref uri="#song"/> </item> </item> </rule> <rule id="genre_Request"> <item> play genre <item> <ruleref uri="#genre"/> </item> </item> </rule> <rule id="musicCommand"> <one-of> <item> <ruleref uri="#artist_Request"/> <tag> out.artist=rules.latest(); </tag> </item> <item> <ruleref uri="#album_Request"/> <tag> out.album=rules.latest(); </tag> </item> <item> <ruleref uri="#song_Request"/> <tag> out.song=rules.latest(); </tag> </item> <item> <ruleref uri="#genre_Request"/> <tag> out.genre=rules.latest(); </tag> </item> </one-of> </rule> </grammar> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 <? xml version = "1.0" encoding = "utf-8" ?> <grammar xml : lang = "en" root = "musicCommand" tag-format = "semantics/1.0" version = "1.0" xmlns = "http://www.w3.org/2001/06/grammar" > <rule id = "artist_Request" > <item> play artist <item> <ruleref uri = "#artist" /> </item> </item> </rule> <rule id = "album_Request" > <item> play album <item> <ruleref uri = "#album" /> </item> </item> </rule> <rule id = "song_Request" > <item> play song <item> <ruleref uri = "#song" /> </item> </item> </rule> <rule id = "genre_Request" > <item> play genre <item> <ruleref uri = "#genre" /> </item> </item> </rule> <rule id = "musicCommand" > <one-of> <item> <ruleref uri = "#artist_Request" /> <tag> out.artist=rules.latest(); </tag> </item> <item> <ruleref uri = "#album_Request" /> <tag> out.album=rules.latest(); </tag> </item> <item> <ruleref uri = "#song_Request" /> <tag> out.song=rules.latest(); </tag> </item> <item> <ruleref uri = "#genre_Request" /> <tag> out.genre=rules.latest(); </tag> </item> </one-of> </rule> </grammar>

And here’s some sample code to build the actual .grxml file by combining the above template with vocabulary from the users media library:

public static async Task<Uri> CreateCommandsGrxml() { var localFolder = ApplicationData.Current.LocalFolder; var assetsFolder = await StorageFolder.GetFolderFromPathAsync("Assets"); // Open the base file from the assets folder StorageFile commandBaseGrxmlFile = await _getBaseCommandFile(); if (commandBaseGrxmlFile != null) { var commandBaseGrxml = commandBaseGrxmlFile.Name; var commandGrxml = commandBaseGrxml.Replace(".base", string.Empty); var commandBaseGrxmlUri = new Uri("ms-appx:///Assets/" + commandBaseGrxml, UriKind.Absolute); var commandGrxmlFile = await localFolder.CreateFileAsync(commandGrxml, CreationCollisionOption.ReplaceExisting); using (IRandomAccessStream commandBaseGrxmlStream = await commandBaseGrxmlFile.OpenReadAsync()) { XNamespace xmlns = "http://www.w3.org/2001/06/grammar"; XDocument loadedData = XDocument.Load(commandBaseGrxmlStream.AsStreamForRead()); _addArtistRule(xmlns, loadedData.Root); _addAlbumRule(xmlns, loadedData.Root); _addSongRule(xmlns, loadedData.Root); _addGenreRule(xmlns, loadedData.Root); _addPlaylistRule(xmlns, loadedData.Root); using (var commandGrxmlStream = await commandGrxmlFile.OpenAsync(FileAccessMode.ReadWrite)) { XmlWriterSettings writerSettings = new XmlWriterSettings { Indent = true, NewLineHandling = NewLineHandling.Entitize, NewLineOnAttributes = false }; XmlWriter writer = XmlWriter.Create(commandGrxmlStream.AsStreamForWrite(), writerSettings); loadedData.WriteTo(writer); writer.Close(); } return new Uri(string.Format("ms-appdata:///Local/{0}", commandGrxml), UriKind.Absolute); } } return null; } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 public static async Task < Uri > CreateCommandsGrxml ( ) { var localFolder = ApplicationData . Current . LocalFolder ; var assetsFolder = await StorageFolder . GetFolderFromPathAsync ( "Assets" ) ; // Open the base file from the assets folder StorageFile commandBaseGrxmlFile = await _getBaseCommandFile ( ) ; if ( commandBaseGrxmlFile != null ) { var commandBaseGrxml = commandBaseGrxmlFile . Name ; var commandGrxml = commandBaseGrxml . Replace ( ".base" , string . Empty ) ; var commandBaseGrxmlUri = new Uri ( "ms-appx:///Assets/" + commandBaseGrxml , UriKind . Absolute ) ; var commandGrxmlFile = await localFolder . CreateFileAsync ( commandGrxml , CreationCollisionOption . ReplaceExisting ) ; using ( IRandomAccessStream commandBaseGrxmlStream = await commandBaseGrxmlFile . OpenReadAsync ( ) ) { XNamespace xmlns = "http://www.w3.org/2001/06/grammar" ; XDocument loadedData = XDocument . Load ( commandBaseGrxmlStream . AsStreamForRead ( ) ) ; _addArtistRule ( xmlns , loadedData . Root ) ; _addAlbumRule ( xmlns , loadedData . Root ) ; _addSongRule ( xmlns , loadedData . Root ) ; _addGenreRule ( xmlns , loadedData . Root ) ; _addPlaylistRule ( xmlns , loadedData . Root ) ; using ( var commandGrxmlStream = await commandGrxmlFile . OpenAsync ( FileAccessMode . ReadWrite ) ) { XmlWriterSettings writerSettings = new XmlWriterSettings { Indent = true , NewLineHandling = NewLineHandling . Entitize , NewLineOnAttributes = false } ; XmlWriter writer = XmlWriter . Create ( commandGrxmlStream . AsStreamForWrite ( ) , writerSettings ) ; loadedData . WriteTo ( writer ) ; writer . Close ( ) ; } return new Uri ( string . Format ( "ms-appdata:///Local/{0}" , commandGrxml ) , UriKind . Absolute ) ; } } return null ; }

In the _addArtistRule() and related functions I programmatically add the names of artists, albums, etc. Only now instead of just adding strings to a list we create an xml node. Here is some more sample code:

private static void _addArtistRule(XNamespace xmlns, XElement rootElement) { var artistRule = new XElement(xmlns + "rule", new XAttribute("id", ArtistCommandsRule)); var artistCollection = new XElement(xmlns + "one-of"); if ((Library.Artists != null) && (Library.Artists.Count > 0)) { foreach (var artist in Library.Artists) { if (artist.Songs.Count > 0) { var sanitizedArtistName = SanitizeItem(artist.Name); artistCollection.Add(new XElement(xmlns + "item", new XText(sanitizedArtistName), new XElement(xmlns + "tag", new XText(string.Format("out="{0}";", sanitizedArtistName))))); } } } // It's important to have atleast one element, without this you'll get an exception when you try to use the grammar file if (!artistCollection.HasElements) artistCollection.Add(new XElement(xmlns + "item", new XText("Artist Name"), new XElement(xmlns + "tag", new XText("out="Artist Name";")))); artistRule.Add(artistCollection); rootElement.Add(artistRule); } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 private static void _addArtistRule ( XNamespace xmlns , XElement rootElement ) { var artistRule = new XElement ( xmlns + "rule" , new XAttribute ( "id" , ArtistCommandsRule ) ) ; var artistCollection = new XElement ( xmlns + "one-of" ) ; if ( ( Library . Artists != null ) && ( Library . Artists . Count & gt ; 0 ) ) { foreach ( var artist in Library . Artists ) { if ( artist . Songs . Count & gt ; 0 ) { var sanitizedArtistName = SanitizeItem ( artist . Name ) ; artistCollection . Add ( new XElement ( xmlns + "item" , new XText ( sanitizedArtistName ) , new XElement ( xmlns + "tag" , new XText ( string . Format ( "out=" { 0 } ";" , sanitizedArtistName ) ) ) ) ) ; } } } // It's important to have atleast one element, without this you'll get an exception when you try to use the grammar file if ( ! artistCollection . HasElements ) artistCollection . Add ( new XElement ( xmlns + "item" , new XText ( "Artist Name" ) , new XElement ( xmlns + "tag" , new XText ( "out=" Artist Name ";" ) ) ) ) ; artistRule . Add ( artistCollection ) ; rootElement . Add ( artistRule ) ; }

You’ll notice I’m using a function called SanitizeItem(). I added the use of this function after the initial release of this update after finding that some users would have problems with speech recognition if they had a double-quote or other invalid characters in their music library.

static bool IsValidXmlString(string text) { try { XmlConvert.VerifyXmlChars(text); if (text.Contains('"')) return false; return true; } catch { return false; } } static string SanitizeItem(string item) { if (IsValidXmlString(item)) return item; else return XmlConvert.EncodeName(item); } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 static bool IsValidXmlString ( string text ) { try { XmlConvert . VerifyXmlChars ( text ) ; if ( text . Contains ( '"' ) ) return false ; return true ; } catch { return false ; } } static string SanitizeItem ( string item ) { if ( IsValidXmlString ( item ) ) return item ; else return XmlConvert . EncodeName ( item ) ; }

And that’s it! I’ve definitely found that after switching from the programmatic lists to the SRGS the accuracy of the speech commands has drastically improved, and I’ve gotten feedback from users who appreciated the improvements. Please let me know what you think of this post, I may refactor this code and try to create an SRGS builder library to put up on github for the community if there is enough interest.