Summary — With the advancements in the field of Natural Language Processing, adding voice control to applications has never been easier. This post will show how to add a voice interface which transcribes a user’s speech, understands their intent and responds with spoken feedback.

There’s something inherently futuristic about voice assistants. Until recently, we would see these interfaces predominantly in sci-fi movies such as HAL 9000 in 2001: A Space Odyssey. But with the recent advancements in the research of Natural Language Processing and Machine Learning, voice interfaces have become a lot more common in our everyday lives. Voice assistants such as Siri, Google Assistant or Amazon Echo best showcase how voice interfaces provide useful functionality in a natural, convenient way.

HAL 9000 from 2001: A Space Odyssey (source)

Although the technology has long been available for developers to use, voice interfaces have been generally lacking in most mobile and web apps. This could be due to a number of reasons: it’s possible the technology is still in its early stages, there isn’t the motivation for adding it, or there is a perceived difficulty in implementing it. In this blog post, I will aim to tackle the last point and show how simple it is to integrate a voice control interface into any project.

You may be asking yourself “Why add a voice interface when I already have a user interface?”. Numerous benefits result from having a voice controlled application:

It allows your application to be controlled in a hands-free way without having to look at your device. This is most beneficial in situations such as cooking, when the user’s hands are occupied and their visual focus is needed.

It can improve accessibility and provide a better experience for visually impaired users.

In my opinion, having a voice interface is just plain bad-ass, and brings us closer to a future where we interact with our technology like it’s Jarvis (the talking AI in Iron Man’s suit).

In the examples, I will use modern Javascript (ES2015), which is transpiled with Babel to a version compatible with old browsers. I will aim to express the ideas in a platform-independent manner.

Starting point

Let’s begin with a simple application which generates jokes on certain topics. The implementation is pretty simple. Jokes are stored by their topic and when they are generated, we randomly choose a joke from a given topic. If a topic isn’t supplied, a random topic is used instead. To make it a bit more interesting, we also add functionality for retrieving the current time. You can see it implemented here:

Currently, the user interface is kind of boring and there isn’t a way to specify topics. Rather than improving the user interface, let’s go ahead with the less orthodox option of adding a voice interface!

Transcribing the user’s voice commands

The first step is capturing what a user says when they interact with our application. It would be useful to get this in a string format, which can then be processed programmatically to understand the intent. Attempting to manually implement this functionality is futile, due to all the complexity from processing audio signals and extracting the individual words said.

Fortunately, many environments provide native Speech to Text functionality. For web apps, there is the experimental Web Speech API’s SpeechRecognition interface. Although it currently has limited support with browsers, it is bound to improve in the future.

Similarly, iOS applications can make use of the SFSpeechRecognizer class and Android developers can utilise SpeechRecognizer. In environments without this functionality, the Bing Speech API can be used to extract text from uploaded audio files, at the cost of increased network bandwidth.

Let’s add this transcription functionality to our joke generator. As it’s a web app, we’ll utilise the speech recognition API, which is fairly straightforward to set up:

Note that the speech recognition purposely isn’t continuous to ensure compatibility between the different demos, as Web Speech API doesn’t work across multiple tabs and iframes simultaneously. To enable continuous speech recognition, simply start the recognizer again from the event listeners that are triggered after the session finishes.

Understanding the intent

Now we arrive at the part of finding out what the user wants from their transcription. There are multiple ways of approaching this.

We need a function which can determine the intent and other parameters

With very simple applications, which have a limited set of intents and parameters, we can use pattern matching with regex to find keywords within the user’s command. Due to the flexibility in phrasing (think of the number of ways you can say a command), it may be useful to look for synonyms of these keywords. It’s also helpful to lemmatise each word, which removes all inflections and converts it into its base form (e.g. “running” -> run”), thus making it easier to find certain keywords.

With more complex applications where there are more intents and parameters which are harder to distinguish, a sequence-based machine learning classifier such as Mallet can be employed. However, this will require substantial effort to setup and prepare the dataset.

To achieve machine learning level performance with little effort, we can use wit.ai, a free (at the the time of this writing) API which can determine the user’s intent and extract corresponding parameters. It works by teaching the model with examples of user commands and defining the possible intents and parameters. The more examples trained, the more accurate it becomes. Other alternatives include LUIS and Alexa API.

This service provides the best balance of performance and ease of use, so we’ll use it for our joke generator. After registering and creating a new app, the “understand” screen will show up where we can train the model from examples.

wit.ai is trained to recognise the intent for generating jokes

Here we can type examples of user commands and teach the model to recognise entities. The most important entity is the intent, which determines which functionality is triggered.

Other entities can also be recognised as parameters. For specific types of parameters (e.g. locations, dates, emails) we can use the wit entities, which are pre-trained to recognise these types.

In our simple application, we just want to recognise the topic of the joke. For this, we can specify our entity to be keyword based, and supply a list of our topics with additional synonyms to allow for flexibility.

Here we specify a “topic” parameter for the joke generator

Once the model is successfully trained to recognise the intents and other entities, we can make requests to the wit.ai’s API with our user transcript. This will return a list of entities, where the confidence represents the certainty that the right entity was extracted.

A response from the wit.ai message endpoint in Postman

From the response, we can simply access the intent value and the corresponding parameters and trigger the correct application functionality.

Replying with spoken feedback

Now that we have our small project fully controllable by voice, let’s take the extra step and respond with spoken feedback. This step is the inverse of the previous task of generating text from speech. We have some data about the result of the user’s command, and we want to say it back in a user friendly way.

To begin, let’s format our data to express it in a more coherent form. This is easy to do with templates, where the data is incorporated into a pre-made response.

Essentially, we define the possible replies where the data is represented with variables. When the response is generated, a random response template is chosen, and the variables are replaced with the data. Alternatively, lodash’s template function can be used.

Now that we have a string of what to reply to the user, we can use Text to Speech to generate audio and play it. Like Speech to Text, this functionality is present on most platforms, such as AVSpeechSynthesizer on iOS and TextToSpeech on Android. In environments where this is not supported, the Text to Speech functionality in the Bing Speech API can be used instead.

For web apps there is SpeechSynthesis, which is simple to setup:

Finally, we integrate all this functionality into our application which results in a fully voice controlled joke generator. Try it out by saying “Tell me a dad joke”:

Conclusion

In the end, we achieve a fully voice-controllable application just by integrating the speech and wit.ai APIs. I hope this post showed you how easy it is to add voice control functionality to any project. I highly recommend reading the wit.ai and Web Speech API documentations, which explain how to customise the voice recognition to suit your application.

Special thanks to Christian Silver, Charlie Crisp, Lucia Bura for their feedback and Hari Prasad for editing this blog post.