I just finished the Coursera deep learning online program this week. The last programming assignment is about trigger word detection, aka. wake/hot word detection. Like when you yell at Amazon Alexa or Google Home to wake them up.

Will it be cool to build one yourself and run it in Real-time?

In this post, I am going to show you exactly how to build a Keras model to do the same thing from scratch. No third party voice API or network connection required to make it functional.

A lot of background information is shown in the Coursera course. Don’t worry if you are new to this, I am going to have an overview just enough for you to understand what is happening next.

Prepare the training datasets

For the sake of simplicity, let’s take the word “Activate” as our trigger word.

The training dataset needs to be as similar to the real test environment as possible. For example, the model needs to be exposed to non-trigger words and background noise in the speech during training so it will not generate the trigger signal when we say other words or there is only background noise.

As you may expect training a good speech model requires a lot of labeled training samples. Do we just have to record each audio and label where the trigger words were spoken? Here is a simple trick to solve this problem.

We generate them!

First, we have 3 types of audio recordings,

1. Recordings of different backgrounds audios. They might just as simple as two clips of background noise, 10 seconds each, coffee shop, and living room.

2. Recordings of the trigger word “activate”. They might be just you speaking the word 10 times in different tones, 1 second each.

3. Recordings of the negative words. They might be you speaking other words like “baby”, “coffee”, 1 second for each recording.

Here is the step to generate the training input audio clips,

Pick a random 10-second background audio clip

Randomly overlay 0–4 audio clips of “activate” into this 10sec clip

0–4 audio clips of “activate” into this 10sec clip Randomly overlay 0–2 audio clips of negative words into this 10sec clip

We choose overlay since we want to mix the spoken words with the background noise to sounds more realistic.

For the output labels, we want it to represent whether or not someone has just finished saying “activate”.

We first initialize all timesteps of the output labels to “0”s. Then for each “activate” we overlayed, we also update the target labels by assigning the subsequent 50 timesteps to “1”s.

Why we have 50 timesteps “1”s?

Because if we only set 1 timestep after the “activate” to “1”, there will be too many 0s in the target labels. It creates a very imbalanced training set.

It is a little bit of a hack to have 50 “1” but could make them a little bit easy to train the model. Here is an illustration to show you the idea.

Credit: Coursera — deeplearning.ai

For a clip which we have inserted “activate”, “innocent”, activate”, “baby.” Note that the positive labels “1” are associated only with the positive words.

The green/blueish plot is the spectrogram, which is the frequency representation of the audio wave over time. The x-axis is the time and y-axis is frequencies. The more yellow/bright the color is the more certain frequency is active (loud).

Our input data will be the spectrogram data for each generated audio. And the target will be the labels we created earlier.

Build the Model

Without further due, let’s take a look at the model structure.

The 1D convolutional step inputs 5511 timesteps of the spectrogram (10 seconds), outputs a 1375 step output. It extracts low-level audio features similar to how 2D convolutions extract image features. Also helps speed up the model by reducing the number of timesteps.

The two GRU layers read the sequence of inputs from left to right, then ultimately uses a dense+sigmoid layer to make a prediction. Sigmoid make the range of each label between 0~1. Being 1, corresponding to the user having just said “activate”.

Here is the code written in Keras’ functional API.

Trigger word detection takes a long time to train. To save time, Coursera’ve already trained a model for about 3 hours on a GPU using the architecture shown above, and a large training set of about 4000 examples. Let’s load the model.

Real-time Demo

So far our model can only take a static 10 seconds audio clip and make the prediction of the trigger word location.

Here is the fun part, let’s replace with the live audio stream instead!

The model we have build expect 10 seconds audio clips as input. While training another model that takes shorter audio clips is possible. But needs us retraining the model on a GPU for several hours.

We also don’t want to wait for 10-second for the model tells us the trigger word is detected. So one solution is to have a moving 10 seconds audio stream window with a step size of 0.5 second. Which means we ask the model to predict every 0.5 seconds, that reduce the delay and make it responsive.

We also add the silence detection mechanism to skip making a prediction if the loudness is below a threshold. This can save some computing power.

Let’s see how to build it,

The input 10 seconds audio is updated every 0.5 second. Meaning for every 0.5 second, the oldest 0.5 second chunk of audio will be discarded and the fresh 0.5 second audio will be shifted in. The job of the model is to tell if there is a new trigger word detected in the fresh 0.5 second audio chunk.

And here is the code to make it happen.

To get the audio stream, we use the pyaudio library. Which has an option to read the audio stream asynchronously. That means the audio stream recording happens in another thread and when a new fixed length of audio data is available, it notifies our model to process it in the main thread.

You may ask why not just read a fixed length of audio and just process it in one function?

Since for the model to generate the prediction, it takes quite some time, sometimes measured in tens of milliseconds. Doing so we are risking creating gaps in the audio stream while we are doing the computation.

Here is the code for the pyaudio library’s callback, in the callback function we send a queue to notify the model to process the data in the main thread.

When you run it, it outputs one of the 3 characters every 0.5 second.

“-” means silence,

“.” means not silence and no trigger word,

“1” means a new trigger word is detected.

--.--......-1----.-..--...-1---..-------..1---------..-1---------.----.--------.----.---.--.-.------------.

Feel free to replace printing the “1” character with anything you want to happen when a trigger word is detected. Launch an app, play a sound etc.

Here is the demo on YouTube

Summary and Further Reading

This article demonstrates how to build a real-time trigger word detector from scratch with Keras deep learning framework.

Here’s what you should remember:

Data synthesis is an effective way to create a large training set for speech problems, specifically trigger word detection.

Using a spectrogram and optionally a 1D conv layer is a common pre-processing step prior to passing audio data to an RNN, GRU or LSTM.

An end-to-end deep learning approach can be used to build a very effective trigger word detection system.

Deep learning model prediction takes time. It processes the audio data asynchronous from the input audio streaming to avoid breaking audio streaming.

A sliding/moving input window is an effective way to reduce delay.

Further reading

Hey Siri: An On-device DNN-powered Voice Trigger for Apple’s Personal Assistant

Trigger Word Detection lecture — Coursera

Now, grab the full source code from my GitHub repo and build an awesome trigger word application.