Much of this processing is done in the cloud, using powerful neural networks that have been trained on enormous amounts of data. However, speech command models, which are able to recognize a single word like “start”, “stop”, “left” or “right” (or “Hey Google”, “Alexa”, “Echo”) usually run locally for a variety of reasons. First, it would become expensive to continuously stream all audio acquired from a given device to the cloud. Second, this would not even be possible in some applications: consider the case where an operator in an offshore platform wants to use simple voice commands to control an auxiliary robot: there may not be an internet connection available or the latency could be too high.

In this work, the Google Speech Recognition dataset was used. It contains short audio clips of length 1s of various words and it is an excellent starting point to learn how to apply deep leraning to speech.

Fundamentals of Human Voice

When analyzing human voice, a very important aspect are filters, which constitute a selective frequency transmission system that allows energy through some frequencies and not others. As shown in the picture below, the pharynx, oral cavity and lips play an important role in human speech.

Voice formants reveal frequency regions of greater energy concentration. They are peaks of greater amplitude in the sound spectre and are inherent to a certain configuration adopted by the vocal tract during vowel speech. When a word is spoken, formants are associated with natural resonance frequencies of the vocal tract, and depend on tongue positioning relative to inner structures and to lip movement. The system can be approximated by tube with one closed extremity (larynx) and one open (lips), modified by tongue, lips and pharynx movement. The resonance which occurs in the cavities of this tube is called formant.

Interestingly, deep learning practitioners decided to ignore all of this information very fast. During his time at Baidu, researcher Andrew Ng went on to say that phonemes, which are the smallest components of sound, didn’t matter. To a certain extent, in fact, what matters is that voice is mostly a (quasi) periodic signal if time intervals are small enough. This leads to the idea of ignoring the phase of the signal and only using its power spectrum as a source of information. The fact that sound can be reconstructed from its power spectrum (with the Griffin-Lim algorithm or a neural vocoder, for example) proves that this is the case.

Though there is a considerable amount of work being done on processing raw waveforms (for example, in this recent Facebook study), spectral methods are prevalent. At the moment, the standard way of preprocessing audio is to compute the short time Fourier transform (STFT) with a given hop size from the raw waveform. The result, called spectrogram, is a tridimensional arrangement that shows frequency distribution and intensity of the audio as a function of time, as shown in the picture below. Another useful trick is to “stretch” lower frequencies in order to mimic the human perception, which is less sensitive to changes in higher frequencies. This can be done by computing mel-frequency coefficients (MFCs). There are many online resources that explain MFCs in detail should this be of interest (Wikipedia is a good start).