An Example of MFCC. The Y Axis represents features and the X axis represents time.

MFCC stands for Mel-Frequency Cepstral Coefficients and it has become almost a standard in the industry since it was invented in the 80s by Davis and Mermelstein. You can get a better theoretical explanation of MFCCs in this amazing readable article. For basic usage all you need to know is that MFCCs are taking into account only the sounds that are best heard by our ears.

In Kaldi we use two more features:

CMVNs which are used for better normalization of the MFCCs I-Vectors (That deserve an article of their own) that are used for better understanding of the variances inside the domain. For example - creating a speaker dependent representation. I-Vectors are based on the same ideas of JFA (Joint Factor Analysis), but are more suitable for understanding both channel and speaker variances. The math behind I-Vectors is clearly described here and here.

The Process of using I-Vectors as described in Dehak, N., & Shum, S. (2011) In Practice: It’s complicated

For a basic understanding of these concepts, remember the following things:

MFCC and CMVN are used for representing the content of each audio utterance. I-Vectors are used for representing the style of each audio utterance or speaker.

The Model

The matrix math behind Kaldi is implemented in either BLAS and LAPACK (Written in Fortran!),or with an alternative GPU implementation based on CUDA. Because of its usage of such low-level packages, Kaldi is highly efficient in performing those tasks.

Kaldi’s model can be divided into two main components:

The first part is the Acoustic Model, which used to be a GMM but now it was wildly replaced by Deep neural networks. That model will transcribe the audio features that we created into some sequence of context-dependent phonemes (in Kaldi dialect we call them “pdf-ids” and represent them by numbers).

The Acoustic model, generalized. On top you can see IPA phoneme representation.

The second part is the Decoding Graph, which takes the phonemes and turns them into lattices. A lattice is a representation of the alternative word-sequences that are likely for a particular audio part. This is generally the output that you want to get in a speech recognition system. The decoding graph takes into account the grammar of your data, as well as the distribution and probabilities of contiguous specific words (n-grams).

A representation of lattice — The words and the probabilities of each word

The decoding graph is essentially a WFST and I highly encourage anyone that wants to professionalize to learn this subject thoroughly. The easiest way to do it is through those videos and this classic article. After understanding both of those you can understand the way that the decoding graph works more easily. This composition of different WFSTs is named in Kaldi project — “HCLG.fst file” and it’s based on the open-fst framework.

A simple representation of a WFST taken from “Springer Handbook on Speech Processing and Speech Communication”. Each connection is labeled: Input:Output/Weighted likelihood

Worth Noticing: This is a simplification of the way that the model works. There is actually a lot of detail about connecting the two models with a decision tree and about the way that you represent the phonemes, but this simplification can help you to grasp this process.

You can learn in depth about the entire architecture in the original article describing Kaldi and about the decoding graph specifically in this amazing blog .

The Training Process

In general, that's the trickiest part. In Kaldi you’ll need to order your transcribed audio data in a really specific order that is described in depth in the documentation.

After ordering your data, you’ll need a representation of each word to the phonemes that create them. This representation will be named “dictionary” and it will determine the outputs of the acoustic model. Here is an example of such dictionary:

eight -> ey t

five -> f ay v

four -> f ao r

nine -> n ay n

When you have both of those things at hand, you can start training your model. The different training steps you can use are named in Kaldi dialect “recipes”. The most wildly used recipe is WSJ recipe and you can look at the run bash script for a better understanding of that recipe.

In most of the recipes we are starting with aligning the phonemes into the audio sound with GMM. This basic step (named “alignment”) helps us to determine what is the sequence that we want our DNN to spit out later.

The general process of training a Kaldi model. The Input is the transcribed data and the output are the lattices.

After the alignment we will create the DNN that will form the Acoustic Model, and we will train it to match the alignment output. After creating the acoustic model we can train the WFST to transform the DNN output into the desired lattices.