This article shows my understanding of the Microsoft 2016 Speech Recognition paper: The Microsoft 2016 Conversational Speech Recognition Systems by W Xiong 2016 (abbreviated as MSR2016). My intention is to deconstruct this article to get better understanding of the state-of-the-art speech recognition technologies. Please feel free to comment and correct my mistakes.

Speech Recognition Basics

For the given acoustic observation or feature vector sequence X = X1X2 … Xn, the goal of speech recognition is to find out the corresponding word sequence Wˆ = w1w2 …wm that has the maximum posterior probability P(W|X)

Given X is fixed, the equation could be simplied to:

P(X|W), given words W, the probabilities of X sound features, is called Acoustic Model (AM).

P(W), the probabilities of words, is called Language Model (LM).

Acoustic Model (AM)

Acoustic Model is not part of my interest. So I will only briefly discuss it.

AM Basics

From Huang and Deng: An Overview of Modern Speech Recognition (2010):

There are a number of well-known factors that determine the accuracy of a speech-recognition system. The most noticeable ones are context variations, speaker variations, and environment variations. Acoustic modeling of speech typically refers to the process of establishing statistical representations for the feature vector sequences computed from the speech waveform. Acoustic modeling also encompasses “pronunciation modeling,” which describes how a sequence or multi-sequences of fundamental speech units (such as phones or phonetic feature) are used to represent larger speech units such as words or phrases that are the object of speech recognition.

Setup: log-filterbank

Forty-dimensional log-filterbank features were extracted every 10 milliseconds, using a 25-millisecond analysis window. — MSR2016

Below is from James F. Alan CS248 Notes discussing what is filter bank and log-filter bank.

One way to more concisely characterize the signal is by a filter bank. We divide the frequency range of interest (say 100–8000Hz) into N bands and measure the overall intensity in each band. A better alternative is to organize the ranges using a logarithmic scale, and this actually agrees better with human perceptual capabilities as well. We set the first range to have the width W, and then subsequent widths are an*W. If a = 2 and W is 200 Hz, we get widths of 200 Hz, 400 Hz, 800 Hz, 1600 Hz, and so on. Our frequency ranges would now be 100Hz-300Hz, 300Hz-700Hz, 700Hz-1500Hz, 1500Hz-3100Hz, 3100Hz-6300Hz

Setup: Triphone

The basic sounds from which word pronunciations can be composed are known as phones or phonemes. Approximately 50 phones may be used to pronounce any word in the English language. However, there is a co-articulation effects, i.e. beginning and end of a phone are modified by the preceding and succeeding phones. Triphone would take the co-articulation effects into consideration. This is explained in this article.

MSR2016 was setup to use left-to-right triphone with 9000 tied states and, in some cases, with 27k tied states. This includes special models for noise, vocalized noise, laughter and silence. 30k vocabulary words were derived from the Switchboard and Fisher corpora.

Convolutional and LSTM Neural Networks

The following CNN variants are tested: VGG, Residual Network and LACE in MSR2016 paper. A long-short memory networks with bidirectional architecture without frame-skipping (BLSTM) was also used. According to the result, LACE and BLSTM perform better.

Speaker Adaptation

Speaker Adaptation is done with 100-dimensional i-vector.

Acoustic Model Training Cost Function:

cross-entropy training : A standard cost function to make training speed fast in all circumstances.

cross-entropy cost function

2. maximum mutual information (MMI) objective function

The general term is called “sequence discriminative training” because here, we are focusing on finding best sequence matching the acoustic observed.

This technique (MMI) seems uniquely designed for the acoustic model for speech recognition. It is quite sensitive to the sequence of the sound. MSR2016 used lattice-free MMI (LFMMI).

Here is a description of speech recognition lattice:

Lattices and word graphs are very important representation of search results serving as an itermediate format between recognition passes as well as interoperation format with various tools like with machine translation tools. Lattice contain timing information and this way it can help to describe the recognition results with more details than plain 1-best string or n-best lists

lattice from http://www.cs.nyu.edu/~mohri/asr12/lecture_12.pdf

Here is the paper about lattice-free MMI :

The basic premise of this paper is to do MMI training directly on the GPU, without lattices, using the forward-backward algorithm for both numerator and denominator parts of the objective function

In MSR2016, the modified LFMMI is described as the following:

In our implementation, we use a mixed-history acoustic unit language model. In this model, the probability of transitioning into a new context-dependent phonetic state (senone) is conditioned both the senone and phone history. We found this model to perform better than either purely word-based or phone-based models. Based on a set of initial experiments, we developed the following procedure: 1. Perform a forced alignment of the training data to select lexical variants and determine senone sequences. 2. Compress consecutive framewise occurrences of a single senone into a single occurrence. 3. Compute a variable-length N-gram language model from this data, where the history state consists of the previous phone and previous senones within the current phone. To illustrate this, consider the sample senone sequence {s s2.1288, s s3.1061, s s4.1096}, {eh s2.527, eh s3.128, eh s4.66}, {t s2.729, t s3.572, t s4.748}. We would compute, e.g., P(t s2.729|s, eh s2.527, eh s3.128, eh s4.66) and then P(t s3.572|eh, t s2.729). We construct the denominator graph from this language model, and HMM transition probabilities as determined by transitioncounting in the senone sequences found in the training data. Our approach not only largely reduces the complexity of building up the language model but also provides very reliable training performance We have found it convenient to do the full computation, without pruning, in a series of matrix-vector operations on the GPU. The underlying acceptor is represented with a sparse matrix, and we maintain a dense likelihood vector for each time frame. The alpha and beta recursions are implemented with CUSPARSE level-2 routines: sparse-matrix, dense vector multiplies. Run time is about 100 times faster than real time. As in [19], we use cross-entropy regularization. In all the lattice-free MMI (LFMMI) experiments mentioned below we use a trigram language model. Most of the gain is usually obtained after processing 24 to 48 hours of data.

Language Model

Create N-gram Language Model

Use SRILM toolkit to train and prune N-gram language model.

2. Decode N-gram Language Model

WFST decoder is described in this article.

Modern-day Large-Vocabulary Continuous Speech-Recognition (LVCSR) systems are based on Hidden Markov Models (HMMs). They break the input utterance into a sequence of frames, each typically accounting for 10 ms of speech, and extract suitable acousticstate scores per frame using Gaussian Mixture Models or, more recently, using Deep Neural Nets. Subsequently, an acoustic model (AM) graph and language model (LM) translate the sequence of acoustic-state scores into the likeliest sequence of words that could have produced the utterance. While the score computation can be efficiently implemented in a GPU , efficient parallel algorithms for the rest of the decoding process remains elusive.

Parallel WFST decoder is an efficient parallel algorithm for Viterbi. The Viterbi algorithm is a dynamic programming algorithm for finding the most likely sequence of hidden states — called the Viterbi path — that results in a sequence of observed events, especially in the context of Markov information sources and hidden Markov models.

In the case of MSR2016 paper:

The initial decoding produces a lattice with the pronunciation variants marked, from which 500-best lists are generated for rescoring purposes. Subsequent N-best rescoring uses an unpruned LM comprising 145 million N-grams. All N-gram LMs were estimated by a maximum entropy criterion as described in [33].

3. RNNLM

N-best hypotheses are then rescored using a combination of the large N-gram LM and RNNLM trained and eveluated using CURED-RNNLM toolkit.