Just a quick comparison between the two very interesting articles.

This article compares the IBM 2016 Speech Recognition Systems (article: The IBM 2016 English Conversational Telephone Speech Recognition System by G Saon etc. a.k.a ISR2016) and the Microsoft 2016 Speech Recognition Systems (article: The Microsoft 2016 Conversational Speech Recognition Systems by W Xiong etc. a.k.a. MSR2016). I have also previously wrote about the Microsoft paper here.

Data Extraction and Preprocessing

Data Extraction: same. 25ms analysis frame and 10ms frame-shift.

Data Processing:

MSR2016: log-filterbank features extracted

ISR2016:

Speaker Adaptation Model:

MSR2016: 100 dimension i-vector

ISR2016: 100 dimension i-vector, VTL, PLP etc.

Data Source

Training Data:

ISR2016:

SwitchBoard 1: 262 hours

Fisher: 1698 hours

CallHome: 15 hours

MSR2016 Acoustic Training Data:

Switchboard-1 Release 2 (LDC 97S62)

Fisher English Training Speech Part 1 Speech (LDC 2004S13)

Fisher English Training Part 2, Speech (LDC 2005S13)

2002 Rich Transcription Broadcast News and Conversational Telephone Speech (LDC 2004S11)

NIST Meeting Pilot Corpus Speech (LDC 2004S09)

MSR2016 Language Training Data: CTS transcripts from the DARPA EARS program:

Switchboard (3M words),

BBN Switchboard-2 transcripts (850k),

Fisher (21M),

English CallHome (200k),

the University of Washington conversational Web corpus (191M).

Testing Data:

ISR2016 - Hub5 2000:

SwitchBoard Data: 2.1 hours with 21.4K words and 40 speakers

CallHome Data: 1.6 hours with 21.6K words and 40 speakers

MSR2016 appears to use CallHome and Switchboard testing data. Not specified in the article.

Acoustic Model Training

ISR2016:

Recurrent nets with maxout activations trained with Hessian-free sequence discriminative training

Very deep convolutional networks: similar to VGG: cross-entropy training + NAG?

Bidirectional LSTM

MSR2016:

CNN variant — VGG

CNN variant — Residual Net

CNN variant — LACE (layer-wise context expansion with attention) model

Bidirectional LSTM

MSR2016 uses cross-entropy training plus lattice-free maximum mutual information (LFMMI) training.

Language Model Training

ISR2016:

vocabulary size is 85K

trained a 4-gram model with modified Kneser-Ney smoothing

linearly interpolated with weights chosen to optimize perplexity on a held-out set

Entropy pruning

LM Rescoring: model M (a class-based exponential model) and feed-forward neural network LM (NNLM)

MSR2016:

An initial decoding is done with a WFST decoder, using the architecture described in [31]. We use an N-gram language model trained and pruned with the SRILM toolkit [32]. The first-pass LM has approximately 15.9 million bigrams, trigrams, and 4grams, and a vocabulary of 30500 words. and gives a perplexity of 54 on RT-03 speech transcripts. The initial decoding produces a lattice with the pronunciation variants marked, from which 500-best lists are generated for rescoring purposes. Subsequent N-best rescoring uses an unpruned LM comprising 145 million N-grams. All N-gram LMs were estimated by a maximum entropy criterion as described in [33].

CUED-RNNLM toolkit is used to train and score various LM RNNs: forward predicting RNNLM and backward RNNLM.