Speech Recognition: You down with CTC?

Demystifying the Connectionist Temporal Classification Loss

You may have heard that speech recognition nowadays does away with everything that’s not a neural network. “Not a neural network” might be a matter of semantics, but much of that philosophy comes from a cost function called the CTC loss function. You see, in the old days, everyone who wanted to train neural networks used a two step procedure: (1) initialize by pre-training your neural network and (2) fine-tune your algorithm through backpropagation. Rather, the CTC loss function operates under the assumption that we don’t need to do the first pre-training step. And this small, little cost function can do this because of a big thing called data.

“Yes…yes, it’s true. I don’t need pre-training,” said Big Data.

In fact, adoption of a pure neural network approach with CTC is widespread. It’s what the rich kids (Baidu, Google, and Microsoft) and poor kids (University graduate students) use. Perhaps you’ve heard of CTC, but chances are you still probably didn’t know what the acronym stood for (it’s Connectionist Temporal Classification.)

If you fall into the latter group, the beginner-intermediate category of practitioners in deep learning, you might find this blog post worth reading. This blog post is meant to guide you with a brief introduction to and some intuition behind modern speech recognition solutions for the masses. Along the way, hopefully you’ll also start to understand how the CTC loss function works.

Background: Speech Recognition Pipelines

Typical speech processing approaches use a deep learning component (either a CNN or an RNN) followed by a mechanism to ensure that there’s consistency in time (traditionally an HMM).

The deep learning component predicts what’s being uttered, i.e. phoneme, which is frequently based on a single 10ms frame of the input signal. The input signal may be a spectrogram, Mel features, or raw signal. This component are the light blue boxes in Diagram 1.

predicts what’s being uttered, i.e. phoneme, which is frequently based on a single 10ms frame of the input signal. The input signal may be a spectrogram, Mel features, or raw signal. This component are the light blue boxes in Diagram 1. The time consistency component deals with rate of speech as well as what’s being said overall. That’s the purple box in Diagram 1.

Diagram 1: CTC loss (purple) across several NN (blue) outputs

Since it’s really hard to predict what’s in a 10ms frame without context, this second time consistency component is necessary. Its job is to inform our predictions based on previous or future frames. It is this component that the community has been making the wholesale change from Hidden Markov Models (HMMs) over to Recurrent Neural Networks (RNNs).

Behind the face of RNNs lies a peculiar cost function called the Connectionist Temporal Classification (CTC) loss, initially presented at ICML in 2006 by Alex Graves and Co. This is quite significant since the HMM (the RNN & CTC predecessor) has been used for speech processing since forever, before and even after neural networks got hot. What does the CTC loss function do? We’ll get to that, but first we need to know why we need it.

The Problem

Let’s take a closer look Diagram 1 with a blow-up of the neural network in Diagram 2. At each time step, a neural network predicts an output we call a vector y. Each one of the dimensions of y is an indicator of how much we believe in an utterance given an audio frame. For example, let’s say the 4th dimension of output vector y (denoted by y₄) represents the phoneme “/y/”.

Diagram 2: Each frame prediction is called y.

If y₄ is a value 0.84 and all the other y’s are small (like 0.02), my neural network is pretty sure I’m saying “/y/”. I’m essentially predicting my outputs given my inputs: p( y | x ) = Neural-Network( x ), where x is my input frame.

Nothing new, just neural networks 101. But wait, maybe the word I’m saying is, “huge”, and that phoneme was just found near the “/y/” and should really be “/h/”. Unless I’m Donald Trump (I’m not) the “/y/” phoneme is incorrect. In this case, context actually buys us a lot.

If you wanted to guess right, you’d probably need some context.

Rather than predict each frame at a time, what would be more informative is we predict the sequence of outputs that makes the most sense {y[0], y[1], …, y[T]}, where y[t] is a vector with all possible phonemes at time t. So let’s line up the outputs from Diagram 2 in a row over five time steps, shown in Diagram 3. The connected line through nodes at each time step is called a path, and is often represented by the greek symbol π.

Diagram 3: Activations over time

The problem is then to optimize for the best path π. In math, you’re optimizing the below equation.

That’s just a compact way of saying, the probability of a particular path is the product of all the softmax outputs over time T. Now, you might be wondering, doesn’t recurrence or convolutions take care of that? Just by using a CNN or RNN, you’re already considering past outputs, right?

True, but you might also notice in Diagram 3, “/oo/” appears twice, and doesn’t contribute to the meaning of the phrase. That’s important as rate of speech should be discounted when recognizing words. Instead of “_-h-yoo-oo-oo-j-_”, we really want “_-h-yoo-oo-j-_” with the second “oo” deleted. This is especially noticeable when we consider that silence/blank/repeat /transition characters actually make up a lot of speech. In fact, we’ll often have something that looks like:

— — — bbbb ——eeeee —

which in reality, we should perceive as “be”.

There are two things you can do about this.

Label every single time step, and where the phoneme changes Just do an overall labeling and make the network figure it out

The first option is unpalatable; I’m okay with a little bit of annotation, but this one’s going to cut into my nights trying to figure out who shot Longmire at the beginning of season 5 (I think it was the psychiatrist.) The second option is where the CTC loss comes in. Instead of just optimizing for the path, π, also optimize to the label. Let’s call that label l. Now, we’re going to predict the best label among all paths, the CTC objective:

The above equation is really just saying, the probability that I’ve said label l is just the probability over the sum of all the possible ways to get there over all paths (i.e., marginalizing out π.) If you want to be fancy, you’d call this step dynamic time warping. If not, just call it getting rid of crap that’s repetitive.

It can’t be that simple, right? Actually, according to scientists at Baidu, it really is. They are literally predicting Mandarin characters (the labels l) directly from the input audio sequence.

Solving the Label Problem

Okay, so now we know the problem (i.e., the CTC objective). What’s the solution? (i.e., how do we optimize?) As in the case of HMM’s, we could use dynamic programming. In fact, we can optimize using the analogous forward-backward algorithm.

Diagram 4: Decoding the word “CAT”. Taken from Graves et al.

Each row in Diagram 4 is a label interspersed with “blanks” or “transitions”, where the entire word is “CAT”. Each column is the time when it’s uttered. At each time step (again, each column), we’re picking which label could go next, which is demonstrated by the arrows. It could be one of three things:

My current label

A transition to the next label

The next predicted label

Well, how do I choose which one of these things to do? That’s where the forward-backward algorithm comes in.

There’s some math, but the basic gist of the forward backward algorithm is you’re calculating “p(l|x)” with some helper functions “α(t)” and “β(t)”. Going forward, you calculate a parameter “α(t)”. Going backwards you calculate “β(t)”, and the probability of a label given your input is:

Both “α(t)” and “β(t)” are calculated with an input of the two previous equations in this blog (the neural network output y), but you can get them from Graves’ paper. To choose which label comes next, you’re calculating α(t) and β(t). The maximum of p(l|x) will tell you what phoneme is being uttered. But more importantly, the key to tractability is you are only considering the paths with “good” labels. In other words, if you’re not a node with an arrow coming in and going out of you in Diagram 4, then you’re not being considered.

Final Touches

But wait. I just said that “α(t)” and “β(t)” depend on the output of the neural network, y. Doesn’t that mean that when I optimize my neural network, “α(t)”and “β(t)” will change? Well, yes, but that’s what back propagation is for! We just need to calculate the gradient of “p(l|x)” with respect to its input y, which so happens to be what the neural network gives you up until that point. We can do this from the equation that I just described above, but it’s easier just to take equation 15 from Graves in his paper. I’m not going to include it here because if you’re serious about implementation, you’ll probably read his paper anyway. ;-)

That’s All Folks!

“You were supposed to say, yeah you know me. Like the h…”

I’ve skipped a whole bunch of steps, but hopefully that gives you a bit of an overview and some intuition about how the CTC loss works. Since I tried to keep this short, any feedback is welcomed! Please find me at kni@iqt.org or karllab41 on Twitter.