Decoding Speech

Posted January 1, 2010

DAVID LEVIN: You're listening to a NOVA podcast. I'm David Levin.

Take a listen to these audio clips and see if you can tell what everybody's saying.

VARIOUS VOICES: I put on my hat.

DAVID LEVIN: Yep, it's, "I put on my hat". Easy, right? Well, maybe for you, but for a computer, recognizing that takes some serious processing power. Humans are pros when it comes to recognizing speech – our brains can filter out the noise of a busy street, or the tinny sound of a phone call. But how can a computer decipher spoken words?

VLAD SEJNOHA: That's actually a very simple question with a complex answer.

DAVID LEVIN: Vlad Sejnoha is the vice president and chief scientist for Nuance Communications, a company that specializes in speech recognition software.

VLAD SEJNOHA: What we do, just in a nutshell, is first process the audio into a form that we can then manipulate by computer programs. We don't care, necessarily, about what speaker was speaking. We don't care about the pitch, the emotional content – we just want the words.

DAVID LEVIN: To get to those, it's necessary to filter out background noise, or variations in a person's intonation or pitch that could fool the computer. Essentially, their voice is whittled down to its basic frequencies, and that's what the computer looks at. Not the words themselves, but the individual sounds they're made of.

VLAD SEJNOHA: Different sounds have different frequency energy profiles. So something like an "s" sound has a lot of energy in the upper frequencies, whereas some kind of a low vowel has a lot of energy and a lot of harmonic structure, like a musical tone, at the lower frequencies.

DAVID LEVIN: These sounds are called phonemes, and they're the building blocks of spoken language. So the word hat –

MALE VOICE: Hat

DAVID LEVIN: – is really made up of three distinct noises.

MALE VOICE: Hhh, AAA, tuh

VLAD SEJNOHA: They can vary considerably. There are some sounds which are extremely brief. The shortest could be 10, 20 milliseconds long. "Ah," could be quite pronounced. So there's a big variation.

DAVID LEVIN: Now that all these bits of sound have been isolated, the next step is figuring out what they are and which words they form.

VLAD SEJNOHA: This is called the search part of speech recognition, where we effectively search through our collection of models and try to find an explanation of the input.

DAVID LEVIN: The computer isn't identifying words directly. Instead of saying, "Ah-ha, that word sounds a lot like 'hat,'" it works sort of backwards.

VLAD SEJNOHA: At the root this is done blindly, where you are setting out to basically try every possible combination of all the words in English, or whatever language you're dealing with, and evaluating against the input.

DAVID LEVIN: Think of it this way. Let's say you're the computer, and you're asked to identify the color of a piece of paper.

VLAD SEJNOHA: So the way the recognition engine actually does it is not by looking at that piece and saying, "I somehow infer that this is red."

DAVID LEVIN: Instead, it takes the paper you've shown it and compares it to a huge database that's filled with samples.

VLAD SEJNOHA: They're little mathematical models.

DAVID LEVIN: And each one represents a different word – or in this case, color. So it looks at the paper you've shown it, then compares it to all those models in the database. Looks at the paper, looks in the database –

VLAD SEJNOHA: And says, hmm, is the input like this one? No. Is the input like this one? No. Is the input like this one? Oh, yeah, that's a pretty good match, and we look at the one that matches the best and say, hey, that's labeled red. So we have determined that the input was red.

DAVID LEVIN: Once the computer has individual words identified, it can start trying to figure out how they fit together. To do that, it relies on what are called language models. They help piece together sentences based on context.

VLAD SEJNOHA: The language models embody or represent our knowledge of how language works in the language model's view of the world. "It's so cold today, I had to wear a hat," would be much more likely than "I have to wear my cat." Although wearing one's cat might actually be a reasonable thing to do when it's minus two. [Laughs]

DAVID LEVIN: Identifying words by their context is especially handy for dealing with words that sound the same, like "Sun," the thing in the sky, and "Son," somebody's male child.

VLAD SEJNOHA: That's where the language model really kicks in because the probability of the S-O-N spelling will be very different from the S-U-N spelling. You know, for example, if the preceding words are "My" and "dear," the likelihood that the next word is S-O-N is much higher than S-U-N.

DAVID LEVIN: Sejnoha says that language models can also help figure out phrases that sound similar but are totally different in meaning. One classic example is "It's easy to recognize speech," which can be misheard as "it's easy to wreck a nice beach."

But if the computer sees that most of the words we've said in the last few sentences have to do with language and not destroying coastlines, it might lean toward the former choice. Still, Sejnoha says that although these systems are good, they're far from perfect.

VLAD SEJNOHA: There are a number of problems in speech recognition which really aren't solved and are a kind of the holy grail.

DAVID LEVIN: Some things that would be easy for humans, like understanding three or four people talking around a table, are still well beyond what software can handle.

VLAD SEJNOHA: I mean, we've all been on teleconferences that have been hard to parse even for humans. So that represents really the sort of the toughest speech recognition problem, and we're just nowhere near to solving it.

DAVID LEVIN: So next time you're stuck talking to a robot when you call your insurance company, maybe you'll be a little more forgiving. After all, the job's not easy.

For NOVA, I'm David Levin.