Our approach to APIA

When ASR(Automatic Speech Recognition) receives a speech signal, it tries to figure out the sequence of phonemes, the duration for which a phoneme was spoken and its own confidence score (how much ASR is sure of that sound being spoken at that segment of signal). ASR does not just give the most probable sequence, but also other possibilities which exhibit more confusion and lesser probable sequences. We use this information to extract relevant features for APIA(automatic pronunciation intelligibility assessment). We can just match the obtained phoneme sequence with the true sequence for a word in context, but that would not let us generate relevant feedback for the speaker. Neither will it help us figure out the exact nature of errors committed by the speaker.

There are a few terms we will use in our approach.

N-best list: The list of sound sequences outputted by the ASR are ranked based on the ASR’s confidence on them. This list is called the N-best list, where N is based on programmer’s discretion. Grammar: ASR requires all the sequences that are likely to be produced in a language to reduce the list of possibilities (cut down search space) and speed up the computation. It can be sequences of sound or word sequences. We will use sound sequences as our grammar since we operate inside a word. This grammar can be stated in a Finite State Grammar format, where probabilities to transition from current sound to another are mentioned as arcs in a graph.

Our approach is to extract features relevant to APIA and then present them to a neural network along with the ground truth scores manually annotated by expert speakers. If you are not aware of neural networks, you can safely assume them to be a complex function which takes input values and produce relevant out when exposed to a lot of data to learn.

f(x) = y

The form of this function is generic enough to learn complex mapping by adjusting its parameters. During training, we present several examples of the word features and its intelligibility score (between 0 and 1) to adjust the parameters of the neural network. At test time, when a speaker pronounces a word, we extract features out of it and present it to the trained network and it produces the intelligibility score as output.

Let us talk about the features that we are referring to, since they are the most important part of our approach.

The features are extracted by iterating through several passes of recognition. To go through a single pass, we must have the audio and the grammar. For a word CAT, our grammar simply looks like this:

K AE T

This tells the ASR(Automatic Speech Recognition) that the three sounds (phonemes) in the above grammar are present in the audio input. All it must do is find their location, duration and the confidence score. It must also output the less likely sequences as N-best list. After the first pass, we know the alignment of phonemes with the audio signal. Now we break the phoneme sequence into triphones. For given example:

SIL K AE K AE T AE T SIL

The SIL phoneme stands for SILENCE. For a bigger word, we will have a bigger list of triphones. Subsequent passes through the ASR are of three types.

Three types of feature extraction passes through the ASR

Substitution pass: The middle phoneme in the triphone is replaced by arbitrary phonemes in the language. So, SIL K AE become SIL <some phoneme> AE. English has 39 phonemes, so we get 39 new substitution sequences. Each of these sequences are converted to grammar and the corresponding segment of audio is aligned with the new grammar. For each pass, we note the rank of the true phoneme in the N-best list. This number is normalized over all the passes and finally we obtain a single number which becomes part of our list of features.

Insertion pass: Similarly, insertion pass measures the likelihood of inserting an arbitrary phoneme within the correct sequence. So, SIL <some phoneme> K becomes the new grammar and likewise all the grammars are aligned, and the rank of true phoneme is noted in the N-best list.

Deletion pass: This pass ensures if the ASR(Automatic Speech Recognition) thinks that a phoneme is missing from the audio. Phonemes in the real sequence are omitted and aligned with the audio segment.

Further we also obtain features which are inspired by the physiology of human vocal tract. Researchers assume human voice apparatus to be composed of pipes of varying cross-sectional area connected in a series like in figure given below (Image courtesy: Macquarie University). Please refer to Azu’s blog for a detailed overview.

Representation of human vocal tract as the pipe model

The features we use are:

Place of articulation: The place in vocal tract where a cavity might be obstructed by tongue, lips or velum. Closedness: This value indicates the proximity of tongue with the roof of the mouth without creating a constriction. Roundedness: This value indicates the shape of lips while pronouncing, as in /o/ and /i/ cause the lips to be shaped differently. Voicing: Try pronouncing /aa/ and place your hand on your neck. Now repeat it while just hissing (pronouncing /s/). Notice the vibrations on your neck. These are your vocal folds vibrating. This vibration is called voicing.

We can predict these values based on the type of phoneme in context using a lookup table or a predictor such as a neural network.

Once we have all the features: ASR passes + physiological features, we have enough information to deduce the types of error that are made by the speaker as compared to the ground truth. We can now provide a constructive feedback to the speaker about the exact nature of errors. We can even pass this information to a 3-D model of the vocal tract and ask the speaker to correct themselves using visual feedback. KTH Royal Institute of Technology is working towards developing such interactive models of vocal tract (See: link).