Many speech codecs use Linear Predictive Coding (LPC) to model the short term speech spectrum. For very low bit rate codecs, most of the bit rate is allocated to this information.

While working on the 700 bit/s version of Codec 2 I hit a few problems with LPC and started thinking about alternatives based on the masking properties of the human ear. I’ve written Octave code to prototype these ideas.

I’ve spent about 2 weeks on this so far, so thought I better write it up. Helps me clarify my thoughts. This is hard work for me. Many of the steps below took several days of scratching on paper and procrastinating. The human mind can only hold so many pieces of information. So it’s like a puzzle with too many pieces missing. The trick is to find a way in, a simple step that gets you a working algorithm that is a little bit closer to your goal. Like evolution, each small change needs to be viable. You need to build a gentle ramp up Mount Improbable.

Problems with LPC

We perceive speech based on the position of peaks in the speech spectrum. These peaks are called formants. To clearly perceive speech the formants need to be distinct, e.g. two peaks with a low level (anti-formant) region between them.

LPC is not very good at modeling anti-formants, the space between formants. Effectively an all pole filter, it can only explicitly model peaks in the speech spectrum. This can lead to unwanted energy in the anti-formants which makes speech muffled and hard to understand. The Codec 2 LPC postfilter improves the quality of the decoded speech by suppressing interformant-energy.

LPC attempts to model spectral slope and other features of the speech spectrum which are not important for speech perception. For example “flat”, high pass or low pass filtered speech is equally easy for us to understand. We can pass speech through a Q=1 bandpass or notch filter and it will still sound OK. However LPC wastes bits on these features, and get’s into trouble with large spectral slope.

LPC has trouble with high pitched speakers where it tends to model individual pitch harmonics rather than formants.

LPC is based on “designing” a filter to minimise mean square error rather than the properties of the human ear. For example it works on a linear frequency axis rather than log frequency like the human ear. This means it tends to allocates bits evenly across frequency, whereas an allocation weighted towards low frequencies would be more sensible. LPC often produces large errors near DC, an important area of human speech perception.

LPC puts significant information into the bandwidth of filters or width of formants, however due to masking the ear is not very sensitive to formant bandwidth. What is more important is sharp definition of the formant and anti-formant regions.

So I started thinking about a spectral envelope model with these properties:

Specifies the location of formants with just 3 or 4 frequencies. Focuses on good formant definition, not the bandwidth of formants. Doesn’t care much about the relative amplitude of formants (spectral slope). This can be coarsely quantised or just hard coded using, e.g. voiced speech has a natural low pass spectral slope. Works in the log amplitude and log frequency domains.

Auditory Masking

Auditory masking refers to the “capture effect” of the human ear, a bit like an FM receiver. If you hear a strong tone, then you cant hear slightly weaker tones nearby. The weaker ones are masked. If you can’t hear these masked tones, there is no point sending them to the decoder. So we can save some bits. Masking is often used in (relatively) high bit rate audio codecs like MP3.

I found some Octave code for generating masking curves (Thanks Jon!), and went to work applying masking to Codec 2 amplitude modelling.

Masking in Action

Here are some plots to show how it works. Lets take a look at frame 83 from hts2a, a female speaker. First, 40ms of the input speech:

Now the same frame in the frequency domain:

The blue line is the speech spectrum, the red the amplitude samples {Am}, one for each harmonic. It’s these samples we would like to send to the decoder. The goal is to encode them efficiently. They form a spectral envelope, that describes the speech being articulated.

OK so lets look at the effect of masking. Here is the masking curve for a single harmonic (m=3, the highest one):

Masking theory says we can’t hear any harmonics beneath the level of this curve. This means we don’t need to send them over the channel and can save bits. Yayyyyyy.

Now lets plot the masking curves for all harmonics:

Wow, that’s a bit busy and hard to understand. Instead, lets just plot the top of all the masking curves (green):

Better. We can see that the entire masking curve is dominated by just a few harmonics. I’ve marked the frequencies of the harmonics that matter with black crosses. We can’t really hear the contribution from other harmonics. The two crosses near 1500Hz can probably be tossed away as they just describe the bottom of an anti-formant region. So that leaves us with just three samples to describe the entire speech spectrum. That’s very efficient, and worth investigating further.

Spectral Slope and Coding Quality

Some speech signals have a strong “low pass filter” slope between 0 an 4000Hz. Others have a “flat” spectrum – the high frequencies are about the same level as low frequencies.

Notice how the high frequency harmonics spread their masking down to lower frequencies? Now imagine we bumped up the level of the high frequency harmonics, e.g. with a first order high pass filter. Their masks would then rise, masking more low frequency harmonics, e.g. those near 1500Hz in the example above. Which means we could toss the masked harmonics away, and not send them to the decoder. Neat. Only down side is the speech would sound a bit high pass filtered. That’s no problem as long as it’s intelligible. This is an analog HF radio SSB replacement, not Hi-Fi.

This also explains why “flat” samples (hts1a, ve9qrp) with relatively less spectral slope code well, whereas others (kristoff, cq_ref) with a strong spectral slope are harder to code. Flat speech has improved masking, leaving less perceptually important information to model and code.

This is consistent with what I have heard about other low bit rate codecs. They often employ pre-processing such as equalisation to make the speech signal code better.

Putting Masking to work

Speech compression is the art of throwing stuff away. So how can we use this masking model to compress the speech? What can we throw away? Well lets start by assuming only the samples with the black crosses matter. This means we get to toss quite a bit of information away. This is good. We only have to transmit a subset of {Am}. How I’m not sure yet. Never mind that for now. At the decoder, we need to synthesise the speech, just from the black crosses. Hopefully it won’t sound like crap. Let’s work on that for now, and see if we are getting anywhere.

Attempt 1: Lets toss away any harmonics that have a smaller amplitude than the mask (Listen). Hmm, that sounds interesting! Apart from not being very good, I can hear a tinkling sound, like trickling water. I suspect (but haven’t proved) this is because harmonics are coming and going quickly as the masking model puts them above and below the mask, which makes them come and go quickly. Little packets of sine waves. I’ve heard similar sounds on other codecs when they are nearing their limits.

Attempt 2: OK, so how about we set the amplitude of all harmonics to exactly the mask level (Listen): Hmmm, sounds a bit artificial and muffled. Now I’ve learned that muffled means the formants are not well formed. Needs more difference between the formats and anti-formant regions. I guess this makes sense if all samples are exactly on the masking curve – we can just hear ALL of them. The LPC post filter I developed a few years ago increased the definition of formants, which had a big impact on speech quality. So lets try….

Attempt 3: Rather than deleting any harmonics beneath the mask, lets reduce their level a bit. That way we won’t get tinkling – harmonics will always be there rather than coming and going. We can use the mask instead of the LPC post filter to know which harmonics we need to attenuate (Listen).

That’s better! Close enough to using the original {Am} (Listen), however with lots of information removed.

For comparison here is Codec 2 700B (Listen and Codec 2 1300 (aka FreeDV 1600 when we add FEC) Listen. This is the best I’ve done with LPC/LSP to date.

The post filter algorithm is very simple. I set the harmonic magnitudes to the mask (green line), then boost only the non-masked harmonics (black crosses) by 6dB. Here is a plot of the original harmonics (red), and the version (green) I mangle with my model and send to the decoder for synthesis:

Here is a spectrogram (thanks Audacity) for Attempt 1, 2, and 3 for the first 1.6 seconds (“The navy attacked the big”). You can see the clearer formant representation with Attempt 3, compared to Attempt 2 (lower inter-formant energy), and the effect of the post filter (dark line in center of formants).

Command Line Kung Fu

If you want to play along:

~/codec2-dev/build_linux/src$ ./c2sim ../../raw/kristoff.raw --dump kristoff

octave:49> newamp_batch("../build_linux/src/kristoff");

~/codec2-dev/build_linux/src$ ./c2sim ../../raw/kristoff.raw --amread kristoff_am.out -o - | play -t raw -r 8000 -e signed-integer -b 16 - -q

The “newamp_fbf” script lets you single step through frames.

Phases

To synthesise the speech at the decoder I also need to come up with a phase for each harmonic. Phase and speech is still a bit of a mystery to me. Not sure what to do here. In the zero phase model, I sampled the phase of the LPC synthesis filter. However I don’t have one of them any more.

Lets think about what the LPC filter does with the phase. We know at resonance phase shifts rapidly:

The sharper the resonance the faster it swings. This has the effect of dispersing the energy in the pitch pulse exciting the filter.

So with the masking model I could just choose the center of each resonance, and swing the phase about madly. I know where the center of each resonance is, as we found that with the masking model.

Next Steps

The core idea is to apply a masking model to the set of harmonic magnitudes {Am} and select just 3-4 samples of that set that define the mask. At the decoder we use the masking model and a simple post filter to reconstruct a set of {Am_} that we use to synthesise the decoded speech.

Still a few problems to solve, however I think this masking model holds some promise for high quality speech at low bit rates. As it’s completely different to conventional LPC/LSP I’m flying blind. However the pieces are falling into place.

I’m currently working on i) how to reduce the number of samples to a low number ii) how to determine which ones we really need (e.g. discarding interformant samples); and iii) how to represent the amplitude of each sample with a low or zero number of bits. There are also some artifacts with background noise and chunks of spectrum coming and going.

I’m pretty sure the frequencies of the samples can be quantised coarsely, say 3 bits each using scalar quantisation, or perhaps 8 bit/s frame using VQ. There will also be quite a bit of correlation between the amplitudes and frequencies of each sample.

For voiced speech there will be a downwards (low pass) slope in the amplitudes, for unvoiced speech more energy at high frequencies. This suggests joint VQ of the sample frequencies and amplitudes might be useful.

The frequency and amplitude of the mask samples will be highly correlated in time (small frame to frame variations) so will have good robustness to bit errors if we apply trellis decoding techniques. Compared to LPC/LSP the bandwidth of formants is “hard coded” by the masking curves, so the dreaded LSPs-too-close due to bit errors R2D2 noises might be a thing of the past. I’ll explore robustness to bit errors when we get to the fully quantised stage.