Analyze this!

So let's start with a regular sound file stored in uncompressed PCM audio. We'll assume for the sake of argument that both it and the MP3 are going to operate using the same sample rate—MP3 can support several rates but typically uses CD-standard 44.1KHz.

The first step is to group these samples into "frames," each of which contains 1152 samples. Why 1152? It's another accommodation to backwards compatibility with Layer 2. Technically, Layer 3 frames are split into two "granules" of 576 samples. This is kind of a kludge and one that was simplified in newer encoders: when the MPEG-2 video standard was created, its audio layer uses only one of these granules per frame. For the purposes of encoding, MP3 really only acts on a single granule at a time, although it may use parts of the previous and following granules in order to get a wider viewpoint of change over time.

Now those samples are run through a filterbank that divides the sound into a set of 32 frequency ranges (in other audio applications, we call this a "bandpass filter," since it lets only a specific band of frequencies through). This is another concession to Layer 2, which actually used those 32 values for its encoding. One of the values of Layer 3 is that it subsequently divides those 32 frequency bands by a factor of 18, creating 576 smaller, adaptive bands. Each of these bands, therefore, contains 1/576th of the frequency range from the original samples.

At this stage, a set of two parallel processes takes place: the Modified Discrete Cosine Transform (MDCT) and Fast Fourier Transforms (FFT). The math for these is complicated, but their functions can be explained without having to show our work.

The FFTs are used as analysis functions, turning each frequency band into information that can be fed into the encoder's psychoacoustic model—a kind of virtual human ear. The encoder uses that model to answer a few questions: are there sounds in each band below the masking threshold (they will be hidden by louder sounds at close frequencies)? Is the audio fairly constant, or does it change? Are there any sharp transient sounds that need to be preserved and which might mask other transients just before or after? This information will be used during the compression to figure out which information can be safely discounted since (according to the masking behavior of the psychoacoustic model) our ears would ignore it anyway.

Before going into the MDCT on the other side of the parallel process, the samples are sorted into different "window" patterns based on whether they contained steady or constant noise. MP3 allows frequency bands to be described using either one long window or three short windows. Constant noise without much change over time can be expressed using the long window. Transient noises, like drum hits or vocal consonants, are described across three short windows (each containing 192 samples, or about 4 milliseconds).

The MDCT turns each windowed band into a set of spectral values. Unlike the initial audio, which represents sound as the position of a waveform over regularly collected samples, spectral analysis looks at sound as energy across the range of frequencies.





In this spectral view of a sound file, frequencies with more energy are shown as brighter patches. The lowest frequencies are at the bottom, and the highest at the top. Time moves from left to right.



Because spectral information bears more of a resemblance to the way our hearing interprets audio, many compressed audio encoders use it to remove the psychoacoustic information instead of operating on the sampled waveform. Once the MDCT finishes with its math, the MP3 process has 576 "frequency bins" to work with, each containing the spectral intensity for 1/576th of the total frequency range.

Now that the encoder has both the spectral information and the psychoacoustic analysis of the granule, it starts the actual compression process.