From anonymity to ubiquity

The rest of the series

The AudioFile: Basics of uncompressed audio

Since its standardization in 1991, MP3 has gone from being a little-known portion of a video file format to the kind of ubiquity that most brands can only dream of having. It's both widespread, with small players flying off the shelves, and controversial, dropping from the lips of politicians and advocates for all sides of the intellectual property debate.

But what is MP3? The usual explanations usually take one of two forms. The long version, available in technical papers, is written in jargon and filled with math. The short version, often used by newspapers and nontechnical periodicals, simply states that the process eliminates parts of sound not normally heard by the human ear. But this one-sentence description raises more questions than it answers for any reasonably tech-savvy reader: how does it find those unheard sounds, and how does it get rid of them? What's the difference between the different bit rates and quality levels? If you're anything like me, you've often wanted to know the mechanics of MP3, but not to the point of writing your own encoder.

This guide attempts to explain the process of MP3 compression in simple terms, without oversimplifying it. Although some parts have been omitted, like the details of stereo encoding schemes and in-depth file composition, it covers the basic theory of turning uncompressed sound files into compressed MP3. In order to tour the MP3 codec without getting overwhelmed by the technical minutiae, we'll take a look at some of the background principles and legacy of MP3, then break the process down into analysis and compression before finally considering the impact that this humble format has had on digital audio.

Hear, hear

Depending on the number of concerts you've attended, your ears may be more or less healthy for your age. But even if they're in perfect shape, human hearing is constrained by a number of limitations. At best, tests have usually shown that we can hear frequencies in a range between 20 to 20,000Hz. Our ears are also most sensitive between 2KHz and 5KHz, and they can detect changes between frequencies in increments of 2Hz—that's the effective "resolution" of hearing. As the average person gets older or the delicate cells of the ear are damaged by loud noise, high-frequency perception is reduced. In fact, most adults (myself included) have trouble hearing above 16KHz.

And these are just the physical limitations of the human ear. Our brains also play a role in filtering and analyzing the signals sent by the auditory nerve. The science of how we perceive sound is called psychoacoustics, and it has discovered a number of useful auditory effects. For example, one of my favorites is the Haas effect, which states that two identical sounds arriving within 30-40ms of each other from different directions will be perceived as a single sound coming from the direction of the first. It's often used in public address systems to reinforce the sound "from the stage," even if the loudspeakers are located farther to the side. MP3, like many other lossy audio compression schemes, relies heavily on these kinds of psychoacoustic effects to work its magic. In particular, it exploits the phenomenon of frequency masking.

Imagine two sounds with similar frequency profiles—say at 100Hz and 110Hz—but with different volume levels. If played by itself, the weaker sound is perfectly audible, but only the stronger will be heard if both are played simultaneously. The process of covering one frequency with another close (but not identical) frequency is called "masking." The degree to which frequencies can mask each other varies across the range of human hearing—our ears are less precise at the top and bottom of the audible spectrum. Loud transient signals (ones with very short duration) can also mask weaker signals for a short time, similar to the Haas effect. This type of masking is known as "temporal" masking and is also used in MP3 compression.

Leftovers

Something else to keep in mind while looking at the techniques of MP3 is that it continues a compression legacy that has influenced its design. MP3 actually stands for "MPEG-1 Audio Layer 3." MPEG, in turn, stands for the Moving Pictures Expert Group, which created the standard. MPEG video (and its successors, MPEG-2 and MPEG-4) is used all around us—DVDs are a modified version of MPEG-2, as is your digital TV signal.

As Layer 3 of the MPEG-1 specification, there are obviously two previous audio layers before MP3, which did not catch on in the consumer market (few of us listen to MP2s at home). There are several features of MP3 which may seem to be pointlessly complicated or are implemented in more steps than would seem strictly necessary, and these are often holdovers from the old design. This legacy means that MP3 is not actually terribly elegant or streamlined.

Which is a great excuse for me as an author, honestly. So if you have trouble following the process laid out in this article, don't blame me for a poor explanation. Blame Layer 2 instead.