For a long time now, I’ve been wanting to write an article about how to properly encode an MP3 file for use in a podcast. Something so fundamental to the medium is a very difficult task: most easy-to-use tools are expensive, and free tools are overcomplicated and confusing to set up.

Recently I got a Pinecast support ticket from a customer with quite large MP3 files. The files were larger than his plan would allow, and he was out of surge. When I made some suggestions about his encoding settings, I discovered that the tool he was using didn’t even have those options! Beyond my surprise that a paid piece of software lacked these features, I was astounded that there are no good, free pieces of software that made this process simple. The software that I did find wasn’t cross-platform, and couldn’t be used by my customer anyway.

I set out to create a tool to do this work for you, now released in beta as the Pinecoder, but I needed to know how exactly we should be crafting these MP3s. I wanted to have data to back up my choices for what a properly encoded MP3 looks like. To do that, I crawled the list of top podcasts on iTunes (or Apple Podcasts, depending on who you ask) and did a bit of an analysis.

Getting Data to Analyze

First, I scraped the iTunes charts with a simple bit of JavaScript in my devtools console:

Array.from($('[target=itunes]')).map(a => a.href)

This yielded the URLs of each show’s iTunes page. I copied the resulting JSON to a text file. An iTunes podcast URL looks like this:

https://itunes.apple.com/us/podcast/sworn/id1243525941?mt=2

The important part here is the id... bit towards the end. That’s the Id of the show in iTunes, and allows us to get the feed URL by passing it to this endpoint:

https://itunes.apple.com/lookup?id=1243525941&entity=podcast

Notice the Id from above in the id query string parameter. This endpoint returns a JSON blob containing the URL of the podcast’s feed.

Next, I simply extracted the URL, MIME type, and content length of the first <enclosure> tag in each feed. From here, it’s simple to cURL each file.

import json

import re

from urllib.request import urlopen import requests

from defusedxml.minidom import parseString as parseXMLString # RegExp to extract IDs from iTunes URLs

id_extractor = re.compile(r'(?:/id)(\w+)\b')

top100 = json.load(open('top100.json'))

for itunes_url in top100:

itunes_id = id_extractor.search(itunes_url).group(1)

lookup_url = '

output = requests.get(lookup_url).json()

feed_url = output['results'][0]['feedUrl']

print('feed:', feed_url) # top100.json is the JSON from itunescharts.comtop100 = json.load(open('top100.json'))for itunes_url in top100:itunes_id = id_extractor.search(itunes_url).group(1)lookup_url = ' https://itunes.apple.com/lookup?id=%s&entity=podcast' % itunes_idoutput = requests.get(lookup_url).json()feed_url = output['results'][0]['feedUrl']print('feed:', feed_url) feed = requests.get(feed_url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X)'}).text try:

parsed_feed = parseXMLString(feed)

except Exception as e:

print(feed_url)

print(e)

break

enclosure = parsed_feed.getElementsByTagName('enclosure')[0]

audio_url = enclosure.getAttribute('url')

print('audio:', audio_url)

Interesting Metadata

The output of the script above looks something like this:

Using a combination of grep, awk, and wc, I did all of the counting shown below.

The first interesting tidbit is that one of the feeds failed to download because I was accessing it using Python Requests’ default user agent string. This is probably a bad practice on the part of the website owner (feeds are cheap to generate, and banning programmatic access to them is unwise). It was simple to update the script to pass a custom user agent string.

Next, out of the 100 podcasts I crawled, 44 of them are using Podtrac. Podtrac offers podcast analytics, though it’s unclear how exactly they do a better job than other analytics platforms. In my own experience, the numbers behind Podtrac are useful because they’re trusted by advertisers, though perhaps not because of technical merit.

Almost a quarter of the podcasts were hosted on Libsyn. 15 of the 100 are hosted by NPR. Three of the 100 are using SoundCloud (I honestly expected more). Only one podcast of the bunch uses Podbean. These number might be higher because of feed proxies like Feedburner, though I did not investigate further.

22 of the 100 podcasts were behind Feedburner. This is very curious, as Feedburner provides little value when used with most modern hosting services. Feedburner prevents hosts from measuring subscribers in any meaningful way, since it proxies the feed. At Pinecast, we recommend that users do not use Feedburner in conjunction with our service or any other podcast hosting service. From a technical perspective, Feedburner is an old Google acquisition (2007) that has fallen by the wayside — little has gone into its upkeep, and given Google’s track record with minor services, it could go away at any time.

Analyzing the Audio

I went about investigating the encoding of the audio files. To do this, I used the Python library ffprobe3, which conveniently wraps the ffmpeg tool ffprobe. Running ffprobe on a file produces output like this:

> ffprobe audio/246.mp3 -hide_banner -show_streams

Input #0, mp3, from 'audio/246.mp3':

Metadata:

title : #246: My Pen Pal

artist : This American Life

album_artist : Chicago Public Media

TS2 : Chicago Public Media

genre : Podcast

comment : © 1995-2017 Ira Glass

TSP : This American Life

date : 2017

Duration: 01:00:16.55, start: 0.025056, bitrate: 64 kb/s

Stream #0:0: Audio: mp3, 44100 Hz, mono, s16p, 64 kb/s

Metadata:

encoder : LAME3.99r

Side data:

replaygain: track gain - -3.300000, track peak - unknown, album gain - unknown, album peak - unknown,

Stream #0:1: Video: png, rgb24(pc), 3000x3000, 90k tbr, 90k tbn, 90k tbc

Metadata:

comment : Other

[STREAM]

index=0

codec_name=mp3

codec_long_name=MP3 (MPEG audio layer 3)

profile=unknown

codec_type=audio

codec_time_base=1/44100

codec_tag_string=[0][0][0][0]

codec_tag=0x0000

sample_fmt=s16p

sample_rate=44100

channels=1

channel_layout=mono

bits_per_sample=0

id=N/A

r_frame_rate=0/0

avg_frame_rate=0/0

time_base=1/14112000

start_pts=353600

start_time=0.025057

duration_ts=51036733440

duration=3616.548571

bit_rate=64000

max_bit_rate=N/A

bits_per_raw_sample=N/A

nb_frames=N/A

nb_read_frames=N/A

nb_read_packets=N/A

TAG:encoder=LAME3.99r

[SIDE_DATA]

side_data_type=Replay Gain

[/SIDE_DATA]

[/STREAM]

...

One initial problem that I encountered is ffprobe3 being unable to parse the [SIDE_DATA] blocks at the end of the above snippet. I don’t really care about them for the purposes of this article, so I monkey-patched ffprobe3 to ignore them:

import ffprobe3 orig = ffprobe3.ffprobe.FFStream.__init__

def replacement(self, data_lines):

data_lines = [x for x in data_lines if not x.startswith('[')]

return orig(self, data_lines)

ffprobe3.ffprobe.FFStream.__init__ = replacement

Simple enough.

I also used exiftool, a common file analysis tool, to get more information about the audio files. Ffprobe doesn’t report information about constant versus variable bitrate or the channel mode used by a file, so exiftool was necessary. I simply used subprocess.Popen to invoke exiftool and scrape relevant information from stdout.

I used Python’s collections.Counter to tally up the bitrates, max bitrates, and codecs of each of the audio files. I also used csv.Writer to output a CSV for use in a spreadsheet.

Codecs

First, I wanted to look at the most basic information: which codec the files used. I wasn’t sure what to expect here. On one hand, I fully expected 100% of the audio files encoded as MP3. MP3 is overwhelmingly dominant in podcasting. On the other hand, I expected to see some unusual entries, like AAC.

The result was whelming. 99 of the 100 files were MP3. The last, as it turns out, was an M4V file containing H.264 video and AAC audio. It contained a recording of the Apple WWDC event, unsurprisingly from the Apple Keynotes feed.

It was a bit surprising to see a video podcast coming from iTunes. Support for video in podcasts is dodgy at best, but I suppose if you’re Apple and you only list it in Apple Podcasts, you can be fairly sure it’s going to play on your users’ devices.

For the sake of sanity, I excluded the video file from the rest of the analysis.

Conclusion: Use MP3 to encode your podcasts. Avoid AAC.

CBR vs. VBR

The next thing I was interested in is the type of encoding used for those MP3 files. There are two options:

Constant Bitrate (CBR)

Variable Bitrate (VBR)

The difference is simple: with CBR encoding, one second of audio will always take the same amount of data, regardless of where in the audio file you find it. For example, a 128kbps audio file will take 128 kilobits to store one second of audio. Want to find the start of the audio at 00:00:05? Skip to the 5 × 128 kilobits (640 kilobits) mark and you’ll find it. VBR, on the other hand, allows the encoder to turn the bitrate up and down depending on what the audio contains. A second of silence might be encoded at low quality while music immediately after it would be bumped up to a higher quality. Adjusting the bitrate dynamically allows the files to be smaller by only increasing quality for audio that needs it.

With a CBR file, skipping forward or backward is easy because you can calculate exactly where to jump to. With VBR, skipping ahead ten seconds might mean skipping up to 1280 kilobits — but that might be too much if the quality is lowered within those ten seconds. This also means that the duration of the audio file can’t be determined by looking at the file size. With CBR, you simply divide the file size by the bitrate: that’s the number of seconds long the audio is. With VBR, the same calculation will overestimate the audio’s length substantially. Instead, VBR-encoded files need to list their duration in the file’s metadata, though this can be complicated and difficult to do with most encoding tools.

I used exiftool to extract this information, and it was done very simply: it will only output the string “VBR” if the MP3 is encoded with VBR:

from subprocess import Popen, PIPE # Run the tool

proc = Popen('exiftool "%s"' % path, stdout=PIPE, shell=True) # Check for VBR in the stdout

stdout_lines = iter(proc.stdout.readline, b'')

is_vbr = any(': VBR' in a.decode('UTF-8') for a in stdout_lines) # Close the output

proc.stdout.close()

98 of the 99 audio files were CBR-encoded. Only one was VBR-encoded.

Update: Due to a bug in the analysis script, this post originally claimed that fifteen podcasts used VBR encoding. After correcting the code, only one was found to be VBR-encoded.

In my own experience, VBR can provide dramatic savings over CBR. VBR is well-documented as a good practice. If you don’t believe me, take Jeff Atwood’s word for it.

Update: This is a controversial viewpoint. You should read my followup post about VBR.

Conclusion: Consider using VBR if the trade-offs aren’t offensive to you.

Channels

Within an audio file, a channel is something akin to an “audio feed.” Each channel produces sound. A mono audio file has a single channel, while a stereo audio file has two (one for each ear). I have always suggested that podcasters encode their content as mono rather than stereo. The reasoning is simple: most listeners just won’t be able to tell, you probably aren’t mixing your audio for two channels, and increasing the channel count requires more bits to achieve the same quality. Even still, three quarters of the files were encoded with two channels.

How the number of channels affects file size is a bit complicated. A two-channel audio file encoded at 128kbps consumes the same amount of space as a one-channel audio file encoded at 128kbps. Each channel in the two-channel stereo file, though, effectively gets half of the bitrate — the result is lower quality audio. The rules are a bit fuzzy here: besides stereo (where each ear’s audio is a channel) and mono (where there is a single channel for both ears’ audio) there is a channel mode called “joint stereo.” Joint stereo generally stores the sum of the left and right channels and the difference between the two. Since the left and right channels are likely very similar, more bits can be spent on the sum and fewer bits can be spent on the difference. The result is — usually — higher quality audio at the same bitrate as “vanilla” stereo encoding.

At the end of the day, channels are a complicated matter. Minimizing the number of channels is ultimately best, but how do we know for sure? There are a few scenarios for how an MP3’s channels can be put together:

One Channel: This is simplest, and easiest to get right.

This is simplest, and easiest to get right. Two Distinct Channels: This is simple, but bad for podcasts. A stereo track requires twice the space to achieve the same quality as the equivalent mono audio.

This is simple, but bad for podcasts. A stereo track requires twice the space to achieve the same quality as the equivalent mono audio. Two Identical Channels (Faux-Stereo): This is almost certainly the result of a mistake. Faux-stereo is when a single audio channel is duplicated as the left and right channels of a stereo MP3. A faux-stereo audio file is audibly indistinguishable from a mono audio file, but is encoded as two channels instead of one.

Let’s figure out what’s going on with all of these two-channel audio files and see whether there are any obvious errors.

Testing for Faux-Stereo

It turns out that there are exactly zero tools for checking whether a two-channel MP3 file is faux-stereo. One such tool exists for WAV files, though: zrtstr from indiscipline on Github. For this experiment, the plan is to convert MP3 to WAV, then use zrtstr to compare the two channels for faux stereo.

The first step is getting zrtstr to run. Since I’m not on Windows, there is no binary. I installed the Rust compiler, but the compilation failed with an error in one of the dependencies. After some investigation, I found that the offending code had been rewritten in a newer version of the package. Bumping the version and deleting the Cargo.lock file made the compilation process succeed.

Once I got the decoding and analysis process automated with another Python script, I was surprised at the results. Of the 99 audio files, 5 of them were indeed faux stereo as reported by zrtstr.

The first file was a 128kbps stereo MP3. To my surprise, the file sounded fine and was not an unreasonable size. The trick here is that the file used joint stereo: when a faux-stereo file is encoded as joint stereo, the sum of the left and right channels is numerically double that of a mono channel, and the difference channel is just a bunch of zeroes. That second channel is essentially silence, which is easily compressed to almost nothing. In the end, there is a minor amount of overhead from encoding the second joint stereo channel, but not enough to matter too much. The producers could probably increase the quality of the audio marginally by simply encoding as mono instead, or encode as mono and decrease the bitrate slightly.

The second, third, and fourth files were the same. The fifth, though, exhibited the exact characteristics of faux-stereo. It uses the “vanilla” stereo channel mode with identical left and right channels. The file is small, clocking in at around 20MB at 256kbps, but this is because the audio itself was only a few minutes long. Encoded at half the bitrate with a single channel, it would easily fit in 10MB instead.

Conclusion: Never use vanilla stereo as your channel mode. Joint stereo will produce better quality audio and save heartache if you make a mistake. Mono will never do you wrong.

Almost-Faux-Stereo

It seemed unlikely to me that any podcasts in the iTunes charts would make such a mistake, but seeing that at least one did, I’m inclined to think that there are others.

Looking at the other two-channel audio files, many looked like this going into zrtstr:

857053 / 85705344 [=>---------------------] 1.00 % 65071217067.80/s 0s

File is not double mono, channels are different!

zrtstr takes chunk of each channel of the file (in blobs of 1% of the file’s duration) and does a comparison. If it finds any substantial differences, it’ll bail at that point, like in the example above. Consider examples like this, though:

16763124 / 79824431 [======>-----------------------] 21.00 % 30667033.39/s 2s

File is not double mono, channels are different!

In that instance, we got through 21% of the file (16 megabytes) before we found any differences between the channels! This could mean a few things:

Glitches in the audio, or corruption in the file caused a difference between channels.

Ads injected by the hosting platform are stereo, while the rest of the episode is mono.

Certain clips of background music or other imported audio is stereo, while the rest of the episode is mono.

Numeric rounding during the MP3 to WAV conversion led to just enough difference between the channels to cause zrtstr to detect a difference.

zrtstr has a function which allows you to specify the amount of tolerance in amplitude difference that’s allowed when comparing channels. I increased the tolerance by 10x and some files progressed further, but the tool found no additional faux-stereo files.

Because some of the files progressed further, it’s not impossible that the stereo component to the two-channel files is background audio. I attempted to note the start and end of background audio to try to pinpoint where stereo audio might start or end, but many podcasts blend multiple tracks together making it very difficult to identify by ear. I could not find any audio files (through manual listening) that appeared to be stereo as a result of an injected advertisement.

Of the 68-odd non-faux-stereo two-channel audio files, 20 did not contain true stereo audio for at least one percent of the file. That is, zrtstr made it more than one percent of the way through the file before it found a difference between the left and right channels. One audio file made it 77% through before zrtstr found a difference!

Conclusion: Encode your audio as mono unless your primary source audio is stereo. If you do not have multiple microphones or pan tracks to one channel or the other, stereo encoding will only decrease the quality of your output.

Let’s talk about bitrate

As mentioned above, bitrate represents the number of bits required to encode one second of audio. Calculating the effects of bitrate on file size is difficult, but determining its impact on quality is tricky also. Bitrate isn’t a great measure of audio quality. Having a second channel decreases the quality of the audio at a particular bitrate, but the amount that it decreases depends on the channel mode and the contents of the audio itself.

In my analysis, the most common bitrate was 128kbps with a majority of 57 audio files. 192kbps came in with 12 and 64kbps had 10. A number of other strange bitrates had fewer than five each.

I also broke down bitrates by the number of channels. 54 of the 73 two-channel files were 128kbps (54/57 128kbps audio files were two-channel). 9 of the two-channel files were 192kbps (or 75% of 192kbps files). Unexpectedly, 64kbps was the dominant bitrate for single-channel audio files with eight files, followed by 48kbps, 128kbps, 96kbps, and 192kbps.

As-is, this doesn’t mean a lot. The juicy details are in the breakdown of bitrates by channel mode. That will tell us a few things:

Are particular channel modes biased towards higher or lower quality?

Are certain bitrates necessary to compensate for quality issues introduced by the chosen channel mode?

What are the most common bitrates?

To break this down, I’ve created something of a histogram showing the percentage of audio files encoded with each channel mode at all of the notable bitrates that I encountered. That sounds crazy, but it’ll make a bit more sense in chart form: