So I finished the SuperFast LAME multi-threaded MP3 encoder last week and it's time to write about some technical aspects of it.

tl;dr: Implementing SuperFast LAME required some additional work to handle certain features of the MP3 format. You can download a preview release of fre:ac with SuperFast LAME support from GitHub.

The challenge

SuperFast LAME is significantly more complex than the SuperFast components for AAC, Opus and Speex, mostly because of technical peculiarities of the MP3 format.

The main difficulty is that while most other formats have discrete frames of audio samples in their bitstreams, MP3 frames can overlap each other:

In this example, the average frame size is 4 blocks of data. The individual frame lengths are 4, 3, 4, 3, 1, 5, 5 and 7 blocks. In an AAC bitstream, each frame will simply have a length matching the number of data blocks required for that frame and the frames will neatly come one after another. In an MP3 bitstream, however, (at least for CBR files, VBR is more complicated) frames have a fixed size and when there is space left in a frame after all samples have been encoded, that space can be used by the following frames. This space available to following frames is called bit reservoir and allows the codec to maintain a set target quality in most cases, even when frame sizes are fixed and audio complexity changes.

Have a look at the example. The 5th frame is only one data block long and that data block fits completely into the 4th frame. It even leaves some space, so the first data block of the 6th frame starts in the 4th frame as well. Looking at only the 5th and 6th frame, their layout in the bitstream looks like this:

Here the frame headers come after the data and in case of frame #5, there even is data of another frame (#6) between its data and its header. In real world MP3 streams, the situation can be even more intricate.

Basic SuperFast operation

So this is a problem when implementing the SuperFast technology for MP3. SuperFast works by passing chunks of audio data to separate encoder instances and later joining the encoded data blocks back together in the right order. This requires the frames to be available in discrete form in order to deal with overlap and joining the frames correctly. The SuperFast encoding loop usually looks like this (click to jump to example source code):

MP3 difficulties

When dealing with MP3, multiple issues arise from the peculiarities around the bit reservoir:

The encoder might not return all encoded frames after processing a chunk of data as some frames might still be waiting for additional data to put in the bit reservoir.

Frames are not available in discrete form, but may be overlapping each other.

After dealing with the above, frames need to be put back into an MP3 compatible bitstream after joining.

Frames might require more reservoir than is available after joining with frames coming from other codec instances.

Previous attempts to create multi-threaded MP3 encoders dealt with these issues in a very simple way: They completely disabled the bit reservoir to get nicely laid out frames with no overlapping data. This solution cuts into the resulting MP3's quality, though, which is why such encoders never really gained traction.

So let's see how we can handle these issues more adequately.

Unraveling it

The first one is relatively simple. After encoding a chunk of data, we call lame_encode_flush_no_gap to force the encoder to return all encoded frames even if they are not completely filled yet. This makes sure we can operate with all the relevant frames in the next steps.

The second issue is handled by a bitstream unpacker that parses the data returned by the encoder and extracts discrete frames from the bitstream. After this step all frames will be laid out as a frame header followed by the complete data belonging to that frame. No more intermixing with other frames' headers or data.

After unpacking, we are ready to perform overlap skipping and ordering of data chunks from different encoder instances.

When writing the ordered frames to the output stream, we now need to make sure to repack them back into an MP3 compatible bitstream. The repacker deals with frame sizes and the bit reservoir and tries to pack frames in the most compact way.

Sometimes, though, a frame requires more reservoir than is currently available and the repacker needs to find a way to fit it in. It basically has two options to accomplish this: If only a few extra bits are needed, the repacker can add padding to a frame. This will add an additional byte and sometimes this is enough to provide the required reservoir. In cases where it is not sufficient, the repacker can enlarge one or more previous frames to a bigger frame size. This usually allows to provide enough reservoir, but requires all affected frames to be repacked again.

However, even this might not be enough when issue number 4 comes into play. In some rare cases, a frame requires so much reservoir that it is simply not possible to fit it into the bitstream. This can happen because one encoder instance cannot know how much reservoir will be left over by the instance encoding the preceding chunk. In cases where the preceding instance has to deal with a difficult to encode signal, it might leave next to no reservoir available to the next encoder.

Dealing with this was difficult. While there are some simple options like forcing the encoder to use a lower bitrate, these might potentially result in audible quality drops. So I tried to find another way to handle this.

Basically, the SuperFast algorithm will try to re-encode the audio part starting with the non-fitting frame and repeat this until it fits. To work around situations where it might never fit using this strategy, each time it fails, we try to put some more pressure on the bit reservoir by prepending a few frames of difficult to encode dummy data. These dummy frames force the encoder to spend some reservoir on them and lead to using less reservoir for our previously non-fitting frame, eventually allowing us to fit the frame into the bitstream.

The result

With all these additional steps, the process for SuperFast LAME now looks like this (click to jump to source code):

Arriving at this point took several months of work, but was absolutely worth it. The SuperFast LAME encoder scales well with the number of CPU cores and can provide a 3.5x speedup on a quad-core processor. On my 8 core, 16 thread CPU, I was able to achieve up to 12x speed increase with it.

Unlike previous attempts to speed up MP3 encoding, SuperFast LAME does this while still using the MP3 format's bit reservoir feature and uses an unmodified encoder library - the necessary changes are all implemented in the frontend application and could be used with alternative MP3 encoders as well.

I plan to implement this technology on top of the command line LAME frontend in the future. For now, my priority is on releasing fre:ac 1.1 beta and final versions, though. But keep watching this blog for future annoucements about a SuperFast enabled stand-alone LAME version.

Downloads

SuperFast LAME is now in testing and included in the SuperFast Preview Release 3 available at GitHub.

Source code

Check out the SuperFast repository on GitHub if you would like to learn more or build the code yourself. The SuperFast LAME implementation can be found in the components/lame folder.