Audio on the mobile web is a mess. The easy way to play sound — creating an <audio> element and calling the audio.play() method — doesn’t work unless playback starts in response to a ‘user gesture’, and will only let you play one clip at a time.

The hard way — loading the audio, decoding it using the web audio API’s context.decodeAudioData(…), creating an AudioBufferSourceNode and playing that — gives you a lot more flexibility, but comes with a rather important caveat: it will crash your phone.

There’s a simple reason for that. The browser needs to store the entire audio clip, decoded, in memory, which — since a 5Mb mp3 file typically equates to a 55Mb wav file — you can quickly find yourself running out of if you’re using large audio files. When that happens, the way you find out about it is the whole tab going kaput.

For RioRun, an interactive podcast we (the Guardian US interactive team) recently built, this was a major problem. At any one time we might have as many as three separate layered audio clips playing at once, and each of those clips might be several minutes in length. And even if we didn’t have to worry about bursting the memory banks, the web audio API approach has another major drawback in that you have to download the entire clip before you can start playing any of it.

Since playback is controlled by the distance you’ve covered, as measured by your phone’s GPS, using <audio> is a non-starter — we can’t rely on the user tapping their screen.

Break it down

We created Phonograph, an open source JavaScript library, to tackle this problem. It exploits a useful fact about mp3 files: like the planarian flatworm, you can slice them into smaller chunks and they won’t die — each chunk becomes a block of audio that can be played independently.

By reading in the raw binary data and breaking it into Uint8Arrays of a few kilobytes each, we can decode just enough audio to get us through the next few seconds. As we reach the end of each chunk, the next one is decoded and starts playing.

Better still, if we’re in a browser that supports the fetch API and implements streaming, we can start playback before download is complete, by estimating the duration of the clip and how long it will take to arrive at the current rate. (That is, of course, something you get for free with traditional HTML5 <audio> — just not with the web audio API.)

Here be dragons

That’s the theory, at least. It turns out to be somewhat more challenging in practice. For one thing, you can’t just slice the mp3 file anywhere — you have to do it on a frame boundary, on which more later, otherwise you’ll lose data — and even then you’ll get audible seams between chunks because of something called the byte reservoir. This is one of the tricks that mp3 encoders use to cram more data into a smaller space. By filling unused space in less-complex-to-encode parts of the clip (such as silence) with extra bytes from upcoming more-complex-to-encode parts, encoders can achieve better quality with the same filesize. (This isn’t variable bitrate encoding or VBR, by the way — that’s a whole other can of planarian flatworms that we’ll open later.) The upshot is that any one frame may depend on as many as 9 preceding frames — and since a frame represents about 1/40th of a second, your ears notice it.

Phonograph solves this problem by linking chunks together: each chunk appends the first few kilobytes of the next chunk’s raw data to its own. Rather than stopping playback at the end of the audio that ‘belongs’ to the chunk, it continues for a fraction of a second while the next chunk starts playing silently. Once it’s safe to do so, Phonograph silences the first clip and unsilences the second.