Using locks in real-time audio processing, safely

When developing music software, you are operating under tight time constraints. The time between subsequent audio processing callbacks is typically between 1-10 ms. A common default setting is a buffer size of 128 samples at a sample rate of 44,100 Hz, which translates to 2.9 ms in between callbacks. If your process does not compute its audio output and write it into the provided buffer before this deadline, you will get an audible glitch, rendering your product worthless for professional use.

It is therefore commonly taught that you cannot do anything on the audio thread that might block the thread or otherwise take an unknown amount of time, such as allocating memory, performing any system call, or doing any I/O. This includes waiting on a mutex, which not only blocks the audio thread but also leads to priority inversion. This is well known and covered in many articles and talks, such as Ross Bencina’s seminal article, my own CppCon 2015 talk, and more recently a brilliant series of talks by Dave Rowland and Fabian Renn-Giles (part 1, part 2).

So how do we synchronise the audio thread with the rest of our application?

In the simplest case, a single numerical value needs to be shared between threads. For example, the user might turn a knob, which will update a parameter in your DSP algorithm. For this, you can use a std::atomic<T> variable, where T is a built-in numeric type such as int or float ; this will be lock-free on modern platforms.

If you have a stream of objects flowing from one thread to the other, such as MIDI messages, you can use a lock-free single-producer single-consumer FIFO ( boost::spsc_queue is a good implementation).

But sometimes we find ourselves in a more complicated situation. The audio thread needs to access a bigger data structure describing some part of the system, such as a std::vector or a directed graph or perhaps a linked list. Typically, the audio thread will be traversing that structure to do its work, while another thread might be modifying or re-generating that structure at the same time. How do we deal with this situation?

The correct answer, in my opinion, is to design your audio engine in such a way that this case never occurs. This can be achieved using immutable data structures. Instead of modifying the data structure in-place, the message thread peels off a copy that contains the modification, while the audio thread still looks at the previous version for however long it needs to. When done correctly, this will lead to a system that is thread-safe by design, and you will never need locks on the audio thread.

Unfortunately, not everyone is in the position where they can re-engineer their audio engine to follow this pattern. You might be stuck with data structures accessed by multiple threads simultaneously. In this situation, many developers reach for locks, and that’s exactly what we will be looking at in this article.

Use cases

Looking at the codebase I am currently working with, I have found various places where locks on the audio thread are used to avoid data races. These use cases are all variations on the same theme. Here are just three examples:

The audio thread is searching a list of audio samples, trying to find the one that matches the currently played note; another thread loads more samples and inserts them into that list.

The audio thread is searching for a free polyphonic synthesiser voice that it can use; another thread allocates more such voices.

The audio thread is processing a whole audiograph; another thread re-builds this graph.

You might notice that the only real difference between all those use cases is how large the data structure in question is, and how big the portion of the audio graph is which depends on the current state of that data structure.

Now (putting aside for a moment the fact that all of these can be implemented without any locks at all if we use more clever data structures), let’s look at the types of locks chosen. It seems that some developers intuitively choose different types of locks depending on the perceived (but often not actually measured) duration of the code under lock. If there is a lot of code, they might reach for std::mutex or something similar. If, on the other hand, it looks like a relatively quick operation, they might opt for a spinlock instead.

Below, we look at both of those options.

Don’t use std::mutex::try_lock!

It is well-known by now that you should not lock a mutex on the audio thread. At this point, it might seem like a good idea to resort to a different strategy: using try_lock instead. In fact, I saw this being recommended in tutorials. If you’re using C++17 and the C++ standard library, it might look like this:

if (std::unique_lock lock(mtx, std::try_to_lock); lock.owns_lock()) { // do your audio processing } else { // use a fallback strategy }

In this code, the audio thread is trying to obtain the mutex. If this fails, because the mutex is being held by another thread, the audio thread doesn’t wait (it’s not allowed to). Instead, we fall back to an alternative strategy such that we can keep the audio running without accessing the data that is being used by the other thread. If you’re lucky, the object can use the last known values of the needed parameters instead, or do something else that the user won’t notice. If you’re less lucky, you’ll have to switch your audio effect to bypass mode, or perhaps fade out, or some similar fallback strategy. It’s probably not ideal, but in many cases still better than a glitch.

At first glance, this seems like a good strategy. After all, the C++ standard says that try_lock doesn’t block and returns immediately. So it’s safe to call on the audio thread. That is true. But unfortunately, this strategy has a fatal flaw.

If we look more closely at what the above code snipped actually does, it turns out that std::mutex::try_lock() is not the only call that is being made by std::unique_lock . After all, it’s an RAII class, so when it goes out of scope, it will call std::mutex::unlock() in its destructor. Most of the time, that’s not going to do very much. But if there is another thread waiting to acquire the mutex, the audio thread will have to interact with the OS thread scheduler at that point so that that other thread can then be woken up. And that’s a system call. It’s not realtime-safe to do that. So you should not use std::mutex on the audio thread, not even with try_lock .

What about spinlocks?

The alternative to mutexes are spinlocks. For example, a widely used implementation is juce::SpinLock. It is probably being used in half of the VST plug-ins out there, and the other half will use something very much like it. I am not a huge fan of this class, because its API is not STL-compatible (so you can’t use it with std::scoped_lock and std::unique_lock ), and it doesn’t use the optimal memory order flags in the implementation. But we can easily fix that and put together a more standard C++ style version of this type of lock:

struct spin_mutex { void lock() noexcept { if (!try_lock()) /* spin */; } bool try_lock() noexcept { return !flag.test_and_set(std::memory_order_acquire); } void unlock() noexcept { flag.clear(std::memory_order_release); } private: std::atomic_flag flag = ATOMIC_FLAG_INIT; };

This can now be used as a drop-in replacement for std::mutex , except that it now has a realtime-safe unlock() method. Because of this, it is much, much better than std::mutex . But is it good enough?

On the audio thread, we will only be using try_lock and unlock (via a std::unique_lock wrapper like in the first example), and never lock . As we can see, both of these will only do a single atomic operation, so it’s perfectly safe.

On the other thread however, whenever we try to access the lock while the audio thread has it, we will use lock (via one of the RAII wrappers). Therefore, the non-audio thread will be spending some time in that spin loop. It will keep hammering that atomic flag again and again and again, about 200,000,000 times a second (on my machine), until the audio thread is done and the atomic test-and-set finally succeeds. This loop will completely max out the CPU load of the core it will be on. It’s like using a continuous Sheldon’s knock as a method to determine when someone returns home from vacation.

If this happens rarely, or if you are on a powerful machine with many cores, that’s probably not a problem. In a typical audio application, apart from the audio processing itself, the most CPU time goes into rendering the GUI and doing file I/O. A short spinloop shouldn’t cause you much trouble in the overall picture.

But if the lock is around your whole processing callback or a significant part of it (like in some of the use cases above), and if you are doing a lot of audio processing (let’s say you’re at 90% CPU load on the audio thread), depending on where you use the spinlock, you may end up waiting on it longer and more often. If you’re then on a machine with not too many cores, or worse, a mobile device where energy consumption becomes really important, spinlocks like the above suddenly start looking like not a great strategy at all. Not only do they waste a lot of CPU cycles with useless instructions, sucking your battery dry, but they also keep any other thread from doing any work on that core. So if the audio thread is doing a lot and is near high load, and the message thread is spinning on that lock at the same time, any additional threads in your audio app trying to do meaningful work will possibly not have a great time.

Exponential back-off

When stuck with a problem like this, it is often helpful to look outside of the audio industry for solutions. Spinlocks are used in other applications that demand low latency and high efficiency. If we search the literature, we will find a strategy called exponential back-off. In his recent talk about the C++20 synchronisation library, Bryce Adelstein Lelbach shows an example implementation of such a strategy (starting at 25:50 mins in the video):

struct spin_mutex { void lock() noexcept { for (std::uint64_t k = 0; !try_lock(); ++k) { if (k < 4) ; else if (k < 16) __asm__ __volatile__("rep; nop" : : : "memory"); else if (k < 64) sched_yield(); else { timespec rqtp = {0, 0}; rqtp.tv_sec = 0; rqtp.tv_nsec = 1000; nanosleep(&rqtp, nullptr); } } } bool try_lock() noexcept { return !flag.test_and_set(std::memory_order_acquire); } void unlock() noexcept { flag.clear(std::memory_order_release); } private: std::atomic_flag flag = ATOMIC_FLAG_INIT; };

Now, there is quite a lot to unpick here, and unfortunately the code above is specific to x86 gcc and non-portable. But the idea is to use progressively longer wait times the longer you fail to acquire the lock. First, we try to spin four times; if we can’t get the lock, we insert a CPU pause instruction for the next 12 times, which will consume less energy; then if we still can’t get the lock, we yield and tell the thread scheduler to put the thread back into the run-queue; and finally, if we still fail to get the lock after a series of such yields, we start putting the thread to sleep for some amount of time.

Of course, no one wants to be re-implementing the above by hand for all the platforms and compilers you care about, and then tuning the numbers for each one. Bryce suggests to instead use the new std::atomic::wait/notify operations in C++20 as a portable solution:

struct spin_mutex { void lock() noexcept { while (!try_lock()) flag.wait(true, std::memory_order_relaxed); } bool try_lock() noexcept { return !flag.test_and_set(std::memory_order_acquire); } void unlock() noexcept { flag.clear(std::memory_order_release); flag.notify_one(); } private: std::atomic_flag flag = ATOMIC_FLAG_INIT; };

In general, we can expect this to be low-latency. Unfortunately, it’s up to the standard library implementation to decide how to implement this. For example, std::atomic::notify_one() might be implemented such that it does a system call to wake up another thread waiting for the lock, so (like with most of the C++20 synchronisation library) we have no guarantees that this would be safe to use for real-time audio. It is therefore not really helpful here.

So let’s look again at the exponential back-off strategy. Upon closer inspection, it turns out that the above strategy is tuned for a scenario where you have many cores, and many threads (dozens, maybe hundreds) are contending for the same lock. In such situations with high concurrency and high contention, exponential back-off is really great. However, in audio, we are in a very different situation. In all relevant cases I have seen, there were only two threads involved in trying to acquire the lock: the audio processing thread, accessing some object or data structure, and a second thread modifying it (either the message thread or some other thread loading stuff in the background). And the product will typically run on consumer hardware such as a phone or a mid-range laptop.

So what can we do for the audio use case? Are any bits of the exponential back-off strategy useful for us? Let’s find out. First, let’s try to figure out what the different phases in the loop actually do, and measure how long they take.

Look and measure

First of all, many audio apps are cross-platform, and therefore we need to come up with something more portable than the Linux-specific thing above. I don’t have a mobile development setup at home at the moment, so I can’t test this on ARM, but at least I have two different Intel x64 machines with three different operating systems and three different C++ compilers, so let’s see if we can get something to work on all three of them.

The first stage of the back-off strategy is just an empty spinloop like before – this will work everywhere.

In the second stage, Bryce uses inline assembly to introduce a rep; nop CPU instruction, which according to Intel’s Software Developer Manual (a truly great resource!) is equivalent to a pause instruction, effectively causing the CPU to briefly idle. This is great on modern Intel CPUs, because not only does it save energy compared to the busy wait, but it also allows another thread to make progress on the same core thanks to hyperthreading. It’s effectively CPU magic for implementing efficient spinlocks.

The code as written above is unfortunately not portable, because MSVC no longer allows inline assembly (thanks, Microsoft). But it turns out there is a portable way to do this: we can call an x86 intrinsic called _mm_pause() , which you can include via #include <emmintrin.h> . Even though it’s not part of the C++ standard, every major compiler targeting Intel x64 will understand this and turn it into the right instruction, which is documented here.

On ARM, it will look slightly different – you could perhaps use the WFE instruction instead to achieve a similar effect. I won’t go into ARM details this time, as this blog post focuses on Intel, but it’s on my to do list.

The third stage after the CPU pause is a thread yield, which can be done in standard C++ as well: just call std::this_thread::yield() for the same effect. The fourth stage introduces sleeping, but as we will see later, this is less interesting for us, so let’s ignore sleeping for now.

Let’s take the three stages we got so far, and measure how much time, on average, they consume – at least, on my two machines – when used in a spinloop:

Average time for a single spinlock loop iteration using different back-off levels, across three different operating systems, three different compilers, and two different machines (Dell XPS 15″ and MacBook Pro, both with a 6-core 2.9 GHz Intel Core i9 processor and 32 GB RAM).

Note that I used a logarithmic scale for the time. Every stage takes roughly one order of magnitude longer per loop iteration than the previous one. The linearity of the plot in log scale demonstrates the exponential nature of the different stages.

The spin, which is basically just checking the atomic flag and increasing the loop counter, takes about 5 ns on both machines and all three OSes. The _mm_pause() takes about 39 ns. Finally, the yield consumes between 230 and 510 ns, depending on the OS, showing some variance across OSes. This is expected, because every OS implements their thread scheduler differently.

It’s worth looking at the yield more closely.

Don’t yield in a loop

In the scenario above, I would have expected the yield to take much longer – after all, a timeslice of the thread scheduler is typically something between 10 and 100 ms. And indeed, this happens sometimes.

For my next test, I recreated a scenario typical of high-load music software usage. I let the above yield-in-a-loop spin inside a plug-in that synthesises some sound, which was loaded inside a DAW with a whole bunch of other plug-ins and tracks going on, including something CPU-intensive like a convolution reverb with a very long IR, and various other apps and processes running at the same time.

And indeed, in this scenario, a single iteration of the yield loop would sometimes take 10 or even 30 ms. This is an indicator that our waiting thread was actually scheduled to the back of the run-queue and another thread jumped in. But it was still happening very rarely, perhaps once every few seconds, and only if I drive my machine very hard. I imagine that on a less powerful machine (both my laptops have a 2.9 GHz i9 with six cores) under high load, the re-schedulings might happen more often , and/or take longer. That’s when you will start dropping GUI frames, but this is actually expected in this scenario: under high load, what we want is the audio to not glitch and other threads doing important work to be able to progress.

But in general, almost all of the time, our message or I/O thread would just spin in that loop, repeatedly calling std::thread::yield() with nothing else to do. Each such call triggers a chain of events. It summons the thread scheduler, at which point the OS kernel does the whole dance around suspending the thread, putting it all the way back into the run-queue, then realising it’s the only thread in the queue anyway, and then resuming it again. That’s what takes 230-510 ns in the graph above. It is doing a lot of work, actually even worse than the busy waiting in a certain way, because it also keeps the OS kernel busy. And since there is no contention for that lock in our scenario, doing a yield in a loop seems like a bad idea: we are back to burning unnecessary energy, bumping that CPU core up to a constant 100% load again, without significant benefit. We don’t want that!

Don’t sleep

Now that we figured out we probably don’t want to yield repeatedly, it doesn’t make much sense to consider the next possible back-off stage: putting the thread to sleep. Sleeping implies yielding as well, and both yielding and sleeping are scheduling techniques rather than wait-techniques. But remember, we don’t have contention on that spinlock. If the audio thread has it, there is typically only one other thread waiting for it (and that’s the one we’re in).

Moreover, sleeping creates an additional problem. If the CPU load is high, the audio thread might spend 90% or more of its time processing audio. If the app is architected in a particularly unfortunate way, most of that might be under the spinlock. So, if the callback period is every 3 ms (a typical value), the other thread has only a 300 μs time window, or less than that, to jump in and do its work. If it’s sleeping, it might repeatedly miss that narrow window of time and not proceed at all. This won’t cause any audio glitches, but it is a recipe for a GUI that keeps dropping frames and feels unresponsive. On the one hand, we don’t want to burn unnecessary energy; on the other hand, we also want the message thread to jump in and acquire the lock from the audio thread as soon as possible so that things keep flowing smoothly. How can we achieve that?

Let’s tune this for audio

To get to a suitable implementation, we need to figure out which of the possible back-off stages we should use for our spinlock, and how often. Instead of guessing, let’s go back to our original use cases and measure how long they take.

It turns out that this is spread out over a relatively broad spectrum. If the audio thread only does a quick lookup, perhaps inside a std::vector which it finds to be empty, such things typically take 100 ns or less (for my audio plug-in on my machine). If we have to traverse only a few elements, this will typically be under 1 μs. On the other end of the spectrum, if the lock is around the whole audiograph, and our instrument is quite CPU-heavy, one round of processing the graph and producing an audio buffer can take up to 1 ms.

In other words, the time it takes until the audio thread releases the lock can vary over five orders of magnitude. We need a spinlock that will work well for all these cases. For this, we need to choose appropriate values for how often each back-off stage should get executed on the waiting thread while it stays inside the loop.

Let’s start with the first stage, the spin. I chose to do this five times, which brings us up to 25 ns. This is probably the shortest amount of time in which the audio thread might complete any kind of meaningful operation. In all honesty, this is the most arbitrary of the values I picked. I might just as well have chosen four or three. If the first spin fails, most likely the next couple of spins will fail as well.

The next stage is to do _mm_pause() in a loop, which takes 39 ns each time around. I conducted some additional measurements and found that of those 39 ns, about 90% are spent with the CPU actually paused, and the remaining 10% are spent checking the atomic flag and incrementing the loop counter. 10% overhead is not bad. But we are still touching memory and the CPU’s integer unit every 39 ns in those 10%, and we probably can do better. By doing ten subsequent _mm_pause() ‘s, we can bring that number down from 10% to 1%. Each such iteration of 10 _mm_pause() s now takes about 350 ns. So, as the second back-off stage, I chose 10 iterations of a single _mm_pause() , and then after that I switch to the third back-off stage, which does 10 _mm_pause() s at a time.

Annoyingly, I actually had to write out the 10 _mm_pause() ‘s by hand, because when checking the codegen in godbolt, I found that if I stick them into an inner loop, the compilers don’t unroll that loop – instead, they insert a loop counter that then gets incremented and checked, which defeats the purpose of putting the core to sleep.

Now the waiting thread will spend most of its time in the stage that does 10 _mm_pause() ‘s at a time. This is exactly where we want it to be: on a spinlock that consumes very little energy and allows other threads to progress. At the same time, the time between each check of the lock (350 ns) is still short enough that the waiting thread won’t miss its opportunity of acquiring the lock after the audio thread is done, even with small buffer sizes and under high audio processing load.

But when we reach a waiting time around 1ms (or about 3000 iterations of the 10 _mm_pause() s-loop), we should begin to worry. This is getting close to the maximum amount of time that the audio thread will keep itself busy during a single processing callback. If we are here, the audio thread is operating at high CPU load. It might get close to missing its deadline (and then we’re in bigger trouble), or perhaps the system is just generally clogged up with threads trying to do other things, or something weird might be going on.

In this case, we choose to call std::this_thread::yield() once, to give the scheduler a chance to sort things out (the waiting thread is not the priority in this situation!). But remember, we don’t want to be yielding in a loop as it would increase CPU load even further. Instead, after that single yield, we immediately go back into the previous stage of looping in the low-energy 10 _mm_pause() s stage. We keep going back and forth between these two states until the waiting thread finally gets the lock.

Considering all of the above, this strategy seems to me like the most efficient way to implement a spinlock for a typical audio plug-in.

Implementation

Let’s turn the above strategy into C++. It turned out that sticking all stages into a giant for-loop like in Bryce’s talk isn’t actually that great for our use case. The optimiser and branch predictor can do a much better job if we break things up by stage and make a separate loop for each one, instead of a single loop with four different branches in it. It makes our intent clearer and the code more readable as well. In the end, I arrived at the implementation below, which I will use in my code from now on:

#include <array> #include <thread> #include <atomic> #include <emmintrin.h> struct audio_spin_mutex { void lock() noexcept { // approx. 5x5 ns (= 25 ns), 10x40 ns (= 400 ns), and 3000x350 ns // (~ 1 ms), respectively, when measured on a 2.9 GHz Intel i9 constexpr std::array iterations = {5, 10, 3000}; for (int i = 0; i < iterations[0]; ++i) { if (try_lock()) return; } for (int i = 0; i < iterations[1]; ++i) { if (try_lock()) return; _mm_pause(); } while (true) { for (int i = 0; i < iterations[2]; ++i) { if (try_lock()) return; _mm_pause(); _mm_pause(); _mm_pause(); _mm_pause(); _mm_pause(); _mm_pause(); _mm_pause(); _mm_pause(); _mm_pause(); _mm_pause(); } // waiting longer than we should, let's give other threads // a chance to recover std::this_thread::yield(); } } bool try_lock() noexcept { return !flag.test_and_set(std::memory_order_acquire); } void unlock() noexcept { flag.clear(std::memory_order_release); } private: std::atomic_flag flag = ATOMIC_FLAG_INIT; };

Summary

Ideally, you should avoid having to do any non-trivial thread synchronisation in your audio code. Instead, use std::atomic s, lock-free FIFOs, immutable data structures, and separate your processing code from your data model as much as possible.

But if you have to use locks, never use std::mutex on the audio thread, not even with try_lock() . While that call is realtime-safe, the subsequent unlock() is not – hard to see because it will be hidden behind a std::unique_lock or similar RAII wrapper.

Spinlocks are a good fallback solution, but if you find they waste too much energy and CPU time, you can back off from the busy-wait spin loop to progressively longer sequences of CPU pause instructions ( pause on x64, WFE on ARM). Once in a while, yield the waiting thread to allow the thread scheduler to do its job.

This wait loop should never run on the audio thread; the audio thread should only ever call try_lock() and fall back to an alternative strategy on failure.

If you choose to use a progressive back-off implementation using pause and yield, measure what your code is doing, and tune the numbers for your use case. For mine, the above implementation seems to do the job.

Thanks much to Dave Rowland and Gašper Ažman, who helped me a lot with putting all this together.