I’ve seen a lot of poorly implemented spinlocks floating around the internet. They all make the same mistake: spinning on a read-modify-write (RMW) operation. I will explain why that’s bad and how to correctly implement a spinlock in C++. The explanation and implementation will easily carry over to other programming languages. At the end of the note I provide a complete spinlock implementation adhearing to the Mutex named requirements.

Discussion thread on Reddit.

The bad spinlock

The spinlock works by using a boolean (or single bit) to indicate whether the lock is held or not. To acquire the lock a atomic exchange operation is used to set the boolean to true . If the previous value returned by the exchange operation was false the lock is now held, otherwise we need to try again until the lock is successfully acquired:

struct taslock { std :: atomic < bool > lock_ = { 0 }; void lock () { while ( lock_ . exchange ( true )); } void unlock () { lock_ . store ( false ); } };

Other atomic operations can be used to implement a spinlock. In my benchmarking on x86_64 architecture I found no difference in spinlock performance between the different atomic operations that can be used. Additionally according to Agner Fog’s instruction tables and upos.info all LOCK prefixed instructions have very similar instruction latencies.

Use correct memory ordering

The default memory ordering of C++ atomics std::memory_order_seq_cst (sequentially-consistent ordering) is overly restrictive and can be changed to std::memory_order_acquire for operations that acquires the lock and std::memory_order_release for operations that releases the lock. This will allow the compiler to emit more efficent code.

struct taslock { std :: atomic < bool > lock_ = { 0 }; void lock () { while ( lock_ . exchange ( true , std :: memory_order_acquire )); } void unlock () { lock_ . store ( false , std :: memory_order_release ); } };

Reducing cache coherency traffic under contention

The problem with the above implementation occurs when there is contention on the lock. In order to keep all CPU caches synchronized a cache coherency protocol is used. There are several different cache coherency protocols (MESI, MOESI, MESIF), but they all have in common that only a single CPU core may write to a cache line but multiple CPU cores may simultaneously read from a cache line. The atomic exchange operation requires write access to the cache line where the lock is stored. If more than one thread is spinning trying to acquire the lock a lot of cache coherency traffic is required to continuously change which core has exclusive write access to the lock.

Since the coherency protocol allows for multiple simultaneous readers we can fix this by instead spinning on an atomic load operation (spin-wait loop). Instead of continuos futile attempts to acquire the held lock we wait for the lock holder to first release the lock. This eliminates cache coherency traffic during spinning:

void lock () { for (;;) { if ( ! lock_ . exchange ( true , std :: memory_order_acquire )) { break ; } while ( lock_ . load ( std :: memory_order_relaxed )); } }

Reducing load-store unit utilization

Typical CPUs in use today support hyper-threading allowing two logical CPU cores share the main execution resources such as arithmetic logic unit (ALU), load-store unit (LSU) etc. Our improved lock above will execute a load operation almost every cycle while spinning. This will starve any other logical CPU cores sharing the same load-store unit as the spinning core. To alleviate this problem Intel introduced the PAUSE instruction which provides a hint that a spin-wait loop is running and throttles the CPU core in some architecture specific way in order to reduce power usage and contention on the load-store units. On GCC and clang we can emit the PAUSE instruction using the built-in function __builtin_ia32_pause() and on MSVC using the _mm_pause() . Adding this to our spin-wait loop we get:

void lock () { for (;;) { if ( ! lock_ . exchange ( true , std :: memory_order_acquire )) { break ; } while ( lock_ . load ( std :: memory_order_relaxed )) { __builtin_ia32_pause (); } } }

Other CPU architectures may have similar primitives. For example ARM has the YIELD instruction.

Note that if you don’t use hyper-threading and you don’t care about power efficiency using the PAUSE instruction will slow down lock acquisition on contended locks.

Benchmark

I created a simple benchmark where N threads are launched, each thread acquires and releases the lock 100M / N times and finally the average time for a lock()-unlock() pair is calculated.

Running on my computer with 4 cores / 8 hyper-threads we can see that the good spinlock implementation scales better than the bad spinlock:

N threads Good spinlock ns/op Bad spinlock ns/op 1 7 7 2 28 43 3 70 168 4 112 205 5 190 381 6 241 509 7 274 884 8 442 854

We can also use the perf command to see the amount of cache misses and atomic operations for each implementation.

For the good spinlock:

sudo perf stat -e L1-dcache-loads,L1-dcache-load-misses,mem_inst_retired.lock_loads ./a.out 100000000 8 Performance counter stats for './a.out 100000000 8': 1,038,332,933 L1-dcache-loads 282,408,737 L1-dcache-load-misses # 27.20% of all L1-dcache hits 161,632,506 mem_inst_retired.lock_loads 5.148648364 seconds time elapsed 39.249547000 seconds user 0.015969000 seconds sys

For the bad spinlock:

sudo perf stat -e L1-dcache-loads,L1-dcache-load-misses,mem_inst_retired.lock_loads ./a.out 100000000 8 Performance counter stats for './a.out 100000000 8': 1,067,168,715 L1-dcache-loads 509,077,190 L1-dcache-load-misses # 47.70% of all L1-dcache hits 552,378,083 mem_inst_retired.lock_loads 11.225589133 seconds time elapsed 82.805909000 seconds user 0.051584000 seconds sys

We can see that the good spinlock has significantly fewer cache misses and atomic operations.

References

Example of poor spinlock implementations:

Appendix: Complete spinlock implementation

Below is the complete implementation of the basic spinlock using C++11 atomics. It should be easy to adapt to C11 atomics and other languages.

Update 2020-04-26: As suggested by user adops_dsp on Reddit the try_lock() should first check if the lock is free before attempting to acquire it. This would prevent excessive coherency traffic in case someone loops over try_lock() .