Introduction

Atomic operations are the building blocks of synchronization primitives and non-blocking algorithms. They guarantee that, when modifying a memory location, it will happen without any interference from other threads. They are required whenever writing applications for multiple-cores architectures.

Consider this function:

void NonAtomicAND(tS32* Value, tS32 Op) { *Value &= Op; }

Which translates to this on x86:

mov eax, dword ptr [Value] mov ecx, dword ptr [eax] and ecx, dword ptr [Op] mov dword ptr [eax], ecx

As you can see, we read the value in a register on line 2. Then we perform the operation on line 3 and finally update the memory location with the new value on line 4. What would happen if another thread updates the same memory location while we’re holding the new value in a register? That’s it: undefined behavior. We need a way to know if the memory location was modified between the time we performed the read and the write.

Hardware Implementations

Depending on the platform, there’s two atomic operations hardware implementations: Compare-And-Swap (CAS) on x86 and Load-Link/Store-Conditional (LL/SC) on Alpha, PowerPC, MIPS and ARM.

CAS

CAS compares a memory location with a given value, and if they are the same the new value is set. The return value is the value before the swap was attempted. That way, we can know if the memory location was written to between our read and write, and repeat the operation if it was.

Atomicity is guaranteed when used like this:

Read the original value from a memory location. Compute the new value to be set. Set the new value only if the memory location is still the original value.

On x86, the lock instruction prefix makes some instructions (ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, CMPXCH8B, DEC, INC, NEG, NOT, OR, SBB, SUB, XOR, XADD, and XCHG) atomics. The CMPXCHG instruction is used to implement CAS.

Here’s the atomic version of the NonAtomicAND function we’ve seen in the introduction, using CAS:

void AtomicAND(volatile tS32* Value, tS32 Op) { while(1) { const tS32 OldValue = *Value; const tS32 NewValue = OldValue & Op; // If the result is the original value, the new value was stored. if(CAS(Value, NewValue, OldValue) == OldValue) { return; } } }

As you can see, we first read the original value and compute the new one. Then, we try to store the new value only if the current memory location is still the original one. If it changed, we need to repeat the operation until we succeed.

The ABA Problem

Be aware of the ABA problem though: between the time that you read the original value and try to swap it with the new one, it could have been changed to something else and back to the original value. In that case, that change will not be detected.

For simple operations, like the AtomicAND function, this isn’t a problem since the resulting value is still valid in the end. But when implementing lock-free algorithms such as queues or linked-lists, this will cause unwanted behavior.

The usual solution to this problem is to append a counter to the values which is incremented at each operation. That way, A-B-A becomes A1-B2-A3 and the changes is properly detected. It may not be as easy as it looks though and may requires 64/128 bits CAS instructions, especially when working with pointers.

LL/SC

LL/SC works differently. It is implemented using two instructions (lwarx (LL)/stwcx (SC) on PowerPC): LL load and reserve the memory location, while SC store the new value only if the memory location is still reserved. The memory reservation is lost whenever it gets written to. For this reason, LL/SC does not suffer from the ABA problem.

Here’s the atomic version of the NonAtomicAND function we’ve seen in the introduction, using LL/SC:

void AtomicAND(volatile tS32* Value, tS32 Op) { while(1) { const tS32 NewValue = __lwarx(Value) & Op; // If the reservation was still valid, new value was stored. if(__stwcx(Value, NewValue)) { return; } } }

Reservation Granularity

Depending on the architecture, the reservation is performed on aligned words or, in the worst case, on the whole cache-lines. On current-generation PowerPC consoles, the reservation granularity is cache-lines, so special care must be taken to avoid false sharing or the performance hit can be dramatic.

CAS Emulation

Implementing CAS using LL/SC instructions may be tempting to maintain platform-agnostic code. Doing so adds a compare and a branch though, which could be optimized depending on the compiler and optimization level.

Here’s the code for AtomicAND from the native LL/SC implementation:

loop: lwarx r6,0,r3 # Load and create reservation and r4,r6,r5 # Compute the new value stwcx r4,0,r3 # Store the new value if the reservation is still valid bne loop # Loop if the reservation was invalidated

And here’s the one using the CAS emulation:

loop: lwz r8,0(r3) # Load the original value and r4,r8,r5 # Compute the new value lwarx r6,0,r3 # Load and create reservation cmpw r8,r6 # CAS comparison bne loop # Retry if not equal stwcx r4,0,r3 # Store the new value if the reservation is still valid bne loop # Loop if the reservation was invalidated

As we can see, CAS emulation is slower. Depending on the usage, using native LL/SC might help in some edge cases.

Performance

If threads competition is low, the compare loops should almost never loop. However, when used very concurrently by a high number of threads, the loop count can be quite high and can even lead to a livelock, where a thread is never able to set the new value (or takes a large amount of time to do so) due to other threads always modifying it at the same time. Fixing this particular issue might require algorithm-level refactoring.

Conclusion

In an upcoming post, I’ll talk about memory ordering issues, which is especially important when using atomic operations on some platforms.