UPDATED! Added explanation about how to check if you are affected below. Thanks to vicaya on hacker news for commenting about this.

This article is going to take a look at a rather nasty low-level hardware bug which could bite people writing applications which are multi-threaded. The bug manifests itself in MySQL as we’ll see a bit later. Part of the blame for the nastiness of this bug certainly lies with a third party who should have (and could have) publicly released a detailed errata, as we’ll see in a bit.

Consider this article a bit of a thought experiment; when you read this article ask yourself: “How would I have debugged this?”

A quick cautionary note: We’re going to look at a little bit of assembly. If you don’t know assembly, don’t worry.

Starting at the atom

An important concept when building complex systems with multiple threads or processes is the concept of atomicity. Very briefly, atomicity is the idea that there is some set of instructions that when executed appear as an indivisible unit to the rest of the system.

The Intel x86 architecture has two ways to execute atomic operations:

guaranteed atomic operations

system bus locking – a processor asserts a signal and other processors are unable to gain control of the system bus during this time

Guaranteed atomic operations

The least interesting, but on the Intel 486 and above:

Reading or writing a byte

Reading or writing a word aligned on a 16-bit boundary

Reading or writing a doubleword aligned on a 32-bit boundary

are all guaranteed to be atomic.

[NOTE: There are a few more guaranteed atomic operations for the Pentium and P6 families, but I won’t go into the details of them here as there is way too much bleeding involved. If you are the type of person who runs toward a burning building check out: Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 3A Chapter 7 for more information about the other guaranteed atomic operations.]

This list is good, but it doesn’t give us enough power to actually implement a synchronization primitive. You can imagine two processes competing for a resource. One process reads a flag to check whether the resource is available, but before it has time to flip the flag, the other process is reading the value.

We need a little more juice, we’d like to be able to read, write, and maybe compare as one indivisible operation. This is where bus locking and two x86 instructions step in to save the day.

Bus locking

There are two instructions on the x86 which are commonly used as atomic operations:

xchg exchange a source and destination.

cmpxchg compare the EAX register with the destination, if equal load source into destination. Otherwise, load destination in EAX.

And there are two flavors of bus locking:

automatic bus locking – bus is locked whenever xchg and other special instructions are executed

and other special instructions are executed software controlled bus locking – a small set of instructions ( cmpxchg , dec , and others) can be executed atomically if preceded with lock

OK, cool. Now we’re talking. We have enough firepower to build some synchronization primitives.

Let’s build a mutex

Let’s build a simple mutex using cmpxchg .

Assume a function exists:

/**

* compare old with the destination memory. if they match, put new

* in the destination and return old. Otherwise, return new.

*/

int cmpxchg ( void *memory, int old, int new ) ;

This implementation is left as an exercise to the reader. Wow, I’ve always wanted to say that.

(Protip: If you don’t feel like bleeding, you can just: #include <asm/system.h> which will take care of it for you on Linux.)

Then we can implement a stupid and simple spinlock like this:

typedef int mutex;

void mutex_init ( mutex *m ) { *m = 1 ; }

void mutex_lock ( mutex *m ) { while ( !cmpxchg ( m, 1 , 0 ) ) ; }

void mutex_unlock ( mutex *m ) { *m = 1 ; }

That was pretty easy! Admittedly, it is a pretty stupid implementation. You could imagine adding more features, and even building more complex synchronization primitives (semaphores, barriers, read-write locks, etc). Go crazy and build them out.

Synchronization bugs are nasty

So we’ve talked a little bit about how atomic operations work, and we’ve built a simple mutex. Complex multi-threaded applications can use lots of synchronization primitives to control access to a shared resource (like a piece of memory, or a file, etc). As you can imagine, weird bugs can crop up like deadlock, livelock, and memory corruption if synchronization primitives are used incorrectly.

A large multi-threaded application used by lots of people in the Web 2.0 world is MySQL. MySQL, like any other application, has bugs. Take a look at an interesting bug report filed on Feb 5, 2007 regarding a MySQL crash: http://bugs.mysql.com/bug.php?id=26081

So, it looks as if an assertion is being tripped in the source code, but the code in the second update to the bug looks very straightforward. Maybe some one else is writing all over the mutex and corrupting its state? Definitely possible, but there is something strange about the bug reports that start to roll in afterward.

It seems that everyone hitting this bug is using AMD Opteron CPUs. Maybe this a hardware bug?

AMD Opteron bug causes a MySQL crash

It turns out that MySQL was crashing because of a hardware bug in AMD Opteron Revision E and pre-release Revision F processors.

There is no official errata published about this, but we can find clues in the OpenSolaris mutex implementation.

From OpenSolaris’ IA-32 lock_prim.S file (NOTE: AT&T syntax):

560 ENTRY_NP(mutex_enter) ... 563 lock 564 cmpxchgq %rdx, (%rdi) 565 jnz mutex_vector_enter 566 .mutex_enter_lockstat_patch_point: 567 #if defined(OPTERON_WORKAROUND_6323525) 568 .mutex_enter_6323525_patch_point: 569 ret /* nop space for lfence */ 570 nop 571 nop ....

Note that on line 563, the lock prefix is used, so the cmpxchg instruction should be atomic. But something weird is going on here, notice how line 567 checks for a macro called OPTERON_WORKAROUND_6323525 ? That looks suspicious, as does the comment on line 569. The nop instruction is a no-op; the instruction doesn’t do anything. In this case it looks as if nop instructions are being used to set aside space and will be overwritten later with an lfence if you have an affected AMD CPU.

If you look up lfence and nop in the Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2A: Instruction Set Reference, A-M, you can see that nop instruction is 1 byte wide and the lfence instruction is 2 bytes wide.

Looks like the OpenSolaris guys left just enough room for one lfence .

If we dig a little deeper in lock_prim.S, we discover a helpful comment:

1133 /* 1134 * patch_workaround_6323525: provide workaround for 6323525 1135 * 1136 * The workaround is to place a fencing instruction (lfence) between the 1137 * mutex operation and the subsequent read-modify-write instruction. 1138 * 1139 * This routine hot patches the lfence instruction on top of the space 1140 * reserved by nops in the lock enter routines. 1141 */

Wow.

That is extremely painful. So it turns out that an lfence instruction is needed after the cmpxchg . lfence (load fence) forces all memory reads preceding the lfence to be read from memory before memory reads following the lfence . Without the lfence , a read after the cmpxchg can happen before the cmpxchg . This read can result in reading stale/incomplete state information.

The workaround hot patches the kernel and writes an lfence instruction in the space set aside by the nop instructions on lines 570 and 571.

Thankfully, MySQL is well written and a runtime assertion caught this bug before it could do real damage or corrupt data.

But what about your multi-threaded code? Is it equally well-written? Would you have caught the same hardware bug with runtime assertions? Now is definitely a good time to go add those assertions you’ve been putting off.

Current state of affairs of this bug

This bug is fixed in recent versions of OpenSolaris and it is unclear whether or not this is currently an issue with Linux. Depending on which version of the Linux kernel you are using, Linux uses an atomic decrement ( dec ) in the fast path and an xchg in the slow path. With the lack of an official public errata about this issue, it is difficult to tell which other instructions could be affected.

How to check if you are affected

Special thanks to vicaya on hacker news for suggesting this!

If you are using Linux you can simply check /proc/cpuinfo :

$ cat /proc/cpuinfo

If your CPU family is 15 and your model is 32-63 inclusive your processor is buggy!

Conclusion

There are a couple things to take away from this: