Often I’ve found myself wanting to point someone to a description of the x86’s memory model, but there wasn’t any that quite laid it out the way I wanted. So this is my take on how shared memory works on multiprocessor x86 systems. The guts of this description is adapted/copied from “A Better x86 Memory Model: x86-TSO” by Scott Owens, Susmit Sarkar, and Peter Sewell; this presentation strips away most of the math and presents it in a more operational style. Any mistakes are almost certainly mine and not theirs.

Components of the System:

There is a memory subsystem that supports the following operations: store, load, fence, lock, unlock. The memory subsystem contains the following:

Memory: A map from addresses to values Write buffers: Per-processor lists of (address, value) pairs; these are pending writes, waiting to be sent to memory “The Lock”: Which processor holds the lock, or None, if it is not held. Roughly speaking, while the lock is held, only the processor that holds it can perform memory operations.

There is a set of processors that execute instructions in program order, dispatching commands to the memory subsystem when they need to do memory operations. Atomic instructions are implemented by taking “the lock”, doing whatever reads and writes are necessary, and then dropping “the lock”. We abstract away from this.

Definitions

A processor is “not blocked” if either the lock is unheld or it holds the lock.

Memory System Operation

Processors issue commands to the memory subsystem. The subsystem loops, processing commands; each iteration it can pick the command issued by any of the processors to execute. (Each will only have one.) Some of the commands issued by processors may not be eligible to execute because their preconditions do not hold.

If a processor p wants to read from address a and p is not blocked:

a. If there are no pending writes to a in p’s write buffer, return the value from memory

b. If there is a pending write to a in p’s write buffer, return the most recent value in the write buffer If a processor p wants to write value v to address a, add (a, v) to the back of p’s write buffer At any time, if a processor p is not blocked, the memory subsystem can remove the oldest entry (a, v) from p’s write buffer and update memory so that a maps to v If a processor p wants to issue a barrier

a. If the barrier is an MFENCE, p’s write buffer must be empty

b. If the barrier is an LFENCE/SFENCE, there are no preconditions; these are no-ops ** If a processor p’s wants to lock the lock, the lock must not be held and p’s write buffer must be empty; the lock is set to be p If a processor p’s wants to unlock the lock, the lock must held by p and p’s write buffer must be empty; the lock is set to be None

Remarks

So, the only funny business that can happen is that a load can happen before a prior store to a different location has been flushed from the write buffer into memory. This means that if CPU0 executes “x = 1; r0 = y” and CPU1 executes “y = 1; r1 = x”, with x and y both initially zero, we can get “r0 == r1 == 0”.

The common intuition that atomic instructions act like there is an MFENCE before and after them is basically right; MFENCE requires the write buffer to empty before it can execute and so do lock and unlock.

x86 is a pleasure to compile atomics code for. The “release” and “acquire” operations in the C++11 memory model don’t require any fencing to work. Neither do the notions of “execution order” and “visibility order” in my advisor and my RMC memory model.

** The story about LFENCE/SFENCE is a little complicated. Some sources insist that they actually do things. The Cambridge model models them as no-ops. The guarantees that they are documented to provide are just true all the time, though. I think they are useful when using non-temporal memory accesses (which I’ve never done), but not in general.