My name is Marc Brooker. I've been writing code, reading code, and living vicariously through computers for as long as I can remember. I like to build things that work. I also dabble in brewing, cooking and skiing.I'm currently an engineer at Amazon Web Services (AWS) in Seattle, where I lead engineering on AWS Lambda and our other serverless products. Before that, I worked on EC2 and EBS.All opinions are my own.

C++11's atomic and volatile, under the hood on x86

In my previous post Java's Atomic and volatile, under the hood on x86 I look at Atomic and volatile in Java, and how they affect the generated assembly. In this post, I'm looking at std::atomic and volatile in C++.

Like in Java, it's well known that std::atomic and volatile have different meanings in C++, but it's still interesting to take a look at how that translates to what actually gets run. Let's start with a very simple program:

for (int i = 0; i < 500000000; i++) { x += 0x3; }

Then define x in one of three ways:

long x; volatile long x; std::atomic_long x;

Before digging directly into the assembly, we can compare the run-time of the three programs (on a Core2 Q6600 compiled with gcc4.6 -O2):

long took 0.0018s

volatile took 1.9s

atomic_long took 8.5s

It's clear from the difference in run times that these three programs do produce significantly different code. The step up from long to volatile long is 100x, and another 4x up to atomic_long. Starting the with assembly for the long version, we can see why it's so fast:

addq $1500000000, %rsi

Oh gcc, you're sneaky. The compiler has completely discarded the loop, and calculated the result into a constant. Without the guarantees of atomic or volatile, it's free to make optimizations like this. Next, the volatile version:

movl $500000000, %eax .L2: movq x(%rip), %rdx addq $3, %rdx subl $1, %eax movq %rdx, x(%rip) jne .L2

The inclusion of volatile has forced the compiler to not only run the loop, but also load and store the variable from memory on every run (the two movq instructions). The overhead of this is clearly significant, but it's hard to seperate the effects of the two. Moving the load and store out of the loop breaks the volatile guarantee, but keeps the loop:

movl $500000000, %eax movq x(%rip), %rdx .L2: addq $3, %rdx subl $1, %eax jne .L2 movq %rdx, x(%rip)

That version takes about 0.3s to run, so the it's clear that memory access time still dominates the run time of the program. It's also somewhat interesting the gcc has chosen to load, modify and store instead of doing the add directly to memory. We can modify the original volatile version to do that:

movl $500000000, %eax .L2: addq $3, x(%rip) subl $1, %eax jne .L2

That's much less code, but it's no faster. In fact, taking the average of a very large number of runs shows that it's about 1% slower on my hardware. perf stat tells the story for the unmodified load-modify-store code:

3,006,477,937 cycles 2,504,227,487 instructions # 0.83 insns per cycle

and for the modified code:

3,035,431,555 cycles 1,504,575,146 instructions # 0.50 insns per cycle

The modified code saves two fifths of the instructions, but drops the instructions per cycle by slightly more. After that diversion, on to the atomic_long version of our original program:

movl $500000000, %eax .L2: lock addq $3, x(%rip) subl $1, %eax jne .L2

GCC is generating smarter code than java does with volatile, but uses a similar basic approach. It uses the lock prefixed instruction to generate a write barrier. In this case, gcc uses the same instruction to do the addition, which makes the code much simpler (and much faster on some machines). The performance impact of that lock prefix, compared to the nearly identical code above without it, is quite clear. perf stat says:

13,522,633,468 cycles 1,513,530,642 instructions # 0.11 insns per cycle

That's 0.11 instructions per cycle compared to 0.5 for the non-locked version above. Comparing these two pieces of code makes the difference of intention between atomic and volatile quite clear. Even ignoring memory ordering issues, the volatile code can lead to the incorrect answer with concurrent access. Take another look at that code:

movl $500000000, %eax .L2: movq x(%rip), %rdx addq $3, %rdx subl $1, %eax movq %rdx, x(%rip) jne .L2