This is a story about the fun part of my job: getting handed something close to the hardware and unexplainable, and tracking down the odd details that explain it. What can I say, it beats code reviews and writing design docs. (All quotes and all code here is paraphrased to omit confidential details and protect the innocent; numbers have been replaced with representative but fake data; I don’t speak for my employer or any of my coworkers.)

A while back a coworker circulated a draft of our paper on convenient RCU APIs for internal comment. Someone replied “Looks interesting–our team wrote something similar, called Locked, a while back which we use for the same reason.” Now, I’m pretty proud of View as an interface, and I’m even more proud of our RCU implementation that backs it; to my knowledge it’s the fastest and most scalable userspace version in existence…and it took quite a lot of time and trouble for me to get it to that point. So I told him, well, that: “Not bad, but looking at your implementation, I’m pretty sure View will be 3-4x faster, not to mention scale better to multiple cores. Give it a try!”

So he does try it. “Well, you’re sort of right! I wrote a simple multithreaded benchmark and View is usually faster…except it’s inconsistent. Every few runs it slows down by a factor of 10 or more. I don’t think I like that.” What can I say? Talk shit, get hit. But it really didn’t seem right to me–I know what their code did and I know what my code did and there shouldn’t be any way mine could be slowed down. So I was determined to find out why.

I checked out his benchmark, started running it, and quickly found he was right:

Pinning to well-chosen CPUs didn’t help; running for longer periods made no difference–so neither the load balancer nor some fixed overhead was to blame. The profiler showed no difference in called functions from good runs to bad, so nothing was interrupting the code, and it wasn’t causing any background work to show up. The same code was running more slowly; nothing else explained it, and `perf stat` showed a huge change in IPC.

But here’s the thing: our RCU was very carefully designed to avoid any sort of false sharing or other cache coherency problems. All of the control words were carefully segregated; code running on cpu 3 should only touch one page, full of other cpu 3 data. Moreover, I had my own benchmarks, doing very similar things to my coworker’s, that proved it (and remained consistently fast.) What else could be causing that sort of thing?

My coworker had written the code in what I thought was a slightly odd fashion. I’d write a parallel benchmark roughly like so (using Google’s benchmarking library):

But his was different:Instead of using our library’s ability to spawn a parallel benchmark directly, he made them himself. (He was careful to avoid having the thread creation attributed to the cost of the benchmark.) It’s not what I would have done, and I think it looks clunky, but it shouldn’t have mattered, right? Yet it clearly did–my test was consistently fast. At this point, I could have stopped; it was clearly a benchmark effect. But I felt pot-committed, so kept looking, and found something even stranger. I wanted to double-check kernel timekeeping wasn’t buggy (we’ve seen problems there before), so I added some manual timing:And when we ran it…Too bad–the manual timing agreed 100% with the kernel–consistly 15-20ns per rep, every run, like clockwork.

Wait, what?

Adding the calls to rdtsc fixed the problem. Entirely.

Some more fiddling proved this wasn’t just rdtsc – any number of tiny changes to the code made the problem disappear. Add a debug print in the wrong place, move some data to global locations, change compiler flags, run under GDB, do everything in a dedicated thread pool…all sorts of things might fix the problem. It was incredibly sensitive. What the heck did this benchmark do wrong? I was tempted to dismiss it as one of those microarchitectural anomalies, or a compiler issue, but…

At this point the benchmark’s author mentioned that he had noticed more memory traffic during bad runs. I had dismissed cache interference–I *knew* my rcu was designed to take and drop reader locks without allowing for it–but that had been stupid. Just because I knew my code didn’t have a problem didn’t excuse not measuring it. In God We Trust, All Others Pay Cash.

So I brought out `perf stat` again. Good runs showed next to no L3 misses, and bad ones a constant stream. Pinned to one socket, bad ones stopped showing L3 misses…and replaced them with L3_HITS marked as pulled from another CPU’s L2. Bingo. Someone was playing cache ping pong. But who?

PEBS is very powerful here; it allows me to find (to the instruction!) the precise location of every cache miss. And they were all in the same place: rcu::View<char>, internally, is just a const char pointer. We take a RCU read lock, copy that pointer, and then deference it safely…and copying the pointer was missing L3. How could that be happening?

The pointer read couldn’t be the objectionable part; any number of CPUs could read that pointer with a cache line in SHARED state, and no one was changing it. Something on that cache line was being written, but I couldn’t see it in the code. The benchmark inner loop wasn’t doing any modifications at all. What’s more, my coworker’s Locked abstraction was a very similar protected pointer, and the inner loop read that pointer exactly the same way–why wasn’t he seeing the same cache traffic?

Annoyingly, perf is less useful here; memory store events are much less configurable. There isn’t any event at all for memory uops that miss cache! (I am given to understand this is because store ops are buffered in a way that makes it more difficult to track.) But I can get a profile of every store, good or bad. So I did; thankfully, there weren’t too many. Most looked totally innocuous…all RCU traffic was on different cachelines. The only store I couldn’t account for was the “call” instruction…and then I knew what had happened.

The RSEQ technology backing our RCU implementation has one major limit; the code the kernel accelerates must be in one contiguous block of assembly per binary. So RSEQ operations emit calls into that block, which cannot be inlined. (Our upcoming V2 release eliminates this problem–we’re looking forward to the performance gains of inlining!) Calls aren’t free, but avoiding atomic instructions is still worth it.

But here it was killing us. The strange way my coworker had written his benchmark meant that the shared rcu::View object was on the main benchmark thread’s stack…exactly 48 bytes from the stack’s frontier. Since the main stack was participating in the benchmark loop, it was constantly pushing a return address onto the stack, 48 bytes away; if that was the same cache line as our shared pointer, this ruined every reader’s day.

This neatly explained everything! My coworker’s Locked abstraction was inlined, so it didn’t touch the main thread’s stack in the inner loop. Adding timestamping (or debug printfs) enlarged the stack frame. Moving the relevant data to the heap took the shared pointer away from the writes.

The last question: why was the problem inconsistent? ASLR. It was easy to verify that the address of the stack variable changed from run to run, by 16 byte increments. This was a bit surprising, actually–I knew a lot of kernel mappings would be randomized, but I expected this to only affect things at the page level (which wouldn’t change alignments.) Was libc startup playing silly buggers? I made a trivial binary with no user space code at all other than exiting with a return status of the stack alignment:

and saw very clear results:So no, the kernel does in fact do ASLR to very small resolutions on the main thread stack. Learn something new every day, I guess. (If we’d as much as run his benchmark in a child thread, this would never have become visible.) Now that I knew where to look it was easy to find with some judicious use of LXR and LWN.

So like so many of my work investigations, I was done, and on one level had nothing to show for it: my code was fast and didn’t need changing. (In any real binary, Views being concurrently accessed wouln’t live on a active thread’s last stack frame. One hopes!) I guess I walked away with vindication that I, indeed, had the fastest rcu in the west, though. And there were a few good lessons.

Measure everything, even the things you’re sure about.

Tiny structural differences can matter.

When in doubt, check the cache misses.

And if nothing else it was great to see a new case of interference; I’ve never before seen cache ping-pong due to call instructions. (Have you?)

My only annoyance is that I still don’t know an actually good principled way to find store uops that actually force a cacheline from SHARED->EXCLUSIVE with perf. If you do, drop me a line. At the moment, I’m stuck doing a lot of assembly level forensics. Mind you, I *enjoy* having one window of `perf report -D` and one of objdump, but I do have a day job.

If you find stories like this interesting and want to share your own; if you have suggestions for doing better analysis of this sort; if you have questions about high performance multithreaded C++; if you want to correct me or give me advice about anything; if you just want to chat and think I might be able to help you somehow, please email me. I want to hear from you–yes, you–about these things. Some of my best work has come out of random strangers asking me questions…you never know what’s possible until you see someone ask what isn’t. My email is in the sidebar, or just rot13: naqerjuuhagre@tznvy.pbz