The recent reveal of Meltdown and Spectre reminded me of the time I found a related design bug in the Xbox 360 CPU – a newly added instruction whose mere existence was dangerous.

Back in 2005 I was the Xbox 360 CPU guy. I lived and breathed that chip. I still have a 30-cm CPU wafer on my wall, and a four-foot poster of the CPU’s layout. I spent so much time understanding how that CPU’s pipelines worked that when I was asked to investigate some impossible crashes I was able to intuit how a design bug must be their cause. But first, some background…

The Xbox 360 CPU is a three-core PowerPC chip made by IBM. The three cores sit in three separate quadrants with the fourth quadrant containing a 1-MB L2 cache – you can see the different components, in the picture at right and on my CPU wafer. Each core has a 32-KB instruction cache and a 32-KB data cache.

Trivia: Core 0 was closer to the L2 cache and had measurably lower L2 latencies.

The Xbox 360 CPU had high latencies for everything, with memory latencies being particularly bad. And, the 1-MB L2 cache (all that could fit) was pretty small for a three-core CPU. So, conserving space in the L2 cache in order to minimize cache misses was important.

CPU caches improve performance due to spatial and temporal locality. Spatial locality means that if you’ve used one byte of data then you’ll probably use other nearby bytes of data soon. Temporal locality means that if you’ve used some memory then you will probably use it again in the near future.

But sometimes temporal locality doesn’t actually happen. If you are processing a large array of data once-per-frame then it may be trivially provable that it will all be gone from the L2 cache by the time you need it again. You still want that data in the L1 cache so that you can benefit from spatial locality, but having it consuming valuable space in the L2 cache just means it will evict other data, perhaps slowing down the other two cores.

Normally this is unavoidable. The memory coherency mechanism of our PowerPC CPU required that all data in the L1 caches also be in the L2 cache. The MESI protocol used for memory coherency requires that when one core writes to a cache line that any other cores with a copy of the same cache line need to discard it – and the L2 cache was responsible for keeping track of which L1 caches were caching which addresses.

But, the CPU was for a video game console and performance trumped all so a new instruction was added – xdcbt. The normal PowerPC dcbt instruction was a typical prefetch instruction. The xdcbt instruction was an extended prefetch instruction that fetched straight from memory to the L1 d-cache, skipping L2. This meant that memory coherency was no longer guaranteed, but hey, we’re video game programmers, we know what we’re doing, it will be fine.

Oops.

I wrote a widely-used Xbox 360 memory copy routine that optionally used xdcbt. Prefetching the source data was crucial for performance and normally it would use dcbt but pass in the PREFETCH_EX flag and it would prefetch with xdcbt. This was not well-thought-out. The prefetching was basically:

if (flags & PREFETCH_EX)

__xdcbt(src+offset);

else

__dcbt(src+offset);

A game developer who was using this function reported weird crashes – heap corruption crashes, but the heap structures in the memory dumps looked normal. After staring at the crash dumps for awhile I realized what a mistake I had made.

Memory that is prefetched with xdcbt is toxic. If it is written by another core before being flushed from L1 then two cores have different views of memory and there is no guarantee their views will ever converge. The Xbox 360 cache lines were 128 bytes and my copy routine’s prefetching went right to the end of the source memory, meaning that xdcbt was applied to some cache lines whose latter portions were part of adjacent data structures. Typically this was heap metadata – at least that’s where we saw the crashes. The incoherent core saw stale data (despite careful use of locks), and crashed, but the crash dump wrote out the actual contents of RAM so that we couldn’t see what happened.

So, the only safe way to use xdcbt was to be very careful not to prefetch even a single byte beyond the end of the buffer. I fixed my memory copy routine to avoid prefetching too far, but while waiting for the fix the game developer stopped passing the PREFETCH_EX flag and the crashes went away.

The real bug

So far so normal, right? Cocky game developers play with fire, fly too close to the sun, marry their mothers, and a game console almost misses Christmas.

But, we caught it in time, we got away with it, and we were all set to ship the games and the console and go home happy.

And then the same game started crashing again.

The symptoms were identical. Except that the game was no longer using the xdcbt instruction. I could step through the code and see that. We had a serious problem.

I used the ancient debugging technique of staring at my screen with a blank mind, let the CPU pipelines fill my subconscious, and I suddenly realized the problem. A quick email to IBM confirmed my suspicion about a subtle internal CPU detail that I had never thought about before. And it’s the same culprit behind Meltdown and Spectre.

The Xbox 360 CPU is an in-order CPU. It’s pretty simple really, relying on its high frequency (not as high as hoped despite 10 FO4) for performance. But it does have a branch predictor – its very long pipelines make that necessary. Here’s a publicly shared CPU pipeline diagram I made (my cycle-accurate version is NDA only, but looky here) that shows all of the pipelines:

You can see the branch predictor, and you can see that the pipelines are very long (wide on the diagram) – plenty long enough for mispredicted instructions to get up to speed, even with in-order processing.

So, the branch predictor makes a prediction and the predicted instructions are fetched, decoded, and executed – but not retired until the prediction is known to be correct. Sound familiar? The realization I had – it was new to me at the time – was what it meant to speculatively execute a prefetch. The latencies were long, so it was important to get the prefetch transaction on the bus as soon as possible, and once a prefetch had been initiated there was no way to cancel it. So a speculatively-executed xdcbt was identical to a real xdcbt! (a speculatively-executed load instruction was just a prefetch, FWIW).

And that was the problem – the branch predictor would sometimes cause xdcbt instructions to be speculatively executed and that was just as bad as really executing them. One of my coworkers (thanks Tracy!) suggested a clever test to verify this – replace every xdcbt in the game with a breakpoint. This achieved two things:

The breakpoints were not hit, thus proving that the game was not executing xdcbt instructions. The crashes went away.

I knew that would be the result and yet it was still amazing. All these years later, and even after reading about Meltdown, it’s still nerdy cool to see solid proof that instructions that were not executed were causing crashes.

The branch predictor realization made it clear that this instruction was too dangerous to have anywhere in the code segment of any game – controlling when an instruction might be speculatively executed is too difficult. The branch predictor for indirect branches could, theoretically, predict any address, so there was no “safe place” to put an xdcbt instruction. And, if speculatively executed it would happily do an extended prefetch of whatever memory the specified registers happened to randomly contain. It was possible to reduce the risk, but not eliminate it, and it just wasn’t worth it. While Xbox 360 architecture discussions continue to mention the instruction I doubt that any games ever shipped with it.

I mentioned this once during a job interview – “describe the toughest bug you’ve had to investigate” – and the interviewer’s reaction was “yeah, we hit something similar on the Alpha processor”. The more things change…

Thanks to Michael for some editing.

Postscript

How can a branch that is never taken be predicted to be taken? Easy. Branch predictors don’t maintain perfect history for every branch in the executable – that would be impractical. Instead simple branch predictors typically squish together a bunch of address bits, maybe some branch history bits as well, and index into an array of two-bit entries. Thus, the branch predict result is affected by other, unrelated branches, leading to sometimes spurious predictions. But it’s okay, because it’s “just a prediction” and it doesn’t need to be right.

Discussions of this post can be found on hacker news, r/programming, r/emulation, and twitter.

A somewhat related (Xbox, caches) bug was discussed a few years ago here.