Microcode Mystery

Did you ever wonder what is inside those microcode updates that get silently applied to your CPU via Windows update, BIOS upgrades, and various microcode packages on Linux?

Well, you are in the wrong place, because this blog post won’t answer that question (you might like this though).

In fact, the overwhelming majority of this this post is about the performance of scattered writes, and not very much at all about the details of CPU microcode. Where the microcode comes in, and what might make this more interesting than usual, is that performance on a purely CPU-bound benchmark can vary dramatically depending on microcode version. In particular, we will show that the most recent Intel microcode version can significantly slow down a store heavy workload when some stores hit in the L1 data cache, and some miss.

My results are intended to be reproducible and the benchmarking and data collection code is available as described at the bottom.

A series of random writes

How fast can you perform a series of random writes? Because of the importance of caching, you might reasonably expect that it depends heavily on how big of a region the writes are scattered across, and you’d be right. For example, if we test a series of random writes to a region that fits entirely in L1, we find that random writes take almost exactly 1 cycle on modern Intel chips, matching the published limit of one write per cycle .

If we use larger regions, we expect performance to slow down as many of the writes miss to outer cache levels. In fact, I measure roughly the following performance whether for linear (64 byte stride) or random writes to various sized regions:

Region Size Cycles/Write Typical Read Latency L1 1 5 L2 3 12 L3 5-6 ~35 RAM 15-20 ~200

I’ve also included a third column in the table above which records typical read latency figures for each cache level. This gives an indication of roughly how far away a cache is from the core, based on the round-trip read time. Since all normal stores also involve a read (to get the cache line to write to into the L1 cache with its existing contents), the time to “complete” a single store should be at least that long. As the observed time per write is much less, these tests must exhibit significant memory level parallelism ( MLP ), i.e., several store misses are in-progress in the memory subsystem at once and their latencies overlap. We usually care about MLP when it comes to loads, but it is important also for a long stream of stores such as these benchmarks. The last line in above table implies that we may have requests for 10 or more stores in flight in the memory subsystem at once, in order to achieve average store time of 15-20 cycles with a memory latency of 200 cycles.

You can reproduce this table yourself using the wrandom1-unroll and wlinear1 tests.

Interleaved writes

Let’s move on to the case where we actually observe some interesting behavior. Here we tackle the same scenario that I asked about in a twitter poll.

Consider the following loop, which writes randomly to two character arrays.

int writes_inter ( size_t iters , char * a1 , size_t size1 , char * a2 , size_t size2 ) { rng_state rng = RAND_INIT ; do { uint32_t val = RAND_FUNC ( & rng ); a1 [( val & ( size1 - 1 ))] = 1 ; a2 [( val & ( size2 - 1 ))] = 2 ; } while ( -- iters > 0 ); return 0 ; }

Let’s say we fix the size of the first array, size1 , to something like half the size of the L2 cache, and evaluate the performance for a range of sizes for the second array, size2 . What type of performance do we expect? We already know the time it takes for a single write to regions of various size, so in principle one might expect the above loop to perform something like the sum of the time of one write to an L2-sized region (the write to a1 ) and one write to a size2 sized region (the write to a2 ).

Let’s try it! Here’s a test of single stores vs interleaved stores (with one of the interleaved stores accessing a fixed 128 KiB region), varying the size of the other region, run on my Skylake i7-6700HQ.

Overall we see that behavior of the two benchmarks roughly track each other, with the interleaved version (twice as many stores) taking longer than the single store version, as expected.

Especially for large region sizes (the right side of the graph), the assumption that interleaved accesses are more or less additive with the same accesses by themselves mostly pans out: there is a gap of about 4 cycles between the single stream and the stream with interleaved accesses, which is just slightly more than the cost of an L2 access. For small region sizes, the correspondence is less exact. In particular, the single stream drops down to ~1 cycle accesses when the region fits in L1, but in the interleaved case this doesn’t occur.

At least part of this behavior makes sense: the two streams of stores will interact in the caches, and the L1 contained region isn’t really “L1 contained” in the interleaved case because the second stream of stores will be evicting lines from L1 constantly. So with a 16 KiB second region, the test really behaves as if a 16 + 128 = 144 KiB region was being accessed, i.e., L2 resident, but in a biased way (with the 16 KiB block being accessed much more frequently), so there is no sharp decrease in iteration time at the 32 KiB boundary.

The weirdness begins

So far, so good and nothing too weird. However, starting now, it is about to get weird!

Everything above is a reduced version of a benchmark I was using to test some real code, about a year ago. This code had a tight loop with a table lookup and then writes to two different arrays. When I benchmarked this code, performance was usually consistent with the performance of “interleaved” benchmark plotted above.

Recently, I returned to the benchmark to check the performance on newer CPU architectures. First, I went back to check the results on the original hardware (the Skylake i7-6700HQ in my laptop). I failed to reproduce it – I wasn’t able to achieve the same performance, with the same test and on the same hardware as before: it was always running significantly slower (about half the original speed).

With some help from user Adrian on the RWT forums I was able to bisect the difference down to a CPU microcode update. In particular, with newest microcode version , 0xc6 the interleaved stores scenario runs much slower. For example, the same benchmark as above now looks like this, every time you run it:

The behavior of interleaved for small regions (left hand side of chart) is drastically different - the throughput is less than half of the old microcode. It is not obvious just by visual comparison it, but performance is actually reduced across the range of tested sizes for the interleaved case, albeit by only a few cycles as the region size becomes large. I tested various microcode versions and found that only the most recent SKL microcode, revision 0xc6 and released in August 2018 exhibits the “always slow” behavior shown above. The preceding version 0xc2 usually results in the fast behavior.

What’s up with that?

Performance Counters

We can check the performance counters to see if they reveal anything. We’ll use the l2_rqsts.references , l2_rqsts.all_rfo and l2_rqsts.rfo_miss counters, which count the total number of accesses ( references ) and total accesses related to RFO requests ( all_rfo aka stores) from the core as well as the number that miss ( rfo_miss ). Since we are only performing stores, we expect these counts to match and to correspond to the number of L1 store misses, since any store that misses in L1 ultimately contributes to an L2 access.

Here’s the old microcode:

… and with the new microcode (note the change in the y axis, it’s about 3x slower for the L1 hit region):

Despite the large difference in performance, there is very little to no difference in the relevant performance counters. In both cases, the number of L1 misses (i.e., L2 references) approaches 0.75 as the second region size approaches zero as we’d expect (all L1 hits in the second region, and about 25% L1 hits in the 128 KiB fixed region as the L1D is 25% of the size of L2). On the right side, the number of L1 misses approaches something like 1.875, as the L1 hits in the 128 KiB region are cut in half by competition with with the other large region.

So despite the much slower performance, for L1-sized second regions, the difference doesn’t obviously originate in different cache hit behavior. Indeed, with the new microcode, performance goes down as the L1 hit rate goes up.

So it seems that the likeliest explanation is that the presence of an L1 hit in the store buffer prevents overlapping of miss handling for stores on either side, at least with the new microcode, on SKL hardware. That is, a series of consecutive stores can be handled in parallel only if none of them is an L1 hit. In this way L1 store hits somehow act as a store fence with the new microcode. The performance is in line with each store going alone to the memory hierarchy: roughly the L2 latency plus a few cycles.

Will the real sfence please stand up

Let’s test the “L1 hits act as a store fence” theory. In fact, there is already an instruction that acts as a store force in the x86 ISA: sfence . Repeatedly executed back-to-back this instruction only takes a few cycles but its most interesting effect occurs when stores are in the pipeline: this instruction blocks dispatch of subsequent stores until all earlier stores have committed to the L1 cache, implying that stores on different sides of the fence cannot overlap.

We will look at two version of the interleaved loop with sfence : one with sfence inserted right after the store to the first region (fixed 128 KiB), and the other inserted after the store to the second region - let’s call them sfenceA and sfenceB respectively. Both have the same number of fences (one per iteration, i.e., per pair of stores) and only differ in what store happens to be last in the store buffer when the sfence executes. Here’s the result on the new microcode (the results on the old microcode are over here):

The right side of the graph is fairly unremarkable: both versions with sfence perform roughly at the latency for the associated cache level because there is zero memory level parallelism (no, I don’t know why one performs better than other or why the performance crosses over near 64 KiB). The left part is pretty amazing though: one of the sfence configurations is faster than the same code without sfence. That’s right, adding a store serializing instruction like sfence, can speed up the code by several cycles. It doesn’t come close to the fast performance of the old microcode versions, but the behavior is very surprising nonetheless.

The version that was faster, sfenceA, had the sfence between the 128 KiB store and the L1 store. So perhaps there is some kind of penalty when an L1 hit store arrives right after a L1-miss-L2-hit store, in addition to the “no MLP ” penalty we normally see.

Larger fixed regions

To this point we’ve been we’ve been looking at the scenario where a write to a 128 KiB region is interleaved with a write to a region of varying size. The fixed size of 128 KiB means that most of those writes will be L2 hits. What if we make the fixed size region larger? Let’s say 2 MiB, which is much larger than L2 (256 KiB) but still fits easily in L3 (6 MiB on my CPU). Now we expect most writes to the fixed region to be L2 misses but L3 hits.

What’s the behavior? Here’s the old microcode:

… and the new:

Again we see a large performance impact with the new microcode, and the results are consistent with the theory that L1 hits in the store stream prevent overlapping of store misses on either side. In particular we see that the region with L1 hits takes about 37 cycles, almost exactly the L3 latency on this CPU. In this scenario, it is slower to have L1 hits mixed in to the stream of accesses than to replace those L1 hits with misses to DRAM. That’s a remarkable demonstration of the power of memory level parallelism and of the potential impact of this change.

Why?

I can’t tell you for certain why the store related machinery acts the way it does in this case. Speculating is fun though, so lets do that. Here are a couple possibilities for why the memory model acts the way it does.

The x86 Memory Model

First, let’s quickly review the x86 memory model.

The x86 has a relatively strong memory model. Intel doesn’t give it a handy name, but lets call it x86-TSO. In x86-TSO, stores from all CPUs appear in a global total order with stores from each CPU consistent with program order. If a given CPU makes stores A and B in that order, all other CPUs will observe not only a consistent order of stores A and B, but the same A-before-B order as the program order. All this store ordering complicates the pipeline. In weaker memory models like ARM and POWER, in the absence of fences, you can simply commit senior stores in whatever order is convenient. If some store locations are already present in L1, you can commit those, while making RFO requests for other store locations which aren’t in L1.

An x86 CPU has a to take more conservative strategy. The basic idea is that stores are only made globally observable in program order as they reach the head of the store buffer. The CPU may still try to get parallelism by prefetching upcoming stores, as described for example in Intel’s US patent 7130965 - but care must be taken. For example, any snoop request that comes in for any of the lines in flight must get a consistent result: whether the lines are in a write-back buffer being evicted from L1, in a fill buffer making their way to L2, in a write-combining buffer waiting to commit to L1, and so on.

Write Combining Buffers

With that out of the way, let’s talk about how the store pipeline might actually work.

Let’s assume that when a store misses in the L1 it allocates a fill buffer to fetch the associated line from the outer levels of the memory hierarchy (we can be pretty sure this is true). Lets further assume that if another stores in the store buffer reaches the head of the store buffer and is to the same line, we get effectively a “fill buffer hit”, and that in this case the store is merged into the existing fill buffer and removed from the store buffer*. That is, the fill buffer entry itself keeps track of the written bytes, and merges those bytes with any unwritten ones when the line returns from the memory hierarchy, before finally committing it to L1.

In the scenario where there are outstanding fill buffers containing store stater, committing stores that hit in L1 is tricky: if you have several outstanding fill buffers for outstanding stores, as well as several interleaved L1-hit stores, the strong memory model used by x86 and described above means that you have to ensure that any incoming snoop requests see all those stores in the same order. You can’t just snoop all the fill buffers and then the L1 or vice-versa since that might change the apparent order. Additionally, stores become globally visible if they are committed to the L1, but the global observability point for stores whose data is being collected in the fill buffers is less clear.

One simple approach for dealing with L1-hits stores when there are outstanding stores in the fill buffers is to delay the store until the outstanding stores complete and are committed to L1. This could prevent any parallelism between stores with an intervening L1 hit, unless RFO prefetching kicks in. So perhaps the difference is whether the RFO prefetch heuristic determines it is profitable to prefetch stores. Or perhaps the CPU is able to choose between two strategies in this scenario, one of which allows parallelism and one which doesn’t. For example, perhaps the L1 stores could themselves be buffered in fill buffers, which seems silly except that it may allow preserving the order among stores which both hit and miss in L1. For whatever reason the CPU choose the no-parallelism strategy more in the case of the new microcode.

Perhaps the overlapping behavior was completely disabled to support some recent type of Spectre mitigation (see for example SSB disable functionality which was probably added in this newest microcode version).

Without more details on the mechanisms on modern Intel CPUs it is hard to say more, but there are certainly cases where extreme care has to be taken to preserve the order of writes. The fill buffers used for L1 misses, as well as associated components in the outer cache layers already need to be ordered to support the memory model (which also disallows load-load reordering), so in that sense all the stores that miss L1 are already in good hands. Stores that want to commit directly to L1 are more problematic since they are no longer tracked and have become globally observable (a snoop may arrive at any moment and see the newly written) value. I did take a good long look at the patents, but didn’t find any smoking gun to explain the current behavior.

Workarounds

Now that we’re aware of the problem, is there anything we can do in the case we are bitten by it? Yes.

Avoid or reduce fine-grained interleaving

The problem occurs when you have fine-grained interleaving between L1 hits and L1 misses. Sometimes you can avoid the interleaving entirely, but if not you can perhaps make it coarser grained. For example, the current interleaved test alternates between L1 misses and L1 misses, like L1-hit, L1-miss, L1-hit, L1-miss . If you unroll by a factor of two and then move the writes to the same region to be adjacent in the source (which doesn’t change the semantics since the regions are not overlapping), you’ll coarser grained interleaving, like: L1-hit, L1-hit, L1-miss, L1-miss . Based on our theory of reduced memory level parallelism, grouping the stores in this way will allow at least some overlapping (in this example, two stores can be overlapped).

Let’s try this, comparing unrolling by a factor of two and four versus the plain unrolled version. The main loop in the factor of two unrolled version (the factor of 4 is equivalent) looks like:

do { uint32_t val1 = RAND_FUNC ( & rng ); uint32_t val2 = RAND_FUNC ( & rng ); a1 [( val1 & ( size1 - 1 ))] = 1 ; a1 [( val2 & ( size1 - 1 ))] = 1 ; a2 [( val1 & ( size2 - 1 ))] = 2 ; a2 [( val2 & ( size2 - 1 ))] = 2 ; } while ( -- iters > 0 );

Here’s is the performance with a fixed array size of 2048 KiB (since the performance degradation is more dramatic with large fixed region sizes):

For the region where L1 hits occur, the unroll by gives a 1.6x speedup, and the unroll by 4 a 2.5x speed. Even when unrolling by 4 we still see an impact from this issue (performance still improves once almost every store is an L1 miss) - but we are much closer to the expected the baseline performance before the microcode update.

This change doesn’t come for free: unrolling the loop by hand has a cost in development complexity as the unrolled loop is more complicated. Indeed, the implementation in the benchmark doesn’t handle values of iters which aren’t a multiple or 2 or 4. It also has a cost in code size as the unrolled functions are larger:

Function Loop Size in Bytes Function Size in Bytes Original 40 74 Unrolled 2x 72 108 Unrolled 4x 140 191

Finally, note that while more unrolling is faster in the region where L1 hists is faster, the situation reverses itself around 64 KiB, and after that point no unrolling is fastest.

All this means that in this particular example you would face some tough tradeoffs if you want to reduce the impact by unrolling.

Prefetching

You can solve this particular problem using software prefetching instructions. If you prefetch the lines you are going to store to, a totally different path is invoked: the same one that handles loads, and here the memory level parallelism will be available regardless of the the limitations of the store path. One complication is that, except for prefetchw , such prefetches will be “shared OK” requests for the line, rather than an RFO (request for ownership). This means that the core might receive the line in the S MESI state, and then when the store occurs, a second request may be incurred to change the line from S state to M state. In my testing this didn’t see to be a problem in practice, perhaps because the lines are not shared across cores so generally arrive in the E state, and the E->M transition is cheap.

We don’t even really have to pre-fetch: that is, we don’t need to issue the prefetch instructions early (which would be hard in this case since we’d need to run ahead of the RNG) - we just issue the PF at the same spot we would have otherwise done the store. This transforms the nature of the request from a store to a load, which is the goal here - even though it doesn’t make the request visible to the CPU any earlier than before.

One question is which of the two regions to prefetch? The fixed region, the variable region or both? It turns out that “both” is a fine strategy and is often the sole fastest approach and is generally tied in the remaining cases. Here’s a look at all three approaches against no prefetching at all on the new microcode (128 KiB fixed region size):

A key observation is that if you had decided to only prefetch one or the other of the two stores, you’d be slower than no prefetching at all over most of the range. It isn’t exactly clear to me why this is the case: perhaps the prefetches compete for fill buffers or otherwise result in a worse allocation of fill buffers to requests.

The simplest solution is to simply avoid the newest microcode updates. These updates seem drive by new spectre mitigations, so if you are not enabling that functionality (e.g., SSDB is disabled by default in Linux, so if you aren’t explicitly enabling it, you won’t get it), perhaps you can do without these updates.

This strategy is not feasible once the microcode update contains something you need.

Additionally, as noted above, even the old microcodes sometimes experience the same slower performance that new microcodes always exhibit. I cannot exactly characterize the conditions in which this occurs, but one should at least be aware that old microcodes aren’t always fast.

Other findings

This post is already longer than I wanted it to be. The idea is for posts closer in length to JVM Anatomy Park than War and Peace. Still, there is a bunch of stuff uncovered which I’ll summarize here:

The current test uses regions whose addresses whose bottom 12 bits are identically zero, but whose 13th bit varies. That is, the regions “4K alias” but do not “8K alias”. Since the main loop uses the same random address for both regions (wrapped to region size by masking) in each iteration, this means that the stores alias as describe above. However, this is not the cause of the main effects reported here: you can remove the aliasing completely and the behavior is largely the same .

. You can go the other way too: if you increase the aliasing (you can try this by setting environment variable ALLOW_ALIAS=1 ) up to 64 KiB (bottom 16 bits of the physical address), I found a strong effect where performance was slower with the old microcode. This effect seems to have disappeared with the new microcode. Now 64 KiB aliasing (especially physical aliasing) is probably a lot more rare than mixed L1 hits and L1 misses in the stream of stores, so I’d rather the old behavior than the new - but this is probably interesting enough to write about separately.

) up to 64 KiB (bottom 16 bits of the physical address), I found a strong effect where performance was slower with the old microcode. This effect seems to have disappeared with the new microcode. Now 64 KiB aliasing (especially physical aliasing) is probably a lot more rare than mixed L1 hits and L1 misses in the stream of stores, so I’d rather the old behavior than the new - but this is probably interesting enough to write about separately. I do sometimes see the “slow mode” behavior with earlier microcode versions. Almost a year ago, when the last several version of the microcode didn’t even exist, I experienced periodic slow mode behavior while benchmarking - the same type of performance in the L1 region as the current microcode shows all the time. On older microcode I can still reproduce this consistently: if all CPUs are loaded when I start the bench process. For example ./bench interleaved consistently gives fast mode, but stress -c 4 & ./bench interleaved consistently gives slow timings … even when I kill the CPU using processes before the results roll in. In that case, the tests keep running in slow mode even though it’s the only thing running on the system.



This seems to explain why I randomly got slow mode in the past. For example, I noticed that something like ./bench interleaved > data; plot-csv.py data would give fast mode results, but when I shortened it to ./bench interleaved | plot-csv.py it would be in slow mode, because apparently launching the python interpreter in parallel on the RHS of the pipe used enough CPU to trigger the slow mode. I had a weird 10 minutes or so where I’d run ./bench without piping it and look at the data, and then try to plot it and it would be totally different, back and forth.

process. For example consistently gives fast mode, but consistently gives slow timings … even when I kill the CPU using processes before the results roll in. In that case, the tests keep running in slow mode even though it’s the only thing running on the system. This seems to explain why I randomly got slow mode in the past. For example, I noticed that something like would give fast mode results, but when I shortened it to it would be in slow mode, because apparently launching the python interpreter in parallel on the RHS of the pipe used enough CPU to trigger the slow mode. I had a weird 10 minutes or so where I’d run without piping it and look at the data, and then try to plot it and it would be totally different, back and forth. I considered the idea that this bad behavior only shows up when the store buffer is full, e.g., because of some interaction that occurs when renaming is stalled on store buffer entries, but versions of the test which periodically drain the store buffer with sfence so it never becomes very full showed the same result.

so it never becomes very full showed the same result. I examined the values of a lot more performance counters than the few shown above, but none of them provided any smoking gun for the behavior: they were all consistent with L1 hits simply blocking overlap of L1 miss stores on either side.

Other platforms

An obvious and immediate question is what happens on other micro-architectures, beyond my Skylake client core.

On Haswell, the behavior is always slow. That is, whether with old or new microcode, store misses mixed with L1 store hits were much slower than expected. So if you target Haswell or (perhaps) Broadwell era hardware, you might want to keep this in mind regardless of microcode version.

On Skylake-X (Xeon W-2401), the behavior is always fast. That is, even with the newest microcode version I did not see the slow behavior. I also was not able to trigger the behavior by starting the test with loaded CPUs as I was with Skylake client with old microcode.

On Cannonlake I did not observe the slow behavior. I don’t know if I was using an “old” or “new” microcode as Intel does not publish microcode guidance for Cannonlake (and it isn’t clear to me if any Cannonlake microcodes have been released at all as very few chips were ever shipped).

You can look at the results for all the platforms I tested in the assets directory. The plots are the same as described above for Skylake plus some variants not show but which should be obvious from the title or filename.

The Source

You can have fun reproducing all these results yourself as my code is available in the store-bench project on GitHub. Documentation is a bit lacking, but it shouldn’t be to hard to figure out. Open an issue if anything is unclear or you find a bug, and pull requests gladly accepted.

Thanks

Thanks to Nathan Kurz and Leonard Inkret who pointed out some errors in the text and to Daniel Lemire who kindly provided additional hardware on which I was able to test these results.

I don’t have a comment system, but I’ll reply to stuff posted in the Hacker News discussion, or on Twitter.

If you liked this post, check out the homepage for others you might enjoy.



