If you are in a rush, you can skip to the summary, but you’ll miss out on the journey.

AVX-512 introduced eight so-called mask registers, k0 through k7 , which apply to most ALU operations and allow you to apply a zero-masking or merging operation on a per-element basis, speeding up code that would otherwise require extra blending operations in AVX2 and earlier.

If that single sentence doesn’t immediately indoctrinate you into the mask register religion, here’s a copy and paste from Wikipedia that should fill in the gaps and close the deal:

Most AVX-512 instructions may indicate one of 8 opmask registers (k0–k7). For instructions which use a mask register as an opmask, register k0 is special: a hardcoded constant used to indicate unmasked operations. For other operations, such as those that write to an opmask register or perform arithmetic or logical operations, k0 is a functioning, valid register. In most instructions, the opmask is used to control which values are written to the destination. A flag controls the opmask behavior, which can either be “zero”, which zeros everything not selected by the mask, or “merge”, which leaves everything not selected untouched. The merge behavior is identical to the blend instructions.

So mask registers are important, but are not household names unlike say general purpose registers ( eax , rsi and friends) or SIMD registers ( xmm0 , ymm5 , etc). They certainly aren’t going to show up on Intel slides disclosing the size of uarch resources, like these:





In particular, I don’t think the size of the mask register physical register file ( PRF ) has ever been reported. Let’s fix that today.

We use an updated version of the ROB size probing tool originally authored and described by Henry Wong (hereafter simply Henry), who used it to probe the size of various documented and undocumented out-of-order structures on earlier architecture. If you haven’t already read that post, stop now and do it. This post will be here when you get back.

You’ve already read Henry’s blog for a full description (right?), but for the naughty among you here’s the fast food version:

Fast Food Method of Operation

We separate two cache miss load instructions by a variable number of filler instructions which vary based on the CPU resource we are probing. When the number of filler instructions is small enough, the two cache misses execute in parallel and their latencies are overlapped so the total execution time is roughly as long as a single miss.

However, once the number of filler instructions reaches a critical threshold, all of the targeted resource are consumed and instruction allocation stalls before the second miss is issued and so the cache misses can no longer run in parallel. This causes the runtime to spike to about twice the baseline cache miss latency.

Finally, we ensure that each filler instruction consumes exactly one of the resource we are interested in, so that the location of the spike indicates the size of the underlying resource. For example, regular GP instructions usually consume one physical register from the GP PRF so are a good choice to measure the size of that resource.

Mask Register PRF Size

Here, we use instructions that write a mask register, so can measure the size of the mask register PRF .

To start, we use a series of kaddd k1, k2, k3 instructions, as such (shown for 16 filler instructions):

mov rcx , QWORD PTR [ rcx ] ; first cache miss load kaddd k1 , k2 , k3 kaddd k1 , k2 , k3 kaddd k1 , k2 , k3 kaddd k1 , k2 , k3 kaddd k1 , k2 , k3 kaddd k1 , k2 , k3 kaddd k1 , k2 , k3 kaddd k1 , k2 , k3 kaddd k1 , k2 , k3 kaddd k1 , k2 , k3 kaddd k1 , k2 , k3 kaddd k1 , k2 , k3 kaddd k1 , k2 , k3 kaddd k1 , k2 , k3 kaddd k1 , k2 , k3 kaddd k1 , k2 , k3 mov rdx , QWORD PTR [ rdx ] ; second cache miss load lfence ; stop issue until the above block completes ; this block is repeated 16 more times

Each kaddd instruction consumes one physical mask register. If number of filler instructions is equal to or less than the number of mask registers, we expect the misses to happen in parallel, otherwise the misses will be resolved serially. So we expect at that point to see a large spike in the running time.

That’s exactly what we see:

Let’s zoom in on the critical region, where the spike occurs:

Here we clearly see that the transition isn’t sharp – when the filler instruction count is between 130 and 134, we the runtime is intermediate: falling between the low and high levels. Henry calls this non ideal behavior and I have seen it repeatedly across many but not all of these resource size tests. The idea is that the hardware implementation doesn’t always allow all of the resources to be used as you approach the limit - sometimes you get to use every last resource, but in other cases you may hit the limit a few filler instructions before the theoretical limit.

Under this assumption, we want to look at the last (rightmost) point which is still faster than the slow performance level, since it indicates that sometimes that many resources are available, implying that at least that many are physically present. Here, we see that final point occurs at 134 filler instructions.

So we conclude that SKX has 134 physical registers available to hold speculative mask register values. As Henry indicates on the original post, it is likely that there are 8 physical registers dedicated to holding the non-speculative architectural state of the 8 mask registers, so our best guess at the total size of the mask register PRF is 142. That’s somewhat smaller than the GP PRF (180 entires) or the SIMD PRF (168 entries), but still quite large (see this table of out of order resource sizes for sizes on other platforms).

In particular, it is definitely large enough that you aren’t likely to run into this limit in practical code: it’s hard to imagine non-contrived code where almost 60% of the instructions write to mask registers, because that’s what you’d need to hit this limit.

Are They Distinct PRFs?

You may have noticed that so far I’m simply assuming that the mask register PRF is distinct from the others. I think this is highly likely, given the way mask registers are used and since they are part of a disjoint renaming domain. It is also supported by the fact that that apparent mask register PFR size doesn’t match either the GP or SIMD PRF sizes, but we can go further and actually test it!

To do that, we use a similar test to the above, but with the filler instructions alternating between the same kaddd instruction as the original test and an instruction that uses either a GP or SIMD register. If the register file is shared, we expect to hit a limit at size of the PRF . If the PRFs are not shared, we expect that neither PRF limit will be hit, and we will instead hit a different limit such as the ROB size.

Test 29 alternates kaddd and scalar add instructions, like this:

mov rcx , QWORD PTR [ rcx ] add ebx , ebx kaddd k1 , k2 , k3 add esi , esi kaddd k1 , k2 , k3 add ebx , ebx kaddd k1 , k2 , k3 add esi , esi kaddd k1 , k2 , k3 add ebx , ebx kaddd k1 , k2 , k3 add esi , esi kaddd k1 , k2 , k3 add ebx , ebx kaddd k1 , k2 , k3 mov rdx , QWORD PTR [ rdx ] lfence

Here’s the chart:

We see that the spike is at a filler count larger than the GP and PRF sizes. So we can conclude that the mask and GP PRFs are not shared.

Maybe the mask register is shared with the SIMD PRF ? After all, mask registers are more closely associated with SIMD instructions than general purpose ones, so maybe there is some synergy there.

To check, here’s Test 35, which is similar to 29 except that it alternates between kaddd and vxorps , like so:

mov rcx , QWORD PTR [ rcx ] vxorps ymm0 , ymm0 , ymm1 kaddd k1 , k2 , k3 vxorps ymm2 , ymm2 , ymm3 kaddd k1 , k2 , k3 vxorps ymm4 , ymm4 , ymm5 kaddd k1 , k2 , k3 vxorps ymm6 , ymm6 , ymm7 kaddd k1 , k2 , k3 vxorps ymm0 , ymm0 , ymm1 kaddd k1 , k2 , k3 vxorps ymm2 , ymm2 , ymm3 kaddd k1 , k2 , k3 vxorps ymm4 , ymm4 , ymm5 kaddd k1 , k2 , k3 mov rdx , QWORD PTR [ rdx ] lfence

Here’s the corresponding chart:

The behavior is basically identical to the prior test, so we conclude that there is no direct sharing between the mask register and SIMD PRFs either.

This turned out not to be the end of the story. The mask registers are shared, just not with the general purpose or SSE/AVX register file. For all the details, see this follow up post.

An Unresolved Puzzle

Something we notice in both of the above tests, however, is that the spike seems to finish around 212 filler instructions. However, the ROB size for this microarchtiecture is 224. Is this just non ideal behavior as we saw earlier? Well we can test this by comparing against Test 4, which just uses nop instructions as the filler: these shouldn’t consume almost any resources beyond ROB entries. Here’s Test 4 ( nop filer) versus Test 29 (alternating kaddd and scalar add ):

The nop -using Test 4 nails the ROB size at exactly 224 (these charts are SVG so feel free to “View Image” and zoom in confirm). So it seems that we hit some other limit around 212 when we mix mask and GP registers, or when we mix mask and SIMD registers. In fact the same limit applies even between GP and SIMD registers, if we compare Test 4 and Test 21 (which mixes GP adds with SIMD vxorps ):

Henry mentions a more extreme version of the same thing in the original blog entry, in the section also headed Unresolved Puzzle:

Sandy Bridge AVX or SSE interleaved with integer instructions seems to be limited to looking ahead ~147 instructions by something other than the ROB . Having tried other combinations (e.g., varying the ordering and proportion of AVX vs. integer instructions, inserting some NOPs into the mix), it seems as though both SSE/AVX and integer instructions consume registers from some form of shared pool, as the instruction window is always limited to around 147 regardless of how many of each type of instruction are used, as long as neither type exhausts its own PRF supply on its own.

Read the full section for all the details. The effect is similar here but smaller: we at least get 95% of the way to the ROB size, but still stop before it. It is possible the shared resource is related to register reclamation, e.g., the PRRT - a table which keeps track of which registers can be reclaimed when a given instruction retires.

Finally, we finish this party off with a few miscellaneous notes on mask registers, checking for parity with some features available to GP and SIMD registers.

Move Elimination

Both GP and SIMD registers are eligible for so-called move elimination. This means that a register to register move like mov eax, edx or vmovdqu ymm1, ymm2 can be eliminated at rename by “simply” pointing the destination register entry in the RAT to the same physical register as the source, without involving the ALU.

Let’s check if something like kmov k1, k2 also qualifies for move elimination. First, we check the chart for Test 28, where the filler instruction is kmovd k1, k2 :

It looks exactly like Test 27 we saw earlier with kaddd . So we would suspect that physical registers are being consumed, unless we have happened to hit a different move-elimination related limit with exactly the same size and limiting behavior.

Additional confirmation comes from uops.info which shows that all variants of mask to mask register kmov take one uop dispatched to p0 . If the move is eliminated, we wouldn’t see any dispatched uops.

Therefore I conclude that register to register moves involving mask registers are not eliminated.

Dependency Breaking Idioms

The best way to set a GP register to zero in x86 is via the xor zeroing idiom: xor reg, reg . This works because any value xored with itself is zero. This is smaller (fewer instruction bytes) than the more obvious mov eax, 0 , and also faster since the processor recognizes it as a zeroing idiom and performs the necessary work at rename, so no ALU is involved and no uop is dispatched.

Furthermore, the idiom is dependency breaking: although xor reg1, reg2 in general depends on the value of both reg1 and reg2 , in the special case that reg1 and reg2 are the same, there is no dependency as the result is zero regardless of the inputs. All modern x86 CPUs recognize this special case for xor . The same applies to SIMD versions of xor such as integer vpxor and floating point vxorps and vxorpd .

That background out of the way, a curious person might wonder if the kxor variants are treated the same way. Is kxorb k1, k1, k1 treated as a zeroing idiom?

This is actually two separate questions, since there are two aspects to zeroing idioms:

Zero latency execution with no execution unit (elimination)

Dependency breaking

Let’s look at each in turn.

Execution Elimination

So are zeroing xors like kxorb k1, k1, k1 executed at rename without latency and without needing an execution unit?

No.

Here, I don’t even have to do any work: uops.info has our back because they’ve performed this exact test and report a latency of 1 cycle and one p0 uop used. So we can conclude that zeroing xors of mask registers are not eliminated.

Dependency Breaking

Well maybe zeroing kxors are dependency breaking, even though they require an execution unit?

In this case, we can’t simply check uops.info. kxor is a one cycle latency instruction that runs only on a single execution port ( p0 ), so we hit the interesting (?) case where a chain of kxor runs at the same speed regardless of whether the are dependent or independent: the throughput bottleneck of 1/cycle is the same as the latency bottleneck of 1/cycle!

Don’t worry, we’ve got other tricks up our sleeve. We can test this by constructing a tests which involve a kxor in a carried dependency chain with enough total latency so that the chain latency is the bottleneck. If the kxor carries a dependency, the runtime will be equal to the sum of the latencies in the chain. If the instruction is dependency breaking, the chain is broken and the different disconnected chains can overlap and performance will likely be limited by some throughput restriction (e.g., port contention). This could use a good diagram, but I’m not good at diagrams.

All the tests are in uarch bench, but I’ll show the key parts here.

First we get a baseline measurement for the latency of moving from a mask register to a GP register and back:

kmovb k0 , eax kmovb eax , k0 ; repeated 127 more times

This pair clocks in at 4 cycles. It’s hard to know how to partition the latency between the two instructions: are they both 2 cycles or is there a 3-1 split one way or the other, but for our purposes it doesn’t matter because we just care about the latency of the round-trip. Importantly, the post-based throughput limit of this sequence is 1/cycle, 4x faster than the latency limit, because each instruction goes to a different port ( p5 and p0 , respectively). This means we will be able to tease out latency effects independent of throughput.

Next, we throw a kxor into the chain that we know is not zeroing:

kmovb k0 , eax kxorb k0 , k0 , k1 kmovb eax , k0 ; repeated 127 more times

Since we know kxorb has 1 cycle of latency, we expect to increase the latency to 5 cycles and that’s exactly what we measure (the first two tests shown):

** Running group avx512 : AVX512 stuff ** Benchmark Cycles Nanos kreg- GP rountrip latency 4.00 1.25 kreg- GP roundtrip + nonzeroing kxorb 5.00 1.57

Finally, the key test:

kmovb k0 , eax kxorb k0 , k0 , k0 kmovb eax , k0 ; repeated 127 more times

This has a zeroing kxorb k0, k0, k0 . If it breaks the dependency on k0, it would mean that the kmovb eax, k0 no longer depends on the earlier kmovb k0, eax , and the carried chain is broken and we’d see a lower cycle time.

Drumroll…

We measure this at the exact same 5.0 cycles as the prior example:

** Running group avx512 : AVX512 stuff ** Benchmark Cycles Nanos kreg- GP rountrip latency 4.00 1.25 kreg- GP roundtrip + nonzeroing kxorb 5.00 1.57 kreg- GP roundtrip + zeroing kxorb 5.00 1.57

So we tentatively conclude that zeroing idioms aren’t recognized at all when they involve mask registers.

Finally, as a check on our logic, we use the following test which replaces the kxor with a kmov which we know is always dependency breaking:

kmovb k0 , eax kmovb k0 , ecx kmovb eax , k0 ; repeated 127 more times

This is the final result shown in the output above, and it runs much more quickly at 2 cycles, bottlenecked on p5 (the two kmov k, r32 instructions both go only to p5 ):

** Running group avx512 : AVX512 stuff ** Benchmark Cycles Nanos kreg- GP rountrip latency 4.00 1.25 kreg- GP roundtrip + nonzeroing kxorb 5.00 1.57 kreg- GP roundtrip + zeroing kxorb 5.00 1.57 kreg- GP roundtrip + mov from GP 2.00 0.63

So our experiment seems to check out.

Reproduction

You can reproduce these results yourself with the robsize binary on Linux or Windows (using WSL). The specific results for this article are also available as are the scripts used to collect them and generate the plots.

Summary

SKX has a separate PRF for mask registers with a speculative size of 134 and an estimated total size of 142

has a separate for mask registers with a speculative size of 134 and an estimated total size of 142 This is large enough compared to the other PRF size and the ROB to make it unlikely to be a bottleneck

size and the to make it unlikely to be a bottleneck Mask registers are not eligible for move elimination

Zeroing idioms in mask registers are not recognized for execution elimination or dependency breaking

Part II

I didn’t expect it to happen, but it did: there is a follow up post about mask registers, where we (roughly) confirm the register file size by looking at an image of a SKX CPU captured via microcope, and make an interesting discovery regarding sharing.

Discussion on Hacker News, Reddit (r/asm and r/programming) or Twitter.

Direct feedback also welcomed by email or as a GitHub issue.

Thanks

Daniel Lemire who provided access to the AVX-512 system I used for testing.

Henry Wong who wrote the original article which introduced me to this technique and graciously shared the code for his tool, which I now host on github.

Jeff Baker, Wojciech Muła for reporting typos.

Image credit: Kellogg’s Special K by Like_the_Grand_Canyon is licensed under CC BY 2.0.

If you liked this post, check out the homepage for others you might enjoy.



