With Computex, there’s been a ton of news about Ice Lake (hereafter ICL) and the Sunny Cove core (SNC). Wikichip, Extremetech and Anandtech among many others have coverage (and there will be a lot more), so I won’t rehash this.

I also don’t want to embark on a long discussion about 14nm vs 10nm, Intel’s problems with clock speeds, etc. Much of this is covered elsewhere both well and badly. I’d rather focus on something I find interesting: a summary of what SNC is bringing us as programmers, assuming we’re programmers who like using SIMD (I’ve dabbled a bit).

I’m also going to take it as a given that Cannonlake (CNL) didn’t really happen (for all practical purposes). A few NUCs got into the hands of smart people who did some helpful measurements of VBMI, but from the perspective of the mass market, the transition is really going to be from AVX2 to “what SNC can do”.

While I’m strewing provisos around, I will also admit that I’m not very interested in deep learning, floating point or crypto (not that there’s anything wrong with that; there are just plenty of other people willing to analyze how many FMAs or crypto ops/second new processors have).

Also interesting: SNC is a wider machine with some new ports and new capabilities on existing ports. I don’t know enough about this to comment, and what I do remember from my time at Intel isn’t public, so I’m only going to talk about things that are on the public record. In any case, I can’t predict the implications of this sort of change without getting to play with a machine.

So, leaving all that aside, here are the interesting bits:

ICL will add AVX-512 capability to a consumer chip for the first time. Before ICL, you can’t buy a consumer-grade computer that runs AVX-512 – there are various Xeon D chips that you could buy to get AVX-512 or you could buy the obscure CNL NUC or laptop (maybe), but there wasn’t a mainstream development platform for this. So we’re going straight from AVX2, which is an incremental SSE update that didn’t really extend SSE across to 256 bits in a general way (looking at VPSHUFB and VPALIGNR makes it starkly clear that, from the perspective of a bit/byte-basher, AVX2 is 2x128b) – all the way to AVX-512. AVX-512 capability isn’t just a extension of AVX 2 across more bits. The way that nearly all operations can be controlled by ‘mask’ registers allows us to do all sorts of cool things that previously would have required mental gymnastics to do in SIMD. The short version of the story is that AVX-512 introduces 8 mask registers that allow most operations to be conditionally controlled by whether the corresponding bit in the mask register in on or off (including the ability to suppress loads and stores that might otherwise introduce faults).

Further, the AVX-512 capabilities in ICL are a big update on what’s there in Skylake Server (SKX), the mainstream server platform that would be the main way that most people would have encountered AVX-512. When you get SNC, you’re getting not just SKX-type AVX-512, you’re getting a host of interesting add-ons. There include, but are not limited to (I told you I don’t care about floating point, neural nets or crypto): VBMI VBMI2 VPOPCNTDQ BITALG GFNI VPCLMULQDQ



So, what’s so good about all this? Well, armed with a few handy manuals (the Extensions Reference manual and Volume 2 of the Intel Architecture Manual set), let’s dig in.

VBMI

This one is the only extension that we’ve seen before – it’s in Cannonlake. VBMI adds the capability of dynamically (not just a fixed pattern shuffle) shuffling bytes as well as words, “doublewords” (Intel-ese for 32-bits), and “quadwords” (64 bits). All the other granularities of shuffle up to a 512-bit size are there in AVX-512 but bytes don’t make it until AVX512_VBMI.

Not only that, the shuffles can have 2-register forms, so you can pull in values over a 2x512b pair of registers. So in addition to VPERMB we add VPERMT2B and VPERMI2B (the 2-register source shuffles have 2 variants depending on what thing gets overwritten by the shuffle result).

This is significant both for the ‘traditional’ sense of shuffles (suppose you have a bunch of bytes you want to rearrange before processing) but also for table lookup. If you treat the shuffle instructions as ‘table lookups’, the byte operations in VBMI allow you to look up 64 different bytes at once out of a 64-byte table, or 64 different bytes at once out of a 128-byte table in the 2-register form.

The throughput and latency costs on CNL shuffles are pretty good, too, so I expect these operations to be fairly cheap on SNC (I don’t know yet).

VBMI also adds VPMULTISHIFTQB – this one works on 64-bit lanes and allows unaligned selection of any 8-bit field from the corresponding 64-bit lane. A pretty handy one for people pulling things out of packed columnar databases, where someone might have packed annoying sized values (say 5 or 6 bits) into a dense representation.

VBMI2

VBMI2 extends the VPCOMPRESS/VPEXPAND instructions to byte and word granularity. This allow a mask register to be used to extract (or deposit) only the appropriate data elements either out of (or into) another SIMD register or memory.

VPCOMPRESSB pretty much ‘dynamites the trout stream’ for the transformation of bits to indexes, ruining all the cleverness (?) I perpetrated here with AVX512 and BMI2 (not the vector kind, the PDEP/PEXT kind).

VBMI2 also adds a new collection of instructions: VPSHLD, VPSHLDV, VPSHRD, VPSHRDV. These instructions allow left/right logical double shifts, either by a fixed about or a variable amount (thus the 4 variants) across 2 SIMD registers at once. So, for either 16, 32 or 64-bit granularity, we can effectively concatenate a pair of corresponding elements, shift them (variably or not, left or right) and extract the result. This is a handy building block and would have been nice to have while building Hyperscan – we spend a lot of time working around the fact that it’s hard to move bits around a SIMD register (one of these double-shifts, plus a coarse-grained shuffle, would allow bit-shuffles across SIMD registers).

VPOPCNTDQ/BITALG

I’m grouping these together, as VPOPCNTDQ is older (from the MIC product line) but the BITALG capabilities that arrive together with VPOPCNTDQ for everyone barring the Knights* nerds nicely round out the capabilities.

VPOPCNT does what it says on the tin: a bitwise population count for everything from bytes and words (BITALG) up to doublewords and quadwords (VPOPCNTDQ). We like counting bits. Yay.

VPSHUFBITQMB, also introduced with BITALG, is a lot like VPMULTISHIFTQB, except that it extracts 8 single bits from each 64-bit lane and deposits it in a mask register.

GFNI

OK, I said I didn’t care about crypto. I don’t! However, even a lowly bit-basher can get something out of these ones. If I’ve mangled the details, or am missing some obvious better ways of presenting this, let me know – be aware that the below picture is an accurate presentation of what I look like when dealing with these instructions:

GF2P8AFFINEINVQB I’ll pass over in silence; I think it’s for the real crypto folks, not a monkey with a stick like me.

GF2P8AFFINEQB on the other hand is likely awesome. It takes each 8 bit value and ‘matrix multiplies’ it, in a carryless multiply sense, with a 8×8 bit matrix held in the same 64-bit lane as the 8 bit value came from.

This can do some nice stuff. Notably, it can permute bits within each byte, or, speaking more generally, replace each bit with an arbitrary XOR of any bit from the source byte. So if you wanted to replace (b0, b1, .. b7) with (b7^b6, b6^b5, … b0^b0) you could. Trivially, of course, this also gets you 8-bit shift and rotate (not operations that exist on Intel SIMD otherwise). This use of the instruction effectively assumes the 64-bit value is ‘fixed’ and our 8-bit values are coming from an unknown input.

One could also view GF2P8AFFINEQB as something where the 8-bit values are ‘fixed’ and the 64-bit values are unknown – this would allow the user to, say, extract bits 0,8,16… from a 64-bit value and put it in byte 0, as well as 1,9,17,… and put it in byte 1, etc. – thus doing a 8×8 bit matrix transpose of our 64-bit values.

I don’t have too much useful stuff for GF2P8MULB outside crypto, but it is worth noting that there aren’t that many cheap byte->byte transformations that can be done over 8 bits that aren’t straightforward arithmetic or logic (add, and, or, etc) – notably no lanewise multiplies. So this might come in handy, in a kind of monkey-with-a-stick fashion.

VPCLMULQDQ

OK, I rhapsodized already about a use of carry-less multiply to find quote pairs.

So the addition of a vectorized version of this instruction – VPCLMULQDQ – that allows us to not just use SIMD registers to hold the results of a 64b x 64b->128b multiply, but to carry out up to 4 of them at once, could be straightfowardly handy.

Longer term, carryless multiply works as a good substitute for some uses of PEXT. While it would be nice to have a vectorized PEXT/PDEP (make it happen, Intel!), it is possible to get a poor-man’s version of PEXT via AND + PCLMULQDQ – we can’t get a clean bitwise extract, but, we can get a 1:1 pattern of extracted bits by carefully choosing our carryless multiply multiplier. This is probably worth a separate blog post.

I have a few nice string matching algorithms that rest on PEXT and take advantage of PEXT as a useful ‘hash function’ (not really a hash function, of course). The advantage in the string matching world of using PEXT and not a hash function is the ability to ‘hash’ simultaneously over ‘a’, ‘abc’ and ‘xyz’, while ensuring that the effectively ‘wild-carded’ nature of ‘a’ in a hash big enough to cover ‘abc’ and ‘xyz’ doesn’t ruin the hash table completely.

Conclusion

So, what’s so great about all this? From an integer-focused programmer’s perspective, ICL/SNC adds a huge collection of instructions that allow us – in many cases for the first time – to move bits and bytes around within SIMD registers in complex and useful ways. This radically expands the number of operations that can be done in SIMD – without branching and potentially without having to go out to memory for table accesses.

It’s my contention that this kind of SIMD programming is hugely important. There are plenty of ways to do ‘bulk data processing’ – on SIMD, on GPGPU, etc. This approach is the traditional “well, I had to multiply a huge vector by a huge matrix” type problem. Setup costs, latencies – all this stuff is less important if we can amortize over thousands of elements.

On the other hand, doing scrappy little bits of SIMD within otherwise scalar code can yield all sorts of speedups and new capabilities. The overhead of mixing in some SIMD tricks that do things that are extremely expensive in general purpose registers is very low on Intel. It’s time to get exploring (or will be, in July).

Conclusion Caveats

Plenty of things could go wrong, and this whole project is a long-term thing. Obviously we are a long way away from having SNC across the whole product range at Intel, much less across a significant proportion of the installed base. Some of the above instructions might be too expensive for the uses I’d like to put them (see also: SSE4.2, which was never been a good idea). The processors might clock down too much with AVX-512, although this seems to be less and less of a problem.

If you have some corrections, or interesting ideas about these instructions, hit me up in the comments! If you just want to complain about Intel (or, for that matter, anyone or anything else), I can suggest some alternative venues.