This article was inspired by Geoff Langdale's text Why Ice Lake is Important (a bit-basher’s perspective) . I'm also grateful Zach Wegner for an inspiring discussion.

The function parity is called with the result of bit-and of two bytes fetched from the two argument vectors ( qword[7 - i] & byte ). If we assure that one of bytes is constant and has the k-th bit set, than parity yields k-th bit from another, non-constant byte.

This function calculates bit-xor for all bits of input, i.e. it returns 1 when number of ones in input is odd. We know that 0 xor 0 = 0 , thus parity(0) = 0 . If the input has exactly one bit set, i.e. its form is 1 << k , we hit the case 1 xor 0 = 1 during computations, which means that parity(1 << k) = 1 .

For instance, to interleave bits, i.e. set the output order to 0, 4, 1, 5, 2, 6, 3, 7, the constant has to be 0x0110022004400880 (not 0x8008400420021001 ). If we want to reverse bits within a byte, the constant is 0x8040201008040201 . If we want to populate one bit, let say 5th, the constant is 0x2020202020202020 .

where constants bit0 , bit1 , ..., bit7 have to be in range 0..7. Please bear in mind that the order of bytes in a constant has to be reversed, as procedure affine_byte fetches bytes from A using index 7 - i .

Bit shuffling requires to setup a pattern in argument A . The pattern for each lane is a 64-bit number in form:

Let's do some inlining on the sample psuedocode to make that ability clearly visible:

Gathering bits

To build a byte from selected bit we must fill the argument x with proper masks, argument A is then treated as "variable". Again, we do some simplifications to the pseudocode to reveal this property:

__m512i gather_bits ( __m512i x , __m512i A , uint8_t imm8 ) { for ( j = 0 ; j < 8 ; j ++ ) { qword_A = A . qword [ j ]; qword_x = x . qword [ j ]; for ( i = 0 ; i < 8 ; i ++ ) { // x contains the fixed bit-masks in form 1 << k k = bit_pos ( qword_x . byte [ i ]); res . qword [ j ]. byte [ i ]. bit [ 0 ] = qword_A . byte [ 7 ]. bit [ k ]; res . qword [ j ]. byte [ i ]. bit [ 1 ] = qword_A . byte [ 6 ]. bit [ k ]; res . qword [ j ]. byte [ i ]. bit [ 2 ] = qword_A . byte [ 5 ]. bit [ k ]; res . qword [ j ]. byte [ i ]. bit [ 3 ] = qword_A . byte [ 4 ]. bit [ k ]; res . qword [ j ]. byte [ i ]. bit [ 4 ] = qword_A . byte [ 3 ]. bit [ k ]; res . qword [ j ]. byte [ i ]. bit [ 5 ] = qword_A . byte [ 2 ]. bit [ k ]; res . qword [ j ]. byte [ i ]. bit [ 6 ] = qword_A . byte [ 1 ]. bit [ k ]; res . qword [ j ]. byte [ i ] = res . qword [ j ]. byte [ i ] ^ imm8 ; } } }

Please note that the order of bits is reversed, because in affine_byte bytes from A are fetched from index 7 - i .