This article shows a neat utilization of SAD instruction to calculate 8-bit pospopcnt. It's not the fastest one, but I really like the whole algorithm.

Positional population count (pospopcnt) is a procedure that calculates the histogram for bits placed at given position in a byte, word or double word etc. from larger stream of such entities.

Pospopcnt algorithm overview

The simplest way to use SAD to calculate pospopcnt would be mask each bit: 0, 1, ..., 7 and invoke SAD eight times. However, since the sum of an isolated bit never exceeds 8, we may use single SAD to calculate two sums for pairs of bits: 0-4, 1-5, 2-6 and 3-7. Such sums are then stored on 4-bit subwords of 8-bit word. Additionally, if we shift right the pairs of bit to position 0-4, then combined sums never exceed the byte boundary, as the maximum value is 0x88 .

Thanks to these tricks we need only four invocations of SAD instruction to get pospopcnts from 8-byte lanes — we call them partial pospopcnts.

These partial results are stored in separate 64-bit words of four registers. The next step is to gather the bytes having sums for given bit in the same lane, and then place the lanes in single register. That can be done with byte-shuffling instructions and basic bit-ops instructions.

The last step is to mask out either lower on higher nibbles of combined sums, and invoke SAD for the last time. At this point each 64-bit word of the result vector has got the total pospopcnt for bytes from the input register.

Such resulting vector can be easily used to update the global histogram. The global histogram can also be kept in a register during the main loop, as it fits in single AVX512 register.