SIMDized check which bytes are in a set

Author: Wojciech Muła Added on: 2018-10-18

Introduction The problem is defined as follows: there's a stream of bytes and we want to get a byte-mask (or a bit-mask) that indicates which bytes are in the predefined set. Thanks to SIMD instructions this task can be performed faster than scalar code. Jobs like input validation or parsing (for instance CSV files), might benefit from a vectorized approach. In this text I show several SIMD methods: The universal algoithm that can handle arbitrary sets (from 1 to 255 elements) with a few instructions.

that can handle arbitrary sets (from 1 to 255 elements) with a few instructions. Several specialized algorithms, that handle small sets of peculiar properties. They require fewer instructions than the universal algorithm. However, the algorithms are rather meant for compilers/code generators, where we can statically determine the best code sequence for a predefined set.

For sake of completeness I describe basic SIMD methods. If the set has a few elements then no fancy algorithm is needed. Likewise, if the set can be represented as a union of ranges the code is also not complicated.

SIMD algorithms The main ingredient of the techniques shown below is instruction pshufb ( _mm_shuffle_epi8 ), which is present in SSE, AVX2 and also AVX512BW. The instruction does parallel byte lookup in a 16-byte register (or lane, in AVX2 and AVX512 variants) using 4-bit indices from another vector. A C-like code shows the pshufb algorithm: for ( int i = 0 ; i < VECTOR_SIZE ; i ++ ) { uint8_t index = indices_vector [ i ]; if ( index & 0x80 ) result [ i ] = 0x00 ; else result [ i ] = lookup_vector [ index & 0x0f ]; }

Universal algorithm In this algorithm the set is represented as a bitmap of size 16 x 16 bits, where a bit of value 1 indicates that given element is in the set. The bitmap is addressed with nibbles, i.e. 4-bit halves of a byte. The lower nibble of each input byte selects the bitmap's row, i.e. a 16-bit value. The higher nibble selects the column on the bitmap — once we fetched a row row, the higher nibble points at specific bit. Scalar code which deals with such representation is quite simple: bool in_set ( uint16_t bitmap [ 16 ], uint8_t byte ) { const uint8_t lo_nibble = byte & 0xf ; const uint8_t hi_nibble = byte >> 4 ; const uint16_t bitset = bitmap [ lo_nibble ]; const uint16_t bitmask = uint16_t ( 1 ) << hi_nibble ; return ( bitset & bitmask ) != 0 ; } Example Let's consider following set; for better visibility zeroes are shown with dots, and ones with the 'x': lo / hi nibble +-------------------------------- | 0 1 2 3 4 5 6 7 8 9 a b c d e f --+-------------------------------- 0 | x x . . . . x . . . x . . x . . 1 | x x x x . x x . . . . . x x . x 2 | . x . . x . x . . . x . . x . . 3 | . x x . . . . x . . x . x . x . 4 | . . . . . . . . . . . . x x x x 5 | x x . . x . x x x . x . . . x x 6 | x . . . . x . x . . x . x . . . 7 | . . x . . . . . . . . x . . x . 8 | . . x x . . . . . . . . . . . x 9 | . . x x x . . x . . x . . . . . a | . . . . . . x . . . x . . . . x b | . . . x . . x . . . . . . . . . c | x . . . x . . . . . . . . . x x d | . . . x x x . x . . x x . . . . e | x . x . . . . x . x . x . . . . f | x x . . . . x . . . . . x x x . We want to check if byte 0xa5 is in the set. Its lower nibble is 5, so we select the highlighted row, getting 16-bit value 1100_0101_1101_0011 . The higher nibble is 10 ( 0xa ), thus we're testing this bit: 1100_0101_1101_0011 ^ It's set and that means 0xa5 belongs to the set. Implementation Since the instruction pshufb maps bytes into bytes, it necessary to store two halves of a bitmap, i.e. 16 x 8 bit arrays. These sub-bitmaps are highlighted with blue and magenta below. lo / hi nibble +-------------------------------- | 0 1 2 3 4 5 6 7 8 9 a b c d e f --+-------------------------------- 0 | x x . . . . x . . . x . . x . . 1 | x x x x . x x . . . . . x x . x 2 | . x . . x . x . . . x . . x . . 3 | . x x . . . . x . . x . x . x . 4 | . . . . . . . . . . . . x x x x 5 | x x . . x . x x x . x . . . x x 6 | x . . . . x . x . . x . x . . . 7 | . . x . . . . . . . . x . . x . 8 | . . x x . . . . . . . . . . . x 9 | . . x x x . . x . . x . . . . . a | . . . . . . x . . . x . . . . x b | . . . x . . x . . . . . . . . . c | x . . . x . . . . . . . . . x x d | . . . x x x . x . . x x . . . . e | x . x . . . . x . x . x . . . . f | x x . . . . x . . . . . x x x . Fetching a row from the bitmap requires two lookups: to get "blue" and "magenta" parts of the row. Then, depending on higher nibble, we test a bit in "blue" (for higher nibble in range 0..7) or "magenta" part (for higher nibble in range 8..15). A note on calculating bitmask . In the scalar code it's just 1 << hi_nibble , here we must evaluate (1 << higher_nibble) % 8 . Fortunately, this is possible with single pshufb invocation. Below is a scalar code that show the algorithm: /* */ const uint8_t lo_nibble = input & 0x0f ; /* */ const uint8_t hi_nibble = input >> 4 ; /* 1 */ const uint8_t bitset_0_7 = bitmap_0_7 [ lo_nibble ]; /* 2 */ const uint8_t bitset_8_15 = bitmap_8_15 [ lo_nibble ]; /* 3 */ const uint8_t bitmask = ( 1 << ( hi_nibble & 0x7 )) uint8_t bitset ; /* 4 */ if ( hi_nibble < 8 ) bitset = row_0_7 ; else bitset = row_8_15 ; /* 5 */ return ( bitset & bitmask ) != 0 ; Instructions 1, 2 and 3 are pshufb . The fourth expression might be expressed by vector blend instruction ( x[i] ? t[i] : f[i] ). The last, 5th instruction, is simple bit-and followed by conversion into a bytemask. Following code shows a detailed SSE implementation of the above steps. Load 16 input bytes; hex digits are shown. // example values in set: {0x10, 0x21, 0xbd} // not in set: {0x36, 0x91, 0xed} // input = [36|10|91|21|10|ed|ed|21|36|bd|36|21|91|91|ed|10] // ^^ ^^ ^^ ^^ ^^ ^^ ^^ const __m128i input = _mm_loadu_si128 ( ptr ); Extract lower nibbles. Simple bit-and is needed. // lower_nibbles = [06|00|01|01|00|0d|0d|01|06|0d|06|01|01|01|0d|00] const __m128i lower_nibbles = _mm_and_si128 ( input , _mm_set1_epi8 ( 0x0f )); Extract higher nibbles, this requires also a bit shift. // higher_nibbles = [03|01|09|02|01|0e|0e|02|03|0b|03|02|09|09|0e|01] const __m128i higher_nibbles = _mm_and_si128 ( _mm_srli_epi16 ( input , 4 ), _mm_set1_epi8 ( 0x0f )); Pick rows for lower nibble (0..7). static const __m128i bitmap_0_7 = _mm_setr_epi8 ( /* 0 */ 0x43 , /* 01000011 */ /* 1 */ 0x6f , /* 01101111 */ /* 2 */ 0x52 , /* 01010010 */ /* 3 */ 0x86 , /* 10000110 */ /* 4 */ 0x00 , /* 00000000 */ /* 5 */ 0xd3 , /* 11010011 */ /* 6 */ 0xa1 , /* 10100001 */ /* 7 */ 0x04 , /* 00000100 */ /* 8 */ 0x0c , /* 00001100 */ /* 9 */ 0x9c , /* 10011100 */ /* a */ 0x40 , /* 01000000 */ /* b */ 0x48 , /* 01001000 */ /* c */ 0x11 , /* 00010001 */ /* d */ 0xb8 , /* 10111000 */ /* e */ 0x85 , /* 10000101 */ /* f */ 0x43 /* 01000011 */ ); // row_0_7 = [a1|43|6f|6f|43|b8|b8|6f|a1|b8|a1|6f|6f|6f|b8|43] const __m128i row_0_7 = _mm_shuffle_epi8 ( bitmap_0_7 , lower_nibbles ); Likewise pick rows for higher nibble (8..15). static const __m128i bitmap_8_15 = _mm_setr_epi8 ( /* 0 */ 0x24 , /* 00100100 */ /* 1 */ 0xb0 , /* 10110000 */ /* 2 */ 0x24 , /* 00100100 */ /* 3 */ 0x54 , /* 01010100 */ /* 4 */ 0xf0 , /* 11110000 */ /* 5 */ 0xc5 , /* 11000101 */ /* 6 */ 0x14 , /* 00010100 */ /* 7 */ 0x48 , /* 01001000 */ /* 8 */ 0x80 , /* 10000000 */ /* 9 */ 0x04 , /* 00000100 */ /* a */ 0x84 , /* 10000100 */ /* b */ 0x00 , /* 00000000 */ /* c */ 0xc0 , /* 11000000 */ /* d */ 0x0c , /* 00001100 */ /* e */ 0x0a , /* 00001010 */ /* f */ 0x70 /* 01110000 */ ); // row_8_15 = [14|24|b0|b0|24|0c|0c|b0|14|0c|14|b0|b0|b0|0c|24] const __m128i row_8_15 = _mm_shuffle_epi8 ( bitmap_8_15 , lower_nibbles ); Calculate a bitmask, i.e. (1 << hi_nibble % 8) . static const __m128i bitmask_lookup = _mm_setr_epi8 ( 1 , 2 , 4 , 8 , 16 , 32 , 64 , - 128 , 1 , 2 , 4 , 8 , 16 , 32 , 64 , - 128 ); // bitmasks = [08|02|02|02|02|40|40|02|08|08|08|04|02|02|40|02] const __m128i bitmask = _mm_shuffle_epi8 ( bitmask_lookup , higher_nibbles ); Choose rows halves depending on higher nibbles. // mask = [ff|ff|00|ff|ff|00|00|ff|ff|00|ff|ff|00|00|00|ff] const __m128i mask = _mm_cmplt_epi8 ( higher_nibbles , _mm_set1_epi8 ( 8 )); // bitsets = [ff|ff|00|ff|ff|00|00|ff|ff|00|ff|ff|00|00|00|ff] // ? [a1|43|..|6f|43|..|..|6f|a1|..|a1|6f|..|..|..|43] // : [..|..|b0|..|..|0c|0c|..|..|0c|..|..|b0|b0|0c|..] // // = [a1|43|b0|6f|43|0c|0c|6f|a1|0c|a1|6f|b0|b0|0c|43] const __m128i bitsets = _mm_blendv_epi8 ( row_0_7 , row_8_15 , mask ); Finally check which bytes belong to the set. // tmp = [a1|43|b0|6f|43|0c|0c|6f|a1|0c|a1|6f|b0|b0|0c|43] // & [08|02|02|02|02|40|40|02|08|08|08|04|02|02|40|02] // = [00|02|00|02|02|00|00|02|00|08|00|04|00|00|00|02] // ^^ ^^ ^^ ^^ ^^ ^^ ^^ const __m128i tmp = _mm_and_si128 ( bitsets , bitmask ); // result = [00|ff|00|ff|ff|00|00|ff|00|ff|00|ff|00|00|00|ff] const __m128i result = _mm_cmpeq_epi8 ( t0 , bitmask ); Number of instructions: 3 x bit-and;

1 x shift;

3 x shuffle bytes;

2 x comparison;

1 x blend. Alternative implementation Geoff Langdale pointed out that we might get rid of the blend instruction, which is not the fastest one. Instead, we use pshufb , that zeros destination bytes when the highest bit of index is 1. This requires different indices when fetching row_0_7 and row_8_15 . The indices for bitmap_0_7 keep bits 0..3 and also the most significant, i.e. 7th bit of the input. Similarly the indices for bitmap_8_15 have also bits 0..3, but the bit 7th is negated. When the most significant bit is set, it means that the higher nibble is greater or equal 8 — in such situation bitset_0_7 is zeroed and bitset_8_15 kept. The modified algorithm: Load 16 input bytes. const __m128i input = _mm_loadu_si128 ( ptr ); Extract indices for row_0_7 . const __m128i indices_0_7 = _mm_and_si128 ( input , _mm_set1_epi8 ( 0x8f )); // 0b1000_1111 Extract indices for row_8_15 . const __m128i msb = _mm_and_si128 ( input , _mm_set1_epi ( 0x80 )); const __m128i indices_8_15 = _mm_xor_si128 ( indices_0_7 , msb ); Fetch row_0_7 and row_8_15 : const __m128i row_0_7 = _mm_shuffle_epi8 ( bitmap_0_7 , indices_0_7 ); const __m128i row_8_15 = _mm_shuffle_epi8 ( bitmap_8_15 , indices_8_15 ); Calculate a bitmask, i.e. (1 << hi_nibble % 8) . const __m128i bitmask = _mm_shuffle_epi8 ( bitmask_lookup , higher_nibbles ); Choose rows halves depending on higher nibbles. const __m128i bitsets = _mm_or_si128 ( row_0_7 , row_8_15 ); Finally check which bytes belong to the set. const __m128i tmp = _mm_and_si128 ( bitsets , bitmask ); const __m128i result = _mm_cmpeq_epi8 ( t0 , bitmask ); Number of instructions: 3 x bit-and,

1 x bit-or,

1 x bit-xor,

3 x shuffle,

1 x compare.

Special case 1 — small sets This method handles up to eight distinct elements. Let's consider this sample set: {0x01, 0x31, 0xc1, 0x35, 0x65, 0x77, 0x8b, 0x3e}; elements are labelled with letters K .. R. lo / hi nibble +-------------------------------- | 0 1 2 3 4 5 6 7 8 9 a b c d e f --+-------------------------------- 0 | . . . . . . . . . . . . . . . . 1 | K . . L . . . . . . . . M . . . 2 | . . . . . . . . . . . . . . . . 3 | . . . . . . . . . . . . . . . . 4 | . . . . . . . . . . . . . . . . 5 | . . . N . . O . . . . . . . . . 6 | . . . . . . . . . . . . . . . . 7 | . . . . . . . P . . . . . . . . 8 | . . . . . . . . . . . . . . . . 9 | . . . . . . . . . . . . . . . . a | . . . . . . . . . . . . . . . . b | . . . . . . . . Q . . . . . . . c | . . . . . . . . . . . . . . . . d | . . . . . . . . . . . . . . . . e | . . . R . . . . . . . . . . . . f | . . . . . . . . . . . . . . . . We separately consider lower and higher nibble, i.e. rows and columns in the diagram above. We associate a subset with each nibble, for lower nibbles there are following subsets: 1: {K, L, M}

5: {N, O}

7: {P}

b: {Q}

e: {R} For higher nibbles we have: 0: {K}

3: {L, N, R}

6: {O}

7: {P}

8: {Q}

c: {M} These subsets are encoded using bit sets, i.e. each element has assigned a bit in a byte (this is the source of size limit). For the example data we might have following mapping: const uint8_t empty = 0x00 ; const uint8_t K = ( 1 << 0 ); // 0x01 const uint8_t L = ( 1 << 1 ); // 0x02 const uint8_t M = ( 1 << 2 ); // 0x04 const uint8_t N = ( 1 << 3 ); // 0x08 const uint8_t O = ( 1 << 4 ); // 0x10 const uint8_t P = ( 1 << 5 ); // 0x20 const uint8_t Q = ( 1 << 6 ); // 0x40 const uint8_t R = ( 1 << 7 ); // 0x80 And then lookup tables are: const uint8_t lo_nibbles_lookup [ 16 ] = { /* 0 */ 0x00 , /* 1 */ 0x07 , // K | L | M /* 2 */ 0x00 , /* 3 */ 0x00 , /* 4 */ 0x00 , /* 5 */ 0x18 , // N | O /* 6 */ 0x00 , /* 7 */ 0x20 , // P /* 8 */ 0x00 , /* 9 */ 0x00 , /* a */ 0x00 , /* b */ 0x40 , // Q /* c */ 0x00 , /* d */ 0x00 , /* e */ 0x80 , // R /* f */ 0x00 }; const uint8_t hi_nibbles_lookup [ 16 ] = { /* 0 */ 0x01 , // K /* 1 */ 0x00 , /* 2 */ 0x00 , /* 3 */ 0x8a , // L | N | R /* 4 */ 0x00 , /* 5 */ 0x00 , /* 6 */ 0x10 , // O /* 7 */ 0x20 , // P /* 8 */ 0x40 , // Q /* 9 */ 0x00 , /* a */ 0x00 , /* b */ 0x00 , /* c */ 0x04 , // M /* d */ 0x00 , /* e */ 0x00 , /* f */ 0x00 }; Algorithm consist following steps (SSE code is shown in example): Load 16 input bytes. // input = [11|31|11|35|8b|ff|ee|77|11|c1|11|8b|11|11|ff|01] // ^^ ^^ ^^ ^^ ^^ ^^ ^^ const __m128i input = _mm_loadu_si128 ( ptr ); Extract lower nibbles. // lower_nibbles = [01|01|01|05|0b|0f|0e|07|01|01|01|0b|01|01|0f|01] const __m128i lower_nibbles = _mm_and_si128 ( input , _mm_set1_epi8 ( 0x0f )); Extract higher nibbles. // higher_nibbles = [01|03|01|03|08|0f|0e|07|01|0c|01|08|01|01|0f|00] const __m128i higher_nibbles = _mm_and_si128 ( _mm_srli_epi16 ( input , 4 ), _mm_set1_epi8 ( 0x0f )); Translate the lower and higher nibbles (the lookup tables are defined above). // lo_translated = [07|07|07|18|40|00|80|20|07|07|07|40|07|07|00|07] const __m128i lo_translated = _mm_shuffle_epi8 ( _mm_load_si128 ( lo_nibbles_lookup ), lower_nibbles ); // hi_translated = [00|8a|00|8a|40|00|00|20|00|04|00|40|00|00|00|01] const __m128i hi_translated = _mm_shuffle_epi8 ( _mm_load_si128 ( hi_nibbles_lookup ), higher_nibbles ); Test bitsets for intersection. // lo_translated = [07|07|07|18|40|00|80|20|07|07|07|40|07|07|00|07] // hi_translated = [00|8a|00|8a|40|00|00|20|00|04|00|40|00|00|00|01] // intersection = [00|02|00|08|40|00|00|20|00|04|00|40|00|00|00|01] // ^^ ^^ ^^ ^^ ^^ ^^ ^^ const __m128i intersection = _mm_and_si128 ( lo_translated , hi_translated ); Fix up if necessary. Convert all non-zero elements into 0xff . // t0 = [ff|00|ff|00|00|ff|ff|00|ff|00|ff|00|ff|ff|ff|00] const __m128i t0 = _mm_cmpeq_epi8 ( intersection , _mm_setzero_si128 ()); const __m128i t1 = _mm_andnot_si128 ( t0 , _mm_setzero_si128 ()); Number of instructions: 2 x bit-and;

1 x shift

2 x shuffle bytes;

1 x comparison [optional];

1 x bit-andnot [optional].

Special case 2 — constant nibble When either higher or lower nibbles of set values are constant, a single invocation of pshufb is sufficient to classify up to 16-element set. The variable nibble addresses a lookup vector which contains elements from the set; nibbles not present in the set are marked with some other value. The translated vector is finally compared for equal with the input vector, yielding zeros for bytes that are not in the set. Let's consider set {0x10, 0x12, 0x14, 0x15, 0x17, 0x18, 0x1a, 0x1f}. lo / hi nibble +-------------------------------- | 0 1 2 3 4 5 6 7 8 9 a b c d e f --+-------------------------------- 0 | . x . . . . . . . . . . . . . . 1 | . . . . . . . . . . . . . . . . 2 | . . . . . . . . . . . . . . . . 3 | . . . . . . . . . . . . . . . . 4 | . x . . . . . . . . . . . . . . 5 | . x . . . . . . . . . . . . . . 6 | . . . . . . . . . . . . . . . . 7 | . x . . . . . . . . . . . . . . 8 | . x . . . . . . . . . . . . . . 9 | . . . . . . . . . . . . . . . . a | . x . . . . . . . . . . . . . . b | . . . . . . . . . . . . . . . . c | . . . . . . . . . . . . . . . . d | . . . . . . . . . . . . . . . . e | . . . . . . . . . . . . . . . . f | . x . . . . . . . . . . . . . . The higher nibble is constant, equal to 0x10; we will map lower, variable nibbles as shown below. const uint8_t lookup [ 16 ] = { /* 0 */ 0x10 , /* 1 */ 0x00 , // not in the set /* 2 */ 0x12 , /* 3 */ 0x00 , // not in the set /* 4 */ 0x14 , /* 5 */ 0x15 , /* 6 */ 0x00 , // not in the set /* 7 */ 0x17 , /* 8 */ 0x18 , /* 9 */ 0x00 , // not in the set /* a */ 0x1a , /* b */ 0x00 , // not in the set /* c */ 0x00 , // not in the set /* d */ 0x00 , // not in the set /* e */ 0x00 , // not in the set /* f */ 0x1f , }; Algorithm consist following steps: Load 16 input bytes. // input = [21|12|13|15|14|fa|ca|17|55|aa|2a|1a|3a|ff|af|1f] // ^^ ^^ ^^ ^^ ^^ ^^ const __m128i input = _mm_loadu_si128 ( ptr ); Extract the variable nibble. If it's lower one, then simple bit-and is needed; for higher nibble also a shift right is required. // lower_nibbles = [01|02|03|05|04|0a|0a|07|05|0a|0a|0a|0a|0f|0f|0f] const __m128i lower_nibbles = _mm_and_si128 ( input , _mm_set1_epi8 ( 0x0f )); // higher_nibbles = [02|01|01|01|01|0f|0c|01|05|0a|02|01|03|0f|0a|01] const __m128i higher_nibbles = _mm_and_si128 ( _mm_srli_epi16 ( input , 4 ), _mm_set1_epi8 ( 0x0f )); // in our example we use lower nibbles // nibbles = [01|02|03|05|04|0a|0a|07|05|0a|0a|0a|0a|0f|0f|0f] const __m128i nibbles = lower_nibbles ; Translate the nibbles using lookup table. // translated = [10|12|10|15|14|1a|1a|17|15|1a|1a|1a|1a|1f|1f|1f] const __m128i translated = _mm_shuffle_epi8 ( _mm_load_si128 ( lookup ), nibbles ) Compare the translated vector with the input one. // input = [21|12|13|15|14|fa|ca|17|55|aa|2a|1a|3a|ff|af|1f] // ^^ ^^ ^^ ^^ ^^ ^^ // translated = [10|12|10|15|14|1a|1a|17|15|1a|1a|1a|1a|1f|1f|1f] // eq = [00|ff|00|ff|ff|00|00|ff|00|00|00|ff|00|00|00|ff] const __m128i eq = _mm_cmpeq_epi8 ( input , translated ); Number of instructions: 1 x bit-and;

1 x shift (if higher nibble is variable);

1 x shuffle bytes;

1 x comparison.

Special case 3 — unique lower and higher nibbles When none of lower and higher nibbles repeats then just two invocations of pshufb are needed to classify a vector. In this case up to 16-elements sets are supported. Let's consider following 11-element set {0x20, 0x31, 0x42, 0x53, 0x64, 0x75, 0x86, 0x97, 0xa8, 0xb9, 0xca}. Lower nibbles are {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a}, higher nibbles are {2, 3, 4, 5, 6, 7, 8, 9, a, b, c} — as we see, the both sets contain unique values. lo / hi nibble +-------------------------------- | 0 1 2 3 4 5 6 7 8 9 a b c d e f --+-------------------------------- 0 | . . K . . . . . . . . . . . . . 1 | . . . L . . . . . . . . . . . . 2 | . . . . M . . . . . . . . . . . 3 | . . . . . N . . . . . . . . . . 4 | . . . . . . O . . . . . . . . . 5 | . . . . . . . P . . . . . . . . 6 | . . . . . . . . Q . . . . . . . 7 | . . . . . . . . . R . . . . . . 8 | . . . . . . . . . . S . . . . . 9 | . . . . . . . . . . . T . . . . a | . . . . . . . . . . . . U . . . b | . . . . . . . . . . . . . . . . c | . . . . . . . . . . . . . . . . d | . . . . . . . . . . . . . . . . e | . . . . . . . . . . . . . . . . f | . . . . . . . . . . . . . . . . The idea is to label each set element with unique index and then map lower and higher nibbles to that index. If the indices are equal, it means that both nibbles form set's element. Mapping for nibbles outside the valid set must yield different values, so comparison is always false. For instance, we might have following mapping: const uint8_t K = 0 ; const uint8_t L = 1 ; const uint8_t M = 2 ; const uint8_t N = 3 ; const uint8_t O = 4 ; const uint8_t P = 5 ; const uint8_t Q = 6 ; const uint8_t R = 7 ; const uint8_t S = 8 ; const uint8_t T = 9 ; const uint8_t U = 10 ; const uint8_t lo_outside = 0x10 ; const uint8_t hi_outside = 0x20 ; // lo_outside != hi_outside Algorithm: Load 16 bytes. // in = [20|21|ca|cb|aa|a8|86|42|43|12|44|75|86|8f|fa|97] // ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ const __m128 in = _mm_loadu_si128 ( ptr ); Extract lower nibbles. // lo = [ 0| 1| a| b| a| 8| 6| 2| 3| 2| 4| 5| 6| f| a| 7] const __m128i lo = _mm_and_si128 ( in , _mm_set1_epi8 ( 0x0f )); Extract higher nibbles. // hi = [ 2| 2| c| c| a| a| 8| 4| 4| 1| 4| 7| 8| 8| f| 9] __m128i hi = _mm_srli_epi16 ( in , 4 ); hi = _mm_and_si128 ( hi , _mm_set1_epi8 ( 0x0f )); Translate lower nibbles into indices. const __m128i lo_lookup = _mm_setr_epi8 ( 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 0xa , 0x20 , 0x20 , 0x20 , 0x20 , 0x20 ); // lo = [ 0| 1| a| b| a| 8| 6| 2| 3| 2| 4| 5| 6| f| a| 7] // lo_idx = [ 0| 1| a|20| a| 8| 6| 2| 3| 2| 4| 5| 6|20| a| 7] const __m128i lo_idx = _mm_shuffle_epi8 ( lo , lo_lookup ) Translate higher nibbles into indices. const __m128i hi_lookup = _mm_setr_epi8 ( 0x10 , 0x10 , 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 0xa , 0x10 , 0x10 , 0x10 ); // hi = [ 2| 2| c| c| a| a| 8| 4| 4| 1| 4| 7| 8| 8| f| 9] // hi_idx = [ 0| 0| a| a| 8| 8| 6| 2| 2|10| 2| 5| 6| 6|10| 7] const __m128i hi_idx = _mm_shuffle_epi8 ( hi , hi_lookup ) Compare translated nibbles — if they match, it means that bytes also match. // in = [20|21|ca|cb|aa|a8|86|42|43|12|44|75|86|8f|fa|97] // ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ // lo_idx = [ 0| 1| a|20| a| 8| 6| 2| 3| 2| 4| 5| 6|20| a| 7] // hi_idx = [ 0| 0| a| a| 8| 8| 6| 2| 2|10| 2| 5| 6| 6|10| 7] // result = [ff|00|ff|00|00|ff|ff|ff|00|00|00|ff|ff|00|00|ff] const __m128i result = _mm_cmpeq_epi8 ( lo_idx , hi_idx ) Number of instructions: 2 x bit-and;

1 x shift;

2 x shuffle bytes;

1 x comparison.

Acknowledgements Thanks to Geoff Langdale and Michael Howard for their comments and fixes.

See also Intel Hyperscan, that use similar algorithms