Gray code encoding

Gray code encoding is a bijective function between natural numbers (including 0) in a way that the Gray code of subsequent integers only differ in a single binary digit. This encoding was proved useful for rotary encoders and other kinds of counting where switching multiple bits at once would be problematic.



A 10bit Gray code rotary encoder.

I use the following common definition for Gray code encoding:

where denotes the ,,bitwise exclusive or’’ and denotes the ,,bitwise right shift’’ operators.

Gray code decoding

I denote the inverse of the function with . can be expressed in the following way:

It can be shown that is indeed the inverse of using the commutative and associative properties of and the following identities:

Implementations

Implementing Gray code encoding is rather simple:

unsigned gray_encode ( unsigned i ) { return i ^ ( i >> 1 ); }

One can implement Gray code decoding directly from the defining expression of :

#include <stdint.h> unsigned gray_decode ( unsigned i ) { unsigned ret = 0 ; while ( i != 0 ) { ret = ( ret ^ i ); i = ( i >> 1 ); } return ret ; }

One can actually do better for a known fixed-width integral type. The following algorithm is from Wikipedia:

#include <stdint.h> uint32_t gray_decode ( uint32_t i ) { i ^= ( i >> 16 ); i ^= ( i >> 8 ); i ^= ( i >> 4 ); i ^= ( i >> 2 ); i ^= ( i >> 1 ); return i ; }

One can show the correctness of this implementation using the identities I presented before and induction.

I also developed an alternative implementation that makes use of the popcnt and pdep x86_64 CPU instructions (part of the POPCNT and BMI2 extensions respectively).

#include <stdint.h> #include <x86intrin.h> uint32_t gray_decode ( uint32_t i ) { uint32_t evens = _pdep_u32 ( 0x55555555u , i ); uint32_t odds = _pdep_u32 ( 0xAAAAAAAAu , i ); uint32_t popcount = __builtin_popcount ( i ); return ( - ( popcount & 1 )) ^ (( odds << 1 ) - ( evens << 1 )); }

In the following sections I break down how this implementation works. I stick with 32 bit numbers, but my reasoning works for any bitwidth.

If and – where the bold letters denote binary digits – then . In other words one can get the th digit in by xoring the digits from the th to the most significant digit in . For a given where the 1 digits are sparsely laid out one can think of in the following way:

The leading digits (from the most significant digits) are 0 up until the first 1 digit in . Then the following digits in are all 1 up until the next 1 digit in , and so on. The 1 digits in mark the transitions between strips of 0s and 1s in . For this description it’s not required that the 1s in are sparsely laid out, it just makes the intuition easier (for me at least).

I define by taking every second 1 binary digit in starting from the least significant one. I also define by taking every second 1 binary digit in starting from the second least significant one. and correspond to the evens and odds variables in my gray_decode function above.

After adding and , where denotes bitwise negation, the result has similar strips of 0s and 1s as :

There are two main differences to :

The transitions between the strips of 0s and 1s start one place to the right compared to . The first strips from the left can be all 1s or all 0s depending if there are even or odd number of 1 digits in .

The first issue can be fixed by shifting and to the left by one place beforehand. The second difference can be addressed by first calculating if there are even or odd number of 1 bits in . Then one can flip the bits in the result for the even case.

I denote the number of 1 bits in as , this corresponds to the popcnt variable in my gray_decode function. By applying these fixes we get the following expression for :

The branching between the even and odd cases can also be eliminated by xoring the intermediate result with 00…0 or 11…1 depending on the parity of . is either 0 or 1 , is either 00…0 or 11…11 for the even and odd cases respectively. Since bitwise negation is required for the even case an additional bitwise negation is necessary:

Update: IJzerbaard pointed out that this expression can be further simplified by using the and identities:

The popcnt and pdep CPU instructions can be used to calculate , and efficiently.

The popcnt instruction takes one source and one destination argument. It writes the number of 1 binary digits in the source and returns the count in the destination. can be calculated by applying popcnt on .

pdep takes two source arguments ( src and mask ) and one destination argument ( dst ). It takes the low bits from src and deposits them in the destination to the corresponding set bit locations in mask .

One can calculate by using pdep and setting src to 0101…01 and mask to . can be calculated similarly with pdep but with 1010…10 as src . This is how evens and odds are calculated in my gray_decode function, the magic numbers are the src arguments.

Benchmark

I think there is no correct way to objectively benchmark these Gray code decoding functions in isolation. One can call these functions in a tight loop and compare their throughput that way, but one shouldn’t take the resulting numbers too seriously.

I used Google’s benchmark library and the following setup:

static void Generic32 ( benchmark :: State & state ) { uint32_t i = 0 ; for ( auto _ : state ) { ++ i ; uint32_t x = gray_decode_generic ( i ); benchmark :: DoNotOptimize ( x ); } state . SetBytesProcessed ( i * sizeof ( uint32_t )); } BENCHMARK ( Generic32 );

And similarly for all the other functions for both uint32_t and uint64_t . I compiled the benchmark with gcc -O2 .

------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------- Generic32 2 ns 2 ns 58150989 1.56283GB/s PDEP32 2 ns 2 ns 82701968 2.18142GB/s Generic64 2 ns 2 ns 75010161 4.67384GB/s PDEP64 1 ns 1 ns 110827182 5.80519GB/s

The throughput of the popcnt and pdep method is somewhat faster in this specific benchmark.

Notes

can also be calculated using carry-less multiplying with 11…1 and fixing the result with popcnt . The PCLMUL extension provides this operation, but it can only carry-less multiply 64bit arguments and it still requires XMM registers. The performance of this method wasn’t great the last time I tried. There can easily be a CPU instruction that could help beating my method at least in a streaming case, but I’m not familiar with all of the vector instructions. Fun fact: is the Thue–Morse sequence I use a variation of my algorithm in my fast Hilbert-curve library. For my purposes I don’t need to shift and .

Update2: IJzerbaard discovered an other improvement: the left shifting can be done early before the pdep instructions, this saves a single instruction in the resulting binary when compiled with gcc: