Last week someone I mentor asked for help with a programming challenge that involved some bit twiddling. The first part of the problem was

A number is known as special bit number if its binary representation contains atleast two consecutive 1’s or set bits. For example 7 wih binary representation 111 is a special bit number. Similarly 3 ( 11 ) is also a special bit number.

The naive implementation was quite simple but got me thinking if there were better options.

1. Right Shift and Counter

This is the most straightforward approach where we check the least significant bit (LSB) and increment a counter if it is 1 . As soon as the counter hits 2 we exit or reset to 0 if the bit is not set.

func isSpecial ( n uint32 ) bool { for wasLastBitOne := false ; n > 0 ; n = n >> 1 { isCurrentBitOne := n % 2 == 1 if isCurrentBitOne && wasLastBitOne { return true } wasLastBitOne = isCurrentBitOne } return false }

Now let’s write a quick benchmark to see how it performs.

func BenchmarkIsSpecial ( b * testing . B ) { for n := 0 ; n < b . N ; n ++ { isSpecial (uint32( n )) } }

And the results are: you can perform about a 100 million of these per second.

> go test -bench 'BenchmarkIsSpecial$' goos: darwin goarch: amd64 pkg: github.com/zqureshi/special BenchmarkIsSpecial-8 200000000 9 .46 ns/op PASS ok github.com/zqureshi/special 2 .861s

2. Speedup via Lookup Table

We are working with 32-bit unsigned integers and notice that checking an 8-bit long subsequence is no different than checking the whole number. For an 8-bit sequence we have 256 (2**8) possible combinations and can precompute a lookup table of all possible results.

var lookupTable = [ 256 ] bool {} func init () { fmt . Println ( "Recomputing lookup table..." ) for i := uint32( 0 ); i < 256 ; i ++ { lookupTable [ i ] = isSpecial ( i ) } } func isSpecialLookup ( n uint32 ) bool { if lookupTable [ 3 ] == false { panic( "Lookup table not initialized!" ) } return lookupTable [uint8( n )] || lookupTable [uint8( n >> 7 )] || lookupTable [uint8( n >> 14 )] || lookupTable [uint8( n >> 21 )] || lookupTable [uint8( n >> 24 )] }

An interesting thing to note here is that we cannot just check four bit ranges 0-7, 8-15, 16-23, 28-31 because we will miss numbers that have consecutive bits on the boundary i.e. bit 7 and 8, therefore we must always check overlapping ranges 0-7, 7-14, 14-21, 21-28, 24-31 . In the last comparison of bits 24-31 we re-check bits 24-28 but it is easier to just do so than special case it and add extra logic.

And now let’s also benchmark this.

func BenchmarkIsSpecialLookup ( b * testing . B ) { for n := 0 ; n < b . N ; n ++ { isSpecialLookup (uint32( n )) } }

> go test -bench '.' Recomputing lookup table... goos: darwin goarch: amd64 pkg: github.com/zqureshi/special BenchmarkIsSpecial-8 200000000 9 .85 ns/op BenchmarkIsSpecialLookup-8 1000000000 2 .60 ns/op PASS ok github.com/zqureshi/special 5 .817s

We get an almost 4x speedup using this approach as we are doing lesser comparisons and memory writes but also due to the fact that the whole lookup table fits in 32 bytes of memory which sits cosily in a 64 byte cache line.

3. Larger Lookup Table

We could extend the previous approach and precompute 65536 (2**16) values which would allow us to cover the whole number in just 3 comparisons 0-15, 15-30, 16-31 .

func isSpecialLookup16 ( n uint32 ) bool { if lookupTable [ 3 ] == false { panic( "Lookup table not initialized!" ) } return lookupTable [uint16( n )] || lookupTable [uint16( n >> 15 )] || lookupTable [uint16( n >> 16 )] }

And benchmark

func BenchmarkIsSpecialLookup16 ( b * testing . B ) { for n := 0 ; n < b . N ; n ++ { isSpecialLookup16 (uint32( n )) } }

Recomputing lookup table... goos: darwin goarch: amd64 pkg: github.com/zqureshi/special BenchmarkIsSpecial-8 200000000 9 .57 ns/op BenchmarkIsSpecialLookup-8 1000000000 2 .62 ns/op BenchmarkIsSpecialLookup16-8 1000000000 2 .31 ns/op PASS ok github.com/zqureshi/special 8 .325s

A 12% improvement in runtime but a 256x blowup in table size (32 bytes vs 8192 bytes). While this performs better in this synthetic benchmark it might not perform as well in production because of cache line and page misses.

4. Finding Something Better

At this point I was quite happy with the progress but was wondering if there was some intrinsic property of numbers that could be exploited to make this even faster. I landed upon this HackerRank post and a collection of bit twiddling hacks from Stanford’s graphics department.

The solution was quite ingenious, if you take any binary number it will have runs of set bits followed by one or more unset bits. Now if you left shift the number, you move the 0 left and now if you bitwise and it with the original number you have effectively cleared one set bit from a run. If you repeat this process you will clear another bit, and another until the number is 0.

A worked out example will demonstrate this better.

14 = 1110 14 << 1 = 11100 a & b = 01100 12 = 1100 12 << 1 = 11000 a & b = 01000 8 = 1000 8 << 1 = 10000 a & b = 00000

We can then simplify this algorithm to derive that if a number has at least two consecutive bits set, then left shift followed by bitwise and will result in a non-zero value. Let’s work through this.

12 = 1100 12 << 1 = 11000 a & b = 01000 10 = 1010 10 << 1 = 10100 a & b = 00000

So now the code simply becomes

func isSpecialLeftShift ( n uint32 ) bool { return ( n & ( n << 1 )) > 0 }

and benchmark

func BenchmarkIsSpecialLeftShift ( b * testing . B ) { for n := 0 ; n < b . N ; n ++ { isSpecialLeftShift (uint32( n )) } }

Recomputing lookup table... goos: darwin goarch: amd64 pkg: github.com/zqureshi/special BenchmarkIsSpecial-8 200000000 9 .55 ns/op BenchmarkIsSpecialLookup-8 1000000000 2 .60 ns/op BenchmarkIsSpecialLookup16-8 1000000000 2 .32 ns/op BenchmarkIsSpecialLeftShift-8 2000000000 0 .27 ns/op PASS ok github.com/zqureshi/special 8 .871s

This is a 35x improvement over our naive approach and only involves 3 instructions, which means it is also a candidate for inlining.

5. Epilogue

One thing that we didn’t talk about was how did I verify that my implementations were correct? I implemented the naive version and manually tested it on various inputs. Then for every other implementation I iterated through the first billion integers and reconciled the optimized against the naive.