Let’s jump straight to the code. This is the original version (before the unrolling) that processes a single byte of input at a time:

Original

First, you might be surprised by not seeing any loop. The actual loop is at the higher level of abstraction, allowing us to plug various algorithms into the search logics. The compiler is able to inline the body of this method into the loop.

According to my previous benchmark results, this code has a very decent performance already, easily outperforming some well known textbook algorithms. It all boils down to just a few CPU instructions and is effectively branch free. Is there any hope to make it even faster? Let’s start by doing a dumb unrolling of this code and see where can we get from there (commit):

Unrolling, step 1.

In our first attempt, we simply repeat the same code 8 times, by processing every byte of the input parameter which is now a 64 bit long . There is an additional complication: we cannot simply return the first match among those 8 bytes. There can be more than one match during a single invocation of this method, so we need to collect and return all of them. We do that by setting a single bit of the result variable to indicate every matching position. This complicates the unrolled code further, and I am not even sure what to expect from running the benchmark:

Benchmark Mode Cnt Score Error Units before-unrolling thrpt 5 0.579 ± 0.007 GB/s

unrolling-step-1 thrpt 5 0.591 ± 0.003 GB/s