Computer science books contains timeless wisdom, but performance advice doesn’t always age well. When reading Programming Pearls, by Jon Bentley, I’ve found more modern hardware advances that puts conventional wisdom on its head.

Consider the following problem:

Rotate an array of n elements left by i positions. For example, rotating “abcdefgh” by 3 gives “defghabc”. Is it possible to do this in place and proportional to n?

The question models a few problems in real computing scenarios:

Moving text in a text editor can be seen as rotating the text to be moved to the new location.

Swapping unequal chunks of memory that are next to each other.

Considering the use cases, the operation should not be done lazily.

One algorithm involves a delicate juggling act: taking the 1st element, then move the element at i to its final position, then the item at 2i to i, and so on. This is repeated until the entire array is rotated.

Credits: Programming Pearls, 2nd Edition, page 14.

Another way to think about this problem is to treat the array as two different segments, A and B, then rotating the array is identical to converting the array AB to the array BA. If we can reverse A to produce A’, then

(A’ B’) ‘ -> BA

The juggling algorithm reads and writes to each array element once. The reversing rotate reads and writes to each array element twice. Conventional wisdom would lead us to conclude that the juggling algorithm runs faster. In Programming Pearls, the juggle code is reported to be twice as fast. Does this result still hold true in the 21st century?

Here is an implementation of the juggle rotate in Python:

Juggle Swap

By using a typed Python array, I thought it would result in better performance compared to a list. Unfortunately, I’ve committed the sins of premature optimizations.

Running the juggle rotate on a Python array of 10000 elements and repeating 10000 times gives the following result:

……..15 seconds?

I ran the tests on a 4GHz Intel i7-6700K. As a quick sanity check, my CPU should be able to crunch through a few billion integer operations per second. The juggle rotate moves roughly 10 million integers (10000 elements * 10000 repeats), 15 seconds seems ridiculously long for the work it has to do.

What happens when a list is used?

12.56 seconds…. eh….

It’s better, but still not great. Reading the Python source code reveals that during element retrieval, Python will create a PyObject and wrap the result. This cost is incurred every time it accesses an integer it has never seen before. What if we inline the function?

A 5 second improvement via inlining….not bad

Right, no function inlining. Not nice for things like map() and filter() where you are essentially calling functions over and over again.

What about the reverse rotate?

Starting from the two ends, we loop through one at a time and swap the elements around, and we do 3 segment reversals. Pretty straightforward.

Setting up a profiling bench….

Let’s go:

At this point, while the reverse function seems pretty slow, everything just seems to be slow in general.

Well, we could run a Python environment with a JIT, such as PyPy.

Let’s give that a go

We’ve managed to achieve a 10x speedup in the juggle function and a 200x speedup in the reverse function. The reverse rotate is now much faster than the juggle rotate for some reason.

Can we improve even further?

Through Cython, and adding some type annotations, the code can be compiled, eliminating more overhead. Here’s the converted function

Note how similar it is to the Python code we wrote before

How did we go?

Wow

As another sanity check, if our CPU can punch through a few billion instructions per second, then going through 10 million integers should be within the ballpark of a hundredth of a second, our reverse swap does this in 0.07 seconds.