David Chisnall takes a look at a number of optimization techniques that have been recommended over the years, discussing why they no longer work on modern architectures.



It's often said that there are two rules to follow when optimizing code:

Don't optimize. Don't optimize yet (experts only).

Although this thinking is slightly facetious, there's a good reason why the second rule is marked as "experts only." Optimizing code means either making it run faster or making it take up less space (ideally both, but in most cases it's a tradeoff).

Generally speaking, you can do two kinds of optimizations:

Algorithmic improvements. If you go from using a bubble sort to using a merge sort, for example, your code will be faster. A quick sort usually will make it even faster, although this result depends somewhat on the input data.

If you go from using a bubble sort to using a merge sort, for example, your code will be faster. A quick sort usually will make it even faster, although this result depends somewhat on the input data. Choosing good algorithms is generally the best way of making your code go quickly. This practice has the nice advantage that it works independently of the programming language you use or the architecture of the underlying system. Sometimes, however, even the best algorithm you can design is still too slow. Optimizations specific to a language or implementation. This type of optimization requires a thorough understanding of how your compiler and CPU work. Unfortunately, over the years many of these techniques have been incorporated into rules that made sense for some architectures in the past, but no longer are practical.

This article examines some of these popular (but now outdated) rules for optimization.

A Load of Shift

On older CPUs, multiplication was very slow; in particular, multiplying by a constant was no faster than multiplying by a variable. One common trick was to decompose multiplications into sequences of add and shift operations. If you were compiling for an early RISC CPU, this would be done for you, because the CPU had no "multiply" hardware. On other machines, it was done for speed.

Multiplication by a power of two can be translated trivially into a left-shift operation. Essentially, any multiplication can be rewritten in terms of add and shift operations. For example, multiplying by 5 is equivalent to multiplying by 4 and adding the result once. The following two statements are equivalent:

// Multiply directly a *= 5; // Multiply using shift-and-add a = a + (a << 2);

You can rewrite any multiplication operation in this way, and on a lot of systems this technique used to be much faster for multiplying by a constant. This was such a good trick that some compilersincluding GCCused to do this translation automatically.

Then hardware multipliers got better. CPUs like the AMD Athlon could do integer multiplication quickly. Because most programs contained a lot of multiplications and not many shifts, AMD put two shift units and one multiply unit in the Athlon, which meant that it could do two multiplications at a time, but only one shift. Because most multiplication by a constant requires several add-and-shift operations, suddenly the code was faster (not to mention clearer) if you wrote the multiplication instead of the add-and-shift.

Most modern CPUs are superscalar, meaning that they can execute two or more adjacent instructions simultaneously if the operations aren't dependent on each other and the CPU has enough free execution units. The independence requirement is important here; while two multiplications can often be done in parallel, each shift and add needed to simulate the multiplication depends on the result of the previous one, so this practice can cause pipeline stalls on in-order architectures (and even on out-of-order architectures) when other instructions are pushed too far back in the instruction stream to be seen. Modern CPUs get a lot of their speed from pipelining. Many instructions on something like a Core 2 take 1020 cycles to execute, but because they're decomposed into a large number of independent steps, you can start a second operation only one or two cycles after the first one has started.

This practice is even more pronounced on modern systems. To demonstrate this principle, I wrote two simple programs for the Core 2 Duo. Both loop 1,000,000,000 times, multiplying a running counter by 12345. Program A uses a multiply instruction; program B uses a sequence of shifts and adds (10 in total). Program A takes 3.7 seconds to complete, whereas the shift-and-add version in program B takes 21.3 seconds. Thus, the "naïve" version is almost 7 times faster than the "optimized" version. This result isn't just because the multiplication is faster. A lot of the extra time is taken up with loads and stores, and program B's add-and-shift version involves a lot of moving of operands between hidden registers inside the execution units, and shuffling data from the end of the pipeline back to the start. Another version, with just a single shift, ran marginally faster than the program A multiplication; even turning a "multiply by 2" into a "left-shift by 1" gives almost no performance improvement (less than 1%) on a Core 2.

The important point here is that just because something used to be slow on old CPUs doesn't mean that it's still slow. You should always time your optimizations and make sure that they really offer an improvement in speed. When I started programming in C, most processors didn't have floating-point units. Any floating-point calculations were implemented entirely in software, and typically were two orders of magnitude slower than integer operations. These days, floating-point operations are done in hardware and are often within a factor of 2 of the speed of integer operations. Floating-point has gone from being something that should be used only when absolutely necessary to something to be avoided only in performance-critical code, or in code intended to run on embedded systems with no floating-point unit (FPU).