





The following is a list of things programmers often take as either self-evident or have picked up from others and mistakenly believe. In disputing these misbeliefs I only intend to improve awareness about performance optimization. But in doing so, I only offer half the solution. To be really convinced the aspiring performance conscious programmer should write up examples and disassemble them (in the case of using a HLL like C) or simply profile or time them to see for themselves.



Ever improving hardware, makes software optimization unimportant (1) If you don't optimize your code, but your competition does, then they end up with a faster solution, independent of hardware improvements. (2) Hardware performance improvements expectations almost always exceeds reality. PC disk performance has remained relatively stable for the past 5 years, and memory technology has been driven more by quantity than performance. (3) Hardware improvements often only target optimal software solutions.

Using tables always beats recalculating This is not always true. Remember that in recalculating, you have the potential of using parallelism, and incremental calculation with the right formulations. Tables that are too large will not fit in your cache and hence may be very slow to access. If your table is multi-indexed, then there is a hidden multiply which can be costly if it is not a power of 2. On a fast Pentium, an uncached memory access can take more time than a divide.

My C compiler can't be much worse than 10% off of optimal assembly on my code This is something some people can only be convinced of by experience. The cogitations and bizarre operator fumbling that you can do in the C language convinces many that they are there for the purposes of optimization. Modern C compilers usually unify C's complex operators, semantics and syntax into much a simpler format before proceeding to the optimization and code generation phase. ANSI C only addresses a subset of your computers' capabilities and is in of itself too generic in specification to take advantage of all of your processor nuances. Remember ANSI C does not know the difference between cached and uncached memory. Also many arithmetic redundancies allow for usage of processor features that C compilers to date have not yet mastered (for e.g.., there are clever tricks for host based texture mapping if the stride of the source texture is a power of two, or in particular 256 on an x86.) Unfortunately, many research based, and work station programmers as well as professors of higher education, who might even know better, have taken it upon themselves to impress upon newbie programmers to avoid assembly language programming at all costs; all in the name of maintainability and portability. A blatant example of this can be seen in the POV-ray FAQ, which outright recommends that there is no benefit to be had in attempting assembly language optimizations. (I wouldn't be surprised, if you couldn't simply low level optimize POV-Ray, change the interface and turn around and sell the darn thing!) The fact is, low-level optimization has its place and only should be passed by if there is a conflicting requirement (like portability), there is no need, or there are no resources to do it. For more, see High level vs. Low level below.

Using C compilers make it impossible to optimize code for performance Most C compilers come with an "inline assembly" feature that allows you to roll your own opcodes. Most also come with linkers that allow you to link completely external assembly modules. Of course not all C compilers are created equal and the effects of mixing C and assembly will vary depending on the compiler implementation. (Example: WATCOM and DJGPP mix ASM in very smoothly, whereas VC++ and Borland do not.) Modern C compilers will do a reasonable job if they are given assistance. I usually try to break my inner loops down into the most basic expressions possible, that are as analogous to low level assembly as possible, without resorting to inline assembly whenever possible. Again your results will vary from compiler to compiler. (The WATCOM C/C++ compiler can be helped significantly with this sort of approach.)

Compiled bitmaps are the fastest way to plot graphics This method replaces each bitmap graphics source data word with a specific CPU instruction to store it straight to graphics memory. The problem with it is, that it chews up large amounts of instruction cache space. This is to be compared against a data copying routine which needs to read the source data from memory (and typically caches it.) Both use lots of cache space, but the compiled bitmap method uses far more, since it must encode a CPU store command for each source data word. Furthermore, CPU performance is usually more sensitive to instruction cache performance than it is to data cache performance. The reason is that data manipulations and resource contentions can be managed by write buffers and modern CPU's ability to execute instructions out of order . With instruction data, if they are not in the cache, they must be prefetched, paying non-overlapping sequential penalties whenever the pre-fetch buffer runs out. On older x86's this method worked well because the instruction prefetch penalties were paid on a per instruction basis regardless (there was no cache to put them into!) But starting with the 486, this was no longer a sensible solution since short loops paid no instruction prefetch penalties, which rendered the compiled bitmap technique completely useless.

Using the register keyword in strategic places C will improve performance substantially This keyword is a complete placebo in most modern C compilers. Keep in mind that K&R and the ANSI committee did not design the C language to embody all of the performance characteristics of your CPU. The bulk of the burden of optimizing your C source, is in the hands of your compiler's optimizer which will typically have its own ideas about what variables should go where. If you are interested in the level optimizations available by hand assigning register variable aliasing, you are better off going to hand rolled assembly, rather than relying on these kinds of language features. ( Addenda: The only real purpose of "register" is to assert to the compiler that an auto variable is never addressed and therefore can never alias with any pointer. While this might be able to assist the compiler's optimizer, a good optimizing compiler is more than capable of deducing this feature of a local by itself.)

Globals are faster than locals Most modern C compilers will alias local variables to your CPUs register's or SRAM. Furthermore, if all variables in a given scope are local, then an optimizing compiler, can forgo maintaining the variable outside the scope, and therefore has more simplification optimization opportunities than with globals. So, in fact, you should find the opposite tends to be true more of the time.

Using smaller data types is faster than larger ones The original reason int was put into the C language was so that the fastest data type on each platform remained abstracted away from the programmer himself. On modern 32 and 64 bit platforms, small data types like chars and shorts actually incur extra overhead when converting to and from the default machine word sized data type. On the other hand, one must be wary of cache usage. Using packed data (and in this vein, small structure fields) for large data objects may pay larger dividends in global cache coherence, than local algorithmic optimization issues.

Fixed point always beats floating point for performance Most modern CPUs have a separate floating point unit that will execute in parallel to the main/integer unit. This means that you can simultaneously do floating point and integer calculations. While many processors can perform high throughput multiplies ( the Pentium being an exception ) general divides and modulos that are not a power of two are slow to execute (from Cray Super Computers right on down to 6502's; nobody has a really good algorithm to perform them in general.) Parallelism (via the usually undertaxed concurrent floating point units in many processors) and redundancy are often better bets than going to fixed point. On the redundancy front, if you are dividing or calculating a modulo and if you know the divisor is fixed, or one of only a few possible fixed values there are ways to exploit fast integer (aka fixed point) methods. On the Pentium, the biggest concern is moving data around to and from the FPU and the main integer unit. Optimizing FPU usage takes careful programming; no x86 compiler I have see does a particularly good job of this. To exploit maximum optimization potential, you are likely going to have to go to assembly language. As a rule of thumb: if you need many simple results as fast as possible, use fixed point, if you need only a few complicated results use floating point. See Pentium Optimizations by Agner Fog for more information. With the introduction of AMD's 3DNOW! SIMD floating point technology, these older rules about floating point performance have been turned upside down. Approximate (14/15 bits) divides, or reciprocal square roots can be computed in a single clock. Two multiplies and two adds can also be computed per clock allowing better than 1 gigaflop of peak performance. We are now at a point in the industry where floating point performance is truly matching the integer performance. With such technologies the right answer is to use the data type format that most closely matches its intended meaning.

Performance optimization is achieved primarily by counting the cycles of the assembly code You usually get a lot more mileage out of optimizing your code at a high level (not meaning to imply that you need a HLL to do this) first. At the very least, changes to the high level source will tend to affect more target code at one time than what you will be able to do in assembly language with the same effort. In more extreme cases, such as exponential (usually highly recursive) algorithms, thorough hand optimization often buys you significantly less than good up front design. The cycle counts given in processor instruction lists are usually misleading about the real cycle expenditure of your code. They usually ignore the wait states required to access memory, or other devices (that usually have their own independent clocks.) They are also typically misleading with regards to hidden side effects of branch targets, pipelining and parallelism. The Pentium can take up to 39 clocks to perform a floating point divide. However, the instruction can be issued in one clock, 39 clocks of integer calculations can then be done in parallel and the result of the divide then retrieved in another clock. About 41 clocks in total, but only two of those clocks are actually spent issuing and retrieving the results for the divide itself.



In fact, all modern x86 processors have internal clock timers than can be used to assist in getting real timing results and Intel recommends that programmers use them to get accurate timing results. (See the RDTSC instruction as documented in Intel's processor architectural manuals .)

Assembly programming is only done in DOS, there's no need under Windows The benefits of assembly over C are the same under Windows or Linux as they are under DOS. This delusion doesn't have anything close to logic backing it up, and therefore doesn't deserve much comment. See Iczelion's Win32 Assembly Home Page if you don't believe me.

Complete optimization is impossible; there is always room left to optimize, thus it is pointless to sustain too much effort in pursuit of it This is not a technical belief -- its a marketing one. Its one often heard from the folks that live in Redmond, WA. Even to the degree that it is true (in a very large software project, for example) it ignores the fact that optimal performance can be approached asymptotically with a finite, and usually acceptable amount of effort. Using proper profiling and benchmarking one can iteratively grab the "low hanging fruit" which will get most of the available performance. Absolutely optimization is also not a completely unattainable goal. Understanding the nature of your task, and bounding it by its input/output performance and the best possible algorithm in the middle in many cases is not an undoable task. For example, to read a file from disk, sort its contents and write the result back out, ought to be a very doable performance optimization exercise. (The input/output performance is known, and the algorithm in the middle is approachable by considering the nature of the input and going with a standard algorithm such as heap sort or radix sort.) Of course, the degree of analysis that you can apply to your specific problem will vary greatly depending on its nature. My main objection to this misconception is just that it cannot be applied globally.