Programming as if Performance Mattered

James Hague May 4, 2004 james.hague@gmail.com

I frequently see bare queries from programmers in discussion forums, especially from new programmers, who are worried about performance. These worries often stem from popular notions about what operations are "slow." Division. Square roots. Mispredicted branches. Cache unfriendly data structures. Visual Basic. Inevitably someone chimes in that making out-of-context assumptions, especially without profiling, is a bad idea. And they're right. But I often wonder if those warnings are also just regurgitations of popular advice. After all, if a C++ programmer was truly concerned with reliability above all else, would he or she still be using C++?

At the same time, such concerns and advice seem to remain constant despite rapid advances in hardware. A dozen years ago a different crop of new programmers was fretting about the speed of division, square roots, and Visual Basic, and yet look at the hundred-fold increase in computing power since then. Are particular performance problems perennial? Can we never escape them?

This essay is an attempt to look at things from a different point of view, to put performance into proper perspective.

The Benchmark

To show how PC speed has changed, I'll use a Targa graphics file decoder as a benchmark. Why a Targa decoder? It gets away from tired old benchmarks involving prime numbers and replaces them with something more concrete and useful. It also involves a lot of data: 576x576 24-bit pixels, for a total of 995,328 bytes. That's enough raw data processing to require performance-oriented coding, and not some pretty but unrealistic approach.

The benchmark works as follows. Read a run-length compressed 24-bit Targa file, parse the header, and decompress the pixel data into a raw 32-bit image. If a pixel contains the RGB value {255,0,255}, then the pixel is considered to be fully transparent, so it ends up with an alpha value (the fourth color value in a 32-bit pixel) of 255. All other pixels get alpha values of zero. All we're doing is decoding run-length compressed 24-bit pixels and turning them into 32-bit pixels. Of course there are a lot of pixels.

For timing purposes, I included reading the Targa file from the hard disk in the total time for the benchmark. Naturally this is slower the first time it is run, because the file isn't cached in memory yet, so I tooks the timings for subsequent runs.

The Results

The first PC I ran this on was an old 333MHz Pentium II, originally purchased in March of 1998, running Windows ME. The total time to load and process the 576x576 24-bit image on this machine is:

176,000 microseconds

-or-

0.176 seconds

-or-

5.68 images/second

0.176 seconds to decode so much data on such an old PC seems quick enough. But I also ran this benchmark on a modern PC with a 3GHz Pentium 4 and an 800MHz front-side bus, running Windows XP. The total time to load and process the image on this machine is:

15,000 microseconds

-or-

0.015 seconds

-or-

66 images/second

Now this is impressive! 0.015 seconds to load and sling so much data? 66 images decoded per second means you could take a video game that runs at sixty frames per second—a speed generally considered the target if silky smooth motion is required—and run the Targa benchmark in less time than a single frame. This is 11.7 times faster than the PC from six years earlier.

What I neglected to mention

I talked about the hardware used to run this benchmark, and I talked about the operating system. I didn't mention the programming language the benchmark is implemented in. The natural assumption is that this is a performance-oriented benchmark, so it is probably implemented in C or C++. It is neither. The benchmark is written in Erlang.

Why? Primarily because I enjoy programming in it and I'm comfortable using it, not because it's a secret high-performance language or because I'm hell-bent on advocating it. In fact, Erlang is hardly the best language for the job. Consider a few properties of the Erlang implementation I used:

It is compiled to code for a virtual machine, which is then interpreted. The benchmark is not compiled to machine code.

It is garbage collected.

It is a functional language, meaning that once a data structure is created it cannot be modified (you create a new version of it).

It is dynamically typed. "X + Y" involves runtime checks to see what the types of X and Y are.

It is not specifically designed for the type of problem solved by the benchmark. In fact, if you asked the creators or maintainers of the language, then you might even be advised that such bulk processing of data is not one of Erlang's strong points. (They would likely tell you that Erlang was designed for fault tolerant, distributed applications.)

This is not a trick. The Targa decoder is written in a pure Erlang, a language with the above characteristics, and yet the performance, even on old hardware, is excellent. The "trick" element comes from the the difficulty in understanding the magnitude of three billion cycles per second, with multiple instructions executing in parallel, and the leap of assuming that the amount of "raw data processing" required by the benchmark somehow requires the full use of all those cycles.

Even worse than it sounds

Not only is the decoder written in pure Erlang, but if you look at the code through a microscope there are some stunning low-level inefficiencies. For example, consider this clean and simple statement:

decode_rgb(Pixels) -> list_to_binary(decode_rgb1(binary_to_list(Pixels))).

This is the "convert raw 24-bit pixels into 32-bit pixels, accounting for transparency" function. The real work happens inside of decode_rgb1 , which is the function where most of the decoding time is spent. But before calling that function, we take a binary chunk of pixels and turn it into a list. Then the decoding function churns away on this list, returning an entirely new list. That final list is turned back into a binary block of data.

Now let's look at this more closely. Say we're looking at a block of twenty 24-bit pixels. The Pixels binary object contains 20 * 3 = 60 bytes of data. When this is turned into a list, each of the R,G,B color values ends up taking eight bytes: four for the value and four for the "next" pointer in each node. This means that the first binary_to_list call immediately eats up 60 * 8 = 480 bytes.

This 480 byte list is processed in decode_rgb1 , and a new list that takes (20 * 4) * 8 = 640 bytes is returned. Overall, to process one run of twenty pixels, 640 + 480 = 1120 bytes are allocated. Then all of this is immediately thrown away when the resulant list is turned back into a binary block of data. Naturally all of this constant allocation (remember, these are the numbers for a mere twenty pixels) causes the garbage collector to kick in, most likely several times, during the execution of the benchmark.

From a low-level point of view, this is crazy! But remember, we're still looking at a total execution time of 0.015 seconds, so all of this concern about low-level operations—and even bigger issues, like using an interpreted language—is misguided and irrelevant. Keeping this in mind, I'll rewind and explain how the benchmark code was written.

How the Targa decoder was written

I needed to write a Targa decoder in order to test some OpenGL code. The code was driven from Erlang, via a dynamically linked library (DLL) written in Delphi, so it made sense to drive the texture loading and decoding from Erlang as well.

Version 0

As a first pass, I kept the loading and header parsing in Erlang, then passed off the raw block of image data to the DLL. There were two decoders: one for uncompressed images and one for run-length compressed images. This worked well enough, once I convinced myself that the pointer fiddling was correct, except that I didn't realize that Targa images usually have the bottom row of the image stored first in the file. The resultant image, after going through my deocder, was upside-down. I was quietly annoyed at having to rewrite the code to start at the bottom of the buffer and work backward.

I was also bothered—and I freely admit that I am probably one of the few people this would bother—by the fragility of my decoder. If an incorrectly encoded image was passed in, say with a height field indicating more lines than there's data for—then there was a chance that memory would be silently tromped. This seemed to defeat the purpose of working in a safe, high-level language. To fix this, I'd need much additional error checking in the decoder, plus a way to communicate errors back to the host.

Instead of dealing with these problems, I decided it would be cleaner and easier to move Targa decoding into a module written in Erlang.

Version 1

My first crack at the Erlang version dealt only with uncompressed images, and it took roughly 0.8 seconds to decode the 576x576 test image on the 333MHz PC. "Almost a second" is in the realm of physically feeling the execution time, so I decided to try to speed it up, as this slowness would be multiplied by the number of images to be decoded. Waiting a minute for many images to be decoded fell decidedly into the realm of "too slow."

Version 2

I diddled around with the high-level Erlang code a bit, trying to cut down on unnecessary conversions, comparing different data representations. This was all easy and safe compared to tweaking the Delphi code, and resulted in noticible speedups. I interactively tested each change as I went. I can't emphasize this enough: the high-level optimizations were painless and fun.

Version 3

I noticed was that much time was spent decoding a large {255,0,255} border around the image. I expected most images to have similarly transparent borders. Because I was decoding a line at a time, I added code to compare the new line with the previous line, and if they were identical to pass the decoded version of the previous line as the result. This was fast even if the two didn't match, because the "compare two blobs of data" operator was implemented in C inside the interpreter. This cut the execution time substantially (I'd have an exact figure if I had been keeping notes).

Version 4

Next I started looking into decoding compressed images. My first attempt decompressed a piece of the image, either a run or a literal, then passed it to the same function used for uncompressed pixels. This was slower than the uncompressed version, even though it took less time to read the compressed image into memory. (In the Delphi code, the compressed case was faster.)

I special-cased the "run of transparent pixels" operation, to avoid checking each pixel for transparency, and because I could build up the list of pixels by creating one pixel and duplicating it. This was faster, but still not great. Note that change this was simply one additional pattern matching statement in the pixel decoding function. In C or Delphi it would have been messier, taking care not to overrun the end of the buffer.

Version 5

It finally occurred to me that I could symbolically work with the compressed chunks of data rather than decoding each of them, but this requires some insight into how images are to be used. The second step, after decoding an image, is to squish it into a minimally-sized rectangle. If there are only 30x30 relevant pixels surrounded by a transparent border, then the goal is to create a 30x30 image without the border. In essence, the pixels in the original border are decoded, then later they are scanned and determined to be disposable.

To avoid this, I switched the representation of an image to a list of lines, with each line of the form {Left, Right, Pixel_data}. Left and right are the the number of leading and trailing transparent pixels. Pixel_data is the raw block of pixels between the two transparent runs. If a row of a 576x576 image has no transparent pixels, it's {0, 0, Pixel_data}. If a row is entirely transparent, then it's {576, 576, null}. The key is that we can now skip huge runs of transparent pixels without decoding them at all. A subsequent "scan for transparent border" function throws them away anyway. Again, I'd like to emphasize that this kind of manipulation was fairly straightforward in Erlang, but I wouldn't want to attempt it in a language like C.

With this code in place, the compressed image case became faster than the uncompressed image. The bulk of the execution time is now inside of one function ( decode_rgb1 , as mentioned earler). If I wanted further speedups, moving just that one function into C would do it, but I haven't felt the need to go down that road.

This isn't 1985...or is it?

The golden rule of programming has always been that clarity and correctness matter much more than the utmost speed. Very few people will argue with that. And yet do we really believe it? If we did, then 99% of all programs would be written in something like Python. Or Erlang. Even traditional disclaimers such as "except for video games, which need to stay close to the machine level" usually don't hold water any more. After all, who ever thought you could use an interpreted, functional language to decode Targa images, especially without any performance concerns?

That tempting, enticing, puzzle-solving activity called "optimization," it hasn't gone away either. The optimization process I used to speed up the decoder is similar to that of Commodore 64 coders speeding up arcade games. Only now the process is on a different level. It isn't machine level twiddling and cycle counting, but it isn't simply mathematical analysis of algorithms either. The big difference is that the code changes I made are substantially safer than running a program and having it silently hang the system. All array accesses are bounds-checked. There's no way to accidentally overwrite a data structure. There's no way to create a memory leak. Really, this is what those cycle-counting programmers from 1985 dreamed of.