Stock in Trade The classic trade in computer science is trading memory for execution speed.



In years past, I used to travel around teaching week-long seminars on different programming topics. Although I haven't done that for awhile, I do miss teaching and working with students. I guess because I miss it, I try to drop in on some high-school engineering classes from time to time (and yes, these days high schools have engineering classes). I also talk to regular classes about what it means to be an engineer.

One theme I talk about with kids is that engineers often make trades. I want a car that is perfectly safe, costs $10, and gets 200 miles per gallon of gas. But I can't have all of those things, so I make trades, or perhaps trade-offs would be a better way to put it. More metal makes the car safer, but makes it get less gas mileage and cost more. That's a trade.

The classic trade in computer science is trading memory for execution speed. You can see that in a nearly pure form by looking at how you'd write a C function to count the number of bits set in a word (sometimes known as the bit population count).

At first glance this seems pretty easy:

unsigned int value, count; for (count=0; value; value >>=1) count+=value & 1;

That's easy enough and doesn't take much memory once compiled. The downside is the speed. If you have an 8-bit byte, you could make 8 passes through the loop. At least the code does stop early if you run out of set bits, but you can't count on it ever taking less than 8 passes for a byte. For a 32-bit word, you could take 32 loops.

One easy way to speed things up is to trade memory for speed. I can go faster if I'm willing to invest in memory for a lookup table.

unsigned int value, count; unsigned char bitset[] = { 0, 1, 1, 2, 1, 2, 2, 3, …. }; count=bitset[value];

That's great and takes only one array lookup. But the price is a 256 element array to process a byte. A 32-bit word? Forget about it! You'd need 232 entries in your table.

On the other hand, both of these are extreme cases. Consider this code:

unsigned int value, count=0; unsigned char bit4set[] = { 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4 }; while (value) { count+=bit4set(value&0xF); value>>=4; }

This is the middle trade. The table only works on 4 bits at a time, and you have to (potentially) take more than one pass through the loop. For a byte, you might make 0, 1, or 2 loops. A 32-bit word could take as much as 8 loops (which is still better than 32).

There's at least one more engineering lesson to be learned here: You can optimize code to some point, but really big optimizations are usually due to selecting a smarter algorithm. Consider this algorithm that Peter Wegner published in 1960:

unsigned int value, count; for (count=0; value; count++) { value&=value-1; }

You have to think about this one a little. If the number is already zero, the for loop doesn't execute and count is zero, as it should be. Any other number is going to have a 1 in some position. Calculating value-1 gives me the same number except that the rightmost 1 bit will become a 0, and all the bits to the right of that will be 1s. For example:

00000100 -> 00000011 10101000 -> 10100111 00000001 -> 00000000

Using the and operator ensures that all the rightmost bits will go to 0, but none of the bits ahead of the rightmost 1 will change. What this means is that the loop only executes once for every 1 bit in the word. Granted, it could still take 8 loops to count a byte, but assuming the inputs are evenly distributed, that won't happen often. Using the simple non-table approach, half the possible inputs would take 8 loops and 64 would require 7 loops. Only one bit pattern would take a single loop.

With the value-1 method, there are 8 different byte-sized inputs that take a single loop. There is only one input ( 11111111 ) that actually requires 8 loops. So while it is memory efficient, it should be faster on the average.