Which clock is the best?

We can easily rule the one which has stopped …

Or can we? In “The Rectory Umbrella” Lewis Carroll argues otherwise.

Which is better, a clock that is right only once a year, or a clock that is right twice every day? “The latter,” you reply, “unquestionably.” Very good, now attend. I have two clocks: one doesn’t go at all, and the other loses a minute a day: which would you prefer? “The losing one,” you answer, “without a doubt.” Now observe: the one which loses a minute a day has to lose twelve hours, or seven hundred and twenty minutes before it is right again, consequently it is only right once in two years, whereas the other is evidently right as often as the time it points to comes round, which happens twice a day.

It’s an amusing diversion, but not really that puzzling: of course the clock which loses time is of more practical use, even if, somewhat paradoxically, the less time it loses the less often it tells the right time. A clock which loses just a second a day only tells the right time every 118 years or so.

I mention these defective clocks because I’m thinking about bugs in software and how we go about finding and fixing them.

Code which is obviously wrong is easier to spot than code which is almost right, and spotting bugs is the precursor to fixing them. This implies — building on Carroll’s terminology — that we’re unlikely to ship many stopped clocks but if we’re not careful we may end up delivering a few which lose time. And, in general, code which is obviously wrong is easier to fix than code which is almost right. A badly-broken function clearly needs a rethink; whereas one which almost works may simply get tweaked until it appears to work, often resulting in a more subtle bug.

C and C++ provide a good example of what I’m talking about. Consider a program which misuses memory. An attempt to allocate workspace of 4294967295 bytes fails instantly[1]; a slow memory leak, like a slow running clock, may cause no perceptible damage for an extended period.

Decent tools detect memory leaks. Race conditions in multi-threaded code are harder to track and may prove elusive during system testing. More than once I’ve left a program running under a debugger, being fed random inputs, in the hope some rare and apparently random condition will trigger a break in execution. Give me truly broken code any day!

Here are two implementations of a C function to find an integer midway between a pair of ordered, positive integer values, truncating downwards. Before reading on, ask yourself which is better.

int midpoint1(int low, int high) { return low/2 + high/2; } int midpoint2(int low, int high) { return (low + high)/2; }

Midpoint1 is a “stopped clock”, returning 3 instead of 4 as the mid-point of 3 and 5, for example. It gets the wrong answer 25% of the time — fatally wrong were it to be used at the heart of, say, a binary search. I think we’d quickly detect the problem.

An obvious fix would be the one shown in midpoint2 which does indeed return 4 as the mid-point of 3 and 5.

Midpoint2 turns out to be a losing clock, though. If the sum low + high overflows then the result is undefined. On my implementation I get a negative value — a dangerous thing to use as an array index. This is a notorious and very real defect, nicely documented in a note by Joshua Bloch subtitled “Nearly all Binary Searches and Mergesorts are broken”.

Bloch offers more than one fix so I’ll just note here that:

this defect simply doesn’t exist in a high-level language like Python or Haskell, where integers are bounded only by machine resources

I think Bloch is unfair to suggest Jon Bentley’s analysis in chapter 4 of Programming Pearls is wrong. The pseudo-code in this chapter is written in a C-like language somewhere between C and Python, and in fact one of Bentley’s exercises is to examine what effect word size has on this analysis.

in a sense, midpoint2 is more broken than midpoint1 : over the range of possible low and high inputs, the sum overflows and triggers the defect 50% of the time.

Computers are supposed to be predictable and we typically aim for correct programs. There’s no reason why we shouldn’t consider aiming for programs which are good enough, though, and indeed many programs which are good enough to be useful are also flawed. Google adverts, for example, analyse the contents of web pages and serve up related links. The algorithm used is secret, clever and quick, but often results in semantic blunders and, on occasion, offensive mistakes. Few could deny how useful to Google this program has been, though.

Here’s a more interesting example of an algorithm which, like a losing clock, is nearly right.

def is_fprime(n): """Use Fermat's little theorem to guess if n is prime. """ from random import randrange tries = 3 xs = (randrange(1, n) for _ in range(tries)) return all((x ** n) % n == x for x in xs)

We won’t go into the mathematics here. A quick play with this function looks promising.

>>> all(is_fprime(n) for n in [2, 3, 5, 7, 11, 13, 17, 19]) True >>> any(is_fprime(n) for n in [4, 6, 8, 9, 10, 12, 14, 15]) False

In fact, if we give it a real work-out on some large numbers, it does well. I used it to guess which of the numbers between 100000 and 102000 were prime, comparing the answer with the correct result (the code is at the end of this article). It had a better than 99% success rate (in clock terms, it lost around 8 minutes a day) and increasing tries will boost its performance.

The better is_fprime performs, the less likely we are to spot that it’s wrong. What’s worse, though, is that it cannot be fixed by simple tweaking. However high we set tries we won’t have a correct function. We could even take the random probing out of the function and shove every single value of x in the range 1 to n into the predicate:

def exhaustive_is_fprime(n): return all((x ** n) % n == x for x in range(1, n))

Exhaustive_is_fprime is expensive to run and will (very) occasionally return True for a composite number[2]. If you want to know more, search for Carmichael numbers.

The point I’m making is that code which is almost right can be dangerous. We are tempted to fix it by adjusting the existing implementation, even if, as in this case, a complete overhaul is required. By contrast, we all know what needs doing with code which is plainly wrong.

We’ve all seen nervous functions which go beyond their stated interface in an attempt to protect themselves from careless users.

/** * Return the maximum value found in the input array. * Pre-condition: the input array must not be empty. */ int nervy_maximum_value(int const * items, size_t count) { int M = -INT_MAX; if (items == NULL || count == 0) { return M; } for ( ; count-- != 0; ++items) { if (*items > M) { M = *items; } } return M; }

What’s really wanted is both simpler and easier for clients to code against.

int maximum_value(int const * items, size_t count) { int const * const end = items + count; int M = *items++; for ( ; items != end; ++items) { if (*items > M) { M = *items; } } return M; }

Did you spot the subtle bug in nervy_maximum_value ? It uses -INT_MAX instead of INT_MIN which will cause trouble if clients code against this undocumented behaviour; if nervy_maximum_value is subsequently fixed, this client code back-fires.

Note that I’m not against the use of assertions to check pre-conditions, and a simple assert(items != NULL && count != 0) works well in maximum_value ; it’s writing code which swallows these failed pre-conditions I consider wrong.

The occurrence of defects in complex software systems can be modelled in the same way as radioactive decay. I haven’t studied this theory and my physics is rusty[3], but the basic idea is that the population of bugs in some software is rather like a population of radioactive particles. Any given bug fires (any given particle decays) at random, so we can’t predict when this event will happen, but it is equally likely to fire at any particular time. This gives each defect an average lifetime: a small lifetime for howling defects, such as dereferencing NULL pointers, and a longer one for more subtle problems, such as accumulated rounding errors. Assuming we fix a bug once it occurs, the population of defects decays exponentially, and we get the classic tailing-off curve.

Anyone who has ever tried to release a software product knows how it feels to slide down the slope of this curve. We system test, find bugs, fix them, repeat. At the start it can be exhilarating as bugs with short half-lives fall out and get squashed, but the end game is demoralising as defects get reported which then cannot be reproduced, and we find ourselves clawing out progress. When we eventually draw the line and ship the product we do so suspecting the worst problems are yet to be found. To put it more succinctly[4].

Ship happens!

A combination of techniques can help us escape this depressing picture. The most obvious one would be to avoid it: rather than aim for “big-bang” releases every few years, we can move towards continual and incremental delivery. A modular, decoupled architecture helps. So does insistence on unit testing. Rather than shake the system and sweep up the bugs which fall off we should develop a suite of automated tests which actively seek the various paths through the code, and exercise edge cases. Within the code-base, as already mentioned, defensive programming can cause defects to become entrenched. Instead, we should adopt a more confident style, where code fails hard and fast.

Have you ever fixed a defect and wondered how the code ever even appeared to work before your fix? It’s an important question and one which requires investigation. Perhaps the bug you’ve fixed is compensated for by defensive programming elsewhere. Or perhaps there are vast routes through the code which have yet to be exercised.

None of these clocks is much good. The first has stopped, the second loses a second every minute, the third gains a second every minute. At least it’s easy to see the problem with the first: we won’t be tempted to patch it.

We should never expect our code to work first time and we should be suspicious if it appears to do so. Defensive programming seems to mean different things to different people. If I’ve misused the term here, I’m sorry. Our best defence is to assume code is broken until we’ve tested it, to assume it will break in future if our tests are not automated, and to fail hard and fast when we detect errors.

import math from itertools import islice, count from random import randrange def primes(lo, hi): '''Return the list of primes in the range [lo, hi). >>> primes(0, 19) [2, 3, 5, 7, 11, 13, 17] >>> primes(5, 10) [5, 7] ''' sqrt_hi = int(math.sqrt(hi)) sieve = range(hi) zeros = [0] * hi sieve[1] = 0 for i in islice(count(2), sqrt_hi): if sieve[i] != 0: remove = slice(i * i, hi, i) sieve[remove] = zeros[remove] return [p for p in sieve[lo:] if p != 0] def is_fprime(n, tries=3): '''Use Fermat little theorem to guess if n is prime. ''' xs = (randrange(1, n) for _ in range(tries)) return all((x ** n) % n == x for x in xs) def fprimes(lo, hi, tries=10): '''Alternative implementation of primes. ''' return filter(is_fprime, range(lo, hi)) if __name__ == '__main__': import doctest doctest.testmod() lo, hi = 100000, 102000 primes_set = set(primes(lo, hi)) fprimes_set = set(fprimes(lo, hi)) print "Range [%r, %r)" % (lo, hi) print "Actual number of primes", len(primes_set) print "Number of fprimes", len(fprimes_set) print "Primes missed", primes_set - fprimes_set print "False fprimes", fprimes_set - primes_set

Running this program produced output:

Range [100000, 102000) Actual number of primes 174 Number of fprimes 175 Primes missed set([]) False fprimes set([101101])

[1] In the first version of this article I wrote that an attempt to allocate 4294967295 bytes would cause the program to crash, which isn’t quite right. Malloc returns NULL in the event of failure; standard C++ operator new behaviour is to throw a bad_alloc exception. My thanks to R Samuel Klatchko for the correction.

[2] “Structure and Interpretation of Computer Programs” discusses Carmichael numbers in a footnote

Numbers that fool the Fermat test are called Carmichael numbers, and little is known about them other than that they are extremely rare. There are 255 Carmichael numbers below 100,000,000. The smallest few are 561, 1105, 1729, 2465, 2821, and 6601. In testing primality of very large numbers chosen at random, the chance of stumbling upon a value that fools the Fermat test is less than the chance that cosmic radiation will cause the computer to make an error in carrying out a “correct” algorithm. Considering an algorithm to be inadequate for the first reason but not for the second illustrates the difference between mathematics and engineering.

[3] Being lazy and online I thought I’d search for a nice radioactive decay graphic rather than draw my own. I found a real gem on the University of Colarado site, where Kyla and Bob discuss radioactive decay.

Hmmm…so a lot of decays happen really fast when there are lots of atoms, and then things slow down when there aren’t so many. The halflife is always the same, but the half gets smaller and smaller. That’s exactly right. Here’s another applet that illustrates radioactive decay in action.

Visit the site to play with the applet Bob mentions. You’ll find more Kyla and Bob pictures there too.

[4] I’m unable to provide a definitive attribution for the “Ship happens!” quotation. I first heard it from Andrei Alexandrescu at an ACCU conference, who in turn thinks he got it from Erich Gamma. I haven’t managed to contact Erich Gamma. Matthew B. Doar reports using the term back in 2002, and it appears as a section heading in his book “Practical Development Environments”.