How Can One Test a Program's Average Performance? The standard-library sort function. This function typically implements the Quicksort algorithm, which sorts an n-element sequence in O(n log n) time — on average.



Last week, I argued that testing a program's performance is harder than testing its functionality. Not only is it hard to verify that the performance is up to par, but it can be hard to define exactly what "par" means.

I would like to continue by looking at the standard-library sort function. This function typically implements the Quicksort algorithm, which sorts an n-element sequence in O(n log n) time — on average. Despite this average performance, input that is chosen unfortunately can cause a Quicksort implementation to run much more slowly than average; for example, in O(n2) time. I chose the word unfortunately on purpose, because it is not unusual for Quicksort implementations to use randomness to ensure that the quadratic-performance cases come along only very rarely.

Why is randomness important here? Quicksort starts by picking an element, called the pivot, of the array to be sorted. Quicksort then typically rearranges the elements of the sequence so that all elements less than or equal to the pivot come first, followed by all of the elements greater than the pivot. This rearrangement can always be done in O(n) time. Finally, Quicksort calls itself recursively to sort the elements of the two sections of the (now rearranged) array.

Accordingly, Quicksort's running time is no worse than proportional to the number of elements times the maximum recursion depth. By implication, Quicksort's performance depends on having the recursion depth usually be no more than O(log n). This depth limit can be achieved so long as the pivot, on average, is not too close to the largest or smallest element.

How does Quicksort guarantee that the pivot is not too close to the endpoints? In general, it can't. Nevertheless, it can avoid performance problems most of the time by picking the pivot at random. Doing so ensures that Quicksort's average performance is reasonable, even though once in a while the pivot might happen to be close enough to an endpoint to cause performance problems. Such occasional problems aren't a big deal as long as they're rare. Right?

Well, that depends. Suppose your job is to write performance tests for an implementation of Quicksort.

How do you translate the vague "average performance" claim in the C++ standard into a requirement that it is possible to test at all?

How you test Quicksort in a way that gives you any confidence in the results?

What makes average performance so hard to test is that the very notion has an element of probability in it. If a program is required to produce a particular result, then you can say with certainty that the result of a particular test run is either right or wrong. In contrast, if you are testing a requirement on average performance, no single test can be said with certainty to be right or wrong. The best you can hope for is that by running more and more tests, you can increase your confidence that the program is working correctly; there is always the possibility that further testing may cause you to change your mind about the program's correctness.

In short, if the performance requirements include claims about average execution time, testing those claims is apt to require some kind of statistical analysis. Such analysis is not always easy, but certainly has a long tradition in engineering. As an example, consider American Airlines flight 191.

Flight 191 took off from O'Hare Airport on May 25, 1979. Just as the airplane was leaving the ground, the engine on the left wing seized up and separated from the wing. The engine was attached to the wing by shear pins that were designed to break rather than damaging the wing. Nevertheless, because of faulty maintenance, the wing was damaged; that damage caused the airplane to go out of control and crash, killing everyone aboard.

In reading about the ensuing investigation, I saw a discussion of how a different aircraft manufacturer tested its shear pins in order to ensure that — assuming that the aircraft is maintained properly — the pins will allow the engine to leave the wing rather than damage it. It hasn't occurred to me before, but a major engineering problem in designing shear pins is that the purpose of a shear pin is to break if too much force is applied to it. There is no way to test whether a pin meets that requirement without destroying it. It follows, therefore, that the pins that are actually used in the airplane cannot be tested.

How can one possibly be confident in the safety of an airplane that is built this way? The answer is quite clever.

The engine is attached to the wing with several shear pins in such a way that even if one of them fails to break, the engine will still separate from the wing rather than damage the wing.

The shear pins are manufactured in batches of 100, all made at the same time in the same way.

From each batch of 100 pins, ten pins are selected at random and tested, thereby destroying them. If all ten pins pass the tests, the other 90 are assumed to be good enough to use. If even a single pin fails, the entire batch is discarded.

Obviously, this design involves not only clever mechanical engineering, but also sophisticated statistical reasoning. The limits on the pins must be chosen so that the probability of two randomly chosen pins being out of limits is very small once the 10% sample of the pins has passed its tests. I imagine that this probability can be made even smaller by making the limits on the tested pins narrower than the pins need to be in practice.

I would not want to have to do this kind of statistical analysis in order to test the performance of a Quicksort implementation. Even if I were confident enough in my ability to get the statistics right, there is always the possibility that a future change to the specifications or to the test procedure might render the statistics invalid. Moreover, there is one important difference between algorithms such as Quicksort and mechanical devices such as shear pins, namely that algorithms are sometimes given difficult inputs on purpose. For example, Doug McIlroy wrote a paper in 1999 that detailed how one can construct input to Quicksort that will force it to take O(n2) operations to sort an n-element array. Does a Quicksort implementation that misbehaves in this way fail to meet its specifications? If so, it's hard to see how we can use Quicksort at all.

One way to simplify such performance-testing problems is to use white-box testing, which is testing that takes advantage of knowledge of the program's implementation details. I'll discuss such testing techniques in more detail next week.