What Happens After A Failure? Imagine that you have asked a program to do something for you, and it has reported that it is unable to do so. What do you do next?



Last week , I distinguished between two kinds of software failure. In one kind, the program fails to produce a correct result; in the other, it produces an incorrect result. The difference between these two cases is whether the software's user can tell that the program has failed. For now, I want to concentrate on the case in which the program's user knows that it has not produced a correct result.

Imagine that you have asked a program to do something for you, and it has reported that it is unable to do so. What do you do next? The answer depends on the context, including what you can learn about what the program did instead of what you asked it to do. At one extreme, the program might have failed in a way that makes further progress impossible, such as by destroying critical data. At the other extreme, the failure might be in a network that does not require anything beyond a "best effort" to handle incoming data — in which case you can simply disregard the failure and wait for whatever is on the other end of the network to try again.

Even these two simple examples suggest outlines of what it might be useful to know about a part of a program that has failed:

Do we know what it actually did?

Did the failure leave the system in a state from which it is possible to proceed?

If it is possible to proceed, what do we need to do in order to be able to use the failing component in the future?

If these questions have answers at all, those answers are probably not known until the failure has occurred. The desire to be able to answer such questions is a key motivation for the design of exception-handling facilities in languages that have them.

In order for us to be able to answer our three questions after a failure,

The program has to be able to report how it failed in a way that lets us identify the nature of the failure, and

The failure must be documented in enough detail so that we can figure out how, and if, we can proceed.

Perhaps the most important factor in being able to continue after a failure is that we must be able to understand the damage that the failure caused. This understanding typically comes from documentation, rather than from the program itself.

For example, suppose we call a sort function. Such a function typically works by comparing elements of the data structure that we give it, and any of those comparisons might fail. If a comparison fails, the sort function reports the failure back to us. In order for us to figure out how to continue, one question will be crucially important: Can we be confident that the data structure that we tried — and failed — to sort is a permutation of its original contents? That is, do we know that every element in the original data structure is still there, and that all that might have changed are the relative positions of the elements?

This kind of question is very unlikely to be answered in the code. Instead, we would hope to be able to find some kind of guarantee in the sort function's documentation that if the data cannot be sorted, the function will throw an exception, after which the data we were trying to sort will be some kind of rearrangement of the original data, with all values still present. In effect, such documentation will help us understand what the failed program did before it failed, which will help us answer our other two questions.

In short, when part of a program can detect — reliably — that it cannot continue, it can use some kind of exception-handling mechanism to report this inability back to its caller. The fact that an exception has occurred, coupled with the author's description of what the program has done at the point when it reports an exception, is what makes it possible to figure out how to proceed after the exception.

Next week, we'll look more closely at what kinds of guarantees it is feasible for a program to make in the context of exceptions.