←

On Rigorous Error Handling

In my long career as a programmer I've seen a lot of projects with a lot of error handling practices and, sadly, almost none that actually worked.

We are facing facing few high-level problems here:

For many applications error handling is not critical. Error handling code is rarely executed. Consequently, it looks like it works until it doesn't. Programmers want to implement new features. Writing error handling is just an annoyance that slows them down.

Who cares about error handling?

For many applications writing rigorous error handling code doesn't pay off.

As long as you have some semi-decent way to deal with errors transparently (e.g. exceptions) you are fine. If there's a problem, throw an exception. On the top level catch all the exceptions and open a dialog box with the error message (in frontend applications) or write it to the log (on the server).

The above, of course, means that you hope that, after the exception is processed, the application will we left in a consistent, functional state. Once again, you HOPE. You do not KNOW. You do not know because, when writing the application, you haven't thought about error handling at all.

Now, don't get me wrong. I am not saying that you should never do this. It often makes sense from economic perspective. Writing rigorous error handling is expensive. If the worst thing that can happen is that a frontend application crashes once in a long while making user curse and restart it, then it's totally not worth doing.

However, there is also a different kind of application. There are medical application where the patient dies if they misbehave. There are space probes that crash into planets, taking millions of dollars of investment as well as scientific careers down with them. There are HFT application which can generate bazillion of dollars of loss for every minute of downtime.

In these cases it's definitely worth writing rigorous error handling code.

That much everybody agrees with.

However, it is often not explicitly understood that whenever you are writing a piece of infrastructure (as opposed to a client-facing application) you are always in the latter category.

A piece of infrastructure can be used in a medical application, in a space probe, in a high-frequency trading system, but unless it's a commercial product and you take care not to sell it to anyone who looks too important to fail, then you have no way to prevent that.

Ergo, if you are doing infrastructure, you should always do rigorous error handling.

Error handling code is never executed

The mainstream attitude to fixing bugs at the moment is "run it, see what fails, then fix it". That's why we write tests. Tests make the code run and, hopefully, fail if there's a problem. That's why we have user-facing issue trackers. If a bug gets past testing, it will fail for the user and user will fill in a bug report — or so we hope, at least.

But the error handling code is, almost by definition, executed only rarely. And code that is triggered by two concurrent problems is executed almost never at all.

Therefore, tests are not likely to help. User reports may catch the issue, but are often closed with "cannot reproduce" status.

There are alternative approaches, like formal verification or fuzzing, but almost nobody uses those.

In the end, we want programmers to write error handling code that works on the first try, without any testing.

And that is, of course, impossible to do. But, on the other hand, it's not black and white. Errors are rare and if the bugs in error handling code get rare as well then you can increase the reliability of the system from, say, three nines to five nines. And as medical applications go, adding one nine of reliability saves some actual human lives.

The dilemma

So here we are. We have programmers who want to spend no time on error handling, who just want to move on and implement cool new features, and we want them to write perfect error handling code on the first try, without testing.

Even worse, attempts to solve the first problem (to make error handling invisible) often make the second problem (failure-proof error handling) even more complex. And given that everyone is writing new code all the time, but errors are rare, the solutions to the first problem are going to proliferate even at the expense of making the second problem harder or even impossible to solve.

An example

Let's have a look at OpenSSL API. I don't want to pick on OpenSSL, mind you. You'll encounter similar kind of problem with almost any library, but I happened to work with OpenSSL lately so that's what I'll use as an example. Also note that OpenSSL is notoriously underfunded, so instead of getting angry at them, do your bit and send them a donation.

Anyway, when an OpenSSL function fails it reports that it failed and you can have a look at the "error queue". Error queue is effectively a list of integer error codes. The list of error codes can be found here.

Note how documentation of an error code invariably looks like this:

#define ERR_R_BAD_GET_ASN1_OBJECT_CALL 60 Definition at line 296 of file err.h.

Yes, you get a cryptic string, a cryptic number, and a reference to the line in the source code. If you look at the source code you'll find a cryptic string and a cryptic number again. No explanation of what the error means whatsoever.

Now think for a second about how you would, as a user, handle such an error. It's a multidimensional beast and it's not even clear whether order of errors matters or not. And each error is itself chosen from a large set of cryptic errors with no explicit associated semantics at all. Needless to say, the set of error codes is going to expand in newer versions, so even if you handled every single one of them correctly you can still get a nasty failure at some point in the future.

So what I've ended up doing was logging all the errors from the queue for debugging purposes and converting the entire thing into a single "OpenSSL failed" error. I am not proud of that but I challenge you to come up with a better solution.

And no, passing the monstrosity to the caller, who has even less context about how to handle it than you have, and breaking the encapsulation properties along the way, is not an option.

The takeaway from the example is that even if you are willing to do rigorous error handling, the underlying layers often give you no chance but to go the easy and sloppy way.

As a side note, OpenSSL's system of reporting errors is by no means the worst of all. Consider typical exceptions as they appear in most languages. Exception is not a list of error codes. It's an object. In other words, it can be a list, a map, a binary tree or whatever you want, really. There's absolutely no way to deal with this Turing-complete beast is a systemic way.

What's needed

When you look at it from the point of view of the user of the API, the requirements are suddenly crystal-clear. If you are to do rigorous error handling you want a small amount of well-defined failure modes.

A function may succeed. Or it may, for example, fail because of disconnected backend. Or it may time out. And that's it. There are only two failure modes and they are documented as a part of the API. Once you have that, the error handling becomes obvious:

int err = some_library_function(); if(err != 0) { swich(err) { case ECONNRESET: ... case ETIMEDOUT: ... default: assert(0); // the function violates its specification } }

Some libraries are closer to this ideal state, some are further away.

As already mentioned, classic exceptions are the worst. You get an object, or worse, just an interface that you can (the horror!) downcast to an actual exception class. All bets are off.

OpenSSL's approach is somewhat better. You still get a list of error codes, but at least you know it's a list and not a map or something else.

Some libraries go further in the right direction and return a single error, but the list of error codes is long, poorly documented and subject to the future growth. Moreover, there's no clear association between a function and the error codes it can return. You have to expect any error code when calling any function. Often, the rigorous error handling code, as show above, would have to have dozens if not hundreds of error codes in the switch statement.

When possible errors are part of the function specification, on the other hand, we are almost OK. This is the case with POSIX. I've also tried to used the same approach with my own libraries, such as libdill. However, even in this case it's possible to do it wrong. The list of possible errors in POSIX specification of connect() API lists more than a dozen of possible errors which in turn disincentivizes the user from doing rigorous error handling.

In the best of the cases, the list of errors is small not only for a particular function, but also for the library as a whole. In my experience, it's entirely possible to do with as little as 10-15 different error codes for the entire library.

The advantage of that approach is twofold.

First, the user, as they use the library, will eventually learn those error codes and what they mean by heart. That in turn makes writing rigorous error handling code much less painful.

Second, a limited set of error codes makes the implementer of the library to actually think about the failure modes. If he can just define a new error code for every error condition he encounters, it's an easy way out with no need to do much thinking. If, on the other hand, he has to choose one of the 10 possible error codes, he has to think about which of them matches the semantics of the error the best. Which in turn results in a better, more consistent library, which it is joy to use.

November 17th, 2018

Discussion Forum