Error codes or Exceptions? Why is Reliable Software so Hard?

Error codes or exceptions? Like static vs. dynamic programming languages or how great David Hasselhoff is (most people say great, I say super-great), it tends to turn into a pointless argument ("Hasselhoff is super-great ASSHOLE!").

Very little software really gets error handling right. Even many critical, backend server systems tend to break under heavy loads. And the vast majority of end-user applications handle errors gracefully only for the most well understood, commonly encountered conditions (e.g. HTTP timeout), but very poorly for most other conditions (failed allocations, bad data, I/O errors, missing files, etc).

When these sorts of errors occur, bad things happen. Bad bad things. Like when my web browser crashes, taking one half-composed email and 8 open web pages with it. Why did a single flaw cause so much damage? I use Firefox and it's pretty reliable compared to most applications. It's engineered impressively, with logical program layers well separated and a great deal of the application logic is written in JavaScript, a high-level "safe" programming language. But occasionally it still just crashes or locks up.

Why is this? Because it's using error codes when it should be using exceptions, and exceptions when it should be using error codes? And why should a single flaw in the software cause the world to explode? Is the only way we can have reliable software is by having perfect software?



I argue that it's not the "handling" part that's hard, few errors are things we can even respond to. How do we "handle" the inability to allocate memory? We can't fix those errors, we just hope they don't make us crash or lock up. And yet so often it does, a single error causes us to lose everything.

The problem is deeper than how we communicate errors in our languages, it's really everything we've done leading up to the error that's the problem.

I'll describe the three styles of error handling, and why one of those styles is usually wrong and the problem is more fundamental than error handling.

"Get the Hell Out of Dodge" Error Handling

This is the most simple case of error handling: When a step in some action fails, all the subsequent steps in that action are simply NOT executed. This is where exceptions shine because the application code need not worry about checking for errors after each step; once the exception is thrown (either directly or by a called routine), the routine exits automatically. And its caller will have a chance to catch it or do nothing and let the exception bubble up to its caller, etc on up the call stack.

void DoIt() { // An exception in Foo means // Bar doesn't get called Thing thing = Foo(); Bar(thing); } Thing Foo() { if (JupiterInLineWithPluto) { throw new PlanetAlignmentException(); } return new Thing(); }

A second, slightly more advanced case of this error handling is when, like in the first error case you want to halt execution of the current code, but before you do you need to free any resources previously allocated. This is different than the "just stop executing the action" case, because we actually need to do some additional work in the presence of the error.

In C, this most often this means freeing up allocated memory. In garbage collected languages like Java, this it's more typically closing opened files or sockets (although they will eventually get closed by the garbage collector regardless). In this style of error handling, you are simply returning resources you've acquired, be it memory, file handles, locks, etc . Most programming languages offer simple ways to deal with this: Java has "finally" blocks, C# has "using" blocks , C++ has stack based variables and the RAII idiom.

Here's an example of a "finally" block in Java:

void DoIt() { Thing thing = Foo(); thing.CreateTempFiles(); try { Bar(thing); Baz(thing); } finally { // This gets called regardless // of exceptions in Bar and Baz. thing.DeleteTempFiles(); } }

To generalize the description of this type of error handling, you are returning the software back to the default state. Whatever intermediate state your code was in is now lost forever. Stack frames are popped, memory freed, resources recovered, etc. And that's okay because you want those things to go away and start fresh.

This is easy and simple error handling, as easy as turning around and leaving town. And you'll leave town if you know what's good for you. Got that partner?

"Plan B" Error Handling

This type of error handling is for error conditions that are known and understood and there is an action the code should take in the situation. This differs from other error handling as these errors aren't "exceptional", they are expected and we have alternate paths to take, we don't just go home and pretend like it never happened.

One example might be attempting to deliver a SMTP mail message and the connection times out. The error handling in that case may be to look in the MX record for a backup host, or put aside the mail message for later delivery. (I'm sure it's way more complicated than that, humor me)

With this type of error handling, status codes are easier to deal with syntactically and logically: "if" and "switch" statements are more compact and natural than "try/catch" for most logic flow.

Error codes:

if (DeliverMessage(msg, primaryHost) == FAILED) { if (DeliverMessage(msg, secondaryHost) == FAILED) { PutInFailedDeliveryQueue(msg); } }

Exceptions:

try { DeliverMessage(msg, primaryHost); } catch (FailedDeliveryException e) { try { DeliverMessage(msg, secondaryHost); } catch (FailedDeliveryException e2) { PutInFailedDeliveryQueue(msg); } }



But regardless if you use error codes or exceptions, Plan B error handling isn't particularly difficult. The error conditions and scenarios are understood and your code has actions to deal with those scenarios. If you use status codes here, this type of error handling is as natural as regular application code. And that's the way it should be, it should be just like adding any other branching logic. Exceptions aren't as useful here, because in this case they aren't "exceptional" and the code to handle common conditions becomes much more convoluted.

"Reverse the Flow of Time" Error Handling

The third, and truly nastiest case of error handling, is when you must "undo" any state changes your program has made leading up to the error condition. This is where things can get real complicated real quick, you aren't just freeing resources like before, you are backing up in time to a previous program state.

The analogy of putting the toothpaste back in the tube seems appropriate, but that's a piece of cake comparatively. In this case you're actually trying to un-brush the crud back onto your teeth, and each piece of crud should go right back where it was originally.

And how do you do that? How do you put back state you've changed? Do keep a copy of every variable and property change so you can put it back? Where do you keep it? What if the change is down in some deeply nested composite object? What if another thread or some other code already sees the state change and acted on it? What happens if another error happens while putting stuff back?

This is the hard stuff. This is the stuff where the error handling easily becomes as complex as the application logic, and sometimes to do it right it has to be even more complex. So what can we do? What techniques or secrets can we use to make this error handling easier? If only we had something that reversed the actual flow of time, that could do the trick.

Or maybe we shouldn't be trying to figure out an easier way to do this type of error handling, but rather avoiding the need for it altogether.

Why is this style of error handling necessary? Is it our actions leading up to the error? And what could we have done differently? To understand a little better what's going on here, I'll use the analogy of building a deck.

Building a Deck

Let's say you want to build a deck onto your house. You foresee a grand deck on a beautiful summer day, you're sipping lemonade and eating pie and playing Battleship! with friends.

So you get the permits, you buy the materials, you dig, you saw, you hammer, you drill. (Anyone tell you how much you look like Bob Vila?)

Then a few days into it a building inspector shows up and asks to see your permits. You dutifully retrieve them and give them to the inspector. Uh-oh, there's a problem, you didn't apply for a county building permit, you only got the permits from the city.

That's too bad, the inspector says, because then you might have known the placement of the deck is out of line with Jupiter on the autumn sky, it's clearly in a violation of the county building code regulation number 109.8723.b17 section 4 paragraph 2. So sorry, you can not continue building this deck.



How could you've have known? You thought you planned for everything you could think of, but here, halfway into building your deck, there is a problem you didn't foresee. You can't believe how bad it's going to suck to not have that deck, you're devastated, you already bought Electronic Battleship! Deluxe and everything. But that's not the half of your problems. Not even close.

The worst part by far is that your home is in a completely wrecked state, you've dug up the yard, tore off a bunch of siding and trim and there's a big door-shaped hole in the side of your house into your living room. Putting all these things back the way they were is going to be just as hard, if not harder, than pushing forward.

In short, you're fucked.

So you patch up the door-shaped hole, you nail back up the siding and you pick up your tools and building materials. Later you start out digging up the concrete posts, and it's hard heavy work. After while you stop trying so hard; other matters are more pressing. And who cares if the new wall is unpainted or all the posts aren't dug up right away? Most of the building materials you bought are salvageable, and Home Depot is forgiving with their return policy, so you figure no big deal, you have plenty of resources to go around, you'll recover those later.

But you forgot to nail back up a board near the sill, and now a family of chipmunks has taken residence in your walls. You hear the scurrying noises sometimes, but you're never quite sure what it is or how it got there, but clearly something is, uh, squirrelly.

This is the real world, where things get screwed up in a big way because of the unexpected, the unknown, and going back is just as hard as going forward. We can't escape this in the real world, building a deck always has the possibility of being a huge disaster.

But what if the real world worked differently? What if it could all be completely undone when things go wrong?

The Miracle Deck

What if Home Depot sold a do-it-yourself deck kit that had an installation "undo" feature? At any point during the decks installation, if something went wrong during installation, the whole thing could be undone and it's like no one ever touched your house.

You'd just press a button, and the whole deck and everything zips itself up and drives back to the store and your charge card is refunded, all automatically. Even if it's at the very end of the installation, if you didn't like the way it looked ("it makes my house look fat"), just press the button and back to the store it goes. And the cool thing is, even if you hit a power line while digging the footers, you could just press a button and all damage is undone.

This product, once installed, is no better than conventional decks. The wood, nails and screws are the same color and quality, the foundation is dug just as deep and cement just as strong. The only difference is during installation, the miracle deck can be undone at any time.

If such a deck product really existed, there could be no serious problems when trying to install a deck, because if anything goes wrong the house is kept in the exact condition as if nothing ever happened. This product might not actually install any more successfully than the old product, but when things go wrong you won't end up with chipmunks in your wall and a garage filled with unreturned Home Depot supplies.

The real world can't work that way, but the programming world can.

Object Oriented Programming: Works Just Like the Real World. Dammit!

One of the great things about Object Oriented Programming is it is a very natural, intuitive way to model software. Things in the real world behave in many ways like the objects we use in programming. The objects in the real world contain other objects, they have interchangeable interfaces, they hide their internal workings, they change over time and take on new state.

Of course, there are many ways the world isn't like OO programming too, but I won't go into that here.

So here we have programming constructs that act and work much like things in the real world act and work. Great, OO makes it easier to write programs that work like the real world, but does OO make it easier to write programs that are useful and reliable?

I remember a crummy movie with Michael Douglas and Demi Moore where Demi was the bad guy. I don't remember much about it except that for some reason the movie -- with no relevance to the plot other than they worked in a tech company -- included a virtual reality sequence that was suppose to showcase a brilliant advance in data retrieval UI.

The system worked by immersing you into a virtual reality representation of a library. Then, you could walk around the library to find the information you need. You'd navigate by following categorized signs, and then further narrowed categories until you found the virtual bookshelf with the virtual book of information you're looking for. That's supposed to be a huge advance in data retrieval, it made finding information as simple as going to the library.

Here's the problem: What's the very first thing you do when you want to find a book in a real library? You walk over to a computer and use the digital card catalog system.

Sometimes you don't want things that work like the real world, sometimes you want things that work like computers.

Similarly, our object oriented languages are modeling reality too closely. I'm sure it's a slam dunk when actually modeling real world objects, but just how often are we as programmers doing that? OOP's strength also ties us to many of the inherent problems we have with real objects. Why are we limiting ourselves this way?

OO is the problem?

No, OO is NOT the problem, not at its core. It's just that all popular OO languages have the same problem. The problem is more fundamental than what OO brings to the party, it's a problem that exists in nearly every popular programming language, OO or not.

The problem is variable mutation, the problem of complex state change and how to manage what happens when we can no longer go forward. It's the same problem of building a deck.

Another term for variable mutation is "destructive update", because when you change the state of a variable, you are destroying the previous state. In every popular language, the updating of a variable means the previous state of that variable is destroyed, vanished, gone and you can't get it back. And that's kind of a problem, your code is doing the equivalent of tearing your house apart in order to achieve an action, but if it fails it won't have achieved its objective and your house is in ruins. Ouch Ouch Ouch.



What we need in languages and tools is the ability to easily isolate our changes for when the shit hits the fan, so that incomplete changes aren't seen (all or nothing). And we cannot be in denial that the shit can hit the fan at any time. We need to make it easy to detect when things do wrong, and make it simple to do the right thing once that happens.

PHP to the Rescue! PHP?



Expecting someone else?

Believe it or not we already have it, in rudimentary form, in PHP. Yup, good old, stupid-simple PHP. On a webserver, PHP scripts have no shared state, so each instance of a PHP script runs in its own logical memory space. The scripts maintain no persisted state, so each script start off fresh as a daisy, blissfully unaware of what happened the previous times it was executed.

The only shared state in PHP exists at the database level (or file level, but don't go there), and if you commit all changes in a single transaction, you've basically solved the deck building problem. Your code might not be better about successfully completing its update, but failure is isolated, all the actions leading up to a failure are forgotten about and it can't cause further problems or inconsistencies in the application.

But PHP as a language has nothing special about it that gives it these properties, rather its how it's being used. Any language, Java/C++/VB/Ruby/Python, coupled with a transactional database also has the same ability if it's used in a manner like PHP is used: each invocation is started from scratch with no shared state and no memory of previous invocations.

However, all these languages begin to have issues once they start modifying persisted, in-memory program state. Once again, it's the deck building problem. As some multi-step action is getting carried out, if one step fails, then any modifications in the previous steps must be undone, or like your deck project, the program may be left in a shambles. Databases have transaction support, but our languages do not.

Pretty much any application that keeps state in memory has to worry about this: everything from highly concurrent application servers down to single user GUI applications.

So, how can we solve this problem more generally?

Don't Undo Your Actions, Just Forget Them

There are strategies to avoid the intermediate destructive updates that cause problems, but unfortunately none of the popular languages provide direct support, so it feels hacky. And it is. But just say they're design patterns and you won't feel so bad about it.

The key to these strategies is to minimize destructive updates, so that any actions we take need not be undone, but simply forgotten. By doing this, we turn the super difficult "Reverse the Flow of Time" error handling into the super easy "Get the Hell out of Dodge" error handling.

Make a Copy of Everything Up Front

The first technique is low-tech and easy to understand, but expensive computationally and resource-wise.

Before the code does anything, make a deep copy all the objects you might modify, then have the action modify the copies. Once all those modifications are completed, swap out the old objects with the new at the very end.

If an error happens during the action, the copied objects are simply forgotten about and garbaged collected later. And you need not change the way the object methods work, the bulk of the application code remains unchanged. Easy as pie... a very expensive, memory intensive pie. But simple and easy nonetheless.

Immutable Objects

The second way to avoid destructive updates is to make your objects immutable. An immutable object is one that, once created, cannot be changed. Lord knows, it can't change.

Java strings work this way. No methods of the String class ever modifies an existing string object, they instead create a brand new string object that's the result of the operation, and the caller will at that point have two distinct strings, a pre-action string and a post-action string. In practice this works very well and easily for strings object. But strings are simple datatypes, they aren't composite like most of our application objects (they only contain a char array).

Unfortunately, most popular languages don't directly support this style of development. C++ has the "const" modifier, which enables static enforcement of immutable objects, but that only tells us when we are doing it wrong (attempting to modify const objects), it doesn't make it any easier to actually achieve this style of programming, which is difficult when working with deeply composite objects. None of the popular languages offer much support this style of programming, there is no syntactic sugar or other features to make it less awkward.

Consider this example of object composition. We have a house. That house contains a bathroom, that bathroom contains a toilet, and so on. When we want to clean the house, we call down through objects, cleaning each sub object. First take a look at a classic, mutable-object implementation:

void DoIt(House house) { ... house.Clean(); ... } class House { Bathroom bathroom; Bathroom bedroom; ... void Clean() { bathroom.Clean(); bedroom.Clean(); ... } } class Bathroom { Toilet toilet; Mirror mirror; ... void Clean() { toilet.Flush(); mirror.Clean(); ... } } class Toilet { int poops; ... void Flush() { poops = 0; } }

Here is an "immutable" version of the above code:

void DoIt(House house) { ... house = house.Clean(); ... } class House { Bathroom bathroom; Bedroom bedroom; ... House Clean() { // make a new copy of the house // with the cleaned contents house = new House ; house.bathroom = bathroom.Clean(); house.bedroom = bedroom.Clean(); ... return house; } } class Bathroom { Toilet toilet; Mirror mirror; .. Bathroom Clean() { // make a new copy of the bathroom // with the cleaned contents bathroom = new Bathroom; bathroom.toilet = toilet.Flush(); bathroom.mirror = mirror.Clean(); ... return bathroom; } } class Toilet { int poops; Toilet Flush() { // make a new copy of the toilet // with no poop Toilet toilet = new Toilet; toilet.poops = 0; return toilet; } }

Clearly the immutable version is longer and more complex, and it only gets worse if you also want to have a second return value. However, the immutable version is a more robust version: if any cleaning operation fails then the house won't wind up in a half-cleaned state.

Being in a half-cleaned state might seem harmless enough, but it can cause surprisingly serious problems. If, for example, part of cleaning the house meant moving all the furniture into the lawn so the floors could be polished, you would have big problems if the cleaners suddenly left. And they're calling for rain. And migrating seagulls.

Keep Object Mutation to a Single Operation

Another strategy that is helpful in certain circumstances is to keep existing object mutation down to one operation. This strategy is to do as much work in isolation as possible, then apply those changes in a single operation.

This is also known as an atomic update. Not atomic like an atomic bomb, but atomic like a tiny atom, as in can't get any smaller.



(photo of actual atom)

An example might be if you have GUI application, and your code wants to add a dockable tool bar to the UI window.

One approach is to add an empty tool bar to the UI, then add each individual button to the bar. This is bad because now you are mutating the UI program state for each button added, and if one tool bar button fails to be added, then the user gets a wacked-out, partially constructed bar. You could put out an eye like that. Not to mention each time you add a button, you may be kicking off all sorts of ripple mutations as layout managers do work, increasing the chances of something going haywire.

Instead, the better strategy is to build the tool bar in isolation. Once the bar is completely constructed with all buttons, then add it to the UI in a single operation. This way you minimize the mutation to the existing objects (the top level window), instead we are only mutating our new object during its multi-step construction. If we fail to construct it fully, we can just forget about it and let the garbage collection get it.

So you fully construct the bar and then add it to the window in one operation. Unfortunately, adding the toolbar bar to the window may not truly be an atomic operation down deep, but from your perspective it is, since you can't make the mutation operation any smaller. You may not have completely eliminated the chance of things going into a bad state, but you've minimized it as far as you can.

Plus people will be totally impressed you're using atomic powered code.

Use a Functional Language

Functional languages get immutability and state change right (they'd better, it's a key attribute of functional programming). Unfortunately, I don't know of any functional language I'd call popular. I think it's because they all have dumb names like LISP and Haskell.



Why pthat's a lovely monad you're wearing, Mrsh. Cleather.

Erlang, which started me thinking about these issues, is a functional programming language that gets reliability right in a simple and elegant way that I think is fairly easy for an experienced OO programmer to pick up. You don't even have to learn about monads, but you damn sure need to understand recursion. Erlang is dynamic and somewhat "scripty", making the development process more incremental and approachable. It also has a hideous syntax.

But Erlang is marvelously beautiful in the way it meshes the concepts of immutability, messaging, pattern matching, processes and process hierarchy to create a language and runtime where extreme concurrency and reliability means adhering to a few simple design principles.

The point

No, this article wasn't really about error codes vs. exceptions. Sorry but the truth is, there is no one best way to communicate error conditions. "It depends" is the only honest answer. Unfortunately the designers of APIs have to decide ahead of time how the callers will be signaled of errors, while the caller -- who knows best how the errors should be communicated and managed-- isn't given a choice.

The much bigger problem in software reliability is not how we communicate errors, it's the state we are in when the error happens. So often the errors are things we can't really do anything about, we can't force the network connection to work, or somehow create more disk space or memory if we run out. But we can see to it that we don't do the programmatic equivalent of half-destroying our house in the process of building a deck. Attempts to "Reverse the Flow of Time" in code are bad. Avoid mutations (destructive updates) and use "Get the Hell out of Dodge" error handling whenever possible.

Posted April 27, 2006 2:20 PM