Getting to Zero Exceptions

This is a blog about the development of Yeller, The Exception Tracker with Answers Read more about Yeller here

Oh that’s just the usual exception

I think that’s just the error that happened before

These API failures are normal

We have 500 exceptions happening all the time - there’s a lot of noise

I guess these timeouts are normal

I think we don’t really care about this URI encoding bug

These are all direct quotes from Yeller’s customers and other folk I’ve talked to. They’ve gone through the pain of running a production application without a zero exception policy. They have “background radiation” of NullPointerException or NoMethodError or other small coding errors that remain unfixed.

When you first launch a new app, development is usually pretty easy. There’s not all that much code, and you can hold it all in your head. When bugs do turn up, you can fix them relatively easy.

But, as use of your application grows, and the pressure on the development team increases, you start letting small things slip. You leave exceptions in production, because they aren’t a high priority. Things start to slide.

Small bugs are the little-death that brings total obliteration

Letting things pile up really hurts your development though. Suddenly you have to worry about compounding errors. You have to know to ignore them when you’re looking through your exception tracker. You have to nag coworkers to look at bugs. It gets difficult to figure out if you broke anything new after a deploy, because there’s too many exceptions there already.

Imagine your app was mostly bug free. Imagine you never had to trawl through a list of exceptions, trying to figure out which one was the most important right now. Imagine you never said “oh we know about that bug”, but hadn’t fixed it yet.

Adopt a Zero Exception Policy

Saying “we will not tolerate exceptions in production” sounds well and good. But how do you actually get there? If you have hundreds of different exceptions happening in production, you need a plan to deal with that, and to stop new exceptions when they turn up.

The 4 Rules to get to Zero Exceptions

Here’s 4 simple rules to get to zero production exceptions. They work if your team size is one developer, or a hundred developers. They work if you just have a few persistent bugs, or thousands of different exceptions.

Fix new unique exceptions as soon as they show up

Expected exceptions get turned into metrics

Add a regression test every time you fix an exception

Start with the most painful exceptions

Fix new unique exceptions as soon as they show up

The first thing to do to quench the tide of exceptions is to stop letting new exceptions accumulate. Send exception notifications to chat and to email. Upon seeing a new exception, one developer picks it up, and takes responsibility. That developer then fixes the exceptions, tells the exception tracker to ignore it, or turns it into a metric.

This is the most important step - just stopping new exceptions from accumulating will save your application from a whole bunch of bugs and make things a lot less noisy.

If the exception is genuinely an issue you don’t care about, use your exception tracker to ignore that exception, so it doesn’t show up again.

Turn expected exceptions into metrics

How do you deal with buggy third-party APIs? How do you deal with timeouts that will get retried?

Expected exceptions shouldn’t go in your exception tracker. Instead, use a monitoring system to track the rate of timeouts/expected errors over time (ideally the success/failure percentage), and add alerting. If you don’t have metrics set up yet, look into Librato as a first start.

Add a regression test every time you fix an exception

This is the dual to fixing new exceptions as they show up. You have to make sure they stay fixed! Doing this is often easy, and prevents a bunch of pain in the future.

write a new test to reproduce the exception

run the test and watch it fail

fix the bug

run the test and watch it pass

Sometimes this gets more tricky - especially if the exception requires production data to reproduce. Still, you can make use of your exception tracker to figure out exactly what data that is - look at the user/database records involved and figure out what the cause of the error is.

Start with the most painful exceptions

Ok, so you’ve quenched the tide. New exceptions get jumped on, and fixed such that they can’t happen again. What do you do about all the existing exceptions?

Set aside a bit of time each week to fix one or more existing exceptions.

I like setting aside Monday mornings for this. On Monday mornings, each developer picks an existing exception, and either fixes the bug that’s causing it, turns it into a metric, or sets it to “ignore” in the exception tracker.

Over time, this process will see the number of production exceptions you see drop to zero. Then you’ll know what to do about new errors - they won’t be that common, so you just fix them right away. And, you won’t be discussing “oh, is this the same error we’ve been ignoring for months?”

References and thanks

Thanks to Chris Sinjakli for the original chat that provoked this discussion.

Thanks to Merlin Mann and his amazing Inbox Zero series, which applies just as well to production exceptions as it does to email.

This is a blog about the development of Yeller, the Exception Tracker with Answers. Read more about Yeller here