How to Fix the Hardest Bug You've Ever Seen: The Scientific Method

This is a blog about the development of Yeller, The Exception Tracker with Answers Read more about Yeller here

“Holy Shit”

Yeller had a race condition.

A nasty one too, that prevented work from being shut down cleanly, resulting in double writes into the counters Yeller uses to tell users how many times each exception has happened.

I was still stress testing the system when this bug was originally discovered, and it’s long been fixed.

But the race condition was complicated as hell to debug.

Hard bugs like this require attention to your process - just wildly hacking away will lead you to wandering in circles, confused about the state of the problem, and the system.

How do you go about fixing the really hard bugs?

Difficult bugs are often a big jump up in pain from easier bugs. Many easy bugs can be solved quickly enough (though obviously solving them faster and with less fallout is super important). But how do you solve the real head-scratchers? I turn to an age-old, proven technique for understanding issues in complex systems: The Scientific Method

The Scientific Method, adapted for debugging

The Scientific Method is what you break out when you’re stumped. It’s for when you’ve hit a tough problem, racked your brain and nothing works, and so you say “Ok Computers, that’s the end of the nice guy”, and break out the Scientific Method. Here’s roughly how it goes down for debugging:

Write down the problem you’re trying to fix, and write down any observations about it. Formulate a hypothesis as to what causes the bug, and write it down. Design an experiment to test the hypothesis, with explicit expected results. Write both down Perform the experiment/measurements. For many production problems this will involve looking through log files, metrics and any other visibility tools you have, maybe changing and deploying code and so on. Note that the “fixes” you apply should also be experiments - that is, you should have a hypothesis that goes something like “if I make this change, then I won’t observe the bug anymore”. You should write down results for that as well - often bugs are multilayered, or your original “fix” turns out to not do anything.

Why the scientific method wins

The most useful part of this methodology for me is WRITE EVERYTHING DOWN. It’s too easy as an stressed developer working on a production system to:

repeat a hypothesis that you’ve already discarded, because you forgot that you tested it make debugging harder because your “fixes” cause new problems

Writing shit down solves that, as you have a set of notes to consult about what’s going on, what happened and so on. It’s also helpful if you have to handoff the issue to somebody else, as your notes follow a rigid structure, and your thinking should be clear to the person you hand off to. You may also find it very useful to add timestamps to each change you make to production in your notes, so you can correlate that against graphs.

You are only human

So why is writing shit down so effective? And why does the scientific method help so much? Here’s the problem you face when debugging a complicated system:

You are only human

Human beings, especially ones who are stressed, are prone to wild leaps of faith, bad judgement, forgetfulness and many similar issues. These are there for good reasons - they helped your genetic ancestors survive. But they don’t help so much when dealing with complicated software bugs. Writing everything you do/think about your problem down, in an organized fashion helps you battle the parts of your humanity that make it harder to debug complex issues.

Humans forget shit

If your feedback loop is slow, then you often can have debugging sessions over days or weeks (I’ve had a few with large data import processes like that). In those kinds of situations, writing everything down means that you aren’t relying on your limited memory to know what you ran, what changes you made and so on.

Tips for using the scientific method effectively

Using the scientific method is a technique you’ll get better at the more you apply it. I’ve been using it for debugging the hard bugs for a while now, and have picked up a bunch of tips and tricks for being effective.

Check your assumptions

Your original problem state will have some assumptions in it. Sometimes these will turn out to be just plain false. So some of your experiments should investigate those assumptions and check that they’re correct.

Lab Notebook

This entire post was kicked off by this post from nelhage (a Stripe engineer) on keeping a lab notebook. Writing shit down is essential, but where do you do it? I like two different places, depending on the bug:

For time-critical production incidents: a dedicated chatroom is my favorite lab notebook. Others can see your work as you do it, what changes you’re making, and everything gets a timestamp. Campfire, Slack, Hipchat etc are all good enough for this purpose - use whatever you’re already using. For everything else: I use a markdown file stored in dropbox or git. Where you write matters far less than the fact that you are writing. Just use wherever you usually write and leave it at that.

My notes typically looks something like this:

Problem Statement: right at the top of the document. See above for what to write.

Hypothesis : a description of the theory you are about to test

: a description of the theory you are about to test Experiment : a description of the metrics you’re going to look at, code you’re going to change, query you’re going to run, etc

: a description of the metrics you’re going to look at, code you’re going to change, query you’re going to run, etc Expected Results : Roughly what results you expect. If it’s a database query, often a count of rows. If its a log file, the line(s) you’re looking for. If it’s a metric, a trend and some limits

: Roughly what results you expect. If it’s a database query, often a count of rows. If its a log file, the line(s) you’re looking for. If it’s a metric, a trend and some limits Actual Results: The results of the experiment. I like pasting in screenshots of graphs for metrics, the actual log lines, query results etc.

I title every single subsection, so that it’s easy to scan later. Rinse and repeat the loop until you’ve fixed the issue, and are happy with your diagnosis and your fix.

Clearly Stating the Problem is Half the Battle

Take time with your original problem description. Try and be as precise as you can. If you mess up the problem definition, it’s easy to get stuck, or confused. If you were ever a “rules lawyer” in D&D or a similar game, write it as though somebody is trying to exploit it, because your debugging is going to.

Common Mistake: Bias

A common mistake that real scientists make a lot of the time is introducing bias into their experiments “I really want it to be this cause” is an issue you’ll face as a production engineer as well. Your understanding of the system is incomplete, and your guesses as to what the issue is can confound your experimental results. The only real way to guard against this is to watch for it, or if you can, get somebody else to repeat your experiments.

Common Mistake: One root cause

Complex systems fail in complex ways. Often the failures interact with each other. Assuming that there’s a single root cause is an easy route to misdiagnosis. Instead look for combinations of failures that together explain the issue at hand.

John Allspaw has an excellent post on this topic, and I’d recommend reading that to learn more about this.

Tradeoffs with the scientific method

So, there’s one huge downside to using the scientific method for debugging: it is really slow. One of my favorite books describes the scientific method as:

an enormous juggernaut, a huge bulldozer – slow, tedious, lumbering, laborious, but invincible.

The invincibility is huge - it’s what makes it so effective.

You can walk right up to really scary errors, and know that your human nature won’t lose, because the scientific method has your back.

But sometimes the speed tradeoff really hurts.

Sometimes your hair is on fire.

This is a blog about the development of Yeller, the Exception Tracker with Answers. Read more about Yeller here