One advantage of test-driven development I’ve rarely seen emphasized is that it enables experimental programming. This was recently brought home to me while I was working on XOM. Steve Loughran had requested the ability to use a NodeFactory when converting from a DOM document. It was a reasonable request so I set off to implement it.

My first iteration satisfied Steve’s use-case. It was actually a fairly major rework of the code, but the test suite assured me that it all worked and I hadn’t broken anything that used to work. However, Wolfgang Hoschek quickly found a real bug in the new functionality I had added; and things got a little sticky here.

To satisfy John Cowan, the DOMConverter uses some fancy non-recursive algorithms. This enables it to process arbitrarily deep documents without stack overflows. However the algorithm is far from obvious; and even though the code is well-written it’s hard to come back to it a couple of years later and figure out just how it works.

But this is where test-driven development really shines: I don’t have to understand the code to fix it. All I need is a test case for the new functionality and the bug. Once that test is passing, I’m done. Rather than trying to understand what the code is doing and exactly how the bug is triggered, I just start making changes and running the tests. The changes aren’t random; but they’re based more on intuition and guess work than on detailed analysis of the code paths. In this case, my first attempt to fix the problem was hitting a NullPointerException . I could tell from the stack trace that the exception was thrown by this line:

parent.appendChild(children.get(i));

I was pretty sure the parent variable was the offending null object since children had already been used in the immediately preceding line. The question then becomes why is parent null?

The most likely candidate seemed to be this line:

parent = parent.getParent()

I wasn’t absolutely sure of that, but it seemed likely. I didn’t check it with a debugger or a System.err.println() statement; but given what else was going on with the code that was the obvious place to look. Now how to fix it? Here I got really confused. There was no obvious fix. I was thinking deeply about the problem, and I thought I was going to have to rewrite the entire method from scratch with a totally new algorithm.

But then I stopped thinking for a minute. “What’s the simplest thing that could possibly work?” I asked. This was the simplest thing I could come up with:

if (parent.getParent() != null) parent = parent.getParent();

The traditional way of approaching this problem would have been to think carefully about the algorithm, and consider exactly how this change would affect that. What would the parent variable be when the parent was null, and so forth? I could have done that, and if I had done that I would have rapidly concluded that this fix wouldn’t work. Just looking at it, I really thought this would fail. But instead I went ahead and ran the test anyway.

Damned if the test didn’t pass!

I’m done. I saved myself hours of hard mental effort trying to understand this code. In fact, I don’t need to understand the code. I only need to understand what the code is supposed to do, and have test cases that prove it does it. This is a radical rethinking of how we program, but I think it’s essential for modern programs. XOM is small enough that one person could understand it, but many programs aren’t. Does anyone really know all the inner workings of Apache? or Mozilla? or MySQL? Maybe, but I doubt it; and I know no one really understands everything that’s going on inside Linux or Windows XP.

The only way we can have real confidence in our programs is by testing them. Practical programmers long ago gave up on the fantasy of proving programs correct mathematically. Increasingly I think we’re going to need to give up on the fantasy of even understanding many programs. We’ll understand how they work at the low scale of individual lines, and we’ll understand what they’re supposed to do, but the working of the whole program? Forget it. It’s not possible.

This sounds scary. This sounds like a radical idea; and to a computer scientist, it is. But that’s only because until relatively recently computer scientists have only dealt with very simple systems. In the rest of the sciences–physics, chemistry, economics, and so forth–this is how all real problems are handled. We identify the low level basic principles like Schrodinger’s equation or Newton’s Laws that define how the world works; but when we do actual engineering we use approximations to those laws and we experiment to find out which approximations work in which domains. Chemists don’t start with Schrodinger’s equation when trying to understand the properties of a complex molecule. They experiment with it. They poke at it and they prod it, and hit it with electricity, light, heat and a thousand other things, until they’re confident they know how it behaves. In many cases, chaos theory tells us we can’t even theoretically hope to solve the underlying equations. Experiment is all we’ve got.

Computer science has really avoided experiment of this nature. Programmers do experiments but they don’t trust the results, unless they can understand them and rationalize them. I think that’s going to have to change. The systems are getting too big. While there will always be simple programs and simple systems that can be understood in toto, the more interesting systems are too big. The only way to manage and understand them is empirically, by experiment and by test. The good news is that experiments do work. Test driven development works. It produces demonstrably more reliable, more robust, less buggy code. You don’t need to understand why or how the program works as long as the tests prove that it does.