The Stan meeting today reminded me of Joel Spolsky’s recasting of the Yiddish joke about Shlemiel the Painter. Joel retold it on his blog, Joel on Software, in the post Back to Basics:

Shlemiel gets a job as a street painter, painting the dotted lines down the middle of the road. On the first day he takes a can of paint out to the road and finishes 300 yards of the road. “That’s pretty good!” says his boss, “you’re a fast worker!” and pays him a kopeck. The next day Shlemiel only gets 150 yards done. “Well, that’s not nearly as good as yesterday, but you’re still a fast worker. 150 yards is respectable,” and pays him a kopeck. The next day Shlemiel paints 30 yards of the road. “Only 30!” shouts his boss. “That’s unacceptable! On the first day you did ten times that much work! What’s going on?” “I can’t help it,” says Shlemiel. “Every day I get farther and farther away from the paint can!”

Joel used it as an example of the kind of string processing naive programmers are prone to use.

The reason I bring it up is that software development almost inevitably employs the Shlemiel the Painter algorithm.

Here’s the problem in a nutshell: The more moving pieces your software has, the longer it takes to add a new feature or change an existing feature.

For example, when we have N special functions defined, if we want to change the way error handling works in all of them, it takes N units of work. If we have K interfaces defined, the amount of work to roll out a new user-facing feature increases K-fold over the situation where there’s only a single interface.

So the first feature takes one unit of time, the second two units, and so on. And we all know where this goes, though if you’re like me rather than like Gauss, you didn’t derive the result in your head in primary school:

The upshot is that to add features takes time proportional to .

This is why I’m always arguing that adding new features, as simple as they look, isn’t free. We have to test them and we have to document them and we have to build them into all of our interfaces. Then when things change, we have to change all of those moving pieces.

Sometimes you can design around the problem with modularity. In an ideal world, you look into the future when designing the algorithm the first time and imagine all the ways it might change and design something simple with that in mind. At least that’s how software design works in theory.

In practice, it’s nearly impossible to write simple and modular code with oracular accuracy about how it’ll be used in the future. It’s the unknown unknowns that get you every time. As every engineer knows, but as Donald Rumsfeld was mocked for pointing out,

There are known knowns; there are things we know that we know. There are known unknowns; that is to say, there are things that we now know we don’t know. But there are also unknown unknowns – there are things we do not know we don’t know.

It’s the unknown unknowns that get you every time. What usually happens is that you only figure out where the modularity has to be after the work is done.

At this point, you can spend even more time and refactor the code and hopefully not have to do yet another round of work in the future. If you’re lucky. One issue is that those without a lot of experience in software development can see refactoring or trying to design modularly in the first place as akin to rearranging the deck chairs on the Titanic. But what it’s really about is trying to wrestle software into a manageable state.

Just an example related to Stan — we’re about to go through and rewrite everyone one of our distribution functions yet again so that we can take higher-order derivatives (we need this for some optimization, for Laplace approximations, and for RHMC). We didn’t even know Stan would be used by anyone other than us when we first wrote the code. The second rewrite was for vectorization. The third time around we added template metaprogramming to drop constant terms. On the fourth rewrite, we added expression templates to make the vectorization efficient. And now we’re on the fifth time, adding higher-order derivative-compatible code. Each time, the amount of work we had to do was proportional to the number of distributions we supported, which continues to grow with both new distributions and more convenient reparameterizations of existing distributions, like Benroulli on the logit scale or Poisson on the log scale or multivariate normal with a Cholesky factor covariance or precision. And that work’s not just in defining the function, but in testing them. We had basic tests, then we needed tests for vectorization, then tests for all the varying ways the functions could be called with data and parameters. And now we need tests for the second-order derivatives. But it’s not just tests. It’s documentation. And debugging. And providing support to users, because the doc gets more confusing as we support more options.

And at one point, we actually simplified all the distributions (again requiring N units of work) to get rid of the traits-based error-handling configuration that we anticipated needed but never needed. It just added drag to all of our development and seriously complicated our testing. One of the biggest reasons to keep things simple is to simplify testing and documentation.