The strength of these persistent models come apparent late in the development cycle. Software usually goes through several phases

Analysis → Design → Impl. → Test → Deploy → Maintenance

It is important to stress that development of software is a dynamic activity. We repeatedly change the software in production by layering more and more complexity/features on top of the system. We also dynamically fix bugs in the software while it is in production.

The recent years, development tends to so-called Agile methods—where there are many small dynamic iterations of the software construction process running all at the same time. We have social tooling in place which tries to achieve this (Scrum, Kanban, …), and we have technical tooling in place to reach the goal (git, Mercurial, …).

The “Maintenance” part is very expensive. Maintaining running software has periodic costs associated with it. In a world where everything is a service, we have to pay operators, pay for hardware resources, developers, and so on.

When we program, we try to remove errors early. We employ static type systems, we do extensive testing, we use static analysis. Perhaps we even use probalistic model checkers like QuickCheck, exhaustive model checkers like SPIN or prove our software in Coq. We know, inherently, that eradicating bugs early on in the software life cycle means less work in the maintenance phase.

But interestingly, all this only raises the bar for errors. When we have done all our hard work, the errors that do remain are all of the subtle kind. These are errors which were not caught by our initial guardian systems. Most static type systems won’t capture the class of faults which has to do with slow algorithms or excessive memory consumption for instance. A proper benchmark suite will—but only if we can envision the failure case up front.

The class of faults that tend to be interesting is the class that can survive a static type check. The mere fact we could not capture it by a static analysis in the compile phase makes the error much more subtle. Also, it often means they are much harder to trigger in production systems. If the fault furthermore survives the test suite it becomes even more interesting. The viral strain has a certain basic DNA which mutated it so it could get past two barriers of correctness tests. Now it becomes a latent bug in your software.

Aside: I tend to absolutely love static type systems. I enjoy them a lot when I program in Go, Standard ML, OCaml or Haskell. I am all for the richer description that comes with having a static type system.

There is a great power in being able to say v : τ rather than just v—exactly because the former representation is richer in structure. Richer structure helps documentation, makes it possible to pick better in-memory representations, makes the programs go faster and forces a more coherent programming model.

Yet, I also recognize that most of the errors caught by static type systems are not interesting. They are of the kind where a simple run of the program will find them instantly.

End of Aside.

Concurrency and Distribution failures

When systems have faults due to concurrency and distribution, debuggers will not work. The problem is that you can’t stop the world and then go inspect it. A foreign system will inspect an answer in time or it will time out. Many modern systems have large parts of which you have no direct control anymore. Such is life in the Post-1991 era of computing where the internet defines the interface to your program and its components. An Erlang system with two nodes is enough to be problematic. Even if I could snapshot one node, the other node will carry on.

The same is true for concurrency errors. They often incorporate race conditions which must trigger. Attaching a debugger alters the execution schedule making the race condition disappear in the process. The only way to debug such systems is by analysing post-mortem traces of what went wrong—or by inspecting the systems online while they are running.

To make matters worse, a lot of races only occur when data sizes are increased to production system batches. Say you have a small write conflict in the data store due to inappropriate transactional serialization and isolation. If your test system has few users, this conflict will never show up. And if it does, you will disregard it as a one-time fluke that will never happen again. Yet—on the production system, as you increase capacity, this problem will start to occur. The statistical “Birthday Paradox” will come and rear its ugly head and you will be hitting the conflict more and more often. Up until the point where it occurs multiple times a day.

In conclusion, capturing these kinds of bugs up front is deceptively hard.

The Erlang Shell

The Erlang shell is a necessary tool for producing correct software. Its usefulness is mostly targeted at the maintenance phase, but it is also useful in the initial phases of development. A running Erlang system can be connected to while it is running:

(qlglicko@127.0.0.1)3>

This provides a REPL so you can work with the software. But note that this is a REPL on the running production system. If I run commands on the system:

(qlglicko@127.0.0.1)3> qlg_db:players_to_refresh(1000).

{ok,[]}

(qlglicko@127.0.0.1)4>

I hook into running processes. In this case qlg_db which does connection pooling towards the Postgres database. This allows me to go probe the system while it is running to check for its correct operation. Any exported functionality can be probed from the shell.

I often keep a module around, named z.erl which I can compile and inject into the running system:

(qlglicko@127.0.0.1)6> c("../../z.erl").

{ok,z}

(qlglicko@127.0.0.1)7>

This dynamically compiles and loads the z module into the running system. It makes the functions of the module available for system introspection and manipulation. When debugging hard-to-find bugs on systems, you need this functionality.

And yes, if you want, Erlang nodes contains the compiler application so they can compile modules.