There should be more language research like Stefan Hanenberg's experiment on how static typing affects productivity. It uses a rigorous methodology which is standard everywhere except in software engineering research: a controlled experiment. It compares a substantial number of programmers solving the same problem in two purpose-built languages which differ only in a single feature: the presence of a static type system.

Unfortunately the languages tested are not state of the art – the dynamically typed language is approximately Smalltalk, and the static types are those of early Java, with required type annotations and casts and no parametric types. Not surprisingly, it found that this sort of static typing is useless or worse. (Curiously, it also found that debugging type errors took longer when they were found statically than when they were found dynamically, but this is weak evidence, because the debugging times may include work other than fixing the type errors.)

This needs to be replicated with parametric types and optional type annotations!

The obvious languages to compare are a minimal pair differing only in the presence of the static checker — perhaps ML and ML-minus-the-typechecker, since that's easy to implement. We can do better, though: we can distinguish between different effects of static typing. There are many hypotheses about how it affects productivity; some of the most plausible are:

It finds bugs faster.

It directs the user's attention to types. (It's not clear whether (when?) this helps or hurts.)

It restricts programs by preventing constructs like arbitrary union types and variable arity.

ML-minus-the-typechecker suffers from the third effect: even though it doesn't have a typechecker, its library is limited to things the typechecker can handle – so no read or eval or type predicates or even printf (without exotic tricks). So if that effect is important, comparing ML to ML-minus-the-typechecker will not detect it.

We can distinguish between these three effects by devising a way to inflict each effect independently. We can start with a dynamically typed language with an appropriate standard library, and vary three features:

To find bugs, use a nonrestrictive (“soft”) typechecker, which detects some type errors but doesn't interfere with running programs.

To selectively divert the user's attention to types, vary the prominence of type declarations: whether they're used in the standard library documentation, whether they're printed in the output of the REPL, whether they're emphasized in the language documentation, and maybe even whether they're enabled at all.

To restrict programs, use a version of the standard library with everything that would confuse the Hindley-Milner typechecker removed or altered. (This is not the typechecker we're testing, but it represents a typical degree of restriction.) This doesn't restrict user code, but restricting the library should have a similar effect, especially on small programs. (We could also restrict user code by running the H-M typechecker on it and delaying error reporting until runtime, to make it less helpful for finding bugs.)

The eight combinations of these three features allow us to test all three hypotheses, and detect interactions between them.

(Note that we don't test the Hindley-Milner typechecker, because it both finds bugs and restricts programs, and we want to distinguish between these effects. It would be interesting to compare it to the nonrestrictive typechecker, though, to compare the two checkers' efficacy at finding bugs. Has this been done before?)

One experiment like this can do more to advance our understanding of type systems than ten type theory papers, because it addresses questions we care about, not just ones we know how to answer. I wish I could do this one myself, but as that's unlikely to happen, I encourage someone else to do it.

(Via It will never work in theory — the world's finest, and probably only, empirical software engineering research blog.)

(By the way, I was surprised to see the Mann-Whitney U-test used for all the significance tests, instead of Student's t-test. I thought the U-test was a special-purpose test for ordinal (not cardinal) data, but apparently some statisticians recommend it as the default significance test, on the grounds that it's more robust and only a little less powerful.)