Big Ideas for Haskell

In a conversation last night, a friend and I talked about the following thought experiment:

If you had a few million dollars and a few years to spend on hiring/recruiting and leading a strike force to make some big improvements in the infrastructure of your favorite programming language… what would you do?

Note that the changes have to be to infrastructure. We aren’t interested in language changes, which tend to get too much play already; but rather in the tools and fundamental libraries available to us. Here’s my list for Haskell.

Project #1: Fix Dependencies in GHC/Cabal

Managing dependencies is a HUGE issue in Haskell right now. Any substantial software project seems to require nearly as much time managing the various versions of different dependent libraries as writing code. And there’s one really big thing that can be done about it: don’t expose dependencies from a library unless they are really, truly needed.

Let’s look at an example. A gazillion different libraries use Parsec internally to parse various things from text. They use various different versions of Parsec. Even though there is no reason why two different versions of Parsec can’t be used in the same program, still Cabal’s dependency resolution will try very hard to avoid combining those two libraries together, just because of some inconsequential implementation details. Parsec isn’t the only such package, either: QuickCheck, for years, split the Haskell world because various different packages depended on different versions of QuickCheck for their internal testing. The situation with mtl and transformers is a little more complicated, since it’s possible in principle that libraries that depended on both actually needed the same versions; but poking around a bit reveals that for the most part, these libraries used mtl or transformers internally to build monads that could very well have been wrapped up in opaque newtype wrappers at the package boundary.

Basically, we have a lot of confusion between implementation dependencies, and exposed dependencies.

So this idea has two steps: Step #1, instead of just telling GHC about all of the packages your code depends on, you should be able to give it a list of hidden dependencies, and a separate list of exposed dependencies. It should check that any types accessible via exports of your package’s exposed modules is in the list of exposed dependencies. Step #2, Cabal should get separate fields for exposed and hidden dependencies, and pass them along to GHC. And its constraint solver should be set up to never fail to install a package because of needing different versions of hidden dependencies.

I don’t know an easy way to measure this, but my strong suspicion is that this would save 80% of the time Haskell programmers currently spend on maintaining the dependencies of large Haskell projects. It would be a huge obstacle to real-world use of Haskell, removed in one fell swoop. It easily gets the #1 position on my list.

Project #2: Expand STM to External Systems

In my opinion, one of the most underestimated bits of potential in Haskell is software transactional memory. The STM library is, as Don Stewart so elegantly pointed out recently, done and working and quite usable. The problem is that it’s just about transactional memory.

I’d say that from my experience, a very dangerous source of software bugs comes from the fact that so much software is written from the perspective of someone standing on the outside, looking in to a transaction. Managing transactions properly can be very difficult, and many programmers just plain get it wrong. I shudder to think of the number of web applications we trust every day that probably have data loss bugs because of poor transaction handling. This is something that desperately needs a solution.

One solution is database stored procedures, and they are used liberally for this purpose. The ability to write code that sees the world from inside of the transaction is, I think, at the root of why it’s generally considered safer to write data-related code in stored procedures rather than in applications. But stored procedures are specific to a single data source. Most significant information stores provide some support for distributed transactions these days… but using them requires doing your data manipulation in application code, which means back outside the transaction. Very little work has been done on writing code at the application level that nevertheless lives inside of a (distributed) transaction. A big reason for that is that applications have always been effectful, and we’ve had no good way to take the side effects of the code in the application itself and manage even that central bit in a transaction-safe way.

Enter STM. Now we do have a nice, working, fully functional system for writing application level code that sees the world from inside of a transaction. STM was developed as a means of writing composable code that gets speedups from parallelism; I think that’s missing the bigger picture. Speedups from parallelism are nice, but what STM really gives us is a way to write composable code that’s organized as concurrent processes acting on data with safety offered by transactions. The next step is to expand those transactions. I want to write code that makes changes to both a database and my data structures in memory, and have that code run as a transaction, which either succeeds or fails as a whole.

There are challenges here, to be sure: STM’s crucial retry operation is unlikely to be supported by any kind of external transaction system, for example, and the transaction models of different external systems may be hard to bring together under a single interface (savepoints? nested transactions? isolation levels?). But even a simple lowest common denominator here would be a huge step forward.

I feel like Haskell has a huge opportunity here to be the first popular general-purpose languages in which it’s possible to easily write and compose code that is really transaction-safe. It would be a true shame to miss the opportunity.

Project #3: Universal Serialization

This is cheating a little bit, because it’s almost asking for a language feature… but I can get away with it because it doesn’t actually require a change to the language: just exposing a somewhat magical new API from the compiler.

Basically, it’s possible to take a lot of types in Haskell, and turn them into some kind of serialized form from which they can be recovered later. This is what Data.Binary (from the binary package) and Data.Serialize (from the cereal package) both do. But it’s not possible for all types. First class functions, for example, cannot be serialized. It’s not just that no one has written the code yet; in Haskell, it simply can’t happen.

But in a magic library provided by a Haskell compiler, it definitely could happen.

You’d have to be willing to live with a few limitations: recompiling your code would most likely invalidate serialized first-class functions. The interpreter is an even harder challenge. Serialized values would need to embed checksums of various bits of the code, and I haven’t even begun to think through which checksums would be needed. But it should be possible. After all, a closure is just a function pointer (to a function which has some offset in the compiled binary code) together with an environment consisting of values of other types. Throw in some smart handling of cyclic data structures based on a lookup table, and you’ve got something.

Even a half-baked attempt to make higher-order programming with functions and closures compatible with the existing world of SQL databases and binary network protocols would be a huge benefit.