Haskell vs. Go vs. OCaml vs. ...

update 31/08/2018 - moved “Libraries” section into a separate post

A couple of years ago I read the paper Haskell vs. Ada vs. C++ vs. AWK vs. … which left me under impression that Haskell might be a great choice for prototyping. Since then I did a few toy projects in Haskell, read Real World Haskell and Programming in Haskell (which I found absolutely marvelous) but it was hard to come across a real-world project where I could verify the claim.

A few weeks ago I finally got the chance. The team I am a part of, apart from writing totally awesome things in Elixir/Erlang in between times maintains tremendous amount of legacy tools written in Perl with some bits of Python. One of the tools is responsible for daily querying a bunch of RRD files using Perl, enriching the data with some metadata using Python and dumping it all into a single .csv file with all the metrics collected on the previous day. And when I say “a bunch” that means almost a million in some cases. No one really looked at the tool for years until the number of files to process has increased 4x times up to 800000 overnight. The tool could not keep up effectively making the daily reports unusable. A temporary workaround was to run the tool on a beefy 24-core (with only one core actually utilised) 2x CPU 128GB RAM HP Gen9 blade which was able run the thing in 9 hours.

It just so happened that when the problem manifested itself I was chatting with my boss about how awesome Haskell was and how OCaml was the implementation language of choice for some modern programming languages and how most software developers unreasonably turn a blind eye to their great merits. And the boss - otherwise being absolutely rational and pragmatic person - said that my wet dream might have come true and I was to rewrite that tool in Haskell. While I was making some finishing touches to the project I was on at the moment, eagerly awaiting for the Haskell Rewrite Project to start, the other team member found… let’s say an anti-pattern in Python part of the tool reducing 8 hours of Python crunching to 20 minutes thus bringing the total run time of the tool down to 1 hour and 20 minutes. That eliminated the true to life business justification of the rewrite. However, my boss - being one of the most awesome bosses around - allowed me to start the project and prototype the tool in Haskell under condition that I also were to write a prototype in Go and had it all done within a week. After that I would have to present the results of the experiment to the team. I love challenges and I wanted some hands-on experience with Go (which I knew nothing about at that moment) so I happily accepted those conditions.

The Problem

In the nutshell the problem can be described as following:

parse command line arguments to gather parameters about DB connection, in-out folders/files and something called device groups

query the required set of devices and relevant metadata from the MySQL database

based on the device set deduce the list of RRD files

for each file in the file set query data points for the 12:00AM - 11:55PM interval of the previous day

dump every data point using human-readable metric name, device name, etc. as a row to the specified .csv file

TL;DR or comparison table

update 17/09/2018 - added multiprocess OCaml and Go+goroutines results

update 28/10/2018 - added async OCaml results

Go version straight out of the box demonstrated the best performance. Haskell version after some tinkering came second, and OCaml version while being slower than the other two was the best in memory consumption.

I haven’t used Go routines or any other concurrency/parallelism techniques in either version because I wanted a fair comparison with the Perl/Python version and also wanted as simple solution as possible. Go and Haskell used parallel GC though, so comparison was not completely fair.

Also it is worth noting that Go version used librrd wrapper and consequently skipped parsing of rrdtool output altogether.

Run time is how long it took to process those 800000 files.

is how long it took to process those 800000 files. Memory is the maximum memory consumption for the file set.

is the maximum memory consumption for the file set. LOC is the number of lines of code including imports.

is the number of lines of code including imports. Size is the executable file size (everything linked statically).

Build time is how long it takes to rebuild the executable. All versions but OCaml + glob have only one file. The build time does not take into account building any extra dependencies.

Impl time is how many hours (approximately) it took me to implement a version.

Version Run time Memory LOC Size Build time Impl time Perl + Python 80 min - 480+360 - - - Go 12 min 1.3GB 277 6.5MB ~ 1s 24 h Go + goroutines 130s 42GB 301 6.5MB ~ 1s 27 h (24+3) Haskell (rushed) 144 min 1.7GB 234 4.2MB 1.7s 16 h Haskell optimised 17 min 1.4GB 286 4.2MB 1.8s 39 h (16+23) OCaml + find 22 min 60MB 233 2.2MB 0.3s 14 h OCaml + glob 23 min 100MB 232 + 25 4.7MB 0.4s 18 h (14+4) OCaml multiprocess 67s 30GB 288 +25 4.7MB 0.4s 23 h (18+5) OCaml + Lwt 14 min 40s 1.6GB 253 +25 5.3MB 0.8s 24 h (18+6)

I used the latest stable version of compilers. That is Go 1.10.4, GHC 8.4.3, OCaml 4.07.0. I also tried compiling with GHC 8.2 and OCaml 4.0.5 but results were consistent across versions.

Go version

Go had to be the first one because I was not sure if I would be able to complete it in time and wanted as much leeway as possible. I started with A Tour of Go and some other tutorials which consumed one weekend of my time. I excluded that time from the table above because I already had some experience with Haskell and OCaml and taking the time into account did not seem fair.

I found tooling being very good, easy to use, fast. Most things just worked out of the box. There is a single entry point - go command - which allows to build/install/remove packages and executables.

What I liked the most (although it’s probably irrelevant for many) is that setting up Emacs environment took me only 30-40 minutes. go-mode , go-eldoc , go-autocomplete , go-rename required little to no extra tweaking delivering a very pleasant, virtually bug-free development experience. I especially liked go-playground which provided me with almost repl-like functionality.

Documentation for the language, the standard library, tools, third-party packages is extremely good. It is approachable being beginner-friendly but comprehensive at the same time.

To sum it up: excellent.

Go’s overall impression and thoughts

Go was easy to set up, easy to use. Abundance of documentation and ready-to-use libraries written clearly and concisely helped a lot. Language-wise I had experience not dissimilar to what I had with AWK as a code-gen tool for a bigger C project - somewhat awkward but straightforward - KISS principle at its fullest. I think it is a good language that anyone can start hacking without having to read a Category Theory book.

The flip side of the coin is not that bright I am afraid. There is no a standard way (at time of this writing at least) to lock dependencies. It is not a problem for small one-off projects, but I can’t image how people are able to develop/maintain larger code bases without falling into cabal hell. Not dissimilar to Haskell, NixOS could be an answer to that problem.

Another observation, and it does stand out, especially after Haskell experience, is that language-wise it feels like a “can I have a C (or AWK) but without this and that annoyances please?“. Which is not necessarily a bad thing. I’ve got mixed feelings about it and will probably write a post or two to get into more details.

Haskell version

To evaluate Haskell in that semi-production setting was the main goal of the experiment. I knew enough of the language to start writing code straight away but I wanted to measure how productive I would be (and how much faster the tool would become). The only problem was that I had only two days left to answer the question.

Back In 2016 haskell-mode with interactive-haskell-mode and structured-haskell-mode for Emacs provided well-rounded development environment. I am not sure what has changed since then but I could not glue it all together and spent at least three hours trying the setup. I ended up using intero-mode and brittany for code formatting. It worked reasonably well but determining a highlighted expression type, especially if it involved polymorphic functions was highly unreliable. I had to use typed holes all the time. Worst of all was that repl within Emacs tend to produce some scary error messages - “unfathomable operation” or something along the lines - every time I called anything IO-related. I had to run the repl in the terminal to avoid that.

Documentation is scarce at best. Most libraries provide but a few pointers or remarks for the authors themselves rather than a comprehensive documentations, let alone “getting started” guides or tutorials.

Hoogle is your friend and works fine from command line/Emacs. It has been invaluable and I wish all languages would provide something like that. The problem, however, is those Haddock pages are hard to navigate. “Synopsys” is useless when there are more than a handful of functions (which is true in most cases) because it does not scroll, effectively displaying only first 10-15 items. I ended up using w3m from Emacs because it was the only sane way to navigate through e.g. 100+ functions in ByteString library.

stack was OK. I think it has become even better over the past two years. My only gripe is that it lacks search command and something like deps fetch/update/remove to simplify handling of dependencies.

Haskell’s overall impression and thoughts

Using Haskell was surprisingly hard. When searching for information scattered through Haskell Wiki, StackOverflow posts and multiple blogs I felt more like a detective or investigator rather than a software engineer. Had I had something akin to Real World Haskell but up to date, the whole process could have been smoother. Alas I had not and I struggled.

Another observation I made was that my intuition about performance was almost always off. GHC provides fantastic tools for profiling and it was the only way for me to improve performance. Laziness of the language changes run time behaviour drastically compared to strict languages and that is something to always keep in mind. I should write another post about the journey to cut down the run time from 144 minutes to 17. It is almost a detective story. All I can say here is that performance hits were never where I expected them to be.

update 05/09/2018 - the post has been published.

OCaml version

I decided to add OCaml into the mix because I wanted to compare Haskell with a strict language that had equally powerfull type system. I also wanted to know if I could write the whole thing without a single type declaration (spoiler: I was not able to).

Setting up Emacs environment was straightforward. There is an excellent merlin tool that provides auto completion, type information, source code navigation and integrates with Emacs seamlessly. There is also a fancy repl called utop that I think far exceeds any stand-alone repls out there in terms of usability and number of colours in use. And also there is an easy to use and feature-full packet manager called opam. I would consider it being superior to Haskell’s stack tool in terms of usability if not feature-wise. And last but not least dune build system is insanely fast, although as everything (it would seem) coming from JaneStreet feels somewhat opinionated. Overall my impression of the tools is very positive, Go-level-like or even better.

The story is not that great wrt documentation. While many packages have concise and comprehensible READMEs, tutorials on the official web-site are mostly outdated and/or incomplete. Real World OCaml is also somewhat dated and is heavily biased towards Core - the standard library replacement. There is a newer edition of the book in progress but still it is Core (or rather Base)-biased and does not touch on some other excellent libraries out there.

The ecosystem is not that healthy as Go’s and definitely not as vast as Haskell’s but I found it good enough and, more importantly, useful. For example, I could install a package where the latest commit was 5-8 years ago and it worked straight out of the box. That’s something unimaginable with Haskell.

OCaml’s overall impression and thoughts

I fell in love with OCaml. It’s a simple, very explicit language that is pleasant to work with, has strong, sophisticated type system that actually guides the development process not trying to produce scary, intimidating error messages. The type-level language has different syntax from the main language making code easier to “brain-parse”.

More importantly though, all the libraries seem to have been written with the goal of getting the job done, not to defend a CS thesis. I used batteries because I was looking for something that could interleave IO with pure computations. So I found Enum and a lot, lot more in there. I think it’s an excellent library and somehow I find it more to my liking than Core .

I also was stunned by how effortless it was to write a C stub. Just throw in a C file into the source directory, declare external function signature and it’s done! I couldn’t find a “glob” library so I used find tool in the first version. The program was building a huge pattern and then read the command output using a pipe. It felt somewhat unfair because find was doing all the job. After some research I stumbled upon a piece of code that called a libc function in OCaml. So I wrote ~20 lines of code that wrapped the libc ’s glob function and used it instead. It did slow down the program a little bit but that painless, almost native-like interop with C code was something.

Conclusion

I believe the experiment was successful at illuminating strengths and weaknesses of the languages in the context of writing small, one-off tools. In other words if you have a bunch of Python/Perl/Ruby scripts that you want/need to make run faster, what language should you choose?

Go had the best bang for the buck best performance per hour spent out of the box. If you have a team of engineers of varying level of expertise by using Go you could expect that a) the result would be good-to-great and b) anyone on the team would be able to maintain it.

Haskell’s laziness may be tricky but the excellent built-in profiling tools remedy it, albeit at the cost of longer development time. More importantly though is that the library ecosystem leaves a lot to be desired and coupled with the tendency of having stumbling blocks where shouldn’t really be any it turned out to be the worst tool for that particular job. On the other hand it was the most powerful language I ever used and if you have time/resources to build a set of domain-specific, optimised libraries it seems to be capable of solving anything you can throw at it.

OCaml appears be somewhere in the middle. On the one hand insanely fast compiler (much faster than Go’s!) and a sophisticated type system (and more powerful module system than the other two) make it extremely pleasant to work with. On the other hand it is somewhat lacking in centralised documentation. There aren’t that many books oriented both at beginner and advanced levels. The variety of available libraries/packages also leaves much to be desired. The absence of a parallel GC may incur performance hits in some situations, although low memory footprint somewhat rectifies it.