TL;DR: I explain how I improved the performance of an interpreter for WebAssembly written in Haskell by plucking some low-hanging fruit.

Background

Motivated by my work at the DFINITY Foundation, I was looking into interpreters for WebAssembly written in Haskell, and found my colleague John Wiegley’s winter : A straight-forward port of the WebAssembly reference interpreter, written in Ocaml by Andreas Rossberg (another colleague of mine … I guess there is a common theme here.)

Obviously, an interpreter will never be as fast as a real compiler implementation (such as the one in v8, lucet or wasmtime). But for my purposes an interpreter is fine. Nevertheless, I don’t want it to be needlessly slow, and when I picked up wasm , it was clear that I had to work at least a little bit on performance.

None of this is to blame John, of course: I have no doubt that he could have sped it up at least as well! I assume it just wasn't a priority at that time, and winter is a nicely done piece of Haskell.

A graph

I like statistics and graphs! In this blog post, I am always running the following very simple C program:

unsigned int i; void init () { i = 1024 * 1024 ; while (i--) {}; }

which I compile using clang version 9 and run with wasm-invoke from the winter package (compiled with GHC-8.6.5):

$ clang-9 --target=wasm32-unknown-unknown-wasm -fno-builtin -ffreestanding loop.c --compile $ wasm-ld-9 loop.o --gc-sections --export init --no-entry -o loop.wasm $ wasm-invoke -w loop.wasm -f init +RTS -t

The +RTS t tells the Haskell runtime to produce a line with some statistics, of which I collect

total number of byte allocated

maximum memory size used and

total time elapsed

To reduce the noise a little bit, I run the program five times, and take the minimum of all the runs.

I did this for all the commits I produce while trying to make the code faster. This yields the following graph:

Note that I had to draw this with a logarithmic y axis, because some of the improvements were so drastic that a linear graph would not be helpful.

But how did I get there? I will take you through a guided tour through all the commits that have improved performance, explain how I found an performance issue, what I did to improve things and why that improved things. To keep this post at a reasonable size, I split it into a series of blog post, faithful to the chronological order in which I performed them:

Here they are:

Was it worth it? I’d say so! Initially, running the program above would take 44 minutes (slow!) and continuously grows to require 6,7 GB of memory (Houston, we have a space leak!).

After all the steps above, memory consumption is constant and below 2MB, and the program finishes in 3,6s. Overall, this yields the following, very satisfying, statistics:

Improvement: Allocations: -99.60% Memory: -99.97% Time: -99.86% (Commits de9f4f8...5406efd)

By the way, the Ocaml reference interpreter takes 2,92s to run this (and it does a bit more, such as validation of the module). So there is still room for improvement…