My earlier post Four MLs (and a Python) listed a toy example program written in four languages in the ML family: Standard ML, OCaml, Yeti, and F♯.

Perhaps unwisely, I measured and reported the runtimes for each version of this program when processing a test file on my Linux laptop. The figures I got were, in seconds:

SML OCaml Yeti F♯ 16.4 7.8 14.4 13.9

At this point I have to repeat what I said in the earlier post—this is not a meaningful benchmark; these are all pretty good; they’re not so far apart that you can’t imagine a slightly different test case reversing the order.

Even so, it puzzled me why this particular SML code (compiled using the MLton compiler) ran slower than the OCaml. They’re similar languages, well-established, both compiled to native code using optimising compilers. What’s the difference?

I took a rambling, not-entirely-conclusive look at this, with some detours along the way.

First, the OCaml

Several commentators noted that the OCaml implementation could have been sped up by hoisting regexp construction out of the inner function. I tried out a couple of the suggested alternatives:

Original from Marek from Dora 7.8 7.4 7.3

So it makes a bit of a difference, but not as much as I’d expected. (Both alternatives are nicer to read than my own code was, though.) Another comment here proposes using the Core library from Jane Street instead of the supplied Str module, but I haven’t yet tried that.

Now, the Standard ML

The first thing I thought I’d do, since Standard ML is a standard that has many implementations, was to try the program in more than one of them.

SML environments

Here are all the SML implementations I could find that showed any signs of life—I’d be interested to hear of any I missed.

Note that, unlike the previous survey, these are not different languages. They all implement the same language, even if in practice many of them also have extensions reflecting whatever research projects engaged their authors. All are open source.

MLton is the compiler I used in my original post. It’s an optimising native-code compiler that aims to produce fast code complying with the SML standard without any novel extensions. Unlike the other implementations here, it is only a compiler and does not include an interactive environment (or REPL, read-eval-print-loop).

Standard ML of New Jersey (SML/NJ) is the most venerable complete implementation and, I think, the only one of these that existed the last time I wrote any ML. It provides an interactive environment but doesn’t have a straightforward native-code compiler. It includes a number of language and library extensions, and I avoided it for my earlier test because I was keen to stick to the standard parts of the language and I’d heard it was hard to generate a native binary with.

Poly/ML is a compact distribution including an interactive environment and a native compiler. I’m not sure when it first appeared, but it’s been around long enough to have been mentioned in the 1996 edition of Larry C Paulson’s excellent ML for the Working Programmer. (As an aside, it’s a pity to see that book available only either second-hand or in hugely expensive academic pricing, even after almost 20 years. Almost everything in it is still relevant and enlightening to a reader with an interest in functional programming. I’m sure an online version would be well-loved, though that’s probably too big a job for too little reward to hope that it would ever appear. Anyway I hope it sells enough to be useful to the author still, because it’s a brilliant piece of work.)

Moscow ML is another neat distribution with both interactive environment and native compiler. A nice quality of Standard ML, at least until you start writing code that depends on non-standard libraries, is that you can try out your code in more than one compiler and compare their error messages and the like: Moscow ML is fast to run, and has (for my taste) the most readable error messages of these.

MLKit is a modular distribution that also dates back to the 90s. The name made me think abstractly of Apple platform libraries, but it has nothing to do with those. Like MLton it has a compiler but not an interactive environment.

Alice ML is an implementation that extends the language with concurrency primitives and various other interesting bits and bobs. It consists of an interactive environment and bytecode compiler, without a native code compiler.

SML♯ is an SML implementation from Tohoku university in Japan. Just as MLKit has nothing to do with Apple’s *Kit frameworks, so SML♯ has nothing to do with Microsoft’s C♯ and F♯. I was unable to get it to build on my usual system (64-bit Arch Linux with LLVM 3.6) so I ran it in a 32-bit Ubuntu 14.04 container instead. Rather shockingly, when compiling with the SML♯ v1.2 distro package I got completely wrong answers—it didn’t parse negative exponents correctly when reading from file. I installed the official SML♯ v2.0 instead, and that gave the right answers.

All of these were able to run the test program. They all took an essentially negligible amount of time to compile it, except for MLton and SML♯ which took a second or two. Here’s how long the resulting code took to process my test file (in seconds).

MLton

20130715 SML/NJ v110.77 Poly/ML 5.5.2 Moscow ML 2.10 MLKit 4.3.7 Alice ML 1.4+ SML♯ 2.0 16.4 9.3 * 25.6 99.1 33.5 579.5 ** 173.9

(Notes:

* I know that it’s possible to make SML/NJ produce a native binary, but I still haven’t taken the time to figure out how to do it. This time comes from adding a timing wrapper in the program itself and then importing it into the interactive environment.

** Alice ML has three JIT compiler settings; this was mode 1, which proved the fastest.)

So SML/NJ is not far off OCaml; it’s clearly possible. We can see from the other columns why I said the timings in the original article were all pretty good—Moscow ML is a nice environment to develop with but it’s much slower here, as is SML♯, and Alice ML is clearly not well suited to this particular job.

So where did the time go?

MLton comes with a simple-to-use instrumenting profiler; running it on this task gives:

function cur ---------------------------------- ----- to_number 51.1% values_of_line 10.7% <gc> 6.3% fold_stream 5.9% IEEEReal.scan.digitStar 4.7% IEEEReal.scan.digitStar.done 4.4% Sequence.Slice.concat 3.8% IEEEReal.scan.done 3.5% Integer.fmt 3.3% values_of_line.fn 2.0% IEEEReal.scan.normal' 1.8% Sequence.concat 1.7% PosixError.SysCall.simpleResultAux 0.3% add_to 0.3% IEEEReal.scan.afterE.neg 0.1%

It’s spending about 65% of its runtime converting strings to floating-point numbers. (Most of that is reported as spent in my own function, which does nothing except call out to the Basis library: I guess this is because MLton has inlined the conversion code, since it’s only called in one place.)

For a quick sanity check, I replace the call to Real.fromString with a function that only pretends to convert the string:

fun fake_real_from_string str = if str = "" then NONE else SOME ((Real.fromInt (length (explode str))) / 10000.0)

… and the runtime is reduced from 16.4 seconds to 4.5. We now get the wrong answers, but it’s clear that this is where most of the time went.

What does Real.fromString consist of? The implementation supplied with MLton appears to be this section of real.sml, calling out to this code in IEEE-real.sml. It looks as if the IEEE-real structure does some cleaning and preprocessing and parses into a standard format, which is then converted to a floating-point value by the code in real.sml.

For comparison, here’s the primitive as defined in the OCaml standard library; it calls out to this C function in the OCaml runtime, which simply calls strtod from the C library. A quick scan of the OCaml version with the sampling profiler sysprof shows that it spends around 40% of its runtime in strtod . That’s about 3 seconds, compared to more than 10 for MLton.

It’s no surprise that simply calling the C library might be faster. I wonder why they take such different approaches. My first thought was that the MLton version, which has a look of careful code about it, was a pure SML solution, but then I noticed that ultimately it also calls out to a C library function to do the eventual conversion ( strtor , from the gdtoa library).

Still, all I really know about parsing floats is that it’s tricky and full of pitfalls and I certainly wouldn’t trust myself to do it right. Searching for discussion about MLton’s Real.fromString implementation, I run into this post, which says, “Real.fromString is difficult to implement quickly in SML because of the requirements for getting good-to-the last bit conversions and getting all the corner cases right” and notes that it’s likely only a problem in pathological cases like this one.

So my question isn’t quite answered, although that’s enough to satisfy my own curiosity for the moment. But there is one thing I’m going to try before I close:

Hacking it with the FFI

MLton has a foreign function interface (FFI) for calling C functions; why not just call strtod ourselves? Or its simpler sibling atof?

fun to_number str = let val atof = _import "atof": char array -> real; in (* Pass a C-style null-terminated string *) atof (Array.fromList (explode (str ^ "00"))) end

Compiling with

$ mlton -default-ann 'allowFFI true' sum.sml

this actually runs fine, taking about 10.5 seconds, four less than the original. The results for this particular test run are the same, but, well… you’d have to be a bit desperate.

Conclusions

In both SML and OCaml, the largest slice of runtime of my test program went to string-to-floating-point conversions; OCaml performs these by calling out to the C library directly, while MLton has a more laborious implementation (I don’t entirely understand why and would very much appreciate hearing from anyone who does)

MLton seems to perform well as a compiler, given the more elaborate library code

MLton’s foreign-function interface seems straightforward, at least for simple examples (though of course this is specific to the compiler and doesn’t work in other ML implementations)

The OCaml program seems to be still the fastest even after accounting for this, and I hazard that that may be because other library functions are similarly minimal

This was indeed a pretty terrible benchmark