I recently upgraded my phone to a Galaxy S3, an ARMv7 device with 2GB of ram and a quad-core 1.3ghz processor. After installing Arch Linux on it (only in a chroot; I’m not brave enough to boot directly into it), which seemed like the right thing to do, I thought it’d be interesting to do some benchmarking and see how various language implementations perform on ARM compared to on my x86-64 (Intel dual core i5 M430 2.27GHz) laptop.

This benchmark involves calculating a large number of Fibonacci numbers repeatedly. The exact code used is available here. Note that the purpose of this benchmark is to compare the ARM and x86 implementations of each language, not to compare the different languages to each other, so no effort has been put into optimising the different implementations.

The benchmarks were run with an input of 500 million, with the speed for each language being expressed as a percentage of the speed of the C implementation. This allows for easy comparison, as each language being expressed in terms of C speed (as opposed to in absolute terms) minimises the effect of the different speeds of the processors used.

The results from my x86-64 laptop are as follows (input of 500 million):

Language Running time (ms) % C speed C (Clang) 3726 100 C# (Mono) 7625 48.8656 Go 8489 43.8921 Haskell 9839 37.8697 Java (OpenJDK) 4237 87.9396 LuaJit 15135 24.6184 Ocaml 9668 38.5395 Racket 22785 16.3529 Rust 3910 95.3

The results from my ARMv7 Galaxy S3 (input of 500 million):

Language Running time (ms) % C speed C (Clang) 5051 100 C# (Mono) 52736 9.5779 Go 12246 41.2461 Haskell 22304 22.6462 Java (OpenJDK) 50079 10.0861 LuaJit 21738 23.2358 Ocaml 28774 17.554 Racket 56519 8.93682 Rust 5741 87.9812

The performance of each language on ARMv7 expressed as a percentage of its x86-64 performance:

Language Performance (% x86-64 speed) C 100 LuaJit 94.3839 Go 93.9716 Rust 92.3190 Haskell 59.8003 Racket 54.6498 Ocaml 45.5481 C# (Mono) 19.6005 Java (OpenJDK) 11.4693

The benchmarks can be run by `sh fib.sh 500000000 true`, replacing true with false if the executables have already been compiled. This will print an html table to stdout, and store the results in text form in `./sortedResults`. Rename this to something like armResults or x86Results, and two different such files may be compared using `sh fibdiff.sh armResults x86Results`, which will output the above html table to stdout.

Again, note the purpose of this benchmark is not to compare the languages with each other, but to see how well each one performs on ARM compared to on x86.

Interestingly, the Rust implementation actually runs faster on ARM. I suspect this indicates a bug of some sort in the compiler, as Rust generally performs at close to C speed. Both the ARM and x86 implementations are running on the latest Arch Linux build, but the Rust version on ARM appears to be “0.11.0-pre”, whereas on x86 it’s “0.11.0”, so it’s possible there was some regression between those two versions.

*Edit: the Rust version used the Rust int type, which is 64bit on x86-64, whereas C’s int is 32bit. This exaggerated Rust’s speed relative to C on ARM. The results table has now been updated to show results from using Rust’s fixed size i32 type, which is much faster on x86-64. It can now be seen that Rust manages to perform quite well on both platforms, which makes sense considering its use of the LLVM backend.

*Another edit: according to strncat on Reddit: “Rust is slower than C at this task on ARM (and x86_64, once the 64-bit overhead is gone) because it uses LLVM’s segmented stack support for stack overflow checking. The sane alternative is outputting a one byte write in multiples of the guard size, so 99.5% of functions won’t have any overhead, and the remaining ones will only have some writes rather than the branch that is there today.”

Java on ARM is hilariously slow, 1/10th of its x86 speed, which I suspect may be because the JVM I’m using, OpenJDK, doesn’t yet have a JIT compiler. Or, if it does, it’s not particularly well developed. The Oracle JVM is apparently faster on ARM.

C# (via Mono) is also much slower on ARM, 1/5 of its x86 speed. Presumably ARM code generation hasn’t received as much attention as x86.

Luajit’s performance on ARM, almost equal to its x86 performance, proves once again that Mike Pall is a genius. Interestingly, as everything is a double in Lua, it may actually have an advantage in this case as there’s a modulus operation in the code, and ARM doesn’t (as far as I’m aware) have hardware support for integer division, only float division.

The Go compiler also runs surprisingly well on ARM, nearly matching its performance on x86. I’m not sure whether this is due to lots of work having been put into ARM code generation or not so much work being put into x86 code generation, but it certainly bodes well for future plans to allow the use of Go in the Android NDK.

Racket, OCaml and Haskell all run at about half their x86 speed on ARM, which seems reasonable as I can’t imagine they’re often run on ARM, so I imagine ARM performance probably hasn’t received much attention.

In terms of the C implementation, it’s interesting to note that its absolute speed on the ARM device is almost 73% of the speed of its x86 speed, in spite of the x86 device being a 16 inch i5 laptop and the ARM device being just a 5 inch phone.

Miscellanea

The Haskell implementation is named hsfib.hs instead of fib.hs, breaking from the naming conventions of the other implementations. Why is this? If I name it fib.hs, GHC will notice the fib.o left over from compiling the OCaml implementation and try to link that, with hilarity ensuing soon thereafter. This could be avoided if GHC checked whether the object files it was attempting to link were actually Haskell object files.

Getting an Integer from a Number in Typed Racket seems to take a lot of work:

`(numerator (inexact->exact (real-part n)))`

Although there’s probably a simpler method I’m missing.

There’s a Common Lisp implementation in the repo, but the ARM support of SBCL is quite recent and I couldn’t get it to build.

At least 70% of the C# implementation is just copy pasted from the Java implementation. Syntactically those languages seem even more similar than Lisp and Scheme.

Recursion in Lua is interesting. Doing a comparison like `if n == 1` does not work out well, due to the nature of floating point equality, so `if n < 1` is necessary instead.

The correctness of the Rust and C implementations is actually optimisation dependent. If compiled with -O, they use constant space; if compiled without it, they use linear space and fail with a stack overflow. Never, ever write code like this for any project that’s even remotely important..

OCaml’s Emacs support is really awesome. I installed a couple of plugins, Tuareg and Merlin, and not only did they give me autocomplete but also automatic error checking whenever I save. I was particularly impressed when it combined with Ocaml’s type-checking to warn me that the float I was accidentally passing to printf didn’t match the type required by the %d in the format string; even Haskell’s default printf function doesn’t do compile time checking of the format string.

*Edit: I was asked why I didn’t include Nimrod. Nimrod compiles to C, so it would be the same speed on both ARM and x86. Seems like it would be a good option for ARM development.

Bonus

I ran the Java implementation of the benchmark under Dalvik. The results are rather underwhelming, to say the least. Here are the results from running with an input of 500 million (the same as above), with the Dalvik results included.

Language Running time (ms) % C speed C 5051 100 C# (Mono) 52736 9.5779 Dalvik 75699 6.6725 Go 12246 41.2461 Haskell 22304 22.6462 Java (OpenJDK) 50079 10.0861 LuaJit 21738 23.2358 Ocaml 28774 17.554 Racket 56519 8.93682 Rust 5741 87.9812

Now that’s why the new ahead-of-time-compiled Android RunTime is A Good Thing. Also possibly part of the reason why the Android UI is less fluid than the iPhone’s, and I say this as someone who wouldn’t use an iPhone if you paid me. Where would Objective C appear in that table? Right at the top; Objective C is a superset of C, so valid C code is generally valid Objective C code, making it extremely easy to write fast code in Objective C (just write C).

Note however that I ran the Dalvik file (compiled with `dx –dex –output=fib.dex fib.class`) with `dalvikvm -cp fib.dex fib`, the simplest way to run it. It’s quite possible there’s a better way to compile/run it that makes it faster; I’m not particularly familiar with Android development. There apparently exists a tool called ‘dexopt’, which is run automatically when a new apk is installed, but I’m not sure whether it’s run when I execute a .dex file manually with `dalvikvm`. Running `dexopt` tells me to look at the dexopt source for usage information; I don’t have time to figure it out right now, but if someone wants to send me a script for running dexopt on a .dex file I’ll be happy to test it.

Moral of the story: performantly porting a JIT compiler to ARM is difficult (unless you’re Mike Pall).

*Edit: apparently development of the ARM port of Luajit was sponsored, so the developer may have had more incentive to optimise it than the developers of the ARM ports of OpenJDK and Mono.