By pascal on Friday, October 18 2013, 18:30 - Permalink

The OCaml compiler can generate native code for various architectures. One of the few code generation options is named -compact :

$ ocamlopt -help Usage: ocamlopt <options> <files> Options are: ... -compact Optimize code size rather than speed ...

One of the effects, perhaps the only effect(?), of this option is on the code generated for allocation. In OCaml, a small block typically gets allocated from the minor heap. When all goes well, allocating in the minor heap means simply decrementing the pointer that separates the already-allocated top of the minor heap from the still-available bottom of the minor heap. Eventually the minor heap becomes full; what happens then is complicated. But the fast path is basically a subtraction and a conditional branch (usually not taken) to the complicated stuff.

With option -compact , an allocation simply calls a routine that does all the above. Without it, the subtraction and conditional jump are inlined at the site where the allocation is requested. OCaml being a typical functional language, there are plenty of these. Inlining a couple of instructions makes the code bigger, but when execution stays on the fast path, which is often, a routine call is avoided completely.

A small example

Consider the one-line OCaml function let f i = Some i . This function takes a value i and returns a Some memory cell that points to i . I intended i to be an integer, but I inadvertently wrote code that works for any type of i . This was possible because in OCaml, all types of values share a uniform representation, as discussed in the introduction of a previous post.

A Some cell occupies 16 bytes (yes, this is huge. Blame 64-bit computing). The fast code generated by ocamlopt -S t.ml , where L102 is the label at which selcom-called code manages the case when the minor heap is full:

... subq $16, %r15 movq _caml_young_limit@GOTPCREL(%rip), %rax cmpq (%rax), %r15 jb L102 ...

The -compact option causes more compact code to be emitted for the allocation. When our function f is compiled with ocamlopt -S -compact t.ml , the sequence generated for the allocation is much shorter indeed. There is even a specialized allocation function for allocating 16-byte blocs, so no arguments need to be set up:

... call _caml_alloc1 ...

Effects of option -compact in the large

When the development version of Frama-C is compiled with option -compact , the binary takes 14.4MB.

Without option -compact, the Frama-C binary weights in at 15.4MB.

One additional megabyte of code is not going to hurt us at this point, right? It is not as if the Frama-C binary had to fit exactly in a box of ten floppy disks. As long as the larger code is faster, it is worth it…

$ time bin/toplevel.lean_and_mean tests/test/adpcm.c -sparecode > /dev/null user 0m0.843s $ time bin/toplevel.big_and_slow tests/test/adpcm.c -sparecode > /dev/null user 0m0.856s

… but it turns out that the larger code isn't faster. In the case of Frama-C, and of the computer I am typing on, at least, the code generated with option -compact is smaller and faster.

Conclusion

It is not easy to try to predict performance on modern out-of-order architectures, but if I had to hazard an explanation, I would say that, besides the obvious cache concerns, having many scattered predictable conditional branches may be a losing proposition compared to one single conditional branch visited from many places with call and ret . Every conditional branch that isn't in the branch prediction buffer is at risk of being mispredicted. In addition, each entry taken by the branches that have recently been visited is an entry that isn't available for branches translated from conditionals in the source code.

In contrast, the execution of the call and ret instructions has been optimized over the years. The target of the ret instruction is predicted with a Return Stack Buffer that has a conceptually simpler job than the Branch Predictor (although it is clearly possible to write code for which the ret instructions cause a stall, which could re-establish an advantage for the partially inlined version).