BLOG ON CAMLCITY.ORG: OMake

Faster builds with omake, part 2: Linux - by Gerd Stolpmann, 2015-06-19

The Linux version of OMake suffered from specific problems, and it is worth looking at these in detail.

Part 1: Overview

Part 2: Linux (this page)

Part 3: Caches (will be released on Tuesday, 6/23) The original publishing is on This text is part 2/3 of a series about the OMake improvements sponsored by LexiFi The original publishing is on camlcity.org

While analyzing the performance characteristics of OMake, I found that the features of the OS were used in a non-optimal way. In particular, the fork() system call can be very expensive, and by avoiding it the speed of OMake could be dramatically improved. This is the biggest contribution to the performance optimizations allowing OMake to run roughly twice as fast on Linux (see part 1 for numbers).

The fork/exec problem

The traditional way of starting commands is to use the fork/exec combination: The fork() system call creates an almost identical copy of the process, and in this copy the exec() call starts the command. This has a number of logical advantages, namely that you can run code between fork() and exec() that modifies the environment for the new command. Often, the file descriptors 0, 1, and 2 are assigned as it is required for creating pipelines. You can also do other things, e.g. change the working directory.

The whole problem with this is that it is slow. Even for a modern OS like Linux, fork() includes a number of expensive operations. Although it can be avoided to actually copy memory, the new address space must be set up by duplicating the page table. This is the more expensive the bigger the address space is. Also, memory must be set aside even if it is not immediately used. The entries for all file mappings must be duplicated (and every linked-in shared library needs such mappings). The point is now that all these actions are not really needed because at exec() time the whole process image is replaced by a different one.

In my performance tests I could measure that forking a 450 MB process image needs around 10 ms. In the n=8 test for compiling each of the 4096 modules two commands are needed (ocamldep.opt and ocamlopt.opt). The time for this fork alone sums up to 80 seconds. Even worse, this dramatically limits the benefit of parallelizing the build, because this time is always spent in the main process.

The POSIX standard includes an alternate way of starting commands, the posix_spawn() call. It was originally developed for small systems without virtual memory where it is difficult to implement fork() efficiently. However, because of the mentioned problems of the fork/exec combinations it was quickly picked up by all current POSIX systems. The posix_spawn() call takes a potentially long list of parameters that describes all the actions needed to be done between fork() and exec(). This gives the implementer all freedom to exploit low-level features of the OS for speeding the call up. Some OS, e.g. Mac OS X, even implement posix_spawn directly as system call.

On Linux, posix_spawn is a library function of glibc. By default, however, it is no real help because it uses fork/exec (being very conservative). If you pass the flag POSIX_SPAWN_USEVFORK, though, it switches to a fast alternate implementation. I was pointed (by Török Edwin) to a few emails showing that the quality in glibc is not yet optimal. In particular, there are weaknesses in signal handling and in thread cancellation. Fortunately, these weaknesses do not matter for this application (signals are not actively used, and on Linux OMake is single-threaded).

Note that I developed the wrapper for posix_spawn already years ago for OCamlnet where it is still used. So, if you want to test the speed advantage out on yourself, just use OCamlnet's Shell library for starting commands.

Pipelines and fork()

It turned that there is another application of fork() in OMake. When creating pipelines, it is sometimes required that the OMake process forks itself, namely when one of commands of the pipeline is implemented in the OMake language. This is somewhat expected, as the parts of a pipeline need to run concurrently. However, this feature turned out to be a little bit in the way because the default build rules used it. In particular, there is the pipeline

$(OCAMLFIND) $(OCAMLDEP) ... -modules $(src_file) | ocamldep-postproc

which is started for scanning OCaml modules. While the first command, $(OCAMLFIND), is a normal external command, the second command, ocamldep-postprocess, is written in the OMake language.

Forking for creating pipelines is even more expensive than the fork/exec combination discussed above, because memory needs really to be copied. I could finally avoid this fork() by some trickery in the command starter. When used for scanning, and the command is the last one in the pipeline (as in the above pipeline), a workaround is activated that writes the data to a temporary file, as if the pipeline would read

$(OCAMLFIND) $(OCAMLDEP) ... -modules $(src_file) >$(tmpfile);

ocamldep-postproc <$(tmpfile)

(NB. You actually can also program this in the OMake language. However, this does not solve the problem, because for sequences of commands $(cmd1);$(cmd2) it is also required to fork the process. Hence, I had to find a solution deeper in the OMake internals.)

There is a now a pre-release omake-0.10.0-test1 that can be bootstrapped! It contains all of the described improvements, plus a number of bugfixes.

There is one drawback of this, though: The latency of the pipeline is increased when the commands are run sequentially rather than in parallel. The effect is that OMake takes longer for a j=1 build even if less CPU resources are consumed. A number of further improvements compensate for this:

Most importantly, ocamldep-postprocess can now use a builtin function, speeding this part up by switching the implementation language (now OCaml, previously the OMake language).

Because ocamldep-postprocess mainly accesses the target cache, speeding up this cache also helped (see the next part of this article series).

Finally, there is now a way how functions like ocamldep-postprocess can propagate updates of the target cache to the main environment. The background is here that functions implementing commands run in a sub environment simulating some isolation from the parent environment. This isolation prevented that updates of the target cache found by one invocation of ocamldep-postprocess could be used by the next invocation. This also speeds up this function.

Windows is not affected

The Windows port of OMake is not affected by the fork problems. For starting commands, an optimized technique similar to posix_spawn() is used anyway. For pipelines and other internal uses of fork() the Windows port uses threads. (Note beside: You may ask why we don't use threads on Linux. There are a couple of reasons: First, the emulation of the process environment with threads is probably not quite as stable as the original using real processes. Second, there are difficult interoperability problems between threads and signals (something that does not exist in Windows). Finally, this would not save us maintaining the code branch using real processes and fork() because OCaml does not support multi-threading for all POSIX systems. Of course, this does not mean we cannot implement it as optional feature, and probably this will be done at some point in the future.)

The trick of using temporary files for speeding up pipelines is not enabled on Windows. Here, it is more important to get the benefits of parallelization that the real pipeline allows.

The next part will be published on Tuesday, 6/23.