The Garbage Collector probably is OCaml’s greatest unsung hero. Its pragmatic approach allows us to allocate without much fear of efficiency loss. In a way, the fact that most OCaml hackers know little about it is a good sign: you want a runtime to gracefully do its job without having to mind it all the time.

But as OCaml 4.10.0 has now hit the shelves, a very exciting feature is in the changelog:

- #8809, #9292: Add a best-fit allocator for the major heap; still experimental, it should be much better than current allocation policies (first-fit and next-fit) for programs with large heaps, reducing both GC cost and memory usage. This new best-fit is not (yet) the default; set it explicitly with OCAMLRUNPARAM="a=2" (or Gc.set from the program). You may also want to increase the `space_overhead` parameter of the GC (a percentage, 80 by default), for example OCAMLRUNPARAM="o=85", for optimal speed. (Damien Doligez, review by Stephen Dolan, Jacques-Henri Jourdan, Xavier Leroy, Leo White)

At OCamlPro, some of the tools that we develop, such as the package manager Opam, the Alt-Ergo SMT solver or the Flambda optimizer, can be quite demanding in memory usage, so we were curious to better understand the properties of this new allocator.

Minor heap and Major heap: the GC in a nutshell

Not all values are allocated equal. Some will only be useful for the span of local calculations, some will last as long as the program lives. To handle those two kinds of values, the runtime uses a Generational Garbage Collector with two spaces:

– The minor heap uses the Stop-and-copy principle. It is fast but has to stop the computation to perform a full iteration.

– The major heap uses the Mark-and-sweep principle. It has the perk of being incremental and behaves better for long-lived data.

Allocation in the minor heap is straightforward and efficient: values are stored sequentially, and when there is no space anymore, space is emptied, surviving values get allocated in the major heap while dead values are just forgotten for free. However, the major heap is a bit more tricky, since we will have random allocations and deallocations that will eventually produce a scattered memory. This is called fragmentation, and this means that you’re using more memory than necessary. Thankfully, the GC has two strategies to counter that problem:

– Compaction: a heavyweight reallocation of everything that will remove those holes in our heap. OCaml’s compactor is cleverly written to work in constant space, and would be worth its own specific article!

– Free-list Allocation: allocating the newly coming data in the holes (the free-list) in memory, de-scattering it in the process.

Of course, asking the GC to be smarter about how it allocates data makes the GC slower. Coding a good GC is a subtle art: you need to have something smart enough to avoid fragmentation but simple enough to run as fast as possible.

Where and how to allocate: the 3 strategies

OCaml used to propose 2 free-list allocation strategies: next-fit, the default, and first-fit. Version 4.10 of OCaml introduces the new best-fit strategy. Let’s compare them:

Next-fit, the original and remaining champion

OCaml’s original (and default) “next-fit” allocating strategy is pretty simple:

Keep a (circular) list of every hole in memory ordered by increasing addresses; Have a pointer on an element of that list; When an allocation is needed, if the currently pointed-at hole is big enough, allocate in it; Otherwise, try the next hole and so-on.

This strategy is extremely efficient, but a big hole might be fragmented with very small data while small holes stay unused. In some cases, the GC would trigger costly compactions that would have been avoidable.

First-fit, the unsuccessful contender

To counteract that problem, the “first-fit” strategy was implemented in 2008 (OCaml 3.11.0):

– Same idea as next-fit, but with an extra allocation table.

– Put the pointer back at the beginning of the list for each allocation.

– Use the allocation table to skip some parts of the list.

Unfortunately, that strategy is slower than the previous one. This is an example of making the GC smarter ends up making it slower. It does, however, reduce fragmentation. It was still useful to have this strategy at hand for the case where compaction would be too costly (on a 100Gb heap, for instance). An application that requires low latency might want to disable compaction and use that strategy.

Best-fit: a new challenger enters!

This leads us to the brand new “best-fit” strategy. This strategy is actually composite and will have different behaviors depending on the size of the data you’re trying to allocate.

– On small data (up to 32 words), segregated free lists will allow allocation in (mostly) constant time.

– On big data, a general best-fit allocator based on splay trees.

This allows for the best of the two worlds, as you can easily allocate your numerous small blocks in the small holes in your memory while you take a bit more time to select a good place for your big arrays.

How will best-fit fare? Let’s find out!

Try it!

First, let us remind you that this is still an experimental feature, which from the OCaml development team means “We’ve tested it thoroughly on different systems, but only for months and not on a scale as large as the whole OCaml ecosystem”.

That being said, we’d advise you don’t use it in production code yet.

Why you should try it

Making benchmarks of this new strategy could be beneficial for you and the language at large: the dev team is hoping for feedback, the more quality feedback you give means the more the future GC will be tuned for your needs.

In 2008, the first-fit strategy was released with the hope of improving memory usage by reducing fragmentation. However, the lack of feedback meant that the developers were not aware that it didn’t meet the users’ needs. If more feedback had been given, it’s possible that work on improving the strategy or on better strategies would have happened sooner.

Choosing the allocator strategy

Now, there are two ways to control the GC behavior: through the code or through environment variables.

First method: Adding instructions in your code

This method should be used by those of us who have code that already does some GC fine-tuning. As early as possible in your program, you want to execute the following lines:

let () = Gc.(set { (get()) with allocation_policy = 2; (* Use the best-fit strategy *) space_overhead = 100; (* Let the major GC work a bit less since it's more efficient *) })

You might also want to add verbose = 0x400; or verbose = 0x404; in order to get some GC debug information. See here for more details on how to use the GC module.

Of course, you’ll need to recompile your code, and this will apply only after the runtime has initialized itself, triggering a compaction in the process. Also, since you might want to easily switch between different allocation policies and overhead specifications, we suggest you use the second method.

Second method: setting $OCAMLRUNPARAM

At OCamlPro, we develop and maintain a program that any OCaml developer should want to run smoothly. It’s called Opam, maybe you’ve heard of it? Though most commands take a few seconds, some administrative-heavy commands can be a strain on our computer. In other words: those are perfect for a benchmark.

Here’s what we did to benchmark Opam:

$ opam update $ opam switch create 4.10.0 $ opam install opam-devel # or build your own code $ export OCAMLRUNPARAM='b=1,a=2,o=100,v=0x404' $ cd my/local/opam-repository $ perf stat ~/.opam/4.10.0/lib/opam-devel/opam admin check --installability # requires right to execute perf, time can do the trick

If you want to compile and run your own benchmarks, here are a few details on OCAMLRUNPARAM :

– b=1 means “print the backtrace in case of uncaught exception”

– a=2 means “use best-fit” (default is 0 , first-fit is 1 )

– o=100 means “do less work” (default is 80 , lower means more work)

– v=0x404 means “have the gc be verbose” ( 0x400 is “print statistics at exit”, 0x4 is “print when changing heap size”)

See the manual for more details on OCAMLRUNPARAM

You might want to compare how your code fares on all three different GC strategies (and fiddle a bit with the overhead to find your best configuration).

Our results on opam

Our contribution in this article is to benchmark opam with the different allocation strategies:

Strategy: Next-fit First-fit Best-fit Overhead: 80 80 80 100 120 Cycles used (Gcycle) 2,040 3,808 3,372 2,851 2,428 Maximum heap size (kb) 793,148 793,148 689,692 689,692 793,148 User time (s) 674 1,350 1,217 1,016 791

A quick word on these results. Most of opam ‘s calculations are done by dose and rely heavily on small interconnected blocks. We don’t really have big chunks of data we want to allocate, so the strategy won’t give us the bonus you might have as it perfectly falls into the best-case scenario of the next-fit strategy. As a matter of fact, for every strategy, we didn’t have a single GC compaction happen. However, Best-fit still allows for a lower memory footprint!

Conclusions

If your software is highly reliant on memory usage, you should definitely try the new Best-fit strategy and stay tuned on its future development. If your software requires good performance, knowing if your performances are better with Best-fit (and giving feedback on those) might help you in the long run.

The different strategies are:

– Next-fit: generally good and fast, but has very bad worst cases with big heaps.

– First fit: mainly useful for very big heaps that must avoid compaction as much as possible.

– Best-fit: almost the best of both worlds, with a small performance hit for programs that fit well with next-fit.

Remember that whatever works best for you, it’s still better than having to malloc and free by hand. Happy allocating!