LLVM and SBCL

This is about the slightly conservative generational GC that SBCL uses on i386 and amd64. SBCL has other GCs as well, but gencgc is the most tuned one since it is used in pretty much all production uses of SBCL.

LLVM has several different mechanisms for GC support, a plugin loader and a bunch of other choices, and things change quite rapidly. It also turns out that just basic memory allocation in SBCL isn’t what LLVM expects right now (LLVM wants to go through a function, but SBCL allocates by incrementing pointers, see https://github.com/sbcl/sbcl/blob/master/src/compiler/x86-64/alloc.lisp#L89).

There are very few garbage collecting languages on LLVM right now, and they have driven much of the interface design so far.

First, a description of how SBCL’s generational GC works:

written for CMUCL by Peter Van Eynde @pvaneynd (ETA: Peter says it wasn’t him, Douglas Crosher maybe?) when CMUCL was ported to x86, later used when SBCL split off CMUCL and amd64 was implemented

generational, with pages/segments within generations (I call them “GC cards” to avoid confusion with VM pages)

slightly conservative. Due to shortage of registers on x86 the code generator for x86 decided not to split the register set into tagged and untagged ones (on RISC one half of the registers could only hold non-pointers, the other half only pointers. So the GC would specifically know what is a pointer and what is not). SBCL’s genCGC is not conservative when it comes to heap-to-heap, only for potential pointer sources such as registers, the stack (avoidable but done for now), signal contexts and a couple other places.

Being conservative with regards to the stack. That creates a major clash with LLVM which expects a stack map. A stack map declares which data type can be found where on the stack for a function (as in every invocation of that function needs to comply to it), whereas SBCL’s compiler can and does generate code that just places anything anywhere. This makes the stack a lot more compact in some call patterns.

supports a write barrier but no read barrier. This means that parts of the heap can be excluded from scavenging (scanning) when it can be proven that they once did not point to our to-be-gced area and have not been modified since then.

the write barrier is implemented by VM page protection and using SIGSEGV. When I get to do some more coding the faster userfaultfd(2) in Linux should be used. The “soft-dirty bit” mechanism might also be suitable. I wrote extensively about this here: https://medium.com/@MartinCracauer/generational-garbage-collection-write-barriers-write-protection-and-userfaultfd-2-8b0e796b8f7f https://www.cons.org/cracauer/cracauer-userfaultfd.html

it is copying, however other passes that just punch holes into existing cards (to wipe pointers that are not pointed to anymore without moving a lot of memory) have and will be added. Conservative GC and use-from-GC are implemented by protecting a whole GC card from moving, then GCing inside it punching holes (which leaves the pinned objects in place).

This GC scheme ends up with the following properties:

fast memory writes. There is no overhead to instructions that write to memory. Since we are using OS facilities for the write barrier there is no need to annotate each memory write with a second write doing bookkeeping for the write barrier.

awesome performance could be had if your overall system did transient garbage during queries, resets all that garbage before the next query, as long as you don’t have older things point into it (which sadly is not what QPX is).

Looking at memory management overhead during GC time and non-GC time you will see that non-GC code suffers very little (just some updates to VM protection for GC cards, and that can be tuned via granularity). Fast non-GC time can be a huge advantage for query-oriented systems, because you can do GC between the queries, so that customer visible queries do not add latency from GC. You can also do fuller GCs between the queries and quick GCs during and still win. Of course overall CPU consumption for the workload of query and non-query activities goes up when you play these games.

Write barrier GCs that maintain their own bitmaps via annotating writes have the flexibility to teach the system more tricks and do more bookkeeping during non-GC code, slowing down queries some more but save overall real or CPU time. Users of the VM pages protection system have little to play with, but userfaultfd(2) will give SBCL something to play with.

SBCL’s GC can potentially run multi-threaded, but not concurrently. That means it needs to stop the world, then could use multiple threads and CPUs to do GC (unimplemented), but it cannot do GC in the background while the rest of the world is running. Generally you need read barriers to do this. It is unknown to me at this time whether VM page protection schemes can implement read barriers that are sufficient for concurrent GC, and if so whether the unavoidable granularity allows for sufficient performance. These are equivalent with Java’s 3 modes of GCing. The JVM uses annotated writes with lots of tweaks for its write barrier.

Other memory management in SBCL:

SBCL compiled code does not use an allocation function for most allocs. Only large allocs or allocs hitting the end of an alloc area call a functions. All memory allocation in between is doing by atomically incrementing a free-space pointer, multiple threads using the same allocation area and the same free-space pointer (that works because they use compare-and-swap to get their alloced place).

This is potentially tricky to preserve, but I would like to keep it. The speed at which some garbage can be generated is enormous, and it all adds up.

SBCL’s GC is copying (moving) by default, which enables this scheme. There is no fragmentation. Just just blast new objects into place in a simple one-after-another manner.

There is no malloc-like functions which has to keep track of fragments to maybe fill them, to maintain heap pools in lists or other collection classes. A malloc function for all practical purposes needs to either use thread-specific heap pools or use thread locking. Doing a malloc in C is several orders of magnitude more expensive than in SBCL code. More so in multithreaded code. In SBCL code there is a lock-free way to generate a bunch of stuff on the heap with no function calls.

There also is no zero-intialization of allocated space. Common Lisp does not allow uninitialized variables, so the allocated space is overwritten before use with the initialization value (as opposed to first zeroing it, then writing again with the initialization value).

All this makes various LLVM mechanism look slow, namely LLVM usually expects you to go through functions for allocation, and it zeros memory.

Compiler properties of SBCL:

no aliasing of pointers in generated code (we come to that later)

a write barrier implementation using a bitmap was available as a patch for SBCL by Paul Khuong @ pkhuong. I benchmarked it against the VM page implementation. The untweaked real-world application did not show overall performance differences. It obviously was dominated by doing the actual GC work (especially memory scans) and it didn’t really matter how you do the write barrier, and both mechanisms apparently did an equal job of excluding GC cards from scavenging. I might be able to dig up some numbers about the resulting compiled code size. I have SBCL versions of that time around if you want to play with those two.

pkhuong. I benchmarked it against the VM page implementation. The untweaked real-world application did not show overall performance differences. It obviously was dominated by doing the actual GC work (especially memory scans) and it didn’t really matter how you do the write barrier, and both mechanisms apparently did an equal job of excluding GC cards from scavenging. I might be able to dig up some numbers about the resulting compiled code size. I have SBCL versions of that time around if you want to play with those two. very decent stack allocation abilities. This did tremendously reduce overall heap management time for my toy at work. Unlike Golang it is not possible in Lisp to automatically determine stack-allocatablity for nontrivial code, you have to tell the compiler (like in C) and hope you are not wrong (the Lisp language would allow us to use safe-mode compilation to actually make this detectable during regression tests, but this has not been done).

To iterate a bit on the basics: CMUCL and SBCL are native compilers. They compile to machine code, they do their own code loading and starting. They do not use OS compiler, assembler or linker (except for the executable image). CMUCL/SBCL have various transformation phases into Lisp-like abstracts, then an abstract machine language, then a very simple and linear step from abstract machine language to the requested architecture’s binary code.

My interest in LLVM results from that last step. No optimization is done during that last phase (abstract assembler to machine assemble), and no optimization happens on machine code. The resulting code can look repetitive and wasteful when human-reading it. LLVM is precisely what I need: an assembly-to-assembly optimizing machine. If I could just replace the last 1 or 2 steps in SBCL with LLVM I would get much better looking machine code.

Now, it is debatable and was debated inside Google whether the current state of affairs is actually doing much harm. Code is just a little bit bigger, I was more concerned about the needless instructions. To test this I made C functions that did roughly the same as my Lisp functions, then compiled C to assembler, edited the assembler file to be wasteful the same way that SBCL’s last stage is wasteful, and benchmarked. Modern x86 CPUs blast through a couple of needless or inefficient instructions with amazing speed. Since the storage dependencies are obviously quite favorable in the instruction stream (we are talking about repeated manipulations of same locations, using the same read data), the difference was barely noticeable. I had decided against “selling” this huge project to be done at work. The outcome is uncertain. Now I’m on my own time and I want to know more.

LLVM GC requirements:

LLVM might alias pointers, calling them “derived pointers”, as part of optimizations. A moving GC that needs to find all pointers to an object that it moved must be informed of such copies and where they live, so that the copy can be adjusted, too. This is ordinarily not difficult if you just place the copies in the heap (where scavenging will find them), but that isn’t necessarily done that way (would require wrapping them). This needs thinking over.

What is worse is that such derived pointers might not actually be copies of pointers to (the beginning of) an object, they might point into the middle. The existing SBCL GC has some facilities to deal with that, but it is a pain and a slowdown. You need to determine the surrounding object on scavenging.

LLVM is language-neutral and naturally assumes that in a mixed language system (say Lisp and C) both languages can call each other freely. That is a change from SBCL where calling Lisp from C is only supported if you wrap it into routines that tell the system about it (e.g. so that Lisp objects known to be pointed to by C are not moved, this is simply folded into the “slightly conservative” mechanism).

What I don’t pay for now (in SBCL):