One of the things that’s annoyed me for a while is that there aren’t any high performance, high level languages, really being seriously developed. It seems to me that most of the benefits from high level languages, if carefully designed, have no or very limited performance implications, but yet we get languages that either largely abandon safety and productivity concerns even when there’s no reason (iterations on C/C++ mostly, including Go and D), or we get high level languages that throw out performance considerations left and right for minor convenience reasons (e.g. C#, Java).

Luckily this appears to be changing. E.g. languages like Rust seem very promising.

This brings me to one of the more controversial features of high level languages: garbage collection. IMO this is almost a no-brainer. Automatic memory management is such a productivity and safety booster that any high level language should have it, and I’m fairly convinced that it can be done in a high performance manner as well. We’ve just been conditioned to see problems with it because the host languages haven’t really tried to do this efficiently by making the right tradeoffs.

The first rather obvious optimization is to avoid GC work whenever possible – the easiest way to do this is to reduce the number of allocations. Languages like C# and Java get this entirely wrong. They heavily encourage heap allocations so you tend to get them all over the place. Each object consists mostly of pointers to other heap allocated objects. Where’s the actual data dammit? As a result, the GC has a crazy amount of work to do. C++ gets this right, and so does Rust. Most “member fields” in data types should just be stored in-place rather than behind a heap allocated pointer. This complicates the language if you want GC (especially with regards to taking pointers to internal parts of a data structure), but it’s worth the extra complexity. In practice this means that rather than each object having a massive fan-out where every member is another allocation, most objects will have only a couple of heal-allocated members – in aggregate this drastically reduces the heap graph and therefore the amount of work the GC has to do.

Another optimization, which rust and C++ makes alike, is to have “unique” pointers that refer to objects that are uniquely owned by their pointer. These could live in a separate heap altogether. Since these signify unique ownership, the object dies when its parent dies. Ownership can be transferred, but never shared (and this is statically ensured in Rust). This is particularly useful for reference counting based GC, since there’s no RC overhead for these pointers, but tracing collectors can often stop tracing early when it hits a unique pointer, if the type it’s looking at has no GC pointers in its transitive “type graph”.

These two things basically boil down to this: the language should allow you to express simple memory ownership when you have it, and only resort to arbitrary (shared) heap allocations in extreme circumstances. The language should prefer and promote stack allocation, and in-place allocation of sub-objects. When you do need heap allocations, the language should nudge you towards unique ownership for most of them.

Even D, which is ostensibly supposed to be a language prioritizing both efficiency and safety, doesn’t get this right with its struct/class separation (I want to very carefully elect to put stuff on the heap only on the small subset data where it’s absolutely necessary, rather than being forced – or at least heavily encouraged – into heap allocations because something happens to be a class).

The third and maybe most important optimization is to make garbage collection a per-thread operation. You simply don’t allow garbage collected heap allocations to transfer between threads. This means you can use efficient “stop the world” collectors (and compactors), without actually stopping the whole world – only stop one thread. Assuming threads are cheap and ubiquitous (they should be) we may have hundreds or even thousands of isolated heaps – each of which may be stopped for a few milliseconds at a time, but “on average” there are always enough threads that can run that we keep the mutator resources (CPU cores) busy doing actual work and make overall progress. You might call it “concurrent garbage collection in aggregate”. Plus, most of these threads are probably blocked at any one time – running GC on a blocked thread is basically a ghetto version of concurrent GC (from “inside the thread” it looks as if GC just happened while you weren’t looking, just like a concurrent GC, but without expensive read or write barriers).

Anyway, the point is that before we even start talking about GC, the assumption is that we are in a language designed with memory efficiency in mind, and have far less garbage than any existing mainstream high level languages, and that we can run the garbage collectors on small, isolated islands without stopping the entire world (or resorting to expensive concurrent collectors requiring mutator memory barriers).

There are two issues that come up here that I’ve been pondering for a while: external fragmentation due to all these heaps, and tracing collection versus reference counting.

External fragmentation



A lot of high performance garbage collectors (such as Immix etc.) gain their performance largely due to their heap organization. Typically the heaps are allocated in big “regions”, e.g. 32KB each. This is problematic for per-thread GC since it implies a minimum heap size, which in turn implies that having tons (e.g. thousands) of tiny threads will lead to substantial memory waste in the form of external fragmentation. Even if each heap is efficient on its own (e.g. runs compaction periodically), for the overall system the empty space at the end of each heap will add up to substantial fragmentation that cannot be reclaimed.

Ideally, each heap would start small, like a few hundred bytes or so, and then grow using bigger and bigger regions as the overall heap size increases. I haven’t really seen any GC papers do this, they all seem to pick a fixed region size up front. The reasoning is probably quite simple: with a fixed region size you can find the start of the region by looking at the pointer itself (power of two sized regions are cheaper).

I only really mention this because I think I have an efficient way of identifying the region a pointer belongs to, even if region sizes vary, that I haven’t seen mentioned anywhere. Perhaps this is well known, and I wouldn’t be surprised if people have done it for decades, but it’s not completely obvious and it seems like a trick that should’ve been mentioned but has been conspicuously absent in many GC books and papers I’ve read. So I should probably write it down in case it’s helpful to someone.

So the standard way of partitioning the heap is to have power-of-two region sizes so you can mask the lower order bits of the pointer to get a pointer to the start of the region (you could place per-region meta data here, or store a pointer from the region to a meta data block “off to the side” – the latter reduces contention for the ways in a set associative cache). The problem with this in the presence of varying region sizes is that you don’t know ahead of time how many bits you need to mask out (the number of bits to mask out depends on the region size, after all).

The basic idea is to simply try all the different region sizes, and see if the proposed region is actually valid. The trick is to eliminate false positives. The main way to verify if an arbitrary pointer actually refers to a region is to simply store a pointer to all regions in a separate array that user-code cannot write to. If the current proposed region is not in this array, then it is not the start of a valid region. Call this the “region pointer array”. To make the lookup in this array fast, we also store, as the first DWORD of each actual region, an index into this array. Call this the “region index”. We also store the region size in the meta data as well (along with all the other stuff, such as occupancy bits, mark bits, reference count bits, and whatever else you may store per region).

To figure out which region a pointer refers to in this scheme, we simply loop through the different region sizes that our allocator allows, mask the appropriate number of bits from the pointer to get the start of the proposed region, read the region index in the first DWORD of this memory area (this may be garbage if we have the wrong size) use this to index into the region pointer array (taking care to avoid buffer overruns from garbage indices) and ensure that the pointer stored in that element refers right back to the proposed region. We also must ensure that the size of the region at this address actually contains the original pointer.

This seems like a scheme that would allow us to have very small initial region sizes that quickly grows as the heap does, while only doing very simple tests to find which region a pointer belongs to. No, it’s not as cheap as a simple bitwise and, but it should essentially optimize to that in the common case due to allocation coherency. E.g. take the tracing loop in a mark and sweep collector, you can imagine having special-cased tracing loops for each region size (or at least the most common ones), and then just jumping from one loop to the next when the region size changes, but keeping the common path fast by being able to just mask and compare the next pointer from the tracing stack (mask with a constant, compare against a register) to ensure you’re still in the same region as you were, and going through the generic region-finding algorithm above only when that test fails.

Tracing GC vs Reference counting



The next question is if we should be using a tracing GC or reference counting. The conventional wisdom seems to be that reference counting isn’t performance competitive but keep in mind that:

Most GC research seems to happen in the context of Java. I.e. an extremely heap-heavy language, where each object is small and consists largely of pointers to other heap objects. This favors tracing over RC.

Even in this context, modern reference counting optimization can reclaim the performance difference vs. tracing collectors. See Down for the Count? Getting Reference Counting Back in the Ring.

For our hypothetical language we will have large objects, and few heap pointers, so the overhead for reference counting will be lower both in space and time. Few pointers mean fewer reference counts to keep up to date. Few allocations mean fewer reference count fields that take up space.

Per-thread GC heaps means we don’t need atomic reference count operations.

Reference counting costs are proportional to the size of garbage, rather than the size of the heap. This seems better for scalability in the future where heap sizes will keep getting bigger and bigger.

A strongly typed high level language should be immutable by default, or at the very least track mutability, which means we can statically know that large object graphs are acyclic (as immutable values must be, in a strict language) – no need to trace these in any cycle collector.

One of the benefits RC has that I’m not too concerned about is promptness of deallocation. The idea that as soon as a sub graph goes out of scope, memory is immediately reclaimed. This is appealing, but doesn’t eliminate pause times as some might think (you tend to get big “bubbles” that burst as a lynch pin object is deallocated, causing the mutator to stall for a long time as deallocations cascade). Many RC optimizations rely on giving up on this benefit in order to be competitive with tracing GC, so it’s a benefit I’m prepared to sacrifice.

Basically, I think that with a language that has the properties I’ve outlined for a hypothetical modern and performance oriented language (or the non-hypothetical language Rust), reference counting may well come out ahead.

There’s one final bit of conjecture: if immutable values are the default and therefore common in this language, the vast majority of heap pointer mutations will happen on the stack (i.e. local variables). I.e. you will grab some pointers to the heap, shuffle them around on the stack for a while in various algorithms, and then they go out of scope. This may not be true, but it seems very plausible. This implies that along various static analysis to avoid reference counting when possible (such as Rust’s “borrowed pointers” which amount to safe versions of C’s naked pointers), you can simply add the most naïve form of deferred reference counting as well and eliminate most remaining reference counting operations that way.

The idea is basically this. Don’t do any reference counting operations as the result of writing to stack variables. Only do them for heap variables. In other words, grabbing a pointer to a heap allocation and storing it in a local variable does not introduce a reference counting operation, only mutating a heap object’s pointer to another heap object does (these operations should be rare, largely because mutable pointer fields in data structures would be rare). If a reference count goes to zero during execution we can’t just delete it immediately in case there are also stack variables referring to it, instead we put a pointer to it in a thread local “zero count buffer” or ZCB. Put all newly allocated heap objects in this buffer too (unless we can statically ensure that it will be immediately referenced by another heap object). Once this buffer is full, or some other heuristic is met (e.g. a certain number of bytes has been allocated), stop the current thread, scan the stack for references into the GC heap and increment reference counts for these objects. This restores accurate reference counts to all objects that are potentially dead. Then, scan the ZCB and delete any objects whose reference counts are still zero. Finally, go through the stack again and decrement the reference counts you just incremented so that reference counts once again do not take the stack into account. The benefit is twofold, the first one is that we only perform reference counting for stack-rooted pointers periodically, rather than every time one is written to, which by the conjecture above should eliminate the vast majority of remaining reference counting operations. The other is that we batch up our deallocations into groups instead of deallocating throughout normal execution which should improve locality. The cost of writing an object to the ZCB when its reference count goes to zero is low. If we enforce that the ZCB always has at least one free element left we can safely do an unconditional write into it and perform the test for a full ZCB afterwards, without having to stall the subsequent code. This means modern OOE CPUs will correctly predict the branch and keep going assuming the ZCB is not full, which is the common case.

There are other common reference counting optimization in the literature, such as coalesced reference counting. I believe that in a language such as Rust these will have low value (having less RC traffic to start with), and that a very simple deferred scheme like above would get most of the benefit while avoiding the cost of per-object dirty bits and the like.

Anyway. This post is getting long. It may also be the worst kind of blog post – too specific to be interesting for most people, and written by a non-expert in the area so that people who would be interested think it’s either obvious or stupid. Let me know which in the comments!