Rethinking Static Reference Tables in GHC

It seems rare these days to be able to make an improvement that’s unambiguously better on every axis. Most changes involve a tradeoff of some kind. With a compiler, the tradeoff is often between performance and code size (e.g. specialising code to make it faster leaves us with more code), or between performance and complexity (e.g. adding a fancy new optimisation), or between compile-time performance and runtime performance.

Recently I was lucky enough to be able to finish a project I’ve been working on intermittently in GHC for several years, and the result was satisfyingly better on just about every axis.

Code size: overall binary sizes are reduced by ~5% for large programs, ~3% for smaller programs.

Runtime performance: no measurable change on benchmarks, although some really bad corner cases where the old code performed terribly should now be gone.

Complexity: some complex representations were removed from the runtime, making GC simpler, and the compiler itself also became simpler.

Compile-time performance: slightly improved (0.2%).

To explain what the change is, first we’ll need some background.

Garbage collecting CAFs

A Constant Applicative Form (CAF) is a top-level thunk. For example:

myMap :: HashMap Text Int myMap = HashMap.fromList [ -- lots of data ]

Now, myMap is represented in the compiled program by a static closure that looks like this:

When the program demands the value of myMap for the first time, the representation will change to this:

At this point, we have a reference from the original static closure, which is part of the compiled program, into the dynamic heap. The garbage collector needs to know about this reference, because it has to treat the value of myMap as live data, and ensure that this reference remains valid.

How could we do that? One way would be to just keep all the CAFs alive for ever. We could keep a list of them and use the list as a source of roots in the GC. That would work, but we’d never be able to garbage-collect any top-level data. Back in the distant past GHC used to work this way, but it interacted badly with the full-laziness optimisation which likes to float things out to the top level - we had to be really careful not to float things out as CAFs because the data would be retained for ever.

Or, we could track the liveness of CAFs properly, like we do for other data. But how can we find all the references to myMap ? The problem with top-level closures is that their references appear in code, not just data. For example, somewhere else in our program we might have

myLookup :: String -> Maybe Int myLookup name = HashMap.lookup name myMap

and in the compiled code for myLookup will be a reference to myMap .

To be able to know when we should keep myMap alive, the garbage collector has to traverse all the references from code as well as data.

Of course, actually searching through the code for symbols isn’t practical, so GHC produces an additional data structure for all the code it compiles, called the Static Reference Table (SRT). The SRT for myLookup will contain a reference to myMap .

The naive way to do this would be to just have a table of all the static references for each code block. But it turns out that there’s quite a lot of opportunities for sharing between SRTs - lots of code blocks refer to the same things - so it makes sense to try to use a more optimised representation.

The representation that GHC 8.4 and earlier used was this:

All the static references in a module were collected together into a single table ( ThisModule_srt in the diagram), and every static closure selects the entries it needs with a combination of a pointer ( srt ) into the table and a bitmap ( srt_bitmap ).

This had a few problems:

On a 64-bit machine we need at least 96 bits for the SRT in every static closure and continuation that has at least one static reference: 64 bits to point to the table and a 32-bit bitmap.

Sometimes the heuristics in the compiler for generating the table worked really badly. I observed some cases with particularly large modules where we generated an SRT containing two entries that were thousands of entries apart in the table, which required a huge bitmap.

There was complex code in the RTS for traversing these bitmaps, and complex code in the compiler to generate this table that nobody really understood.

The shiny new way

The basic idea is quite straightforward: instead of the single table and bitmap representation, each code block that needs an SRT will have an associated SRT object, like this:

Firstly, this representation is a lot simpler, because an SRT object has exactly the same representation as a static constructor, so we need no new code in the GC to handle it. All the code to deal with bitmaps goes away.

However, just making this representation change by itself will cause a lot of code growth, because we lose many of the optimisations and sharing that we were able to do with the table and bitmap representation.

But the new representation has some great opportunities for optimisation of its own, and exploiting all these optimisations results in more compact code than before.

We never need a singleton SRT

If an SRT has one reference in it, we replace the pointer to the SRT with the pointer to the reference itself.

The SRT field for each code block can be 32 bits, not 96

Since we only need a pointer, not a pointer and a bitmap, the overhead goes down to 64 bits. Furthermore, by exploiting the fact that we can represent local pointers by 32-bit offsets (on x86_64), the overhead goes down to 32 bits.

We can common up identical SRTs

This is an obvious one: if multiple code blocks have the same set of static references, they can share a single SRT object.

We can drop duplicate references from an SRT

Sometimes an SRT refers to a closure that is also referred to by something that is reachable from the same SRT. For example:

In this case we can drop the reference to x in the outer SRT, because it’s already contained in the inner SRT. That leaves the outer SRT with a single reference, which means the SRT object itself can just disappear, by the singleton optimisation mentioned earlier.

For a function, we can combine the SRT with the static closure itself

A top-level function with an SRT would look like this:

We might as well just merge the two objects together, and put the SRT entries in the function closure, to give this:

Together, these optimisations were enough to reduce code size compared with the old table/bitmap representation.

Show me the code

Look out for (slightly) smaller binaries in GHC 8.6.1.