How to get them?

Shenandoah has been integrated with OpenJDK as of Java 12, but the Oracle builds don’t include it, so you’ll have to use other builds for now. Of course if you use Red Hat Linux then you get this out of the box. Usefully, Red Hat also provide backports for Java 8 and 11.

ZGC comes with the default OpenJDK and Oracle builds. In Java 12 it’s pretty complete, but supports “only” a maximum heap size of 4 terabytes. In Java 13 they plan to increase the max heap size to a whopping 16 terabytes — as well as adding the increasingly vital ability to shrink the heap and release memory back to the OS.

Edit: Java 13 is out now and adds these features.

However, obtaining a JVM with the GC you want built in isn’t enough. You also have to enable it. Because these GCs are tuned for huge heaps at a cost of runtime throughput, and because they are new, you are expected to opt-in for now. Perhaps in future the JVM will automatically use these new GCs when the configured maximum heap size is big enough.

Comparisons

So how do these new GCs compare to other initiatives outside the Java world? That’s very hard to say because you can’t normally hold all other variables constant: other languages may put different amounts of pressure on the heap or GC algorithm. But I’d say they compare extremely well indeed.

In my previous article in 2016 about garbage collection, I observed that Go was advertising itself as somehow having a one-size-fits-all collector that was both better than ‘enterprise’ alternatives and sufficient to do everything into the foreseeable future. But this wasn’t true so I criticised their marketing approach. A couple of years later the Go team published this excellent talk called “Getting to Go” that I’d describe as almost the exact opposite — it’s remarkably honest.

It tells us that the Go GC was designed in the unusual way it was, largely due to previously unmentioned internal engineering constraints at Google, and in particular due to a surprising need for short term success.

The original plan was to do a read barrier free concurrent copying GC. That was the long term plan. There was a lot of uncertainty about the overhead of read barriers so Go wanted to avoid them. But short term 2014 we had to get our act together … We also needed something quickly and focused on latency but the performance hit had to be less than the speedups provided by the compiler. So we were limited. We were also concerned about compiler speed, that is the code the compiler generated … Go also desperately needed short term success in 2015.

There’s nothing wrong with designing to meet your time constraints. Some lucky projects are funded for as long as they need to be in order to achieve their goals (for instance it took the Kotlin team about 6 years to reach version 1.0). Some other projects need to obtain market success quicker. Go seems to have been in the latter category, although I have no idea why. Creating new languages is always a very long term endeavour — perhaps Google’s senior management were demanding to see adoption or else the project would be cancelled.

So it’s 2014 and Jeff Dean had just come out with his paper called ‘The Tail at Scale’ which this digs into this further. It was being widely read around Google since it had serious ramifications for Google going forward and trying to scale at Google scale.

And Go’s GC was designed with another constraint — in an environment in which their colleagues had become obsessed with the long tail of pause latencies. The Go team couldn’t ignore an important position paper by Jeff Dean.

Regardless of the causes, Go’s unusual design choices were clearly due to a combination of unusual constraints: they were in a company going through an obsession with tail latencies, they had limited programming time because they were simultaneously rewriting the standard library in Go, and most of all they needed to drive pause latency down very fast to obtain “short term success”.

Good for them — I’d say they achieved their goals splendidly. Go did drive down pause latencies albeit at huge cost in other areas, they did it very quickly with limited engineering budget, and they obtained the short term success they needed. It’s unfortunate that they claimed their quickly thrown together GC was “a future where applications scale effortlessly” and “good for the next decade and beyond” whilst immediately starting to experiment with other GC algorithms, but let’s let bygones be bygones.

Unfortunately, along the way they adopted some strange requirements, like the desire to not expose tuning knobs to users. And this led to them being pinned down badly by the wildly varying needs of different programs.

The Go compiler is a classic batch job. Pause times don’t matter at all for a compiler, only the total runtime does. But techniques to reduce pause times all increase total runtime, so that’s a problem. A good GC algorithm to use for a batch job is something like the JVM’s Parallel GC. Go doesn’t have any equivalent of that.

They experimented with a “request oriented collector” which scaled better for some kinds of important apps:

As you can see if you have ROC on and not a lot of sharing, things actually scale quite nicely. If you don’t have ROC on it wasn’t nearly as good.

But … it slowed down their compiler:

At that point there was a lot of concern about our compiler and we could not slow down our compilers. Unfortunately the compilers were exactly the programs that ROC did not do well at. We were seeing 30, 40, 50% and more slowdowns and that was unacceptable. Go is proud of how fast its compiler is.

Darn, GC design is hard! The Go guys could have gone the JVM route and just let the user pick the GC algorithm depending on whether they care about latency or throughput, but, that would have violated the “two knobs is perfect” policy they had:

We also do not intend to increase the GC API surface. We’ve had almost a decade now and we have two knobs and that feels about right. There is not an application that is important enough for us to add a new flag.

Why two knobs? Why not three? Everyone can agree fewer is better, but drawing the line at two feels totally arbitrary.

So they gave up and tried a new approach — a regular generational garbage collector (like their “enterprise” competitors). But they made this in a strange way too, because they insisted it not move things around in memory. That third try also failed:

This is typical in performance tuning — you try something and it makes some programs faster, but others slower. This is the root of the tuning knobs problem.

Their position as of 2018 is therefore that they’re going to just wait for memory prices to drop and ask Go users to give Go apps huge heaps, to reduce the amount of GC work that needs to be done with the current algorithm (see the last slides).

If we compare this story to the equivalent in the JVM world, we can see things played out differently. In the SPECjbb 2015 benchmark (which simulates a warehouse management database app), ZGC imposes doesn’t really slow down the app but does significantly improve latency:

Pause times are almost always a millisecond or less, yet throughput isn’t significantly harmed. They aim for a max throughput impact of 15% but it appears they manage significantly better than their own goals in reality (ZGC is new so it’s hard to say exactly on what workloads this is true). Meanwhile, there are only three standard tuning flags — one to select ZGC instead of another algorithm, one to set the max heap size, and one you can normally leave alone as it’s automatically adjusted but which sets how much CPU time the collector gets. A few more obscure ones can be used to increase performance further by utilising the Linux kernel’s hugepages feature. This seems pretty good. In the case that the new algorithm does harm a specific program or benchmark, it is easy to go back to the old ones.

Overall, it looks to me like the Java guys are winning the low latency game.