Introduction

The Java language specification mandates some form of automatic garbage collection to reclaim unused storage, and forbids manual memory deallocation. Automatic garbage collection frees the programmer from much of the worry about releasing objects that are no longer needed, which can otherwise consume substantial design effort. It also prevents some kinds of common bugs occurring, including certain kinds of memory leaks, dangling pointer bugs (which occur when a piece of memory is freed while there are still pointers to it and one of those pointers is then used) and double free bugs (which occur when the program tries to free a region of memory that has already been freed and perhaps already been allocated again).

Whilst garbage collection clearly has a number of advantages, it does also have some problems. The most significant is that thus far, practical implementations of garbage collection in commercial Java runtimes have generally involved an unpredictable "pause" during collection. As Java programs have expanded in size and complexity, so the garbage collection pause has become an increasingly significant problem for Java software architects.

A widely used technique in enterprise Java to work around this is to distribute programs. This can both keep the heap size smaller, making the pauses shorter, and also allow some requests to continue whilst others are paused. Increasingly though, this means that Garbage Collector (GC) managed workloads are unable to take full advantage of the capabilities of the hardware on which they run. We spoke to Gil Tene, Vice President of Technology, CTO, and Co-Founder of Azul Systems, who suggested that garbage collection hasn’t been able to keep pace with changes in hardware. Whilst 10 years ago a 512MB-1GB heap size would have been considered substantial, and a high-end commodity server might have shipped with 1GB-2GB RAM and a 2-core CPU, a modern commodity hardware server typically has around 96GB-256GB memory running on a system with 24-48 virtual or physical cores. Over a period of time during which commodity hardware memory capacities have increased a hundred fold, commonly used garbage collected heap sizes have only doubled. Tene argues that garbage collectors have lagged significantly behind both the hardware and software demands of many larger enterprise applications.

Collector Mechanisms

Common Tasks

While implementations of garbage collectors in Java runtimes vary, there are common and unavoidable tasks that are performed by all commercial JVMs and garbage collection modes. These common tasks include:

Identifying the live objects in the memory heap

Reclaiming the resources used by those objects that are not live (aka “dead” objects)

Periodically relocating live objects to create some amount of contiguous free space in order to accommodate the allocation of new objects of varying sizes – this process is referred to as either relocation or compaction

Depending on the collector’s specific algorithm, these tasks may be performed in separate phases, or as part of a single combined pass. For example, commonly used Tracing Collectors (popular for old generation collection in commercial JVMs) often use separate Mark, Sweep, and Compact phases to perform the identification, reclaiming, and relocation tasks. On the other hand, commonly used Copying Collectors (popular for young generation collection in commercial JVMs) will typically perform all three tasks in a single copying pass (all live objects are copied to a new location as they are identified).

Parallelism and Concurrency

Garbage collectors can be either single-threaded or parallel:

A single threaded collector uses at most one CPU core to perform garbage collection.

A parallel collector uses multiple threads that can perform collection work at the same time, and can therefore make simultaneous use of more than one CPU core

They are also either Stop-the-world or Concurrent:

A Stop the world collector performs garbage collection work with the application code stopped.

A concurrent collector performs garbage collection work concurrently with program execution, allowing program execution to proceed while collection is done (potentially using separate CPU cores).A Collector is termed “partly concurrent” or “mostly concurrent” when some of the garbage collection work is done concurrently with application execution, but some part of the collection work is performed with application execution stopped.

Responsiveness and Sensitivity

Traditional "stop-the-world" collector implementations clearly affect an application's responsiveness. What is less commonly understood is that “somewhat concurrent” and “mostly concurrent” collectors will also exhibit application response time issues, depending on exactly what processing requires a pause. Application responsiveness and scalability limitations are generally sensitive to the type of processing that occurs during stop-the world pauses, and to the type or rate of operations that induce such pauses to occur more often. Examples of metrics that applications may be sensitive to include:

Live set size: The amount of live data in the heap will generally determine the amount of work a collector must do in a given cycle. When a collector performs operations that depend on the amount of live data in the heap during stop-the-world pauses, the application’s responsiveness becomes sensitive to the live set size. Operations that tend to depend on live set size are marking (in Mark-Sweep, Mark-Sweep-Compact, or Mark-Compact collectors), copying (in copying collectors), and the remapping of references during heap compaction (in any compacting collector). Heap size: Depending on the algorithm and mechanisms used, the total size of the heap may determine the amount of work a collector must do in a given cycle. When a collector performs operations that depend on the amount of memory in the heap (regardless of whether the contents is alive or dead), and those operations are performed during a stop-the-world event, the application’s responsiveness becomes sensitive to the heap size. Operations that tend to depend on heap size are sweeping (in Mark-Sweep and Mark-Sweep-Compact) collectors. Fragmentation and compaction: Fragmentation is inevitable, and as a result so is compaction. As you allocate objects and some of them die, the heap develops “holes” in it that are large enough for some objects but not for others. As time goes by you get more, smaller holes. Eventually, you get to a point where there is plenty of room in the heap, but no place for some object that is larger than the largest available hole, and a compaction is required in order to keep the application operating. Compaction will relocate at least some live objects in the heap in order to free up some contiguous space for further allocation. When live objects are relocated to new locations during compaction, all references to the relocated objects must be tracked down and remapped to point to the new locations. All commercial JVMs contain code to compact the heap, without which the heap will become useless with time. All the collector modes (concurrent or not) of the JRockit, HotSpot, and J9 JVMs will perform compaction only during stop-the-world pauses. A compaction pause is generally the longest pause an application will experience during normal operation, and these collectors are therefore sensitive to heap fragmentation and heap size. Mutation rate: Defined as how quickly a program updates references in memory, that is how fast the heap, and specifically the pointers between objects in the heap, are changing. Mutation rate is generally linear to application load; the more operations per second you do, the more you will mutate references in the heap. Thus if a collector can only mark mutated references in a pause, or if a concurrent collector has to repeatedly revisit mutated references before a scanning phase is completed, the collector is sensitive to load, and a high mutation rate can result in significant application pauses. Number of weak or soft references: Some collectors, including the popular CMS and ParallelGC collectors in Hotspot, process weak or soft reference only in a stop-the-world pause. As such, pause time is sensitive to the number of weak or soft references that the application uses. Object lifetime: Most allocated objects die young so collectors often make a generation distinction between young and old objects. A generational collector collects recently allocated objects separately from long lived objects. Many collectors use different algorithms for handling the young and old generations - for example they may be able to compact the entire young generation in a single pass, but need longer to deal with the older generation. Many applications have long lived datasets however, and large parts of those datasets are not static (for example caching is a very common pattern in enterprise systems). The generational assumption is an effective filter, but if collecting old objects is more time-consuming then your collector becomes sensitive to object lifetime.

Given the importance of application response times, a need clearly exists for concurrent collectors that exhibit fewer sensitivities to application behaviour metrics. The most widely used collectors in production systems however are often unable to achieve these goals.

According to Tene, the Azul GPGC collector included in the Zing platform is designed to be insensitive to many of these metrics, and remain robust across a wide operating range. Through the use of a guaranteed single-pass marker, the collector is completely insensitive to mutation rates. By performing concurrent compaction of the entire heap, the collector is insensitive to fragmentation. By performing weak, soft, and final reference processing concurrently, the collector has been made insensitive to the use of such features by applications. By using quick-release, and through the nature of the loaded value barrier’s self-healing properties, the collector avoids the sensitivity of “being in a hurry” to complete a phase of its operation for fear that its efficiency might drop, or that it may not be able to complete a phase or make forward progress without falling back to a stop-the-world pause. In the sections that follow we will explore how this is achieved, comparing Azul’s approach with other commonly used collector algorithms.

The Azul Garbage Collector

Azul's GPGC[1] collector (which stands for Generational Pauseless Garbage Collector), included in its HotSpot-based JVM, is both parallel and concurrent. GPGC has been widely used in a number of production systems for several years now, and has been successful at removing or significantly reducing sensitivity to the factors that typically cause other concurrent collectors to pause. Factors that the Azul GPGC collector was specifically designed to be insensitive to include fragmentation, allocation rates, mutation rates, use of soft or weak references, and heap topology. Tene explained that whilst GPGC is a generational collector, this was mostly an efficiency measure; GPGC uses the same GC mechanism for both the new and old generations, working concurrently and compacting in both cases. Most importantly, GPGC has no stop-the-world fall back. All compaction is performed concurrently with the running application.

According to Tene a central theme of the algorithms design was the idea that there is no “rush” to finish any given phase. No phase places a substantial burden on the mutators that needs to be relieved by ending the phase quickly. There is no “race” to finish some phase before collection can begin again – relocation runs continuously and can immediately free memory at any point. Moreover since all phases are parallel, the GC can keep up with any number of mutator threads simply by adding more GC threads.

The Loaded Value Barrier

Core to GPGC’s design, the collector uses what Tene calls a “loaded value barrier”, a type of read barrier which tests the values of object references as they are loaded from memory, and enforces key collector invariants. The loaded value barrier (LVB) effectively ensures that all references are “sane” before the mutator ever sees them, and this strong guarantee is responsible for the algorithms relative simplicity.

The loaded value barrier enforces two critical qualities for all application-visible references:

The reference state indicates that it has been “marked through” if a mark phase is in progress. (See Mark Phase description below). The reference is not pointing to an object that has been relocated.

If either of the above conditions is not met, the loaded value barrier will trigger a “trap condition”, and the collector will correct the reference to adhere to the required invariants before it becomes visible to application code. The use of the loaded value barrier test in combination with self healing trapping (see below) ensures safe single pass marking, eliminating the possibility that the application threads would cause live references to escape the collector’s reach. The same barrier and trapping mechanism combination is also responsible for supporting lazy, on-demand, concurrent compaction (both object relocation and reference remapping), ensuring that application threads will always see the proper values for object references, without ever needing to wait for the collector to complete either compaction or remapping.

“Self Healing”

Key to GPGC’s concurrent collection is the self-healing nature of handling barrier “trap” conditions. When a loaded value barrier test indicates that a loaded reference value must be changed before the application code proceeds, the value of the loaded reference AND the value of the memory location it was loaded from will both be modified to adhere to the collectors current invariant requirements (e.g. to indicate that the reference has already been marked through, or to remap the reference to a new object location). By correcting the cause of the trap in the source memory location (possible only with a read barrier, such as the loaded value barrier, that intercepts the source address), the GC trap has a “self healing” effect: the same object references will not re-trigger additional GC traps for this or other application threads. This ensures a finite and predictable amount of work in a mark phase, as well as the relocate and remap phases. Azul coined the term “Self healing” in their first publication of the pauseless GC algorithm in 2005, and Tene believes this "self-healing" aspect is still unique to the Azul collector.

Azul implements the same logical LVB tests on modern x86 processors as it first did with it’s custom Vega processors. On Azul’s Vega hardware, LVB tests include bit field checking in reference metadata as well as special virtual memory protection for GC-compacted pages. In addition, Azul’s hardware supports a number of fast user-mode trap handlers to handle “slow path” trap conditions triggered when values that would contradict the collector’s current invariants are loaded from memory and tested. These “slow path” trap handlers can be entered and left in a handful of clock cycles (4-10, depending) and are used by the GC algorithm to handle such conditions - these are the “Self Healing traps”. On modern x86-64 hardware, the same LVB effects are achieved using a combination of instructions in the execution stream and proper manipulation of virtual memory mapping (see “Appendix A” for more details).

How the Azul Algorithm Works

The Azul algorithm is implemented in three logical phases as illustrated below.

Mark: responsible for periodically refreshing the mark bits. Relocate: uses the most recently available mark bits to find pages with little live data, to relocate and compact those pages and to free the backing physical memory. Remap: updates (forwards) every relocated pointer in the heap.

The Mark Phase

The Azul GC mark phase has been characterised by David Bacon et al (PDF document) as a "precise wavefront" marker , not SATB, augmented with a read barrier. The mark phase introduces and tracks a “not marked through” (NMT) state on object references. The NMT state is tracked as a metadata bit in each reference, and the loaded value barrier verifies each reference’s state matches the current GC cycle’s expected NMT state. The invariant imposed on the NMT state by the loaded value barrier eliminates the possibility that the application threads would cause live references to escape the collector’s reach, allowing the collector to guarantee robust and efficient single pass marking.

At the start of the mark phase, the marker’s work list is “primed” with a root-set which includes all object references in application threads. As is common to all markers, the root-set generally includes all refs in CPU registers and on the threads' stacks. Running threads collaborate by marking their own root-set, while blocked (or stalled) threads get marked in parallel by the collector’s mark-phase threads.

Rather than use a global, stop-the-world safepoint (where all application threads are stopped at the same time), the marker algorithm uses a “checkpoint” mechanism. As the Azul 2005 VEE "Pauseless GC Algorithm" paper explained:

Each thread can immediately proceed after its root set has been marked (and expected-NMT flipped) but the mark phase cannot proceed until all threads have crossed the Checkpoint.

(Note: this paper is no longer available on Azul's website).

After the root-sets are all marked the algorithm continues with a parallel and concurrent marking phase. Live refs are pulled from the worklists, their target objects marked live and their internal refs are recursively worked on.

In addition to references discovered by the marker’s own threads during the normal course of following the marking work lists, freely running mutator threads can discover and queue up references they may encounter that do not have their NMT bit set to the expected “marked through” value when they are loaded from memory. Such reference loads trigger the loaded value barrier’s trapping condition, at which point the offending reference will be queued to ensure that it would be traversed and properly marked through by the collector. It’s NMT value will therefore be immediately fixed and healed (in the original memory location) to indicate that it can be considered to be properly marked through, avoiding further LVB condition triggers.

The mark phase continues until all objects in the marker work list are exhausted, at which point all live objects have been traversed. At the end of the mark phase, only objects that are known to be dead are not marked as “live”, and all valid references have their NMT bit set to “marked through”.

It is worth noting that the GPGC mark phase also performs concurrent processing of soft, weak, and phantom references. This quality makes the collector relatively insensitive to the amount of Soft or Weak references used by the application.

The Relocate and Remap Phases

Relocation

The relocate phase is where objects are relocated and pages are reclaimed. During this phase, pages with some dead objects are made wholly unused by concurrently relocating their remaining live objects to other pages. References to these relocated objects are not immediately remapped to point to the new object locations. Instead, by relying on the loaded value barrier’s imposed invariants, reference remapping can proceed lazily and concurrently after relocation, until the completion of the collector’s next remap phase assures the collector that no references that require remapping exist.

During relocation, sets of pages (starting with the sparsest pages) are selected for relocation and compaction. Each page in the set is protected from mutator access, and all live objects in the page are copied out and relocated into contiguous, compacted pages. Forwarding information tracking the location of relocated objects is maintained outside the original page, and is used by concurrent remapping.

During and after relocation, any attempts by the mutator to use references to relocated objects are intercepted and corrected. Attempts by the mutator to load such references will trigger the loaded value barrier’s trapping condition, at which point the stale reference will be corrected to point to the object’s proper location, and the original memory location that the reference was loaded from will be healed to avoid future triggering of the LVB condition.

“Quick Release”

Using a feature that Tene calls “Quick Release”, the GPGC collector immediately recycles memory page resources without waiting for remapping to complete. By keeping all forwarding information outside of the original page, the collector is able to safely release physical memory immediately after the page contents have been relocated (and before remapping has been completed). A compacted page’s virtual memory space cannot be freed until no more stale references to that page remain in the heap (which will only be reliably true at the end of the next remap phase), but the physical memory resources backing that page are immediately released by the relocate phase, and recycled at new virtual memory locations as needed. The quick-released physical resources are used to satisfy new object allocations, as well as the collector’s own compaction pipeline. By using “hand over hand” compaction along with the quick release feature, page resources released by the compaction of one page are used as compaction destinations for compacting additional pages, and the collector is able to compact the entire heap in a single pass without requiring empty target memory to compact into.

Remap

During the remap phase, which follows the relocate phase, GC threads complete reference remapping by traversing the object graph and executing a loaded value barrier test on every live reference found in the heap. If a reference is found to be pointing to a relocated object, it is connected to point to the object’s proper location. Once the remap phase completes, no live heap reference can exist that would refer to pages protected by the previous relocate phase, and at that point the virtual memory for those pages is freed.

Since the remap phase traverses the same live object graph as a mark phase would, and because the collector is in no hurry to complete the remap phase, the two logical phases are rolled into one in actual implementation, known as the Combined Mark and Remap phase. In each combined Mark-Remap phase, the collector will complete the remapping of references affected by the previous relocate phase, and at the same time perform the marking and discovery of live objects used by the next relocate phase.

Comparison with existing Garbage Collectors

HotSpot’s Concurrent Mark Sweep (CMS) is a mostly concurrent collector. It performs some garbage collection operations concurrently with program execution, but leaves some operations to long stop-the-world pauses. Tene described it as follows:

Hotspot’s CMS collector uses a full, stop-the-world parallel new generation collector which compacts the new generation on every pass. Since young generation collection is very efficient, this stop-the-world pass typically completes very quickly - in the order of 100s of milliseconds or less. CMS uses a mostly-concurrent collector for OldGen. It has a mostly concurrent, multipass marker that marks the heap while the mutator is running, and tracks mutations as they happen. It then revisits these mutations and repeats the marking. CMS does a final stop-the-world “catch up” marking of mutations, and also processes all weak and soft refs in a stop-the-world pause. CMS does not compact concurrently. Instead, it does concurrent sweeping, maintains a free-list, and attempts to satisfy old generation allocations from this free list. However, because the free list is not compacted, free space will inevitably be fragmented, and CMS falls back to a full stop-the-world to compact the heap.

Oracle’s experimental Garbage First (G1) collector, which is expected to ship as part of Java 7, is an incrementally compacting collector that uses a snapshot-at-the-beginning (henceforth SATB) concurrent marking algorithm and stop-the-world compaction pauses to compact only parts of the heap in each pause. G1 allows the user to set pause time goals, and uses these goal inputs along with topology information collected during its mark phase in order to compact a limited amount of heap space in each pause, in an attempt to contain the amount of the heap references that need to be scanned during each pause.

Tene explains that, given its experimental state, it is still early to tell how G1 will fare with real applications. According to Tene, while G1 is intended to incrementally improve upon CMS’s handling of fragmentation, its compactions are still performed in complete stop-the-world conditions, and its ability to limit the length of incremental compaction pauses strongly depends on an application’s specific heap topology and object relationships. Applications with popular objects or popular heap portions will still experience long stop-the-world pauses when the entire heap needs to be scanned for references that need remapping, and those long pauses, while less frequent, will be no shorter than those experienced by the current CMS collector. As Tene puts it:

If a collector includes code that performs a full heap scan in stop-the-world conditions, that code is intended to run at some point, and application should expect to eventually experience such pauses.

Whilst G1 attempts to improve determinism, it cannot guarantee it. Tene explained:

It does let the user set pause time goals, but will only “try” to follow those goals. Even if OS scheduling was perfect, G1 would not guarantee determinism as its ability to contain individual compaction pauses is inherently sensitive to heap shape. Popular objects and (more importantly) popular heap regions (regions referred to from many other regions, but do not necessarily need to have popular individual objects) will cause G1 to eventually perform a full heap compaction in a single pause if they were to be compacted… Real application patterns (such as slowly churning LRU caches that get lots of hits) will exhibit such heap relationships in the vast majority of the heap, forcing G1 to perform what is effectively full compaction remap scans in a single pause…

A previous InfoQ article provides more of the technical details.

Conclusion

With the introduction of Zing, Azul has made pauseless garbage collection available in a pure software product on commodity hardware, making it more consumable and much easier to adopt. Pauseless garbage collection can:

Allow Java instances to make effective use of the abundant, cheap, commodity hardware capacities that are already available, and grow with future improvements to hardware. Allow Architects and developers to make aggressive use of memory capacity in their designs, applying techniques such large in-memory, in-process caching, as well as complete in-memory representations to new problem sets. Allow developers to avoid the fine tuning and re-tuning of current sensitive GC systems.

Appendix A: Implementing the Algorithm on modern Intel and AMD processors

The GPGC algorithm was first deployed on Azul’s Vega systems, and has evolved and matured since it was first commercially introduced in 2005. It is now available on both Vega and x86-64 architectures. On Vega’s custom processors, GPGC made use of a special loaded value barrier (LVB) instruction to perform the barrier checks. Recent improvements in Intel and AMD processors, along with virtualization, have allowed Azul to bring the same capabilities to Intel and AMD based servers. While a single-cycle LVB instruction does not currently exist on x86-64 architectures, Azul uses its JIT compiler to generate a semantically equivalent set of x86 instructions and efficiently interleave it with the normal instruction stream. Specifically for Intel, Azul exploits the EPT (Extended Page Table) feature (which first appeared in Intel's Xeon 55xx, and later in Xeon 56xx, 65xx and 75xx chips), and for AMD the NPT (otherwise know as AMD-V Nested Paging) feature. This works in conjunction with the x86 virtual memory subsystem to remap and protect GC-compacted pages and thereby achieve the same loaded value barrier effect, and maintain the same algorithmic invariants needed for the Pauseless GC algorithms to work. The loaded value barrier set of instructions is emitted by the JIT compilers and efficiently interleaved into the regular instruction stream.

To simulate the “fast traps” on x86 Azul inject a sequence of x86 instructions that perform a semantically equivalent set of tests and a conditional call at the loaded value barrier site, all using a single conditional branch in the hot case. Tene described this as analogous to an "LVB" instruction using x86 instructions as "micro-code".

The x86 instructions are injected by the JIT compilers, and we chose a set of operations that hide well in the nice wide 4-issue, out of order, speculative execution pipeline of modern x86s. (A modern Nehalem core, for example, has a 128-u-op deep reorder buffer, with 36 reservation stations, combined with macrofusion, loop streaming, and other cool features).

An example discussion can be found here.

Tene went on to provide some details on how the hot code can use the pipeline efficiently:

Since a "trapping" condition is rare (a ref having an offending NMT+SpaceID combo, or a page number that is being relocated), Azul have designed the sequence to use a single conditional branch in for determining that it doesn't happen. This single branch is correctly predicted by the x86 processors as "not going down the branching path", and the x86 processor will simply keep executing far down the hot path.

Since the operations needed for resolving the branch are dependent only on registers and a nearly-always-cache-hitting memory access, the branch is commonly resolved well before the processor runs out of speculation depth, keeping the pipeline going without a large stall.

x86 rarely executes semantic program code that can efficiently fill up all 4 of its execution units all the time - it would have a CPI of 0.25 if it did, and the CPI is typically much higher than in real apps due to memory access and cache misses needed by the actual program logic. At a CPI of 0.25 the extra ops we do would certainly cost, but at higher CPIs they hide well in the pipe. Think about it this way: adding cache hitting and register-only ops to an otherwise identical semantic operations flow becomes cheaper and cheaper the higher the CPI of the original flow is, as there is plenty of idle room in the pipe.

Appendix B: GC Testing Methodology Recommendations

During this article we’ve looked at the various factors such as mutation rates, soft references and object lifetime, which garbage collectors can be sensitive to. However, regardless of which GC algorithm you use, it is important to test how your application performs during interesting garbage collection events. During load testing, an inexperienced team will often tune garbage collection so that it no longer occurs. However this does not prevent it from happening in the real world.

Tene suggested specifically designing tests that can cram a few days, or several hours worth, of interesting GC events into a short but practical test length, by adding low load noise. There are several techniques which are good for this, such as background LRU caches. But one of easiest, and most effective, is a simple fragmentation generator, such as the Fragger tool which Tene wrote and which is freely available from the Azul Website.

Fragger works by repeatedly generating large sets of objects of a given size, pruning each set down to a much smaller remaining live set, and increasing the object size between passes such that it becomes unlikely to fit in the areas freed up by objects released in a previous pass without some amount of compaction. Fragger ages object sets before pruning them down in order to bypass potential artificial early compaction by young generation collectors. When run with default settings, Fragger will occupy ~25% of the total heap space, and allocate objects at a rate of 20MB/sec.

The general idea is to test the metrics we’ve discussed for sensitivity, and establish the range in which your application works well. To do this, vary the metrics under your control by increasing the things the application does to generate them. For example you can:

increase the size of state that each session or main object carries.

increase the lifetime of state associated with sessions or requests.

increase the amount of data the application caches.

By pushing these things until your application ceases to meet your performance requirements, you’ll find the edges of the safe envelope inside which your application can run.

Footnotes

1. Since this article was first published GPGC has been renamed to C4 (Continuously Concurrent Compacting Collector). The algorithm described here is the same as that used in the C4 collector. [<< Back]