Java HotSpot™ Virtual Machine Performance Enhancements

Tiered compilation, introduced in Java SE 7, brings client startup speeds to the server VM. Normally, a server VM uses the interpreter to collect profiling information about methods that is fed into the compiler. In the tiered scheme, in addition to the interpreter, the client compiler is used to generate compiled versions of methods that collect profiling information about themselves. Since the compiled code is substantially faster than the interpreter, the program executes with greater performance during the profiling phase. In many cases, a startup that is even faster than with the client VM can be achieved because the final code produced by the server compiler may be already available during the early stages of application initialization. The tiered scheme can also achieve better peak performance than a regular server VM because the faster profiling phase allows a longer period of profiling, which may yield better optimization.

Both 32 and 64 bit modes are supported, as well as compressed oops (see the next section). Use the -XX:+TieredCompilation flag with the java command to enable tiered compilation.

An "oop", or ordinary object pointer in Java Hotspot parlance, is a managed pointer to an object. An oop is normally the same size as a native machine pointer, which means 64 bits on an LP64 system. On an ILP32 system, maximum heap size is somewhat less than 4 gigabytes, which is insufficient for many applications. On an LP64 system, the heap used by a given program might have to be around 1.5 times larger than when it is run on an ILP32 system. This requirement is due to the expanded size of managed pointers. Memory is inexpensive, but these days bandwidth and cache are in short supply, so significantly increasing the size of the heap and only getting just over the 4 gigabyte limit is undesirable.

Managed pointers in the Java heap point to objects which are aligned on 8-byte address boundaries. Compressed oops represent managed pointers (in many but not all places in the JVM software) as 32-bit object offsets from the 64-bit Java heap base address. Because they're object offsets rather than byte offsets, they can be used to address up to four billion objects (not bytes), or a heap size of up to about 32 gigabytes. To use them, they must be scaled by a factor of 8 and added to the Java heap base address to find the object to which they refer. Object sizes using compressed oops are comparable to those in ILP32 mode.

The term decode is used to express the operation by which a 32-bit compressed oop is converted into a 64-bit native address into the managed heap. The inverse operation is referred to as encoding.

Compressed oops is supported and enabled by default in Java SE 6u23 and later. In Java SE 7, use of compressed oops is the default for 64-bit JVM processes when -Xmx isn't specified and for values of -Xmx less than 32 gigabytes. For JDK 6 before the 6u23 release, use the -XX:+UseCompressedOops flag with the java command to enable the feature.

When using compressed oops in a 64-bit Java Virtual Machine process, the JVM software asks the operating system to reserve memory for the Java heap starting at virtual address zero. If the operating system supports such a request and can reserve memory for the Java heap at virtual address zero, then zero-based compressed oops are used.

Use of zero-based compressed oops means that a 64-bit pointer can be decoded from a 32-bit object offset without adding in the Java heap base address. For heap sizes less than 4 gigabytes, the JVM software can use a byte offset instead of an object offset and thus also avoid scaling the offset by 8. Encoding a 64-bit address into a 32-bit offset is correspondingly efficient.

For Java heap sizes up around 26 gigabytes, any of Solaris, Linux, and Windows operating systems will typically be able to allocate the Java heap at virtual address zero.

Escape analysis is a technique by which the Java Hotspot Server Compiler can analyze the scope of a new object's uses and decide whether to allocate it on the Java heap.

Escape analysis is supported and enabled by default in Java SE 6u23 and later.

The Java Hotspot Server Compiler implements the flow-insensitive escape analysis algorithm described in:

[Choi99] Jong-Deok Choi, Manish Gupta, Mauricio Seffano, Vugranam C. Sreedhar, Sam Midkiff, "Escape Analysis for Java", Procedings of ACM SIGPLAN OOPSLA Conference, November 1, 1999

Based on escape analysis, an object's escape state might be one of the following:

GlobalEscape – An object escapes the method and thread. For example, an object stored in a static field, or, stored in a field of an escaped object, or, returned as the result of the current method.

– An object escapes the method and thread. For example, an object stored in a static field, or, stored in a field of an escaped object, or, returned as the result of the current method. ArgEscape – An object passed as an argument or referenced by an argument but does not globally escape during a call. This state is determined by analyzing the bytecode of called method.

– An object passed as an argument or referenced by an argument but does not globally escape during a call. This state is determined by analyzing the bytecode of called method. NoEscape – A scalar replaceable object, meaning its allocation could be removed from generated code.

After escape analysis, the server compiler eliminates scalar replaceable object allocations and associated locks from generated code. The server compiler also eliminates locks for all non-globally escaping objects. It does not replace a heap allocation with a stack allocation for non-globally escaping objects.

Some scenarios for escape analysis are described next.

The server compiler might eliminate certain object allocations. Consider the example where a method makes a defensive copy of an object and returns the copy to the caller. public class Person { private String name; private int age; public Person(String personName, int personAge) { name = personName; age = personAge; } public Person(Person p) { this(p.getName(), p.getAge()); } public int getName() { return name; } public int getAge() { return age; } } public class Employee { private Person person; // makes a defensive copy to protect against modifications by caller public Person getPerson() { return new Person(person) }; public void printEmployeeDetail(Employee emp) { Person person = emp.getPerson(); // this caller does not modify the object, so defensive copy was unnecessary System.out.println ("Employee's name: " + person.getName() + "; age: " + person.getAge()); } } The method makes a copy to prevent modification of the original object by the caller. If the compiler determines that the getPerson method is being invoked in a loop, it will inline that method. In addition, through escape analysis, if the compiler determines that the original object is never modified, it might optimize and eliminate the call to make a copy.

The server compiler might eliminate synchronization blocks (lock elision) if it determines that an object is thread local. For example, methods of classes such as StringBuffer and Vector are synchronized because they can be accessed by different threads. However, in most scenarios, they are used in a thread local manner. In cases where the usage is thread local, the compiler might optimize and remove the synchronization blocks.

The Parallel Scavenger garbage collector has been extended to take advantage of machines with NUMA (Non Uniform Memory Access) architecture. Most modern computers are based on NUMA architecture, in which it takes a different amount of time to access different parts of memory. Typically, every processor in the system has a local memory that provides low access latency and high bandwidth, and remote memory that is considerably slower to access.

In the Java HotSpot Virtual Machine, the NUMA-aware allocator has been implemented to take advantage of such systems and provide automatic memory placement optimizations for Java applications. The allocator controls the eden space of the young generation of the heap, where most of the new objects are created. The allocator divides the space into regions each of which is placed in the memory of a specific node. The allocator relies on a hypothesis that a thread that allocates the object will be the most likely to use the object. To ensure the fastest access to the new object, the allocator places it in the region local to the allocating thread. The regions can be dynamically resized to reflect the allocation rate of the application threads running on different nodes. That makes it possible to increase performance even of single-threaded applications. In addition, "from" and "to" survivor spaces of the young generation, the old generation, and the permanent generation have page interleaving turned on for them. This ensures that all threads have equal access latencies to these spaces on average.

The NUMA-aware allocator is available on the Solaris™ operating system starting in Solaris 9 12/02 and on the Linux operating system starting in Linux kernel 2.6.19 and glibc 2.6.1.

The NUMA-aware allocator can be turned on with the -XX:+UseNUMA flag in conjunction with the selection of the Parallel Scavenger garbage collector. The Parallel Scavenger garbage collector is the default for a server-class machine. The Parallel Scavenger garbage collector can also be turned on explicitly by specifying the -XX:+UseParallelGC option.

The -XX:+UseNUMA flag was added in Java SE 6u2.

Note: There was a known bug in the Linux Kernel that may cause the JVM to crash when being run with -XX:UseNUMA . The bug was fixed in 2012, so this should not affect the latest versions of the Linux Kernel. To see if your Kernel has this bug, you can run the native reproducer.

NUMA Performance Metrics

When evaluated against the SPEC JBB 2005 benchmark on an 8-chip Opteron machine, NUMA-aware systems showed the following performance increases: