“Challenge accepted” said Tagir Valeev when I recently asked the readers of the jOOQ blog to show if the Java JIT (Just-In-Time compilation) can optimise away a for loop.

Tagir is the author of StreamEx, very useful Java 8 Stream extension library that adds additional parallelism features on top of standard streams. He’s a speaker at conferences, and has contributed a dozen of patches into OpenJDK Stream API (including bug fixes, performance optimizations and new features). He’s interested in static code analysis and works on a new Java bytecode analyzer.

I’m very happy to publish Tagir’s guest post here on the jOOQ blog.

The Java JIT Compiler

In recent article Lukas wondered whether JIT could optimize a code like this to remove an unnecessary iteration:

// ... than this, where we "know" the list // only contains one value for (Object object : Collections.singletonList("abc")) { doSomethingWith(object); }

Here’s my answer: JIT can do even better. Let’s consider this simple method which calculates total length of all the strings of supplied list:

static int testIterator(List<String> list) { int sum = 0; for (String s : list) { sum += s.length(); } return sum; }

As you might know this code is equivalent to the following:

static int testIterator(List<String> list) { int sum = 0; Iterator<String> it = list.iterator(); while(it.hasNext()) { String s = it.next(); sum += s.length(); } return sum; }

Of course in general case the list could be anything, so when creating an iterator, calling hasNext and next methods JIT must emit honest virtual calls which is not very fast. However what will happen if you always supply the singletonList here? Let’s create some simple test:

public class Test { static int res = 0; public static void main(String[] args) { for (int i = 0; i < 100000; i++) { res += testIterator(Collections.singletonList("x")); } System.out.println(res); } }

We are calling our testIterator in a loop so it’s called enough times to be JIT-compiled with C2 JIT compiler. As you might know, in HotSpot JVM there are two JIT-compilers, namely C1 (client) compiler and C2 (server) compiler. In 64-bit Java 8 they work together. First method is compiled with C1 and special instructions are added to gather some statistics (which is called profiling). Among it there is type statistics. JVM will carefully check which exact types our list variable has. And in our case it will discover that in 100% of cases it’s singleton list and nothing else. When method is called quite often, it gets recompiled by better C2 compiler which can use this information. Thus when C2 compiles it can assume that in future singleton list will also appear quite often.

You may ask JIT compiler to output the assembly generated for methods. To do this you should install hsdis on your system. After that you may use convenient tools like JITWatch or write a JMH benchmark and use -perfasm option. Here we will not use third-party tools and simply launch the JVM with the following command line options:

$ java -XX:+UnlockDiagnosticVMOptions -XX:+PrintCompilation -XX:+PrintAssembly Test >output.txt

This will generate quite huge output which may scare the children. The assembly generated by C2 compiler for our testIterator method looks like this (on Intel x64 platform):

# {method} {0x0000000055120518} # 'testIterator' '(Ljava/util/List;)I' in 'Test' # parm0: rdx:rdx = 'java/util/List' # [sp+0x20] (sp of caller) 0x00000000028e7560: mov %eax,-0x6000(%rsp) 0x00000000028e7567: push %rbp ;*synchronization entry ; - Test::testIterator@-1 (line 15) 0x00000000028e7568: sub $0x10,%rsp ; implicit exception: dispatches to 0x00000000028e75bd 0x00000000028e756c: mov 0x8(%rdx),%r10d ; {metadata('java/util/Collections$SingletonList')} 0x00000000028e7570: cmp $0x14d66a20,%r10d ;*synchronization entry ; - java.util.Collections::singletonIterator@-1 ; - java.util.Collections$SingletonList::iterator@4 ; - Test::testIterator@3 (line 16) 0x00000000028e7577: jne 0x00000000028e75a0 ;*getfield element ; - java.util.Collections$SingletonList::iterator@1 ; - Test::testIterator@3 (line 16) 0x00000000028e7579: mov 0x10(%rdx),%ebp ; implicit exception: dispatches to 0x00000000028e75c9 0x00000000028e757c: mov 0x8(%rbp),%r11d ; {metadata('java/lang/String')} 0x00000000028e7580: cmp $0x14d216d0,%r11d 0x00000000028e7587: jne 0x00000000028e75b1 ;*checkcast ; - Test::testIterator@24 (line 16) 0x00000000028e7589: mov %rbp,%r10 ;*getfield value ; - java.lang.String::length@1 ; - Test::testIterator@30 (line 17) 0x00000000028e758c: mov 0xc(%r10),%r10d ;*synchronization entry ; - Test::testIterator@-1 (line 15) ; implicit exception: dispatches to 0x00000000028e75d5 0x00000000028e7590: mov 0xc(%r10),%eax 0x00000000028e7594: add $0x10,%rsp 0x00000000028e7598: pop %rbp # 0x0000000000130000 0x00000000028e7599: test %eax,-0x27b759f(%rip) ; {poll_return} 0x00000000028e759f: retq ... // slow paths follow

What you can notice is that it’s surpisingly short. I’ll took the liberty to annotate what happens here:

// Standard stack frame: every method has such prolog mov %eax,-0x6000(%rsp) push %rbp sub $0x10,%rsp // Load class identificator from list argument (which is stored in rdx // register) like list.getClass() This also does implicit null-check: if // null is supplied, CPU will trigger a hardware exception. The exception // will be caught by JVM and translated into NullPointerException mov 0x8(%rdx),%r10d // Compare list.getClass() with class ID of Collections$SingletonList class // which is constant and known to JIT cmp $0x14d66a20,%r10d // If list is not singleton list, jump out to the slow path jne 0x00000000028e75a0 // Read Collections$SingletonList.element private field into rbp register mov 0x10(%rdx),%ebp // Read its class identificator and check whether it's actually String mov 0x8(%rbp),%r11d cmp $0x14d216d0,%r11d // Jump out to the exceptional path if not (this will create and throw // ClassCastException) jne 0x00000000028e75b1 // Read private field String.value into r10 which is char[] array containing // String content mov %rbp,%r10 mov 0xc(%r10),%r10d // Read the array length field into eax register (by default method returns // its value via eax/rax) mov 0xc(%r10),%eax // Standard method epilog add $0x10,%rsp pop %rbp // Safe-point check (so JVM can take the control if necessary, for example, // to perform garbage collection) test %eax,-0x27b759f(%rip) // Return retq

If it’s still hard to understand, let’s rewrite it via pseudo-code:

if (list.class != Collections$SingletonList) { goto SLOW_PATH; } str = ((Collections$SingletonList)list).element; if (str.class != String) { goto EXCEPTIONAL_PATH; } return ((String)str).value.length;

So for the hot path we have no iterator allocated and no loop, just several dereferences and two quick checks (which are always false, so CPU branch predictor will predict them nicely). Iterator object is evaporated completely, though originally it has additional bookkeeping like tracking whether it was already called and throwing NoSuchElementException in this case. JIT-compiler statically proved that these parts of code are unnecessary and removed them. The sum variable is also evaporated. Nevertheless the method is correct: if it happens in future that it will be called with something different from singleton list, it will handle this situation on the SLOW_PATH (which is of course much longer). Other cases like list == null or list element is not String are also handled.

What will occur if your program pattern changes? Imagine that at some point you are no longer using singleton lists and pass different list implementations here. When JIT discovers that SLOW_PATH is hit too often, it will recompile the method to remove special handling of singleton list. This is different from pre-compiled applications: JIT can change your code following the behavioral changes of your program.