In our field, we can skip the hypotheses building and go straight to poking the stuff with power tools. This section could be shorter, if we used hindsight, and breezed straight to the conclusion. But one of the points for this post is to show off the approaches one can take to analyze these benchmarks.

But this is not the central issue here. The most problematic part for us is the absence of any low-level details that may help us to answer performance questions. Can you see anything in the profiling snapshot above that can verify any sensible hypothesis in our case? You cannot, because the data is too coarse. Note that this is not an issue for larger workloads where performance phenomena manifest in larger and more distributed call trees. In those workloads, bias effects are smoothed out.

Most Java profilers are inherently biased, because they either instrument the code and thus offset the real results, or they sample at designated places in the code (e.g. safepoints) that also skews the results. In our case above, while the bulk of work is done in simple() method, profiler misattributes the work to …​_jmhStub method that holds the benchmark loop.

It is very easy to use: start the application, start VisualVM (if the JDK is in your PATH, then invoking jvisualvm is enough), select a target VM from the list, select "Sample" from a drop-down list or a tab, press the "CPU Sampling" button, and enjoy the results. Let’s take one of our tests and see what it shows us:

The most obvious approach is to attach a Java profiler to the benchmarked JVM, and see what is going on there. Without loss of generality, let’s take the VisualVM profiler that is shipped with JDK itself, and thus readily available on most installations.

It shows the hottest part is somewhere in the cryptic StubRoutines::checkcast_arraycopy code block. We would dive in later, but the main takeaway is that our performance phenomena lies somewhere within the generated code, and probably not even from the Java-backed generated code. While the further insights in this post are recoverable with perfasm, I want to focus on another option that is more conventional, and scalable for larger workloads.

Below is the example output for one of the tests:

The need to explore low-level details for microbenchmarks is the reason why good microbenchmark harnesses provide a clear-cut way to profile, analyze, and introspect tiny workloads. In case of JMH, we have the embedded "perfasm" profiler that dumps PrintAssembly output from the VM, annotates it with perf_events counters, and prints out the hottest parts. perf_events provides non-intrusive sampling profiling with hardware counters, which is what we want for fine-grained performance engineering.

Meet Solaris Studio Performance Analyzer

Imagine we want a profiler that combines non-biased Java/native profiling, with optional hardware-counters profiling? There is an app for that! Confusingly, it is called Solaris Studio Performance Analyzer, though it can work on Linux too.

There are a few suitable workflows for the Analyzer, and here is the one we mostly use in our experiments. Analyzer has two distinct parts: a collect tool that collects the performance data, and an analyzer GUI that can be used to process the results. While you can start/attach to a running process from the GUI, having a CLI collect tool is very handy, because you can define a few reusable shell aliases:

$ tail ~/.bashrc # Collect Java+native, clock profiling alias 'perfanal'="collect -o test.1.er -S on -j on -A on " # Collect native only (useful when loading *lots* of classes # -- avoids profiler's JVMTI hooks overheads), clock profiling alias 'perfanal-native'="collect -o test.1.er -S on -j off -A on " # Collect Java+native, hardware counter profiling alias 'perfanal-hwc'="collect -o test.1.er -S on -j on -A on -h cycles,on,insts,on " # Collect native only, hardware counting profiling alias 'perfanal-hwc-native'="collect -o test.1.er -S on -j off -A on -h cycles,on,insts,on "

These aliases allow you to quickly attach Analyzer to JMH like this:

$ perfanal-hwc java -jar benchmarks.jar -f 1 -p size=1000 -p type=arraylist

This will create the test.${n}.er result file in your current directory, which you can then open with Analyzer GUI. The GUI is rich enough to understand intuitively, and here is what we see in it.

Figure 4. ArrayList (size = 1000), toArray().

In a simple case, we see the hottest function is jint_disjoint_arraycopy , which seems to be the function that implements an "int" arraycopy for disjoint arrays.

Note how the profiler resolved both Java code ( org.openjdk…​ ), VM native code ( ParallelTaskTerminator::…​ and SpinPause ), and generated VM stubs (e.g. jint_disjoint_arraycopy ). In complex scenarios, call trees would show both Java and native frames, which is immensely helpful when debugging JNI/JNA cases, including calls into the VM itself.

The "Call Tree" would show you that this jint_disjoint_arraycopy is a leaf function. There is no other indication as to what this function is, but you can search around in Hotspot source base, then you will stumble upon stubRoutines.hpp, which will say that "StubRoutines provides entry points to assembly routines used by compiled code and the run-time system. Platform-specific entry points are defined in the platform-specific inner class." Both compiler and runtime use that to optimize some operations, notably arraycopy.

Figure 5. ArrayList (size = 1000), toArray(new T[list.size()]).

In a sized case, we see a different function for arraycopy, checkcast_arraycopy , plus the Java code for sized() method.

Figure 6. ArrayList (size = 1000), toArray(new T[0]).

In a zero case, we see yet another contender, checkcast_arraycopy_uninit , which looks like a modification of checkcast_arraycopy , but "uninitialized"? Notice the Java method is gone from hotspots.

Well, the function names are already quite revealing, but it would be only an educated guess until we invoke another outstanding feature of Analyzer. If we select the function from the list (or "Call Tree"), and switch to "Disassembly" view, we will see…​ wait for it…​ disassembly!

Figure 7. simple case disassembly.

The jint_disjoint_arraycopy is indeed the copying stub, which employs AVX for vectorized copy! No wonder this thing is fast: it copies the backing storage in large strides.

Figure 8. zero case disassembly

zero case disassembly shows non-vectorized checkcast_arraycopy_uninit . What does it do? The loop at relative addresses 30-51 is indeed copying, but it does not seem to be vectorized at all. Alas, deciphering what it does requires some VM internals knowledge. Normally, you will be able to attach the Java code to assembly, but VM stubs have no Java code associated with them. You can also trace the VM source code that generated the stub. Here, we will just stare patiently at the code.

The loop reads with movl(%rdi,%rdx,4),%eax and writes with movl %eax, 0(%r13,%rdx,4) , and %rdx is increasing — it follows from that %rdx is the loop counter. After reading the current element to %eax , we null-check it, load something from offset 0x8 , shift it, and compare with something else. That is loading a class word, unpacking it, and doing the typecheck against it.

You can peek around the runtime representation of Java Objects with JOL. Notably, see the JOLSample_11_ClassWord example.

Figure 9. sized case disassembly, arraycopy stub

sized has two hotspots, one in checkcast_arraycopy , where the hot loop is actually the same as in zero case above, but then we come to a second hotspot:

Figure 10. sized case disassembly, Java code