One primitive that can tell us the initialization part of story is plain old Java array. It needs to be initialized too, and its length is variable, which lets us see what is going on at different sizes. With that in mind, let us construct this benchmark:

import org.openjdk.jmh.annotations.*; @Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS) @Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS) @Fork(value = 3) @BenchmarkMode(Mode.AverageTime) @OutputTimeUnit(TimeUnit.NANOSECONDS) @State(Scope.Benchmark) public class UA { @Param({"1", "10", "100", "1000", "10000", "100000"}) int size; @Benchmark public byte[] java() { return new byte[size]; } }

It makes sense to run with the latest JDK 9 EA (the follow up test below requires it), with -XX:+UseParallelOldGC to minimize GC overhead, and -Xmx20g -Xms20g -Xmn18g to keep most of the heap for these new allocations. Without further delay, this is what you will get on a good development desktop (like mine i7-4790K, 4.0 GHz, Linux x86_64), when all 8 hardware threads are busy:

Benchmark (size) Mode Cnt Score Error Units # Time to allocate UA.java 1 avgt 15 20.307 ± 4.532 ns/op UA.java 10 avgt 15 26.657 ± 6.072 ns/op UA.java 100 avgt 15 106.632 ± 34.742 ns/op UA.java 1000 avgt 15 681.176 ± 124.980 ns/op UA.java 10000 avgt 15 4576.433 ± 909.956 ns/op UA.java 100000 avgt 15 44881.095 ± 13765.440 ns/op # Allocation rate UA.java:·gc.alloc.rate 1 avgt 15 6228.153 ± 1059.385 MB/sec UA.java:·gc.alloc.rate 10 avgt 15 6335.809 ± 986.395 MB/sec UA.java:·gc.alloc.rate 100 avgt 15 6126.333 ± 1354.964 MB/sec UA.java:·gc.alloc.rate 1000 avgt 15 7772.263 ± 1263.453 MB/sec UA.java:·gc.alloc.rate 10000 avgt 15 11518.422 ± 2155.516 MB/sec UA.java:·gc.alloc.rate 100000 avgt 15 12039.594 ± 2724.242 MB/sec

One can see that the allocation takes around 20 ns (single-threaded is much lower, but this average is skewed by hyper-threads), which goes gradually up to 40 us for 100K array. If you look at the allocation rate, then you will see it saturates at around 12 GB/sec. These experiments form the basis for other performance tests, by the way: it is important to understand the order of memory bandwidth / allocation rate you can pull off on a particular machine.

Can we see what code dominates the execution time? Of course we can, again, with JMH’s -prof perfasm . For -p size=100000 , it will print out the hottest code like this:

0x00007f1f094f650b: movq $0x1,(%rdx) ; store mark word 0.00% 0x00007f1f094f6512: prefetchnta 0xc0(%r9) 0.64% 0x00007f1f094f651a: movl $0xf80000f5,0x8(%rdx) ; store klass word 0.02% 0x00007f1f094f6521: mov %r11d,0xc(%rdx) ; store array length 0x00007f1f094f6525: prefetchnta 0x100(%r9) 0.05% 0x00007f1f094f652d: prefetchnta 0x140(%r9) 0.07% 0x00007f1f094f6535: prefetchnta 0x180(%r9) 0.09% 0x00007f1f094f653d: shr $0x3,%rcx 0.00% 0x00007f1f094f6541: add $0xfffffffffffffffe,%rcx 0x00007f1f094f6545: xor %rax,%rax 0x00007f1f094f6548: cmp $0x8,%rcx ╭ 0x00007f1f094f654c: jg 0x00007f1f094f655e ; large enough? jump │ 0x00007f1f094f654e: dec %rcx │╭ 0x00007f1f094f6551: js 0x00007f1f094f6565 ; zero length? jump ││↗ 0x00007f1f094f6553: mov %rax,(%rdi,%rcx,8) ; small loop init │││ 0x00007f1f094f6557: dec %rcx ││╰ 0x00007f1f094f655a: jge 0x00007f1f094f6553 ││ ╭ 0x00007f1f094f655c: jmp 0x00007f1f094f6565 ↘│ │ 0x00007f1f094f655e: shl $0x3,%rcx 89.12% │ │ 0x00007f1f094f6562: rep rex.W stos %al,%es:(%rdi) ; large loop init 0.20% ↘ ↘ 0x00007f1f094f6565: mov %r8,(%rsp)

You may find this code shape familiar from "TLAB Allocation" and "New Object Stages" issues in the series. What is interesting here, is that we have to initialize much larger chunk of array data. For this reason, we see the inlined rep stos sequence on x86 — which repeatedly stores the given number of given byte — and seems most effective on recent x86-s. One can spy with a little eye that there is also an initializing loop for small arrays (smaller or equal of 8 elements) — rep stos has upfront startup costs, so smaller loop is beneficial there.

This example shows that for large objects/arrays initialization costs would dominate the performance. If objects/arrays are small, then writes of metadata (headers, array lengths) would dominate. Small array case would not be significantly different from small object case.

Can we guesstimate what would the performance be, if we somehow avoid initialization? The compilers are routinely coalesce system and user initialization, but can we get no initialization at all? Granted, it would serve no practical purpose to have uninitialized object, because we will still fill it with user data later — but synthetic tests are fun, am I right?

Turns out, there is an Unsafe method that allocates uninitialized arrays, so we may use it to see the benefits. Unsafe is not Java, it does not play by Java rules, and it sometimes even defies the JVM rules. It is not the API for the public use, and it is there for internal JDK use and JDK-VM interoperability. You can’t have it, it is not guaranteed to work for you, and it can go away, crash and burn without telling you.

Still, we can use it in synthetics, like this:

import jdk.internal.misc.Unsafe; import org.openjdk.jmh.annotations.*; @Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS) @Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS) @Fork(value = 3) @BenchmarkMode(Mode.AverageTime) @OutputTimeUnit(TimeUnit.NANOSECONDS) @State(Scope.Benchmark) public class UA { static Unsafe U; static { try { Field field = Unsafe.class.getDeclaredField("theUnsafe"); field.setAccessible(true); U = (Unsafe) field.get(null); } catch (Exception e) { throw new IllegalStateException(e); } } @Param({"1", "10", "100", "1000", "10000", "100000"}) int size; @Benchmark public byte[] unsafe() { return (byte[]) U.allocateUninitializedArray(byte.class, size); } }

Why don’t we run it? Okay:

Benchmark (size) Mode Cnt Score Error Units UA.unsafe 1 avgt 15 19.766 ± 4.002 ns/op UA.unsafe 10 avgt 15 27.486 ± 7.005 ns/op UA.unsafe 100 avgt 15 80.040 ± 15.754 ns/op UA.unsafe 1000 avgt 15 156.041 ± 0.552 ns/op UA.unsafe 10000 avgt 15 162.384 ± 1.448 ns/op UA.unsafe 100000 avgt 15 309.769 ± 2.819 ns/op UA.unsafe:·gc.alloc.rate 1 avgt 15 6359.987 ± 928.472 MB/sec UA.unsafe:·gc.alloc.rate 10 avgt 15 6193.103 ± 1160.353 MB/sec UA.unsafe:·gc.alloc.rate 100 avgt 15 7855.147 ± 1313.314 MB/sec UA.unsafe:·gc.alloc.rate 1000 avgt 15 33171.384 ± 153.645 MB/sec UA.unsafe:·gc.alloc.rate 10000 avgt 15 315740.299 ± 3678.459 MB/sec UA.unsafe:·gc.alloc.rate 100000 avgt 15 1650860.763 ± 14498.920 MB/sec

Holy crap! 100K arrays are allocated with 1.6 terabytes per second rate. Can we see where we spend time now?

0x00007f65fd722c74: prefetchnta 0xc0(%r11) 66.06% 0x00007f65fd722c7c: movq $0x1,(%rax) ; store mark word 0.40% 0x00007f65fd722c83: prefetchnta 0x100(%r11) 4.43% 0x00007f65fd722c8b: movl $0xf80000f5,0x8(%rax) ; store class word 0.01% 0x00007f65fd722c92: mov %edx,0xc(%rax) ; store array length 0x00007f65fd722c95: prefetchnta 0x140(%r11) 5.18% 0x00007f65fd722c9d: prefetchnta 0x180(%r11) 4.99% 0x00007f65fd722ca5: mov %r8,0x40(%rsp) 0x00007f65fd722caa: mov %rax,%rdx

Oh yeah, touching the memory to push the metadata out. The cycles accounting is skewed towards the prefetches, because they are now paying the most of the cost for pre-accessing memory for upcoming writes.

One might wonder what toll that exerts on GC, and the answer is: not that much! The objects are almost all dead, and so any mark-(copy|sweep|compact) GC would breeze through the workload like this. When objects start surviving at TB/sec rate, the picture gets interesting. Some GC guys I know call these things "impossible workloads" — because they both impossible in reality, and impossible to deal with. Try to drink from a firehose to learn this in practice.

Anyhow, we can see that with pure allocations, GCs are surviving fine. With the same workload, we can see what are the apparent application pauses, by using JMH’s -prof pauses profiler. It runs a high-priority thread and records the perceived pauses:

Benchmark (size) Mode Cnt Score Error Units UA.unsafe 100000 avgt 5 315.732 ± 5.133 ns/op UA.unsafe:·pauses 100000 avgt 84 537.018 ms UA.unsafe:·pauses.avg 100000 avgt 6.393 ms UA.unsafe:·pauses.count 100000 avgt 84.000 # UA.unsafe:·pauses.p0.00 100000 avgt 2.560 ms UA.unsafe:·pauses.p0.50 100000 avgt 6.148 ms UA.unsafe:·pauses.p0.90 100000 avgt 9.642 ms UA.unsafe:·pauses.p0.95 100000 avgt 9.802 ms UA.unsafe:·pauses.p0.99 100000 avgt 14.418 ms UA.unsafe:·pauses.p0.999 100000 avgt 14.418 ms UA.unsafe:·pauses.p0.9999 100000 avgt 14.418 ms UA.unsafe:·pauses.p1.00 100000 avgt 14.418 ms

So, it had detected around 84 pauses, the longest is 14 ms, while the average is about 6 ms. Profilers like these are inherently imprecise, because they contend on CPUs with workload threads, they get into the mercy of OS scheduler, etc.

In many cases, it is better to enable the JVM to tell when it stops the application threads. This can be done with JMH’s -prof safepoints profiler, which records the "safe point", "stop the world" events when all application threads are stopped, and VM does its work. GC pauses are naturally the subset of safepoint events.

Benchmark (size) Mode Cnt Score Error Units UA.unsafe 100000 avgt 5 328.247 ± 34.450 ns/op UA.unsafe:·safepoints.interval 100000 avgt 5043.000 ms UA.unsafe:·safepoints.pause 100000 avgt 639 617.796 ms UA.unsafe:·safepoints.pause.avg 100000 avgt 0.967 ms UA.unsafe:·safepoints.pause.count 100000 avgt 639.000 # UA.unsafe:·safepoints.pause.p0.00 100000 avgt 0.433 ms UA.unsafe:·safepoints.pause.p0.50 100000 avgt 0.627 ms UA.unsafe:·safepoints.pause.p0.90 100000 avgt 2.150 ms UA.unsafe:·safepoints.pause.p0.95 100000 avgt 2.241 ms UA.unsafe:·safepoints.pause.p0.99 100000 avgt 2.979 ms UA.unsafe:·safepoints.pause.p0.999 100000 avgt 12.599 ms UA.unsafe:·safepoints.pause.p0.9999 100000 avgt 12.599 ms UA.unsafe:·safepoints.pause.p1.00 100000 avgt 12.599 ms