It is possible to write the same code, but using AVX2 (256 byte registers) instructions. (Thanks, @kellylittlepage , for an awesome article where I’ve read how to do it).

This is how a simple checksum function looks like after rewriting for vectorizing execution. Here, some GCC intrinsics like _mm256_unpacklo_epi8 and _mm256_add_epi32 are used. GCC has a special implementation for this functions that uses AVX2 instructions. Almost always it is just one instruction.

Java way

Let’s say, now we met our performance requirements, but can we make it more readable than just an ugly blob of ASM code produced by GCC? It is possible to save the main loop inside Java and use Long4 vectors to pass data.

Java version of that scary function public class VectorIntrinsics { ... private static final MethodHandle _mm256_loadu_si256 = jdk.internal.panama.CodeSnippet.make( "_mm256_loadu_si256", MethodType.methodType(Long4.class, long.class), true, 0xC5, 0xFE, 0x6F, 0x06 // vmovdqu ymm0, YMMWORD PTR [rdi] ); public static Long4 _mm256_loadu_si256(long address) throws Throwable { return (Long4) _mm256_loadu_si256.invoke(address); } ... } private static int JAVA_avxChecksumAVX2(ByteBuffer buffer, long target, int targetLength) throws Throwable { Long4 zeroVec = Long4.ZERO; Long4 oneVec = ones; Long4 accum = Long4.ZERO; int checksum = 0; int offset = 0; if (targetLength >= 32) { for (; offset <= targetLength - 32; offset += 32) { Long4 vec = _mm256_loadu_si256(target + offset); Long4 vl = _mm256_unpacklo_epi8(vec, zeroVec); Long4 vh = _mm256_unpackhi_epi8(vec, zeroVec); accum = _mm256_add_epi32(accum, _mm256_madd_epi16(vl, oneVec)); accum = _mm256_add_epi32(accum, _mm256_madd_epi16(vh, oneVec)); } } for (; offset < targetLength; ++offset) { checksum += (int) buffer.get(offset); } accum = _mm256_add_epi32(accum, _mm256_srli_si256_4(accum)); accum = _mm256_add_epi32(accum, _mm256_srli_si256_8(accum)); long finalChecksum = _mm256_extract_epi32_0(accum) + _mm256_extract_epi32_4(accum) + checksum; return (int) (Integer.toUnsignedLong((int) finalChecksum) % 256); }

Now it is written in the right way. We wrote a lot of small methods; every method represents one small AVX2 instruction. And the main loop is written in Java. This code is reusable; it is much easier to write and understand than trying to write one big ASM blob. But, a big surprise, it is much slower than the ugly ASM blob.

And again, JMH will help us to find answer with gc profiler.

That’s why JAVA_avx2Impl 129536 avgt 4 30.394 ± 6.813 us/op JAVA_avx2Impl:·gc.alloc.rate 129536 avgt 4 NaN MB/sec JAVA_avx2Impl:·gc.count 129536 avgt 4 34.000 counts JAVA_avx2Impl:·gc.time 129536 avgt 4 39.000 ms avx2Impl 129536 avgt 4 4.192 ± 0.246 us/op avx2Impl:·gc.alloc.rate 129536 avgt 4 NaN MB/sec avx2Impl:·gc.count 129536 avgt 4 ≈ 0 counts

JAVA_avxChecksumAVX2 produces high allocation rate. Despite the fact that vector types work with escape analysis really well, this loop breaks our hopes. Because Long4 is immutable, we have to save accum to the same variable on every loop iteration. Escape analysis can’t understand this and we are getting a lot of allocations of boxed vector values.

Problematic code for Escape Analysis Long accum = Long4.ZERO; for (; offset <= targetLength - 32; offset += 32) { Long4 vec = _mm256_loadu_si256(target + offset); accum = operation(accum, vec); // EA, you are drunk, go home }

This problem is known issue. Very probably it will be fixed soon, but how can it be solved now?

As a workaround, we may try to create a temporary buffer and use a pair of _mm256_loadu_si256 and _mm256_storeu_si256 instructions on every iteration. That intrinsics use vmovdqu instruction to load/store register value to the memory.

GC free solution static final ByteBuffer tmpBuf = ... ... for (; offset <= targetLength - 32; offset += 32) { Long4 vec = _mm256_loadu_si256(target + offset); Long4 accum = _mm256_loadu_si256(tmpBuffAddr); Long4 result = operation(accum, vec); _mm256_storeu_si256(tmpBuffAddr, result); }

Results Benchmark (size) Mode Cnt Score Error Units ChecksumBenchmark.JAVA_avx2Impl 129536 avgt 4 23.837 ± 0.064 us/op ChecksumBenchmark.JAVA_avx2Impl:·gc.alloc.rate 129536 avgt 4 NaN MB/sec ChecksumBenchmark.JAVA_avx2Impl:·gc.count 129536 avgt 4 ≈ 0 counts

Now function is GC free; there is no garbage anymore and it is faster, but actually it’s still quite slow. To understand why we should use a profiler, but simple solutions like Yourkit or JProfiler won’t help us, we must work on instruction level. Thank goodness, JMH has an excellent support of perf profiler, you need just to pass an option to it (don’t forget to install perf on your system before).

12.39% 26.58% vmovdqu YMMWORD PTR [rsp+0x40],ymm0 12.88% 2.85% movabs r10,0x6d61010e8 0.01% vmovdqu ymm1,YMMWORD PTR [r10+0x10] 0.01% vmovdqu ymm0,YMMWORD PTR [rsp+0x20] vpunpcklbw ymm0,ymm0,ymm1 4.42% 0.03% movabs r10,0x6d61010b8 0.01% vmovdqu ymm1,YMMWORD PTR [r10+0x10] 0.02% 0.01% vpmaddwd ymm0,ymm0,ymm1 0.02% 0.01% vpmaddwd ymm0,ymm0,ymm1 0.02% vmovdqu ymm1,ymm0 4.20% 2.95% vmovdqu ymm0,YMMWORD PTR [rsp+0x40] 8.45% 22.88% vpaddd ymm0,ymm1,ymm0 12.91% 5.79% vmovdqu YMMWORD PTR [rsp+0x40],ymm0

As you can see, we are spending an enormous amount of time just to load out the temporary buffer and store it back just to avoid GC. So, we can rewrite algorithm a little bit instead. We’ll be saving a final result to checksum variable right in the loop instead of using it further in vector calculations.

Here the code for (; offset <= targetLength - 32; offset += 32) { Long4 vec = _mm256_loadu_si256(target + offset); Long4 lVec = _mm256_unpacklo_epi8(vec, zeroVec); Long4 hVec = _mm256_unpackhi_epi8(vec, zeroVec); Long4 sum = _mm256_add_epi16(lVec, hVec); sum = _mm256_hadd_epi16(sum, sum); sum = _mm256_hadd_epi16(sum, sum); sum = _mm256_hadd_epi16(sum, sum); checksum += _mm256_extract_epi16_0(sum) + _mm256_extract_epi16_15(sum); }

Benchmark results Benchmark (size) Mode Cnt Score Error Units ChecksumBenchmark.JAVA_avx2Impl 4 avgt 4 0.005 ± 0.001 us/op ChecksumBenchmark.JAVA_avx2Impl 8096 avgt 4 1.245 ± 0.028 us/op ChecksumBenchmark.JAVA_avx2Impl 129536 avgt 4 20.095 ± 0.314 us/op ChecksumBenchmark.avx2Impl 4 avgt 4 0.013 ± 0.001 us/op ChecksumBenchmark.avx2Impl 8096 avgt 4 0.211 ± 0.004 us/op ChecksumBenchmark.avx2Impl 129536 avgt 4 3.317 ± 0.077 us/op ChecksumBenchmark.plainJava 4 avgt 4 0.005 ± 0.001 us/op ChecksumBenchmark.plainJava 8096 avgt 4 2.109 ± 0.035 us/op ChecksumBenchmark.plainJava 129536 avgt 4 33.503 ± 0.227 us/op

This version of the code is even faster, but it can’t achieve the performance of big ugly assembly blob yet because escape analysis is like a big stone on our way. However this code can be maintained easily, and this API is under active development; there are a lot of experiments happening right now. So, you will have fought this ugly blob when these features are released.

Moreover, all that machine snippets and direct Long* vector parameters are really low-level API. Prototypes of high-level API you can find here and here.