Optimizing the way to Valhalla: JIT Status Update

Hi, I thought now is a good time to give a high level status update about the inline type implementation and optimizations in the JIT. The goal is to set realistic expectations about what is currently optimized and about what could be added with reasonable engineering effort in the near future. Below, I'm only distinguishing between null-free (.inline) and nullable (.ref) inline types because that's what the JIT cares about most. After the LW1 EA binaries were released in July 2018, we were working towards LW2: - C2 support for LW2 specific features: nullable and non-scalarized inline types, array covariance, substitutability check, conversions/casting between types and calling convention changes [1]. - Various performance improvements: Object array accesses, aaload/aastore profiling, reflective invocations, synchronization, lock coarsening, unsafe/hashCode/reflection/array intrinsics and inline type array specific loop unswitching to mitigate impact on legacy code [2]. - Other new inline type specific C2 optimizations [3]. - Full C1 support for LW2 including calling convention. - Stabilization work: fixed ~130 compiler bugs for LW1 and LW2 [4]. - Thousands of lines of new test code and many extensions to our inline type specific test framework. Below are some more details about various optimizations that might be of interest. Array access (aaload, aastore): - Optimized flattened load/store if array type is known at compile time, runtime call otherwise. - Optimized runtime checks based on array storage properties. - Type information is also used to guide optimizations of following code and omit runtime checks: - After successfully storing null, the destination array can't be null-free (-> not flat). - After successfully casting an array element to a non-inline-type, the source array can't be null-free (-> not flat). - After successfully casting an array element to a non-flattenable type, the source array can't be flat. - Speculate on varargs Object array being not null-free (-> not flat). - Loop unswitching and hoisting based on flattened array check to mitigate performance impact on Object array accesses. - New profiling points for array type, element type and whether the array is flat or null-free are collected for both aaload and aastore. We then speculate based on these properties. Optimized acmp implementation: if (a == b) { return true; } else if (a != NULL && a.isValue() && b != NULL && a.type() == b.type()) { // Slow runtime call for the substitutability check return ValueBootstrapMethods.isSubstitutable(); } else { return false; } - Based on type system information, C2 is often able to remove parts or all of the above. - Implicit null checks and knowledge about nullity/flatness are used to improve remaining checks. - We currently always delegate the substitutability check to the runtime (-> slow). - Planned: profiling [5] and optimized substitutability check [6]. Scalarization in the scope of a compiled method: - C2 is aggressively scalarizing whenever null-free inline types are created, loaded or passed. For example, at defaultvalue, withfield, flattened array/field load, through inlined calls/returns (also method handles and incremental inlining), scalarized calls and returns. This means that each field of the inline type is passed individually in registers or on the stack and no heap allocation is necessary. - In addition, we attempt to prove or speculate that nullable inline types are null-free and then scalarize these as well. Please note that this is *not* done across call boundaries. Scalarized calling convention: - Null-free inline types are passed as arguments and returned in a scalarized form. That means that instead of passing/returning a pointer, each field of the inline type is passed/returned individually in registers or on the stack. - The implementation is very complex because we need to handle mismatches in the calling convention between the interpreter, C1 and C2. The following variants exist: - All null-free inline type arguments are scalarized (C2). - Inline type receiver is not scalarized (interface, method handle call). - No arguments are scalarized (interpreter and C1). We can basically have any combination of the above where there is an inconsistency between what the caller passes and what the callee expects (in many cases, the caller does not "know" what the callee expects). To solve that, we need to translate between calling conventions in the adapters / entry points by allocating and packing or unpacking. The same problem exists for returns. - Nullable inline types are *not* scalarized in the calling convention. That's mainly because the VM only supports one compiled version of each method. If we would speculatively scalarize an inline type argument, that compiled method could not handle null and we would need to deoptimize when seeing null (-> huge, unexpected performance impact). Since scalarized adapters are created at link time, we would also not be able to re-compile that method without scalarization, i.e. passing null will always be extremely slow. Related to that, Roland investigated lazy adapter creation a while ago and explained some of the additional problems here [7]. - One option to scalarize nullable inline types in the calling convention would be to pass an additional, artificial field that can be used to check if the inline type is null. Compiled code would then "null-check" before using the fields. However, this solution is far from trivial to implement and the overhead of the additional fields and the runtime checks might cancel out the improvements of scalarization. Also, the VM would need to ensure that the type is loaded when the adapters are created at method link time. The currently planned JIT work can be found here: https://bugs.openjdk.java.net/issues/?filter=36444 In my opinion, the main ongoing challenge is that we don't have a good understanding about what still needs to be done with respect to performance. For example, we don't have any numbers on the performance impact of the calling convention optimization. We also need to evaluate the inline type specific profiling that we have. Is the current version good enough? Do we need more? In general, we need to identify performance issues and prioritize them. After all, this entire project is solely about performance. Hope this helps. Please let me know if you have any questions. Best regards, Tobias [1] For example: https://bugs.openjdk.java.net/browse/JDK-8215477 https://bugs.openjdk.java.net/browse/JDK-8220716 https://bugs.openjdk.java.net/browse/JDK-8215559 https://bugs.openjdk.java.net/browse/JDK-8206139 https://bugs.openjdk.java.net/browse/JDK-8212190 https://bugs.openjdk.java.net/browse/JDK-8211772 [2] For example: https://bugs.openjdk.java.net/browse/JDK-8227634 https://bugs.openjdk.java.net/browse/JDK-8227463 https://bugs.openjdk.java.net/browse/JDK-8229288 https://bugs.openjdk.java.net/browse/JDK-8222221 [3] For example: https://bugs.openjdk.java.net/browse/JDK-8220666 https://bugs.openjdk.java.net/browse/JDK-8227180 https://bugs.openjdk.java.net/browse/JDK-8228367 [4] https://bugs.openjdk.java.net/secure/Dashboard.jspa?selectPageId=18410 [5] https://bugs.openjdk.java.net/browse/JDK-8235914 [6] https://bugs.openjdk.java.net/browse/JDK-8228361 [7] https://mail.openjdk.java.net/pipermail/valhalla-dev/2018-April/004093.html