RyuJIT Just-in-Time Compiler Optimization Enhancements

Joseph

October 16th, 2017

I’d like to tell you about some of the recent changes we’ve made as part of our ongoing work to extend the optimization capabilities of RyuJIT, the MSIL-to-native code generator used by .NET Core and .NET Framework. I hope it will make for an interesting read, and offer some insight into the sorts of optimization opportunities we have our eyes on.

Note: The changes described here landed after the release fork for .NET Core 2.0 was created, so they are available in daily preview builds but not the released 2.0 bits. Similarly, these changes landed after the fork for .NET Framework 4.7.1 was created. The changes to struct argument passing and block layout, which are purely JIT changes, will automatically propagate to subsequent .NET Framework releases with the new JIT bits (the RyuJIT sources are shared between .NET Core and .NET Framework); the other changes depend on their runtime components to propagate to .NET Framework.

Improvements for Span

Some of our work was motivated by the introduction of Span<T> , so that it and similar types could better deliver on their performance promises.

One such change was #10910, which made the JIT recognize the Item property getters of Span<T> and ReadOnlySpan<T> as intrinsics — the JIT now recognizes calls to these getters and, rather than generate code for them the same way it would for other calls, it transforms them directly into code sequences in its intermediate representation that are similar to the sequences used for the ldelem MSIL opcode that fetches an element from an array. As noted in the PR’s performance assessment (n.b., if you follow that link, see also the follow-up where the initially-discovered regressions were fixed with subsequent improvements in #10956 and dotnet/roslyn#20548), this improved several benchmarks in the tests we added to track Span<T> performance, by allowing the existing JIT code that optimized array bound checks that are redundant with prior checks, or that are against arrays with known constant length, to kick in for Span<T> as well. This is what some of those improved benchmark methods look like, and their improvements:

Building on that, change #11521 updated the analysis machinery the JIT uses to eliminate bounds checks for other provably in-bounds array accesses, to similarly eliminate bounds checks for provably in-bounds Span<T> accesses (in particular, bounds checks in for loops bounded by span.Length ). As noted in the PR (numbers here), this brought the codegen for four more microbenchmarks in the Span<T> tests up to par with the codegen for equivalent patterns with arrays; here are two of them:

One key fact that these bounds-check removal optimizations exploit is that array lengths are immutable; any two loads of a.Length , if a refers to the same array each time, will load the same length value. It’s common for the JIT to encounter different accesses to the same array, where the reference to the array is held in a local or parameter of type T[] , such that it can determine that intervening code hasn’t modified the local/parameter in question, even if that intervening code has unknown side-effects. The same isn’t true for parameters of type ref T[] , since intervening code with unknown side-effects might change which array object is referenced. Consider:

Since Span<T> is a struct, some platforms’ ABIs specify that passing an argument of type Span<T> actually be done by creating a copy of the struct in the caller’s stack frame, and passing a pointer to that copy in to the callee via the argument registers/stack. The JIT’s internal modeling of this convention is to rewrite Span<T> parameters as ref Span<T> parameters. That internal rewrite at first caused problems for applying bounds-check removal optimizations to spans passed as parameters. The problem was that methods written with by-value Span<T> parameters, which at source look analogous to by-value array parameter a in the example above, when rewritten looked to the JIT like by-reference parameters, analogous to by-reference array parameter b above. This caused the JIT to handle references to such parameters’ Length fields with the same conservativism needed for b above. Change #10453 taught the JIT to make local copies of such parameters before doing that rewrite (in beneficial cases), so that bounds-check removal optimizations can equally apply to spans passed by value. As noted in the PR, this change allowed these optimizations to fire in 9 more of the Span<T> micro-benchmarks in our test suite; here are three of them:

This last change applies more generally to any structs passed as parameters (not just Span<T> ); the JIT is now better able to analyze value propagation through their fields.

Enum.HasFlag Optimization

The Enum.HasFlag method offers nice readability (compare targets.HasFlag(AttributeTargets.Class | AttributeTargets.Struct) vs targets & (AttributeTargets.Class | AttributeTargets.Struct) == (AttributeTargets.Class | AttributeTargets.Struct) ), but, since it needs to handle reflection cases where the exact enum type isn’t known until run-time, it is notoriously expensive. Change #13748 taught the JIT to recognize when the enum type is known (and known to equal the argument type) at JIT time, and generate the simple bit test rather than the expensive Enum.HasFlag call. Here’s a micro-benchmark to demonstrate, comparing .NET Core 2.0 (which doesn’t have this change) to a recent daily preview build (which does). Much thanks to @adamsitnik for making it easy to use BenchmarkDotNet with daily preview builds of .NET Core!

Output:

BenchmarkDotNet=v0.10.9.313-nightly, OS=Windows 10 Redstone 2 [1703, Creators Update] (10.0.15063) Processor=Intel Core i7-4790 CPU 3.60GHz (Haswell), ProcessorCount=8 Frequency=3507517 Hz, Resolution=285.1020 ns, Timer=TSC .NET Core SDK=2.1.0-preview1-007228 [Host] : .NET Core 2.1.0-preview1-25719-04 (Framework 4.6.25718.02), 64bit RyuJIT Job-WFNGKY : .NET Core 2.0.0 (Framework 4.6.00001.0), 64bit RyuJIT Job-VIXUQP : .NET Core 2.1.0-preview1-25719-04 (Framework 4.6.25718.02), 64bit RyuJIT

Method Toolchain Mean Error StdDev HasFlag .NET Core 2.0 14,917.4 ns 80.147 ns 71.048 ns HasFlag .NET Core 2.1.0-preview1-25719-04 449.3 ns 1.239 ns 1.034 ns

With the cool new BenchmarkDotNet DisassemblyDiagnoser (again thanks to @adamsitnik), we can see that the optimized code really is a simple bit test:

Bench.HasFlag RyuJIT x64 .NET Core 2.0 RyuJIT x64 .NET Core 2.1.0-preview1-25719-04 HasFlagBench.Bench.HasFlag(): push rdi push rsi push rbx sub rsp,20h mov rsi,rcx xor edi,edi L1: mov rcx, [[AttributeTargets type]] call [[box]] mov rbx,rax mov rcx, [[AttributeTargets type]] call [[box]] mov ecx,dword ptr [rsi+8] mov dword ptr [rbx+8],ecx mov rcx,rbx mov dword ptr [rax+8],0Ch mov rdx,rax call [[System.Enum.HasFlag]] mov byte ptr [rsi+0Ch],al inc edi cmp edi,3E8h jl L1 add rsp,20h pop rbx pop rsi pop rdi ret HasFlagBench.Bench.HasFlag(): xor eax,eax mov edx,dword ptr [rcx+8] L1: mov r8d,edx and r8d,0Ch cmp r8d,0Ch sete r8b mov byte ptr [rcx+0Ch],r8b inc eax cmp eax,3E8h jl L1 ret

What’s more, implementing this optimization involved implementing a new scheme for recognizing intrinsics in the JIT, which is more flexible than the previous scheme, and which is being leveraged in the implementation of Intel SIMD intrinsics for.NET Core.

Block Layout for Search Loops

Outside of profile-guided optimization, the JIT has traditionally been conservative about rearranging the basic blocks of methods it compiles, leaving them in MSIL order except to segregate code it identifies as “rarely-run” (e.g. blocks that throw or catch exceptions). Of course, MSIL order isn’t always the most performant one; notably, in the case of loops with conditional exits/returns, it’s generally a good idea to keep the in-loop code together, and move everything on the exit path after the conditional branch out of the loop. For particularly hot loops, this can cause a significant enough difference that developers have been using gotos to make the MSIL order reflect the desired machine code order. Change #13314 updated the JIT’s loop detection to effect this layout automatically. As usual, the PR included a performance assessment, which noted speed-ups in 5 of the benchmarks in our performance test suite.

Again comparing .NET Core 2.0 (which didn’t have this change) to a recent daily preview build (which does), let’s look at the effect on the repro case from the GitHub issue describing this opportunity:

The results confirm that the new JIT brings the performance of the loop with the in-place return in line with the performance of the loop with the goto , and that doing so constituted a 15% speed-up:

BenchmarkDotNet=v0.10.9.313-nightly, OS=Windows 10 Redstone 2 [1703, Creators Update] (10.0.15063) Processor=Intel Core i7-4790 CPU 3.60GHz (Haswell), ProcessorCount=8 Frequency=3507517 Hz, Resolution=285.1020 ns, Timer=TSC .NET Core SDK=2.1.0-preview1-007228 [Host] : .NET Core 2.0.0 (Framework 4.6.00001.0), 64bit RyuJIT Job-NHAVNC : .NET Core 2.0.0 (Framework 4.6.00001.0), 64bit RyuJIT Job-CTEHPT : .NET Core 2.1.0-preview1-25719-04 (Framework 4.6.25718.02), 64bit RyuJIT

Method Toolchain Mean Error StdDev LoopReturn .NET Core 2.0 61.97 ns 0.1254 ns 0.1111 ns LoopGoto .NET Core 2.0 53.63 ns 0.5171 ns 0.4837 ns LoopReturn .NET Core 2.1.0-preview1-25719-04 53.75 ns 0.5089 ns 0.4511 ns LoopGoto .NET Core 2.1.0-preview1-25719-04 53.52 ns 0.0999 ns 0.0934 ns

Disassembly confirms that the difference is entirely block placement:

LoopWithExit.LoopReturn RyuJIT x64 .NET Core 2.0 RyuJIT x64 .NET Core 2.1.0-preview1-25719-04 LoopLayoutBench.LoopWithExit.LoopReturn_ (System.String, System.String): sub rsp,18h xor eax,eax mov qword ptr [rsp+10h],rax mov qword ptr [rsp+8],rax mov ecx,dword ptr [rdx+8] mov qword ptr [rsp+10h],rdx mov rax,rdx test rax,rax je L1 add rax,0Ch L1: mov qword ptr [rsp+8],r8 mov rdx,r8 test rdx,rdx je L2 add rdx,0Ch L2: test ecx,ecx je L5 L3: movzx r8d,word ptr [rax] movzx r9d,word ptr [rdx] cmp r8d,r9d je L4 xor eax,eax add rsp,18h ret L4: add rax,2 add rdx,2 dec ecx test ecx,ecx jne L3 L5: mov eax,1 add rsp,18h ret LoopLayoutBench.LoopWithExit.LoopReturn_ (System.String, System.String): sub rsp,18h xor eax,eax mov qword ptr [rsp+10h],rax mov qword ptr [rsp+8],rax mov eax,dword ptr [rdx+8] mov qword ptr [rsp+10h],rdx test rdx,rdx je L1 add rdx,0Ch L1: mov qword ptr [rsp+8],r8 mov rcx,r8 test rcx,rcx je L2 add rcx,0Ch L2: test eax,eax je L4 L3: movzx r8d,word ptr [rdx] movzx r9d,word ptr [rcx] cmp r8d,r9d jne L5 add rdx,2 add rcx,2 dec eax test eax,eax jne L3 L4: mov eax,1 add rsp,18h ret L5: xor eax,eax add rsp,18h ret

LoopWithExit.LoopGoto RyuJIT x64 .NET Core 2.0 RyuJIT x64 .NET Core 2.1.0-preview1-25719-04 LoopLayoutBench.LoopWithExit.LoopGoto_ (System.String, System.String): sub rsp,18h xor eax,eax mov qword ptr [rsp+10h],rax mov qword ptr [rsp+8],rax mov eax,dword ptr [rcx+8] mov qword ptr [rsp+10h],rcx test rcx,rcx je L1 add rcx,0Ch L1: mov qword ptr [rsp+8],rdx test rdx,rdx je L2 add rdx,0Ch L2: test eax,eax je L4 L3: movzx r8d,word ptr [rcx] movzx r9d,word ptr [rdx] cmp r8d,r9d jne L5 add rcx,2 add rdx,2 dec eax test eax,eax jne L3 L4: mov eax,1 add rsp,18h ret L5: xor eax,eax add rsp,18h ret LoopLayoutBench.LoopWithExit.LoopGoto_ (System.String, System.String): sub rsp,18h xor eax,eax mov qword ptr [rsp+10h],rax mov qword ptr [rsp+8],rax mov eax,dword ptr [rcx+8] mov qword ptr [rsp+10h],rcx test rcx,rcx je L1 add rcx,0Ch L1: mov qword ptr [rsp+8],rdx test rdx,rdx je L2 add rdx,0Ch L2: test eax,eax je L4 L3: movzx r8d,word ptr [rcx] movzx r9d,word ptr [rdx] cmp r8d,r9d jne L5 add rcx,2 add rdx,2 dec eax test eax,eax jne L3 L4: mov eax,1 add rsp,18h ret L5: xor eax,eax add rsp,18h ret

Conclusion