Let us try and see how does the virtual call slowpath looks. For that, we make the artificial megamorphic call site in a JMH benchmark: make the three subclasses visiting the same call site:

import org.openjdk.jmh.annotations.*; @Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS) @Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS) @Fork(1) @BenchmarkMode(Mode.AverageTime) @OutputTimeUnit(TimeUnit.NANOSECONDS) @State(Scope.Benchmark) public class VirtualCall { static abstract class A { int c1, c2, c3; public abstract void m(); } static class C1 extends A { public void m() { c1++; } } static class C2 extends A { public void m() { c2++; } } static class C3 extends A { public void m() { c3++; } } A[] as; @Param({"mono", "mega"}) private String mode; @Setup public void setup() { as = new A[300]; boolean mega = mode.equals("mega"); for (int c = 0; c < 300; c += 3) { as[c] = new C1(); as[c+1] = mega ? new C2() : new C1(); as[c+2] = mega ? new C3() : new C1(); } } @Benchmark public void test() { for (A a : as) { a.m(); } } }

To simplify things for analysis, we invoke this with -XX:LoopUnrollLimit=1 -XX:-TieredCompilation : this will block loop unrolling from complicating the disssembly, and disabling tiered compilation would guarantee compilation with the final optimizing compiler. We don’t care about performance numbers all that much, but let’s have them to frame the discussion:

Benchmark (mode) Mode Cnt Score Error Units VirtualCall.test mono avgt 5 325.478 ± 18.156 ns/op VirtualCall.test mega avgt 5 1070.304 ± 53.910 ns/op

To give you some taste of what would happen if we do not use the optimizing compiler on test, run with -XX:CompileCommand=exclude,org.openjdk.VirtualCall::test

Benchmark (mode) Mode Cnt Score Error Units VirtualCall.test mono avgt 5 11598.390 ± 535.593 ns/op VirtualCall.test mega avgt 5 11787.686 ± 884.384 ns/op

So, the megamorphic call does indeed cost something, but it is definitely not interpeter-bad performance. The difference between "mono" and "mega" in optimized case is basically the call overhead: we spend 3ns per element for "mega" case, while spending only 1ns per element in "mono" case.

How does "mega" case look like in perfasm ? Like this, with many things pruned for brevity:

....[Hottest Region 1]....................................................................... C2, org.openjdk.generated.VirtualCall_test_jmhTest::test_avgt_jmhStub, version 88 (143 bytes) 6.93% 5.40% ↗ 0x...5c450: mov 0x40(%rsp),%r9 │ ... 3.65% 4.31% │ 0x...5c47b: callq 0x...0bf60 ;*invokevirtual m │ ; - org.openjdk.VirtualCall::test@22 (line 76) │ ; {virtual_call} 3.12% 2.34% │ 0x...5c480: inc %ebp 3.33% 0.02% │ 0x...5c482: cmp 0x10(%rsp),%ebp ╰ 0x...5c486: jl 0x...5c450 ... ............................................................................................. 31.26% 21.77% <total for region 1> ....[Hottest Region 2]....................................................................... C2, org.openjdk.VirtualCall$C1::m, version 84 (14 bytes) <--- mis-attributed :( ... Decoding VtableStub vtbl[5]@12 3.95% 1.57% 0x...59bf0: mov 0x8(%rsi),%eax 3.73% 3.34% 0x...59bf3: shl $0x3,%rax 3.73% 5.04% 0x...59bf7: mov 0x1d0(%rax),%rbx 16.45% 22.42% 0x...59bfe: jmpq *0x40(%rbx) ; jump to target 0x...59c01: add %al,(%rax) 0x...59c03: add %al,(%rax) ... ............................................................................................. 27.87% 32.37% <total for region 2> ....[Hottest Region 3]....................................................................... C2, org.openjdk.VirtualCall$C3::m, version 86 (26 bytes) # {method} {0x00007f75aaf4dd50} 'm' '()V' in 'org/openjdk/VirtualCall$C3' ... [Verified Entry Point] 17.82% 26.04% 0x...595c0: sub $0x18,%rsp 0.06% 0.04% 0x...595c7: mov %rbp,0x10(%rsp) 0x...595cc: incl 0x14(%rsi) ; c3++ 3.53% 5.14% 0x...595cf: add $0x10,%rsp 0x...595d3: pop %rbp 3.29% 5.10% 0x...595d4: test %eax,0x9f01a26(%rip) 0.02% 0.02% 0x...595da: retq ... ............................................................................................. 24.73% 36.35% <total for region 3>

So the benchmarking loop calls into something, which we can assume is the virtual call handler, then it ends up with VirtualStub, that is supposedly does what every other runtime does with virtual calls: jumps the the actual method with the help of Virtual Method Table (VMT).

But wait a minute, this does not compute! The disassembly says we are actually calling to 0x…​0bf60 , not into VirtualStub that is at 0x…​59bf0 ?! And that call is hot, so the call target should also be hot, right? This is where runtime itself plays tricks on us. Even if the compiler bails to optimize the virtual call, the runtime can handle "pessimistic" cases on its own. To diagnose this better, we need to get the fastdebug OpenJDK build, and supply a tracing option for Inline Caches (IC): -XX:+TraceIC . Additionally, we want to save the Hotspot log to file with -prof perfasm:saveLog=true

Lo and behold!

$ grep IC org.openjdk.VirtualCall.test-AverageTime.log IC@0x00007fac4fcb428b: to megamorphic {method} {0x00007fabefa81880} 'm' ()V'; in 'org/openjdk/VirtualCall$C2'; entry: 0x00007fac4fcb2ab0

Okay, it says inline cache had acted for the call-site at 0x00007fac4fcb428b . Who is it? This is our Java call!

$ grep -A 4 0x00007fac4fcb428b: org.openjdk.VirtualCall.test-AverageTime.log 0.02% 0x00007fac4fcb428b: callq 0x00007fac4fb7dda0 ;*invokevirtual m {reexecute=0 rethrow=0 return_oop=0} ; - org.openjdk.VirtualCall::test@22 (line 76) ; {virtual_call}

But what was the address in that Java call? This is the resolving runtime stub:

$ grep -C 2 0x00007fac4fb7dda0 org.openjdk.VirtualCall.test-AverageTime.log 0x00007fac4fb7dcdf: hlt Decoding RuntimeStub - resolve_virtual_call 0x00007fac4fb7dd10 0x00007fac4fb7dda0: push %rbp 0x00007fac4fb7dda1: mov %rsp,%rbp 0x00007fac4fb7dda4: pushfq

This guy basically called to runtime, figured out what method we want to call, and then asked IC to patch the call to point to new resolved address! Since that is the one-time action, no wonder we do not see it as the hot code. IC action line mentions changing the entry to another address, which is, by the way, our actual VtableStub:

$ grep -C 4 0x00007fac4fcb2ab0: org.openjdk.VirtualCall.test-AverageTime.log Decoding VtableStub vtbl[5]@12 8.94% 6.49% 0x00007fac4fcb2ab0: mov 0x8(%rsi),%eax 0.16% 0.06% 0x00007fac4fcb2ab3: shl $0x3,%rax 0.20% 0.10% 0x00007fac4fcb2ab7: mov 0x1e0(%rax),%rbx 2.34% 1.90% 0x00007fac4fcb2abe: jmpq *0x40(%rbx) 0x00007fac4fcb2ac1: int3

In the end, no runtime/compiler calls were needed to dispatch over resolved call: the call-site just calls the VtableStub that does the VMT dispatch — never leaving the generated machine code. This IC machinery would handle virtual monomorphic and static calls in the similar way, pointing to the stub/address that does not do VMT dispatch.