This effect is caused by Type Profile Pollution. Let me explain on a simplified benchmark:

@State(Scope.Benchmark) public class Streams { @Param({"500", "520"}) int iterations; @Setup public void init() { for (int i = 0; i < iterations; i++) { Stream.empty().reduce((x, y) -> x); } } @Benchmark public long loop() { return Stream.empty().count(); } }

Though iteration parameter here changes very slightly and it does not affect the main benchmark loop, the results expose very surprising 2.5x performance degradation:

Benchmark (iterations) Mode Cnt Score Error Units Streams.loop 500 thrpt 5 29491,039 ± 240,953 ops/ms Streams.loop 520 thrpt 5 11867,860 ± 344,779 ops/ms

Now let's run JMH with -prof perfasm option to see the hottest code regions:

Fast case (iterations = 500):

....[Hottest Methods (after inlining)].................................. 48,66% bench.generated.Streams_loop::loop_thrpt_jmhStub 23,14% <unknown> 2,99% java.util.stream.Sink$ChainedReference::<init> 1,98% org.openjdk.jmh.infra.Blackhole::consume 1,68% java.util.Objects::requireNonNull 0,65% java.util.stream.AbstractPipeline::evaluate

Slow case (iterations = 520):

....[Hottest Methods (after inlining)].................................. 40,09% java.util.stream.ReduceOps$ReduceOp::evaluateSequential 22,02% <unknown> 17,61% bench.generated.Streams_loop::loop_thrpt_jmhStub 1,25% org.openjdk.jmh.infra.Blackhole::consume 0,74% java.util.stream.AbstractPipeline::evaluate

Looks like the slow case spends the most time in ReduceOp.evaluateSequential method that is not inlined. Furthermore, if we study the assembly code for this method we'll find that the longest operation is checkcast .

You know how HotSpot compiler works: before the JIT starts, a method is executed in interpreter for some time to collect the profile data, e.g. what methods are called, what classes are seen, what branches are taken etc. With Tiered compilation the profile is also collected in C1-compiled code. The profile is then used to generate C2-optimizied code. However if the application changes execution pattern in the middle, the generated code may be not optimal for the modified behavior.

Let's use -XX:+PrintMethodData (available in debug JVM) to compare the execution profiles:

----- Fast case ----- java.util.stream.ReduceOps$ReduceOp::evaluateSequential(Ljava/util/stream/PipelineHelper;Ljava/util/Spliterator;)Ljava/lang/Object; interpreter_invocation_count: 13382 invocation_counter: 13382 backedge_counter: 0 mdo size: 552 bytes 0 aload_1 1 fast_aload_0 2 invokevirtual 3 <java/util/stream/ReduceOps$ReduceOp.makeSink()Ljava/util/stream/ReduceOps$AccumulatingSink;> 0 bci: 2 VirtualCallData count(0) entries(1) 'java/util/stream/ReduceOps$8'(12870 1.00) 5 aload_2 6 invokevirtual 4 <java/util/stream/PipelineHelper.wrapAndCopyInto(Ljava/util/stream/Sink;Ljava/util/Spliterator;)Ljava/util/stream/Sink;> 48 bci: 6 VirtualCallData count(0) entries(1) 'java/util/stream/ReferencePipeline$5'(12870 1.00) 9 checkcast 5 <java/util/stream/ReduceOps$AccumulatingSink> 96 bci: 9 ReceiverTypeData count(0) entries(1) 'java/util/stream/ReduceOps$8ReducingSink'(12870 1.00) 12 invokeinterface 6 <java/util/stream/ReduceOps$AccumulatingSink.get()Ljava/lang/Object;> 144 bci: 12 VirtualCallData count(0) entries(1) 'java/util/stream/ReduceOps$8ReducingSink'(12870 1.00) 17 areturn ----- Slow case ----- java.util.stream.ReduceOps$ReduceOp::evaluateSequential(Ljava/util/stream/PipelineHelper;Ljava/util/Spliterator;)Ljava/lang/Object; interpreter_invocation_count: 54751 invocation_counter: 54751 backedge_counter: 0 mdo size: 552 bytes 0 aload_1 1 fast_aload_0 2 invokevirtual 3 <java/util/stream/ReduceOps$ReduceOp.makeSink()Ljava/util/stream/ReduceOps$AccumulatingSink;> 0 bci: 2 VirtualCallData count(0) entries(2) 'java/util/stream/ReduceOps$2'(16 0.00) 'java/util/stream/ReduceOps$8'(54223 1.00) 5 aload_2 6 invokevirtual 4 <java/util/stream/PipelineHelper.wrapAndCopyInto(Ljava/util/stream/Sink;Ljava/util/Spliterator;)Ljava/util/stream/Sink;> 48 bci: 6 VirtualCallData count(0) entries(2) 'java/util/stream/ReferencePipeline$Head'(16 0.00) 'java/util/stream/ReferencePipeline$5'(54223 1.00) 9 checkcast 5 <java/util/stream/ReduceOps$AccumulatingSink> 96 bci: 9 ReceiverTypeData count(0) entries(2) 'java/util/stream/ReduceOps$2ReducingSink'(16 0.00) 'java/util/stream/ReduceOps$8ReducingSink'(54228 1.00) 12 invokeinterface 6 <java/util/stream/ReduceOps$AccumulatingSink.get()Ljava/lang/Object;> 144 bci: 12 VirtualCallData count(0) entries(2) 'java/util/stream/ReduceOps$2ReducingSink'(16 0.00) 'java/util/stream/ReduceOps$8ReducingSink'(54228 1.00) 17 areturn

You see, the initialization loop ran too long that its statistics appeared in the execution profile: all virtual methods have two implementations and checkcast has also two different entries. In the fast case the profile is not polluted: all sites are monomorphic, and JIT can easily inline and optimize them.

The same is true for your original benchmark: longer stream operations in init() method polluted the profile. If you play with profile and tiered compilation options, the results can be quite different. For example, try

-XX:-ProfileInterpreter -XX:Tier3InvocationThreshold=1000 -XX:-TieredCompilation

Finally, this problem is not unique. There are already multiple JVM bugs related to performance regressions due to profile pollution: JDK-8015416, JDK-8015417, JDK-8059879... Hope this will be improved in Java 9.