Benchmark test

It’s a Superman vs. Batman battle. If we pit sequential streams against regular for-loops, which one comes out faster? After some careful benchmark tests, Angelika Langer shows us which is fastest, and why must be careful to make judgements.

A while ago I mused about the performance of Java 8 streams in this forum, showed a couple of figures that illustrate certain aspects of the performance characteristics of streams, and explained why these results are plausible. In one of these benchmark experiments we compared the performance of sequential streams and regular for- loops. In the context in which we did the measurement the for-loop was approximately 15 times faster than the corresponding sequential stream.

Reactions to this performance comparison vary from surprise to utter disbelief. Some people even conclude that streams are too slow to be useful. Jumping to such a conclusion from a singular benchmark result is – I don’t know how to put it – hasty? misleading? irresponsible? In any case, it is not helpful. When you benchmark, then you usually benchmark a lot and never rely on a single set of figures – neither did we. So, let us put the results into perspective.

Both ends of a spectrum

For the benchmark that illustrated the for- loop as 15 times faster as the corresponding sequential stream we used an int-array filled with 500,000 random integral values. In this array we searched for the maximum value.

The for- loop looked like this:

int[] a = ints; int e = ints.length; int m = Integer.MIN_VALUE; for (int i = 0; i < e; i++) if (a[i] > m) m = a[i];

As the counterpart to an int[] we used a sequential primitive IntStream :

int m = Arrays.stream(ints) .reduce(Integer.MIN_VALUE, Math::max);

The results on our outdated hardware (dual core, no dynamic overclocking) with proper warm-up and all it takes to produce halfway reliable benchmark figures were this:

int-array, for-loop : 0.36 ms int-array, seq. stream: 5.35 ms

The for -loop is substantially faster that the sequential stream. We ran the same benchmark on a more modern hardware (with 4 virtual cores) and found a slowdown factor of 4.2 (instead of 15). The result is reproducible, see for instance Nicolai Parlog’s blog (/CDFX/). He came up with a slowdown factor of 4.5 (instead of 15). As is expected, a different context yields different results. Yet the performance characteristics – “a for-loop is faster than a sequential stream” (in this particular benchmark) – is the same on various platforms.

We repeated the benchmark with a different stream source: we replaced the int[] by an ArrayList<Integer>, again filled with 500,000 random integral values. The results were:

ArrayList, for-loop : 6.55 ms ArrayList, seq. stream: 8.33 ms

Again, the for- loop is faster that the sequential stream operation, but the difference on an ArrayList is not nearly as significant as it was on an array.

Is any of it surprising? No, actually not, if you think of it. In this benchmark we invest a lot to retrieve half a million values from a sequence in memory and once we get hold of the values we perform a petty pair-wise comparison of two integral values, which after JIT compilation is barely more than one assembly instruction. For this reason, the benchmark results are dominated by the cost of memory access and the cost of iteration. As the speed of memory access is hardware-dependent the results vary from platform to platform. This explains the different slowdown factors on different platforms.

The fact that the for-loop beats the sequential stream in our benchmark is not surprising either. We deliberately picked an extreme situation, which represents an end of a spectrum. Actually, there are several spectra involved.

We compared for-loops to streams. Loops are JIT-friendly. Compilers have 40+ years of experience optimizing loops and we picked a loop that the JIT compiler can heavily optimize. This is one end of a spectrum: a JIT-friendly, highly optimizable access to the sequence elements. As opposed to streams. Using a stream means calling into a major framework, which inevitably adds overhead. A JIT compiler will eliminate the overhead to some extent, but usually not completely. (By the way, there are sequential streams that are by far slower than the ArrayList in our benchmark. We will get to this later.) Let us call it the JIT-friendly/JIT-unfriendly The for-loop is on the friendly side and thus it wins performance-wise compared to a sequential stream. No surprise here. We compared sequences of primitive type elements to sequences of reference type elements. Let us call it the cache-friendly/cache-unfriendly An array of primitive type int s is very cache-friendly (and would be even more so if Java had immutable arrays, which is under discussion for future versions of Java as part of the “Array 2.0” effort). A collection of reference type elements, even if it is an array-based collection such as ArrayList , has little chance to be cached efficiently. Every individual access to a sequence element requires dereferencing a pointer and reaching out into memory – which means yet another cache miss. Clearly, a for-loop over an int[] is on the cache-friendly side and will win performance-wise compared to a sequential reference stream. No surprise here, either. We compared a light-weight element usage to CPU intensive usage. More precisely, we compared a pair-wise comparison when looking for the maximum value with a Taylor approximation for sine values. We will get to sine values in a minute. Let us call it the CPU-friendly/CPU-unfriendly If we perform a heavy-weight, CPU-intensive operation on each element in a sequence, then the benchmark results are dominated by the availability and speed of your CPU and all other aspects such as cache misses and JIT-compiled loops become fairly insignificant, as we will see.

As you can tell now, a for-loop over an int[] searching for the maximum element benefits from the JIT-friendly loop and the cache-friendly primitive array with no CPU-intensive activities that would mask any of these advantages. Of course the for-loop shines under these circumstances. It would be surprising if it didn’t.

Now, let’s put it into perspective. How likely is it that THE performance critical activity in our application is a loop over a primitive-type array with barely any CPU activity? Not very likely, I would say. “Why did we measure it then?”, you might wonder. We did it because it is a piece of the puzzle. It is a demarcation line and represents a certain extreme situation against which you can compare other scenarios involving streams. The for-loop over an int[] is the best-case scenario: if a stream usage is as fast as that then it is really good.

If the for-loop over an int[] searching for the maximum element is one end of a spectrum, what are the performance figures for the other end of the spectrum. Let us look into the aforementioned CPU-intensive use case.

We modify our for-loop and the corresponding sequential stream usage a little bit in that we first map the array / stream elements to sine values and then search for the maximum sine value.

The for-loop looks like this:

int[] a = ints; int e = a.length; double m = Double.MIN_VALUE; for (int i = 0; i < e; i++) { double d = Sine.slowSin(a[i]); if (d > m) m = d; }

The sequential stream usage looks like this:

Arrays.stream(ints) .mapToDouble(Sine::slowSin) .reduce(Double.MIN_VALUE, (i, j) -> Math.max(i, j));

A minimal change that has drastic consequences for the benchmark results, which look like this:

for-loop : 11.82 ms seq. stream: 12.15 ms

It was measured on the out-dated hardware that we used for the slowdown factor of 15. Also, we got impatient and reduced the size of the sequences by a factor of 50 (from 500,000 to 10,000); it simply took too long to run the benchmark with 500,000 elements.

Looking at the resulting figures we conclude that the for-loop is still a little faster than the sequential stream, but the difference is no longer substantial. The difference still is statistically significant (if you run the samples through a t-test), but for all practical purposes it is negligible. Now we proved the opposite of what our previous benchmark suggested: there is hardly any difference in the performance of sequential streams and for-loops.

Why is this result so dramatically different from the previous benchmark’s result? It is because this time we used a scenario located at the opposite end of the CPU-friendly/CPU-unfriendly spectrum. In this benchmark, we do not simply compare sequence elements to each other as soon as we get hold of them. Instead we stuff each sequence element into the mysterious slowSin() method that calculates the sine value before we compare anything.

The slowSin() method is indeed very slow. It is a non-public method taken from the Apache Commons Mathematics Library (ACML), where it is used to fill a table of sine values that are later used for the public sin() method, which performs a fast interpolation based on the table values (see class FastMathCalc ). The slowSin() method isn’t slow because it is waiting for anything; it is slow because it does a lot. It calculates the sine value by means of a Taylor approximation, which is a calculation that keeps the CPU very, very busy. No data need to be loaded from memory, except the initial parameter, and the CPU is happily occupied with itself and its registers for quite a bit of time. Very different from the petty pair-wise comparison the CPU had to perform in the previous benchmark.

Under these circumstances the benchmark results are dominated by the availability and speed of the CPU and all other aspects such as cache misses and JIT-compiled loops become insignificant.

Again: How likely is it that THE performance critical activity in our application is a loop over a primitive-type array with a CPU-intensive activity such as a Taylor approximation? It might be for some applications, but it is not extremely likely for others. Why did we measure it then? Because it is another piece of the puzzle – on the opposite end of a spectrum.

Conclusions

The point to take home is: a sequential stream can be significantly slower than a for-loop in certain situations, while there is no substantial performance difference in other situations. When you use a sequential stream then you use it because you like the style, not in order to improve your application’s performance. At the same time, there is no reason to shy away from streams for fear they might impair your application’s performance.

In most situations sequential streams are somewhat slower than for-loops; in very special situations they are substantially slower, but in general the performance difference will be tolerable – especially if you consider that in practice most stream sizes are in the magnitude of dozens of elements, not hundreds of thousands of elements like in our benchmark. Even a slowdown factor of 15 does not hurt if you loop over a handful of elements. It’s a couple of nanoseconds more or less. Who cares?

SEE ALSO: How to use Java 8 streams to swiftly replace elements in a list

The performance difference is only relevant for hot spots in your application, i.e. huge streams and long loop on the performance-critical path. Don’t worry about the performance of a non-critical stream operation on a short stream source. First you need proof that there is a performance bottleneck (e.g. by means of profiling figures) before you start worrying and fixing the problem by avoiding streams. Don’t optimize prematurely! If in doubt – measure, don’t guess!