When to use parallel streams

The java.util.streams framework supports data-driven operations on collections and other sources. Most stream methods apply the same operation to each data element. When multiple cores are available, "data-driven" can become "data-parallel", by using the parallelStream() method of a collection. But when should you do this?

Consider using S.parallelStream().operation(F) instead of S.stream().operation(F) when operations are independent, and either computationally expensive or applied to many elements of efficiently splittable data structures, or both. In more detail:

F , the per-element function (usually a lambda) is independent: the computation for each element does not rely on or impact that of any other element. (See the stream package summary for further guidance about using stateless non-interfering functions.)

, the per-element function (usually a lambda) is independent: the computation for each element does not rely on or impact that of any other element. (See the stream package summary for further guidance about using stateless non-interfering functions.) S , the source collection is efficiently splittable. There are a few other readily parallelizable stream sources besides Collections, for example, java.util.SplittableRandom (for which you can use the stream.parallel() method to parallelize). But most sources based on IO are designed primarily for sequential use.

, the source collection is efficiently splittable. There are a few other readily parallelizable stream sources besides Collections, for example, java.util.SplittableRandom (for which you can use the method to parallelize). But most sources based on IO are designed primarily for sequential use. The total time to execute the sequential version exceeds a minimum threshold. These days, the threshold is roughly (within a factor of ten of) 100 microseconds across most platforms. You don't need to measure this precisely though. You can estimate this well enough in practice by multiplying N (the number of elements) by Q (cost per element of F), in turn guestimating Q as the number of operations or lines of code, and then checking that N * Q is at least 10000. (If you are feeling cowardly, add another zero or two.) So when F is a tiny function like x -> x + 1 , then it would require N >= 10000 elements for parallel execution to be worthwhile. And conversely, when F is a massive computation like finding the best next move in a chess game, the Q factor is so high that N doesn't matter so long as the collection is completely splittable.

The streams framework does not (and cannot) enforce any of these. If the computation is not independent, then running it in parallel will not make any sense and might even be harmfully wrong. The other criteria stem from three engineering issues and tradeoffs:

Start-up As processors have added cores over the years, most have also added power control mechanisms that can make these cores slow to start up, sometimes with additional overhead imposed by JVMs, OSes, and hypervisors. The threshold roughly approximates the time it might take for enough cores to start processing parallel subtasks to be worthwhile. Once they get started, parallel computations can be more energy efficient than sequential (depending on various processor and system details; see for example this article by Federova et al). Granularity Subdividing already small computations is rarely worthwhile. The framework normally splits up problems so that parts may be processed by all available cores on a system. If there is practically nothing for each core to do after starting, then the (mostly sequential) effort setting up the parallel computation is wasted. Considering that the practical range of cores these days is from 2 to 256, the threshold also stays away from over-partitioning effects. Splittability The most efficiently splittable collections include ArrayLists and {Concurrent}HashMaps, as well as plain arrays (i.e., those of form T[] , split using static java.util.Arrays methods). The least efficient are LinkedLists, BlockingQueues, and most IO-based sources. Others are somewhere in the middle. (Data structures tend to be efficiently splittable if they internally support random access, efficient search, or both.) If it takes longer to partition data than to process it, the effort is wasted. So, if the Q factor of computations is high enough, you may get a parallel speedup even for a LinkedList, but this is not very common. Additionally, some sources cannot be split completely down to single elements, so there may be limits in how finely tasks are partitioned.