March 25, 2014

Today, I want to continue talking about Java 8 and improvements to ForkJoin framework performance that can help you on your quest to better software development. For the tools part, I'm going to look at the infamous Java Microbenchmark Harness project, which allows you to generate better benchmarks and measure performance of your code in a sane way. We'll use this tactic to compare Java 7 vs. Java 8 performance.

This post is also about concurrency updates that landed into the JDK in the latest release. Actually, there were multiple changes to java.util.concurrent, but here the focus will be on the improvements introduced to the Fork-Join framework. We're doing to discuss Fork-Join a bit and then try to implement a sample benchmark to compare the performance of the FJ in Java 7 vs. 8. Let's get started with the basics.

What Is ForkJoin?

Fork-Join is a JVM framework for parallel computation of, usually, recursive tasks.

It was first introduced in the Java 7 and since then served its purpose well. The reason for that is that many large tasks in their nature can be represented as recursive.

For example, consider the most known MapReduce programming example: counting how many times different words occur in a document. Obviously, one can divide the document into parts, compute word counts part by part and combine the results in the end. Indeed, ForkJoin is an implementation of general MapReduce principles, only all the workers are threads in the same JVM instead of being a cluster of machines.

ForkJoinPool

The central piece in the ForkJoin framework is a ForkJoinPool. It is an ExecutorService, so it is capable of accepting asynchronous tasks returning you a Future object, so you can keep track of what’s the status of the computation in question.

What makes it different from the other ExecutorServices is that its thread workers that at the moment don’t execute anything useful check out their buddies and borrow work from them. This technique is called work-stealing. So what’s the deal with work stealing?

What Is Work Stealing?

Work stealing is a way to manage workload in a decentralized fashion.

There’s no-one to assign units of work to all available thread workers. Instead every one of them manages its personal queue of tasks. The trick is to manage those queues effectively.

There are two main questions about having every worker processing its own queue.

Where do externally submitted task go?

How can we organise work stealing so access to queues is efficient?

Essentially, there should be no difference between an externally submitted tasks and tasks created by workers themselves while executing some larger tasks. Both have similar requirements to be executed and provide a result. However the mechanics is different. The main difference is that tasks created by workers can be stolen. It means that despite being put to a certain worker’s queue they will be executed by someone else.

The way how ForkJoin framework handles it is pretty straightforward, every worker has two queues, one for external tasks and another to satisfy the stealing mechanics. When an external submit happens, random workers queue is picked to add the task. When a task is divided into the smaller counterparts, worker adds them to its own queue and hopes that its peers come to help.

The whole stealing idea is based of the fact that a worker adds tasks to the tail of its queue. Then during normal execution every worker tries to take a task from the head of the queue, and when failing to do so because there’s no work in its personal queue goes and steals a task from the tail of a someone else queue. This allows to omit locking on most queue operations and enhances performance.

Another trick that is employed to make ForkJoin pools work faster is that when a worker steals tasks, it leaves a hint which queue it took work from and the original owner of the work can determine it and help this particular worker, so the progress of the parent tasks moves faster.

All in all it is an extremely complicated machine which requires a lot of scientific background to make it work. Also properties and performance of this system are very implementation specific, so I doubt that it will change dramatically without major refactoring.

What Was Wrong With ForkJoin in Java 7?

Since ForkJoin framework was introduced in Java 7, it actually worked well there. However it didn’t stop the progress and within concurrency updates in Java 8 the ForkJoin was improved. From this Java Enhancement Proposal we find the exact Java 8 performance improvements:

Added functionality and improved performance for ForkJoinPools that allow them to be used more effectively in the increasingly wider range of applications that users desire. New features include support for completion-based designs that are often most appropriate for IO-bound usages, among others.

Another source of knowledge is of course talking to the authors of the improvements, for example an somewhat older update from Doug Lea:

Substantially better throughput when lots of clients submit lots of tasks. The idea is to treat external submitters in a similar way as workers -- using randomized queuing and stealing. This also greatly improves throughput when all tasks are async and submitted to the pool rather than forked, which becomes a reasonable way to structure actor frameworks, as well as many plain services that you might otherwise use ThreadPoolExecutor for.

However it is not that easy to find out what exactly has been changed and which scenarios are affected. Instead, let’s take another approach to the problem. I’ll try to create a benchmark that simulates a very simple ForkJoin computation and measures the overhead of handling ForkJoin tasks vs. computing the values sequentially in a single thread. Hopefully it will give us a starting point to speculate about improvements.

Java 8 vs. Java 7 Performance Comparison

So I’ve set a small benchmark to figure out if the difference between Java 7 and Java 8 is actually noticeable. Here is a Github repo if you want to gaze at it or want to try it yourself.

Thanks to the recent effort by Oracle engineers OpenJDK now contains Java Microbenchmark Harness project, that is specifically aimed at creating benchmarks that are hopefully less affected by usual microbenchmarking problems and mistakes.

So let’s put it to use. JMH comes with a Maven archetype for a project, so setting everything up is even easier than one can imagine.

org.openjdk.jmh jmh-core 0.4.1



At the moment of writing this, the latest released version of JMH core is 0.4.1 which includes the @Param annotation to run a benchmark on a series of parameterized inputs. This leaves the pain of executing the pain of running the same benchmark multiple times manually and really simplifies obtaining the results.

Now every benchmark iteration gets its personal ForkJoinPool instance, this also helps us to lessen differences in instantiation of common ForkJoinPool in Java 8 and prior releases.

@BenchmarkMode(Mode.Throughput) @OutputTimeUnit(TimeUnit.MILLISECONDS) @Warmup(iterations = 5, time = 3, timeUnit = TimeUnit.SECONDS) @Measurement(iterations = 20, time = 3, timeUnit = TimeUnit.SECONDS) @Fork(1) @State(Scope.Benchmark) public class FJPBenchmark { @Param({ "200", "400", "800", "1600"}) public int N; public List<RecursiveTask> tasks; public ForkJoinPool pool = new ForkJoinPool(); @Setup public void init() { Random r = new Random(); r.setSeed(0x32106234567L); tasks = new ArrayList<RecursiveTask>(N * 3); for (int i = 0; i < N; i++) { tasks.add(new Sin(r.nextDouble())); tasks.add(new Cos(r.nextDouble())); tasks.add(new Tan(r.nextDouble())); } } @GenerateMicroBenchmark public double forkJoinTasks() { for (RecursiveTask task : tasks) { pool.submit(task); } double sum = 0; Collections.reverse(tasks); for (RecursiveTask task : tasks) { sum += task.join(); } return sum; } @GenerateMicroBenchmark public double computeDirectly() { double sum = 0; for (RecursiveTask task : tasks) { sum += ((DummyComputableThing) task).dummyCompute(); } return sum; } }



Sin, Cos and Tan are instances of the RecursiveTask, where Sin and Cos don’t actually recurse into anything, but compute Math.sin(input) and Math.cos(input) respectively. Tan tasks actually recurse into a pair of Sin and Cos and return the result of a division.

JMH process the code of the project and generates benchmarks from the methods marked with a @GenerateMicroBenchmark annotations. Other annotations that you can see on top of the class specify benchmarking options: number of warmup iterations, number of iterations that contribute then to the final result, if we want to fork another JVM process for benchmark, and the thing we’re measuring. That would be a throughput of the code; how many times these methods can be executed in a period of time.

@Param says that we want to run the benchmark with several input sizes. All in all, the JMH is very straightforward and creating benchmarks doesn’t require fiddling with manually handling iterations, timing and crunching the results.

Java 7 vs. Java 8 Performance Test

Running this benchmark with Java 7 and 8 produces following results. I’ve used respectively versions 1.7.0_40 and 1.8.0.

[email protected] ~/repo/blogposts/fork-join-blocking-perf » java -version java version "1.7.0_40" Java(TM) SE Runtime Environment (build 1.7.0_40-b43) Java HotSpot(TM) 64-Bit Server VM (build 24.0-b56, mixed mode) Benchmark (N) Mode Samples Mean Mean error Units o.s.FJPB.computeDirectly 200 thrpt 20 27.890 0.306 ops/ms o.s.FJPB.computeDirectly 400 thrpt 20 14.046 0.072 ops/ms o.s.FJPB.computeDirectly 800 thrpt 20 6.982 0.043 ops/ms o.s.FJPB.computeDirectly 1600 thrpt 20 3.481 0.122 ops/ms o.s.FJPB.forkJoinTasks 200 thrpt 20 11.530 0.121 ops/ms o.s.FJPB.forkJoinTasks 400 thrpt 20 5.936 0.126 ops/ms o.s.FJPB.forkJoinTasks 800 thrpt 20 2.931 0.027 ops/ms o.s.FJPB.forkJoinTasks 1600 thrpt 20 1.466 0.012 ops/ms [email protected] ~/repo/blogposts/fork-join-blocking-perf » java -version java version "1.8.0" Java(TM) SE Runtime Environment (build 1.8.0-b132) Java HotSpot(TM) 64-Bit Server VM (build 25.0-b70, mixed mode) Benchmark (N) Mode Samples Mean Mean error Units o.s.FJPB.computeDirectly 200 thrpt 20 27.680 2.050 ops/ms o.s.FJPB.computeDirectly 400 thrpt 20 13.690 0.994 ops/ms o.s.FJPB.computeDirectly 800 thrpt 20 6.783 0.548 ops/ms o.s.FJPB.computeDirectly 1600 thrpt 20 3.364 0.304 ops/ms o.s.FJPB.forkJoinTasks 200 thrpt 20 15.868 0.291 ops/ms o.s.FJPB.forkJoinTasks 400 thrpt 20 8.060 0.222 ops/ms o.s.FJPB.forkJoinTasks 800 thrpt 20 4.006 0.024 ops/ms o.s.FJPB.forkJoinTasks 1600 thrpt 20 1.968 0.043 ops/ms



In a chart below the results are put, for your convenience, into a graphical form.

Java 7 vs Java 8 Performance Test Results

One can see that the baseline results, which show the throughput of running the math directly in a single thread do not differ between the JDK 7 and 8. However, when we include the overhead of managing recursive tasks and going through a ForkJoin execution then Java 8 is much faster. The numbers for this simple benchmark suggest that the overhead of managing ForkJoin tasks is around 35% more performant in the latest release.

The difference between the baseline and FJ computation comes because we elaborately created our recursive tasks extremely thin. They don’t do anything else than a single method call to a very optimised Math class. Because of that, the direct application of Math is much faster. A beefier tasks would change a picture, but they would lessen the overhead of the ForkJoin management, which we wanted to measure in the first place. In general though, you do want to have recursive tasks that do more than a method call.

Also, there’s a slight difference between the Java 7 and 8 results on the baseline tests. The difference is negligible and most probably is not caused by any changes there could be between the Math implementations in Java 7 vs. 8. It probably is a measurement artifact however well JMH tries to balance these out or avoid it.

Get bet practices for Java 8 in this cheat sheet.

Disclaimer: of course these results are synthetic, and you should take them with a grain of salt. However, in addition to discussing Java performance, I wanted to show how easily JMH allows you to create a benchmark that avoids some of the common benchmarking issues like not warming up the JVM. That doesn’t save you if the semantics of the benchmark is flawed, but definitely helps a lot. So if you see a flaw in the logic of the code, please mention it to me. I like getting smarter.

Learn More About ForkJoin

First of all, I want to say that source code for ForkJoinPool, ForkJoinPool.WorkQueue and ForkJoinTask classes is not exactly the easiest read, and it features quite a lot of the Unsafe mechanics and probably won’t answer your questions about the ForkJoin framework in 15 minutes.

However, these classes are well documented and provide a lot of information via internal comments. It is also probably one of the most interesting places in the JDK to dig into, when you have free time.

Another relevant finding is that performance of the ForkJoin pools is better in Java 8, at least on some use-cases. And while I cannot precisely describe what is the reason behind that, if I was map/reducing something with the ForkJoin, I’d take my chances and upgrade.

If you see a flaw in logic or can offer another explanation of the results, or test to run, don’t hesitate to leave a comment below or message me @shelajev.

Additional Resources

Want to learn more about the latest features in Java? Be sure to visit our resource collection, Exploring New Features in Java. It's packed with great information about Java, OpenJDK, JEPs, and the big changes that shape Java as a language.

Looking for additional Java 8 resources? Check out our Java 8 Best Practices Cheat Sheet. It's chock-full of the go-to commands and best practices for those working in Java 8.

See Cheat Sheet