217 Parallel Parking

Author: Dr. Heinz M. Kabutz Date: 2014-02-07 Java Version: 8 Category: Performance

Abstract: What is faster? Synchronized or Atomic compare-and-set? In this newsletter we look at some surprising results when we run Java on the new i7 chip.

Welcome to the 217th issue of The Java(tm) Specialists' Newsletter. In my last newsletter, I mentioned we'd be heading off in search of snow on the Island of Crete. It took a bit of effort, but we were successful! Have a look at a short clip of our adventure (make sure to watch in HD).

From the 25-29 August 2014, we are running the 4th Java Specialists' Symposium aka JCrete 2014. [Note that the venue has changed to the Kolymbari area of Chania - more information coming soon]. Here are some comments from previous attendees:

"In conferences with huge numbers of attendees you'll get great canned talks and lots of merchandise from the sponsors. You might even get a good meal at the hotel restaurant. If that's what you like don't go to JCrete. In JCrete it's extremely, and sunny, you only get 3 talks a day and wifi is poor. You talk with other attendees, get to know them while having a salad at a small taverna by the beach (sometimes a taverna _in_ the beach). You get to meet the great people like I met Maurice Naftalin, Heinz Kabutz or Martin Thompson while drinking home-made Raki. You get to spend all-nighters discussing last Vimeo hit by Bret Victor with people worried about our future as developer/computer scientists. That and great small-group sessions on garbage collecting, language abusing, concurrency, java byte code and many other subjects is what you get at JCrete. But if you are the type that enjoys the Martinis at the Oracle Evening Party at QCon then JCrete is not for you." - Ignasi Marimon-Clos i Sunyol (Spain)

And another one by a regular attendee:

"JCrete is the best combination of multiple worlds. In the morning you are engaged in interesting topics with highly intelligent professionals, often learning new stuff or getting interesting views on existing technology. In the afternoon you spend time with your family or fellow geeks, snorkeling, hiking or just relaxing. Later you eventually gather with the rest of the hard-cores in the foo-bar discussing java, politics, railways or whatever. And at all times you are supported by beautiful surroundings, Greek hospitality and great local food!" - Jesper Udby (Denmark)

Oh and last, but not least, a short video to demonstrate the types of things we do at our Unconference. It is very different to other more traditional conferences. We might only have three sessions per day, but the discussions carry on whilst we are swimming in the sea or enjoying a delicious meal. There are no conference fees, but you need to cover your own travel expenses (flight, food, hotel, entertainment).

javaspecialists.teachable.com: Please visit our new self-study course catalog to see how you can upskill your Java knowledge.

Parallel Parking

After waiting far too long, I finally upgraded to a quad-core i7. Proud to show off my shiny new hardware and hack programming skills to some Concurrency Specialist Course students, I fired it up and waited for them to be dazzled by the awesome speed. But let me go back a few steps. We ended day one of the concurrency course with various versions of NumberRange. The idea was inspired by Brian Goetz's excellent book Java Concurrency in Practice. We start with an interface:

package eu.javaspecialists.microbenchmarks.numberrange; public interface NumberRange { boolean setLower( int newLower); boolean setUpper( int newUpper); boolean isInRange( int number); boolean checkCorrectRange(); }

In JCiP, Brian explains that this version of NumberRange is broken, because even though the lower and upper objects are in themselves threadsafe, we have an invariant across both fields that is not protected through some form of synchronization:

package eu.javaspecialists.microbenchmarks.numberrange; import java.util.concurrent.atomic.*; // does not work correctly - we could get upper < lower! public class NumberRangeAtomicBroken implements NumberRange { private final AtomicInteger lower = new AtomicInteger(0); private final AtomicInteger upper = new AtomicInteger(0); public boolean setLower( int newLower) { if (newLower > upper.get()) return false ; lower.set(newLower); return true ; } public boolean setUpper( int newUpper) { if (newUpper < lower.get()) return false ; upper.set(newUpper); return true ; } public boolean isInRange( int number) { return number >= lower.get() && number <= upper.get(); } public boolean checkCorrectRange() { int lower = this .lower.get(); int upper = this .upper.get(); return lower <= upper; } }

It is easy to reason why this does not work, but here is a unit test that demonstrates the problem (if you still have questions, read the book :-)):

private void testConcurrently(NumberRange range) throws InterruptedException { range.setLower(0); range.setUpper(200); ExecutorService pool = Executors.newCachedThreadPool(); pool.submit(() -> { for ( int i = 0; i < 100_000_000; i++) { range.setLower(100); range.setLower(0); } }); pool.submit(() -> { for ( int i = 0; i < 100_000_000; i++) { range.setUpper(40); range.setUpper(200); } }); Future<Boolean> checker = pool.submit(() -> { for ( int i = 0; i < 100_000_000; i++) { if (!range.checkCorrectRange()) return false ; } return true ; }); try { assertTrue(checker.get()); } catch (ExecutionException e) { fail(e.getCause().toString()); } pool.shutdown(); }

There is a very easy fix and that is to use synchronized instead of AtomicInteger fields. For example:

package eu.javaspecialists.microbenchmarks.numberrange; public class NumberRangeSynchronized implements NumberRange { // INVARIANT: lower <= upper private int lower = 0; private int upper = 0; public synchronized boolean setLower( int newLower) { if (newLower > upper) return false ; lower = newLower; return true ; } public synchronized boolean setUpper( int newUpper) { if (newUpper < lower) return false ; upper = newUpper; return true ; } public synchronized boolean isInRange( int number) { return number >= lower && number <= upper; } public synchronized boolean checkCorrectRange() { return lower <= upper; } }

However, this means that we have synchronized methods and so the threads will only be able to call these one at a time.

I then proceeded to show another much cleverer solution that used long and bit masking, together with a compare-and-swap (CAS) to set upper and lower in a single operation. Here it is:

package eu.javaspecialists.microbenchmarks.numberrange; import java.util.concurrent.atomic.*; import java.util.concurrent.locks.*; public class NumberRangeAtomic implements NumberRange { private final AtomicLong range = new AtomicLong(0); public boolean setLower( int newLower) { while ( true ) { long current = range.get(); int upper = getUpper(current); int lower = getLower(current); if (newLower > upper) return false ; long next = combine(newLower, upper); if (range.compareAndSet(current, next)) return true ; } } public boolean setUpper( int newUpper) { while ( true ) { long current = range.get(); int upper = getUpper(current); int lower = getLower(current); if (newUpper < lower) return false ; long next = combine(lower, newUpper); if (range.compareAndSet(current, next)) return true ; } } public boolean isInRange( int number) { long current = range.get(); int upper = getUpper(current); int lower = getLower(current); return number >= lower && number <= upper; } private long combine( long lower, long upper) { return upper | (lower << 32); } private int getLower( long range) { return ( int ) (range >>> 32); } private int getUpper( long range) { return ( int ) (range); } public boolean checkCorrectRange() { long range = this.range.get(); int lower = getLower(range); int upper = getUpper(range); return lower <= upper; } }

We finished for the day and said our virtual goodbyes. My students were in Brazil(1), Luxembourg(1), USA(2), England(1), Austria(1), Romania(1), Sweden(1) and I was at home in Greece(1). So for some of them, the day had just begun and for others, it was beer time. These virtual classes are fun and we all learn a lot. We keep numbers small so that everyone gets a chance to ask their questions. Have a look at our Concurrency, Master and Patterns Courses.

Each morning, I like to start by giving my students an opportunity to ask questions that they had come up with in the night. Our brain needs a while to reset itself and so taking a complete break often brings doubts to the surface that they did not realize they had. Sometimes this Q&A session goes on for over an hour as we go freestyle off the script and answer the questions that students have.

In our particular Q&A session, James S. asked what the point of that bit shifting code was. It seemed so much more complicated to him and he could not justify writing code that had such a high chance of bugs. I explained that this type of programming style would be typically used for identified bottlenecks. It would be much faster though, I confidently asserted. To prove it, I whipped up a quick microbenchmark. Now I've written hundreds of microbenchmarks and they usually give me the results I want pretty quickly. However, in this case, it showed me that the NumberRangeSynchronized class was actually several times faster than the far more complicated NumberRangeAtomic. I had seen similar results on other fast i7 laptops, but since I had only had my machine for a couple of days, was not able to investigate exactly why. I muttered something about "cache synchronization" and "data moving between cores" and "CAS being a known weakness", but was not satisfied with my answer. I also told them that there was something funny going on and that I would investigate this further.

Instead of showing you the slapped-together benchmark that I coded live in class, I'd rather show you how to do it with OpenJDK's excellent Java Microbenchmarking Harness (JMH) tool. JMH takes care of all the benching things, such as statistics, starting and stopping threads, forking the processes, etc. One thing I really like is that JMH typically measures the amount of work done within a time interval, rather than measuring how long a certain number of steps takes. This is also my preferred way of measuring code. Of course, even JMH can sometimes be subject to weird Hotspot optimizations, so please check whatever results you get and see whether they make sense. If unsure, inspect the generated assembler code.

package eu.javaspecialists.microbenchmarks.numberrange; import org.openjdk.jmh.annotations.*; import static java.util.concurrent.TimeUnit.*; @BenchmarkMode(Mode.AverageTime) @OutputTimeUnit(NANOSECONDS) @Warmup(iterations = 10, time = 1, timeUnit = SECONDS) @Measurement(iterations = 20, time = 1, timeUnit = SECONDS) @Fork(5) @State(Scope.Benchmark) @Threads(8) public class Microbenchmark1 { private final NumberRangeSynchronized nrSync = new NumberRangeSynchronized(); private final NumberRangeAtomic nrAtomic = new NumberRangeAtomic(); @GenerateMicroBenchmark public void test_1_1_synchronized() { nrSync.setUpper(100); nrSync.setLower(10); nrSync.setLower(50); nrSync.setUpper(200); nrSync.setLower(30); } @GenerateMicroBenchmark public void test_1_2_atomic() { nrAtomic.setUpper(100); nrAtomic.setLower(10); nrAtomic.setLower(50); nrAtomic.setUpper(200); nrAtomic.setLower(30); } }

I will leave the exercise of getting this to run on your machine as an exercise to the reader. It's really simple though - just follow the instructions on the JMH website. JMH could not have made it easier and I am embarassed that I have so far been too lazy to bother learning it. Anyway, here are the results:

Benchmark Mean Mean error Units Microbenchmark1.test_1_1_synchronized 1552.081 15.408 ns/op Microbenchmark1.test_1_2_atomic 3855.470 77.010 ns/op

As we can see in this run, our fancy NumberRangeAtomic was more than twice as slow as the NumberRangeSynchronized! All that effort for nothing!

However, the answer came in the form of Jack Shirazi's Java Performance Tuning Newsletter. In his latest bulletin, he explains that locking can outperform compare-and-swap (CAS), especially if there are a lot of repeats. The trick is that if we have a CAS failure to let the losing thread sleep a little bit in order to give the other threads a chance to go through more often. Even a constant time sleep could improve the situation.

After trying different settings, I settled on a figure of 0.18 ms as working well. But this could be completely different on your machine so it would be best to make this figure tunable or maybe auto-adjust at runtime. Thus we only need to add LockSupport.parkNanos(180_000) after the compareAndSet method has returned false, like so:

public boolean setLower( int newLower) { while ( true ) { long current = range.get(); int upper = getUpper(current); int lower = getLower(current); if (newLower > upper) return false ; long next = combine(newLower, upper); if (range.compareAndSet(current, next)) return true ; // now we sleep a constant time LockSupport.parkNanos(180_000); } } public boolean setUpper( int newUpper) { while ( true ) { long current = range.get(); int upper = getUpper(current); int lower = getLower(current); if (newUpper < lower) return false ; long next = combine(lower, newUpper); if (range.compareAndSet(current, next)) return true ; // now we sleep a constant time LockSupport.parkNanos(180_000); } }

Let's run another benchmark, this time with the new NumberRangeAtomicWithPark class that parks after the failure:

package eu.javaspecialists.microbenchmarks.numberrange; import org.openjdk.jmh.annotations.*; import static java.util.concurrent.TimeUnit.*; @BenchmarkMode(Mode.AverageTime) @OutputTimeUnit(NANOSECONDS) @Warmup(iterations = 10, time = 1, timeUnit = SECONDS) @Measurement(iterations = 20, time = 1, timeUnit = SECONDS) @Fork(5) @State(Scope.Benchmark) @Threads(8) public class Microbenchmark2 { private final NumberRangeAtomicWithPark nrAtomicPark = new NumberRangeAtomicWithPark(); @GenerateMicroBenchmark public void test_2_1_atomic_park() { nrAtomicPark.setUpper(100); nrAtomicPark.setLower(10); nrAtomicPark.setLower(50); nrAtomicPark.setUpper(200); nrAtomicPark.setLower(30); } }

Here are the results with the parking atomic included:

Benchmark Mean Mean error Units Microbenchmark1.test_1_1_synchronized 1552.081 15.408 ns/op Microbenchmark1.test_1_2_atomic 3855.470 77.010 ns/op Microbenchmark2.test_2_1_atomic_park 430.649 3.125 ns/op

We see how the performance has increase almost 10x, simply by parking when we have a CAS failure due to parallel threads trying to write at the same time. So sleeping can actually make your program run faster?

We also see from those results that the NumberRangeAtomicWithPark class is now about 4x faster than the NumberRangeSynchronized. Since I have 4 cores, these results are believable. The CAS code should be faster than the exclusive pessimistic code.

A word of warning though. The 180 microseconds seemed to give good results on my machine, but may be quite bad on yours. Making it configurable would allow you to fine tune it for your target machine.

Please note that I've also published a follow-up to this article, which shows what happens when we do some CPU work in between calls to the update methods.

Kind regards from Crete

Heinz

We are always happy to receive comments from our readers. Feel free to send me a comment via email or discuss the newsletter in our JavaSpecialists Slack Channel (Get an invite here)

Load Disqus comments

When you load these comments, you'll be connected to Disqus. Privacy Statement.

Please enable JavaScript to view the comments powered by Disqus.

Related Articles

270 Excursions into Deduplication 2019-05-31 When are Strings deduplicated? How can we find out whether they are of any benefit in our application? How much does it cost in terms of CPU? We try to show you how we can get this from the deduplication statistics provided by the JVM. Full Article

059 When Arguments Get out of Hand... 2002-11-11 We turn the dial to full blast and find out how many parameters a method may contain. Wholly impractical, but fun. Full Article

181 Generating Static Proxy Classes - 2/2 2010-03-01 In this newsletter, we show how the Generator described in our previous issue can be used to create virtual proxy classes statically, that is, by generating code instead of using dynamic proxies. Full Article

Browse the Newsletter Archive