Executive summary

The implementation of the concurrency primitive LockSupport.parkNanos(), the function that controls *every* concurrency primitive on the JVM, is flawed, and any NTP sync, or system time change backwards, can potentially break it with unexpected results across the board when running a 64bit JVM on Linux 64bit

What we need to do?

This is an old issue, and the bug was declared private. I somehow managed to have the bug reopened to the public, but it’s still a P4, that means that probably won’t be fixed. I think we need to push for a resolution ASAP, be sure that’s in for JDK9, make all the possible effort to make this fix for JDK8 or, at least, to include it in a later patch release. In an ideal world it would be nice to have a patch for JDK7

Why all this urgency?

If a system time change happens then all the threads parked will hang, with unpredictable/corrupted/useless results to the end user. Same applies to Future, Queue, Executor, and any other construct that it’s somehow related to concurrency. This is a big issue for us and for any near time application: please think about trading and betting, where the JVM is largely used. And please do not restrain yourself to the Java language: add Scala and any other JVM-based language to the picture.

All the details (spoiler: tech stuff!)

To be more clear about the issue, the extent of it and the concurrency library, let me introduce this very simple program:

import java.util.concurrent.locks.LockSupport; public class Main { public static void main(String[] args) { for (int i=100; i>0; i--) { System.out.println(i); LockSupport.parkNanos(1000L*1000L*1000L); } System.out.println("Done!"); } }

Run it with a 64bit 1.6+ JVM on 64bit Linux, turn the clock down one hour and wait until the counter stops… magic! I tested this on JDK6, JDK7 and latest JDK8 beta running on various Ubuntu distros. It’s not just a matter of (old?) sleep() and wait() primitives, this issue it affects the whole concurrency library.

To prove that this is fixable, I reimplemented the program above above substituting LockSupport.parkNanos() with a JNI call toclock_nanosleep(CLOCK_MONOTONIC…): works like a charm 😦

This is due to the fact that the CPP code is calling the pthread_cond_timedwait() using its default clock (CLOCK_REALTIME) which, unfortunately is affected by settime()/settimeofday() calls (on Linux): for that reason it cannot be used to measure nanoseconds delays, which is what the specification requires. CLOCK_REALTIME is not guaranteed to monotonically count as this is the actual “system time”: each time my system syncs time using a NTP server on the net, the time might jump forward or backward. The correct call (again on Linux) would require to use CLOCK_MONOTONIC as clock id, which are defined by POSIX specs since 2002. (or better CLOCK_MONOTONIC_RAW)

The POSIX spec is infact clear, as it states “…setting the value of the CLOCK_REALTIME clock via clock_settime() shall have no effect on threads that are blocked waiting for a relative time service based upon this clock…”: it definitely states “relative”. Having a look at the hotspot code, it appears that the park() is using compute_abstime() (which uses timeofday) and then waits on an absolute period: for that reason it’s influenced by the system clock change. Very wrong.

Next steps?

I am trying to raise the awareness of this issue, basically involving as much people as I can. I will continue to do that and escalate as soon as I get some solid response from Oracle. I am also working on a patch myself.

See also

The full saga, all the articles I published on the matter: