A Java ? Fork-Join Calamity

This article explains why (5500 words) [yes, it's getting loooong]

Included in the new Java? SE 7 release is a so-called lightweight Fork-Join framework. When you take a careful, comprehensive look at what this framework does and how it does it, then you will see that the framework is an inadequate academic experiment underpinning a research paper, not a lightweight framework.

Edward Harned (eh at coopsoft dot com)

Senior Developer, Cooperative Software Systems, Inc.

January, 2011 [updated July, 2016]

What is the Lightweight Fork-Join framework?

The lightweight Fork/Join (F/J) framework is a structure that supports parallel programming where problems are recursively split into smaller parts, solved in parallel and recombined, as described in this pseudo code from A Java Fork/Join Framework:



Result solve(Problem problem) { if (problem is small)

directly solve problem

else {

split problem into independent parts

fork new subtasks to solve each part

join all subtasks

compose result from subresults

}

}

If only the real solution were this simple. In order to make this work, the framework developer would have to build a parallel engine inside the Java? Virtual Machine by ripping apart the current JDK source code, compilers, hotspot, various run times, and implementations — not a likely scenario. Hence, the only practical solution is to build an application server.

If only the lightweight F/J Framework were a blue-ribbon application server. Regrettably, the framework is not a professional application server. It is not even an API. The framework is the academic experiment for the Java Fork/Join Framework research paper. It is inadequate, not lightweight.

Part two of this series added in January, 2013, deals with the problems of making this framework the default parallel engine for bulk operations in JDK1.8.

These articles are getting too lengthy for comfortable reading since they contain the details to support each point. Here is a downloadable consolidation in PDF format that uses the articles points as references. Much easier to digest.

Part three of this series added in January, 2014, deals with the failure to improve the performance of parallel operations.

The F/J framework simply does not belong inside the JDK. The remainder of this article is a critique only as it applies to to the framework being part of Standard Edition Java?.

Critique

There are major impediments for this framework being part of the JDK. The F/J framework is:

Conclusion.

A Faulty Task Manager

The Fork-Join technique splits the work into fragments and joins the results together. A subordinate practice that allows an intermediate join() for each fork() can only work successfully in a controlled environment. A fatal flaw with the F/J Framework is that it is trying to manage an intermediate join() in a task outside of a controlled environment. Only the Operating System (O/S) or a pseudo O/S can manage an intermediate join() in a task. (Whether they’re called tasks or processes is all the same thing.) Briefly: When a task requires a resource or service (memory, file handle, join()) it goes to the O/S, the ultimate controlled environment. If the request will put the running task in a waiting state, the O/S switches the task context to the suspended queue and switches in a new task context from the active queue for the processor to execute. An Application Task Manager (pseudo O/S) like Cilk or jCilk can only work when the application goes through it (using a compiler and runtime that creates a controlled environment) to request resources or services like join(). That way the manager knows to switch out/in on pseudo processors it controls. Basic application frameworks simply have no way of knowing the application execution environment therefore, there is no way on this great, green earth an application framework can do a context switch to support an intermediate join(). The F/J Framework’s JDK1.7 answer to an intermediate join() without a context switch is “continuation threads.” That is, the framework creates additional, temporary threads to continue fetching application tasks from the deques and submission queue while the joining thread enters a wait state. That can result in a huge number of threads for each request. (StackOverflow has one report of over 700 continuation threads needed to complete one call.) — An unmitigated failure. The F/J Framework’s JDK1.8 answer to an intermediate join() without a context switch is “continuation fetching.” That is, the framework marks the joining task as logically waiting and the thread continues fetching and executing tasks from the deque. That can result in a stall (thread in a wait state) when the recursion level becomes long. The proof is in the profiler. (You can download the source code for MultiRecurSubmit.java demo below.) Additionally, there is the problem of errors/exceptions in subsequently fetched tasks. Since the stack holds the entire flow, can the F/J Framework handle stack-overflow problems?

Can the F/J Framework find the stack entries for the first joining task, or the second, or the third for error recovery?

Can the F/J Framework back out and recover other associated tasks in other deques? The answer is no to all of the above. Let's not forget the asynchronous use of outside resources by tasks. Often a resource will associate a request with the thread that made the request:

Thread caller = Thread.currentThread(); When the first tasks does a fork-join while the asynchronous call proceeds, the same thread will execute the forked task. If that new task fails, what happens to the asynchronous request to the outside resource since they're both associated with the same thread?

What happens to handles, and registers, and memory since there is no clean break (context switch) between tasks? The answer is uncertainty to all of the above. The F/J Framework’s current handling of tasks for JDK1.8 is an unmitigated failure. Soon to come is the Lambda-fication of the framework that should prove interesting, but unsuccessful. The intermediate join() isn't the only source of excessive thread creation with this framework. Introduced with JDK1.7 is the Phaser Class. To quote the JavaDoc: “Phasers may be tiered (i.e., constructed in tree structures) to reduce contention. Phasers with large numbers of parties that would otherwise experience heavy synchronization contention costs may instead be set up so that groups of sub-phasers share a common parent.” Not mentioned in the JavaDoc is that when using a large number of parties the framework creates “compensation threads” to continue fetching application tasks from the deques and submission queue while each original thread waits for sub-phasers to arrive. These “compensation threads” can be so excessive they make the “continuation threads” problem, above, seem meek. See for yourself. (You can download the source code for TieredPhaser.java demo below.) Introduced with JDK1.8 is the CompletableFuture Class. To quote the JavaDoc: “A Future that may be explicitly completed (setting its value and status), and may include dependent functions and actions that trigger upon its completion.” Not mentioned in the JavaDoc is that when using a large number of dependent functions with a get() method, the framework creates “compensation threads” to continue fetching application tasks from the deques and submission queue. Once again, these “compensation threads” can be so excessive they make the “continuation threads” problem, above, seem meek. See for yourself. (You can download the source code for MultiCompletables.java demo below.) The problem is tasks needing to wait. When a task needs to wait for any resource, the framework must detach the task from the thread. This framework cannot detach the task from the thread so it tries other means again and again and again. Patching the F/J Framework repeatedly is like playing whack-a-mole. There always comes a time when no amount of patching will work since it never addresses the fundamental flaw — Any wait without a context switch is an absolute, total, complete failure. Is there an alternative? Certainly. Using a separate structure to hold intermediate results rather then using an intermediate join() is a simple alternative and it has commendable benefits: It doesn’t require the horrendous complexity of Task Management.

It can easily support cyclic dependencies.

It can easily handle synchronous and asynchronous completion processing.

It provides a way to track the request throughout it’s lifetime to aid in error detection/recovery. the ability to time and cancel requests. the capability to gather statistics for tuning.

The faulty F/J Framework has no place within the JDK.

Inefficient

Assume a java.util.concurrent.RecursiveTask that sums a long array in the compute() method. Sum left = new Sum(array, low, mid);

Sum right = new Sum(array, mid, high);



left .fork();

long rightAns = right .compute();

long leftAns = left .join();



return leftAns + rightAns; At level 1, we split the array creating two new Tasks for the next level. We issue a fork() for the left and compute() for the right. The fork() adds a Task to the same worker thread deque. The compute() places an item on the stack and continues execution. In the second-level compute(), we do the same as above. … Eventually we get to the bottom where compute() actually sums the array returning a value for the last item on the stack. That last item can then issue a join() for the last fork() it did. What we've accomplished is creating Tasks for subsequent fetching by the forking worker thread and Tasks that other worker threads may steal. Since all forked Tasks go into the same worker thread deque, the stealing worker threads will fight each other at the top of the deque over the forked Tasks. Not only is this woefully inefficient it also involves contention between the stealing worker threads. Contention is exactly what work-stealing is supposed to avoid. But in reality, contention is what the work-stealing algorithm produces. What we really have is a single queue with a pool of threads fighting over elements. The larger the array, the more apparent this becomes. In A Java Fork/Join Framework, section 4.5 Task Locality , “[the framework] is optimized for the case where worker threads locally consume the vast majority of the tasks they create. … As seen in the figure, in most programs, the relative number of stolen tasks is at most a few percent.” Even the original research paper shows work-stealing is inefficient. This framework can never function efficiently in a high-performance environment. A hypothetical example: Let’s say there are 10 worker threads, the forking generates 100 Tasks, and it takes 1 millisecond to process the actual computation. Therefore, it would take 100 milliseconds to process the Tasks sequentially. Using a load balancing algorithm that evenly splits the work among all the worker threads, the time to complete the work is about 10 milliseconds. Using work-stealing, if one worker thread does 80% of the Tasks and other worker threads do the rest, then the time to complete the work is about 80.3 milliseconds. Then there is the problem of starvation. A few worker threads become saturated with work leaving other stealing worker threads starving. For a student doing calculations on a laptop — so what. For an enterprise server doing transaction processing — oops. Is there a better way? Absolutely. It’s called load balancing with the scatter-gather method. In 2010 the author gave the F/J framework developer a proof-of-concept scatter-gather project that did scatter-gather. The F/J framework developer ignored it completely. Since then the author embellished the proof-of-concept into a full-feature, open-source project on SourceForge.net. You can see for yourself. This convoluted, inefficient framework has no business in core Java.

Special Purpose

Brian Goetz wrote, Stick a Fork In It, introducing the ParallelArray classes and the Fork/Join framework. He recognized the limitations of these classes for “aggregate data operations” only. As Brian suggested, the F/J framework is exclusively for number crunching. The F/J framework has none of the attributes of a professional, multi-use application server (see above.) That’s why it should not be used for other than its original purpose. But there is no way on this earth developers will adhere to unreasonable restrictions. (Some don’t even follow reasonable rules.) Recommended restrictions: must be plain (between 100 and 10,000 basic computational steps in the compute method),

compute intensive code only,

no blocking,

no I/O,

no synchronization Do you wonder why > 100, < 10k computational steps? > 100 has to do with the work stealing problem. All forked Tasks go into the same deque making other threads search for work. When the threads encounter contention they back off and look somewhere else. Since there is no work anywhere else they try the same deque again, and again, and again until the forking thread finally finishes the work all by itself. You can see the proof by downloading the source code for Class LongSum.java below. Hence, run slow or there will be no parallelism. < 10k has to do with the join() problem. Since the F/J framework cannot do pure Task Management (see Faulty Task Manager, above) with Tasks actually waiting independently of threads when they call join(), the framework has to create “continuation threads” to avoid a halt. There can only be a limited time before it all falls apart. Hence, run fast or die. As for the other restrictions — any actual thread blocking, without the framework’s knowledge, would probably stall the entire framework. Since there is no way to view the internals, to cancel synchronous requests, and absolutely no way to time any request, the only alternative to deal with a stall is to kill the JVM. A real business friendly solution. Misuse of the framework has already started:

Java Tip: When to use ForkJoinPool vs ExecutorService - JavaWorld Soon people will find uses for this framework no one ever imagined. It’s the first Law of application software development which follows the Law of unintended consequences. The F/J framework does not belong in core Java.

Slow and Unscalable

Speed Speed is relevant to other Fork-Join products. In this case, speed compared to the TymeacDSE project. A single summation call puts the TymeacDSE product 3-4 times faster than the F/J Framework. (You can download the source code below.) Since the products cannot directly compare — F/J framework has limited scope, TymeacDSE is a full feature application server — the comparison must include the internal structures. F/J Framework uses an inefficient (above) textbook work-stealing paradigm that makes threads go looking for work and unnecessarily splits the work into useless tasks. Let’s say there is an array of 1M elements and the sequential threshold is 32K, then it takes 32 tasks to sum the array in parallel. However, using the code example above, the method generates an extra 31 tasks in six levels just to recursively split the array into two parts and join(). That is almost double the toil for threads that could otherwise be doing useful work. Threads cannot pick up new requests until they completely finish all prior work. (That is, they have to exhaust all the deques of work before they can go to the submission queue.) Since there is no separate structure to hold intermediate results, each split (fork()) must wait (join()) until subsequent splits complete, loosely resembling this: fork()

join() TymeacDSE uses a load balancing, scatter-gather algorithm that feeds work to threads immediately. There is no separate submission queue and there are no split/wait tasks. Thus, the 1M array example above takes simply 32 tasks to sum the array in parallel. TymeacDSE uses a separate structure for all calls both to track the work and to hold intermediate results. Therefore, TymeacDSE allows this: fork() as many times as necessary to spread the work amongst all processors, returning an intermediate result to the server for each computed value.

When complete, finish the work using all the intermediate results. Which may comprise forking the results array. What this means for applications like sorting and map/reduce is that TymeacDSE can use the full collection of processors to sort or map the data and then use the full set of processors to merge or reduce the intermediate results. Using all the processors without waiting (join()) makes TymeacDSE fast. Unscalable Generally, scaling implies that as you add more processors/threads to the equation, you decrease the time to completion. The F/J Framework structure precludes scaling. The entanglement of client call/server processing,

the spare threads necessary to help with the join() waiting problem (above),

as well as the work stealing code (threads need to serially search for work among all deques, threads consume the vast majority of the tasks they create so adding more processors/threads has no effect on parallelism)

only works well on a small number of processors. This design can never scale to hundreds or thousands of processors. The overhead and thread starvation would nullify the benefits of additional processors.



TymeacDSE scales to thousands of processors. It’s simply a matter of design. When the architect separates the client caller, the user processing, and the server management then anything is possible. Building an index over the management structure of threads/queues is easy when there are no code entanglements. Speed and scalability are the lynchpins of parallelization. A slow and unscalable application service does not belong within the JDK.

Exceedingly complex

“Make things as simple as possible — but no simpler.” Albert Einstein

“Complicated is not a sign that you are clever. It is a sign that you failed.” Gilad Bracha

“There are two ways of constructing a software design.

One way is to make it so simple that there are obviously no deficiencies.

And the other way is to make it so complicated that there are no obvious deficiencies.” C . A . R . Hoare The F/J framework classes - have many, many levels of inheritance,

nested classes on top of nested classes,

instance variables used directly by other classes (known internally as “representation-level coupling among classes”),

code from the Hackers Delight without comments about what it does,

homegrown deques and queues instead of standard Java? Classes,

and so much more. Programming techniques use bit switches (like assembler code) and direct memory manipulation (like C code.) The code looks more like an old C language program that was segmented into classes than an O-O structure. The framework embeds within the calling client that also contains the user servicing code. This back and forth, one calling the other that calls the first, is the classic example of spaghetti code. For over forty years, programmers have known to stay away from entangling calls since the code becomes undecipherable. The framework relies so heavily on the [sun.misc.]Unsafe class that the F/J framework is not a Java? program. It is a pseudo-C program. The apparent necessity for writing in the pseudo-C language is to obtain a fair amount of speed by pre-optimizing the code since work-stealing is so inefficient (below.) Naturally, this adds another layer of complexity. Why does this matter? The larger and more complex the software is, the more that can go wrong, the more fragile it becomes, and the shorter the time between failures.

The code is so horrendously complex most experienced developers will tell you that it is going to fail just for that reason alone.

It is the job of the JIT compilers to optimize code. No matter how smart and clever a developer may be, no individual can predict the runtime environment. Prematurely optimized code usually cannot be JIT optimized.

Unsafe class hackery used for performance is sensitive to processor architecture, memory and other hardware dependencies. The code can perform differently between releases of the same architecture which may affect portability and even security.

Junior Java? programmers cannot understand the code much less find work-a-rounds when problems arise — what application maintenance is all about. The F/J framework is problematic to understand by anyone other than the original architects. It is doubtful the transient Concurrency Expert Group can maintain this immensely complex code over its lifetime. This extremely complex code does not belong within the JDK.

Lack of Industry Professional Attributes

Application programmers learn early on that: Programs fail, usually at the worst possible time, consequently error recovery is paramount

Threads stall, again, usually at the worst possible time

Software needs tuning (balancing of resources)

Administrators sometimes need notification (also called alerting)

Logs are a royal nuisance but necessary

Interfaces into the structure to monitor performance and control functionality are critical

The software should be useful across a wide range of computing needs

A shared copy should be available to conserver resources The F/J framework has none of these. There is no error detection/recovery/reporting (other than thrown exceptions)

There is no stall detection and of course, no stall recovery (requires monitoring)

There is no monitoring of the execution environment to ferret out anomalies

There is only minuscule statistics gathering for performance analysis and tuning

There is no alerting or logging (A rudimentary level of these is necessary for debugging and for error reporting. You need some way of knowing what happened.)

Since the internal structure is so entangled with the calling client, there is almost no way to alter control variables

There is no availability for general application usage. It is for compute intensive tasks only. When used for other than pure, short-lived computations, serious problems with task management arise (below)

As currently written, the framework cannot be a remote object since there are entanglements of caller/service code. An application service without industry professional attributes has no place in the JDK.

Academic exercise

The work-stealing algorithm using deques is only of academic interest. There are myriad papers devoted to work-stealing on the web, but none of them are for general-purpose application programming. Assuming locality of reference by placing sub-tasks back into the deque from which they came and assuming that the same thread will process the sub-task is totally without merit. It may look good on paper but the Java? compiler, JIT compiler, and garbage collector rearrange memory. Additionally, Java? applications sometimes run under two virtual machines (mainframes run Linux? under VM and Linux? runs the JVM) and there is no guarantee of mapping of Java? threads to O/S threads (see JRockit and others.)



The extolled work-stolen count is of theoretical interest only. To be of industry usage the count must include whether the owner thread was running or blocking, the queue length at the time of filch, and the time it takes to process a task.



The F/J framework internal method comments site numerous scientific papers as the basis for their existence. Although the code adheres to the highest academic standards, those credentials are not foremost in building a fast, scalable, developer-friendly application service. Moreover, the “Implementation Overview” in ForkJoinPool describes complexity that is unintelligible by mere mortals breaking the time honored dictum that program documentation is for programmers to help them find work-a-rounds when bugs arise. “Any fool can write code that a computer can understand. Good programmers write code that humans can understand.” Martin Fowler Why is this a concern? The focus is on prior academic research and adherence to scholarly principles, not on wide-area usefulness and maintainability.

The design must follow the textbook rather than being innovative. The work-stealing technique, while scholarly, is for very restricted practice (strict, fully-strict, terminally-strict computations) and requires compiler/runtime support (closed environment) to work well. There are no alternatives to work-stealing such as scatter-gather since alternatives would not conform to the curriculum.

The true audience is academia rather than the general programming community since the F/J framework is a modification of the original academic experiment. The F/J framework is only an academic venture and does not belong within the JDK.

Inadequate in Scope

The over utilization of the Unsafe class precludes ever being able to downgrade the F/J framework to run on JavaME. The F/J framework cannot run in a significant portion of the JavaEE environment. EJB and Servlet applications usually call an RMI Server when using threading frameworks. However, the F/J framework cannot function with RMI, IIOP, POA, or other networking technologies. The entanglement of caller/server code forestalls the F/J framework from running as a server. The F/J framework is only designed to work in a trivial segment of the computing world: on JavaSE only,

using humongous data structures,

on high-performance work-stations (16+ CPU's without hyper-threading)

doing compute-only work,

with no task I/O, no inter-task concurrency and no networking. and only for aggregate operations on collections of elements that have a regular structure. That is, you must be able to express things in terms of apply, reduce, filter, map, cumulate, sort, uniquify, paired mappings, and so on — no general purpose application programming here. The F/J framework is only designed to work for one request at a time.

A hypothetical example:

Let’s say there are multiple F/J worker threads and

there are many user threads submitting requests: A worker thread picks up the first request and eventually the other worker threads pick up the forked tasks from the first request.

Other submitted requests wait in turn for a worker thread to empty its own deque, find no work in other worker deques, find a waiting request in the submission queue, and start forking.

Other threads may do the same. That is, they pick up one request at a time, finish that request, and help finish other requests before looking for a new request.

As more and more requests fill up the submission queue they back up. This may result in a significant backlog since there is no way to time any request, no way to cancel synchronous requests, and no way to head queue the hot request. (Work-Stealing as implemented with FIFO/LIFO deques is not conductive to inserting requests directly into a worker deque.)

Therefore, for multiple submissions there is little, if any, benefit to the F/J framework. Using one thread per request (like a thread pool) without forking is often more efficient for throughput. Don’t believe it? Try it yourself. (You can download the source code for a F/J vs. Thread Pool demo below.) This lack of scope only qualifies the F/J framework for a seat in a classroom, not as the core of Java? parallel computing. The female stars cheap hair extensions by real hair extensions and Tang Yin are rushing to dye this hair extensions one by one – “Power Tang (Honey Sugar) color!

Not an API

The F/J Framework is part of the SE7 API Specification but the package is deficient in “characteristics of a good API” (Joshua Bloch). The package is actually an independent entity masquerading as an API. None of the base classes in the F/J framework package can live alone nor do the base classes have any usage outside the framework. The main pool class extends AbstractExecutorService, which makes the framework an application server — an independent entity. Other classes in the JDK extend the Executor Service (ThreadPoolExecutor, for one) but they cannot live alone. Those classes are components in user-built application services; they are not a service in themselves. The F/J framework is a stand-alone entity. Most base support methods are final, to prevent overriding. The base package classes are a server unto themselves, they are not individual API’s for programmers to extend and build another service. A clear example of independence is when non-Java? languages such as Scala, Akka, Clojure, X10, Fortress, and others use the framework without extension. An independent server does not belong within the JDK.

Conclusion

The F/J framework is lacking as an API It is deficient in “characteristics of a good API” (Joshua Bloch) The F/J framework is severely inadequate as an application service It is only available as an embedded server, no remote object access such as RMI

It is only recommended for aggregate data operations.

It is only recommended for users with very large data structures (humongous arrays)

It is only useful for machines with profuse processors (16+ recommended)

It is only useful for one request at a time

It is only useful for strictly limited processing (recommended) must be plain (between 100 and 10,000 basic computational steps in the compute method) compute intensive code only no blocking no I/O no synchronization

It has no logging or alerting

It has no error/stall recovery

It has no ability to display/alter the current execution status

It relies on academic correctness rather than maintainability

It lacks scope

Its multi-tasking is faulty

It is inefficient

It cannot not scale to hundreds of processors

It is complex to the point of being incomprehensible

There is no user manual, just JavaDoc The minuscule benefits of the F/J framework do not outweigh the bloat this experiment adds to the Java? run time for non fork-join users

the restrictions for developers looking to parallelize their software especially with JDK1.8 The F/J framework is an inadequate academic experiment underpinning a research paper, not a lightweight framework. It does not belong in the JDK.

References

About the Author

Edward Harned is a software developer with over thirty years industry experience. He first led projects as an employee in major industries and then worked as an independent consultant. Today, Ed is a senior developer at Cooperative Software Systems, Inc., where, for the last fourteen years, he has used Java? programming to bring fork-join solutions to a wide range of tasks.