Multicore CPUs and the concurrency changes they bring

Why thread-based application parallelism is trumped in the multicore era

Moore's Law — Gordon Moore's 1965 prediction that the number of components per integrated circuit will double every 18 to 24 months — has held true, and it is expected to remain true until 2015-2020 (see Related topics). Until 2005, CPU clock rates also improved consistently, which by itself was sufficient to improve the performance of all applications executing on those CPUs. The application-development community enjoyed a free ride with respect to performance improvement while making little or no investment in algorithmic improvement.

Since 2005, however, clock-rate increases and transistor-count increases have been diverging. Because of the physical nature of processor materials, clock rates stopped increasing (and even dropped), and processor makers started packing more execution units (cores) into a single chip (socket). This trend — which seems likely to continue for the foreseeable future — has started to put upward pressure on the application-development and programming-language-development communities, in two broad senses:

Where do we go from here? Andy Glover interviews concurrency expert Alex Miller in this developerWorks podcast.

Simply upgrading to a more powerful CPU no longer results in pre-2005 rates of performance increase for a single-threaded application. Single-threaded applications perform the same no matter how many cores are in the CPU. That is, throughput per core is more or less the same, regardless of how many cores the CPU has (assuming no breakthrough occurs in automatic parallelization techniques on the compiler, virtual-machine, or operating-system level).

Upgrading to multicore CPUs will benefit only incremental load on the system, not the existing load.

The only way to exploit the available CPU cores efficiently is through parallelism. So far, parallelism is mainly being used by operating systems at the process level to provide a seamless multitasking, multiprocessing experience. On the application-development side, thread-based concurrent programming is the predominant mechanism for implementing parallelism.

Thread-based programming model A thread is a lightweight process and the smallest unit of execution scheduled by an OS. All threads within a process share the same address space in memory, so they share objects in memory. Technical details about how threads work is beyond the scope of this article.

Thread-based parallelism has these advantages:

It is a well-established programming model.

The application-development community has a solid understanding of how threads are created, scheduled, executed, and managed.

Developers are trained to think of algorithmic development in a sequential manner. The threading model simply extends the same approach for parallelism.

However, the problems with thread-based application parallelism outweigh its advantages. This article presents some reasons why explicit thread-based application parallelism might not be the best way to utilize CPU cores and why we need a different programming paradigm.

Call-stack depth

The call stack is an internal structure maintained by the OS or virtual machine to handle all method invocations. Every method call within the thread execution pushes one stack frame (consisting of details about the current method call, such as parameters, return address, and local variables).

Figure 1 shows the internals of method invocation:

Figure 1. Call-stack internal structure and growth

No matter how you modularize an application into multiple logical layers (such as controller layer, facade layer, component layer, and data access object [DAO] layer), a thread is the ultimate weaver at runtime, and it has only one stack. The call stack is an awesome invention for handling source-code modularization at runtime. But as an application's complexity grows and load on the system increases, the current call-stack structure model limits application scalability, and it has inherent problems relating to memory size and object reachability.

Object reachability

Another problem with the deep call stack is that object references can be held up in the call stack but never used. In Figure 1, for example, it is unlikely that all the local variables and parameters of all the methods in the call stack are needed when the thread is executing the deepest method in the execution flow. (For example, when a thread executes DAO-layer code, it is unlikely that the application needs all of the local parameters and variables in the call stack pushed by the servlet-layer, controller-layer, facade-layer, and other layer method calls). However, it won't be released or garbage collected, because it contains live references.

The Java™ call-stack implementation is designed to release all its references automatically upon method-call return. This might be acceptable when the JVM is not under high load. But it can be a problem when the JVM is operating with a high number of active threads. For example, if each thread holds up to 5MB of unused live references in the call stack, and 100 threads are active, the JVM will be unable to garbage collect 500MB of heap space because it is still being referenced by call stack-variables and parameters. On a 32-bit machine, this could amount to at least 25 percent of all available memory for that JVM, which is a considerable size.

Shared objects

Another critical problem with thread-based parallelism is the synchronization effort that is due to the mutability of objects shared by multiple threads, as shown in Figure 2:

Figure 2. Shared memory

Though the concept of synchronization is nothing new and has been widely adopted, it penalizes the performance of the application because the lock-acquiring sequence might force the thread to wait or sleep till it is released, which will internally trigger a thread-context switch. A context switch generally slows down thread execution. Also it flushes out all pipeline instructions and cache within the core. In a JVM with lots of parallel threads, synchronization might cause frequent thread context switches that are due to synchronization and lock.

Sequential programming

Sequential programming is not necessarily a problem with threads themselves, but it is related to the way an application uses them. The logical concept of the OS process was devised in the early days of computing for executing the instructions (in a user-submitted job) sequentially. But the sequential-programming mindset still prevails, even though the complexity of some processes has increased manyfold since then. As complexity has increased, various system layers (back end, middle tier, front end) have come into existence. But within a layer, application use-cases are still executed in a sequential manner with a single thread as the weaver of all logic across a variety of components.

You could compare this to manufacturing processes in the era before Henry Ford's assembly line was introduced. Then, a single worker or team of workers would create an entire product. An assembly line enables workers to concentrate on a specific subtask within the overall manufacturing process. It improves productivity manyfold by saving the time workers would otherwise spend moving through the stages of product manufacturing.

A modern-day analogy to the assembly line is customer-order processing by a fast-food restaurant. A predefined number of workers, each specialized in a set of subtasks, process the order, with each worker doing only a portion of the overall work. Once that person's part of the work is done, the semi-finished product is handed to the next worker in the chain, and so on until the final product is complete. In contrast, consider a system in which each worker handles one customer at a time from start to end. Both are valid ways of executing orders, but the fast-food system is more productive. A single worker who processes an entire order will spend too much time moving from place to place instead of actually making the product. Movement among workers creates other problems, such as space contention and time delays.

Now think of the way a modern JEE application server executes a user request. It allots one dedicated thread for a single user request. As illustrated in Figure 3, that thread executes all the instructions starting from logging, database interaction, web service invocation, network interaction and logic computation, and so on:

Figure 3. Thread flow

No matter how well the source code is modularized in terms of controller, model, view, facade, and other layers, it is executed by a single thread. This type of execution internally creates lots of hardware resource contention such as context switch.

Conclusion

Multithreading is an excellent way of utilizing underlying CPU resources as efficiently as possible. But as systems have evolved, the development and OS communities have extended the use of multithreading for application-level parallelism as well. The application-development community started using thread-based programming to execute all application logic in a sequential manner. Since multicore CPUs started gaining ground, with the numbers of cores increasing gradually, sequential explicit thread-based programming has become less efficient.

Scalable, high-performance applications running on multicore hardware require a parallelism methodology that breaks application logic into slices of multiple interdependent work units and chains them together transparently (as opposed to tying them together explicitly with single thread), so that each individual work unit can execute efficiently.

Just as the assembly line revolutionized the manufacturing process and introduced efficiency in every layer, the right future programming model will change the way we design application software. One such abstraction model, actor-based programming (see Related topics), divides the entire application into multiple slices, so that underlying cores can be assigned to these slices and executed in parallel in an efficient manner.

Disclaimer

All opinions and views in this article are solely mine and not necessarily those of my employer.

Acknowledgment

I would like to thank to my colleagues Jesus Bello and Olga Raskin for their valuable suggestions.

Downloadable resources

Related topics