By the end of this year, more than 70 percent of all server, desktop, and mobile processors that Intel ships will be multicore. This represents a challenge for software developers. Applications must be threaded if they are to make efficient use of these multicore processors. In this way, work can be given to individual cores to make as much use as possible of all the available execution resources. In this article, we describe how to make existing single-threaded applications multithreaded by analyzing their runtime performance characteristics and relating that back to the app's source code.

Multicore CPUs

Intel multicore architecture consists of a single processor package that contains two or more execution cores and can deliver, with appropriate software, parallel execution of software threads. The operating system sees each core as a separate processor with all resources that are typically associated with individual processors. Each core in a package runs at the same frequency and shares the same interface with the chipset and external memory.

Dual-core processors are an extension of hyperthreading technology. A hyperthreaded processor executes two software threads using a single execution unit. Complicating this is that multicore processors can also be hyperthreaded. A dual-core processor that is not hyperthreaded can execute two threads by running each thread on a different core. In a hyperthreaded dual-core processor, each core can execute two software threads. So, for example, two software threads could be executed either on separate cores or on the same core using hyperthreading technology.

A critical aspect of multicore technology is that it is much more important to realize performance gains through threading than by relying on new hardware to increase processor speed. This is because semiconductor manufacturers are decreasing the rate at which processor speed is increased with each generation of multicore processor to make them run cooler. This means that in the future, application performance may not increase as much as it did in the past based only on running unmodified apps on newer processors. However, if applications are coded to use the extra threading capability of multicore technology, applications will continue to run faster and faster as new generations of multicore processors are introduced.

One of the side effects of multicore technology is that CPU-bound applications can potentially run twice as fast on dual-core processors, whereas memory-bound applications may only run 50 percent faster. I/O-bound applications may not run any faster.

When threading applications for both multicore and hyperthreaded processors, it is important to make sure thread workload types are spread across processor packages as much as possible. For instance, if an application has four threads, two computationally intensive threads and two UI intensive threads, and the application is run on a dual-core hyperthreaded machine, it would probably be more effective to have the computationally intensive threads run on separate cores, and the UI threads run on separate cores. In this way, one UI thread and one compute thread are handled via hyperthreading technology on each core. This would provide a more balanced workload for the two cores, rather than having both compute threads on one core and both UI threads on another core.

A Threading Example

To illustrate the process of threading an application, we take a serial, single-threaded application and determine how to make it multithreaded. The application we use is a Microsoft Windows application that simply draws a fractal by manipulating individual pixels in organized separate horizontal scan lines. The platform we ran the application on was a four-socket, dual-core, hyperthreaded (HT) machine. This means there were a total of eight hyperthreaded cores, which the operating system saw as 16 processors (4 sockets × 2 cores per socket × 2 HT CPUs per core).

We analyze the fractal application using Intel's VTune performance analysis tool. We use it to take a close look at the execution characteristics of the application to guide our decisions in terms of how and at what points to thread the application.

Here's the process we follow: First, we find the main performance bottlenecks in the application. We then see if it makes sense to thread the application there. If there's not enough work at that point for individual threads, we can look farther up the program's calling sequences to find a more appropriate place. We then check to see if our newly created threads are well balanced. If not, we try to come up with a solution in which the threads all use roughly the same amount of CPU time.

Find the Hotspot

One of the first issues to consider is whether threading the application improves performance. If the work that the application does can be distributed either through functional or data decomposition, then some performance improvement can be achieved through threading.

To find out where our application is doing most of its work, we start by running the application and collecting hotspot information to find out where the application is consuming a lot of CPU time. We can find the modules, functions, and even the line of source code that consumes most of the CPU cycles using the VTune Analyzer, and we don't need a special build of the application.

Figure 1 is the Process View in the VTune Analyzer 7.2 for Windows after collecting sampling data on the platform. It shows the majority of CPU time being spent in the System Idle Process and our example program. The fact that the amount of CPU time spent in the System Idle Process is a large fraction of the total CPU time during the data collection run indicates that significant CPU resources are not being used. This is expected given that we ran only the example application, the application is single threaded (and therefore only ran on 1 CPU), and the application is running on a machine with 16 logical processors.

[Click image to view at full size]

Figures 2, 3, and 4 show the executables that consumed time in the Mandelbrot process, all the threads in the Mandelbrot app (only one thread that does any significant computation), and the Source View of the application's main bottleneck. The Source View shows that about 99 percent of the CPU time used by the application was spent in the function GenerateScanLine. To improve performance, you could try to change the code in this function so it would run faster. However, since we are running the application on a multicore machine, let's see if it makes sense to create threads to distribute the workload. Fortunately, there is a single location of intense computational activity that lends itself to threading. If there hadn't been any dominant hotspots, it would be more difficult to figure out what work to distribute across the processors.

[Click image to view at full size]

[Click image to view at full size]

[Click image to view at full size]

If you can distribute the work currently done on one processor onto two processors, you can theoretically double the performance of an application. Amdhal's familiar equation gives a formula for the improvement in performance by threading an application:

Speedup of the application=1/((1-P)+ (P/A)), where P is the fraction of the application that is parallelized, and A is the speedup of the parallelized portion.



For example, if you parallelize 25 percent of the application and make that part twice as fast as the result, the application speeds up by about 14 percent:

Speedup=1/((.75)+(.25/2))=

1/(.75+.125)=1/.875=1.14



This equation is also often written a different way in terms of the number of processors on the system:

Speedup=1/(S (1-S)/N)



where S is the fraction of the application that is serial and N is the number of processors.

From the sampling results we know that over 99 percent of the CPU time was consumed in the CalculatePixels, so if we can do the work in CalculatePixels on 16 threads, we could see a theoretical maximum speedup in our application of 1388 percent. Again, using the equation:

Speedup=1/(S+(1-S)/N).



where S, the serial portion of the application, is the part we didn't thread, so S= 1-.99=.01.

Speedup=1/((1-.99)+(1-(1-.99)/16))

=1/(.01+.99/16)

=1/(.01+.062)

=1/.072

=13.88



Our application ran in 2156 milliseconds as a serial application, and 1388 percent faster would be:

2156/13.88=155 milliseconds



Thus, the minimum runtime for our application would be about 155 ms.

However, hyperthreading changes the calculation of speed. Hyperthreading runs multiple threads through one core. The threads run more slowly than if they were executing alone, but the combined execution is faster than running one thread after the other, although not as fast as if they each were run on a separate processor. Taking hyperthreading into account the equation for speedup then becomes:

Speedup=1/(S+0.67*((1-S)/N)+hN)



where h is a constant that varies depending on the application and represents the overhead from hyperthreading, which is comprised of operating system overhead and time spent synchronizing threads (see Programming with Hyperthreading Technology, by Richard Gerber and Andrew Binstock, ISBN 0971786143).

We can get a significant increase in application speed by threading it, but where should we create the threads? It wouldn't be a good idea to create a thread for each call to CalculatePixel, because we would then be creating, synchronizing, and destroying a thread for each pixel on the screen. The thread management would dominate the execution time spent actually calculating each pixel and might actually decrease performance. The same can be said for creating a thread for each scanline. Instead, we should look farther up in the calling sequence to find a better place to create threads to distribute the work.

Creating a Call Graph to Find a Threading Point

You can use the VTune Analyzer's Call Graph feature to create a call graph of our application. By looking at the call graph, we can find a place farther up in the call tree from the hot function where it could make sense to create a thread. By recoding the call chain to the hot function to partition the work among several threads, we can take advantage of the parallel processing capability of the machine to improve the performance of the application.

In the case where there isn't a single hotspot, we could have used VTune to create a call graph, and looked for a subgraph that took a proportionally large amount of time. The total time of the root function of a candidate subgraph would have a large total time relative to other subgraphs.

As with Sampling, the analysis technology we used, VTune's Call Graph feature does not require a special build of the application, just debug symbols (with fully optimized compilations) so we can see function names. The function GenerateScanLine works on the pixels in a scanline, by calling the function CalculatePixel for each pixel; see Figure 5.

[Click image to view at full size]

By looking at the generated call tree, you can see that a good place to create separate threads is in the function Generate_Display, which calls GenerateScanLine. The strategy is to create as many threads as there are logical processors (16) and break the scanline drawing into pieces, giving each thread a set of scanlines to work on. After we make the source code changes to do this, the application retrieves the number of logical processors from the OS and creates that many threads to draw the fractal.

The Call Graph display tells us that Generate_Display and GenerateScanLine are on the time-critical path because the functions are on the most time-consuming calling sequence as indicated by the thick red arrows. This is a good indication that we are focusing on the right subgraph. Also, the Self Time for Generate_Display is low compared to the Self Time for GenerateScanLine. This means that most of the time is consumed in GenerateScanLine. By putting the calls to GenerateScanLine into separate threads, the application work can be partitioned among the available processors.

To have an application that can scale with different numbers of processors, it's a good idea to dynamically determine how many threads to create. In this case, we use the operating system to find out how many logical processors there are and create the same number of threads. This ensures we limit thread-synchronization issues.

Figure 6 shows the modified code to determine how many scanlines each thread should calculate.

[Click image to view at full size]

After running the multithreaded application, there is a significant speedup, from 2156 milliseconds to 375 milliseconds. The application runtime is faster, but not as fast as Amdahl's equation predicts is theoretically possible. Have we done all we can to improve performance?

Now that the application is multithreaded, we need to make sure that the threads are doing about the same amount of work and that they are running in parallel. If not, we won't be getting the maximum amount of benefit from threading the application.

Check Load Balancing

To get an idea of how well distributed and parallelized the workload is, we use VTune Analyzer's Samples Over Time feature. The Samples Over Time display shows how the activity for a particular application, module, or thread varies over the time the data was collected. By looking at the thread sampling data over time, we see if the CPU time consumed by each thread was about the same, providing evidence of whether the workload was evenly distributed. In addition, we can tell if the threads are parallelized by noticing whether there were a significant number of samples taken on each thread during the same block of time.

Figure 7 is the Sample Over Time View. Each square in the right panel represents a slice of time and the red squares indicate a larger number of amount of CPU usage in that slice of time. Many of the squares for each thread are colored red at the same instant in time, indicating that the threads are indeed running in parallel. However, all of the threads are finishing at different times, indicating that some threads have much more work than others, and that there will be idle CPUs when the first threads finish.

[Click image to view at full size]

Fix the Load Balancing Problem and Rebuild

The Sample Over Time View reveals that the threads are doing different amounts of work. It turns out that some threads are drawing on parts of the screen that aren't covered by much of the fractal so they don't consume much CPU time, while other threads are drawing on parts of the screen that are covered by a large part of the fractal and consequently are consuming a lot of CPU cycles. Instead of organizing the threads so that the first thread does the first several scanlines (which are the simplest in terms of the fractal), we interleave the threads and scanlines so (on a 16-CPU system) the first thread does the first line, the 17th line, the 33rd line, and so forth, while the second thread does the second line, the 18th line, and so on. This should distribute the computation more evenly over all the threads.

The Sample Over Time View (Figure 8) shows the results after interleaving the threads. The amount of thread activity is balanced between the 16 threads that are drawing the fractal, and the time to draw the fractal is reduced from 375 to 297 milliseconds.

[Click image to view at full size]

Final Results

By proper threading of the single-threaded fractal application, we got a speedup from 2156 milliseconds to 297 milliseconds. This was done by analyzing the performance characteristics of the application. We began by finding the performance bottleneck, then created multiple threads farther up the bottleneck's calling sequence. We then realized that the threads were not properly balanced, so we came up with a way to partition the work between the threads more evenly.

Threading applications as a way to improve application performance is vital with the introduction of multicore processors. You need to first determine whether threading will actually improve performance, and if it's justified, you need to keep principles of proper threading in mind when introducing threading into applications. Such factors as load balancing and synchronization have to be examined to gain optimal performance from threaded applications, and performance analysis tools can be used to help with these decisions.

DDJ