The advantage of parallel programming over serial computing is increased computing performance. Parallel performance improvements can be achieved by way of reducing latency, increasing throughput, and reducing CPU power consumption. Because these three factors are often interrelated, a developer must balance all three to ensure that the efficiency of the whole is maximized. When optimizing performance, the measurement known as "speedup," enables a developer to track changes in the latency of specific computational problems as the number of processors is increased. The goal for optimizing may be to make a program run faster with the same workload (reflected in Amdahl's Law) or to run a program in the same time with a larger workload (Gustafson-Barsis' Law). This article explores the basic concepts of performance theory in parallel programming and how these elements can guide software optimization.

Latency and Throughput

The time it takes to complete a task is called "latency." It has units of time. The scale can be anywhere from nanoseconds to days. Lower latency is better.

The rate at which a series of tasks can be completed is called "throughput." This has units of work per unit time. Larger throughput is better. A related term is "bandwidth," refers to throughput rates that have a frequency-domain interpretation, particularly when referring to memory or communication transactions.

Some optimizations that improve throughput may increase the latency. For example, processing of a series of tasks can be parallelized by pipelining, which overlaps different stages of processing. However, pipelining adds overhead since the stages must now synchronize and communicate, so the time it takes to get one complete task through the whole pipeline may take longer than with a simple serial implementation.

Related to latency is response time. This measure is often used in transaction-processing systems, such as web servers, where many transactions from different sources need to be processed. To maintain a given quality of service, each transaction should be processed in a given amount of time. However, some latency may be sacrificed even in this case in order to improve throughput. In particular, tasks may be queued up, and time spent waiting in the queue increases each task's latency. However, queuing tasks improves the overall utilization of the computing resources and so improves throughput and reduces costs.

"Extra" parallelism can also be used for latency hiding. Latency hiding does not actually reduce latency; instead, it improves utilization and throughput by quickly switching to another task whenever one task needs to wait for a high-latency activity.

Speedup, Efficiency, and Scalability

Two important metrics related to performance and parallelism are speedup and efficiency. Speedup (Equation 1) compares the latency for solving the identical computational problem on one hardware unit ("worker") versus on P hardware units:



Equation 1.

where T 1 is the latency of the program with one worker and T P is the latency on P workers.



Equation 2.

Efficiency measures return on hardware investment. Ideal efficiency is 1 (often reported as 100%), which corresponds to a linear speedup, but many factors can reduce efficiency below this ideal.

If T 1 is the latency of the parallel program running with a single worker, then Equation 2 is sometimes called "relative speedup" because it shows relative improvement from using P workers. This uses a serialization of the parallel algorithm as the baseline. However, sometimes there is a better serial algorithm that does not parallelize well. If so, it is fairer to use that algorithm for T 1 , and report absolute speedup, as long as both algorithms are solving an identical computational problem. Otherwise, using an unnecessarily poor baseline artificially inflates speedup and efficiency.

In some cases, it is also fair to use algorithms that produce numerically different answers, as long as they solve the same problem according to the problem definition. In particular, reordering floating point computations is sometimes unavoidable. Since floating-point operations are not truly associative, reordering can lead to differences in output, sometimes radically different if a floating-point comparison leads to a divergence in control flow. Whether the serial or parallel result is actually more accurate depends on the circumstances.

Speedup, not efficiency, is what you see in advertisements for parallel computers, because speedups can be large impressive numbers. Efficiencies, except in unusual circumstances, do not exceed 100% and often sound depressingly low. A speedup of 100 sounds better than an efficiency of 10%, even if both are for the same program and same machine with 1000 cores.

An algorithm that runs P times faster on P processors is said to exhibit linear speedup. Linear speedup is rare in practice, since there is extra work involved in distributing work to processors and coordinating them. In addition, an optimal serial algorithm may be able to do less work overall than an optimal parallel algorithm for certain problems, so the achievable speedup may be sublinear in P, even on theoretical ideal machines. Linear speedup is usually considered optimal since we can serialize the parallel algorithm, as noted above, and run it on a serial machine with a linear slowdown as a worst-case baseline.

However, as exceptions that prove the rule, an occasional program will exhibit superlinear speedup  an efficiency greater than 100%. Some common causes of superlinear speedup include:

Restructuring a program for parallel execution can cause it to use cache memory better, even when run on with a single worker! But if T 1 from the old program is still used for the speedup calculation, the speedup can appear to be superlinear. See Section 10.5 for an example of restructuring that often reduces T 1 significantly.

from the old program is still used for the speedup calculation, the speedup can appear to be superlinear. See Section 10.5 for an example of restructuring that often reduces T significantly. The program's performance is strongly dependent on having a sufficient amount of cache memory, and no single worker has access to that amount. If multiple workers bring that amount to bear, because they do not all share the same cache, absolute speedup really can be superlinear.

The parallel algorithm may be more efficient than the equivalent serial algorithm, since it may be able to avoid work that its serialization would be forced to do. For example, in search tree problems, searching multiple branches in parallel sometimes permits chopping off branches (by using results computed in sibling branches) sooner than would occur in the serial code.

However, for the most part, sublinear speedup is the norm.

Later, we discuss an important limit on speedup: Amdahl's Law. It considers speedup as P varies and the problem size remains fixed. This is sometimes called "strong scalability." Another section discusses an alternative, Gustafson-Barsis' Law, which assumes the problem size grows with P. This is sometimes called "weak scalability". But before discussing speedup further, we discuss another motivation for parallelism: power.

Power

Parallelization can reduce power consumption. CMOS (complementary metal–oxide–semiconductor) is the dominant circuit technology in modern computer hardware. CMOS power consumption is the sum of dynamic power consumption and static power consumption. For a circuit supply voltage V and operating frequency f, CMOS dynamic power dissipation is governed by the proportion in Equation 3:



Equation 3.

The frequency dependence is actually more severe than the equation suggests because the highest frequency at which a CMOS circuit can operate is roughly proportional to the voltage. Thus dynamic power varies as the cube of the maximum frequency. Static power consumption is nominally independent of frequency but is dependent on voltage. The relation is more complex than for dynamic power, but, for sake of argument, assume it varies cubically with voltage. Because the necessary voltage is proportional to the maximum frequency, the static power consumption varies as the cube of the maximum frequency, too. Under this assumption, we can use a simple overall model where the total power consumption varies by the cube of the frequency.

ACTIVE CORES MAXIMUM MAXIMUM FREQUENCY (GHz) BREAKEVEN EFFICIENCY 4 2.4 34% 3 2.8 39% 2 3.2 52% 1 3.2 100%

Table 1: The maximum core frequency for an Intel core i5-2500T chip depends on the number of active cores. The right column shows the parallel efficiency over all four cores required to match the speed of using only one active core.

Suppose that parallelization speeds up an application by 1.5X on two cores. You can use this speedup either to reduce latency or reduce power. If your latency requirement is already met, then reducing the clock rate of the cores by 1.5X will save a significant amount of power. Let P 1 be the power consumed by one core running the serial version of the application. Then the power consumed by two cores running the parallel version of the application will be given by:



Equation 4.

where the factor of 2 arises from having two cores. Using two cores running the parallelized version of the application at the lower clock rate has the same latency but uses (in this case) 40% less power. Unfortunately, reality is not so simple. Current chips have so many transistors that frequency and voltage are already scaled down to near the lower limit just to avoid overheating, so there is not much leeway for raising the frequency. For example, Intel Turbo Boost Technology enables cores to be put to sleep so that the power can be devoted to the remaining cores while keeping the chip within its thermal design power limits. Table 1 shows an example. Still, the table shows that even low parallel efficiencies offer more performance on this chip than serial execution.