High Performance Computing is More Parallel Than Ever

The end of Moore’s Law and the explosion of machine learning is driving growth in parallel computing, so what is it?

In 1965 Gordon Moore made the observation that became known as Moore’s Law. Specifically, he noted that in the years between 1959 and 1965 the number of electronic components we could fit on any unit area of a silicon chip had doubled every 18 months. Moore’s observation became a prediction, then it became an expectation, and for the years between 1965 and 2012 it was perhaps the guiding principle of the computer hardware industry. Components got smaller at exponential speed and the ultimate result was that CPU speeds doubled every 18 months.

During the era of Moore’s Law a common mantra in the software world has been, “the programmers time is more valuable than the computer’s time.” This mentality has brought plenty of wonderful things with it including dynamic and expressive languages such as Python and JavaScript. These languages create programs that — compared to their equivalents written in C — are slow and use lots of memory. These trade offs were easy to make when computers would be twice as fast, have more RAM, and larger CPU caches in a year and a half. These advances have also brought down the cost of creating software and made it much easier for beginners to learn how to program.

But Moore’s Law is dead.

CPU speeds have been stagnating for several years now, and experts in the field of computer hardware do not expect them to recover. In order to keep up with the growing demand for high performance computing, hardware engineers are looking to alternatives like Graphics Processing Units (GPUs) and Tensor Processing Units (TPUs) which rely on parallel computing to do more things at once rather than do one thing faster. As we exit the era of Moore’s Law, high performance computing will require that programmers take advantage of parallel processors. Furthermore, if the 2020’s are going to be the era of machine learning then demand for high performance computing will only continue to grow. If that demand is to be met, it will be done with parallel processors.

While individual CPUs aren’t getting faster, we have already done a lot of work using several of them at the same time. Operating systems have been quietly shuffling processes around multicore CPUs for years — the first consumer multicore processors hit the market in 2005 and super computers were using multicore processors as early as 2001.

Much of the perceived speed increases for consumer computers in the last 5–10 years has to do with improvements related to coordinating multiple cores on a CPU, not with individual processors becoming faster. In fact, the 4th generation (2014) core i7 has the same base processor frequency (3.60 GHz) as the 9th generation (2018) core i7 — but the new version has twice as many cores, 8 instead of 4. GPUs and TPUs are a different beast entirely: A high end GPU can have multiple thousands of individual cores.

But adding more cores isn’t enough to increase speed by itself. Programmers need to write programs that explicitly take advantage of parallel processors’ capabilities for end users to really notice the improvement.

Three Kinds of Parallelism

There are 3 major kinds of parallel computing from a programmer’s perspective:

Multiprocessing: when two or more separate programs are run on separate processors. Parallel processing: when a single program is broken into independent parts and the result of those individual parts are merged. Data parallelism: when a large number of values must all be processed in the same way.

All three of these will play an important role in software speeds in the coming years, and operating systems already do tons of multiprocessing. In fact, CPUs have co-evolved closely with operating systems. As a result CPUs are designed with more focus on multiprocessing workloads that improve OS efficiency compared to the newer Graphics Processing Units (GPUs), and Tensor Processing Units (TPUs).

Both GPUs and TPUs excel at processing workloads with data parallelism. Graphics processing units originally came about to solve problems associated with computer graphics. Here is an example: suppose we wanted to increase the brightness of every pixel within some boundary, essentially the photoshop dodge tool. We need to run the same set of instructions on all of those pixels, but each pixel can be updated individually and independently. GPUs were originally designed around this sort of workload: running the same mathematical process on a large number of inputs all of the same type — increasing the brightness of several pixels in this case.

For many years GPUs were a niche product primarily for the gaming world. The emergence of virtual and augmented reality will continue to drive computational demand for GPUs in graphics contexts. Additionally, video games continue to become mainstream which will in turn drive demand as more consumers buy graphics cards (either for their computers, or as part of a gaming console). But data parallelism has also found new consumers in emerging industries such as machine learning and cryptocurrency. The special requirements of machine learning tasks have already spawned a new branch of processors focused on data parallelism called Tensor Processing Units (TPUs). It’s no wonder the GPU companies such as Nvidia are focusing so much on AI in their marketing materials these days.

How are these paradigms different?

Multiprocessing is the most straightforward of these three paradigms: your web browser and your music player are entirely independent programs. They can be run on separate cores at the same time with no downside. Developers of individual applications get this kind of parallelism for free because it has been deeply embedded into modern operating systems. From phone apps to web browsers, no special work needs to be done to ensure that your program is being processed in parallel with the other programs that might be running on the same device.

Parallel processing is when a single process is split into multiple independent units. Here’s an example: Suppose we want to sum up all the values in a list. With a single core we’d just scan through the list and add each item to the running total. If the list is very large we may want to break this computation up across multiple cores. Suppose we have two cores available, to process this list in parallel we can start one core at the start of the list and keep a sum up to the halfway point. At the same time we start a core at the halfway point and sum the values to the end. After computing these two sums individually we have to do some additional work, specifically we have to add those two values together.

We could use this same strategy to divide the list among any number of processors, but because there is overhead associated with this division (such as combining the results of each processor’s section) the returns are somewhat diminishing as the number of processors grows. Of course there is also a hard limit due to the number of processors available on the machine in question; most consumer CPUs on the market today have somewhere between 2 and 8 cores. If you have 4 cores, but split the list among 8 processes, you’ll have introduced additional overhead without any real benefit.

Parallel processing can also be used to handle independent events in the same system. Web servers might create separate processes to respond to web requests from different users, or create a separate process for sending a welcome email. This sort of separation straddles the line between multiprocessing and parallel processing — as individual applications become large and complex it is sometimes more appropriate to think of them as an amalgamation of separate processes.

Data parallelism — processing a large number of data points in the same way — is usually handled differently. The above example is not data parallelism. We are not processing each of the values in the list identically. Each item gets added to the sum of all the items seen before and a single output value is produced. With data parallelism each input value (or values) produces its own output value (or values) — we’re not reducing a list of values to a sum we’re transforming each value according to some function. Such data can be processed using the same parallel processing approach described above, but GPUs (and even many modern CPUs) support another approach called vector processing.

Vector processing requires special hardware that can process bigger inputs at once. On a traditional CPU computations are done by combining values into something called a register using something called an arithmetic logic unit (ALU). Registers hold data, and ALUs process those data. Both the ALU and the register have a fixed “width” corresponding to the amount of data they can hold or process. The phrases “32-bit architecture” or “64-bit architecture” are related to this width. On a 64-bit architecture registers can hold 64 bits of data at once and ALUs can combine and process these 64-bit values, on a 32-bit architecture the registers and ALUs can only hold and process 32-bit values at a time. We’re glossing over some details here, but this is the core idea.

With vector processing, registers and ALUs are made extra wide and special instructions can be used to pack several individual 32-bit values into an extra wide register and process them all at once in an extra wide ALU. Assume we had 16 values, each 32-bits wide, all of which need to be processed in the same way. On a single 32-bit processor this would take 16 timesteps — one for each item. If we had a 128-bit vector processor, we could do the same thing in only 4 timesteps. The vector processor’s timestep is typically longer than the single-value processor’s timestep, but a processor 4 times as wide is not 4 times as slow. This means that overall the vector processor ends up being faster in terms of throughput (values computed per unit time).

GPUs and TPUs both make extensive use of vector processing and both excel at parallel data workloads. In comparison to GPUs, TPUs focus explicitly on processing more smaller values at once, because many machine learning workloads do not need a high degree of precision. This allows the vector units to excel even further by making each individual value only 8 bits wide instead of 32 or 64 bits wide, maximizing the number of values that can be computed during each cycle.

Finally — even for parallel data workloads — GPUs and TPUs apply the parallel processing paradigm in addition to making use of vector processors. Huge numbers of values are split across multiple vector processors in order to make things even faster. Just to put things in perspective: a top of the line Nvidia GPU has over 4000 cores.

Currently, most of this parallelization cannot be easily automated. Programmers need to specify precisely which operations can be done in parallel and write custom code to handle the parallelization. There are layers of abstraction built around this process as well. For example, TensorFlow can take uninformed Python code written using the TensorFlow library and turn it into appropriately parallelized code that can take advantage of a GPU or TPU. This can be automated in large part because of machine learning’s heavy use of matrix operations, which are easily parallelized.

As unparallelized programs continue to be outpaced by their parallel cousins I expect more systems level engineering efforts to focus on parallelism. Parallel programming languages and frameworks such as CUDA and OpenMP will likely gain popularity, especially in high performance areas like machine learning, graphics, and infrastructure. Compiler developers will likely endeavour to automate some forms of parallelism (similar to what TensorFlow does for machine learning workloads). And finally, programmers who can effectively parallelize their code will continue to become more valuable, especially within the world of high performance computing.