Parallel computing is a type of computation in which two or more computations are executed at the same time. The concept is well established in computer science and engineering, but there are lots of misunderstandings from the general public in general forums, blogs, social media, and comments sections of news sites.

MYTH: There is only one type of parallelism.

I see that people often believe that multithreading is the only type of parallel computing.

FACT: There are several different forms of parallel computing: bit-level, instruction-level, data-level, and task-level.

Bit-level

A form of parallel computing based on increasing processor word size. Increasing the word size reduces the number of instructions the processor must execute to perform an operation on variables whose sizes are greater than the length of the word. For example, a 32-bit processor can add two 32-bit integers with a single instruction, whereas a 16-bit processor require two instructions to complete the single operation. The 16-bit processor must first add the 16 lower-order bits from each integer, then add the 16 higher-order bits. Each 32-bit integer:

10110101 10110101 10110101 10110101

would be split into:

10110101 10110101 10110101 10110101

The advantage of bit-level parallelism is that it is independent of the application, because it is running on the processor level. The programmer writes the operation and the hardware executes the operation in a single step or in several steps.

Instruction-level

The ability of executing two or more instructions at the same time. Consider the arithmetic operations.

$$ a = a + 10 $$ $$ b = m + 3 $$

Since the second operation does not depend on the result of the first operation, both operations can be executed on parallel.

$$ a = a + 10 $$ $$ b = m + 3 $$

reducing the execution time to one half.

Data-level

Data parallelism is the execution of multiple data units in the same time by applying the same operation to them. Data parallelism is implemented in SIMD architectures (Single Instruction Multiple Data).

Suppose we want to move a series of objects a fixed distance in the z axis, this is equivalent to adding the distance to the z coordinate of each object.

$$ z1 = z1 + 61 $$ $$ z2 = z2 + 61 $$

$$ z3 = z3 + 61 $$

$$ z4 = z4 + 61 $$

$$ z5 = z5 + 61 $$

$$ z6 = z6 + 61 $$

$$ z7 = z7 + 61 $$

$$ z8 = z8 + 61 $$

In a 4-way SIMD architecture, the operation can be applied to four objects at once, reducing the cycle time by a factor of four. First the coordinates are grouped in 4-wide vectors, then vector addition is executed.

$$ (z1, z2, z3, z4) = (z1, z2, z3, z4) + (61,61,61,61) $$

$$ (z5, z6, z7, z8) = (z5, z6, z7, z8) + (61,61,61,61) $$

A wider architecture can reduce cycle time. An 8-way architecture could do all the operations in a single step.

Task-level

Task parallelism is a mode of parallelism where the tasks are divided among the processors to be executed simultaneously. Thread-level parallelism is when an application runs multiple threads at once.

MYTH: Everything can be made parallel.

Often people in forums accuse game developers of being “lazy” as the reason why games do not scale to 32 cores.

FACT: There are limits to the amount of parallelism that can be applied to game oriented routines.

Consider the equation (𝑎𝑥2+𝑏𝑥+𝑐=0), the solutions are:

$$ x1=\frac{{-b}+\sqrt{b^{2}-4ac}}{2a} $$ $$ x2=\frac{{-b}-\sqrt{b^{2}-4ac}}{2a} $$

The elementary operations needed are:

$$ n1=b\times b $$ $$ n2=4\times a\times c $$ $$ n3=n1-n2 $$ $$ n4=\sqrt{n3} $$ $$ n5={-b}+n4 $$ $$ n6={-b}-n4 $$ $$ n7=2\times a $$ $$ x1=n5\div n7 $$ $$ x2=n6\div n7 $$

Some of those operations are independent, but others are not. For instance, we cannot solve n3 without first knowing the values n1 and n2, and we cannot do x1 and x2 without first knowing the value of denominator n7. The maximum achievable parallelism will be:

$$ n1=b\times b $$ $$ n2=4\times a\times c $$ $$ n7=2\times a $$ $$ n3=n1-n2 $$ $$ n4=\sqrt{n3} $$ $$ n5={-b}+n4 $$ $$ n6={-b}-n4 $$ $$ x1=n5\div n7 $$ $$ x2=n6\div n7 $$

It must be clear that the limitation to the level of parallelism illustrated by this simple example is not a consequence of lazy programming. The problem cannot be parallelized further due to data dependences among different operations.

Game developers are bound by similar limits. A game is essentially a linear algorithm where the state of the game at any instant of the gameplay evolves as a function of the user response. The main algorithm is:

$$ State1\rightarrow UserAction1\rightarrow State2\rightarrow UserAction2\rightarrow State3\rightarrow\cdots $$

Some components that define the state of the game can be split from the main algorithm and run on a separate thread as subtasks, two examples are background music and artificial intelligence. However, those subtasks are not fully disconnected from the main algorithm, because they depend on the decisions taken by the player during the gameplay. For instance, going to the left in a shooter, could mean the gamer finds fifty computer-run enemies in the next room, whereas going to the right could mean finding a weapons arsenal. The thread that runs the artificial intelligence subtask must be synchronized at any instant with the thread that runs the main algorithm. Modern games usually consist of a main thread that runs the basis of the game and four or more slave threads that run subtasks synchronized by the main thread. A CPU with twelve cores at 2GHz could run more subtasks than a CPU with six cores at 4GHz, but the CPU with faster cores can execute the main thread much faster. In general, the CPU with six cores at 4GHz is better for gaming.

MYTH: CPUs have increased IPC by about 5% per gen during the last couple years because Intel rested on their laurels. Now that AMD is competitive again, we will see giant jumps in IPC soon.

FACT: The x86 ISA reached a limit and cannot evolve further.

The x86 ISA is a serial ISA. This means that instructions to be executed are scheduled in linear order when the compiler transforms our program into x86 instructions. Consider the example:

$$ a=a+10 $$ $$ b=m+3 $$

The compiler would generate code such as:

mov ecx, 10 mov edx, 3 add eax, ecx add ebx, edx

This is a sequence of x86 instructions. Modern x86 cores as Zen or Skylake are superscalar out-of-order microarchitectures. Superscalar means the core can execute more than one instruction per cycle. Out-of-order means that it can execute instructions in an order different to that defined by the compiler. At run-time, those cores will load the above sequence of instructions from memory or cache, then will decode them and will analyze the instructions to find dependences, generating a parallel schedule to reduce the time needed to execute the instructions. Here lies the problem; The hardware structures needed to transform a sequence of x86 instructions into an optimized parallel sequence are very complex and power hungry. In a superscalar core the IPC is given by:

$$ IPC=\alpha W^{\beta} $$