Parallel Processing in Python¶

author: Jacob Schreiber

contact: jmschr@cs.washington.edu

Simply put, parallel processing is splitting up a task between many CPUs to make it work faster or more efficiently. In the big data setting this often means speading up complex analyses by splitting up the task across the CPUs and getting a speedup bounded by the additional number of cores added. This is made easier because most tasks on big data are embarassingly parallel, which means having many independent tasks that can be distributed to many cores easily. For example, matrix-matrix multiplication can be time intensive for big matrices. However, each row-column dot product is independent from each other and so can be given to a core without the need to communicate between cores mid-task. This is great for parallel processing.

While languages like C, Java, and R allow parallel processing fairly easily, life isn't easy being a Python programmer due to the Global Interpreter Lock (or GIL). This is a lock on a Python process which prevents it from executing multiple threads simultaneously.

What does that even mean, though?

A process in a computer can be depicted as the following:

(ref: https://web.kudpc.kyoto-u.ac.jp/manual/en/parallel)

Your computer will allocate some memory for each process, which can be thought of as basically a "program." Every program is isolated from the others, and no process is allowed to infringe on the memory space of another process. If a process attempts to infringe on a neighbors space, this can lead to everyone's favorite error, the segfault.

Inside a process there can be multiple "threads" of execution. These threads share the underlying memory of the process, and can each be assigned to different cores for execution. This makes parallelization nice, because a process can load up some data and then process it using multiple cores much many people might eat a birthday cake.