Have you ever been in a hurry to complete a task? Most likely yes. There are two ways of minimizing your time spent on problems: one of them is to run faster… and the other one is to ask for help.

In many cases where the task is number crunching on a computer, programmers will move from a prototype in Python to a faster implementation in a different language (C++, Fortran or whatever). But what about Python itself? We will show that often it is not needed to move to C++ just for more performance. We solve our problem with Numba and Dask. Using these libraries in tandem make Python into a parallel number-crunching power-house.

Parallel vs serial

Conceptually, a computer is a machine that processes tasks. No more and no less. Most modern computers, including personal laptops, have many cores. Each core can process a task independently of the rest.

Most often, we use computers to perform collections of tasks. If those tasks are performed one after the other, we are talking about serial computation (see first figure).

If at least some tasks can be performed simultaneously (as in the second figure), we are talking about parallel computation. By assigning one task to each core, we roughly multiply by 3 the speed of the whole process.

Pea soup and task parallelization

Here’s a recipe for Dutch snert (it sounds even funnier when pronounced in English!), as it is given by one of the larger chains of supermarkets in the Netherlands (translated by Google’s translator app):

If you look closely, it says at point 2 “Gently simmer in about 60 minutes”, followed by point 3, telling you to cut vegetables and point 4 & 5, adding the vegetables to the soup. Any reasonable cook would cut the vegetables while the soup is gently simmering for about 60 minutes. Most traditional computer languages lack the power to express this idea succinctly, and will not optimise these tasks. We can do it, but we need to pull some tricks out of a hat. We will show how this can be done in Python using Dask.

We put the recipe in terms of Python functions, and make sure all our functions are pure: they only take arguments and give back a new result. Pure functions do not mutate a variable.

from dask import delayed

from pint import UnitRegistry, Quantity

units = UnitRegistry()

# Always use units, or else: (https://en.wikipedia.org/wiki/Mars_Climate_Orbiter#Cause_of_failure)

To add vegetables to the pan, we define the add_vegetable action:

@delayed

def add_vegetable(pan: Container, ingredient) -> Container:

return pan + [ingredient]

Assuming the pan is a list of ingredients that went into the soup this creates a new pan with the ingredient added. Similarly we can simmer and cut vegetables :

@delayed

def simmer(pan: Container, time: Quantity) -> Container:

print(“simmering for {:~P}”.format(time))

return pan @delayed

def cut_vegetable(ingredient):

return “sliced “ + ingredient

Given these actions we can write down the recipe (in reduced form):

def snert(pan):

pan = add_vegetable(pan, “peas”)

pan = simmer(pan, 60*units.min)

veggies = cut_vegetable(“vegetables”)

pan = add_vegetable(pan, veggies)

pan = simmer(pan, 20*units.min)

return pan

Notice that each action is decorated with @delayed to tell Python that this is a task within a recipe. However the recipe itself is not decorated. Dask builds the recipe from the elemental actions that we provide it with.

We can visualize the recipe in a directed-acyclic-graph (or DAG for short) by creating a node for each action, and an arrow for the data flow. In Dask this is done by calling snert([]).visualize() .

We now have the recipe for snert in a more abstract form (a DAG), from which the parallelism that is inherent in the recipe is immediately visible.

Our toy algorithm: π’s in the rain

In order to witness the advantages of parallelization we need an algorithm that is 1. parallelizable and 2. complex enough to take a few seconds of CPU time. In order to not scare away the interested reader, we need this algorithm to be understandable and, if possible, interesting. We chose a classical algorithm for demonstrating parallel programming: estimating the value of number π.

The algorithm we are presenting is one of the classical examples of the power of Monte-Carlo methods. This is an umbrella term for several algorithms that use random numbers to approximate exact results. We chose this algorithm because of its simplicity and straightforward geometrical interpretation.

Unfortunately, we have to start refreshing a couple of results from high school geometry. You probably remember that the area A of a square of side L is given by Aₛ = L², and the area of a circle of radius R is given by Aₒ = πR².

A circle of radius R can be stacked inside of a square of side L= 2R (see figure). The ratio between areas is thus, Aₒ/Aₛ=πR²/4R²=π/4.

This means that one possible (but very inefficient) way of estimating the number π is: π~4Aₒ/Aₛ.

We can now imagine the following mental experiment: if we let rain droplets fall in our figure, it is reasonable to expect the number of droplets inside the circle to be proportional to its area. The same is true for the square. Consequently, the ratio between the number of droplets inside the circle, M, and the number of droplets inside the square, N, should verify: π ~ 4M/N, where the approximation should become better as the number of droplets becomes higher.

An interactive version of the algorithm, used for teaching, can be found here.

Implementation (in Python)

Instead of using rain droplets, we’ll use a random number generator to simulate such a rain. In particular, we draw a coordinate x and a coordinate y from a uniform distribution, and use them as the coordinates a of a “droplet” impact.

Depending on your level of Python expertise, you might implement such an algorithm in different ways.

In an introductory Python course we could write this function as follows.

import random def calc_pi(N):

M = 0 for i in range(N):

# Simulate impact coordinates

x = random.uniform(-1, 1)

y = random.uniform(-1, 1)



# True if impact happens inside the circle

if x**2 + y**2 < 1.0:

M += 1 return 4 * M / N

This implementation has serious shortcomings when it comes to performance. We learn how to fix this in the second week of learning Python, when we encounter NumPy! We show here another way of implementing the same idea. In this case we are taking advantage of NumPy’s recommended vector notation. Used smartly, it avoids the need of for loops.

import numpy as np def calc_pi_numpy(N):

# Simulate impact coordinates

pts = np.random.uniform(-1, 1, (2, N)) # Count number of impacts inside the circle

M = np.count_nonzero((pts**2).sum(axis=0) < 1) return 4 * M / N

This implementation is a lot faster than the first one, but it still only uses a single processor. Surely, we can do better than that! We can parallelize this algorithm by running it several times and then computing the mean of the outputs.

The problem of computing π is representative of a family of problems known as embarrassingly parallel. Roughly, this means that parallelizing the algorithm should be easy. If we want to use, say, 80000 random points, and we have 8 cores, we can split our problem in 8 problems of 10000 points each. This will give 8 different results, that we will collect together using a mean.