10x Faster Parallel Python Without Python Multiprocessing

Faster Python without restructuring your code

While Python’s multiprocessing library has been used successfully for a wide range of applications, in this blog post, we show that it falls short for several important classes of applications including numerical data processing, stateful computation, and computation with expensive initialization. There are two main reasons:

Inefficient handling of numerical data.

Missing abstractions for stateful computation (i.e., an inability to share variables between separate “tasks”).

Ray is a fast, simple framework for building and running distributed applications that addresses these issues. For an introduction to some of the basic concepts, see this blog post. Ray leverages Apache Arrow for efficient data handling and provides task and actor abstractions for distributed computing.

This blog post benchmarks three workloads that aren’t easily expressed with Python multiprocessing and compares Ray, Python multiprocessing, and serial Python code. Note that it’s important to always compare to optimized single-threaded code.

In these benchmarks, Ray is 10–30x faster than serial Python, 5–25x faster than multiprocessing, and 5–15x faster than the faster of these two on a large machine.

On a machine with 48 physical cores, Ray is 9x faster than Python multiprocessing and 28x faster than single-threaded Python. Error bars are depicted, but in some cases are too small to see. Code for reproducing these numbers is available below. The workload is scaled to the number of cores, so more work is done on more cores (which is why serial Python takes longer on more cores).

The benchmarks were run on EC2 using the m5 instance types (m5.large for 1 physical core and m5.24xlarge for 48 physical cores). Code for running all of the benchmarks is available here. Abbreviated snippets are included in this post. The main differences are that the full benchmarks include 1) timing and printing code, 2) code for warming up the Ray object store, and 3) code for adapting the benchmark to smaller machines.

To learn more about how companies use Ray in production to speed up their Python applications, register for the free Ray Summit.

Learn how companies use Ray in production to speed up their Python applications at Ray Summit 2020.

Benchmark 1: Numerical Data

Many machine learning, scientific computing, and data analysis workloads make heavy use of large arrays of data. For example, an array may represent a large image or dataset, and an application may wish to have multiple tasks analyze the image. Handling numerical data efficiently is critical.

Each pass through the for loop below takes 0.84s with Ray, 7.5s with Python multiprocessing, and 24s with serial Python (on 48 physical cores). This performance gap explains why it is possible to build libraries like Modin on top of Ray but not on top of other libraries.

The code looks as follows with Ray.

Code for a toy image processing example using Ray.

By calling ray.put(image) , the large array is stored in shared memory and can be accessed by all of the worker processes without creating copies. This works not just with arrays but also with objects that contain arrays (like lists of arrays).

When the workers execute the f task, the results are again stored in shared memory. Then when the script calls ray.get([...]) , it creates numpy arrays backed by shared memory without having to deserialize or copy the values.

These optimizations are made possible by Ray’s use of Apache Arrow as the underlying data layout and serialization format as well as the Plasma shared-memory object store.

The code looks as follows with Python multiprocessing.

Code for a toy image processing example using multiprocessing.

The difference here is that Python multiprocessing uses pickle to serialize large objects when passing them between processes. This approach requires each process to create its own copy of the data, which adds substantial memory usage as well as overhead for expensive deserialization, which Ray avoids by using the Apache Arrow data layout for zero-copy serialization along with the Plasma store.

Benchmark 2: Stateful Computation

Workloads that require substantial “state” to be shared between many small units of work are another category of workloads that pose a challenge for Python multiprocessing. This pattern is extremely common, and I illustrate it hear with a toy stream processing application.

On a machine with 48 physical cores, Ray is 6x faster than Python multiprocessing and 17x faster than single-threaded Python. Python multiprocessing doesn’t outperform single-threaded Python on fewer than 24 cores. The workload is scaled to the number of cores, so more work is done on more cores (which is why serial Python takes longer on more cores).

State is often encapsulated in Python classes, and Ray provides an actor abstraction so that classes can be used in the parallel and distributed setting. In contrast, Python multiprocessing doesn’t provide a natural way to parallelize Python classes, and so the user often needs to pass the relevant state around between map calls. This strategy can be tricky to implement in practice (many Python variables are not easily serializable) and it can be slow when it does work.

Below is a toy example that uses parallel tasks to process one document at a time, extract the prefixes of each word, and return the most common prefixes at the end. The prefix counts are stored in the actor state and mutated by the different tasks.

This example takes 3.2s with Ray, 21s with Python multiprocessing, and 54s with serial Python (on 48 physical cores).

The Ray version looks as follows.

Code for a toy stream processing example using Ray.

Ray performs well here because Ray’s abstractions fit the problem at hand. This application needs a way to encapsulate and mutate state in the distributed setting, and actors fit the bill.

The multiprocessing version looks as follows.

Code for a toy stream processing example using multiprocessing.

The challenge here is that pool.map executes stateless functions meaning that any variables produced in one pool.map call that you want to use in another pool.map call need to be returned from the first call and passed into the second call. For small objects, this approach is acceptable, but when large intermediate results needs to be shared, the cost of passing them around is prohibitive (note that this wouldn’t be true if the variables were being shared between threads, but because they are being shared across process boundaries, the variables must be serialized into a string of bytes using a library like pickle).

Because it has to pass so much state around, the multiprocessing version looks extremely awkward, and in the end only achieves a small speedup over serial Python. In reality, you wouldn’t write code like this because you simply wouldn’t use Python multiprocessing for stream processing. Instead, you’d probably use a dedicated stream-processing framework. This example shows that Ray is well-suited for building such a framework or application.

One caveat is that there are many ways to use Python multiprocessing. In this example, we compare to Pool.map because it gives the closest API comparison. It should be possible to achieve better performance in this example by starting distinct processes and setting up multiple multiprocessing queues between them, however that leads to a complex and brittle design.

Benchmark 3: Expensive Initialization

In contrast to the previous example, many parallel computations don’t necessarily require intermediate computation to be shared between tasks, but benefit from it anyway. Even stateless computation can benefit from sharing state when the state is expensive to initialize.

Below is an example in which we want to load a saved neural net from disk and use it to classify a bunch of images in parallel.

On a machine with 48 physical cores, Ray is 25x faster than Python multiprocessing and 13x faster than single-threaded Python. Python multiprocessing doesn’t outperform single-threaded Python in this example. Error bars are depicted, but in some cases are too small to see. The workload is scaled to the number of cores, so more work is done on more cores. In this benchmark, the “serial” Python code actually uses multiple threads through TensorFlow. The variability of the Python multiprocessing code comes from the variability of repeatedly loading the model from disk, which the other approaches don’t need to do.

This example takes 5s with Ray, 126s with Python multiprocessing, and 64s with serial Python (on 48 physical cores). In this case, the serial Python version uses many cores (via TensorFlow) to parallelize the computation and so it is not actually single threaded.

Suppose we’ve initially created the model by running the following.

Code for saving a neural network model to disk.

Now we wish to load the model and use it to classify a bunch of images. We do this in batches because in the application the images may not all become available simultaneously and the image classification may need to be done in parallel with the data loading.

The Ray version looks as follows.

Code for a toy classification example using Ray.

Loading the model is slow enough that we only want to do it once. The Ray version amortizes this cost by loading the model once in the actor’s constructor. If the model needs to be placed on a GPU, then initialization will be even more expensive.

The multiprocessing version is slower because it needs to reload the model in every map call because the mapped functions are assumed to be stateless.

The multiprocessing version looks as follows. Note that in some cases, it is possible to achieve this using the initializer argument to multiprocessing.Pool . However, this is limited to the setting in which the initialization is the same for each process and doesn’t allow for different processes to perform different setup functions (e.g., loading different neural network models), and doesn’t allow for different tasks to be targeted to different workers.

Code for a toy classification example using multiprocessing.

What we’ve seen in all of these examples is that Ray’s performance comes not just from its performance optimizations but also from having abstractions that are appropriate for the tasks at hand. Stateful computation is important for many many applications, and coercing stateful computation into stateless abstractions comes at a cost.

Run the Benchmarks

Before running these benchmarks, you will need to install the following.

pip install numpy psutil ray scipy tensorflow

Then all of the numbers above can be reproduced by running these scripts.

If you have trouble installing psutil , then try using Anaconda Python.

The original benchmarks were run on EC2 using the m5 instance types (m5.large for 1 physical core and m5.24xlarge for 48 physical cores).

In order to launch an instance on AWS or GCP with the right configuration, you can use the Ray autoscaler and run the following command.

ray up config.yaml

An example config.yaml is provided here (for starting an m5.4xlarge instance).

More About Ray

While this blog post focuses on benchmarks between Ray and Python multiprocessing, an apples-to-apples comparison is challenging because these libraries are not very similar. Differences include the following.

Ray is designed for scalability and can run the same code on a laptop as well as a cluster (multiprocessing only runs on a single machine).

Ray workloads automatically recover from machine and process failures.

Ray is designed in a language-agnostic manner and has preliminary support for Java.

More relevant links are below.

To learn more about how companies are using Ray to scale and speed up their Python applications in production, join as at the upcoming (free) Ray Summit!