A key step in our data pipeline requires loading a few hundred thousand small JSON files from S3. Although Python doesn’t have great support for concurrency — thank you, GIL — asynchronous I/O is one context where it can significantly improve performance. We’ve experimented with the two obvious methods for asynchronous I/O: threading and asyncio. Although the “asyncio” name indicates that it’s literally designed to solve our problem, we think the threading module is the superior choice. Asyncio does feel “more correct” and has marginally less overhead, but it’s also less mature and requires deeper restructuring of code to use.

Let’s start off by looking at the core code for each of the two approaches. The aiobotocore library makes it easy to use asnyc with Amazon's botocore library - more on that in a moment - while we can code up the threading approach from scratch. If you're interested, you can view all of the code for this post on Github. We've even created a public S3 bucket with 100,000 small text files in case you're interested in running the code yourself.

The function that uses async is slightly shorter, but that’s mostly because the aiobotocore module hides some of the boilerplate from us. Assuming you pick an appropriately high value for concurrency , both methods can download 30k objects in about a minute and a half, with asyncio performing slightly faster:

Downloading different numbers of objects

Next, let’s look at how performance varies with different numbers of objects to download:

Experimenting with different levels of concurrency

Creating threads has high overhead; about 0.01 seconds to create the thread itself, and a surprising 0.05 seconds to create a boto3 client. (All performance numbers in this article are from my older-than-I’d-like MacBook Pro working over a good quality WiFi connection.) This lead to a huge performance drag when retrieving small numbers of objects. With 100 threads, retrieving 100 objects takes 6 seconds, compared to 1.39 seconds for the async function and 1.70 with a more reasonable 10 threads. We tried to ameliorate this by e.g. deferring client creation until the thread has a key to retrieve, but the real solution (of course) is just to not use one thread per object. In general, using too many threads is a problem. By contrast, setting a high concurrency value for aiobotocore didn’t cause any problems for us.

On the other hand, creating threads requires an explicit decision about the degree of concurrency. Aiobotocore defaults to a pool of ten connections, which is too few our use case. (For a while, we mistakenly believed that the default was 100 connections and that aiobotocore was just slower than threading.) Even though libraries like aiobotocore and aiohttp don’t require you to explicitly set the level of concurrency, you should still experiment with it.