This is a follow-up for my previous articles about moving pythonic asynchronous IO into a separate GIL-less thread. This time, I have some benchmarks with HTTP protocol. And the results are pretty inspiring. But first things first.

Overview

The objective of a stator library is to move things that python is not very good at to a separate thread which runs Rust. The things python is not very good at are:

Parsing network protocols byte by byte Executing computations in parallel with IO

When I say “not very good at” I’m talking about performance. On the flip side, python is very good for writing business logic.

This means we need to handle HTTP in the Rust thread but we can’t construct Pythonic objects in that thread because this means holding GIL. And to ensure minimal latency, we have to avoid such coarse-grained locks there.

Another important aspect of the API design is that we don’t want to make it too CPython-specific. We want stator to work well in different python implementations (PyPy) and for other programming languages as well (node.js, ruby). So we opt for simple C API which can easily be used by ctypes.

API

All that said in the previous section is a little bit contradictory: we want to parse data in rust thread, but not to create python objects in it.

The solution is pretty simple: we parse HTTP protocol message and serialize it with some faster protocol. I chose cbor because it’s pretty efficient, it’s IETF standard, and there is pretty fast python library to parse cbor (if you’re unfamiliar with the format, think of it as a successor of the msgpack). It’s also cheap to serialize to cbor in rust.

That last sentence is important, because while there is some evidence that in some circumstances json may be faster than cbor in python. It’s not so much in Rust. Mostly because we don’t use actual “serialization” process for writing data, we use cbor-codec which allows to iteratively write the pieces of data into a buffer, without intermediate memory allocations.

While we haven’t actually compared different serialization formats, the results below are good enough, so we don’t need to change it, at least in the current stage of rotor’s development.

The low-level API looks like this:

In the code above, when a message comes from rotor, we call pythonic function, which receives a reference to the buffer, deserializes data from it and returns the data as a python object. This way we avoid copying the buffer that comes from rotor.

The response API is even more simple:

The socket identifiers here in stator API are used similarly to operating system file descriptor numbers. Except they are private to a rotor. There are and asynchronous APImore differences in details, for example, each request gets it’s own socket id. And socket id’s are 64bit values and are never reused (to aid debugging). But all of these should be hidden by higher level API anyway. Full source code is in the github repository.statorstator

Benchmarks (I)

Let’s start with the most impressive benchmark. It’s just a hello world HTTP server on a desktop-grade i7–4790K CPU @ 4.00GHz:

For comparison, here is a similar test for tornado on the same machine:

And here is a test of stator’s HTTP implementation on a laptop i5–3230M CPU @ 2.60GHz:

Before proceeding here is another test to set some baseline. Here is a single-core rotor-http (just rust, no python) benchmark on the i7:

Because medium has no good commenting system feel free to discuss the article in the github issues.

All the benchmarks are in the examples folder in the stator repository.

Benchmarks II

Another interesting suite of benchmarks is to actually do something in python. Let’s ask redis on the localhost:

And the results are considerably slower (i7):

Remarkable thing with this benchmark, is that it takes around 130% CPU. While the original hello world test uses 170–195%.

Another interesting thing, is that at this level of performance, you get some speed improvements just by using unix socket instead of localhost TCP:

Let’s try two redis operations:

And another test with ten operations in a pipeline:

Conclusions

Well, the results of hello world benchmark are pretty impressive. When starting this work I never expected such performance. This is especially important for the microservices era, where a microservice might be small and lightweight, but suffers from the overhead of the HTTP anyway.

On the other hand, I did not expect such huge drop of performance when using redis. My expectation was that the cost of asking redis on a localhost is negligible in python (because of python’s inherent slowness). But it turns out that if you optimize other things well, that cost becomes apparent. This will be even more costly when redis is not on localhost. Additional cost will be added when redis is used by multiple clients, instead of being spawned just for this python process like in the benchmark.

This brings another point. Rather than just receiving requests using Rust, we should create a full-blown asynchronous engine which can offload Redis calls to the Rust thread and switch to another task while waiting. With Redis being just an example, of course. Other databases and services should be implemented too.

This project requires a lot more work to make it production ready, but it allows to continue using Python for projects that have 10x more traffic that current python servers can process, and reduces the incentive to rewrite everything in a “faster language”.

Because medium has no good commenting system feel free to discuss the article in the github issues.