Since the first presentation of high performance AsyncIO with Cython and uvloop by Stefan Behnel at Europython in 2016, Cython has gained traction for web frameworks. Various experiments have demonstrated how Cython, an extension of the Python language with efficient compilation, can match Golang and other high performance system programming languages to build a fast HTTP server.

Cython is both an optimising static compiler and a hybrid language. It mainly gives the ability to:

write Python code that can call back and forth from and to C/C++;

add static typing using C declarations to Python code in order to boost performance;

release the GIL in some code sections.

Cython generates very efficient C code, which is then compiled into a module that Python can import. So it is an ideal language for wrapping external C libraries, and for developing C modules that speed up the execution of Python code.

However, all experiments we are aware that rely on Cython for system programming fail short in at least two ways:

as soon as some Python code is invoked (as opposed to pure Cython cdef code), performance degrades by one or two orders of magnitude; benchmarks are most of the time provided for single core execution only, which is somehow unfair considering Golang's ability to scale up on multiple cores.

The first issue is related to Python's Global Interpreter Lock (GIL) and its highly dynamic nature. It will probably never change much. In a nutshell, the GIL prevents many forms of parallelism in Python and it can also degrade performance.

However, the second issue has no reason to remain since Cython can release the GIL and run pure Cython code on multiple cores. Cython's multi-core support is already heavily used for data sciences to accelerate linear algebra calculation and support OpenMP framework.

We thus decided in early 2018 to start the implementation of a new proof of concept of an HTTP server with the goal to reach higher performance than Golang's HTTP servers on multiple cores.

Any good coroutine library in C?

Our first task consisted in looking for a high performance library in C or C++. We started evaluating a certain number of popular coroutine libraries, some of which are used by major Web infrastructures:

Feature comparison of C coroutine libraries cpc libtask lthread libdill libmill libco non bloking IO and network interface ? Y Y Y Y Y lightweight and fast context switch Y Y? ? ? ? ? contained memory impact Y Y? ? ? ? ? efficient scheduler N? N ? ? ? ? thread-safe ? N Y Y Y Y multi-threads Y N ~N N N N system to guarantee atomicity (mutex...) ? N Y? pthread pthread ? communications between tasks (channels...) ? Y ? Y Y ?

We were very surprised to observe that none of the libraries could fulfill our needs or match what Golang provided by default. Even libtask, created by one of the creators of Golang, still relies on global variables and thus does not support multi-threaded execution. Many libraries that claim to be multi-threaded actually are not. Even the C++ standard library for coroutines does not implement its own specification. The only prior work which seems to surpass Golang's concurrency is CPC by Gabriel Kerneis.

Obviously, someone should consider creating a feature complete coroutine library in C.

We then made a patch to libtask so that it supports multi-threaded execution. We wrote for each coroutine library a simple HTTP server that returns a static page without any processing. We compared results (i.e. the amount of requests per second) on a laptop based on a 2-core i3 processor. We got quite surprising results:

Performance comparison of C coroutine libraries 1 thread 2 threads Comment Python+uvloop 40k X Not multi-core, performance drastically decreases if server has to process a request Go ~33k ~39k High performance and multi-core libtask 6k - 7k 7k - 14k Slower than Go lthread ~4k ? Poor performance, not multi-core libdill ~60 ? Poor performance, not multi-core libco 300 - 4k ? Poor performance, not multi-core

Python+uvloop is indeed faster than Golang on a single core but does not scale on two cores. Its performance drastically decreases if the HTTP server has to process the request.

Golang is surprisingly twice as fast as libtask on two cores.

All other libraries are much slower.

At that point we started losing hope.

LWAN is really fast

We then discovered by chance LWAN, a powerful HTTP server written in C by Leandro Pereira.

LWAN is a project focused on building a solid high-performance and scalable web server. Our initial tests with LWAN exhibited much better performance than Golang. LWAN could handle on a single core close to 90k requests instead of 33k for Golang under the same conditions. Note that LWAN uses its own coroutines and scheduler.

We wrapped LWAN with Cython into a Python module (full source code). The end result is not pure Python code and contains C, but Python and C are strongly related anyways:

Python is an interpreted language; its most widely-used implementation, CPython, is written in C;

SciPy, a popular library to do mathematics, science, and engineering with Python, also takes advantage of C via Cython;

Cython is accepted by the Python community.

It is actually quite usual in Python world to combine Python, Cython and C libraries to solve a problem.

Explanation of the code

We are going to explain the code in more details, i.e. lwan_wrapper.pyx . First, we start at the top of the file with external declarations (to instruct Cython how to interact with C code):

from libc.string cimport strlen cdef extern from "lwan/lwan.h" nogil: struct lwan: pass struct lwan_request: pass struct lwan_response: char *mime_type lwan_strbuf *buffer enum lwan_http_status: HTTP_OK struct lwan_url_map: lwan_http_status (*handler)(lwan_request *request, lwan_response *response, void *data) char *prefix void lwan_init(lwan *l) void lwan_set_url_map(lwan *l, lwan_url_map *map) void lwan_main_loop(lwan *l) void lwan_shutdown(lwan *l) struct lwan_strbuf: pass bint lwan_strbuf_set_static(lwan_strbuf *s1, const char *s2, size_t sz) bint lwan_strbuf_printf(lwan_strbuf *s, const char *fmt, ...)

Cython does not parse header files and anyway, we need a place for annotations like nogil , or good parameter names to call functions with keyword arguments. We only need to declare the types and functions that we are going to use; struct declarations can even be empty ( pass ) when Cython does not need to know their members. Note that Cython already provides wrappers for the standard C library (e.g. libc.string ).

Next, let's come to the handlers. Basically, handlers are functions that take care of a request and respond to the client. Our server contains two example handlers: handle_root and handle_fibonacci for processing respectively the "/" and "/fibonacci" requests. This is how they are defined:

cdef lwan_http_status handle_root(lwan_request *request, lwan_response *response, void *data) nogil: ... cdef lwan_http_status handle_fibonacci(lwan_request *request, lwan_response *response, void *data) nogil: ...

Finally, we can define the url map and start the server after releasing the GIL:

def run(): cdef: lwan l lwan_url_map *default_map = [ {"prefix": "/", "handler": handle_root}, {"prefix": "/fibonacci", "handler": handle_fibonacci}, {"prefix": NULL} ] with nogil: lwan_init(&l) lwan_set_url_map(&l, default_map) lwan_main_loop(&l) lwan_shutdown(&l)

When everything is compiled, running the server is as easy as importing a module and calling a function in Python:

import lwan lwan.run()

Have you noticed this recurring nogil annotation specified at the end of every function? This tells Cython that these functions can be called safely when the GIL is released. Thus, it means that such functions cannot interact with Python objects. However, this nogil annotation does not release the GIL; a with nogil clause must be used to do so.

Let's try it

Above are benchmarks where several HTTP servers run on 4 threads and respond to the client with a "Hello, World!" string. The Go server was implemented using two different packages:

The standard net/http is more focused on idiomaticness than performance.

is more focused on idiomaticness than performance. github.com/valyala/fasthttp is focused on performance.

This graph shows that our server is about twice as fast as Golang net/http when doing full I/O. It is also about 1.4 times faster than fasthttp . The number of requests that servers can process in one second was measured on a machine with a 4-core i5 processor via:

wrk -t 4 -c 40 -d 20 http://127.0.0.1:8080

Here are the handlers in Golang (for both net/http and fasthttp ):

# net/http func handle_root(w http.ResponseWriter, r *http.Request) { fmt.Fprint(w, "Hello, World!") } # fasthttp func fastHandleRoot(ctx *fasthttp.RequestCtx) { fmt.Fprint(ctx, "Hello, World!") }

and in Cython:

cdef public int handle_root(lwan_request *request, lwan_response *response) nogil: cdef char *message = "Hello, World!" response.mime_type = "text/plain" lwan_strbuf_set_static(response.buffer, message, strlen(message)) return HTTP_OK

Now, let's simulate real web-servers by doing some work inside the HTTP handler. For an example workload, we chose to compute the 106th number of Fibonacci's suite before responding to the client. Here is the code for Cython:

cdef unsigned int fibonacci(unsigned int n) nogil: cdef unsigned int i, a, b a, b = 0, 1 for i in range(n): a, b = b, a + b return a cdef public int handle_fibonacci(lwan_request *request, lwan_response *response) nogil: response.mime_type = "text/plain" lwan_strbuf_printf(response.buffer, "Fibonacci(10^6) = %u (with overflow)

", fibonacci(1000000)) return HTTP_OK

And the same one for Go:

func fibonacci(n uint32) uint32 { var i, a, b uint32 a, b = 0, 1 for i = 0; i < n; i++ { a, b = b, a + b } return a } # net/http func handle_fibonacci(w http.ResponseWriter, r *http.Request) { fmt.Fprintf(w, "Fibonacci(10^6) = %d (with overflow)

", fibonacci(1000000)) } # fasthttp func fastHandleFibo(ctx *fasthttp.RequestCtx) { fmt.Fprintf(ctx, "Fibonacci(10^6) = %d (with overflow)

", fibonacci(1000000)) }

In the end, the result of that computation does not matter: it's just to slow down the server and simulate real processing. Note that we used uint32 types in Go to keep consistency with the Python server.

This graph mainly shows that our Python server is capable of running on multiple cores. Performance was measured on the same machine via:

wrk -t X -c 40 -d 20 http://127.0.0.1:8080/fibonacci

where X is the number of threads that the server runs. In the Go case, the number of threads it can use was limited with runtime.GOMAXPROCS(X) . In the Python case, this was specified in the LWAN config file.

To ensure the validity of this last figure, we looked at the generated assembly codes for both Cython and Golang: we mainly checked the section corresponding to the fibonacci() function. In fact, the compilers could optimize it too much and compute the result (at least partially) in place, thus making this benchmark unfair. In reality, no extreme or unfair optimization was done.

Want to try it out?

It should be trivial to run it on your own machine. You can find the code and installation guide in this GitLab repository. The Go servers are also provided for comparison.

Future directions

We still need to have a new look at Shrapnel which was among the first libraries to rely on Cython to build high performance HTTP servers and to gevent which provide a very good concurrency model in Python. We believe it should be possible to reach LWAN level performance with gevent.

We will also evaluate LWAN's own coroutines and compare them to CPC's coroutines which are today the most efficient ones.

Acknowledgements

This article is primarily the work of Bryton Lacquement under the supervision of Hugo Ricateau at Nexedi. We would like to thank Dr. Stefan Behnel, Alexandre Gramfort, Julien Muchembled, Stéfane Fermigier, Kirill Smelkov and Leandro Pereira for their useful inputs. This article would also have been impossible to write without the existence of Leandro Pereira's LWAN library.