As anticipated, let's get into where it's all going wrong, especially for database-related code.

Remember, I wasn't even trying to prove that asyncio is significantly slower than threads; only that it wasn't any faster. The results I got were more dramatic than I expected. We see also that an extremely low-latency async approach, e.g. that of gevent, is also slower than threads, but not by much, which confirms first that async IO is definitely not faster in this scenario, but also because asyncio is so much slower than gevent, that it is in fact the in-Python overhead of asyncio's coroutines and other Python constructs that are likely adding up to very significant additional latency on top of the latency of less efficient IO-based context switching.

Above, we see asyncio significantly slower for the first part of the run (Python 3.4 seemed to have some issue here in both threaded and asyncio), and for the second part, fully twice as slow compared to both Python2.7 and Python3.4 interpreters using threads. Even running 350 concurrent connections, which is way more than you'd usually ever want a single process to run, asyncio could hardly approach the efficiency of threads. Even with the very fast and pure C-code psycopg2 driver, just the overhead of the aiopg library on top combined with the need for in-Python receipt of polling results with psycopg2's asynchronous library added more than enough Python overhead to slow the script right down.

The results of several runs on different machines under different conditions are summarized at the bottom of the README . The best performance I could get was running the Python code on one laptop interfacing to the Postgresql database on another, but in virtually every test I ran, whether I ran just 15 threads/coroutines on my Mac, or 350 (!) threads/coroutines on my Linux laptop, threaded code got the job done much faster than asyncio in every case (including the 350 threads case, to my surprise), and usually faster than gevent as well. Below are the results from running 120 threads/processes/connections on the Linux laptop networked to the Postgresql database on a Mac laptop:

The purpose of the test suite is to load a few million rows into a Postgresql database as fast as possible, while using the same general set of SQL instructions, such that we can see if in fact the GIL slows us down so much that asyncio blows right past us with ease. The suite can use any number of connections simultaneously; at the highest I boosted it up to using 350 concurrent connections, which trust me, will not make your DBA happy at all .

So, I will here present a comprehensive test suite which illustrates traditional threads in Python against asyncio, as well as gevent style nonblocking IO. We will use psycopg2 which is currently the only production DBAPI that even supports async , in conjunction with aiopg which adapts psycopg2's async support to asyncio and psycogreen which adapts it to gevent.

To which many people said to me, "so what? Your database call is much more of the time spent". Never minding that we're not talking here about an approach to optimize existing code, but to prevent making perfectly fine code more slow than it already is. The PyMySQL example should illustrate that Python overhead adds up very fast, even just within a pure Python driver, and in the overall profile dwarfs the time spent within the database itself. However, this argument still may not be convincing enough.

The results we get are that the do-nothing yield from + StopIteration take about six times longer:

That syntax is fantastic, I like it a lot, but unfortunately, the mechanism of that return conn statement is necessarily that it raises a StopIteration exception. This, combined with the fact that each yield from call more or less adds up to the overhead of an individual function call separately. I tweeted a simple demonstration of this, which I include here in abbreviated form:

At the core of asyncio is that we are using the @asyncio.coroutine decorator, which does some generator tricks in order to have your otherwise synchronous looking function defer to other coroutines. Central to this is the yield from technique, which causes the function to stop its execution at that point, while other things go on until the event loop comes back to that point. This is a great idea, and it can also be done using the more common yield statement as well. However, using yield from , we are able to maintain at least the appearance of the presence of return values:

Let's be clear here, that when using Python, calls to your database, unless you're trying to make lots of complex analytical calls with enormous result sets that you would normally not be doing in a high performing application, or unless you have a very slow network, do not typically produce an IO bound effect. When we talk to databases, we are almost always using some form of connection pooling, so the overhead of connecting is already mitigated to a large extent; the database itself can select and insert small numbers of rows very fast on a reasonable network. The overhead of Python itself, just to marshal messages over the wire and produce result sets, gives the CPU plenty of work to do which removes any unique throughput advantages to be had with non-blocking IO. With real-world activities based around database operations, the proportion spent in CPU only increases.

To highlight the actual proportion of these runs that's spent in IO, the following two RunSnakeRun displays illustrate how much time is actually for IO within the PyMySQL run, both for local database as well as over a network connection. The proportion is not as dramatic over a network connection, but in that case network calls still only take 1/3rd of the total time; the other 2/3rds is spent in Python crunching the results. Keep in mind this is just the DBAPI alone ; a real world application would have database abstraction layers, business and presentation logic surrounding these calls as well:

This script , adapted from the Openstack entry, illustrates a pretty straightforward set of INSERT and SELECT statements, and virtually no Python code other than the barebones explicit calls into the DBAPI.

A great misconception I seem to encounter often is the notion that communication with the database takes up a majority of the time spent in a database-centric Python application. This perhaps is a common wisdom in compiled languages such as C or maybe even Java, but generally not in Python. Python is very slow, compared to such systems; and while Pypy is certainly a big help, the speed of Python is not nearly as fast as your database, when dealing in terms of standard CRUD-style applications (meaning: not running large OLAP-style queries, and of course assuming relatively low network latencies). As I worked up in my PyMySQL Evaluation for Openstack, whether a database driver (DBAPI) is written in pure Python or in C will incur significant additional Python-level overhead. For just the DBAPI alone, this can be as much as an order of magnitude slower. While network overhead will cause more balanced proportions between CPU and IO, just the CPU time spent by Python driver itself still takes up twice the time as the network IO, and that is without any additional database abstraction libraries, business logic, or presentation logic in place.

Update - redditor Riddlerforce found valid issues with this section, in that I was not testing over a network connection. Results here are updated. The conclusion is the same, but not as hyperbolically amusing as it was before.

I will address this only in terms of database access. For HTTP / "chat" server styles of communication, either listening as a server or making client calls, asyncio may very well be superior as it can allow lots more sleepy/arbitrarily slow connections to be tended towards in a simple way. But for local database access, this is just not the case.

Many (but certainly not all) within both the node.js community as well as the Python community continue to claim that asynchronous programming styles are innately superior for concurrent performance in nearly all cases. In particular, there's the notion that the context switching approaches of explicit async systems such as that of asyncio can be had virtually for free, and as the Python has a GIL, that all adds up in some unspecified/non-illustrated/apples-to-oranges way to establish that asyncio will totally, definitely be faster than using any kind of threaded approach, or at the very least, not any slower. Therefore any web application should as quickly as possible be converted to use a front-to-back async approach for everything, from HTTP request to database calls, and performance enhancements will come for free.

Issue Two - Async as Making Coding Easier

This is the flip side to the "magic fairy dust" coin. This argument expands upon the "threads are bad" rhetoric, and in its most extreme form goes that if a program at some level happens to spawn a thread, such as if you wrote a WSGI application and happen to run it under mod_wsgi using a threadpool, you are now doing "threaded programming", of the caliber that is just as difficult as if you were doing POSIX threading exercises throughout your code. Despite the fact that a WSGI application should not have the slightest mention of anything to do with in-process shared and mutable state within in it, nope, you're doing threaded programming, threads are hard, and you should stop.

The "threads are bad" argument has an interesting twist (ha!), which is that it is being used by explicit async advocates to argue against implicit async techniques. Glyph's Unyielding post makes exactly this point very well. The premise goes that if you've accepted that threaded concurrency is a bad thing, then using the implicit style of async IO is just as bad, because at the end of the day, the code looks the same as threaded code, and because IO can happen anywhere, it's just as non-deterministic as using traditional threads. I would happen to agree with this, that yes, the problems of concurrency in a gevent-like system are just as bad, if not worse, than a threaded system. One reason is that concurrency problems in threaded Python are fairly "soft" because already the GIL, as much as we hate it, makes all kinds of normally disastrous operations, like appending to a list, safe. But with green threads, you can easily have hundreds of them without breaking a sweat and you can sometimes stumble across pretty weird issues that are normally not possible to encounter with traditional, GIL-protected threads.

As an aside, it should be noted that Glyph takes a direct swipe at the "magic fairy dust" crowd:

Unfortunately, “asynchronous” systems have often been evangelized by emphasizing a somewhat dubious optimization which allows for a higher level of I/O-bound concurrency than with preemptive threads, rather than the problems with threading as a programming model that I’ve explained above. By characterizing “asynchronousness” in this way, it makes sense to lump all 4 choices together. I’ve been guilty of this myself, especially in years past: saying that a system using Twisted is more efficient than one using an alternative approach using threads. In many cases that’s been true, but: the situation is almost always more complicated than that, when it comes to performance, “context switching” is rarely a bottleneck in real-world programs, and it’s a bit of a distraction from the much bigger advantage of event-driven programming, which is simply that it’s easier to write programs at scale, in both senses (that is, programs containing lots of code as well as programs which have many concurrent users).

People will quote Glyph's post when they want to talk about how you'll have fewer bugs in your program when you switch to asyncio, but continue to promise greater performance as well, for some reason choosing to ignore this part of this very well written post.

Glyph makes a great, and very clear, argument for the twin points that both non-blocking IO should be used, and that it should be explicit. But the reasoning has nothing to do with non-blocking IO's original beginnings as a reasonable way to process data from a large number of sleepy and slow connections. It instead has to do with the nature of the event loop and how an entirely new concurrency model, removing the need to expose OS-level context switching, is emergent.

While we've come a long way from writing callbacks and can now again write code that looks very linear with approaches like asyncio, the approach should still require that the programmer explicitly specify all those function calls where IO is known to occur. It begins with the following example:

def transfer ( amount , payer , payee , server ): if not payer . sufficient_funds_for_withdrawal ( amount ): raise InsufficientFunds () log ( "{payer} has sufficient funds." , payer = payer ) payee . deposit ( amount ) log ( "{payee} received payment" , payee = payee ) payer . withdraw ( amount ) log ( "{payer} made payment" , payer = payer ) server . update_balances ([ payer , payee ])

The concurrency mistake here in a threaded perspective is that if two threads both run transfer() they both may withdraw from payer such that payer goes below InsufficientFunds , without this condition being raised.

The explcit async version is then:

@coroutine def transfer ( amount , payer , payee , server ): if not payer . sufficient_funds_for_withdrawal ( amount ): raise InsufficientFunds () log ( "{payer} has sufficient funds." , payer = payer ) payee . deposit ( amount ) log ( "{payee} received payment" , payee = payee ) payer . withdraw ( amount ) log ( "{payer} made payment" , payer = payer ) yield from server . update_balances ([ payer , payee ])

Where now, within the scope of the process we're in, we know that we are only allowing anything else to happen at the bottom, when we call yield from server.update_balances() . There is no chance that any other concurrent calls to payer.withdraw() can occur while we're in the function's body and have not yet reached the server.update_balances() call.

He then makes a clear point as to why even the implicit gevent-style async isn't sufficient. Because with the above program, the fact that payee.deposit() and payer.withdraw() do not do a yield from , we are assured that no IO might occur in future versions of these calls which would break into our scheduling and potentially run another transfer() before ours is complete.

(As an aside, I'm not actually sure, in the realm of "we had to type yield from and that's how we stay aware of what's going on", why the yield from needs to be a real, structural part of the program and not just, for example, a magic comment consumed by a gevent/eventlet-integrated linter that tests callstacks for IO and verifies that the corresponding source code has been annotated with special comments, as that would have the identical effect without impacting any libraries outside of that system and without incurring all the Python performance overhead of explicit async. But that's a different topic.)

Regardless of style of explicit coroutine, there's two flaws with this approach.

One is that asyncio makes it so easy to type out yield from that the idea that it prevents us from making mistakes loses a lot of its plausibility. A commenter on Hacker News made this great point about the notion of asynchronous code being easier to debug:

It's basically, "I want context switches syntactically explicit in my code. If they aren't, reasoning about it is exponentially harder." And I think that's pretty clearly a strawman. Everything the author claims about threaded code is true of any re-entrant code, multi-threaded or not. If your function inadvertently calls a function which calls the original function recursively, you have the exact same problem. But, guess what, that just doesn't happen that often. Most code isn't re-entrant. Most state isn't shared. For code that is concurrent and does interact in interesting ways, you are going to have to reason about it carefully. Smearing "yield from" all over your code doesn't solve. In practice, you'll end up with so many "yield from" lines in your code that you're right back to "well, I guess I could context switch just about anywhere", which is the problem you were trying to avoid in the first place.

In my benchmark code, one can see this last point is exactly true. Here's a bit of the threaded version:

cursor . execute ( "select id from geo_record where fileid= %s and logrecno= %s " , ( item [ 'fileid' ], item [ 'logrecno' ]) ) row = cursor . fetchone () geo_record_id = row [ 0 ] cursor . execute ( "select d.id, d.index from dictionary_item as d " "join matrix as m on d.matrix_id=m.id where m.segment_id= %s " "order by m.sortkey, d.index" , ( item [ 'cifsn' ],) ) dictionary_ids = [ row [ 0 ] for row in cursor ] assert len ( dictionary_ids ) == len ( item [ 'items' ]) for dictionary_id , element in zip ( dictionary_ids , item [ 'items' ]): cursor . execute ( "insert into data_element " "(geo_record_id, dictionary_item_id, value) " "values ( %s , %s , %s )" , ( geo_record_id , dictionary_id , element ) )

Here's a bit of the asyncio version:

yield from cursor . execute ( "select id from geo_record where fileid= %s and logrecno= %s " , ( item [ 'fileid' ], item [ 'logrecno' ]) ) row = yield from cursor . fetchone () geo_record_id = row [ 0 ] yield from cursor . execute ( "select d.id, d.index from dictionary_item as d " "join matrix as m on d.matrix_id=m.id where m.segment_id= %s " "order by m.sortkey, d.index" , ( item [ 'cifsn' ],) ) rows = yield from cursor . fetchall () dictionary_ids = [ row [ 0 ] for row in rows ] assert len ( dictionary_ids ) == len ( item [ 'items' ]) for dictionary_id , element in zip ( dictionary_ids , item [ 'items' ]): yield from cursor . execute ( "insert into data_element " "(geo_record_id, dictionary_item_id, value) " "values ( %s , %s , %s )" , ( geo_record_id , dictionary_id , element ) )

Notice how they look exactly the same? The fact that yield from is present is not in any way changing the code that I write, or the decisions that I make - this is because in boring database code, we basically need to do the queries that we need to do, in order. I'm not going to try to weave an intelligent, thoughtful system of in-process concurrency into how I call into the database or not, or try to repurpose when I happen to need database data as a means of also locking out other parts of my program; if I need data I'm going to call for it.

Whether or not that's compelling, it doesn't actually matter - using async or mutexes or whatever inside our program to control concurrency is in fact completely insufficient in any case. Instead, there is of course something we absolutely must always do in real world boring database code in the name of concurrency, and that is: