Asyncio allows you to write high performance network applications in Python. It’s been a breath of fresh air in reducing complexity and resource abuse common in sync applications. However bugs happen — and they happen to asyncio apps too.

Sailing into the last two weeks of 2017 that I fully intended to spend experimenting with various eggnog recipes. I was alerted by our DevOps team that our asyncio app was consuming 10GB of memory. That is approximately 100 times more than it should!

I reviewed all recent changes and nothing obvious jumped up at me. Also to make matters worse I had no idea what is triggering the memory allocation as restart would fix this problem — and it could take several hours for the app to grow again — but grow it did. And it ate all the memory it could.

Code reviews and minor bug fixes accomplished nothing to change this.

I knew I was facing the dreaded question.

What is the application actually doing?

I realized I needed an endpoint where I could examine such stats at runtime (since that is where the problem happens). So I wrote simple view called /stats where I started displaying various useful info.

How many coroutines does the application use?

Here we are helped by asyncio.Task.all_tasks method which returns a list of all currently running coroutines. Better yet we can inspect these coroutines to gain more insight. We can figure out if a coroutine is cancelled if it has an exception and most important in our case what actually created it by examining the stack.

def show_coro(c):

data = OrderedDict([

('txt', str(c)),

('type', str(type(c))),

('done', c.done()),

('cancelled', False),

('stack', None),

('exception', None),

])

if not c.done():

data['stack'] = [format_frame(x) for x in c.get_stack()]

else:

if c.cancelled():

data['cancelled'] = True

else:

data['exception'] = str(c.exception()) return data def format_frame(f):

keys = ['f_code', 'f_lineno']

return OrderedDict([(k, str(getattr(f, k))) for k in keys])

Which will give us an output that can look something like this.

{

"txt": "<Task pending coro=<writer_process() running at /var/www/app/server/influx.py:52> wait_for=<Future pending cb=[Task._wakeup()]>>",

"type": "<class 'asyncio.tasks.Task'>",

"done": false,

"cancelled": false,

"stack": [

{

"f_code": "<code object writer_process at 0x7f04df9a6b70, file \"/var/www/app/server/influx.py\", line 36>",

"f_lineno": "52"

}

],

"exception": null

},

Where is the memory going? What is being allocated?

Here we are helped by the tracemalloc module. Which will show us what is being allocated and where.

if tracemalloc.is_tracing():

snapshot = tracemalloc.take_snapshot()

top_stats = snapshot.statistics('lineno')

view_data['top_10_trace'] = [str(x) for x in top_stats[:10]]

Which may look like this.

"top_10_trace": [

"/usr/lib/python3.5/site-packages/raven/breadcrumbs.py:47: size=40.6 KiB, count=100, average=416 B",

"/usr/lib/python3.5/logging/__init__.py:1270: size=35.3 KiB, count=468, average=77 B",

"/usr/lib/python3.5/json/decoder.py:355: size=28.5 KiB, count=449, average=65 B",

...

The answers to these two questions is all I needed to solve my troubles.

First of all, tracemalloc tipped me off that I have an extremely high number of allocations tied up in search results objects. This was unexpected since the search results should be freed up after being returned to the user.

Close examination of what happens with the search results object made me realize that we also log a number of results to our statistics database and we do so in separate coroutine (as to not to block the response to user with what is just our internal statistics collection).

That led me to the conclusion that perhaps something is wrong with those logging coroutines. Under normal circumstances, one node would run between 15 to 40 coroutines — when the huge memory increase happened the number of coroutines was 3000 or more.

Mystery solved! The whole problem was caused by our statistics database suffering temporary failures and being unresponsive, for short periods of time. When that happened our application had a large number of coroutines running, all waiting to write to the stats database.

The optimization and fix was simple.

When logging the number of search results we don’t need to pass in the entire result set, just the number of results.

Second and more importantly ensure we do not have long lived logging coroutines running. We can specify a timeout by using asycnio.wait_for(coroutine, 5). By using this feature, occasional slowdowns of external services don’t create performance problems in our applications since we’ll timeout after 5 seconds.

The application has now been running for days and only consuming 100MB of memory.

May your code be leak free in 2018 but if it does leak — you know what to do.