If you use Gevent, you know it’s great for concurrency, but alas, none of the Python performance profilers work on Gevent applications. So I’m taking matters into my own hands. I’ll show you how both cProfile and Yappi stumble on programs that use greenlets, and I’ll demonstrate GreenletProfiler, my solution.

cProfile Gets Confused by Greenlets

I’ll write a script that spawns two greenlets, then I’ll profile the script to look for the functions that cost the most. In my script, the foo greenlet spins 20 million times. Every million iterations, it yields to Gevent’s scheduler (the “hub”). The bar greenlet does the same, but it spins only half as many times.

import cProfile import gevent import lsprofcalltree MILLION = 1000 1000 def foo (): for i in range ( 20 MILLION): if not i % MILLION: # Yield to the Gevent hub. gevent . sleep( 0 ) def bar (): for i in range ( 10 * MILLION): if not i % MILLION: gevent . sleep( 0 ) profile = cProfile . Profile() profile . enable() foo_greenlet = gevent . spawn(foo) bar_greenlet = gevent . spawn(bar) foo_greenlet . join() bar_greenlet . join() profile . disable() stats = lsprofcalltree . KCacheGrind(profile) stats . output( open ( 'cProfile.callgrind' , 'w' ))

Let’s pretend I’m a total idiot and I don’t know why this program is slow. I profile it with cProfile, and convert its output with lsprofcalltree so I can view the profile in KCacheGrind. cProfile is evidently confused: it thinks bar took twice as long as foo , although the opposite is true:

cProfile also fails to count the calls to sleep . I’m not sure why cProfile’s befuddlement manifests this particular way. If you understand it, please explain it to me in the comments. But it’s not surprising that cProfile doesn’t understand my script: cProfile is built to trace a single thread, so it assumes that if one function is called, and then a second function is called, that the first must have called the second. Greenlets defeat this assumption because the call stack can change entirely between one function call and the next.

Yappi Stumbles Over Greenlets

Next let’s try Yappi, the excellent profiling package by Sumer Cip. Yappi has two big advantages over cProfile: it’s built to trace multithreaded programs, and it can measure CPU time instead of wall-clock time. So maybe Yappi will do better than cProfile on my script? I run Yappi like so:

yappi . set_clock_type( 'cpu' ) yappi . start(builtins = True ) foo_greenlet = gevent . spawn(foo) bar_greenlet = gevent . spawn(bar) foo_greenlet . join() bar_greenlet . join() yappi . stop() stats = yappi . get_func_stats() stats . save( 'yappi.callgrind' , type = 'callgrind' )

Yappi thinks that when foo and bar call gevent.sleep , they indirectly call Greenlet.run , and eventually call themselves:

This is true in some philosophical sense. When my greenlets sleep, they indirectly cause each other to be scheduled by the Gevent hub. But it’s wrong to say they actually call themselves recursively, and it confuses Yappi’s cost measurements: Yappi attributes most of the CPU cost of the program to Gevent’s internal Waiter.get function. Yappi also, for some reason, thinks that sleep is called only once each by foo and bar , though it knows it was called 30 times in total.

GreenletProfiler Groks Greenlets

Since Yappi is so great for multithreaded programs, I used it as my starting point for GreenletProfiler. Yappi’s core tracing code is in C, for speed. The C code has a notion of a “context” which is associated with each thread. I added a hook to Yappi that lets me associate contexts with greenlets instead of threads. And voilà, the profiler understands my script! foo and bar are correctly measured as two-thirds and one-third of the script’s total cost:

Unlike Yappi, GreenletProfiler also knows that foo calls sleep 20 times and bar calls sleep 10 times:

Finally, I know which functions to optimize because I have an accurate view of how my script executes.

Conclusion