Profiling Python Like a Boss

Bryan Helmig / December 2, 2013

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. ~Donald Knuth

It is unwise to embark on an optimization quest without first considering the famous quote by Knuth. However, sometimes that quick and hacky O(N^2) code you wrote (and promptly forgot about) to get some feature out the door rears its ugly head and demands more of your attention: this article is for that day.

So what follows are some useful tools and patterns for quick n' dirty profiling in Python. The goal here is simple: identify bottlenecks as quickly as possible, fix them and then confirm you've fixed them.

Write a Test

Before you embark on any of this journey, write a quick high-level test that demonstrates the slowness. You may need to introduce some minimum set of data to reproduce a significant enough slowdown. Usually a run time of second or two is good enough to get a handle on an appreciable improvement when you find it.

It is also quite useful to have some standard tests to ensure your optimizations don't modify the behavior of the slow code. You might also be able to repurpose those tests as a bit of a "benchmark" by running them many times when profiling/benchmarking.

The timing of this test case leads us to our first quick n' dirty profiling tool…

Simple Timers

Timers are simple, flexible ways to measure execution time. You can drop them in just about anywhere and they have very minimal side-effects. It is quite easy to roll your own timer and even customize it to work exactly how you like. For example, a simple timing decorator might look like:

import time def timefunc ( f ): def f_timer ( * args , ** kwargs ): start = time . time () result = f ( * args , ** kwargs ) end = time . time () print f . __name__ , 'took' , end - start , 'time' return result return f_timer def get_number (): for x in xrange ( 5000000 ): yield x @timefunc def expensive_function (): for x in get_number (): i = x ^ x ^ x return 'some result!' # prints "expensive_function took 0.72583088875 seconds" result = expensive_function ()

Of course, you can get even more clever with by doing something with a context manager, giving you checkpoints and other good bits to work with:

import time class timewith (): def __init__ ( self , name = '' ): self . name = name self . start = time . time () @property def elapsed ( self ): return time . time () - self . start def checkpoint ( self , name = '' ): print '{timer} {checkpoint} took {elapsed} seconds' . format ( timer = self . name , checkpoint = name , elapsed = self . elapsed , ) . strip () def __enter__ ( self ): return self def __exit__ ( self , type , value , traceback ): self . checkpoint ( 'finished' ) pass def get_number (): for x in xrange ( 5000000 ): yield x def expensive_function (): for x in get_number (): i = x ^ x ^ x return 'some result!' # prints something like: # fancy thing done with something took 0.582462072372 seconds # fancy thing done with something else took 1.75355315208 seconds # fancy thing finished took 1.7535982132 seconds with timewith ( 'fancy thing' ) as timer : expensive_function () timer . checkpoint ( 'done with something' ) expensive_function () expensive_function () timer . checkpoint ( 'done with something else' ) # or directly timer = timewith ( 'fancy thing' ) expensive_function () timer . checkpoint ( 'done with something' )

Timers require you to do some digging. Start wrapping a few of the higher level functions and confirm where the bottleneck is, then drill down into that function, repeating as you go. When you find the disproportionately slow bit of code, fix it, and work your way back out confirming that it is fixed.

Handy tip: Don't forget the handy timeit module! It tends to be more useful for benchmarking small pieces of code than for doing the actual investigation.

Timer Pros: easy to understand and implement. Also very simple to compare before and after fixes. Works across many languages.

easy to understand and implement. Also very simple to compare before and after fixes. Works across many languages. Timer Cons: sometimes a little too simplistic for extremely complex codebases, you might spend more time placing and replacing boilerplate code than you will fixing the problem!

Built-in Profiler

Breaking out the built-in profiler is akin to breaking out the big guns. It is extremely powerful, but a little unwieldy and sometimes a bit difficult to interpret and act on.

You can read a bit more about the built-in profile module but the basics are quite simple: you enable and disable the profiler and it logs all function calls and execution times. It can then compile and print the output for you. A quick decorator simplifies this:

import cProfile def do_cprofile ( func ): def profiled_func ( * args , ** kwargs ): profile = cProfile . Profile () try : profile . enable () result = func ( * args , ** kwargs ) profile . disable () return result finally : profile . print_stats () return profiled_func def get_number (): for x in xrange ( 5000000 ): yield x @do_cprofile def expensive_function (): for x in get_number (): i = x ^ x ^ x return 'some result!' # perform profiling result = expensive_function ()

In the case of the above code block, you should see something printed to the console that looks something like this:

5000003 function calls in 1.626 seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 5000001 0.571 0.000 0.571 0.000 timers.py:92(get_number) 1 1.055 1.055 1.626 1.626 timers.py:96(expensive_function) 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}

As you can see, it gives good numbers on various function calls but it omits a vital piece of information: what is it about a function that makes it so slow?

However, it is a great start to basic profiling. Sometimes it can even point you to the solution with very little fuss. I often use it as a gut check to start the debugging process before I dig deeper into the specific functions that are either slow are called way too often.

Built-in Pros: No external dependencies and quite fast. Useful for quick high-level checks.

No external dependencies and quite fast. Useful for quick high-level checks. Built-in Cons: Rather limited information that usually requires deeper debugging; reports are a bit unintuitive, especially for complex codebases.

Line Profiler

If the built-in profile was a big gun, consider the line profiler an Ion cannon. It is quite heavy and powerful (and a lot of fun to use!).

For this example, we'll use the great kernprof line-profiler available as a handy line_profiler PyPi package. For ease of use, we'll wrap it once again in a decorator that simplifies usage as well as provides a way to protect us from leaving it in production (because it is dog slow!).

try : from line_profiler import LineProfiler def do_profile ( follow = []): def inner ( func ): def profiled_func ( * args , ** kwargs ): try : profiler = LineProfiler () profiler . add_function ( func ) for f in follow : profiler . add_function ( f ) profiler . enable_by_count () return func ( * args , ** kwargs ) finally : profiler . print_stats () return profiled_func return inner except ImportError : def do_profile ( follow = []): "Helpful if you accidentally leave in production!" def inner ( func ): def nothing ( * args , ** kwargs ): return func ( * args , ** kwargs ) return nothing return inner def get_number (): for x in xrange ( 5000000 ): yield x @do_profile ( follow = [ get_number ]) def expensive_function (): for x in get_number (): i = x ^ x ^ x return 'some result!' result = expensive_function ()

If you run the above chunk of code, you should see a report that looks something like this:

Timer unit: 1e-06 s File: test.py Function: get_number at line 43 Total time: 4.44195 s Line # Hits Time Per Hit % Time Line Contents ============================================================== 43 def get_number(): 44 5000001 2223313 0.4 50.1 for x in xrange(5000000): 45 5000000 2218638 0.4 49.9 yield x File: test.py Function: expensive_function at line 47 Total time: 16.828 s Line # Hits Time Per Hit % Time Line Contents ============================================================== 47 def expensive_function(): 48 5000001 14090530 2.8 83.7 for x in get_number(): 49 5000000 2737480 0.5 16.3 i = x ^ x ^ x 50 1 0 0.0 0.0 return 'some result!'

As you can see, these are highly detailed reports that give unprecedented insight into code execution. Unlike the built-in cProfiler, it can break out time spent core language features like loops or imports and gives cumulative numbers of time spent on various lines.

This sort of detail makes it much easier to get quick understanding of a functions internal. Plus, if you need to dive down into some third party library, you can just import the third party library and pass it directly into the decorator.

Handy tip: Just decorate your test function and pass the problem function in the follow argument!

Line Profiler Pros: has intuitive and finely detailed reports. Can follow functions in third party libraries.

has intuitive and finely detailed reports. Can follow functions in third party libraries. Line Profiler Cons: because of the overhead it can be an order of magnitude slower than real execution time, so don't use it for benchmarking. It is also an external requirement.

Conclusion & Best Practices

You should use some of the simpler tools (like timers or the built-in profiler) to perform some gut checks around test cases (especially around some code you know quite well) and switch to a few of the slower but more detailed tools like line_profiler to get a close look at the internals of functions.

Nine times out of ten you'll notice a silly mistake where repeatedly calling something inside a loop or using an incorrect data structure is chewing up some 90%+ of the time spent in a function. A couple quick (and satisfying) tweaks and you'll be on your way.

If you are still too slow and notice yourself doing esoteric things like comparing attribute accessing techniques or tweaking equality checking techniques, you're likely past the point of diminishing returns. You might have to either:

Live with the slowness or just pre-calculate/cache it. Rethink the entire implementation. Use more optimized data structures (via Numpy, Pandas, etc.). Write a C extension.

Beware, optimizing code can be a guilty pleasure! It is pretty fun finding interesting ways to speed up Python, but be careful not to obfuscate what is going on under the hood with arcane speedups. Readable code is generally better than slightly faster code. It is almost always simpler to just cache it and move on.

That said, enjoy! :-)





PS: check out the code examples in https://github.com/zapier/profiling-python-like-a-boss. Also, learn how to Debug Python Like a Boss.