Profilers are tools designed to check how programs use various computing resources. They measure various program characteristics (memory usage, duration of function calls, etc.) while program is running and aggregate collected data to create a report that can aid program optimization.

In this post we will build two simple profilers that measure function call time (just “profilers” below), but solve this problem in slightly different manner. This post has emerged from various notes I made while working on vprof and I hope that it will be useful for you.

Python (more precisely CPython) is used for all code in this post, since it provides great instrumentation for writing profilers out of the box, but concepts discussed here can be easily applied to other mainstream high-level programming languages.

We will consider two types of profilers - deterministic and statistical.

Deterministic profilers record exact time of all function calls and returns and then process collected data to determine duration of function calls and other important parameters. While this method has greater accuracy, it adds significant overhead (especially to small functions) and programs run slower under profiling.

Statistical profilers periodically take snapshots of running program and deduce where time is being spent. This method has lesser overhead than deterministic profilers and programs run almost at full speed, but typically accuracy is lower, since collected profile provides just a statistical approximation of where time was spent.

Here is code that we will be inspecting with our profilers

The code generates N prime numbers in which sum of digits is even. While the code is far from being algorithmically perfect, it is simple enough and fits our needs well.

Deterministic profiler

Perhaps simplest way to measure execution time of our code is to use Python’s built-in time.time() function

If we run it, we’ll get something like this

Total time is 2.436735 s

Looks good. Now we know the time call to get_primes takes. There is one problem though — we’d like to see the time sum_of_digits and is_prime calls take in order to determine the bottlenecks.

Let’s improve our approach

Here we are using well known Python technique — profiling decorators.

For those who aren’t familiar with Python decorators — @decorator is just a shortcut for func = decorator(func) which allows altering Python functions (here you can find nice introduction to Python decorators).

Then we need to wrap our functions with new profile decorator

and run our code again

The output is

Function is_prime run time is 0.000003 s

Function sum_of_digits run time is 0.000002 s

Function is_prime run time is 0.000001 s

Function sum_of_digits run time is 0.000001 s

Function is_prime run time is 0.000001 s

Function is_prime run time is 0.000001 s

…

(many lines here)

…

Total time is 2.567238 s

Not quite that was expected. We can see individual call times, but, since there are so many function calls, we can’t see the big picture.

Also, you might notice that total time is greater now, because we introduced profiler overhead — each wrapped function calls time.time() twice, does subtraction, string formatting and prints resulting string to stdout.

Our profiling decorator can be improved. Printing call time on each function call is not necessary if we want to see the overview. Instead of printing call times, let’s store function name, it’s caller name and cumulative call time after each function call.

We replaced print call with accumulating function call time inside defaultdict that allows us storing cumulative call time per caller.

Wrapping functions with improved decorator

and running code again

Total time is 2.637101 s

Stats are stored in stats variable. Let’s print them

{‘get_n_primes’: defaultdict(<type ‘float’>, {‘<module>’: 2.7848520278930664}),

‘is_prime’: defaultdict(<type ‘float’>, {‘get_n_primes’: 2.7565882205963135}),

‘sum_of_digits’: defaultdict(<type ‘float’>, {‘get_n_primes’:

0.005716800689697266})}

Even from the raw stats we can see that call to is_prime takes most of the time.

Finally, let’s summarize collected stats and print nice-looking report

Total time: 2.637101 s

Function: is_prime, caller: get_n_primes, function run time: 2.611894 s, percentage: 99.044156 %

Function: get_n_primes, caller: <module>, function run time: 2.637097 s, percentage: 99.999855 %

Function: sum_of_digits, caller: get_n_primes, function run time: 0.005741 s, percentage: 0.217715 %

Now we see that most of the time is spent inside is_prime function and if we want to improve overall get_primes performance, is_prime is first candidate for optimization.

Accuracy of our deterministic profiler is determined by these factors:

1. Timer accuracy. No measurements can be more accurate than the underlying timer (~1 msec).

2. Measurement of time is not instant and there exists a delay between the time when user’s code is exited and time is measured and then user’s code is again executed. Small functions that do not do large number of calculations are especially prone to this kind of error, since it tends to accumulate. This error can be reduced by calibrating the profiler.

NOTE: Python provides sys.settrace() hook for writing profilers, so I recommend to use it for all serious use cases.

Statistical profiler

As it was mentioned above, statistical profilers operate by sampling the call stack at regular intervals. In order to do this, we need to define a function that will do call stack sampling and then schedule this function to be executed periodically. Python signal module will help us with that.

Here is the code for this approach

sample_stack callback is signal handler. It stores current call stack name along with how many times call stack was sampled and schedules a timer that emits new signal. We use ITIMER_PROF, because it decrements underlying timer both when the process executes and when the system is executing on behalf of the process. ITIMER_PROF emits SIGPROF on expiration. Other code just sets up our simple statistical profiler and measures total run time.

After we run the code and print stats variable , we’ll see this

{‘<module>-get_primes’: 312,

‘<module>-get_primes-is_prime’: 876,

‘<module>-get_primes-sum_of_digits’: 1}

We can see that get_primes has caught 312 samples, is_prime — 876 samples and sum of digits — just 1 sample. Since majority of samples belong to is_prime function, we can assume that’s where most of the time is spent. We’ll use total number of samples (312 + 876 + 1 = 1189) for calculations below.

Let’s summarize statistical profiler stats and compare them to deterministic profiler stats

Total time: 2.493037 s

Total samples: 1189

Function: <module>-get_primes-sum_of_digits, sample count: 1, percentage: 0.084104 %

Function: <module>-get_primes-is_prime, sample count: 876, percentage: 73.675357 %

Function: <module>-get_primes, sample count: 1189, percentage: 100.000000 %

First of all we can observe that total run time is less than with deterministic profiler, which can be attributed to fact that statistical profilers have less overhead. Also, percentage for is_prime (~73.7%) is lower than it was with deterministic profiler (~99%). This happens because resulting data from statistical profilers is an approximation. Furthermore, if we run profiler several times we can get pretty diverse results (70-99%). It means that lower accuracy should be taken into account when making optimization decisions. Nevertheless, these results can help us find bottlenecks in code.

Summary

We have built simple deterministic and statistical profilers for Python. Profilers measure function call time, but listed approaches can be applied to measuring other program characteristics such as memory usage. Also concepts discussed here are applicable to other mainstream high-level programming languages.

I hope this post will help you to interpret profiling results thoughtfully and build your own profilers if it’s necessary.

Code from this post is here and here.