As you may have heard several mozillians and I have been working on writing a built-in cross platform profiler with the aim of helping diagnose performance problems and also particularly jank. In this blog I want to share my understanding of profilers.

This post doesn’t discuss how to profile which I will cover soon when the built-in profiler is ready to be used widely (but for now you can still take a sneak peak if you ping me).

Instrumentation Vs. Sampling

First before we talk about profilers let’s discuss the two major categories of profilers. Instrumenting profilers will modify the program to add performance counters. This is often done using a compiler flag (gcc’ ‘-pg’, ‘ -finstrument-functions ‘), a translation step or can be done at run time (Rational Quantify). Instrumenting profilers can be useful but I generally find that the code modifications impacts the performance data you are trying to measure. Typically they tend to add a significant overhead on fast functions that are called frequently.

This leads me to sampling profilers. While a instrumenting profiler is similar to asking someone to keep detailed metrics on every task they perform a sampling profiler is more alike a brief regular and periodic audit of their work. Instrumenting profiler will have something know as a ‘sampling interval’ typically in the range of 1ms to 10ms at which they will pause/interrupt the program and record its state which will typically consist of the backtrace at the current point. By collecting enough of these sample (execution audits) you begin to get a representative picture of where the program is spending its time. This is obvious for long running task such as blocking on IO for 1 sec, you will have sampled it 1000 times if sampling at every 1ms. While it’s a bit harder to wrap your head around this, sampling also works for functions that execute orders of magnitudes faster then the sampling interval. The key thing to remember is that the more overall executing time the cpu spends somewhere the more likely it is to be paused and sampled.

Implementing a Sampling Profiler

The designs of a sampling profiler are very easy to explain but are unfortunately very tedious and complex to implement because it is mostly non portable code because of its dependencies on the ABI, compiled binary format, debug information amongst other things. I’m going to outline each of the pieces with a quick mentions about how we solve the problem on each platform. Feel free to follow along in ‘mozilla-central/tools/profiler/*’

First thing you want is a watchdog that will be ready to collect and store samples (backtraces amongst other thing) at a fixed interval. This is implemented as a simple thread that sleeps.

Now you want a way to stop a thread to give you a chance to get a snapshot of what it was doing. On windows and mac we use a platform specific thread pause API and on linux and android we use a signal.

Now that the thread is paused you need a way to collect data. Our profiler deviates from typical profilera and will record if the process has been responding to its event queue. For unwinding we’ve been extending nsStackWalk.h except on Android where we are working towards using libunwind under fennec profiling builds.

Now you can resume the thread you so rudely interrupted and let it do some more work before the next sample/audit.

After you’ve collected a good number of samples you want to dump this data somewhere. In our case we pass the data to JS where we can let fancy extensions and web apps with all the html5 features and buzzwords do their magic. Note that when saving you also want to note what libraries are loaded and where they live in memory for the next step:

Symbolicating

Now that you’ve collected thousand of backtraces you need to convert these address into something that can be traced back to the source code. Roughtly speaking of course, all source code is converted into binary code into libraries. For example function foo(int) may get compiled into libBAR.so/dll as instructions (cpu opcodes) placed at range 0x100-0x200. To make maters more complicated libraries can be loaded just about anywhere in memory. So libBAR.so could be placed at offset 0x8000 meaning that foo(int) is now at 0x8100-0x8200.

To translate a raw address you take this process backwards, you start with 0x8106, you know that address falls inside the range of libBAR.so/dll which is loaded at offset 0x8000, thus you’re dealing with address 0x106 in that library, which is function foo(int)+6 (with source line information you can even translate that +6 to a line number).

Presenting the Data

Now that you’ve collected the data you can now present it in some useful way. This is typically some weighted call graph presentation. This is the most straightforward step but also a very important step. Profile data we collect is 10’s of MB in size so a good software is a must to translate this data dump into actionable fixes. I find that most profilers tend largely be lacking in this area.

What about the built-in profiler we’re working on? We’re working hard to have it ready for general use ASAP, I’ll blog more when it’s near.