Contents:

Subscribe to my mailing list, support me on Patreon or by PayPal donation.

This is the third part of my series about Intel Processor Traces (PT). In the first part of this series I showed the underlying mechanics of this HW feature. In the second part I showed 2 ways how PT can enhance your debugging experience.

Here is the list of all parts in the series of posts about Intel PT:

In this article we will focus more on the performance side of this CPU feature.

Analyzing performance glitches

Imagine a situation when you’re a developer of an application. Everything works smoothly in general, besides one little thing. Users of the application claim that sometimes they observe glitches in the application performance. Often times this is quite hard to fix, since it won’t be exposed as a hotspot when you profile the app. Also, it’s hard to isolate and extract it to the standalone benchmark, since performance glitches happen rarely, and the root cause may be not obvious.

To demonstrate how Intel PT can help in this case, I created a small example in which I calculate the sum of the first N natural numbers. There is intentionally fast and slow path of the algorithm. Here is the code:

a.cpp:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 #include <random> int goFastPath ( int * arr , int n ); int goSlowPath ( int * arr , int n ); int main () { int arr [ 1000 ]; for ( int i = 0 ; i < 1000 ; i ++ ) { arr [ i ] = i ; } const int min = 0 ; const int max = 999 ; std :: default_random_engine generator ; std :: uniform_int_distribution < int > distribution ( min , max ); // counting sum up to N for ( int i = 0 ; i < 100000 ; i ++ ) { int random_int = distribution ( generator ); if ( random_int < 999 ) goFastPath ( arr , random_int ); else goSlowPath ( arr , random_int ); } return 0 ; }

b.cpp:

1 2 3 4 5 6 7 8 9 10 int goFastPath ( int * arr , int n ) { return ( n * ( n + 1 )) / 2 ; } int goSlowPath ( int * arr , int n ) { int res = 0 ; for ( int i = 0 ; i <= n ; i ++ ) res += arr [ i ]; return res ; }

The idea is that 99.9% of the time we are using the fast path which only takes a few CPU cycles. But in 0.1% of the time we will fall back to the slow path which can take as much as 1000 cycles.

Here is how I build the executable:

$ g++ a.cpp -c -O2 -g $ g++ b.cpp -c -O2 -g $ g++ a.o b.o $ time ./a.out real 0m0.007s user 0m0.007s sys 0m0.000s

Now let’s collect the traces:

$ perf record -e intel_pt/cyc = 1/u ./a.out [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.548 MB perf.data ]

Notice that I used cyc=1 to have the finest granularity available. This option asks the CPU to produce timing packets every cycle.

To decode the traces into human readable form use the command below:

$ perf script --ns --itrace = i1t -F +srcline,+srccode > decoded.dump

Before going into the output, let me explain what those options mean. Option --ns will show the timestamps in nanoseconds. Option -F +srcline,+srccode will add source line and source code for each decoded assembly instruction. Option --trace=i1t is a little more difficult to explain. This option specifies a period on which decoder will sample our traces. In my case I asked to sample every clocktick, i.e. it will output each instruction separately and put a timestamp for it. See man perf script for more details and footnotes for more examples.

Here is what the decoded output (in decoded.dump ) might look like:

proc PID CPU timestamp sym+off srcline srccode ... a.out 3172 [007] 253.555413143: std::uniform_int_distribution<...> >+0xc2 uniform_int_dist.h:281 a.out 3172 [007] 253.555413143: main+0x78 a.cpp:21 if (random_int < 999) a.out 3172 [007] 253.555413143: main+0x7d a.cpp:22 goFastPath(arr, random_int); a.out 3172 [007] 253.555413143: main+0x7f a.cpp:22 a.out 3172 [007] 253.555413143: main+0x82 a.cpp:21 if (random_int < 999) a.out 3172 [007] 253.555413143: main+0x84 a.cpp:24 goSlowPath(arr, random_int); a.out 3172 [007] 253.555413143: goSlowPath+0x0 b.cpp:7 for (int i = 0; i <= n; i++) a.out 3172 [007] 253.555413143: goSlowPath+0x2 b.cpp:7 a.out 3172 [007] 253.555413143: goSlowPath+0x4 b.cpp:7 a.out 3172 [007] 253.555413143: goSlowPath+0x7 b.cpp:6 int res = 0; a.out 3172 [007] 253.555413143: goSlowPath+0x9 b.cpp:6 a.out 3172 [007] 253.555413143: goSlowPath+0xe b.cpp:6 a.out 3172 [007] 253.555413143: goSlowPath+0x10 b.cpp:8 res += arr[i]; a.out 3172 [007] 253.555413143: goSlowPath+0x12 b.cpp:8 a.out 3172 [007] 253.555413143: goSlowPath+0x16 b.cpp:7 for (int i = 0; i <= n; i++) a.out 3172 [007] 253.555413143: goSlowPath+0x19 b.cpp:7 a.out 3172 [007] 253.555413143: goSlowPath+0x10 b.cpp:8 res += arr[i]; a.out 3172 [007] 253.555413143: goSlowPath+0x12 b.cpp:8 a.out 3172 [007] 253.555413143: goSlowPath+0x16 b.cpp:7 for (int i = 0; i <= n; i++) a.out 3172 [007] 253.555413143: goSlowPath+0x19 b.cpp:7 <loop iterations> a.out 3172 [007] 253.555413508: goSlowPath+0x19 b.cpp:7 a.out 3172 [007] 253.555413508: goSlowPath+0x1b b.cpp:7 a.out 3172 [007] 253.555413508: main+0x89 a.cpp:19 for (int i = 0; i < 100000; i++) { ...

Having such a trace is very helpful for any kind of analysis. It allows you to zoom in the exact invocation where the glitch happened and analyze it in isolation. You may further use the timestamp to align it with the other event that was happening at the same moment in the system. Also, you can compare two different invocations of the same function: the one that was fast and the one that was slow. You may then go back in time and compare what was happening before the glitch in both cases and hopefully will find the thing that was different.

Decoded processor traces won’t be able to show the values of different variables but you will get precise control flow. With this information you will know exactly what lead you here and hopefully it will sched some light on what might be causing performance issues. I’m not showing the history here since in our case it’s obvious what caused the glitch of 365ns.

But… There is a caveat. When adding either +srcline or +srccode to output, decoding gets extremely slow. For this workload that runs for only 7ns and generated traces of less than 1MB, decoding took more than a day! I know this is nuts for most of the users. I’m not entirely sure why that happens, but I suspect the implementation might not be optimal.

The good news is that there is a workaround. You don’t have to decode the whole trace in advance and can do it lazily. First you can decode without adding +srcline or +srccode to the perf script command. Then look at the time range that you care about and then only decode this time range using --time start,stop option. For example:

$ perf script --ns --itrace = i1t -F +srcline,+srccode --time 253.555413140,253.555413520 > time_range.dump

If you’re also interested in assembly instructions, you can do this with the command:

$ perf script --ns --itrace = i1t -F +insn | ../xed -F insn: -I -64 > decoded_insn.dump

This type of decoding is much faster. Note that I also used xed to decode the raw bytes into instructions. Here is what the decoded output (in decoded_insn.dump ) might look like:

proc PID CPU timestamp sym+off instruction ... a.out 3172 [007] 253.555413143: main+0x78 cmp eax, 0x3e6 a.out 3172 [007] 253.555413143: main+0x7d mov esi, eax a.out 3172 [007] 253.555413143: main+0x7f mov rdi, r12 a.out 3172 [007] 253.555413143: main+0x82 jle 0x556830247700 a.out 3172 [007] 253.555413143: main+0x84 call 0x556830247a80 a.out 3172 [007] 253.555413143: goSlowPath+0x0 test esi, esi <== enter goSlowPath() a.out 3172 [007] 253.555413143: goSlowPath+0x2 js 0x556830247aa0 a.out 3172 [007] 253.555413143: goSlowPath+0x4 movsxd rsi, esi a.out 3172 [007] 253.555413143: goSlowPath+0x7 xor eax, eax a.out 3172 [007] 253.555413143: goSlowPath+0x9 lea rdx, ptr [rdi+rsi*4+0x4] a.out 3172 [007] 253.555413143: goSlowPath+0xe data16 nop a.out 3172 [007] 253.555413143: goSlowPath+0x10 add eax, dword ptr [rdi] a.out 3172 [007] 253.555413143: goSlowPath+0x12 add rdi, 0x4 a.out 3172 [007] 253.555413143: goSlowPath+0x16 cmp rdi, rdx a.out 3172 [007] 253.555413143: goSlowPath+0x19 jnz 0x556830247a90 a.out 3172 [007] 253.555413143: goSlowPath+0x10 add eax, dword ptr [rdi] a.out 3172 [007] 253.555413143: goSlowPath+0x12 add rdi, 0x4 a.out 3172 [007] 253.555413143: goSlowPath+0x16 cmp rdi, rdx a.out 3172 [007] 253.555413143: goSlowPath+0x19 jnz 0x556830247a90 <loop iterations> a.out 3172 [007] 253.555413508: goSlowPath+0x16 cmp rdi, rdx a.out 3172 [007] 253.555413508: goSlowPath+0x19 jnz 0x556830247a90 a.out 3172 [007] 253.555413508: goSlowPath+0x1b ret <== exit goSlowPath() a.out 3172 [007] 253.555413508: main+0x89 sub ebx, 0x1 ...

If you want combined output, use:

$ perf script --ns --itrace = i1t -F comm ,tid,pid,time,cpu,ip,sym,symoff,srcline,srccode,insn | ../xed -F insn: -I -64 > decoded_combined.dump

This is the output that you might see:

proc PID CPU timestamp sym+off srcline instruction srccode ... a.out 3172 [007] 253.555413143: main+0x84 a.cpp:24 call 0x35c goSlowPath(arr, random_int); a.out 3172 [007] 253.555413143: goSlowPath+0x0 b.cpp:7 test esi, esi for (int i = 0; i <= n; i++) a.out 3172 [007] 253.555413143: goSlowPath+0x2 b.cpp:7 js 0x1e a.out 3172 [007] 253.555413143: goSlowPath+0x4 b.cpp:7 movsxd rsi, esi ...

Finally, we can also calculate interesting statistics like “How much times we called goSlowPath function?”:

$ grep goSlowPath decoded_insn.dump > slow.trace $ grep ret slow.trace -c 93

Note that this is exact number of calls! This is not a wild guess. We know precisely that goSlowPath was called 93 times.

We can calculate average time for one invocation of goSlowPath function:

$ grep "test esi" slow.trace | cut -f1 -d ":" > start.dump $ grep "ret" slow.trace | cut -f1 -d ":" > end.dump $ paste -d- end.dump start.dump | bc -l > func.times $ awk '{ total += $1; count++ } END { printf("%.9f",total/count)}' func.times 0.000001092

And there are many other ways how to extract useful information from the Intel PT traces.

That’s it. Hope that you find this article useful and maybe will apply it in your analysis someday. I agree that there are not many tools that support Intel PT, and even those tools don’t perfectly handle all the cases. Anyway, I hope it will get better in the future.