This post is the fourth of a series examining GHC’s support for DWARF debug information and the tooling that this support enables:

Part 1 introduces DWARF debugging information and explains how its generation can be enabled in GHC.

Part 2 looks at a DWARF-enabled program in gdb and examines some of the limitations of this style of debug information.

and examines some of the limitations of this style of debug information. Part 3 looks at the backtrace support of GHC’s runtime system and how it can be used from Haskell.

Part 4 examines how the Linux perf utility can be used on GHC-compiled programs.

utility can be used on GHC-compiled programs. Part 5 concludes the series by describing future work, related projects, and ways in which you can help.

Profiling with perf

The final application of debug information that we will examine is performance analysis, specifically profiling with the Linux perf tool.

perf is a statistical profiling tool relying on, among other things, the underlying machine’s performance monitoring hardware. It can be used not only to profile time, but also details of the microarchitecture such as cache-misses, front-end stalls, etc. With a bit of cleverness, one can even use perf to profile allocations of a Haskell program.

Moreover, all of this profiling functionality comes with a negligible impact on the throughput of the profiled profiling. Of course, for all of these benefits one trades off precision and ease of interpretation of the resulting profile.

Summary of the trade-offs made by cost-center profiler and statistical profiling with perf . Attribute perf cost-center profiler Runtime impact negligible anywhere from moderate to high Relatability to source program often hard to relate back to source program profile structure directly reflects cost centers declared in source program Provides call graphs yes, with limited depth yes Provides call graphs on Haskell programs not currently yes Profiled variables time, allocations, micro-architectural counters, system calls, user- and kernel-space probe points time, allocations Determinism profile is statistical and will likely vary from run to run deterministic

To see this trade-off in action, let’s return to our vector-tests-O0 example. We can acquire a simple time profile by simply running the executable under perf record :

$ perf record vector-tests-O0

The resulting profile can be examined via perf report . This will show you a TUI interface along the lines of

Samples: 145K of event 'cycles:ppp', Event count (approx.): 98781024840 Overhead Command Shared Object Symbol 10.19% vector-tests-O0 vector-tests-O0 [.] integerzmwiredzmin_GHCziIntegerziType_quotInteger_info 5.02% vector-tests-O0 vector-tests-O0 [.] evacuate 3.58% vector-tests-O0 vector-tests-O0 [.] stg_upd_frame_info+0xffffffffffc00000 3.28% vector-tests-O0 vector-tests-O0 [.] QuickCheckzm2zi13zi2zmac90a2a0d9e0dd2c227d795a9d4d9de22a119c3781b679f3b245300e1b658c43_TestziQuickCheckziGen_zdwzdsgamma_info 2.96% vector-tests-O0 vector-tests-O0 [.] integerzmwiredzmin_GHCziIntegerziType_timesInteger_info 2.93% vector-tests-O0 vector-tests-O0 [.] integerzmwiredzmin_GHCziIntegerziType_minusInteger_info 2.67% vector-tests-O0 vector-tests-O0 [.] integerzmwiredzmin_GHCziIntegerziType_testBitInteger_info 2.34% vector-tests-O0 vector-tests-O0 [.] integerzmwiredzmin_GHCziIntegerziType_eqIntegerzh_info 2.30% vector-tests-O0 vector-tests-O0 [.] randomzm1zi1zmc60864d5616c60090371cdf8e600240f388e8a9bd87aa769d8045bda89826ee2_SystemziRandom_zdwrandomIvalInteger_info 2.18% vector-tests-O0 vector-tests-O0 [.] integerzmwiredzmin_GHCziIntegerziType_plusInteger_info 2.16% vector-tests-O0 vector-tests-O0 [.] integerzmwiredzmin_GHCziIntegerziType_divInteger_info 1.71% vector-tests-O0 vector-tests-O0 [.] QuickCheckzm2zi13zi2zmac90a2a0d9e0dd2c227d795a9d4d9de22a119c3781b679f3b245300e1b658c43_TestziQuickCheckziGen_zdwilog2_info 1.63% vector-tests-O0 vector-tests-O0 [.] stg_ap_p_info+0xffffffffffc00010 1.52% vector-tests-O0 vector-tests-O0 [.] integerzmwiredzmin_GHCziIntegerziType_remInteger_info 1.36% vector-tests-O0 vector-tests-O0 [.] hs_popcnt64 1.34% vector-tests-O0 vector-tests-O0 [.] stg_PAP_apply 1.23% vector-tests-O0 vector-tests-O0 [.] stg_ap_0_fast 1.22% vector-tests-O0 vector-tests-O0 [.] stg_gc_noregs 1.20% vector-tests-O0 vector-tests-O0 [.] stg_newByteArrayzh 1.15% vector-tests-O0 vector-tests-O0 [.] stg_ap_pp_info+0xffffffffffc00000 1.11% vector-tests-O0 vector-tests-O0 [.] _randomzm1zi1zmc60864d5616c60090371cdf8e600240f388e8a9bd87aa769d8045bda89826ee2_SystemziRandom_b1_siHV_entry 0.98% vector-tests-O0 vector-tests-O0 [.] integerzmwiredzmin_GHCziIntegerziType_geIntegerzh_info 0.98% vector-tests-O0 vector-tests-O0 [.] stg_BLACKHOLE_info+0xffffffffffc00069 0.92% vector-tests-O0 vector-tests-O0 [.] scavenge_block ...

This profile shows costs from both the runtime system and compiled Haskell. If one selects an entry from this list with the arrow keys and presses enter perf will show the annotated assembler of this function, along with the associated Haskell source.

In principle, perf record can also give us call-graph information. We will examine this in depth in the next section.

Paths to call-stack profiling

In addition to the flat profiles we saw earlier, perf can also produce call-graph profiles which sample not only the current location of execution but also callers to the enclosing function. Such call-graph profiles are produced with the perf record -g command.

Unlike flat profiles, call-graph capture tends to be quite language- and hardware-dependent. Consequently, perf record currently provides three mechanisms for collecting call-stacks:

via the frame pointer register: GHC doesn’t track the frame-pointer like many imperative languages do, so this is unusable on Haskell programs.

via the last branch record (LBR): This takes advantage of the LBR, a hardware feature of Intel CPUs, and is only really usable on modern (e.g. Skylake and later) hardware. However, even on Skylake it provides only a very limited stack depth (32 branches on Skylake, of which many will be uninteresting).

via DWARF unwinding: This uses DWARF unwinding information to decode the state of the stack, which is captured at sampling-time. This capture can result in non-trivial runtime overhead.

Recent Intel hardware also includes another, more expensive hardware mechanism, the Branch Trace Store, which could be useful for deeper call-graph acquisition but perf record support currently does not support using it for call-graph capture.

Summary of trade-offs made by various call-stack collection mechanisms supported by perf record . attribute frame pointer LBR DWARF maximum depth unlimited 32 on Skylake unlimited runtime overhead negligible negligible small compatible with GHC execution model no yes? not currently

Unfortunately, call-stack profiling support for Haskell is complicated by the fact that currently GHC uses machine registers in a rather non-standard way: rather than use the machine’s stack pointer register (e.g. %rsp on x86-64) to store the Haskell stack pointer GHC rather reserves another register ( %rbp on x86-64). This choice simplifies interaction with foreign code at the expense of one usable register. While DWARF’s representation of unwinding information is sufficiently flexible to encode GHC’s non-standard stack register, the unwinding logic in the Linux kernel used by perf record sadly is not. For this reason, the unwind information produced by ghc -g is of little use when profiling with perf record -g .

Other than DWARF unwinding, the only other viable call-graph option is the last-branch register (LBR). Unfortunately, the rather restrictive 32-branch depth limit and the fact that GHC does not use the traditional call instruction means that in practice the call-graphs produced by this method tend not to be very useful.