Kernel Tracing Using Ftrace

Ftrace is one of those useful “kernel debugging” tools which you turn to when other debugging tools fail to reveal the underlying problem. Ftrace is a Linux kernel internals tracing tool that was first included in the 2.6.27 kernel in 2008. The main developer was (and still is) Steven Rostedt who is currently a Red Hat employee with responsiblity for the real-time patches in the Linux kernel.

The name ftrace comes from the term function tracer, which was the original purpose of the tool, but nowadays it can do a lot more than just trace function calls. Over the years, various additional tracers have been added to look at things like context switches, how long interrupts are disabled, how long it takes for high-priority tasks to run after they have been woken up, and more. Ftrace includes a plugable framework that allows new tracers to be added relatively easily. Currently over 700 events throughout the kernel can be traced.

For my examples in this blog post, I am using ftrace on Fedora 20. Linux kernels must be built with tracing enabled. As you can see, Fedora 20 was built with tracing enabled.



# grep TRACER /boot/config-3.14.2-200.fc20.x86_64 # CONFIG_ATH5K_TRACER is not set CONFIG_NOP_TRACER=y CONFIG_HAVE_FUNCTION_TRACER=y CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y CONFIG_TRACER_MAX_TRACE=y CONFIG_CONTEXT_SWITCH_TRACER=y CONFIG_GENERIC_TRACER=y CONFIG_FUNCTION_TRACER=y CONFIG_FUNCTION_GRAPH_TRACER=y # CONFIG_IRQSOFF_TRACER is not set CONFIG_SCHED_TRACER=y CONFIG_TRACER_SNAPSHOT=y # CONFIG_TRACER_SNAPSHOT_PER_CPU_SWAP is not set CONFIG_STACK_TRACER=y



For some unknown reason, however, not all tracers were enabled. Red Hat Enterprise Linux 6.4 has the same config settings.. By way of contrast, Red Hat MRG provides the following tracers: events wakeup preemptirqsoff preemptoff irqsoff ftrace sched_switch none, once the trace variant of the MRG Realtime kernel is installed and in use.

The main ftrace documentation is available in the Linux kernel source at …/root/Documentation/trace/ftrace.txt. It was last updated for the 3.10 kernel. In the same directory, there are a number of other interesting documents including ftrace-design.txt by Mike Frysinger which focuses on architecture implementation details. The actual tracemoint implementation mechanism is detailed by Mathieu Desnoyers in tracepoints.txt.

A tracepoint placed in code provides a hook to call a function (probe) that you can provide at runtime. A tracepoint can be “on” (a probe is connected to it) or “off” (no probe is attached). When a tracepoint is “off” it has no effect, except for adding a tiny time penalty (checking a condition for a branch) and space penalty (adding a few bytes for the function call at the end of the instrumented function and adds a data structure in a separate section). When a tracepoint is “on”, the function you provide is called each time the tracepoint is executed, in the execution context of the caller. When the function provided ends its execution, it returns to the caller (continuing from the tracepoint site).

The default location for ftrace is /sys/kerenel/debug. You may have to mount the debug filesystem.



# mount -t debugfs nodev /sys/kernel/debug # cat /sys/kernel/debug/tracing/available_tracers blk function_graph wakeup_dl wakeup_rt wakeup function nop # cd /sys/kernel/debug # ls acpi cleancache dynamic_debug hid mce regmap tracing xen bdi clk extfrag ieee80211 pinctrl sched_features usb bluetooth dma_buf frontswap kprobes pkg_temp_thermal sleep_time wakeup_sources boot_params dri gpio lpfc pstate_snb suspend_stats x86 # cd tracing # ls available_events instances set_ftrace_notrace trace_options available_filter_functions kprobe_events set_ftrace_pid trace_pipe available_tracers kprobe_profile set_graph_function trace_stat buffer_size_kb max_graph_depth set_graph_notrace tracing_cpumask buffer_total_size_kb options snapshot tracing_max_latency current_tracer per_cpu stack_max_size tracing_on dyn_ftrace_total_info printk_formats stack_trace tracing_thresh enabled_functions README stack_trace_filter uprobe_events events saved_cmdlines trace uprobe_profile free_buffer set_event trace_clock function_profile_enabled set_ftrace_filter trace_marker

The ftrace framework is one of the many improvements to come out of the Linux kernel realtime effort. Unlike SystemTap, it does not attempt to be a comprehensive, scriptable facility. There is a set of virtual files in a debugfs directory which can be used to enable specific tracers and see the results. The function call tracer after which ftrace is named simply outputs each kernel function called as it happens. Other tracers can look at wakeup latency, events, enabling and disabling interrupts, preemption, task switches, and more.

Kernel function call tracing (the original purpose of ftrace) is enabled as follows:



# echo function > /sys/kernel/debug/tracing/current_tracer # echo 1 > /sys/kernel/debug/tracing/tracing_on



Note that older ftrace documentation examples refer to tracing_enabled rather than tracing_on

Events which are available for tracing are listed in /sys/kernel/debug/tracing/available_events. To enable a particular event, such as sched_wakeup, simply echo it to /sys/kernel/debug/tracing/set_event. For example:



# echo sched_wakeup >> /sys/kernel/debug/tracing/set_event



The append (“>>”) operator is required to avoid overwriting existing events.

To disable an event, echo the event name to the set_event file prefixed with an exclamation point:



# echo '!sched_wakeup' >> /sys/kernel/debug/tracing/set_event

To disable all events, echo an empty line to the set_event file:



# echo > /sys/kernel/debug/tracing/set_event

To enable all events, echo ‘*:*’ or ‘*:’ to the set_event file:



# echo *:* > /sys/kernel/debug/tracing/set_event

Events are organized into subsystems, such as ext4, irq, or sched, and a full event name looks like this: <subsystem>:<event>. The subsystem name is optional, but it is displayed in the available_events file. All of the events in a subsystem can be specified via the syntax <subsystem>:*. For example, to enable all irq events:



# echo 'irq:*' > /sys/kernel/debug/tracing/set_event



You can drill down and enable or disable individual events under the events directory:



# pwd /sys/kernel/debug/tracing/events # ls block gpio kmem pagemap scsi vmscan btrfs hda kvm power signal vsyscall cfg80211 hda_intel kvmmmu printk skb workqueue compaction header_event mce random sock writeback drm header_page migrate ras sunrpc xen enable i915 module raw_syscalls swiotlb xhci-hcd exceptions iommu napi rcu syscalls ext4 irq net regmap task filemap irq_vectors nmi rpm timer ftrace jbd2 oom sched udp # cd sched # ls enable sched_process_exit sched_stat_wait filter sched_process_fork sched_stick_numa sched_kthread_stop sched_process_free sched_swap_numa sched_kthread_stop_ret sched_process_wait sched_switch sched_migrate_task sched_stat_blocked sched_wait_task sched_move_numa sched_stat_iowait sched_wakeup sched_pi_setprio sched_stat_runtime sched_wakeup_new sched_process_exec sched_stat_sleep # cd sched_wakeup # ls -ltotal 0 -rw-r--r--. 1 root root 0 May 8 11:23 enable -rw-r--r--. 1 root root 0 May 8 11:23 filter -r--r--r--. 1 root root 0 May 8 11:23 format -r--r--r--. 1 root root 0 May 8 11:23 id -rw-r--r--. 1 root root 0 May 8 11:23 trigger



For example to enable the sched_wakeup event:



# echo 1 > /sys/kernel/debug/tracing/events/sched/sched_wakeup/enable



To disable the sched_wakeup event:



# echo 0 > /sys/kernel/debug/tracing/events/sched/sched_wakeup/enable



There is lots more to events than I have room to detail in this blog. For example, you can control the output format of events, filter specific events, and invoke specific scriptable actions when events are triggered. Read the ftrace documentation for all the details.

Ftrace uses a custom kernel ring buffer to store event information. The ring buffer can be used in either an overwrite mode or in producer/consumer mode. Producer/consumer mode is where if the producer were to fill up the ring buffer before the consumer could free up anything, the producer will stop writing to the buffer. This will cause the most recent events to be lost. Overwrite mode is where if the producer were to fill up the buffer before the consumer could free up anything, the producer will

overwrite the older data. This will cause loss of the oldest events. The size of the ring buffer is configurable.



# pwd /sys/kernel/debug/tracing # cat buffer_total_size_kb 28 (expanded: 5632) ]# cat buffer_size_kb 7 (expanded: 1408)



A trace marker enables some coordination between what is happening in user space and inside the kernel by providing a way to write into the ftrace kernel ring buffer from user space. This marker will then appear in the trace output to give a location where a specific event occurred. Here is an example of setting a trace marker:



# echo "FINN" > /sys/kernel/debug/tracing/trace_marker

One of the annoying things about ftrace for most first time users of the tool is that it does not provide an easy mechanism to invoke an executable and just trace that executable. Here is a simple shell script which provides that functionality:



#!/bin/bash DEBUGFS=`grep debugfs /proc/mounts | awk '{ print $2; }'` echo $$ > $DEBUGFS/tracing/set_ftrace_pid echo function > $DEBUGFS/tracing/current_tracer echo 1 > $DEBUGFS/tracing/tracing_on exec $* echo 0 > $DEBUGFS/tracing/tracing_on

The function call tracer (function) traces entry into all kernel functions.

Here is what is recorded in /sys/kernel/debug/tracing/trace when the above script is used to invoke ls ( ./script ls):



# tracer: function # # entries-in-buffer/entries-written: 364/27769 #P:4 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | ls-5080 [003] .... 42387.930906: SyS_rt_sigaction < -system_call_fastpath ls-5080 [003] .... 42387.930907: do_sigaction



As you can see, quite a lot of output is produced even for a single invocation of ls.

The volume of date outputted by ftrace can be enormous. Consider the following simple Hello World C language program compiled with GCC 4.8.2:



#include <stdio.h> int main() { printf("hello world

"); }



When traced with the above ftrace script, over 11000 lines of tracing data was produced!

In the next example, the function call tracer is replaced with the function_graph call tracer. This tracer traces entry into and exit from kernel functions and provides the ability to draw a graph of kernal function calls that were invoked.



#!/bin/bash DEBUGFS=`grep debugfs /proc/mounts | awk '{ print $2; }'` echo $$ > $DEBUGFS/tracing/set_ftrace_pid echo function_graph > $DEBUGFS/tracing/current_tracer echo 1 > $DEBUGFS/tracing/tracing_on exec $* echo 0 > $DEBUGFS/tracing/tracing_on

Here is the new output:



# tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 0) 0.038 us | mntput_no_expire(); 0) 0.310 us | } /* mntput */ 0) + 35.884 us | } /* path_openat */ 0) + 36.234 us | } /* do_filp_open */ 0) 0.038 us | __fsnotify_parent(); 0) 0.044 us | fsnotify(); 0) + 37.168 us | } /* do_open_exec */ 0) | sched_exec() { 0) 0.039 us | _raw_spin_lock_irqsave(); 0) | select_task_rq_fair() { 0) 0.043 us | source_load(); 0) 0.044 us | target_load(); 0) 0.044 us | target_load(); 0) 0.038 us | target_load(); 0) 1.990 us | } 0) 0.044 us | _raw_spin_unlock_irqrestore(); 0) | stop_one_cpu() { 0) | cpu_stop_init_done() { 0) 0.036 us | __init_waitqueue_head(); 0) 0.388 us | } 0) | cpu_stop_queue_work() { 0) 0.036 us | _raw_spin_lock_irqsave(); 0) | wake_up_process() { 0) | try_to_wake_up() { 0) 0.094 us | _raw_spin_lock_irqsave(); 0) 0.101 us | select_task_rq_stop(); 0) 0.042 us | _raw_spin_lock(); 0) | ttwu_do_activate.constprop.95() { 0) | activate_task() { 0) | enqueue_task() { 0) 0.066 us | update_rq_clock.part.83(); 0) 0.029 us | enqueue_task_stop(); 0) 0.660 us | } 0) 0.951 us | } 0) | ttwu_do_wakeup() { 0) | check_preempt_curr() { 0) 0.039 us | resched_task(); 0) 0.339 us | } 0) 0.753 us | } 0) 2.242 us | } 0) 0.046 us | _raw_spin_unlock_irqrestore(); 0) 4.537 us | } 0) 5.156 us | } 0) 0.044 us | _raw_spin_unlock_irqrestore(); 0) 6.171 us | } 0) | wait_for_completion() { 0) | _cond_resched() { 0) | __schedule() { 0) 0.049 us | rcu_note_context_switch(); 0) 0.038 us | _raw_spin_lock_irq(); 0) | put_prev_task_fair() { 0) | update_curr() { 0) 0.029 us | update_min_vruntime(); 0) 0.074 us | cpuacct_charge(); 0) 0.658 us | } 0) 0.052 us | __enqueue_entity(); 0) | update_curr() { 0) 0.036 us | update_min_vruntime(); 0) 0.326 us | } 0) 0.039 us | __enqueue_entity(); 0) 2.463 us | } 0) 0.053 us | pick_next_task_stop(); 1) 8.842 us | } /* enqueue_task */ 1) 9.424 us | } /* activate_task */ 1) | ttwu_do_wakeup() { 1) | check_preempt_curr() { 1) 0.152 us | resched_task(); 1) 0.617 us | } 1) 1.093 us | } 1) + 11.337 us | } /* ttwu_do_activate.constprop.95 */ 1) 0.068 us | _raw_spin_unlock_irqrestore(); 1) + 17.778 us | } /* try_to_wake_up */ 1) + 18.213 us | } /* default_wake_function */ 1) + 18.734 us | } /* child_wait_callback */ 1) + 19.313 us | } /* __wake_up_common */ 1) 0.076 us | _raw_spin_unlock_irqrestore(); 1) + 20.689 us | } /* __wake_up_sync_key */ 1) + 21.126 us | } /* __wake_up_parent */ 1) 0.067 us | _raw_spin_unlock_irqrestore(); 1) + 29.656 us | } /* do_notify_parent */ 1) 0.053 us | _raw_spin_lock(); 1) 0.052 us | exit_rcu(); 1) | schedule() { 1) | __schedule() { 1) 0.053 us | rcu_note_context_switch(); 1) 0.051 us | _raw_spin_lock_irq(); 1) | deactivate_task() { 1) | dequeue_task() { 1) 0.103 us | update_rq_clock.part.83(); 1) | dequeue_task_fair() { 1) | dequeue_entity() { 1) | update_curr() { 1) 0.056 us | update_min_vruntime(); 1) 0.058 us | cpuacct_charge(); 1) 0.933 us | } 1) 0.083 us | update_cfs_rq_blocked_load(); 1) 0.049 us | clear_buddies(); 1) 0.096 us | account_entity_dequeue(); 1) 0.072 us | update_min_vruntime(); 1) 0.051 us | update_cfs_shares(); 1) 3.705 us | } 1) 0.049 us | hrtick_update(); 1) 4.626 us | } 1) 5.519 us | } 1) 5.952 us | } 1) | idle_balance() { 1) | update_blocked_averages() { 1) 0.055 us | _raw_spin_lock_irqsave(); 1) | update_rq_clock() { 1) 0.114 us | update_rq_clock.part.83(); 1) 0.534 us | } 1) 0.089 us | update_cfs_rq_blocked_load(); 1) 0.114 us | update_cfs_rq_blocked_load(); 1) 0.100 us | update_cfs_rq_blocked_load(); 1) 0.199 us | update_cfs_rq_blocked_load(); 1) 0.080 us | update_cfs_rq_blocked_load(); 1) 0.063 us | _raw_spin_unlock_irqrestore(); 1) 5.311 us | } 1) | load_balance() { 1) | find_busiest_group() { 1) | update_group_power() { 1) 0.051 us | msecs_to_jiffies(); 1) 0.052 us | arch_scale_freq_power(); 1) 1.098 us | } 1) 0.056 us | target_load(); 1) 0.052 us | idle_cpu(); 1) 0.059 us | source_load(); 1) 0.102 us | idle_cpu(); 1) 0.068 us | source_load(); 1) 0.086 us | idle_cpu(); 1) 0.053 us | source_load(); 1) 0.058 us | idle_cpu(); 1) 6.578 us | } 1) 7.267 us | } 1) 0.054 us | msecs_to_jiffies(); 1) 0.051 us | _raw_spin_lock(); 1) + 14.520 us | } 1) 0.094 us | put_prev_task_fair(); 1) 0.077 us | pick_next_task_fair(); 1) 0.073 us | pick_next_task_stop(); 1) 0.094 us | pick_next_task_dl(); 1) 0.136 us | pick_next_task_rt(); 1) 0.073 us | pick_next_task_fair(); 1) 0.051 us | pick_next_task_idle();

To simplify the use of ftrace, Rostedt has written two tools – trace-cmd and kernelshark. These are available as standard repo RPMs in Fedora 20. Trace-cmd is a binary command line tool which acts as a user interface to ftrace whereas Kernelshark is a graphical consumer of trace-cmd output.

# trace-cmd record -e sched ls .... # ls -al trace.dat -rw-r--r--. 1 root root 4136960 May 8 16:44 trace.dat # wc -l trace.dat 94909 trace.dat



By default, output is written to a file called trace.dat in the current directory.

Here I show kernelsmark using the above trace.dat file to display the collected tracing data:

As you can see below, this data can be filtered in numerous ways to reduce screen clutter and eliminate extraneous output:

You can read a good tutorial on kernelshark here. It takes a bit of effort to understand either the raw output of ftrace or what is displayed using kernelshark but the effort is worth it when you are dealing with difficult kernel timing or locking issues.

Well, that is all for this particular blog post. Note that there is far more to ftrace than what I have managed to include in this blog. It is an incredibly powerful tool which most Linux system administrators and developers seem to be unaware of.