Profiling Android apps with Flamegraphs

Companion code for this post available on Github

TL;DR: I built a tool to turn Android trace output in to flame graphs. You can check out the source code here, or get started immediately by uploading a trace file here

If you’ve ever tried to debug a performance issue in an Android app, you’ve probably become familiar with Traceview, which reads the .trace files generated by the Debug.startMethodTracing API call and displays them in a more or less readable manner. However, for me at least, Traceview is less than ideal. The interface is rather clunky, the scrolling behaviour is questionable (zooming and scrolling down at the same time? Just what I wanted?) and it’s very difficult to interpret the call chains that are consuming the most of your time, especially if multiple threads are involved.

Traceview. Isn't obvious from the coloured bars what's going on?

One of the most useful performance visualization and analysis tools I am aware of is Brendan Gregg’s Flame Graphs, which make it easy to identify long-running sections of your code. However, I couldn’t find any existing tooling for converting the Android trace format to flat stack format expected by the flamegraph generator, so it was time to get familiar with the internals of the format.

First things first, we need to gather a trace. In my case, I have an app that takes an appreciable amount of time to load the first Activity , so I’m going to add trace sections to onCreate , onStart and onResume , following this pattern:

private static final int MEGABYTE = 1024 * 1024 ; @Override protected void onCreate ( Bundle savedInstanceState ) { super . onCreate ( savedInstanceState ); // Call the resulting trace file 'onCreate.trace', and allow a 128MiB // buffer for collecting trace data. Debug . startMethodTracing ( "onCreate" , 128 * MEGABYTE ); // Existing onCreate code [...] // Stop method tracing Debug . stopMethodTracing (); }

Once that’s added, we can start up the app normally and wait for it to finish booting. Note that like all profiliers, this trace mechanism adds overhead! Don’t make decisions based on the absolute timings as correct when dealing with these traces, but rather the differences between traces. Your app will also take noticeably longer to start while profiling is active - this is normal.

Once it has loaded, open up adb and take a look to check that your traces have been created:

ross@mjolnir:/h/ross$ adb shell ls -l /sdcard/Android/data/com.schlaikjer.cookbook/files/ total 112400 -rw-rw---- 1 u0_a122 sdcard_rw 57473027 2017-02-26 14:26 onCreate.trace -rw-rw---- 1 u0_a122 sdcard_rw 6255 2017-02-26 14:26 onResume.trace -rw-rw---- 1 u0_a122 sdcard_rw 60809 2017-02-26 14:26 onStart.trace

And then pull all of them up so that we can take a look at them:

ross@mjolnir:/h/ross$ for F in {Create,Start,Resume}; do adb pull /sdcard/Android/data/com.schlaikjer.cookbook/files/on$F.trace; done 4693 KB/s (57473027 bytes in 11.959s) 710 KB/s (60809 bytes in 0.083s) 79 KB/s (6255 bytes in 0.077s)

If we crack one open with our editor of choice, we can see that the files begin with three plain text sections, followed by what looks like a lot of binary data:

*version 3 data-file-overflow=false clock=dual elapsed-time-usec=18482984 num-method-calls=4086234 clock-call-overhead-nsec=767 vm=art pid=17816 *threads 17816 main 17821 Jit thread pool worker thread 0 [...] *methods 0x7b0 java.lang.BootClassLoader getInstance ()Ljava/lang/BootClassLoader; ClassLoader.java 0x7ac java.lang.ClassLoader findLoadedClass (Ljava/lang/String;)Ljava/lang/Class; ClassLoader.java [...] *end SLOW^C^@ ^@&<95> [...]

Some of these (elapsed time, number of calls, vm name &c) are pretty intuitive. For the rest of them, the best way to figure out what they are is to look at the code that generates them!

os << StringPrintf( "%cversion

" , kTraceTokenChar); os << StringPrintf( "%d

" , GetTraceVersion(clock_source_)); os << StringPrintf( "data-file-overflow=%s

" , overflow_ ? "true" : "false" ); if (UseThreadCpuClock()) { if (UseWallClock()) { os << StringPrintf( "clock=dual

" ); } else { os << StringPrintf( "clock=thread-cpu

" ); } } else { os << StringPrintf( "clock=wall

" ); } os << StringPrintf( "elapsed-time-usec=%" PRIu64 "

" , elapsed); if (trace_output_mode_ != TraceOutputMode :: kStreaming) { size_t num_records = (final_offset - kTraceHeaderLength) / GetRecordSize(clock_source_); os << StringPrintf( "num-method-calls=%zd

" , num_records); } os << StringPrintf( "clock-call-overhead-nsec=%d

" , clock_overhead_ns_); os << StringPrintf( "vm=art

" ); os << StringPrintf( "pid=%d

" , getpid()); if ((flags_ & kTraceCountAllocs) != 0 ) { os << StringPrintf( "alloc-count=%d

" , Runtime :: Current() -> GetStat(KIND_ALLOCATED_OBJECTS)); os << StringPrintf( "alloc-size=%d

" , Runtime :: Current() -> GetStat(KIND_ALLOCATED_BYTES)); os << StringPrintf( "gc-count=%d

" , Runtime :: Current() -> GetStat(KIND_GC_INVOCATIONS)); }

Based on this, it looks like our version header consists of:

Version: The trace format version (3 for all devices I tested)

Data file overflow: The overflow_ flag seems to be set when if the amount of calls in the binary trace section overflows the buffer specified in the start trace call. If you see this set to true, you should re-run your trace with a larger buffer to ensure you aren’t missing any information.

flag seems to be set when if the amount of calls in the binary trace section overflows the buffer specified in the start trace call. If you see this set to true, you should re-run your trace with a larger buffer to ensure you aren’t missing any information. Clock: Whether the trace data uses wallclock time, per-thread CPU time, or both. All tested devices reported both clock times.

Elapsed clock: The total trace time, in microseconds.

Clock call overhead: The amount of time it takes to check the time. Any measurements close to or below this number should be assumed to be below the noise floor for this trace.

VM: Art or Dalvik

Pid: The process ID of the process under trace

There are also three fields that were not present in the trace we took a look at:

Alloc count: Number of allocated objects

Alloc size: Size of all allocated objects

GC count: The number of collections that have occurred

With that under our belt, let’s move on to the Threads section. This one is pretty simple:

void Trace :: DumpThreadList(std :: ostream & os) { Thread * self = Thread :: Current(); for ( auto it : exited_threads_) { os << it.first << " \t " << it.second << "

" ; } Locks :: thread_list_lock_ -> AssertNotHeld(self); MutexLock mu (self, * Locks :: thread_list_lock_); Runtime :: Current() -> GetThreadList() -> ForEach(DumpThread, & os); }

So each entry under the thread section is just a tuple of Thread ID and a human readable thread name. The method section is similar, but has a few more fields:

std :: string Trace :: GetMethodLine(ArtMethod * method) { method = method -> GetInterfaceMethodIfProxy(kRuntimePointerSize); return StringPrintf ( "%#x \t %s \t %s \t %s \t %s

" , (EncodeTraceMethod(method) << TraceActionBits), PrettyDescriptor(method -> GetDeclaringClassDescriptor()).c_str(), method -> GetName(), method -> GetSignature().ToString().c_str(), method -> GetDeclaringClassSourceFile()); }

So the method section is a list of tuples of:

Method ID

Declaring class

Method name

Method type signature

Declaring class’s source file.

Method declaration line number (only present for some runtimes)

That’s all the text sections dealt with. Now we can take a look at the binary data at the end of the file. Luckily, this section is actually described at the top of trace.h:

// File format: // header // record 0 // record 1 // ... // // Header format: // u4 magic ('SLOW') // u2 version // u2 offset to data // u8 start date/time in usec // u2 record size in bytes (version >= 2 only) // ... padding to 32 bytes // [...] // // Record format v3: // u2 thread ID // u4 method ID | method action // u4 time delta since start, in usec // u4 wall time since start, in usec (when clock == "dual" only) // // 32 bits of microseconds is 70 minutes. // // All values are stored in little - endian order.

The interesting thing to note here (and something that tripped me up for a while) is the method ID | method action section of the record format. If we take a look over in trace.cc , we can see how that’s implemented:

uint32_t Trace :: EncodeTraceMethodAndAction(ArtMethod * method, TraceAction action) { uint32_t tmid = (EncodeTraceMethod(method) << TraceActionBits) | action; DCHECK_EQ(method, DecodeTraceMethod(tmid)); return tmid; }

Where TraceAction is defined as:

enum TraceAction { kTraceMethodEnter = 0x00 , // method entry kTraceMethodExit = 0x01 , // method exit kTraceUnroll = 0x02 , // method exited by exception unrolling // 0x03 currently unused kTraceMethodActionMask = 0x03 , // two bits };

So with this data, we know that the method ID encoded in one of the trace records with the lower two bits masked off will match one of the method IDs in the *methods section of the plain text header. We can then use the lower two bits to work out whether each entry is a method entry or exit (via either return or stack unwind).

Armed with this, lets start writing a parser for these files. I chose Erlang for a learning exercise, and also because I intended to make use of it’s excellent binary matching syntax in conjunction with binary comprehensions. Since we know the magic for the binary section of the trace file ( SLOW ), let’s take a look at how we can easily parse out the header and the records using binary matching.

- define ( TRACE_HEADER_MAGIC , "SLOW" ). % Find the location of the trace header { HeaderPos , _} = binary : match ( Data , <<? TRACE_HEADER_MAGIC >> ), % Match out the entire header specification into variables <<? TRACE_HEADER_MAGIC , VersionBin : 2 / binary, DataOffsetBin : 2 / binary, StartTimeBin : 8 / binary, RecordSizeBin : 2 / binary >> = binary : part ( Data , { HeaderPos , 18 }), % Remember all numbers are little endian DataOffset = binary : decode_unsigned ( DataOffsetBin , little), RecordSize = binary : decode_unsigned ( RecordSizeBin , little), % Now that we have the header start and header size, we can start parsing out % the call records themselves. First, excerpt the section of the trace that % contains the binary data SectionStart = HeaderPos + DataOffset , SectionEnd = byte_size ( Data ), RecordSection = binary_part ( Data , { SectionStart , SectionEnd - S ectionStart }), % Now that we have the records, we can break them up based on the RecordSize % that the header speficied and parse them Records = [ Record || << Record : RecordSize / binary >> <= RecordSection ], ParsedRecords = [ parse_trace_record ( Record ) || Record < - R ecords ].

As you can see, extracting the records section and parsing the header was pretty simple using the binary syntax. We use the same approach to parse out the records themselves:

parse_trace_record ( Record ) - > <<T hreadId : 2 / binary, MethodIdActionBin : 4 / binary, TimeDelta : 4 / binary, WallTimeDelta : 4 / binary >> = Record , % Decode the method ID and action from a binary to an integer MethodIdAction = binary : decode_unsigned ( MethodIdActionBin , little), % Now remember that this is a 4-byte integer, and that the top bits % are the actual method ID MethodId = MethodIdAction band 16#FFFFFFFC , % While the action is the lower two bits. % Convert to an atom for readability MethodAction = case MethodIdAction band 16#00000003 of 16#00 - > enter ; 16#01 - > exit ; 16#02 - > unwind end , #call_record { thread_id = binary : decode_unsigned ( ThreadId , little), method_id = MethodId , method_action = MethodAction , time_delta = binary : decode_unsigned ( TimeDelta , little), wall_time_delta = binary : decode_unsigned ( WallTimeDelta , little), child_time = 0 }.

That’s most of the hard work! The full parser code, including the sections for the thread and method tables, can be seen in the final parser implementation here. Now that we have our call records, method IDs and thread IDs, we need to actually convert that data into the format that the flame graph generator can handle. It expects to receive data as ; delimited stack frames, followed by a space and a number representing the time / samples / cycles spent in the final call in that stack. To calculate this, we iterate over all these records per-thread and perform the following:

For a method entry: Push the method name onto a stack, so that we can keep track of what methods have been called. Also push the full method record onto a second stack, so that we can reference its timings later.

For a method exit/unwind: Here’s where the real logic happens. When a method exits, the stack should contain the matching method entry record. We can use the time on the two records to calculate how much time was spent in this method call altogether. If there is another parent on the stack, we update it to reflect how much time was spent in the current call - this allows for separate tracking of self and child call times. We then take the self time (subtracting any children from the current call), and update a map of method name list -> time with the self time. This deduplicates identical call chains.

Once we have iterated over all the calls, we should have a map of lists of method names to timings. From there, we can just join the names with semicolons, append a space and the timing, and it’s all set to be processed by flamegraph.pl.

The code implementing this logic can be found here.

Once we have the trace data, we can process the graph:

flamegraph.pl \ --title "onCreate" \ --hash \ --countname "microseconds" \ onCreate.flat > onCreate.svg

Et voilá!

In order to make this process a lot simpler and less manual, the trace parsing and graph generation have all been rolled into a simple server than you can run. Instructions are available on Github, and there is also a public copy of the server running at https://aflame.rhye.org/ if you want to try out uploading your own traces.