Memory leaks are a class of bugs where memory is not released even after it is no longer needed. They are often explicit, and highly visible, which makes them a great candidate to begin learning debugging. Go is a language particularly well suited to identifying memory leaks because of its powerful toolchain, which ships with amazingly capable tools (pprof) which make pinpointing memory usage easy.

I’m hoping this post will illustrate how to visually identify memory, narrow it down to a specific process, correlate the process leak with work, and finally find the source of the memory leak using pprof. This post is contrived in order to allow for the simple identification of the root cause of the memory leak. The pprof overview is intentionally brief and aims to illustrate what it is capable of and isn’t an exhaustive overview of its features.

The service used to generate the data in this post is available here.

What is a memory leak?

If memory grows unbounded and never reaches a steady state then there is probably a leak. The key here is that memory grows without ever reaching a steady state, and eventually causes problems through explicit crashes or by impacting system performance.

Memory leaks can happen for any number of reasons. There can be logical leaks where data-structures grow unbounded, leaks from complexities of poor object reference handling, or just for any number of other reasons. Regardless of the source, many memory leaks elicit a visually noticeable pattern: the “sawtooth”.

Debug Process

This blog post is focused on exploring how to identify and pinpoint root cause for a go memory leak. We’ll be focusing primarily on characteristics of memory leaks, how to identify them, and how to determine their root cause using go. Because of this our actual debug process will be relatively superficial and informal.

The goal of our analysis is to progressively narrow scope of the problem by whittling away possibilities until we have enough information to form and propose a hypothesis. After we have enough data and reasonable scope of the cause, we should form a hypothesis and try to invalidate it with data.

Each step will try to either pinpoint a cause of an issue or invalidate a non-cause. Along the way we’ll be forming a series of hypotheses, they will be necessarily general at first then progressively more specific. This is loosely based on the scientific method. Brandon Gregg does an amazing job of covering different methodologies for system investigation (primarily focused on performance).

Just to reiterate we’ll try to:

Ask a question

Form a Hypothesis

Analyze the hypothesis

Repeat until the root cause is found

Identification

How do we even know if there is a problem (ie memory leak)? Explicit errors are direct indicators of an issue. Common errors for memory leaks are: OOM errors or explicit system crashes.

OOM errors

Errors are the most explicit indicator of a problem. While user/application generated errors have the potential to generate false positives if the logic is off, an OOM error is the OS literally indicating something is using too much memory. For the error listed below this manifests as cgroup limits being reached and the container being killed.

dmesg

Question: Is the error a regular repeating issue?

Hypothesis: OOM errors are significant enough that they should rarely occur. There is a memory leak in one of the processes.

Prediction: Either the Process memory limit has been set too low and there was a uncharacteristic bump or there is a larger issue.

Test: Upon further inspection there are quite a few OOM errors suggesting this is a serious issue and not a one off. Check the system memory for historic view into memory usage.

System Memory

The next step after identifying a potential problem is to get an idea of system wide memory usage. Memory Leaks frequently display a “sawtooth” pattern. The spikes correspond to the application running while the dips correspond to a service restart.

Sawtooth characterizes a memory leak especially corresponding with a service deploy. I’m using a test project to illustrate memory leaks but even a slow leak would look like sawtooth if the range is zoomed out far enough. With a smaller time range it would look like a gradual rise and then a drop off on process restart.

The graph above shows the example of a sawtooth memory growth. Memory continually grows without flatlining. This is a smoking gun for memory issues.

Question: Which process (or processes) is (are) responsible for the memory growth?

Test: Analyze per process memory. There could be information in the dmesg logs to indicate a process or class of processes that are the targets of OOM.

Per Process Memory

Once a memory leak is suspected the next step is to identify a process that is contributing, or causing the system memory growth. Having per process historical memory metrics is a crucial requirement (container based system resources are available through the a tool like cAdvisor). Go’s prometheus client provides per process memory metrics by default, which is where the graph below gets its data.

The below graph shows a process which is very similar to the system sawtooth memory leak graph above: continual growth until process restarts.

Memory is a critical resource and can be used to indicate abnormal resource usage or could be used as a dimension for scaling. Additionally, having memory stats help inform how to set container based (cgroups) memory limits. The specifics of the graph values above can be found metric code source. After the process has been identified it’s time to dig in and find out which specific part of the code is responsible for this memory growth.

Root Cause Analysis / Source Analysis

Go Memory Analysis

Once again prometheus gives us detailed information into the go runtime, and what our process is doing. The chart shows that bytes are continually allocated to the heap until a restart. Each dip corresponds with the service process restart.

Question: Which part(s) of the application is(are) leaking memory?

Hypothesis: There’s a memory leak in a routine which is continually allocating memory to the heap (global variable or pointer, potentially visible through escape analysis)

Test: Correlate the memory usage with an event.

Correlation With Work

Establishing a correlation will help to partition the problem space by answering: is this happening online (in relation to transactions) or in the background?

One way to determine this could be to start the service and let it idle without applying any transactional load. Is the service leaking? If so it could be the framework or a shared library. Our example happens to have a strong correlation with transactional workload.

HTTP Request Count

The above graph show the count of HTTP requests. These directly match the system memory growth and time and establish diving into HTTP request handling as a good place to start.

Question: Which part of the application are responsible for the heap allocations?

Hypothesis: There is an HTTP handler that is continually allocating memory to the heap.

Test: Periodically analyze heap allocations during program running in order to track memory growth.

Go Memory Allocations

In order to inspect how much memory is being allocated and the source of those allocations we’ll use pprof. pprof is an absolutely amazing tool and one of the main reasons that I personally use go. In order to use it we’ll have to first enable it, and then take some snapshots. If you’re already using http, enabling it is literally as easy as:

import _ "net/http/pprof"

Once pprof is enabled we’ll periodically take heap snapshots throughout the life of process memory growth. Taking a heap snapshot is just as trivial:

curl http://localhost:8080/debug/pprof/heap > heap.0.pprof

sleep 30

curl http://localhost:8080/debug/pprof/heap > heap.1.pprof

sleep 30

curl http://localhost:8080/debug/pprof/heap > heap.2.pprof

sleep 30

curl http://localhost:8080/debug/pprof/heap > heap.3.pprof

The goal is to get an idea of how memory is growing throughout the life of the program. Let’s inspect the most recent heap snapshot:

This is absolutely amazing. pprof defaults to Type: inuse_space which displays all the objects that are currently in memory at the time of the snapshot. We can see here that bytes.Repeat is directly responsible for 98.60% of all of our memory!!!

The line below the bytes.Repeat entry shows:

1.28MB 0.31% 98.91% 410.25MB 98.91% main.(*RequestTracker).Track

This is really interesting, it shows that Track itself has 1.28MB or 0.31% but is responsible for 98.91% of all in use memory!!!!!!!!!!!!! Further more we can see that http has even less memory in use but is responsible for even more than Track (since Track is called from it).

pprof exposes many ways to introspect and visualize memory (in use memory size, in use number of objects, allocated memory size, allocated memory objects), it allows listing the track method and showing how much each line is responsible for:

This directly pinpoints the culprit:

1.28MB 410.25MB 24: rt.requests = append(rt.requests, bytes.Repeat([]byte("a"), 10000))

pprof also allows visual generation of the textual information above:

(pprof) svg

Generating report in profile003.svg

This clearly shows the current objects occupying the process memory. Now that we have the culprit Track we can verify that it is allocating memory to a global without ever cleaning it up, and fix the root issue.

Resolution: Memory was being continually allocated to a global variable on each HTTP request.