There is also a version of the post in Russian.

pprof is the main tool for profiling Go applications. It’s included into Go toolchain, and over the years many handy articles have been written about it.

Enabling pprof profiler for an existing Go application is fairly simple. All we need is to add a single import line:

import _ "net/http/pprof"

The profiler will inject its handlers into the default HTTP server ( net/http.DefaultServeMux ), serving the profiling results under the “/debug/pprof” route. That’s it. One curl command and we have, for example, the results of CPU profiling:

curl -o cpu.prof "http://<server-addr>/debug/pprof/profile"

If you’re not familiar with pprof and performance analysis in Go, I encourage you to look at “Debugging performance issues in Go programs” wiki and documentation for net/http/pprof package.

Sure, enabling pprof seems trivial. But in practice, there are lots of hidden details we should take into consideration when profiling production code.

Let’s start with the fact that we absolutely don’t want to expose “/debug” routes to the internet. Profiling with pprof doesn’t add much overhead, but being cheap doesn’t mean it’s free. A malicious actor can start sending a long running profiling query, affecting the overall application performance. Moreover, profiling results expose details about the application’s internals, which we never want to show to strangers. We must make sure that only authorised requests can reach the profiler. We could restrict the access with a reverse proxy that runs in front of the application or we can move pprof server out of the main server, to a dedicated TCP port with different access procedures — there are ways of doing it.

But what if the business logic of the application doesn’t imply any HTTP interactions? For example, we build an offline worker.

Depending on the state of the infrastructure in the company, an “unexpected” HTTP server inside the application’s process can raise questions from your systems operations department ;) The server adds additional limitations to how we can scale the service. That is, the processes that we could clone on the host, to scale the application up, would start conflicting trying to open the same TCP port for a pprof server.

This is another “easy to fix” issue. We can use different ports per instance, or wrap the application into a container. Nowadays, no one will be surprised with an application that runs over a fleet of servers spread across multiple data centres. And in a very dynamic infrastructure, application instances can come and go, reflecting the workload in real-time. But we still need to access the pprof server inside the application. Meaning, such a dynamic system would have to provide extra mechanisms of service discovery to allow a developer to find a particular instance (and its pprof server) to get the profiling data.

At the same time, depending on the peculiar nature of a company, the very ability to access something inside a production application, that doesn’t directly relate to application’s business logic, can raise questions from the security department ;)) I used to work for a company with a very strict internal security regulations. The only department that had access to production instances were people from production systems operations. The only way a developer could get profiling data was to open a ticket in the Ops bug tracker, describing “what command and on which cluster should be run”, “what results should be expected”, and “what should be done with the results”. Needless to say, that the motivation to do production performance analysis was pretty low.

There is another common situation that a developer can stumble over. Imaging this: you open Slack in the morning and find that last night “something bad happened” to an app in production: maybe a deadlock, a memory leak, or a runtime panic. Nobody on-call had time or energy to delve deep into the problem, so they restarted the application, or rolled back the last release, leaving the rest to the morning.

Investigating such cases is a tough task. It’s great if one can reproduce the issue in the testing environment, or in an isolated part of production, where we have all the tools to get all the data we need. We can take our time delving through the collected data, figuring out what component has been causing problems.

From personal experience, understanding and reproducing the problem in testing is usually a challenge by itself, as in practice, the only artefacts that we are left with are metrics and logs. Wouldn’t it be great if we could go back in time to the point when the issue happened in production and collect all runtime profiles. Unfortunately, to my knowledge, we can’t do that.

But, because we know that profiling with pprof is computationally cheap, what if we, knowing all the possible pitfalls, periodically collected the profiling data and stored the results somewhere outside of production?

In 2010 Google published a paper, titled “Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers”, which described their approach to continuous profiling. After several years, Google also released Stackdriver Profiler — a service of continuous profiling, available to everyone.

The way Stackdriver works is fairly simple: instead of pprof server, a developer includes a “stackdriver-agent” into the application, which — using pprof API under the hood — periodically runs the profiler and sends the results to the cloud. All the developer needs to do is to go to Stackdriver’s UI, choose the instance of the application and an availability zone, and then they can analyse the application’s performance at any point in the past.

There are other SaaS companies, who provide similar functionality. But because of different reasons, not everyone can or wants to send any internal data to a cloud outside of company’s own infrastructure ;)))

Everything described above is not something new or something specific to Go. Developers in most companies where I have worked faced similar obstacles in one form or another.