A few weeks ago, I discovered zap: a blazing fast, structured, leveled logger in Go. The provided benchmarks of zap are quite impressive compared to Logrus - what we use in Snap - and even to the standard library. My first thought was: What if we use it with Snap?

After a quick check, I figured out that changing Logrus to zap would imply changing 300+ call to Logrus in Snap. So my second question quickly became: How much would it be worth using zap in Snap?

Going into the logs. (Source)

The right tool: profiling

One option to gauge its worth was to benchmark zap in Snap. However, these benchmarks would be arbitrary. And in fact, it’s something the Zap authors had done already, proving their speediness compared to Logrus. What I wanted was a global overview of the CPU usage during a real use case of Snap, which benchmarking wouldn’t provide me with.

Then I discovered Go’s profiling tools, which help you identify various types of hotspots (CPU, IO, memory), hotspots being the places that you need to concentrate on in order to significantly improve performance. Go’s existing profiling tools are the perfect means to determine the value of switching to zap in Snap for several reasons:

Easy to implement

Visualize every called subroutine

Work on real use case

I encourage you to take a look at the How-to Profile documentation on the Snap repository. If you want to know more about Go’s profiling tools in general, I highly recommend you to take a look at Debugging performance issues in Go programs.

Configuration used for the tests

To put it in a nutshell, here are the steps I took to profile Snap:

Implement a small code change on snapd.go to make it generate a profile when running Build snapd and run it with the task below Kill snapd ==> it’ll generate the profile Generate a view from the profile

In order to have something accurate, I decided to test snapd on a 24-hour run. So I ran two 24-hour tests on Snap - the first one was in debug mode and the second was in info mode. To understand the difference between those 2 modes, here are the 5 different logging levels:

Fatal — for critical errors (highest severity)

Error — for regular errors

Warning — for warning messages

Info — for informational messages

Debug — for debugging messages (lowest severity)

So basically, running Snap in debug mode generates way more logs than running it in info mode. In order to generate plenty of log to work with, I chose a task that collects/processes/publishes 48 metrics from the CPU collector plugin: the more we collect, the more we log.

{

"version": 1,

"schedule": {

"type": "simple",

"interval": "1s"

},

"workflow": {

"collect": {

"metrics": {

"/intel/procfs/cpu/all/*" : {},

"/intel/procfs/cpu/0/*" : {}

},

"process": [

{

"plugin_name": "passthru",

"process": null,

"publish": [

{

"plugin_name": "file",

"config": {

"file": "/tmp/published_cpu.log"

}

}

],

"config": null

}

],

"publish": null

}

}

}

Snap performance on a 24-hour run

In my research on profiling tools, I discovered go-torch — Thanks to Matthew Brender. I used Go’s profiling tools to generate the profiles, while go-torch provided an understandable view of those profiles with Flame Graphs.

These Flame Graphs are views of the percentage of time every subroutine call takes in Snap. Each layer will call the layer above. The further up you go, the closer you are to the syscall functions, and vice-versa.

Debug mode — full svg image: http://imgh.us/torch-debug.svg

The highlighted purple functions are the result of the Search option accessible on the top right of the svg file (click the link below the image to open the full svg file). In this example I searched “logrus”, the percentage 15.3% indicates the percent of time Logrus spent on Snap daemon.

I.e. If Snap used on average 1% of the CPU, Logrus would use 15.3% of that.

Info mode — full svg image: http://imgh.us/torch-info.svg

Surprisingly, Logrus used only 2.3% of Snap CPU resources in the info mode test.

Analysis

These tests were run on a single machine, but when we think about it at scale, the difference between 2.3% and 15.3% is huge. Let’s answer the question “Is it worth using zap in Snap?” from 2 points of view:

Info mode: the current performance of Logrus is OK. Implementing zap is definitely not a priority.

Debug mode: using Logrus is a huge loss of performance. We should think about changing it to zap.

The new question is now: what mode do people run Snap in production?

I hope this blog post will create a discussion about it. I encourage people to run their own profiling tests and join the discussion on intelsdi-x Slack.