Logging is one of the most powerful tools we have as developers. It’s no accident that when things go wrong in production, one of a developer’s first questions is often — “can you send me the logs?”. Raw logs contain useful information but they can be hard to parse. So, when operating systems at scale, using structured logging can greatly increase the usefulness of your logs. Using a common structure it makes the logs easier to search, and also makes automated processing of logs much easier.

At Giant Swarm we use structured logging throughout our control plane to manage Kubernetes clusters for our customers. We use the EFK stack to do this, which consists of Elasticsearch, Fluent Bit and Kibana. The EFK stack is based on the widely used ELK stack which uses Logstash instead of Fluent Bit or Fluentd.

This post explains some of the best practices we follow for structuring our logs, and how we use the EFK stack to manage them. Coming soon we’ll also be providing managed logging infrastructure to our customers as part of the Managed Cloud Native Stack.

How we write logs

Our control plane consists of multiple microservices and Kubernetes operators. As a reminder, an operator in Kubernetes is a custom controller, paired with a CRD (Custom Resource Definition), that extends the Kubernetes API.

For our microservices we develop them in our microkit framework which is based on Go-Kit. Our operators are developed using our operatorkit framework. These frameworks both use our micrologger library. Since all our logs flow through a single library we can enrich the logs with extra data. This also makes the structure of our logs very consistent.

For example, we can add data like a tenant cluster’s ID to the Golang context we pass inside our operator code. We use this to create a self-link to the CR (custom resource) that the operator is processing. This is the same approach as the self-links exposed by the Kubernetes API and makes the logs easier to read.

What we log

We use a JSON format for our logs, which makes it easier for Fluent Bit to process them. It also means the data is more structured when it’s stored in Elasticsearch. We use a pretty standard format with the log level (e.g debug or error) and the log message. For errors, we add a stack entry with the full call stack.

We use the frameworks described earlier to enrich the log messages with extra information, such as the timestamp, self-link or the event the operator is processing (e.g create or update).

Elasticsearch

Elasticsearch is at the heart of the EFK stack. It’s a NoSQL database based on the Lucene search engine. Its origin as a search engine also makes it good at querying log data. It can ingest large volumes of data, store it efficiently and execute queries quickly. In the EFK stack, Elasticsearch is used for log storage, and receives log data from Fluent, which is the log shipper. The log data is stored in an Elasticsearch index and is queried by Kibana.

As you’d expect we deploy Elasticsearch using Kubernetes. Each control plane we manage for our customers has its own deployment of Elasticsearch. This isolates it from all other control planes. On AWS and Azure, we use cloud storage with Persistent Volumes for storing the index data. In on-premise control planes, the data is stored on physical storage.

Fluent Bit

Logging is an area of Cloud Native applications where there are many options. We currently use Fluent Bit but we have previously evaluated many other options, including Logstash (the L in the very popular ELK stack), and Filebeat which is a lightweight log shipper from Elastic.co.

We initially ruled out Logstash and Filebeat, as the integration with Kubernetes metadata was not very advanced. So we started our implementation using Fluentd. Fluentd is a log shipper that has many plugins. It provides a unified logging layer that forwards data to Elasticsearch. It’s also a CNCF project and is known for its Kubernetes and Docker integrations which are both important to us.

Fluentd uses Ruby and Ruby Gems for configuration of its over 500 plugins. Since Ruby is an interpreted language it also makes heavy usage of C extensions for parsing log files and forwarding data to provide the necessary speed. However, due to the volume of logs we ingest we hit performance problems, and so we evaluated the related Fluent Bit project.

Fluent Bit is implemented solely in C and has a restricted set of functionality compared to Fluentd. However, in our case it provides all the functionality we need and we are much happier with the performance. We deploy Fluent Bit as a daemon set to all nodes in our control plane clusters. The Fluent Bit pods on each node mount the Docker logs for the host which gives us access to logs for all containers across the cluster.

A shout out here to Eduardo Silva who is one of the Fluent Bit maintainers, and helped us a lot with answering questions while we worked on the integration.

Log Retention — Curator

Logging is great but it can quickly use up a lot of disk space. So having a good log retention policy is essential. Fluent Bit helps here because it creates daily indices in Elasticsearch. We have a daily cron job in Kubernetes that deletes indices older than n days. The cron job calls the curator component which deletes the old indices.

There is a Curator component from Elastic.co but we use our own simpler version that meets our requirements. It’s available on GitHub giantswarm/curator. Deleting the indices is an intensive process for disk I/O, so another trick we use is to run the cron job at an unusual time like 02:35 rather than at 02:00 — this avoids conflicting with other scheduled tasks.

Kibana

Kibana provides the UI for the stack, with the front end and query engine for querying the logs in Elasticsearch. Kibana supports the Lucene query syntax as well as its own extended Query DSL that uses JSON. Another nice feature is the built-in support for visualizations for use in dashboards.

One challenge we faced was how to configure Kibana. We run an instance of Kibana in each control plane and we want them all to be kept in sync with the same base configuration. This includes setting the index pattern in Kibana for which Elasticsearch indices it should search.

We’re using Kibana 6 and until recently it didn’t have a documented configuration API. So instead we wrote a simple sidecar giantswarm/kibana-sidecar, that sets the configuration. This is working pretty well, but needs to be adapted for each version of Kibana. Elastic.co have recently published documentation for the Saved Objects API for configuration, which may move to this in future.

Conclusion

In this post we’ve shown you how we use structured logging and the EFK stack to manage Kubernetes clusters for our customers. There are many options for logging when building Cloud Native applications. We’ve evaluated several options and found a set of tools that work well for us.

However, one of the benefits of the EFK and ELK stacks is they are very flexible. So if you want to use a different tool for log storage like Kafka — you just configure Fluent to ship to Kafka. This also works for 3rd party log storage providers, like DataDog and Splunk. You can also use a different log shipper like Filebeat or Logstash, if they better suit your needs.

Coming soon we’ll also be offering a Managed Logging infrastructure to our customers as part of the Managed Cloud Native Stack. This will let our customers take advantage of the rich functionality provided by the EFK stack. However, it will be fully managed by us, using our operational knowledge of running the EFK stack in production. Request your free trial of the Giant Swarm Infrastructure here.

Written by Ross Fairbanks — Platform Engineer @ Giant Swarm