Last Updated on January 23, 2019

Reading Time: 6 minutes

Monitoring, or the more fancy term ‘observability’ is a frequently misunderstood topic. I know this because I see forum posts where person A will ask a question, and persons B, C and D will reply with some random tool they have used.

The landscape of monitoring tools is immense so a comparison blog that covers everything would quickly get tedious.

What I’ll do instead is explain the three types of monitoring that you need. Then suggest the best tool to use in each case. As you may have guessed from the title we’ll cover Time Series, Logging and Tracing.

Why these three?

We divide on these boundaries due to the data collected. Each type of data must be gathered, processed and stored and then later on queried. Each of these three data types will use a different pipeline and database schema.

Time series data is much more lightweight and requires less processing as part of the ingestion pipeline. This makes it the perfect tool to use for real time monitoring of high cardinality metrics. This translates mostly to watching graphs wiggle about.

Log events are much heavier but contain more information about the event that took place. Logging is therefore much better for historic searches and forensics. Who did what and when.

Tracing data is like log data except that it conforms to a specification and in the real world originates from a micro service transaction.

This produces an end to end picture of what happened across multiple services. In a micro service environment this lets you quickly pinpoint exactly where a problem lies. You can then focus on drilling into that one faulty service.

Now the funny thing is you can fudge each of these to do the job of the other. I see people using logging systems to monitor data that is inherently time series based.

Similarly, you can construct a crude tracing system in your log monitoring tools by appending a transaction ID to each log. This all works at small scale but quickly becomes a stupid idea past a certain number of metrics.

Sensible people will use the right tool for the job. And thus I recommend that you use three separate tools.

As a further example of exactly why you need to pick the right tool for the job let’s examine the storage cost of a single data point on disk. As you can see from the table below Prometheus only stores time series metrics and by using Gorilla compression is able to achieve 1.3 bytes per data point. Event data stored in Elasticsearch takes up a comparatively massive 22 bytes per data point.

As with all of my spreadsheets you can view the full table here.

It’s not just storage costs that are effected. There are massive trade-offs with scaling writes and query performance.

Time Series

There are two types of time series monitoring tools. The first is the legacy sort like Nagios, Graphite and Sensu that use single dimensional data. On Kubernetes it makes absolutely no sense to even evaluate these. Their data model makes them factually and objectively worse than other options.

Edit: Somebody asked for evidence of this and I was shocked as this should really be common sense. Here’s a thread for those unsure of why multi dimensional metrics are better in every way.

So this leaves the second type which are tools that support multi dimensional time series data. This category of tool is absolutely dominated by Prometheus. Prometheus will monitor your servers, Kubernetes itself and your applications. It will provide you with pretty real time graphs that you can view in Grafana. You can also create complex queries and use them in alertmanager.

The only consideration is really build vs buy. If you want to buy custom time series monitoring then DataDog is the most popular. It’s expensive but works well enough. There are many other SaaS options but none with the same market share.

My opinion is you should get up to speed on Prometheus and invest the effort to learn and install it. Kubernetes and Prometheus are joined at the hip and based on my experience of using it over the past 18 months in production it doesn’t take that much effort to operate.

Logging

I’m sure some people will disagree with my opinion here, but for simplicity, just standardise on JSON logging. A good way to do this on Kubernetes is to use FluentD on your hosts to collect the logs and convert to JSON at source (if required).

Where you then send these logs is really a matter of opinion. The most common place to send logs is ElasticSearch. I’ve had success using AWS managed ElasticSearch for all of our logs. If we were on GCE we’d probably investigate Stackdriver logging.

Graylog is another good option and comes with more security related dashboards. If you’re considering buying a product then Splunk is the king of this for on-premise and Sumologic rules over the SaaS options. The problem with all paid logging solutions is that they cost a bunch of money and they start to really slow down at scale. You’re better off scaling your own ElasticSearch cluster and working with developers to emit useful application logs to reduce the volume.

My advice for logging on Kubernetes is to setup FluentD to write logs to ElasticSearch (ideally managed) and then use Kibana to perform searches.

Tracing

This is the area with the fewest number of choices. Currently the best open source tracing available for Kubernetes is Jaeger. You can either use it standalone or as part of a service mesh like Istio. There are other solutions like Zipkin and OpenAPM but these aren’t anywhere near the maturity of Jaeger.

The daddy of tracing on the paid end of the scale is AppDynamics. It’s expensive but makes troubleshooting distributed systems absolutely idiot proof.

It’s the sort of thing you’d need to buy if you are in a large government organisation with 100 subcontractors all creating their own dodgy micro services that constantly break. AppDynamics will shine a light directly through the system and pinpoint exactly which companies need to be fired.

Not all of us work in such places. My recommendation, if you have good development teams, is to try out Jaeger. It will be considerably cheaper and provide much of the same value albeit in a worse interface.

Summary

I’m expecting the same random nonsense replies to this blog proposing X, Y or Z tool that I didn’t mention. The important thing to take away is that you need to cover time series, logging and tracing if you want full observability into your Kubernetes clusters and applications.

If you’ve been one of the people advocating to use only Splunk for everything then perhaps now is a time for quiet introspection.

What you end up choosing I don’t really care. Personally, I always go with Prometheus with Grafana, ElasticSearch with Kibana and Jaeger.

There are tools that cost a lot of money that claim to do it all. This will be true to varying degrees. Again, it’s not really the point. Compare and select monitoring tools that best solve each of these three mentioned areas. If you’re at a tiny company then perhaps one tool will fit for a while. The vast majority should heed my warning and use separate tools.

If you would like to explore what Helm charts are available for Kubernetes for each of these categories you can browse them here.

Edit: I did get one reply from jpkroehling and learn’t something from it. This blog extends my simplistic overview with an explanation of the overlaps between these types.