No it doesn’t delete half your metrics.

Standalone Prometheus is pretty great: it provides a great query language along with a simple unified way of collecting and exposing metrics. Making Prometheus be highly available and scalable, however, can often be a bit of a challenge.

The key features we needed were:

Highly available Prometheus

Single place to query all of your metrics

Easily back up and archive data

This is where Improbable’s Thanos comes in.

Making Prometheus HA

Thanos at its most basic allows you to query multiple Prometheus instances at once and has the ability to deduplicate the same metric from multiple instances. This allows you to run multiple replicas of the same Prometheus setup without worrying about metric duplication.

One of Thanos’ components is a Thanos sidecar that runs alongside each Prometheus container that together form a cluster. Instead of querying directly to the Prometheis (this is the official plural according to Prometheus) you query the Thanos Query component. The picture below helps to understand the relationship between Prometheus and Thanos.

This alone is great as it allows you to easily make a HA Prometheus setup but there’s even more you can do with these building blocks.

A single place to view Metrics

Having too many places to look for metrics is not very efficient

The next problem we looked at was how to get all your metrics into one place.

We run multiple Kubernetes clusters all with their own Prometheus. Historically we aggregated the metrics by having a special Prometheus that scraped each of the Prometheus’ federate endpoint. This did the job but was wasteful as we were just duplicating all of our metrics and this special Prometheus was a single point of failure.

A Thanos Query node can use another Query node as a source of data; if we expose the gRPC endpoint of our Thanos Query nodes in each cluster we can create a Thanos Query that aggregates them together by using them as stores. The picture below helps to illustrate this.

This allows us to go to one Thanos Query and get all our metrics across all our clusters. In the following screenshot I’m querying the number of replicas for our fluentd Daemonset across our three clusters (red, black, blue) with just one query.

However there is still a problem with this setup as we have one special Thanos Query in a special cluster that will take out our single metrics view if it goes down. Instead, we’d like to run multiple Thanos Query nodes, one in each cluster, for users to query against via some kind of load-balancing. Using our AWS multi-cluster load-balancing tool Yggdrasil (read more about that here) we can spread this out across multiple Kubernetes Clusters. The result is users can perform queries against any cluster and receive all metrics.

When you put this all together it looks something like this:

Note that each Thanos Query layer will be a ReplicaSet of Thanos Query nodes for added resilience.

This gives us an incredibly resilient Prometheus setup, spread out across multiple clusters with multiple replicas giving us a lot of layers of redundancy. You have one place to go to if you want to see your Prometheus metrics, no need to worry about what cluster/namespace etc that your application is running in.

Storage

Another common problem with Prometheus is backing up and retaining all your metrics; keeping data on your Prometheus instances is often expensive in terms of storage as well as impacting performance. Thanos solves this by having the sidecar continuously back up your data to a cloud storage provider such as S3 and then exposing the data via a Store node. The store node acts kind of like having another Prometheus instance in your Thanos cluster but with all the data coming from the S3 bucket.

The store node also gives you some nice resilience in the event of a cluster outage, if we lose an entire cluster I can no longer query the most recent data as those Prometheis are gone, but I can still query the store node in another cluster which has access to the historical data from S3.

Further Work

One of the things we’ve wanted to do for a while is split up our Prometheis per team/namespace so that our individual Prometheus instances don’t have to be too large and to give us redundancy in the event of one team generating a very large amount of metrics that takes out Prometheus. This was always considered too much effort as having a separate endpoint for each team/namespace would’ve been a lot of overhead, but with Thanos we can just add the team based Prometheis to our Thanos cluster and still use the same single source for your metrics. So we’d like to switch to having many small Prometheis and use the Prometheus Operator to make it nice and simple to create them.