At Improbable, we developed the widely-used, open source metrics project Thanos to enable reliable monitoring at a global scale. Here’s how we use it ourselves.

Dominic Green and Bartłomiej Płotka are Software Engineers working on Improbable's internal observability and analytics team.

Our SpatialOS platform runs infrastructure that supports running hundreds of massive online games and simulations. That requires a similarly scaled observability infrastructure to enable reliable monitoring at a global scale for dozens of clusters around the globe.

Four years ago, when building this the metric system, we were very early adopters of the Prometheus project. A year ago, however, we announced that we had outgrown our old federated Prometheus set-up and needed something that scaled globally - and so our OSS project Thanos was born.

Essentially, we created Thanos to meet a number of requirements:

Global querying of metrics

Long term retention of metrics

High availability of Prometheus instances

In this post, we share how we decided to architect and run the metrics pipeline within our platform using Thanos. We start with a top-down overview of the system then go into more details on each element, while explaining some major design decisions on the way.

Whilst we will show the way we have deployed Thanos at Improbable, the resulted architecture is driven by our specific requirements - for example, we needed cloud-agnostic Kubernetes clusters deployed over multiple cloud providers (GCP, AWS). With this in mind, the Thanos project was designed early to allow many different deployment models. Given that, you should take this blog post as reference architecture that can guide your decisions - not as a strict rule.

Moreover, we want to encourage other users and companies that already are running Thanos on production to share their own architecture for further reference! We, the other Thanos maintainers and the Thanos community would love to hear the details, so we can link that information in the project’s repository. Feel free to raise a Pull Request against the Thanos project with a link to your blog post, talk video or even a short high-level view of your project architecture!

Reference architecture

The above diagram presents the overall, high-level architecture of the Thanos deployments that serves as a centralized, global metric system for all internal engineers.

In our architecture, from Thanos perspective, we can distinguish two types of clusters: the “Observability” cluster and dozens of “Client” clusters. The key part is that those Client clusters do not need to be co-located geographically with the Observability one. Some of them are in the US, some in the EU, and some in other parts of the world. They don’t need to be running in the same cloud provider or using a homogenous orchestration system. (However, we orchestrate most of the workloads using Kubernetes.)

The right-hand side, the client cluster side, is any cluster that is “monitored” using our Thanos deployment. It’s important to note that our observability team does not “own” those. We only provide software and templated infrastructure definitions for a user to deploy in order to join a Client cluster to the centralized Thanos system. Once the cluster owners have deployed those, the clusters expose all their metrics to the Observability cluster via gRPC and optionally by uploading them to object storage. (More details are covered in the “Metric Collection” section below.)

By contrast, the Observability cluster acts as a central entry point for our internal engineers. This is where the dashboards, alert routing and ad-hoc queries are run. Thanks to Thanos, our engineers can perform “global view” queries; that is, PromQL queries that require data from more than one Prometheus server. (You can find out more about this in the “Querying” section below.)

To connect clusters that can be running in different regions and zones, we use Envoy proxy to propagate traffic. When designing Thanos, we did not want to use VPN connections between clusters - and especially not between different cloud providers - to join our monitoring systems due to increased complexity and overhead. Envoy allows a reliable, controlled and very simple-to-configure connection.

It’s worth mentioning that this architecture is deployed in testing, staging, and production as separate systems so that we can thoroughly test and iterate on changes to the architecture. Everything is isolated between environments, including permissions and storage. This is done before we promote the changes to the next environment, giving us confidence in the set-up and increased workload before we hit production.

Metric collection

Most of the metrics data is collected on remote clusters (labelled as EU and US clusters on the high-level diagram) via multiple replicas of Prometheus with a Thanos sidecar. Those clusters are called “client” clusters as our Thanos deployment is able to add or remove clusters with no downtime. The “Prometheus/Thanos replicas” (AKA “Scrapers”) and “Envoy ingress proxies” are the only things required by cluster owners to deploy in order to join our centralized monitoring. As soon as Prometheus scrapes data from its targets, all client cluster’ metrics become part of the Thanos system with the high availability and full persistence of each metric sample.

These client clusters contain a large number of microservices for various purposes. All services are required to have a Prometheus metric endpoint (or have a dedicated exporter). Our “Client” Prometheus runs with a Thanos sidecar, as well as a Pushgateway sidecar. Each Thanos sidecar adds three external labels to the metrics, which makes the instance unique: cluster, environment and replica. On our client clusters, we encourage (but not require) users to run more than one Scraper replica (Prometheus + sidecars). It’s important for each replica to differ only by a single special label like replica.

On top of this, our infrastructure definitions allow two different Thanos “Scraper” deployments:

Global view only

In this model, the engineers are deploying a number of Prometheus replicas with a configuration typical for a vanilla Prometheus deployment: 2 weeks of retention, a larger persistent disk and with compaction enabled. In this model, a Thanos sidecar is responsible only for proxying all the data via StoreAPI gRPC as requested by Thanos Querier in the Observability cluster. This gives us a global view of all metrics within our environment and allows a central entry point for querying this data.

Global View with long term retention

Another deployment option unlocked by Thanos means backing up the TSDB metrics to bucket storage. In this scenario, Prometheus can (and should) have very short retention which makes it almost stateless. This option requires local compaction to be disabled to avoid any potential races with global compaction done by the Thanos compactor. In this mode, the central cluster needs to query Prometheus in the client clusters only for fresh data (2h and newer), while the older data (~3h+) can be fetched directly from object storage.

Let’s talk about numbers. Thanks to a strict review policy and Prometheus education, we are succeeding in terms of keeping our metrics cardinality low. Overall, our client clusters usually don’t go beyond 1 million series within a two hour block. The exceptions are our testing clusters, where the huge frequency of automatic rollouts causes higher-than-normal cardinality (We all love ‘pod_name’ label don’t we?)

Querying

Let’s look more into the Observability cluster where most of the Thanos system lives.

Now that we have shown how clients ingest metrics into the Thanos ecosystem, we would like to be able to query the data that we have collected to display it to our engineers. We use Grafana as the first entry point for visualising metrics in dashboards to our engineers whilst the Thanos Querier component is essentially the only datasource for our queries. When a Grafana dashboard queries for Prometheus metrics, it first hits our Thanos Querier which fans out the query, either to all StoreAPI components or to a certain selection known to the Querier. StoreAPIs are discovered using DNS discovery for all StoreAPIs within a cluster, like Rulers, Sidecars, Store Gateways, and by static configuration for all remote clusters.

After the initial filtering process by Querier (based on external labels and time ranges available on each StoreAPI), requests now fan out to:

Scrapers in the Observability or remote Client clusters for fresh data (e.g Thanos Sidecar, Ruler).

A replica of the Store Gateway for historical metrics that are older than 3 hours.

In terms of requests to a remote cluster, Envoy has been used securely to proxy our request between many clusters; meaning that a request will go via an Envoy sidecar, an edge Envoy egress proxy, and over the public internet to an edge Envoy ingress proxy (all over a secure connection). This latter will then forward the request onto the Thanos Sidecar to retrieve the data for the time period and labels specified. All of this is done using server streaming gRPC.

A core part of the Thanos Querier is when it sees metrics that differentiate from each other only by a single, special ‘replica’ label, it will deduplicate those seamlessly unless requested otherwise. This allows transparent handling of multi-replica Prometheus instances which allows rolling restarts and higher availability.

On the observability cluster, we also run Compactor. This is an essential singleton component which operates on a single object storage bucket in order to compact, down-sample, apply retention, to the TSDB blocks held inside.

Last, but not least, we run multiple replicas of Thanos Ruler. This component is responsible for meta-monitoring - for example, it checks if all of the Scrapers on remote clusters are up and sends an alert if not. It’s also a useful tool to evaluate alerts and recording rules that require a global view or longer metric retention than local Prometheus. The rest of its rules are evaluated on local Prometheus instances in each replica. This helps reduce the risk, as it removes network partitioning from the equation.

Summary

The Thanos project allowed us to redefine the global metrics collection we had initially without sacrificing the benefits of Prometheus monitoring. We feel extremely lucky and happy that so many companies and independent users decided to help us to define the way of scaling Prometheus, maintain the independent Thanos project, and enable:

The global query view

The ability for long-term retention of metrics

The seamless support for high availability of Prometheus instances

The gradual and incremental deployment model

This architecture was not designed in a single day. It resulted from mistakes, hours of brainstormings, discussions and experimenting. Since Thanos is a new project, we needed to try our best to explore the best practices and best design that fits our requirements. We would like to explain our experimenting and migration process in “The Great Thanos Migration” post that will follow at a later date.

Don’t forget, that if you have deployed Thanos within your system, we would love to share a link to your post in the Thanos repository. Please raise a PR or contact us! As always, feel free to join our lovely community via the slack badge in the README.

If you’re interested in building a global platform that allows running massive online games and simulations, if building out the underlying infrastructure or if contributing to Open Source Software sounds interesting, we are hiring for Principal Engineering and Senior Software Engineering roles at Improbable.