Prometheus 2.0: New storage layer dramatically increases monitoring scalability for Kubernetes and other distributed systems

• By Fabian Reinartz

Prometheus is a monitoring system and time series database expressly designed for the highly distributed, automated, and scalable modern cluster architectures orchestrated by systems like Kubernetes. Prometheus has an operational model and a query language tailored for distributed, dynamic environments.

Nevertheless, the nature of these systems enables increasing dynamism in application loads and node membership that can strain the monitoring system itself. As Prometheus has been adopted throughout the Kubernetes ecosystem, even powering the monitoring services on Tectonic, CoreOS’s enterprise Kubernetes platform, performance demands on the Prometheus storage subsystem have compounded. Prometheus 2.0’s new storage layer is designed to accommodate these demands into the future.

Problem space

Time series databases, like that underpinning Prometheus, face some basic challenges. A time series system collects data points over time, linked together into a series. Each data point is a numeric value associated with a timestamp. A series is defined by a metric name and labeled dimensions, and serves to partition raw metrics into fine-grained measurements.

For example, we can measure the total number of received requests, divided by request path , method , and server instance in the following series:

requests_total{path="/status", method="GET", instance="10.0.0.1:80"} requests_total{path="/status", method="POST", instance="10.0.0.3:80"} requests_total{path="/", method="GET", instance="10.0.0.2:80"}

Prometheus allows us to query those series by their metric name and by specific labels. Querying for requests_total{path=”/status”} will return the first two series in the above example. We can constrain the number of data points we want to receive for those series to an arbitrary time range, for example the last two hours, or the last 30 seconds.

Any storage layer supporting collection of metrics series at the rate and volumes seen in cloud environments thus faces two key design challenges.

Vertical and horizontal

We can think of such data as a two-dimensional plane where the vertical dimension represents all of the stored series, while the horizontal dimension represents time through which samples are spread.

series ^ │ . . . . . . . . . . . . . . . . . . . . . . request_total{path="/status",method="GET"} │ . . . . . . . . . . . . . . . . . . . . . . request_total{path="/",method="POST"} │ . . . . . . . │ . . . . . . . . . . . . . . . . . . . ... │ . . . . . . . . . . . . . . . . . . . . . │ . . . . . . . . . . . . . . . . . . . . . errors_total{path="/status",method="POST"} │ . . . . . . . . . . . . . . . . . errors_total{path="/health",method="GET"} │ . . . . . . . . . . . . . . │ . . . . . . . . . . . . . . . . . . . ... │ . . . . . . . . . . . . . . . . . . . . v <-------------------- time --------------------->

Prometheus periodically collects new data points for all series, which means it must perform vertical writes at the right end of the time axis. When querying, however, we may want to access rectangles of arbitrary area anywhere across the plane.

This means time series read access patterns are vastly different from write patterns. Any useful storage layer for such a system must deliver high performance for both cases.

Series churn

Series are associated with deployment units, such as a Kubernetes pod. When a pod starts, new series are added into our data pool. If a pod shuts down, its series stops receiving new samples, but the existing data for the pod remains available. A new series begins for the pod automatically spun up to replace the terminated pod. Auto-scaling and rolling updates for continuous deployment cause this instance churn to happen orders of magnitude more often than in conventional environments. While Prometheus may usually be collecting data points for a roughly fixed number of active series, the total number of series in the database grows linearly over time.

Because of this, the series plane in dynamic environments like Kubernetes clusters typically looks more like the following:

series ^ │ . . . . . . │ . . . . . . │ . . . . . . │ . . . . . . . │ . . . . . . . │ . . . . . . . │ . . . . . . │ . . . . . . │ . . . . . │ . . . . . │ . . . . . v <-------------------- time --------------------->

To be able to find queried series efficiently, we need an index. But the index that works well for five million series may falter when dealing with 200 million or more.

New storage layer design

The Prometheus 1.x storage layer deals well with the vertical write pattern. However, environments imposing more and more series churn started to expose a few shortcomings in the index. In version 2.0, we’re taking the opportunity to address earlier design decisions that could lead to reduced predictability and inefficient resource consumption in such massively dynamic environments.

We designed a new storage layer that aims to address these shortcomings to make it even easier to run Prometheus in environments like Kubernetes, and to prepare Prometheus for the proliferating workloads of the future.

Sample compression

The sample compression feature of the existing storage layer played a major role in Prometheus’s early success. A single raw data point occupies 16 bytes of storage. When Prometheus collects a few hundred thousand data points per second, this can quickly fill a hard drive.

However, samples within the same series tend to be very similar, and we can exploit this fact to apply efficient compression to samples. Batch compressing chunks of many samples of a series, in memory, squeezes each data point down to an average 1.37 bytes of storage.

This compression scheme works so well that we retained it in the design of the new version 2 storage layer. You can read more on the specifics of the compression algorithm Prometheus storage employs in Facebook’s “Gorilla” paper.

Time sharding

We had the profound realization that rectangular query patterns are served well by a rectangular storage layout. Therefore, we have the new storage layer divide storage into blocks, each of which holds all the series from a range of time. Each block acts as a standalone database.

t0 t1 t2 t3 now ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ │ │ │ │ │ │ │ ┌────────────┐ │ │ │ │ │ │ │ mutable │ <─── write ──── ┤ Prometheus │ │ │ │ │ │ │ │ │ └────────────┘ └───────────┘ └───────────┘ └───────────┘ └───────────┘ ^ └──────────────┴───────┬──────┴──────────────┘ │ │ query │ │ merge ─────────────────────────────────────────────────┘

This allows a query to examine only the subset of blocks within the requested time range, and trivially addresses the problem of series churn. If the database only considers a fraction of the total plane, query execution time naturally decreases.

This layout also makes it trivially easy to delete old data, which was an expensive procedure in the old storage layer. Once a block’s time range completely falls behind the configured retention boundary, it can be dropped entirely.

| ┌────────────┐ ┌────┼─────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ 1 │ │ 2 | │ │ 3 │ │ 4 │ │ 5 │ . . . └────────────┘ └────┼─────┘ └───────────┘ └───────────┘ └───────────┘ | | retention boundary

The Index

While reducing the queried data set is very efficient, it makes improving the overall index increasingly crucial as we expect series churn behavior to only intensify.

We want to query series by their metric names and labels. Those are completely arbitrary, user-configured, and vary by application and by use case. A column index like those in common SQL databases cannot be used. Instead, the new Prometheus storage layer borrows the inverted index concept used in full-text search engines. The inverted index method can efficiently retrieve documents by matching any of the words inside of them. The new storage layer treats each series descriptor as a tiny document. The series name and each label pair are then words in that document.

For example, the series requests_total{path="/status", method="GET", instance="10.0.0.1:80"} is a document containing the following words:

__name__="requests_total" path="/status" method="GET" instance="10.0.0.1:80"

For each unique label word across the series, the index now maintains a sorted list of IDs of the series in which it occurs. Those lists can be efficiently merged and intersected to quickly find the right time series for complex queries. For example, the selector requests_total{path!="/status",instance=~".*:80"} , retrieves all series for the requests_total metric of instances listening on port 80, excluding those requests for the /status path.

Benchmarks

Design and development are just one part of the equation. To reliably measure whether our efforts are paying off, we need to verify our theory using benchmarks. What could be better than simulating those series-churning Kubernetes workloads that initially motivated the new storage design? We wrote a benchmarking suite that spins up Kubernetes clusters and runs a microservice workload on them, with pods rapidly starting and stopping. Prometheus servers with different versions and configurations can monitor this workload, which allows for direct A/B testing of Prometheus performance.

But collecting data is just one part of what a time series database storage layer does. We also wanted to compare how the same queries perform on the old and new storage layers, respectively. To benchmark this case, Prometheus servers are continuously hit with queries that reflect typical use cases.

Overall, the benchmark simulates a workload far below the potential maximum sample throughput, but a lot more demanding in terms of querying and series churn than most real-world setups today. This may show larger performance improvements than typical production workloads, but it ensures the new storage layer can handle the next order of scale.

So what are the results? Memory consumption has historically been a pain point of Prometheus deployments. The new storage layer, queried as well as un-queried, shows significant memory savings as well as more stable, predictable allocation. This is especially useful in container environments where admins want to apply reasonable resource limitations.

Storage system CPU consumption also decreases in Prometheus 2.0 servers. Servers show increased usage when being queried, which mostly stems from the computations in the query engine itself, rather than the storage mechanism. This is a potential target for future optimization.

While we haven’t addressed it, we optimized the disk writing behavior of the storage layer. This is particularly useful for Kubernetes environments, where we are often using network volumes with limited I/O bandwidth. Overall, the bytes written to disk per second decreased by a factor of about 80.

As a surprising side effect, the new file layout and index structure also significantly decreases our storage size in addition to our previous gains.

Finally, what about the queries? The benchmarking environment causes significant series churn on top of the expensive queries we are running. The new storage layer aims to handle the linear increase of time series in storage without an equivalent increase in query latency. As shown in the graph below, Prometheus 1.5.2 exhibits this linear slowdown, whereas Prometheus 2.0 enters a stable state on startup with minimal spikes.

Try the Prometheus v2 alpha with the new storage layer

We have seen why today’s orchestrated container cluster environments like Kubernetes and CoreOS Tectonic bring new challenges to monitoring, and especially to efficiently storing and accessing masses of monitoring data. With the redesigned version 2 storage layer, Prometheus is prepared to handle monitoring workloads in increasingly dynamic systems for years to come.

Try out the third alpha release of Prometheus 2.0 today. You can participate in version 2.0 development in many ways, but as we work toward a stable release, testing the alpha releases and filing issues you encounter along the way are great ways to get started as a Prometheus contributor.

The Prometheus Operator, which makes deployment and configuration of Prometheus-powered monitoring in Kubernetes easy, also supports this version 2.0 alpha release. Just follow the Prometheus Operator getting started guide and set the Prometheus version to v2.0.0-alpha.2 .

There's a long list of people who contributed to this work through development, testing, and technical expertise. We deeply thank everyone involved in this critical next step for the Prometheus project.