For a long time, the StatsD + Graphite stack was the go-to solution when considering backend stacks for time-series collection and storage.

In recent years, with the increased adoption of Kubernetes, Prometheus has been gaining more and more attention as an alternative solution for the classic Graphite + StatsD stack. As a matter of fact, Prometheus was initially developed by SoundCloud exactly for that purpose — to replace their Graphite + StatsD stack they used for monitoring. Later on, in July 2016, the Cloud Native Computing Foundation (CNCF), the organization responsible for Kubernetes and multiple other related projects (Helm for example), has adopted Prometheus as an official project of the foundation.

As many other companies in the industry, we’ve been using the Graphite stack for almost 4 years now. Since we are long time users of Kubernetes (used on production since 2015, ~ v1.3) it was only natural for us to evaluate Prometheus as a more modern, community driven and well maintained monitoring stack. Listed below are the main differences between the two monitoring stacks and mainly how Prometheus provides solutions to situations where Graphite might have difficulties in.

Notes:

This post refers to Prometheus’s stable helm chart installation on Kubernetes We’re using Grafana for visualization and alerting so I didn’t cover visualization capabilities or Prometheus’s alert manager here.

Pull vs. Push

The first and most notable difference between Graphite and Prometheus is the way they receive metrics.

Graphite: metrics arrive to StatsD usually by sending UDP packets from the clients. StatsD aggregates the metrics for a time period called “flush interval” and at the end sends them to Graphite for persistence. Graphite has “push” semantics — the client is the one pushing the data into the backend.

Prometheus: metrics arrive to the backend by “scraping”. The Prometheus server issues an HTTP call once every scrape_interval (which is configurable of course). Prometheus “pulls” the metrics directly from its clients. There is no aggregation component in the middle similar to StatsD.

Note: While you could use push semantics to push metrics to Prometheus via PushGateway, it is not the recommended way to go so it isn’t presented here as an option.

Client Setup

StatsD clients require almost zero setup: all you need is a UDP socket to start sending metrics to the backend. It’s so easy, that a simple bash one-liner is a valid StatsD client. For example, the following will increase a counter:

echo “auth_service.login.200.count:1|c” | nc -w 1 -u statsd.example.com 8125

Prometheus, on the other hand, requires a more complicated setup on the client side. Clients should run an HTTP server and serve the metrics on an exposed port and path. It means that even if your application is a simple, offline queue consumer, you’ll have to go through the hassle of importing HTTP capabilities into your project, configuring the server and setting up the networking needed for that server to be able to serve the metrics to Prometheus.

Another requirement of Prometheus is the registry — an object that must be initialized on the client with the type, name and label set of all metrics it would like to report. Reporting a metric that does not exist in the registry might even throw a runtime exception in some Prometheus clients.

Graphite requires almost zero setup on the client, while Prometheus’s client setup is a lot more complicated.

Discovery

Graphite: To be able to send metrics all your clients need is your StatsD host — there’s no client discovery taking place.

Prometheus: Prometheus has to be aware of all clients it would like to pull metrics from. That complicates things a bit since it means the Prometheus server must have discovery capabilities as well as scrape job configurations to be able to properly identify clients and fetch data from them.

Specifically for Kubernetes the default Prometheus installation includes a job that scrapes all pods with the following annotation:

prometheus.io/scrape: “true”

You can also specify the path and port to scrape data from with the following annotations:

prometheus.io/path: “/internal/metrics”

prometheus.io/port: “3000”

so while it is an additional setup, most of the work is done by Prometheus’s kubernetes service discovery plugins and default scrape job configurations.

Graphite requires no discovery since it is not aware of its clients, while Prometheus requires discovery capabilities and scrape jobs configurations to be able to fetch data.

Monitoring Ephemeral Processes

Graphite: short running jobs can open a UDP socket and start sending metrics to StatsD. The fact that the process does not live for a long time does not affect its ability to send metrics.

Prometheus: As we’ve already learned, Prometheus requires a scrapable HTTP endpoint to pull data from which makes getting metrics from short running jobs problematic since they might not be available at the time Prometheus runs its scrape loop. Prometheus’s solution to this is the push gateway. Short running processes can push metrics to the gateway which is a stable process that acts as a metrics cache and provides an endpoint for Prometheus server to scrape. Communication with the push gateway is done via HTTP requests.

Reporting metrics from ephemeral processes is very easy on Graphite. Prometheus requires a more complicated setup in the form of the push gateway.

Metrics Naming

Graphite’s metrics are dot-oriented, for example

<myservice>.<request_type>.200.count

specific example:

auth_service.login.200.count

is a typical counter of HTTP 200 responses for a specific request.

Prometheus has label-based metric names so the same metric as above would look like

http_requests{service="myservice",request="request_type",response_code="200"}

specific example: http_requests{service="auth_service",request="login",response_code="200"}

Prometheus’s naming system is a lot better in my opinion for the following reasons:

Graphite’s metrics naming system imply hierarchy which is not always intuitive. For example, let’s say we have a service located on different AWS regions we would like to monitor — would we name the metric region.service.metric_name or service.region.metric_name ? Graphite’s naming convention also makes querying more difficult on since it requires complete knowledge of the metric structure. A good example would be summing all 200 responses for all services in our system:

Graphite query: sumSeries(*.*.200.count) pay attention how we’re completely aware of the metric structure — we know it has 4 period separated parts, that the service name is on the first part and the request type is on the second.

Prometheus query: sum(rate(http_requests{response_code=”200"}[1m]))

Notice how we’re completely unaware of other labels the metric has like service and request type. Querying does not require any knowledge on a metric structure, except for the names of the labels we would like to perform the query on. Graphite’s strict metrics naming convention becomes very cumbersome when metric structure changes. Let’s assume we’re going multi-region and want to add the region in the beginning of the metric name so instead of service.request.response_code.count we now have region.service.request.response_code.count

As a consequence, we have to change all existing queries to reflect the new metric structure. sumSeries(*.*.200.count) isn’t a valid query anymore since the metric now has 5 parts and not 4 as before. This makes adding data to existing metrics near impossible when the system is large and there are thousands of queries that requires changing.

Prometheus, on the other hand, has no such problem. Adding a region label to the metric does not invalidate existing queries. It means that as long as we don’t change existing label names, we’re free to add data to our metrics without fearing of breaking existing queries.

Prometheus metrics naming system is more concise, flexible and tolerant for changes.

Note: We haven’t used Graphite’s tagging ability which was only added on 1.1.x version so it is not covered in this section

Query Language

Graphite has a set of functions you can use while Prometheus came up with PromQL. It might be a matter of taste but I feel PromQL is a bit more modern and conveys the intent of the query better than Graphite’s functions and I’ll try to demonstrate with an example: lets take the classic case of per-service error rate — for each service we would like to divide the total number of errors by the total number of requests. We define an error as a response with a status code ≥ 499.

Graphite: assuming metrics would be of the form

<service_name>.<request_type>.<response_code>.count

This is how the query would look like

applyByNode(*.*.*.count, 0, “asPercent(sumSeries(%.*.{499,5*}.count), sumSeries(%.*.*.count))”, “%”)

It looks a bit cryptic and very hard to reason about. This is one of these queries you write once and never touch again.

Prometheus: assuming we have metrics of the form

http_requests{service=<service>,request=<request>,response_code=<code>}

The PromQL version of per-service error rates would look like that:

sum(rate(http_requests{response_code=~"499|5.."}[1m])) by (service) /

sum(rate(http_requests[1m])) by (service)

In my opinion, the PromQL version of the query is a lot cleaner and conveys the purpose of the calculation in a more readable and manageble way.

This is just a single example of course but it demonstrates the power and simplicity of PromQL which is reflected in other cases as well.

Aggregations

StatsD is Graphite’s aggregator: it aggregates all metrics received in a flush interval and writes a single point to graphite with the aggregated value. If 100 different processes report a single increment on a counter service.errors to StatsD, StatsD sums all the increments every flush interval and writes to Graphite a single point with the value 100 and series name service.errors . The same goes for timings, so percentiles are calculated over all of the data received in a flush interval. This also means, that if we would like to have per-instance data on Graphite, we would have to explicitly put an instance identifier inside the metric name.

Prometheus works differently: since it pulls data from all of the instances, it can easily add a label with the instance id (pod name in kubernetes) to every scraped metric. This means that Prometheus has per-instance metrics by default. Aggregations are done on the server side at query time via PromQL operators such as sum , avg and quantile . In the example above, we would have had a series errors{kubernetes_pod_name=”<pod_name>”} for each of the pods being scraped. To get the total error rate we would run the PromQL query sum(rate(errors[1m)) .

Another subtlety worth mentioning is the way percentiles are calculated: StatsD percentiles calculation is very straight forward since it has all the data points available at hand to calculate the accurate percentile. The percentiles themselves have to be set when the metrics are received and can’t be calculated backwards so if we decide at some point that we want the 99th percentile of some metric retroactively we can’t have it unless it was already configured to record the 99th percentile.

Prometheus has two means to calculate statistical aggregations: Histograms and Summaries. I strongly recommend reading this blog post and this one to get a better understanding of how both works but the important points for our discussion are, assuming the use of Histograms:

Histograms are a set of counters. Each counter has a preset value (buckets) and will be incremented for any observation with a value lower than the counter value. For example, if we have a Histogram with 3 buckets — 10ms 50ms and 100ms, a 5ms observations increments all counters and a 40ms observation increments both 50ms and 100ms counters. There are two additional counters — one for the number of observations and one for the total sum of observations value.

Calculating percentiles does not require specifying them beforehand and can be calculated retroactively

Since all Prometheus keeps are bucketed observations, percentiles can only be statistically approximated

To sum this section up:

1. Graphite’s datapoints are usually already aggregated on all clients while Prometheus saves per client data and aggregations are done via PromQL.

2. Graphite’s statistical aggregations are accurate but less flexible since the percentiles we would like to track have to be set beforehand. Prometheus’s statistical aggregations are less accurate, but more flexible since aggregation is done via PromQL and allows us to retroactively calculate different percentiles without specifying them anywhere.

Measuring Client Uptime

Uptime is a strong KPI in every monitoring system. Let’s assume we have a pod and would like to get alerted if it is not responsive.

Graphite: We would have to run an infinite loop on a dedicated thread on the client side to report a heartbeat to StatsD. The heartbeat metric could be a simple counter incremented to 1 on every cycle. We could then form a query to get the number of heartbeats in the last minute: summarize(service.pod_name.heartbeats, '1min', 'sum)

Prometheus has pull mechanics so it is already sampling the client in an infinite loop to fetch metrics which makes discovering downtimes very natural. Every Prometheus scrape job produces an up series that will be set to 0 if an instance did not reply to Prometheus’s HTTP request. This means that with zero effort we could just get all instances that failed to reply with a simple PromQL query: up == 0 and set up an alert.

Prometheus’s uptime monitoring requires zero setup because of its pull mechanics, while Graphite requires us to set an infinite report loop on the client and query it.

Missing Data Points

As with any monitoring system, both Prometheus and Graphite are subject to data retrieval errors. It could be because the Prometheus/Graphite server itself is down or because of a network error that prevents the server from receiving metrics from the client. As we are going to see, Prometheus is designed to be fault tolerant (as much as possible of course) to missing data points.

Graphite: StatsD writes a data point to Graphite every flush interval and resets its stored statistics. If a data point was not persisted to Graphite that data point is forever lost. Let’s take an example data set of an errors counter assuming StatsD flushes metrics to Graphite every 1 minute:

Time: 08:01 08:02 08:03 08:04 08:05 08:06

Errors Counter: 2 5 100 170 2 1

Until 08:01 StatsD received 2 increments for the errors counter, wrote it to Graphite and reset it to 0. In the minute between 08:01 and 08:02 StatsD received 5 increments for the errors counter, wrote it to Graphite, reset it to 0 and so on.

If the data points of 08:03 and 08:04 could not be persisted for any reason this is how the final data set on Graphite would look like:

Time: 08:01 08:02 08:03 08:04 08:05 08:06

Graphite: 2 5 NULL NULL 2 1

The knowledge that we had an error spike of 270 errors from 08:02 to 08:04 is lost and won’t be reflected in any way. The graph would look pretty low with data points at 2, 5, 2 and 1. We completely lost the occurrence of an error spike.

Prometheus: Prometheus is designed to handle missing data points, be it because of Prometheus downtime or a scrape failure, very well:

Metrics are saved on the client side and are never reset. Counters, for example, are an ever increasing value.

Counters, for example, are an ever increasing value. Metrics and PromQL functions are designed around counters to allow extrapolation of missing data points.

Let’s take the same scenario described above and see how Prometheus handles it better:

Time: 08:01 08:02 08:03 08:04 08:05 08:06

Num. Errors: 2 5 100 170 2 1

Prometheus counter: 2 7 107 277 279 280

Notice how the counter, which is saved on the client side, is ever increasing and does not reflect a point-in-time value.

Assuming the scrapes at 08:03 and 08:04 have failed and the next one, at 08:05 has succeeded, we end up with the following data set persisted on the Prometheus server:

Time: 08:01 08:02 08:03 08:04 08:05 08:06

Prometheus counter: 2 7 NULL NULL 279 280

We still know that there were 272 errors between 08:02 and 08:05 because 279 — 7 = 272 . We won’t know exactly in which minute we had the error surge but it is easy to identify that the rate of errors during these minutes is higher than the rest of the period.

To properly draw the graph of this counter we use the PromQL rate function which approximates the rate during a time period by dividing the subtraction of the values by the period of time: the rate between 08:01 and 08:02 is calculated as

(7 — 2) / (08:02 — 08:01) = 5/60 = 0.083 errors/sec

The rate for 08:02 to 08:05 would be

(279 — 7) / (08:05 — 08:02) = 272 / 180 = 1.51 errors/sec

An alert on error rate would have been triggered here even though we missed the data points of the event itself.

Now, this might look like a very specific case but the fact is that all of Prometheus’s ecosystem is built around counters to provide fault tolerant metrics. Another great example of counters usage is CPU utilization: In every other metric system CPU utilization would be persisted the following way:

Time: 08:01 08:02 08:03 08:04 08:05 08:06

CPU%: 5% 10% 95% 95% 30% 5%

But here again, we are subject for data loss: if the data points of 08:03 and 08:04 are missing we are unaware of the CPU surge we had in that period.

Prometheus takes another approach for measuring CPU utilization: it does not persist the CPU% for every point in time but has a counter for the total number of seconds a process used the CPU. If a process had 10% utilization during a 60 seconds period it means it used the CPU for 6 seconds. The same data set above would look on Prometheus as:

Time: 08:01 08:02 08:03 08:04 08:05 08:06

CPU%: 5% 10% 95% 95% 30% 5%

Seconds Used: 3 6 57 57 18 3

Prometheus Counter: 3 9 66 123 141 144

To go from Prometheus counter to CPU% we use again the rate function: between 08:01 and 08:02 the rate was (9-3) / 60 = 6/60 = 0.1 (10%)

If we lose the data points at 08:03 and 08:04 (the CPU surge of 95%), we can still see the CPU surge because the data point at 08:05 is 141 so we get: (141-9)/180 = 132/180 = 0.73 = 73%

We won’t have the original 95% CPU usage but we will still see a significant increase in CPU usage during that interval.

Prometheus use of ever increasing counters, saved on the client, makes it more fault tolerant to missing data points than Graphite.

Exporters

This is where Prometheus really shines in my opinion and is one of the strongest incentives for migrating from Graphite. Exporters are components that fetch data from applications and expose Prometheus compatible metrics. There are exporters for almost every application you can think of — RabbitMQ, PostgreSQL, Redis, Kubernetes and the list goes on.

Exporters are usually plug and play — you provide them an address of the application you’d like to fetch metrics from (Redis host, RabbitMQ host etc) and they fetch the data and expose it to Prometheus for scraping. This is awesome because:

it spares you the time of writing a component that will fetch the data from each application and organize it there are many community driven Grafana dashboards around these exporters’ metrics that creates very useful visualizations and KPIs for your applications since applications now have conventional metrics, there is a lot of knowledge sharing and blog posts about creating alerts from these metrics

An excellent example of exporter usage is the way we monitored RabbitMQ with Graphite and the way we do it with Prometheus:

Graphite: We had a Jenkins job that ran every minute. The job had a ruby script that used RabbitMQ’s HTTP API to fetch metadata on queues, exchanges and consumers. The script parsed the data, formed Graphite compatible metrics out of it and shipped them to StatsD using a UDP socket. On Grafana, we built a dashboard around these metrics to visualize the data and set alerts.

Prometheus: There is an exporter ready to be used. We added it as a sidecar container to our RabbitMQ pod and added the proper annotations to flag Prometheus this pod should be scraped. We imported an existing Grafana dashboard to visualize the data. All we were left to do ourselves is to set alerts according to our monitoring needs.

The exporters ecosystem that grew around Prometheus provides an (almost) end-to-end monitoring solution: from fetching the data, organizing it, serving to Prometheus, visualizing it on Grafana and setting alerts. If a few years ago every company would have its custom RabbitMQ monitoring stack, today there’s a widely used and community driven exporter and Grafana dashboard. With almost zero knowledge of how RabbitMQ works we already have great visibility on our cluster.

Some applications, RabbitMQ among them, took this approach even further and started exposing a Prometheus metrics endpoint from the core app so there’s even no need for exporter — Prometheus can just scrape the application itself.

The idea of exporters could be also implemented with Graphite — there is no reason preventing applications from pushing metrics to a provided StatsD/Graphite host. In fact there were some Graphite exporters around — collectd for example. collectd serves a purpose similar to Prometheus’s node exporter: it exports node metrics and had a great Graphite integration. But collectd was an exception as most applications have never had an easy solution for exporting metrics the way exporters do with Prometheus today.

Conclusion

If I had to choose a monitoring stack today I would probably go with Prometheus. Its flexible metric naming system, ability to handle missing data points and the vast exporter ecosystem that grew around it are good enough reasons to overcome its client setup complexity. In addition to that, the fact that Prometheus has been adopted and is being developed by the CNCF makes me feel the project is in good hands and has a very bright future ahead of it.