We’re excited to announce our new metrics collection system, Rothko!

Metrics?

Metrics, logs, traces, telemetry, observability, monitoring, oh my! If you’ve been paying attention to how engineering organizations have increasingly been instrumenting their software to understand what is working and what isn’t at scale, you’ll have heard some of these terms. At the forefront is this idea that you can’t improve what you don’t measure, and measuring how your software is doing in the field is vital to improving it.

Many engineering organizations start with log aggregation; log aggregation collects logs from your system and then puts those logs into some sort of log search tool to try and make sense of them. But an increasing amount of organizations have moved to metrics. Metrics are generally counters and stats about software that is easier than logs to find quantiles of, find outliers, find medians, aggregate, and so on. Systems report metrics, and operators store these metrics over time in time-series graphs.

An example time-series graph of whole-system data transfer rates over time

If you’ve heard of Graphite, InfluxDB, Prometheus, Atlas, etc., these are common tools for displaying and querying metrics over time (the graph above was generated with the Graphite frontend Grafana). These are all great systems, but a key rule for scaling any metric reporting system is sampling — at a certain point, it is impossible to save everything, and so you must choose what to save and what to throw away in a hopefully statistically bias-free way. If you have millions of devices reporting metrics, millions of metrics coming in every second, it gets expensive really quickly to keep track of all of those metrics separately. Graphite, for example, lets you choose the time window for sampling, but the fundamental data model assumes you want to store data for every reporting service. To get Graphite to scale to millions of reporting services means selecting some (maybe a lot) of metrics to simply ignore. If you’re coming from a cloud-only environment, imagine your metrics system was collecting time-series data from millions of servers!

Here at Vivint, we’re on the front lines of putting devices into people’s homes. We have a massive amount of devices, and at the scale we operate, measuring how our devices are behaving comes with new challenges. We weren’t satisfied with the trade-off of deciding to ignore some devices and sought another way. What we want are time-series graphs about the overall distributions across reporting devices of each metric. So we built Rothko.

Quantiles

To understand Rothko, first we need to take a quick detour to explain quantiles. If you’ve heard of quartiles, percentiles, etc., these are all quantiles. A quantile is a way of describing what fraction of nodes have a certain value range.

Let’s say you’re interested in 10 service requests and you kept track of how long the request took. The request times look like this:

12.3ms

37.8ms

53.1ms

18.2ms

5.7s

18.6ms

42.1ms

20.9ms

14.0ms

32.1ms

Perhaps you want some statistic to try and make sense of these values, so you start by averaging them. Even though 90% of the requests finished in under 55ms , the average response time is 549.9ms !

To me, this means averages are kind of useless here. Haha, means. This is where quantiles come in.

100% of requests finished in 5.7s or less

90% of requests finished in 53.1ms or less

80% of requests finished in 42.1ms or less

70% of requests finished in 37.8ms or less

60% of requests finished in 32.1ms or less

50% of requests finished in 20.9ms or less

40% of requests finished in 18.6ms or less

30% of requests finished in 18.2ms or less

20% of requests finished in 14.0ms or less

10% of requests finished in 12.3ms or less

Here, 18.6ms is the 40th percentile, or the .4th quantile. You might notice that the 50th percentile, or the .5th quantile, is more commonly called the median.

Quantiles are really useful for getting a better picture of what’s going on with your system! By taking a look at the quantiles of request times, it’s immediately clear that just one of the requests took a lot longer than all the others.

It’s worth pointing out that for the 10th percentile, 90% of requests took 12.3ms or more. At the 70th percentile, 30% of requests took 37.8ms or more.

In real world systems, the average is usually not only uninteresting but actively misleading. Keeping track of quantiles is a much better way to aggregate and keep track of metrics across a diverse array of sources.

Rothko

Instead of time-series graphs of a small amount of discrete metrics over time, Rothko is a time-series quantile system, displayed somewhat like time-series histograms. Named metrics are assumed to be reported in from millions of individual devices, and instead of recording the values for each device separately, Rothko records the distribution of the seen values for each metric. Suffice it to say Rothko scales much, much better.

So what does Rothko look like?

This is three-dimensional data represented as a colored heatmap. The x-axis (the bottom) is time, the y-axis (on the left) is the quantile (what percent of reporting devices have the given value or less), and the z-axis (the color, with a value given on the right side) is the value of the metric.

This is a graph of how much CPU is idle across thousands of devices, so the closer to dark purple, the less CPU is idle, and the closer to yellow, the more CPU is idle. At 3/13 around 05:00, we had seemingly a peak of idleness across most devices. The 20th percentile at that time seems to be about 80% idle! The most recent data on the right seems like we’re starting to get more CPU usage. The 20th percentile has dropped to 70% idle.

Note that the values associated with color on the right are not linear — there’s a big drop from 57.7 to 0.00 here. That’s because we choose colors based on the most recent data, and while there are a few nodes that have no idle CPU, 90% of nodes always have 57.7% idle CPU in the most recent time slice.

Performance

From the start, the design of Rothko has been focused on performance. The goal has been to very cheaply store useful information across large fleets, and having good performance is critical to succeed at that goal.

To give some numbers, we have an instance of Rothko running in production that is storing about 170,000 metrics per second on 50% of a single cpu core. It is configured to write out a histogram for every metric every 10 minutes, and it completes writing 30,000 distinct metrics in about 7 seconds on average, spending 260µs per metric. Since we only write for 7 seconds every 10 minutes, disk utilization is around 1.25%.

Since writes are so low pressure, we can highly optimize reads. The data layout is set up so that reading a metric is just straight line reads through files on disk. For reads, even on cold metrics, we’re observing 30ms response times for a fully rendered image.

Get Rothko!

Check out the Github readme for instructions on how to build and deploy your own. We believe the combination of cheap operation, good performance, and global analysis might strike a nice balance for your organization, too.