Monitoring At Yeller

This is a blog about the development of Yeller, The Exception Tracker with Answers Read more about Yeller here

Yeller is an infrastructure company. Its mission is to provide available, low latency exception tracking and analysis. It’s also run by a single person, and as such, I’ve put a lot of work into designing Yeller’s monitoring systems, such that debugging production incidents is easier, that I get alerted at the right times, and so on.

What To Measure?

The first part of a monitoring system (in my book), is deciding what to measure (and also importantly, what not to measure). Typically I prefer measuring rates of high-level positive events , and most of yeller’s alerting is around those kind of metrics. Other metrics are also measured, but they are intended as debugging aids only, not detection that something has gone wrong. Here’s some of the things yeller measures:

For Alerting

The single most important metric that yeller uses for alerting is the overall throughput of the system. I have several “fake” projects running on a few different cloud hosting providers that send exceptions once every 10 seconds or so (to ensure that I actually do have a constant miniumum throughput, no matter what customers’ apps are doing). Ensuring that the rate of exceptions tracked at the last stage (when writing into the database) is above the threshold provided by these fake projects is the main metric that yeller alerts on.

Other metrics I alert on are more preemptive - for example, I want to know if my disks are filling up before they get anywhere near filling up completely. Most of those metrics send emails at a warning threshold, then page me if it gets more critical.

For Debugging

I monitor many many more metrics than I alert on (right now I track about 3000 metrics a second). Nearly all of those are for making debugging production incidents easier, not for alerting at all. Here’s some of the things I track:

OS Level Visibility

All of yeller’s servers report memory, network, cpu utilization and free disk space. They do emphatically do not report load average - that metric is highly confusing on modern multicore linux machines, and measuring it doesn’t teach you all that much.

Application Level Metrics

Every unit of code Yeller runs in production is heavily instrumented. For example, the api handler tracks each a decent rate of latency percentiles, throughput, and tracks the rate of each http status code returned.

Data Storage

Yeller uses riak as it’s primary data store. I track all of the metrics that riak outputs from its stats endpoints, and I track each JVM’s interaction with each riak bucket.

The most important data storage metrics (for yeller), are 99th percentile modify latency per bucket (recorded by the client). Because of Amdahl's Law, the latency of Yeller’s riak modification (right now), is the bottleneck on the overall throughput of the system - it’s the highest latency operation on the data ingest path.

I also push less critical storage metrics, even though they matter slightly less than the primary store: datomic, kafka and zookeeper metrics are all recorded.

Where do it go?

The next part of designing a monitoring system is to figure out where the metrics go. Who tracks them, who decides when to alert, who sends out alerts, etc etc.

The brains: Riemann

All of the metrics mentioned above are delivered to Riemann. Riemann does stream processing using arbitrary clojure code. As an example, the api handler metrics mentioned above let me calculate the percentage of http status codes that are non-200, and alert on that, in the riemann process. Likewise the rate of exceptions tracked is monitored by riemann, summed across all hosts and reported if it goes below a certain threshold.

Riemann is pretty fast, Yeller’s instance tracks about 3k events/s right now, but I’ve benchmarked it (with my config) at about 30k, without any tuning whatsoever. I’m not worried about being able to tune it when I need to either, it’s just clojure code, and I have a modicum of experience tuning that.

Real time debugging: riemann-dash

During a production incident, you need up-to date data. It’s no good waiting 5 minutes to see if your nagios (for example) checks will turn green - computer systems can do a heck of a lot of things in 5 minutes. Riemann has a realtime dashboard designed for debugging production incidents (it’s *very configurable, though maybe not the easiest thing to get started with).

One of the things I love most about riemann dash is how much information density you can get onto it. I just checked, and my most dense dashboard presents 110 metrics on a single page. That can be hard to read at first, but it lets you debug very quickly once you’re used to the metrics - you can glance over the page and see what’s wrong.

Here’s an example of my least dense dashboard: the dashboard that monitors riemann itself (note that my riemann tcp latency doesn’t matter so much - it’s all pushed to from a background thread or two, and I push large batches (around 100 at a time) into riemann for throughput reasons):

Each of the systems in Yeller has a high-level dashboard via riemann-dash , and there are a whole heap of dashboards designed for debugging at a lower level (right now there are 14 dashboards in total, with 4 service specific dashes, and one that gives an overview of machine and process health).

Riemann also operates as a distributed alternative to top , df -h and so on - I run small programs that report these measurements into riemann, and put them all on a single dashboard. Then I can see these measurements, across the cluster, without having to ssh into individual machines, open tonnes of tmux windows, etc.

It’s hard to get this across from a static image, but these metrics update once every second or so, so I can see exactly what the machines are doing, with the same sort of update frequency top and htop give you, but over the entire cluster.

The history: graphite

One thing riemann expressly does not do is provide any kind of historical data storage. For that, I send metrics from riemann directly to graphite. I’ve used graphite a tonne in the past, and it fits my needs for storage well enough. The UI gets confusing/weird sometimes, but most of the time I don’t use all too much historical data when debugging problems, just riemann-dash.

The alerting: OpsGenie

OpsGenie is one of many commercial services you can use that fill the following need:

I want to send an email and have it call my phone, very reliably

Right now, because Yeller is just me, I don’t use all too many OpsGenie features. Eventually I’ll need stuff like on-call rotation etc, but for now that doesn’t matter (and OpsGenie does most of that stuff anyway)

Conclusion

The real measure of a monitoring system is how it fares in production. Yeller’s been running in a production environment for 6 months now (since well before launch), and has been subject to many kinds of load and stress testing during that time. I’ve made some tweaks along the way, but overall I’ve been very happy with Riemann as the brain behind all Yeller’s metrics. The soft real-time nature of riemann’s monitoring vision is such a refreshing change from antiquated tools that poll every minute, or every 5 minutes.

If you liked this article, you’ll probably like my hosted Exception Analytics service.

Yeller is a different take on exception tracking, offering you more insight on the actual cause of your exceptions, with all the usual features of a traditional exception tracking service. Exception analytics is an important part of any production service’s monitoring stack, and Yeller helps you fix your production exceptions faster.

This is a blog about the development of Yeller, the Exception Tracker with Answers. Read more about Yeller here