In this series of posts I will talk about how we’re leveraging machine learning algorithms at Overseer Labs to help our customers perform faster root cause analysis. We’ve worked with Fortune 500 companies and demonstrated significant value.

In this first post I’d like to share a case study that we performed with one of our customers — Rainforest QA.

The Problem

It all started when we realized that companies cared a lot about up time and reliability. Part of this problem was addressed by AWS, but bad code still gets shipped, and sometimes unexpected failures happen.

So if downtime is inevitable, the best we can do is address the problem as fast as possible, learn from the mistake, and try to prevent it from happening again.

To facilitate the debugging process, companies may expend a lot of energy instrumenting their systems with logs, metrics, exceptions, and other pieces of data that could be useful. They may use different monitoring products for different parts of their tech stack with each tool producing its own set of metrics. In total, the number of metrics collected can range from tens of thousands to billions, every minute.

While this is great, it can lead to a situation of data overload during an incident. Currently, the SRE teams need to manually dig through all of this data and sift through endless number of charts on a dashboard. This is very grueling and painful. But thinking about it abstractly, what they’re really doing is performing data correlation across different data sources. They’re hoping to gain insights and uncover clues that will help them root cause the problem.

The Theory

Our theory is that with machine learning, we can automate some of this data correlation and assist the operators in finding those insights and clues. We do not claim to perform automated root cause. Instead, we want to guide the operators and help them get to the answer faster (i.e. machine assistance).

And if we’re successful, we’d be able to save companies time, money, and customer annoyance.

Problem Formulation

So now that we’ve decided on the problem we want to solve, next we’ll have to formulate that into something that we can build.

The current machine learning approaches to this problem consist of alerting on anomalies detected on univariate time series data. While this is useful, individual metrics can be very noisy and thus trigger a lot of false positives. Furthermore, it’s not uncommon for there to be a lot of anomalies in the individual metrics despite the system being healthy. Thus alerting on these anomalies can lead to alert fatigue. And in the face of an incident, this approach may even confuse the operator by leading them down the wrong path.

Before going any further, I want to introduce the concept of system health. The premise is that companies collect many metrics that reflect the state of the overall system. An individual metric may reflect the state of one aspect of the system, but by looking at all metrics as a unit, we can interpret that as the state of the entire system. While we may care about the health of one aspect of the system, we generally won’t care until the health of the entire system is degrading.

Thus, anomalies in a given metric may indicate the health degradation of one aspect of the system, but might reflect a system state that we don’t need to worry about. However, if there was a way to perform analysis across all the metrics as a unit, we can identify true system health degradation. And if system health is what we care about, I claim that this methodology will give us a more accurate proxy with fewer false positives.

So my thesis is that we want to analyze logical groups of metrics instead of analysis on individual metrics. This approach has a few benefits:

Eliminates the need to come up with a specific KPI to gauge performance of system (though having one is still a good idea)

Gives us a better summary of the overall system state

Enables us to interrogate the algorithm and understand the relative importance of each metric in a group for any point in time

And if used for alerting, this will reduce false positives and minimize the alert fatigue issue

The Dataset

We were able to acquire a dataset of a real incident from Rainforest QA. These guys have several apps hosted on Heroku, but depended on AWS SQS for message passing. The specific incident here was that SQS had an outage. These guys had over 7,000 metrics (across their Heroku logs, CloudWatch metrics, and custom metrics), which were stored in Librato. While they were alerting on too many messages being passed through SQS, they did not anticipate that there could be an outage. Hence, they were not alerting on too few messages being passed through the service. Setting up an alert on all 7,000 metrics would not only require them to maintain the thresholds, but would ultimately lead to alert fatigue.

Thus, in the face of an incident, they were forced to dig through all their metrics before they uncovered the root cause. They had all the necessary data to diagnose the issue, but because there were so many metrics, it took them over 45 minutes to discover the problem.

So my task was to figure out if we could’ve done better. Here’s what we did.

Generating the metric groups

The first thing I did was sit down with the CTO to partition their 7,000+ metrics into logical groups that reflect different aspects of their system. He chose to partition the metrics into 6 groups, in which each group would represent a different app.

This step essentially gives me a specification of their system and enables me to understand which metrics should be analyzed as a group. In this case, the CTO is interested in understanding the health of each of their six apps.