

When Rob Ewaschuk – a former SRE at Google – jotted down his philosophy on alerting, it resonated with us almost immediately. We had been trying to figure out our alerting strategy around our then relatively new Service-Oriented Architecture – the term microservices hadn’t quite entered the zeitgeist at the time.

It’s not that we didn’t we didn’t have any alerting. In fact, we had too many – running the gamut from system alerts like high cpu, low memory to health check alerts. However, these weren’t doing the job for us. In a system that is properly load balanced, a single node having high cpu does not necessarily mean the customer is impacted. More so, in an SOA architecture, a single bad node in one service is extremely unlikely to result in a customer-impacting issue. It’s no surprise then that with all the alerting we had, we still ended up having multiple customer-impacting issues that were either detected too late or – even worse – by customer support calls.

Rob’s post hit the nail on the head with his differentiation of “symptom-based monitoring” vs “cause-based monitoring”:

I call this “symptom-based monitoring,” in contrast to “cause-based monitoring”. Do your users care if your MySQL servers are down? No, they care if their queries are failing. (Perhaps you’re cringing already, in love with your Nagios rules for MySQL servers? Your users don’t even know your MySQL servers exist!) Do your users care if a support (i.e. non-serving-path) binary is in a restart-loop? No, they care if their features are failing. Do they care if your data push is failing? No, they care about whether their results are fresh.

It was obvious to us that we had to change course and focus on the symptoms rather than the causes. We started by looking at what tools we had at our disposal to get symptom-based monitoring up and running as soon as possible. At the time, we were using Nimbus for alerting, Open TSD for time series data and then we had Splunk. Splunk is an industry leader for aggregating machine data – typically log files – and deriving business and operational intelligence from that data. We had always used Splunk for business analytics and for searching within logs while investigating production issues but we had never effectively used Splunk for alerting us to those issues in the first place. For a symptom-based monitoring tool, Splunk now stood out as an obvious candidate for the following reasons:

Since Splunk aggregates logs from multiple nodes, it is possible to get a sense of the scale and scope of the issue.

It also allowed us to set up alerting based on our existing logs without requiring code changes. Though, over time, based on what we learnt, we did enhance our logging to enable additional alerts.

Since the objective was to alert on issues that impact the user, we started by identifying user flows that were of most importance to us, e.g., add to cart, place order, and add a payment method. For each flow, we then identified possible pain points like errors, latency and timeouts, and defined appropriate thresholds. Rob talks about alerting from the spout, indicating that the best place to set up alerts is from the client’s perspective in a client server architecture. For us, that was the front end web service and the API layer that our mobile apps talk to. We set up most of our symptom-based alerts in those layers.

When our symptom-based alerts first went live, we used a brand-spanking new technology called email – we simply sent these alerts out to a wide distribution of engineering teams. Noisy alerts had to be quickly fine-tuned and fixed since there is nothing worse than your alerts being considered as spam. Email worked surprisingly well for us as a first step. Engineers would respond to alerts and either investigate it themselves or escalate to other teams for resolution. It also had an unintentional benefit because there was greater visibility among different teams about the problems in the system. But alerts by email only goes so far – they don’t do well when issues occur outside of business hours they are easy to miss amidst the deluge that can hit an inbox, and there is no reliable tracking.

We decided to use PagerDuty as our incident management platform. Setting up on-call schedules and escalation policies in PagerDuty was a breeze and our engineers took to it right away – rather unexpected for something meant to wake you up in the middle of the night. Going to email allowed us to punt on a pesky conundrum – in a service oriented architecture, who do you page? But we now need to solve that problem. For some issues, we can use the error code in the alert to determine which service team has to be paged. But other symptom based alerts – for example latency in add to cart – could be caused by any one of the services participating in that flow. We ended up with somewhat of a compromise: For each user flow, we identified a primary team and a secondary team based on which of the services had the most work in that flow. For example, for the add to cart flow, the Cart Service could have been primary and the Inventory Service might be secondary. In PagerDuty, we then set up escalation policies that looked like this:

Another key guideline – nay, rule that Rob calls out – is that pages must be actionable. An issue we’ve occasionally had is that we get a small spike of errors that is enough to trigger an alert but doesn’t continue to occur. These issues need to be tracked and looked into, but they don’t need the urgency of a page. This is another instance where we haven’t really found the best solution, but we found something that works for us. In Splunk, we set the trigger condition based on the rate of errors:

The custom condition in the alert is set to:

stats count by date_minute|stats count|search count>=5

The “stats count by date_minute” tabulates the count of errors for each minute. The next “stats count” counts the number of rows in the previous table. And finally, since we’re looking at a 5 minute span, we trigger the alert when the number of rows is 5 implying that there was at least one error in each minute. This obviously does not work well for all use cases. If you know of other ways to determine if an error is continuing, do let us know in the comments.

This is just the beginning and we’re continuing to evolve our strategies based on what we learn in production. We still have work to do around improving tracking and accountability of our alerts. Being able to quickly detect the root cause once an alert fires is also something we need to get better at. Overall, our shift in focus to symptom-based alerting has paid dividends and has allowed us to detect issues and react faster, making the site more stable and providing a better experience for our fans. Doing this while ensuring that our developers don’t get woken up by noisy alerts also makes for happier developers.