Let’s jump to the conclusion. AWS CloudWatch Anomaly Detection is awesome, but it’s an addition, not a replacement, for ‘classic’ monitoring. In this post we’ll dive into proper and inproper use of Anomaly Detection.

What is anomaly detection?

There is a guideline in the AWS Well-Architected Framework: “Stop guessing your capacity needs”. What AWS means by this is that your infrastructure sizing should be data driven. There are two ways to do this: manually or automatically.

In manual data driven sizing you deploy a database, check its performance in a real world scenario and then manually increase or decrease its size based on utilization.

In automatic data driven sizing a system dynamically adjusts the amount of resources based on demand. In AWS, EC2 Autoscaling Groups are probably the best known example of automatic data driven sizing.

As autoscaling enables automatically sizing your resources to demand, so does Anomaly Detection unlock automatically adjusting your monitoring to changing demand.

This is achieved by not defining your alarms with hard limits (do you see the analogy with guessing your capacity needs?), but instead looking at historical data and patterns.

The historical data generates a ‘band’ around current data. Anything within the band is considered good data, and anything outside the band is an anomaly.

This works with many types of patterns, for example a sinus wave or upward trend:

This is a great feature! Additionally, it’s offered at an extremely low price (about $0.30 per Anomaly Detection alarm per month). This might lead you to think “Let’s apply Anomaly Detection everywhere! I will never need to specify alarm thresholds ever again!”. This blog will tell you why you shouldn’t.

Two types of metrics

There are two types of metrics:

Hard metrics (eg. CPU load at 100%)

Soft metrics (eg. 200.000 requests per minute)

For the first type it’s easy to determine absolute ‘good’ and ‘bad’ values. For example, a CPU load below 70% is good, 75% might be a first warning, and a CPU load of 90% might be a critical warning.

For the second type defining thresholds becomes a lot harder. Is 200.000 requests per minute normal? Or is it an attack, or the result of a successful marketing campaign? Additionally, what is normal now might be completely different in a year. This is where Anomaly Detection comes in.

Let’s say the 200.000 requests per minute is normal behaviour. Because of increasing popularity of your website the average requests per minute increase by 10.000 per month, so in a year your website will be doing 320.000 requests per minute.

If you would have set an alarm threshold at 250.000 requests per minute, the alarm would go off in about five months. However, with CloudWatch Anomaly detection, this is considered natural growth.

Anomaly detection needs patterns

Anomaly Detection is strongly dependent on predictable patterns. Sinus waves based on office hours, peaks every week day at 17:00, or maybe higher loads during the holidays.

Anomaly Detection becomes very difficult when there is no clear baseline. Take this blog for example; it might have an average of a 1000 vistors per day, but it’s a lot higher when a blog post is published and shared on social media.

Then again, publishing a post that goes viral might be considered an anomaly… The question is what you do when an anomaly occurs.

Relevancy of alerts

For both metrics with static thresholds and metrics based on anomaly detection, you can define alerts.

For static thresholds, this might be ‘with a sample every minute, when 3 out of 5 consecutive samples exceed threshold X, send an alert’.

For Anomaly Detection thresholds, it might be ‘with a sample every minute, when 3 out of 5 consecutive samples are above or below the band, send an alert’.

But to quote Jurassic Park:

Your scientists were so preoccupied with whether or not they could, they didn’t stop to think if they should.

From an operational point of view, you probably don’t want to send an alert to your engineers when the number of visitors is above the band. Likewise, you probably don’t want to receive an alert when the amount of database queries is out of the ordinary. Instead, you want to monitor if the database and the web servers are holding up and alert on that.

From a marketing perspective, however, it might be very relevant to know the amount of visitors is peaking.

This leads us to the question you should continuously ask yourself when implementing a monitoring system; what thresholds are relevant, and who should know when they are breached?

Define your KPIs

To know what your thresholds should be, you should know your technical and business key performance indicators (KPIs). A few examples of KPIs:

Average web server latency

Maximum database CPU load

Maximum web server memory usage

ElasticSearch availability

Number of visitors on a Saturday

The thresholds should then match the KPIs:

Average web server latency cannot be above 200ms for 5 consecutive minutes.

Maximum database CPU load cannot be above 80% for 10 consecutive minutes.

Maximum web server memory usage cannot be above 70% for 5 consecutive minutes.

The number of ElasticSearch ‘red status’ alarms should always be 0.

Number of visitors on a Saturday should always be generally equal to other Saturdays.

What should jump out from this list is that the first four thresholds are absolute values, and as such don’t fit Anomaly Detection very well. The fifth example does fit Anomaly Detection, but it’s a business KPI, not a technical one.

Anomaly Detection generally belongs on a dashboard

From the previous part of this post it has become clear that Anomaly Detection is undoubtedly valuable for the business side of your infrastructure. But you might wonder what the technical benefit for us engineers is.

The answer is pretty straightforward: the value of Anomaly Detection for engineers is insight.

Start with all the hard metrics for your database, your load balancer, your web servers and your ElasticSearch cluster. Make sure you have alerting in place that generates an email, push message, pager call, or whatever to your standby engineers. This might wake them up in the middle of the night.

Now they’re all groggy-eyed at 3am, looking at a laptop screen that’s way too bright, trying to figure out why the website is down. Without Anomaly Detection, they would have to scroll and click through every service, seeing if anything is out of the ordinary. Anything that might be a root cause. This is a hassle in any environment.

Now imagine having a dashboard that gives you the following insights:

The amount of requests coming from China is way out of the ordinary.

The amount of requests entering the load balancer doesn’t match any normal pattern.

The amount of instances in your autoscaling group is at its max.

Even your sleep drunk 3am on-call self will quickly figure out that this is probably a DDOS attack. You update your WAF to block requests coming from China, see all metrics go back to normal, and it’s back to bed.

Conclusion

CloudWatch Anomaly Detection is a very powerful tool to:

Gain insights into business performance

Quickly drill down to the root cause of technical issues

CloudWatch Anomaly Detection is also very affordable and accessible, especially compared to building a solution like this yourself.

However, it’s not a replacement for existing monitoring. Instead, it is an addition that greatly enhances the monitoring system you already have.