CloudWatch is the most undervalued service on AWS. It’s like an empty control room. All data is there, but no one is looking at it.

Together with IAM and VPC, CloudWatch provides the basis for modern infrastructure. CloudWatch combines an extensive set of functionality that could also be divided into three dedicated services: Metrics, Logging, and Events. Let me explain why you should take CloudWatch more serious and make use of your control room.

Metrics

A metric represents a time series such as CPU utilization, network usage, or AWS costs. A metric stored numeric data together with a time-stamp. Most AWS services report data to CloudWatch where it is aggregated by the minute and persisted. You can retrieve the minute-by-minute data, or you can retrieve statistics such as 10-minute sum, 1-day average, but also 1-hour 99% percentile.

The CloudWatch Management Console provides a graphical way to represent metrics in charts. The following figure shows such a chart.

Besides many AWS services that send data to CloudWatch, you can also send your data which is stored in so called custom metrics. A custom metric is similar to the provided AWS metrics; the only difference is that you sent the data (e.g. using an SDK or the CLI).

The first 15 days, CloudWatch keeps the minute-by-minute data. The next 48 days, CloudWatch keeps a resolution of 5 minutes. The next 392 days CloudWatch keeps a resolution of 1 hour. After that (455 days in total) the data is deleted.

Available statistics are:

SampleCount: Number of data points (actual value does not matter)

Average

Sum

Minimum / Maximum

Percentile (values between p0.0 and p100) p0.0 should be the Minimum p50 should be the median p100 should be the Maximum



Looking at charts can be helpful, but you may also want to automate this process.

Alarm

A CloudWatch Alarm observes a metric. As soon as the metric (or a statistic of the metric) crosses a threshold, the alarm triggers an action. One popular action is to send a message to an SNS topic. You can subscribe to the topic via email to get notified if an alarm is triggered. You can also trigger a scale-up action to react automatically to capacity shortages or execute more sophisticated logic in a Lambda function.

A basic alarm is shown in the following figure.

When defining an alarm, you can also set more sophisticated rules than just a threshold. For example, you can specify that the threshold must be reached multiple times in a row and how missing data should be interpreted. Imagine a machine that sends a custom metric, when this machine breaks, the metric is no longer published which should be an error. On the other hand, you may only publish a metric if something happens, where no data means 0.

Back to visuals. Humans are good at finding patterns in data. Let’s explore better ways to visualize metrics.

Dashboard

So many metrics are stored in CloudWatch. But only a few of them matter to you. Why not keep the most important metrics in one place? This place can be shared across your team. Your team can get more visibility into the running infrastructure which is a real motivation to feel responsible. A CloudWatch Dashboard is a board with 24x24 tiles that you can fully configure to display CloudWatch metrics. You can either display the latest value of a metric, a simple line graph of one or more metrics, or a stacked area graph of multiple metrics. All metrics display the same time range. The following figure shows one of my dashboards.