What happens when cloud-based application infrastructure slows down?

Twelve years ago, I attended a meetup at the San Francisco Perl Mongers group where an engineer from Amazon introduced the Elastic Compute Cloud service (EC2). At the time, anyone involved with computing infrastructure either racked their own servers or used managed hosting where someone else took care of the pipes, power, and ping. Either way, you had a high degree of observability and flexibility for infrastructure troubleshooting — but at the cost of manual operation and systems administration toil.

The AWS EC2 service offered a level of flexibility that was previously unheard of. EC2 exposes many of the same opportunities for system observability as bare metal instances do. It also allows rapid, dynamic provisioning of infrastructure, which allows operators to deploy resources in minutes or hours instead of weeks or months.

Unfortunately, this came with a new set of challenges. While AWS performance engineers can use tools such as eBPF to performance tune the base images, the output of those tools is not available to end users. So, esoteric system-level metrics, such as syscall latency, high resolution block I/O latency, and other high-frequency, low-level indicators are not available in EC2 or other AWS-hosted services for that matter.

All AWS infrastructure offerings provide service-level metrics through the AWS CloudWatch monitoring API. This telemetry interface exposes five-minute-average service metrics, such as request latency in the Elastic Load Balancer (ELB) service. Note that CloudWatch offers a one-minute average in the Detailed Monitoring offering. However, these metrics are aggregates provided by AWS, and the raw data that generates them is not available to inspect.

Today, AWS application infrastructure is generally composed of a mix of different service offerings in addition to EC2, such as ELB load balancers, DynamoDB key/value stores, and the well-known Simple Storage Service (S3). When an application composed of these services is not performing well, identifying which services are responsible becomes the problem to solve.

Latency, She Wrote

At Circonus, many of our customers use our CloudWatch integration to monitor their AWS infrastructure. For services such as ELB and the Relational Database Service (RDS), the API provides a number of metrics, latency being one. While these metrics provide a one-minute average, they offer some useful introspection into system performance.

The ELB CloudWatch integration provides dozens of metrics, including latency dimensions.

Diagnostics

When diagnosing systems exhibiting anomalous behavior, it nearly always pays dividends to approach the problem in a methodical manner.

Let’s walk through a recent real-world case that we encountered. One part of our AWS infrastructure showed an increase in average latency. In our setup, ELB load balancers send requests to EC2 instances running various web services. This is a common AWS implementation pattern. We capture latency telemetry from the ELBs as well as system-level metrics, such as CPU usage and disk I/O, from the EC2 instances using the Circonus Agent.

ELB request latency averages graphed in seconds, clearly visible across all services

What methodology is optimal for determining the source of these latencies? We applied the USE method. First, we examined the CPU usage for the EC2 hosts servicing the ELB requests and then the CPU idle state for each host.

EC2 host CPU idle metrics — note the decreased idle values where latency was observed

Each of the eight hosts showed CPU idle values upwards of 90% except for when increased latency was observed. The CPUs were busy but not fully utilized. What was the CPU doing during those periods? We examined the user, system, and iowait CPU metrics and found that the CPUs were busy handling I/O.

EC2 host CPU iowait metrics — iowait peaked during observed service latencies

No single root cause

It’s rare to have a single root cause. Often, a series of factors add up to an incident, and our case was no exception. During the initial investigation of the log files on the hosts with a high iowait, we discovered that the application Apache/mod_wsgi processes had reached the MaxClients limit. We also discovered that those processes were waiting on an expensive backend process to send and decode huge 36MB JSON payloads, so no capacity remained to handle quicker and smaller processes. The library parsing the JSON was out of date. While we had a newer one that was 700x faster, we had not deployed it yet. To provide immediate relief, we increased the number of processes and began deploying the new JSON code.

As so often happens, removing one bottleneck reveals another. When we looked at the disk reads, we saw one of them capping out at about 2,000 IOPS and 106MB/s, neither of which should be a limit. We believed, based on the size of the volumes attached, that AWS can handle 7,200 IOPS.

Host disk read rates overlayed with ELB latency — note the disk read-rate ceilings

Overlaying the observed ELB latencies from CloudWatch with the disk read rates from the Circonus Agent on the EC2 hosts showed that the disk read rates hit a ceiling at the same time the increased request latencies were observed. We clearly demonstrated this by adding a guideline to the graph at 106MB/s on the disk reads axis.

Host disk read rates with 106MB/s guideline

It became clear that at least one node maxed out before it should have. Much of this was because huge 36MB payloads of JSON had to be fetched. The CloudWatch monitoring infrastructure also provides a view of the disk reads usage, but the ceiling is less pronounced due to the five-minute average used. This phenomenon is known as “spike erosion.”

Host disk read rates in CloudWatch with five-minute averages

Artificial limitations

Putting two and two together, one of our SREs realized:

The disk I/O we were paying for was artificially limited because of the size of the host we were running on.

EC2 instances can be configured with varying options for CPU, memory, and disk throughput. How do you size an EC2 instance given the plethora of combinations? The initial selection is usually done through experience and a tendency to err on the side of overprovisioning. Amazon offers Tips for Right Sizing to give general guidance, which includes using operational instance metrics to determine if instances are underprovisioned.

Once we identified the constraint, removing it required the following procedure:

Update the kernel

Install the ENA kernel network module, which is required for instance upgrade

Stop the instance and change to the new type (m5.2xlarge, 437.5 MB/s IOPS)

Restart it

Reinstall ZFS

Instance Type EBS-optimized by default Maximum bandwidth (MB/s) Maximum throughput (MB/s, 128 KB I/O) Maximum IOPS (16 KB I/O) i3.xlarge Yes 850 106.25 6,000 m5.2xlarge Yes 3,500 437.5 18,750 – Max burst m5.2xlarge No 1,700 212.5 12,000 – Baseline

Using math to prevent a repeat incident

Once we understood the limitations of the workloads being applied, we set out to apply math to provide a more structured analysis of these metrics. Using the Circonus Analytics Query Language (CAQL), we created the following expression to calculate the number of nodes near disk bandwidth capacity. This expression takes the disk read rate from each host, evaluates if it is greater than 100,000,000 (95.4MB/s), and adds up the number of hosts.

find:derivative('disk`nvme1n1`nread') | each:gt(100000000) | op:sum()

This metric is overlayed with the latency values from the ELBs, expressed as quartiles. We created a histogram from the set of latency values and calculated a value for each quartile (quartiles are the 0th, 25th, 50th, 75th, and 100th percentiles).

find('Latency') | histogram:create() | histogram:percentile(0,25,50,75,100)

On a graph displaying the calculated metrics, we see that the number of hosts with increased disk usage rates (red bars) clearly precedes the increase in service latency (blue lines).

Overlay of number of disk saturated nodes with service latency. Red bars are number of nodes (left Y-axis), blue lines are latency quartiles in milliseconds (right Y-axis).

We created rulesets based on the number of nodes that exceeded the disk rate threshold of 95MB/s and created an alert when triggered. Thus, we are notified of any capacity issues. This advanced statistical-based analysis makes Circonus unique among other monitoring solutions.

Reflections on tuning for AWS infrastructure

No two application workloads are the same. By using established analytical methods, we quickly profiled the load experienced by our ELB/EC2 application infrastructure instance and identified exactly which part of the system was the performance constraint. This performance tuning approach requires accurate operational telemetry data to accomplish.

Without operational metrics, one must result to “throwing instances” at the problem and hope for the best. This approach may yield results occasionally, but it is costly in operator hours and almost always results in a markedly over-provisioned infrastructure, for which Amazon is happy to charge you for.

The ability to use statistical constructs, such as histograms and quartiles, to set discrete performance levels with the appropriate alerts is a vital component of infrastructure monitoring. It provides operators with well-defined thresholds to judge how near their service is to capacity and consequently a degraded user experience.

Any operator’s toolkit should include these approaches when using CloudWatch metrics to tune complicated systems composed of AWS services.