On our cloud migration journey, we have had great fun exploring various strengths and limitations of the platform. We chose Amazon’s managed Kubernetes offering — EKS as one of our workhorse. Since working with a popular marketplace product our needs to accommodate high loads were natural, for one service we were looking at a minimum request load of 2000 RPS.

Request flow in our service

Nature of the service in this scenario

The service is an image processing service that picks up original images from the S3 bucket.

Request flow for image service

The image above shows how the incoming request depends on an externally available resource — an S3 bucket. To come around that we initially started to proxy the S3 bucket via CloudFront distribution, thus avoiding any credential negotiation between the cluster and the bucket. The idea was to reduce any dependency between the S3 bucket and the EKS cluster, thus eliminating the use of KIAM or Kube2iam for this use case.

The issue at high request load

When we had our application and CloudFront distribution set up, we began to load test the solution. Everything was working great and with the degree of automation packed right into Kubernetes, we were sailing ahead in full force.

And then we hit a problem!

About the load test

We used locust.io on a separate EKS cluster to achieve the load requirement of our needs since we had already reached a certain limitation on other managed load testing tools like load impact, etc.

Request flow for initial load test

To measure the load test, we used Datadog for APM, Sumologic for the logs and a nifty little tool — KubeOpsView for overall cluster visualisation.

The Problem

Some interesting findings surfaced during the initial load tests; We saw an invisible throttling when the load on the cluster reached 1000 RPS mark, besides that we observed the 95 percentile of response times (yellow line below) was acting weird.

Throttling observed in locust during the load test

Other variations of load tests

We did multiple load tests to isolate the problem in the request flow. We changed the request chain to replicate the issue with individual components. This made it easy to target the problem.

Test Case 1 — Eliminate EKS

Load test against Cloudfront distribution and S3 bucket

Cloudfront has a limit of 100,000 RPS per distribution and in our test we found no issues with Cloudfront, the WAF policy attached to the distribution did not have any rule to throttle the traffic. We achieved 2500 RPS with the base test.

Test Case 2 — Eliminate Cloudfront, S3 bucket, and Image service

Load test against S3 mock inside EKS

We focused this test on testing EKS without external dependencies and the limits of image service, with a load test against an S3 mock we could touch 2500 RPS with the base test.

This helped rule out ELB and the Nginx ingress controller from the investigation.

Test Case 3 — Eliminate Cloudfront and S3 bucket

Load test against image service and S3 mock inside EKS

This test focused on the limit of image service internally when dealing with local traffic, i.e. S3 mock. We could see that the application could handle high load and even surpassed the 2500 RPS limit.

This was very helpful, as it gave us a direction for starting a more focused investigation, ruling out several components in the request chain.

Investigation begins

When we performed multiple load tests on the EKS cluster, we could rule out certain components in our request chain from a detailed investigation. However, we made sure that we must support this statement with hard evidence; We went through various logs and metrics to find the outcome of the load test to be fairly accurate.

I will try to summarise our investigation approach across a series of questions.

Maybe it’s a resource limitation?

Initially, we thought this just might be a resource limitation and could easily be solved by throwing more resources at the problem; I wish it was that. But that was not the case in our scenario; we had plenty of resources available for the pods. We used the CPU manager to allocate resources systematically and ensured that the QoS was guaranteed.

Are we hitting any limitation in AWS?

Since our request chain was hitting various AWS resources, we shifted our focus to figure out if we hit any limitation on any of the corresponding resources.

We started by looking at the following resources:

ELB Limitations

During our investigation, we found that for optimum performance during a ramp-up test, warming up the ELB is a good option as then we can expect a consistent performance from the ELB end. You can read more about this here. Besides that, we did not exhaust any limit on the ELB.

During our investigation, we found that for optimum performance during a ramp-up test, warming up the ELB is a good option as then we can expect a consistent performance from the ELB end. You can read more about this here. Besides that, we did not exhaust any limit on the ELB. Nginx ingress controller limitations

We monitored the load on the Nginx ingress controller and found they had no stress in handling the required traffic, we also reviewed if some rate-limiting existed within the active configuration. Everything seemed fine in this area.

We monitored the load on the Nginx ingress controller and found they had no stress in handling the required traffic, we also reviewed if some rate-limiting existed within the active configuration. Everything seemed fine in this area. Cloudfront Distribution limitations

As mentioned previously, CloudFront has a limit of 100,000 requests per second per distribution, we were treading very below this mark, also we did not observe any Cloudfront related errors during the test.

Why do we observe the issue when only making external calls through the EKS cluster?

After performing the load tests, this question intrigued us the most!

Since we could serve high load when resources were present locally, we wondered what was the issue that caused this throttling to happen when we made external calls from the cluster — Initial load test.

NAT related issues

SNAT (Source NAT) Issues

We came across this wonderful article that discusses the issue related to higher tail latencies. We spotted a similar trend in our load tests. But since our pods already existed on a private subnet, this scenario was not applicable.

But for others who have pods scheduled on public subnets, the page over here might be an interesting read.

NAT Gateway Issues

Since NAT gateway can support up to 55,000 simultaneous connections to each unique destination. This limit translates to around 900 connections per second to a single destination (about 55,000 connections per minute). Hence, we should see issues because of port allocation errors in such a scenario.

To monitor this, we checked the CloudWatch metric for NAT Gateway — ErrorPortAllocation , but everything was normal here. So we had to shift our investigation even further.

CoreDNS

CoreDNS is an essential part of Kubernetes. It is often overlooked because it works so damn well. It is responsible for both internal and external DNS resolution tasks. We observed our problem when we tried to do an external resolution from our cluster, so we started to perform sanity checks on CoreDNS.

Verifying the version of CoreDNS deployment.

With EKS provisioning CoreDNS, Kube-proxy and aws-node come preassigned. However, it is worth checking whether you have the recommended version of CoreDNS in your cluster:



Kubernetes 1.14: 1.3.1

Kubernetes 1.13: 1.2.6

Kubernetes 1.12: 1.2.2

Kubernetes 1.11: 1.1.3

You can check the same on your cluster with the following command:

kubectl describe deployment coredns --namespace kube-system | grep Image | cut -d "/" -f 3

Output

coredns:v1.3.1

Check for other known issues with CoreDNS

CoreDNS has several known issues out of which ndots:5 issue made the most sense in our scenario. Kubernetes has a long DNS search path and with a default ndots:5 value, it means that any request resolution that contains fewer than 5 dots will cycle through all the search domains to resolve.

This can be a bad strategy when performing a load test or handling a high request load, as it results in extra queries.

An alternative to reach external endpoint can be to use FQDN like external.endpoint.com. in the applications. Note the extra dot in the end. This ensures a single absolute DNS query, bypassing multiple DNS lookups.

We tried to see how the performance would vary if we just scale the CoreDNS pods. Not to our surprise, the cluster exceeded the performance expectations by a huge margin.

A wise man once said — It’s 𝗇̶𝖾̶𝗏̶𝖾̶𝗋̶ always DNS!

Since we could pinpoint the issue on CoreDNS, we dug deeper into it and found interesting stuff.