Last Updated on May 10, 2019

Reading Time: 7 minutes

I’ve been using EKS in production for a small number of months now and so far, so good. Really impressed by the simplicity of getting a cluster up and running and ready for workloads. AWS provide a great Getting Started Guide on their website, which is super duper for getting your head around the components and glue required for getting EKS stood up.

EKS is a very vanilla service, giving users a cluster that conforms to CNCF standards, which Kubernetes purists will be very happy with, however, don’t think that because AWS provides Kubernetes as a service, you no longer have to worry about getting your nodes optimised and ready for your heavy workloads. You should consider an EKS worker node to be the same as a standard, out of the box, EC2 node. If you commonly make optimisations or do hardening, or install software that your company requires for their standards, you should still do all that on EKS.

Fortunately, AWS provides the means to do that in a very straightforward way. AMIs provided by AWS for standing up your EKS workers, contain a bootstrap file at /etc/eks/bootstrap.sh, which is called from UserData when you boot, or an AutoScalingGroup boots node. You can use that UserData to edit arguments passed to this script in your LaunchConfiguration.

A busy Kubelet has a lot of things to do. Not only is it running your actual, very mixed workloads, it’s collecting data and metadata from your applications, dealing with security and auth, managing your network stack and of course telling docker how and when to run containers.

So, with no further ado, these are optimisations, suggestions and considerations for people looking at getting EKS into production. I can’t take credit for having the brain power to come up a lot of these things, so credit inline for specific things I’ve found in the wild.

Reserving Resources For The System and Kubelet

Viewing kubectl get nodes, when we were in a busy period, I noticed nodes in the NotReady state, which I believed was caused by docker itself being exhausted of system resources. I needed to set a few flags when starting the kubelet to ensure there was always enough memory, cpu and disk for vital system resources, and docker itself, to run. The figures suggested by the Kubernetes docs guided us here and seemed completely reasonable.

You’re about to start reserving non-trivial amounts of RAM before the kubelet gets a shot at running your applications, so it’s probably best to not use anything less a large type instance.

If you want to use smaller, you could write some bash to parse /proc/meminfo and make a calculation based on how much ram the system has.

These lines can be placed into the UserData as arguments for boostrap.sh.

# Capture resource reservation for kubernetes system daemons like the kubelet, container runtime, node problem detector, etc. --kube-reserved cpu=250m,memory=1Gi,ephemeral-storage=1Gi # Capture resources for vital system functions, such as sshd, udev. --system-reserved cpu=250m,memory=0.2Gi,ephemeral-storage=1Gi # Start evicting pods from this node once these thresholds are crossed. --eviction-hard memory.available<0.2Gi,nodefs.available<10% 1 2 3 4 5 6 # Capture resource reservation for kubernetes system daemons like the kubelet, container runtime, node problem detector, etc. -- kube - reserved cpu = 250m , memory = 1Gi , ephemeral - storage = 1Gi # Capture resources for vital system functions, such as sshd, udev. -- system - reserved cpu = 250m , memory = 0.2Gi , ephemeral - storage = 1Gi # Start evicting pods from this node once these thresholds are crossed. -- eviction - hard memory .available < 0.2Gi , nodefs .available < 10 %

Network Stack Optimisation

Rather than reinventing the wheel I went hunting the web for good examples.

A great resource I found was here:

https://blog.codeship.com/running-1000-containers-in-docker-swarm/

Docker Swarm, at scale, yeah, that works for my purposes. Thank you, Tit Petric.

cat <<EOF > /etc/sysctl.d/99-kubelet-network.conf # Have a larger connection range available net.ipv4.ip_local_port_range=1024 65000 # Reuse closed sockets faster net.ipv4.tcp_tw_reuse=1 net.ipv4.tcp_fin_timeout=15 # The maximum number of "backlogged sockets". Default is 128. net.core.somaxconn=4096 net.core.netdev_max_backlog=4096 # 16MB per socket - which sounds like a lot, # but will virtually never consume that much. net.core.rmem_max=16777216 net.core.wmem_max=16777216 # Various network tunables net.ipv4.tcp_max_syn_backlog=20480 net.ipv4.tcp_max_tw_buckets=400000 net.ipv4.tcp_no_metrics_save=1 net.ipv4.tcp_rmem=4096 87380 16777216 net.ipv4.tcp_syn_retries=2 net.ipv4.tcp_synack_retries=2 net.ipv4.tcp_wmem=4096 65536 16777216 #vm.min_free_kbytes=65536 # Connection tracking to prevent dropped connections (usually issue on LBs) net.netfilter.nf_conntrack_max=262144 net.ipv4.netfilter.ip_conntrack_generic_timeout=120 net.netfilter.nf_conntrack_tcp_timeout_established=86400 # ARP cache settings for a highly loaded docker swarm net.ipv4.neigh.default.gc_thresh1=8096 net.ipv4.neigh.default.gc_thresh2=12288 net.ipv4.neigh.default.gc_thresh3=16384 EOF # Don't forget to... systemctl restart systemd-sysctl.service 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 cat << EOF > / etc / sysctl .d / 99 - kubelet - network .conf # Have a larger connection range available net .ipv4 .ip_local_port_range = 1024 65000 # Reuse closed sockets faster net .ipv4 .tcp_tw_reuse = 1 net .ipv4 .tcp_fin_timeout = 15 # The maximum number of "backlogged sockets". Default is 128. net .core .somaxconn = 4096 net .core .netdev_max_backlog = 4096 # 16MB per socket - which sounds like a lot, # but will virtually never consume that much. net .core .rmem_max = 16777216 net .core .wmem_max = 16777216 # Various network tunables net .ipv4 .tcp_max_syn_backlog = 20480 net .ipv4 .tcp_max_tw_buckets = 400000 net .ipv4 .tcp_no_metrics_save = 1 net .ipv4 .tcp_rmem = 4096 87380 16777216 net .ipv4 .tcp_syn_retries = 2 net .ipv4 .tcp_synack_retries = 2 net .ipv4 .tcp_wmem = 4096 65536 16777216 #vm.min_free_kbytes=65536 # Connection tracking to prevent dropped connections (usually issue on LBs) net .netfilter .nf_conntrack_max = 262144 net .ipv4 .netfilter .ip_conntrack_generic_timeout = 120 net .netfilter .nf_conntrack_tcp_timeout_established = 86400 # ARP cache settings for a highly loaded docker swarm net .ipv4 .neigh .default .gc_thresh1 = 8096 net .ipv4 .neigh .default .gc_thresh2 = 12288 net .ipv4 .neigh .default .gc_thresh3 = 16384 EOF # Don't forget to... systemctl restart systemd - sysctl .service

DNS lookup scaling

Out of the box, AWS provides a kube-dns deployment containing a single pod of scale 1. After a week or so in production, I was skimming our logs and came across this beauty. This reinforced something I had seen in our exception handling system.

dnsmasq[14]: Maximum number of concurrent DNS queries reached (max: 150) 1 dnsmasq[14]: Maximum number of concurrent DNS queries reached (max: 150)

Wow, we were doing 150 DNS queries/sec?? As it turns out, we were. This is a copy of the resolv.conf I found in our containers:

nameserver 172.20.0.10 search kube-system.svc.cluster.local svc.cluster.local cluster.local ourdomain.com us-west-2.compute.internal options ndots:5 1 2 3 nameserver 172.20.0.10 search kube-system.svc.cluster.local svc.cluster.local cluster.local ourdomain.com us-west-2.compute.internal options ndots:5

For each DNS lookup, in our busy applications, we needed to do A and AAAA lookups for each of those in the search field. I counted 10 lookups for each resolution, followed by 2 more for the actual , recursive lookup. tcpdump confirmed this.

I undertook the following actions, some of which will annoying the purists, I’m sure.

Scale up the kube - dns deployment.

kubectl -n kube-system scale --replicas=5 deployment/kube-dns 1 kubectl - n kube - system scale -- replicas = 5 deployment / kube - dns

Knowing that we don’t use Kubernetes short names in our cluster, we curated our own resolv .conf . Kubernetes merges the system DNS into its own resolv .conf We decided to shrink it to use only kubernetes generated config and not try and use our externally defined search path from our VPC DHCP config. This shortened the list in the search line to 3.

echo "; generated by bootstrap.sh" > /var/lib/kubelet/resolv.conf # ...and then add this argument to the --kubelet-extra-args: --resolv-conf=/var/lib/kubelet/resolv.conf 1 2 3 4 echo "; generated by bootstrap.sh" > / var / lib / kubelet / resolv .conf # ...and then add this argument to the --kubelet-extra-args: -- resolv - conf = / var / lib / kubelet / resolv .conf

Fully qualified DNS names where we could, meaning the getaddrinfo ( ) call would not go through the resolution stack fully.

- internal-api-123456789.eu-west-1.elb.amazonaws.com + internal-api-123456789.eu-west-1.elb.amazonaws.com. 1 2 - internal - api - 123456789.eu - west - 1.elb.amazonaws.com + internal - api - 123456789.eu - west - 1.elb.amazonaws.com.

Where we were happy for it to happen, we updated the dnsPolicy inside our deployments where we knew we had a high frequency of resolutions.

- dnsPolicy: ClusterFirst + dnsPolicy: default 1 2 - dnsPolicy : ClusterFirst + dnsPolicy : default

Disabled IPv6 (This won’t disable resolution attempts, but it brought it to our attention that it may confuse things in our fairly old applications).

cat <<EOF > /etc/sysctl.d/11-no-ipv6.conf net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1 EOF # Don't forget to... systemctl restart systemd-sysctl.service 1 2 3 4 5 6 7 8 cat << EOF > / etc / sysctl .d / 11 - no - ipv6 .conf net .ipv6 .conf .all .disable_ipv6 = 1 net .ipv6 .conf .default .disable_ipv6 = 1 net .ipv6 .conf .lo .disable_ipv6 = 1 EOF # Don't forget to... systemctl restart systemd - sysctl .service

Extra Bootstrapping Steps

Authentication

When you create an EKS cluster, the user that actually creates the cluster is the only one that can access the cluster by default. Even IAM Administrator users can’t login. This seems to be due to the way AWS bootstraps the cluster behind the scenes in AWS-land. Once the cluster is up, you can add users as per normal as you might in any Kubernetes setup.

My recommendation, in a production environment, to use an IAM role to create the cluster.

Once the cluster is created, this auth configuration creates an admin role called mycluster-admin inside your cluster, mapped to an IAM role outside your cluster.

--- apiVersion: v1 kind: ConfigMap metadata: name: aws-auth namespace: kube-system data: mapRoles: | - rolearn: ${node_role_arn} username: system:node:{{EC2PrivateDNSName}} groups: - system:bootstrappers - system:nodes - rolearn: ${admin_role_arn} username: mycluster-admin groups: - system:masters 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 -- - apiVersion : v1 kind : ConfigMap metadata : name : aws - auth namespace : kube - system data : mapRoles : | - rolearn : $ { node_role_arn } username : system : node : { { EC2PrivateDNSName } } groups : - system : bootstrappers - system : nodes - rolearn : $ { admin_role_arn } username : mycluster - admin groups : - system : masters

Once you’ve done this, should you have a catastrophe, without user credentials, you’ll be able to get access to the cluster.

Extra Kubelet Args

As mentioned, the UserData is the place to do configuration for your extra Kubelet config. Appending your bespoke config to the /etc/eks/bootstrap.sh file.

Here’s how ours ended up looking, using some of the options mentioned here. I add a good few labels to make sure I can get pods ending up in the right place.

/etc/eks/bootstrap.sh {cluster_name} \ --kubelet-extra-args \ "--node-labels=cluster=${cluster_name},nodegroup=${node_name},environment=${env},workload=${workload} \ --resolv-conf=/var/lib/kubelet/resolv.conf \ --kube-reserved cpu=250m,memory=1Gi,ephemeral-storage=1Gi \ --system-reserved cpu=250m,memory=0.2Gi,ephemeral-storage=1Gi \ --eviction-hard memory.available<0.Gii,nodefs.available<10%" 1 2 3 4 5 6 7 / etc / eks / bootstrap .sh { cluster_name } \ -- kubelet - extra - args \ " -- node - labels = cluster = $ { cluster_name } , nodegroup = $ { node_name } , environment = $ { env } , workload = $ { workload } \ -- resolv - conf = / var / lib / kubelet / resolv .conf \ -- kube - reserved cpu = 250m , memory = 1Gi , ephemeral - storage = 1Gi \ -- system - reserved cpu = 250m , memory = 0.2Gi , ephemeral - storage = 1Gi \ -- eviction - hard memory .available < 0.Gii , nodefs .available < 10 % "

Subnet Design

EKS uses the amazon-vpc-cni-k8s network plugin which assigns an IP address from the host ENI (Amazon lingo for a network interface) to each pod running on that node. There are a couple of things to consider.

Depending on what instance type you use it will determine the number of ENI’s available and therefore maximum number of pods. The maximum pods you can schedule on an instance will be the maximum interfaces multiplied by the IP addresses per interface. The Kubelet bootstrap script uses a file at /etc/eks/eni-max-pods.txt (link) to make sure your Kubelet doesn’t try and run more pods than IPs are available.

Part of the design of the CNI plugin means that there’s a cooling off period after pod termination before IPs are returned to the pool and are available again. In your design, you should specify new subnets for your workers that are good deal larger than the default /24 range. A decent plan might be that you use a /16 VPC, with /19 internal subnet ranges for your worker nodes, and /21 ranges for your public facing subnet ELB/ALB ranges. This should give you plenty of scope to run pods that you haven’t even thought of running yet.

The takeaway here is that you’ll use more IPs than you think.

Our cluster subnets look like this. I quite like the idea of having a data subnet. I don’t assign any internet route for it, but if there’s anything that I don’t want to have internet access, then it can go in there.

eks-public-1a 10.0.0.0/21 eks-public-1b 10.0.8.0/21 eks-public-1c 10.0.16.0/21 reserved 10.0.24.0/21 eks-private-1a 10.0.32.0/19 eks-private-1b 10.0.64.0/19 eks-private-1c 10.0.96.0/19 reserved 10.0.128.0/19 eks-data-1a 10.0.144.0/23 eks-data-1b 10.0.146.0/23 eks-data-1c 10.0.148.0/23 reserved 10.0.150.0/23 1 2 3 4 5 6 7 8 9 10 11 12 13 14 eks - public - 1a 10.0.0.0 / 21 eks - public - 1b 10.0.8.0 / 21 eks - public - 1c 10.0.16.0 / 21 reserved 10.0.24.0 / 21 eks - private - 1a 10.0.32.0 / 19 eks - private - 1b 10.0.64.0 / 19 eks - private - 1c 10.0.96.0 / 19 reserved 10.0.128.0 / 19 eks - data - 1a 10.0.144.0 / 23 eks - data - 1b 10.0.146.0 / 23 eks - data - 1c 10.0.148.0 / 23 reserved 10.0.150.0 / 23

UserData Configuration In Full

This is the final user-data we used in our launch configuration for the EKS worker nodes.

#!/bin/bash # Sysctl changes ## Disable IPv6 cat <<EOF > /etc/sysctl.d/10-disable-ipv6.conf # disable ipv6 config net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1 EOF ## Kube network optimisation. ## Stolen from this guy: https://blog.codeship.com/running-1000-containers-in-docker-swarm/ cat <<EOF > /etc/sysctl.d/99-kube-net.conf # Have a larger connection range available net.ipv4.ip_local_port_range=1024 65000 # Reuse closed sockets faster net.ipv4.tcp_tw_reuse=1 net.ipv4.tcp_fin_timeout=15 # The maximum number of "backlogged sockets". Default is 128. net.core.somaxconn=4096 net.core.netdev_max_backlog=4096 # 16MB per socket - which sounds like a lot, # but will virtually never consume that much. net.core.rmem_max=16777216 net.core.wmem_max=16777216 # Various network tunables net.ipv4.tcp_max_syn_backlog=20480 net.ipv4.tcp_max_tw_buckets=400000 net.ipv4.tcp_no_metrics_save=1 net.ipv4.tcp_rmem=4096 87380 16777216 net.ipv4.tcp_syn_retries=2 net.ipv4.tcp_synack_retries=2 net.ipv4.tcp_wmem=4096 65536 16777216 #vm.min_free_kbytes=65536 # Connection tracking to prevent dropped connections (usually issue on LBs) net.netfilter.nf_conntrack_max=262144 net.ipv4.netfilter.ip_conntrack_generic_timeout=120 net.netfilter.nf_conntrack_tcp_timeout_established=86400 # ARP cache settings for a highly loaded docker swarm net.ipv4.neigh.default.gc_thresh1=8096 net.ipv4.neigh.default.gc_thresh2=12288 net.ipv4.neigh.default.gc_thresh3=16384 EOF systemctl restart systemd-sysctl.service ## DNS change to limit the 'search' path echo "; generated by bootstrap.sh" > /var/lib/kubelet/resolv.conf if [[ $(ec2-metadata --instance-type) =~ 'large' ]]; then mem_reserved=1Gi else mem_reserved=0.5Gi fi /etc/eks/bootstrap.sh {cluster_name} \ --kubelet-extra-args \ "--node-labels=cluster=${cluster_name},nodegroup=${node_name},environment=${env},workload=${workload} \ --resolv-conf=/var/lib/kubelet/resolv.conf \ --kube-reserved cpu=250m,memory=${mem_reserved},ephemeral-storage=1Gi \ --system-reserved cpu=250m,memory=0.2Gi,ephemeral-storage=1Gi \ --eviction-hard memory.available<500Mi,nodefs.available<10%" 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 #!/bin/bash # Sysctl changes ## Disable IPv6 cat << EOF > / etc / sysctl .d / 10 - disable - ipv6 .conf # disable ipv6 config net .ipv6 .conf .all .disable_ipv6 = 1 net .ipv6 .conf .default .disable_ipv6 = 1 net .ipv6 .conf .lo .disable_ipv6 = 1 EOF ## Kube network optimisation. ## Stolen from this guy: https://blog.codeship.com/running-1000-containers-in-docker-swarm/ cat << EOF > / etc / sysctl .d / 99 - kube - net .conf # Have a larger connection range available net .ipv4 .ip_local_port_range = 1024 65000 # Reuse closed sockets faster net .ipv4 .tcp_tw_reuse = 1 net .ipv4 .tcp_fin_timeout = 15 # The maximum number of "backlogged sockets". Default is 128. net .core .somaxconn = 4096 net .core .netdev_max_backlog = 4096 # 16MB per socket - which sounds like a lot, # but will virtually never consume that much. net .core .rmem_max = 16777216 net .core .wmem_max = 16777216 # Various network tunables net .ipv4 .tcp_max_syn_backlog = 20480 net .ipv4 .tcp_max_tw_buckets = 400000 net .ipv4 .tcp_no_metrics_save = 1 net .ipv4 .tcp_rmem = 4096 87380 16777216 net .ipv4 .tcp_syn_retries = 2 net .ipv4 .tcp_synack_retries = 2 net .ipv4 .tcp_wmem = 4096 65536 16777216 #vm.min_free_kbytes=65536 # Connection tracking to prevent dropped connections (usually issue on LBs) net .netfilter .nf_conntrack_max = 262144 net .ipv4 .netfilter .ip_conntrack_generic_timeout = 120 net .netfilter .nf_conntrack_tcp_timeout_established = 86400 # ARP cache settings for a highly loaded docker swarm net .ipv4 .neigh .default .gc_thresh1 = 8096 net .ipv4 .neigh .default .gc_thresh2 = 12288 net .ipv4 .neigh .default .gc_thresh3 = 16384 EOF systemctl restart systemd - sysctl .service ## DNS change to limit the 'search' path echo "; generated by bootstrap.sh" > / var / lib / kubelet / resolv .conf if [ [ $ ( ec2 - metadata -- instance - type ) = ~ 'large' ] ] ; then mem_reserved = 1Gi else mem_reserved = 0.5Gi fi / etc / eks / bootstrap .sh { cluster_name } \ -- kubelet - extra - args \ " -- node - labels = cluster = $ { cluster_name } , nodegroup = $ { node_name } , environment = $ { env } , workload = $ { workload } \ -- resolv - conf = / var / lib / kubelet / resolv .conf \ -- kube - reserved cpu = 250m , memory = $ { mem_reserved } , ephemeral - storage = 1Gi \ -- system - reserved cpu = 250m , memory = 0.2Gi , ephemeral - storage = 1Gi \ -- eviction - hard memory .available < 500Mi , nodefs .available < 10 % "

In Summary….

EKS has worked out well for us over the past few months and we’ve had no problems with stability or control plane performance. I know that many people are waiting for EKS to improve before jumping on. AWS has a history of releasing early and iterating based on feedback.

Some major concerns that we had were:

A proven upgrade process

Access to the control plane logs

Managed workers

Better integration with IAM

EKS only supports Kubernetes 1.10 but has released two updates since launch. The first was to add the aggregation APIs so that HPA and metrics server would work. The second fixed the critical API server vulnerability. Both were seamless, hands-off upgrades and I’m hopeful that when Kubernetes 1.11 is available it will be just as easy.

Update: EKS control plane upgrades are seamless.

Not having access to the master or etcd’s (or equivalent) logs is a bit annoying, although, saying that, I’ve coped without thus far. AWS have stated that they will make these available via CloudWatch soon.

Update: Logs are now available in Cloudwatch logs.

As written already in this blog, I did need to perform a bit of manual worker tuning that perhaps isn’t required on GKE and AKS. These are all settings that could theoretically be added to the EKS optimised AMI or are trivially added via launch configuration. You could even build your own! Overall the worker management isn’t too difficult and certainly isn’t a showstopper.

We’re using the IAM Authenticator to authenticate to the API server and kube2iam to give pods access to AWS resources. We’re also using the ALB ingress controller which makes configuring routes into the cluster easy. This is all basic but is enough to get by for now.

About the Author

This is a guest post by Graham Moore, a senior DevOps and certified AWS architect who has worked on contracts for numerous high profile technology companies in and around London. Add him on LinkedIn if you’d like to discuss cloud consulting projects.