We've been running Kubernetes for deep learning research for over two years. While our largest-scale workloads manage bare cloud VMs directly, Kubernetes provides a fast iteration cycle, reasonable scalability, and a lack of boilerplate which makes it ideal for most of our experiments. We now operate several Kubernetes clusters (some in the cloud and some on physical hardware), the largest of which we've pushed to over 2,500 nodes. This cluster runs in Azure on a combination of D15v2 and NC24 VMs.

On the path to this scale, many system components caused breakages, including etcd, the Kube masters, Docker image pulls, network, KubeDNS, and even our machines' ARP caches. We felt it’d be helpful to share the specific issues we ran into, and how we solved them.

etcd

After passing 500 nodes in our cluster, our researchers started reporting regular timeouts from the kubectl command line tool. We tried adding more Kube masters (VMs running kube-apiserver). This seemed to solve the problem temporarily, but once we passed 10 replicas we knew we were treating symptoms and not the cause (by comparison, GKE uses a single 32-core VM for 500 nodes).

This made us strongly suspect our etcd cluster, which is the central store of state for the Kube masters. Looking in Datadog, we saw write latency spiking to hundreds of milliseconds on the DS15v2 machines running our etcd replicas, despite each machine using a P30 SSD capable of 5,000 IOPS.

Benchmarking performance with fio , we saw etcd was only able to use about 10% of the available IOPS because the write latency was 2ms and etcd does sequential I/O, making it latency-bound.

We then moved the etcd directory for each node to the local temp disk, which is an SSD connected directly to the instance rather than a network-attached one. Switching to the local disk brought write latency to 200us, and etcd became healthy!

Our cluster ran well until we passed about 1,000 nodes, at which point we once again saw high commit latency from etcd. This time, we noticed the kube-apiservers were reading more than 500MB/s from etcd. We set up Prometheus to monitor the apiservers, and also set the --audit-log-path and --audit-log-maxbackup flags to enabled more logging on the apiserver. This surfaced a number of slow queries and excessive calls to the LIST API for Events.

The root cause: the default setting for Fluentd's and Datadog's monitoring processes was to query the apiservers from every node in the cluster (for example, this issue which is now fixed). We simply changed these processes to be less aggressive with their polling, and load on the apiservers became stable again:

Another helpful tweak was storing Kubernetes Events in a separate etcd cluster, so that spikes in Event creation wouldn’t affect performance of the main etcd instances. To do this, we just set the --etcd-servers-overrides flag to something like this: --etcd-servers-overrides=/events#https://0.example.com:2381;https://1.example.com:2381;https://2.example.com:2381

Another post-1,000 nodes failure was to hit etcd's hard storage limit (by default 2GB), which causes it to stop accepting writes. This triggered a cascading failure: all our Kube nodes failed their health checks, and our autoscaler decided it thus needed to terminate all the workers. We've increased the max etcd size with the --quota-backend-bytes flag, and the autoscaler now has a sanity check not to take action if it would terminate more than 50% of the cluster.

Kube masters

We colocate the kube-apiserver, kube-controller-manager, and kube-scheduler processes on the same machines. For high availability, we always have at least 2 masters, and set the --apiserver-count flag to the number of apiservers we’re running (otherwise Prometheus monitoring can get confused between instances).

We use Kubernetes mainly as a batch scheduling system and rely on our autoscaler to dynamically scale up and down our cluster — this lets us significantly reduce costs for idle nodes, while still providing low latency while iterating rapidly. The default kube-scheduler policy is to spread out load evenly among nodes, but we want the opposite so that unused nodes can be terminated and also so that large pods can be scheduled quickly. So we switched to the following policy:

{ "kind" : "Policy", "apiVersion" : "v1", "predicates" : [ {"name" : "GeneralPredicates"}, {"name" : "MatchInterPodAffinity"}, {"name" : "NoDiskConflict"}, {"name" : "NoVolumeZoneConflict"}, {"name" : "PodToleratesNodeTaints"} ], "priorities" : [ {"name" : "MostRequestedPriority", "weight" : 1}, {"name" : "InterPodAffinityPriority", "weight" : 2} ] }

We use KubeDNS extensively for service discovery, but soon after rolling out the new scheduling policy it started having reliability issues. We found that the failures were only happening on certain pods of KubeDNS. With the new scheduling policy some machines ended up running 10+ copies of KubeDNS, creating hotspots, and we had exceeded the ~200QPS that’s allowed from each Azure VM for external domains lookups.

We fixed this by adding an anti-affinity rule to our KubeDNS pods:

affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - weight: 100 labelSelector: matchExpressions: - key: k8s-app operator: In values: - kube-dns topologyKey: kubernetes.io/hostname

Docker image pulls

Our Dota project started out on Kubernetes, and as it scaled, we noticed that fresh Kubernetes nodes often have pods sitting in Pending for a long time. The game image is around 17GB, and would often take 30 minutes to pull on a fresh cluster node, so we understood why the Dota container would be Pending for a while — but this was true for other containers as well. Digging in, we found that kubelet has a --serialize-image-pulls flag which defaults to true , meaning the Dota image pull blocked all other images. Changing to false required switching Docker to overlay2 rather than AUFS. To further speed up pulls, we also moved the Docker root to the instance-attached SSD, like we did for the etcd machines.

Even after optimizing the pull speed, we saw pods failing to start with a cryptic error message: rpc error: code = 2 desc = net/http: request canceled . The kubelet and Docker logs also contained messages indicating that the image pull had been canceled, due to a lack of progress. We tracked the root to large images taking too long to pull/extract, or times when we had a long backlog of images to pull. To address this, we set kubelet's --image-pull-progress-deadline flag to 30 minutes, and set the Docker daemon's max-concurrent-downloads option to 10. (The second option didn’t speed up extraction of large images, but allowed the queue of images to pull in parallel.)

Our last Docker pull issue was due to the Google Container Registry. By default, kubelet pulls a special image from gcr.io (controlled by the --pod-infra-container-image flag) which is used when starting any new container. If that pull fails for any reason, like exceeding your quota, that node won’t be able to launch any containers. Because our nodes go through a NAT to reach gcr.io rather than having their own public IP, it's quite likely that we’ll hit this per-IP quota limit. To fix this, we simply preloaded that Docker image in the machine image for our Kubernetes workers by using docker image save -o /opt/preloaded_docker_images.tar and docker image load -i /opt/preloaded_docker_images.tar . To improve performance, we do the same for a whitelist of common OpenAI-internal images like the Dota image.

Networking

As our experiments grow larger, they also become increasingly complex distributed systems which rely heavily on the network for their operation. When we first started running distributed experiments, it became immediately obvious that our networking wasn’t configured well. Directly between machines we got 10-15Gbit/s of throughput, but our Kube pods using Flannel were maxing out at ~2Gbit/s. Machine Zone's public benchmarks show similar numbers, meaning the issue wasn't likely to just be bad config, but instead something inherent to our environment. (By contrast, Flannel does not add this overhead on our physical machines.)

To work around this, users can add two different settings to disable Flannel for their pod: hostNetwork: true and dnsPolicy: ClusterFirstWithHostNet . (Though read the warnings in the Kubernetes documentation before doing this.)

ARP Cache

Despite our DNS tuning, we still saw intermittent issues with DNS resolution. One day an engineer reported that nc -v to their Redis server was taking over 30 seconds to print that the connection was established. We tracked the issue to the kernel’s ARP stack. Initial investigation of the Redis pod's host showed something seriously wrong with the network: communication on any port was hanging for multiple seconds, and no DNS names could be resolved via the local dnsmasq daemon, with dig just printing a cryptic failure message: socket.c:1915: internal_send: 127.0.0.1#53: Invalid argument . The dmesg log was more informative: neighbor table overflow! which meant that the ARP cache had run out of space. ARP is used for mapping a network address such as an IPv4 address, to a physical address, such as a MAC address. Fortunately, this was easy to fix by setting a few options in /etc/sysctl.conf :

net.ipv4.neigh.default.gc_thresh1 = 80000 net.ipv4.neigh.default.gc_thresh2 = 90000 net.ipv4.neigh.default.gc_thresh3 = 100000

It's common to tune this setting in HPC clusters, and is particularly relevant in Kubernetes clusters since every pod has its own IP address which consumes space in the ARP cache.

Our Kubernetes clusters have been incident-free for about 3 months now, and we're planning to scale to even larger clusters in 2018. We recently upgraded to version 1.8.4, and are excited to see that it now officially supports 5,000. If you’re interested in building large scale compute clusters, we’re hiring!