At Blend, we make extensive use of Kubernetes on AWS to power our infrastructure. Kubernetes has many moving parts, and most of these components are swappable, allowing us to customize clusters to our needs. An important component of any cluster is the Container Network Interface (CNI), which handles the networking for all pods running on the cluster. Choosing the right CNI for each use case is critically important and making changes, once serving production traffic, can be painful. Blend had several problems with the CNI we initially chose (Weave), leading us to explore alternatives. We eventually decided to switch and in this post, we describe the challenges and solutions to migrating without downtime.

We’ll cover the following topics

Why Changing Overlays is Hard

Changing CNI’s at Blend: Weave to Calico

The Migration: Mirroring Clusters

The Migration: Switching CNIs

Gotchas: Why Practice is Important

Takeaways

Why Changing Overlays is Hard

The CNI handles networking between pods, establishing an overlay network for the cluster on top of the existing network. Any pod running on Kubernetes that isn’t using host networking is managed by the overlay. It assigns IP addresses and handles networking on the cluster. Pod addresses are not visible outside the cluster, so the CNI is vital to allowing pods to communicate across the network. If the CNI pod on a host isn’t available, then any pods running on that host won’t be able to reach the network causing many connectivity issues within the cluster. Normally this isn’t a problem because replication of pods and the ability to move pods to healthy nodes is a strength of Kubernetes. But if all of the network overlay pods are unavailable, all services running on the cluster can no longer serve traffic, causing a major outage. This type of failure is rare, but it is the type of failure you’ll experience if you need to change the CNI on a running cluster. During the transition, the cluster has no network, which is the crux of the challenge with changing overlays.

Changing CNIs at Blend: Weave to Calico

We’ve been using Kubernetes on AWS for over two years and we manage our clusters with tooling built around kops . Because we manage our own clusters¹, we’ve experimented with a number of configuration options and have arrived on a set that works well for us. By the time we moved to production with Kubernetes we had settled on Weave. We started seeing problems with more load on our cluster but continued to work through it and alleviate the stress on Weave. We saw problems with low network throughput, dropped connections, timeouts, and running out of IP space on some nodes. We tried to solve these with suggestions from the community, upgrading, and resource tuning, but we ended up searching for other options. Calico is another common CNI for clusters, so we put it to the test in one of our higher load environments, our sandbox cluster. Calico performed significantly better with our use case, so we made the decision to switch. We hope to follow up with another post discussing in more detail our experiences with different CNIs.

We didn’t want to accept the downtime to migrate our existing clusters to Calico because we were already running most of our SLA’d services on Kubernetes. We wanted to avoid interrupting service as much as we could, not just for our customers but also for services internal to Blend. Since we run several clusters, standing up new ones has become relatively simple for us. This gave us plenty of room to experiment with the best way to change our CNI.

The general process for switching a cluster to Calico was:

Change kops cluster configuration to install Calico and update the cluster Remove Weave from the cluster kops rolling update to bring up new machines into the cluster

This process would leave the cluster with no network starting with the second step until the end of the third step. The more machines in the cluster, the longer this process would take, so doing this and eating the downtime was out of the question, especially for our larger clusters. We realized that since there was no way to avoid downtime, we needed to direct traffic elsewhere while we performed the switch. We do not currently have multiple clusters serving production traffic, so we could not switch to another region. The conclusion we ended up on was to set up another cluster to mirror the one we want to work on.