Lyft is pleased to announce the initial open source release of our IPvlan-based CNI networking stack for running Kubernetes at scale in AWS.

cni-ipvlan-vpc-k8s provides a set of CNI and IPAM plugins implementing a simple, fast, and low latency networking stack for running Kubernetes within Virtual Private Clouds (VPCs) on AWS.

Background

Today Lyft runs in AWS with Envoy as our service mesh but without using containers in production. We use a home-grown, somewhat bespoke stack to deploy our microservice architecture onto service-assigned EC2 instances with auto-scaling groups that dynamically scale instances based on load.

While this architecture has served us well for a number of years, there are significant benefits to moving toward a reliable and scalable open source container orchestration system. Given our previous work with Google and IBM to bring Envoy to Kubernetes, it should be no surprise that we’re rapidly moving Lyft’s base infrastructure substrate to Kubernetes.

We’re handling this change as a two phase migration — initially deploying Kubernetes clusters for native Kubernetes applications such as TensorFlow and Apache Flink, followed by a migration of Lyft-native microservices where Envoy is used to unify a mesh that spans both the legacy infrastructure as well as Lyft services running on Kubernetes. It’s critical that both Kubernetes-native services as well as Lyft-native services be able to communicate and share data as first class citizens. Networking these environments together must be low latency, high throughput, and easy to debug if issues arise.

Kubernetes networking in AWS: historically a study in tradeoffs

Deploying Kubernetes at scale on AWS is not a simple or straightforward task. While much work in the community has been done to easily and quickly spin up small clusters in AWS, until recently, there hasn’t been an immediate and obvious path to mapping Kubernetes networking requirements onto AWS VPC network primitives.

The simplest path meeting Kubernetes’ network requirement is to assign a /24 subnet to every node, providing an excess of the 110 Pod IPs needed to reach the default maximum of schedulable Pods per node. As nodes join and leave the cluster, a central VPC route table is updated. Unfortunately, AWS’s VPC product has a default maximum of 50 non-propagated routes per route table, which can be increased up to a hard limit of 100 routes at the cost of potentially reducing network performance. This means you’re effectively limited to 50 Kubernetes nodes per VPC using this method.

While considering clusters larger than 50 nodes in AWS, you’ll quickly find recommendations to use more exotic networking techniques such as overlay networks (IP in IP) and BGP for dynamic routing. All of these approaches add massive complexity to your Kubernetes deployment, effectively requiring you to administer and debug a custom software defined network stack running on top of Amazon’s native VPC software defined network stack. Why would you run an SDN on top of an SDN?

Simpler solutions

After staring at the AWS VPC documentation, the CNI spec, Kubernetes networking requirement documents, kube-proxy iptables magic, along with all the various Linux network driver and namespace options, it’s possible to create simple and straightforward CNI plugins which drive native AWS network constructs to provide a compliant Kubernetes networking stack.

Lincoln Stoll’s k8s-vpcnet, and more recently, Amazon’s amazon-vpc-cni-k8s CNI stacks use Elastic Network Interfaces (ENIs) and secondary private IPs to achieve an overlay-free AWS VPC-native solutions for Kubernetes networking. While both of these solutions achieve the same base goal of drastically simplifying the network complexity of deploying Kubernetes at scale on AWS, they do not focus on minimizing network latency and kernel overhead as part of implementing a compliant networking stack.

A simple and low-latency solution

We developed our solution using IPvlan, bypassing the cost of forwarding packets through the default namespace to connect host ENI adapters to their Pod virtual adapters. We directly tie host ENI adapters to Pods.

Network flow to/from VPC over IPvlan

In IPVLAN — The Beginning, Mahesh Bandewar and Eric Dumazet discuss needing an alternative to forwarding as a motivation for writing IPvlan:

Though this solution [forwarding packets from and to the default namespace] works on a functional basis, the performance / packet rate expected from this setup is is much lesser since every packet that is going in or out is processed 2+ times on the network stack (2x Ingress + Egress or 2x Egress + Ingress). This is a huge cost to pay for.

We also wanted the system to be host-local with minimal moving components and state; our network stack contains no network services or daemons. As AWS instances boot, CNI plugins communicate with AWS networking APIs to provision network resources for Pods.

Lyft’s network architecture for Kubernetes, a low level overview

The primary EC2 boot ENI with its primary private IP is used as the IP address for the node. Our CNI plugins manage additional ENIs and private IPs on those ENIs to assign IP addresses to Pods.

ENI assignment

Each Pod contains two network interfaces, a primary IPvlan interface and an unnumbered point-to-point virtual ethernet interface. These interfaces are created via a chained CNI execution.

CNI chained execution

IPvlan interface: The IPvlan interface with the Pod’s IP is used for all VPC traffic and provides minimal overhead for network packet processing within the Linux kernel. The master device is the ENI of the associated Pod IP. IPvlan is used in L2 mode with isolation provided from all other ENIs, including the boot ENI handling traffic for the Kubernetes control plane.

The IPvlan interface with the Pod’s IP is used for all VPC traffic and provides minimal overhead for network packet processing within the Linux kernel. The master device is the ENI of the associated Pod IP. IPvlan is used in L2 mode with isolation provided from all other ENIs, including the boot ENI handling traffic for the Kubernetes control plane. Unnumbered point-to-point interface: A pair of virtual ethernet interfaces (veth) without IP addresses is used to interconnect the Pod’s network namespace to the default network namespace. The interface is used as the default route (non-VPC traffic) from the Pod, and additional routes are created on each side to direct traffic between the node IP and the Pod IP over the link. For traffic sent over the interface, the Linux kernel borrows the IP address from the IPvlan interface for the Pod side and the boot ENI interface for the Kubelet side. Kubernetes Pods and nodes communicate using the same well-known addresses regardless of which interface (IPvlan or veth) is used for communication. This particular trick of “IP unnumbered configuration” is documented in RFC5309.

Internet egress

For applications where Pods need to directly communicate with the Internet, our stack can source NAT traffic from the Pod over the primary private IP of the boot ENI by setting the default route to the unnumbered point-to-point interface; this, in turn, enables making use of Amazon’s Public IPv4 addressing attribute feature. When enabled, Pods can egress to the Internet without needing to manage Elastic IPs or NAT Gateways.

Internet egress w/ SNAT over boot ENI

Host namespace interconnect

Kubelets and Daemon Sets have high bandwidth, host-local access to all Pods running on the instance — traffic doesn’t transit ENI devices. Source and destination IPs are the well-known Kubernetes addresses on either side of the connect.

kube-proxy : We use kube-proxy in iptables mode and it functions as expected — Kubernetes Services see connections from a Pod’s source IP, as we loop traffic back through the requesting Pod using policy routing in the default namespace following kube-proxy DNAT resolution.

: We use kube-proxy in iptables mode and it functions as expected — Kubernetes Services see connections from a Pod’s source IP, as we loop traffic back through the requesting Pod using policy routing in the default namespace following kube-proxy DNAT resolution. kube2iam: Traffic from Pods to the AWS Metadata service transits over the unnumbered point-to-point interface to reach the default namespace before being redirected via destination NAT. The Pod’s source IP is maintained as kube2iam runs as a normal Daemon Set.

VPC optimizations

Our design is heavily optimized for intra-VPC traffic where IPvlan is the only overhead between the instance’s ethernet interface and the Pod network namespace. We bias toward traffic remaining within the VPC and not transiting the IPv4 Internet where veth and NAT overhead is incurred. Unfortunately, many AWS services require transiting the Internet; however, both DynamoDB and S3 offer VPC gateway endpoints.

While we have not yet implemented IPv6 support in our CNI stack, we have plans to do so in the near future. IPv6 can make use of the IPvlan interface for both VPC traffic as well as Internet traffic, due to AWS’s use of public IPv6 addressing within VPCs and support for egress-only Internet Gateways. NAT and veth overhead will not be required for this traffic.

We’re planning to migrate to a VPC endpoint for DynamoDB and use native IPv6 support for communication to S3. Biasing toward extremely low overhead IPv6 traffic with higher overhead for IPv4 Internet traffic seems like the right future direction.

Ongoing work and next steps

Our stack is composed of a slightly modified upstream IPvlan CNI plugin, an unnumbered point-to-point CNI plugin, and an IPAM plugin that does the bulk of the heavy lifting. We’ve opened a pull request against the CNI plugins repo with the hope that we can unify the upstream IPvlan plugin functionality with our additional change that permits the IPAM plugin to communicate back to the IPvlan driver the interface (ENI device) containing the allocated Pod IP address.

Short of adding IPv6 support, we’re close to being feature complete with our initial design. We’re very interested in hearing feedback on our CNI stack, and we’re hopeful the community will find it a useful addition that encourages Kubernetes adoption on AWS. Please reach out to us via GitHub, email, or Gitter.

Thanks

cni-ipvlan-vpc-k8s is a team effort combining engineering resources from Lyft’s Infrastructure and Security teams. Special thanks to Yann Ramin who coauthored much of the code and Mike Cutalo who helped get the testing infrastructure into shape.