Cilium’s multi-cluster implementation

In the age of dynamically changing and very complex microservices eco-system, traditional IP and port management may pose a problem in management and scaling perspective. Cilium uses BPF that can be used to forward data within the Linux kernel, and can be used through proxy injection for service meshes such as Kubernetes Service-based Load Balancing or istio.

BPF, as in Berkeley Packet Filter, was initially conceived in 1992 so as to provide a way to filter packets and to avoid useless packet copies from kernel to userspace. Extended BPF (eBPF) is an enhancement over BPF (which is now called cBPF, which stands for classical BPF) with enhanced resources, such as 10 registers and 1–8 byte load/store instructions. Whereas BPF has forward jumps, eBPF has both backwards and forwards jumps, so there can be a loop, which, of course, the kernel ensures terminates properly.

Today, architectures such as x86_64, arm64, ppc64 and s390 have the ability to compile a native opcode image out of an eBPF program, so that instead of an execution through an in-kernel eBPF interpreter, the resulting image can run natively like any other kernel code. tc then installs the program into the kernel’s networking data path, and with a capable NIC, the program can also be offloaded entirely into the hardware.

Kubernetes uses iptables as the core of CNI. There are significant issues with iptables which pose a problem under large, diverse traffic conditions. iptable rules are matched sequentially and updates to iptables must be made by recreating and updating all rules in a single transaction which might not fit in a highly dynamic container space. The performance suffers in particular if packets are frequently hitting rules in the lower parts of the table. BPF on the other hand is matched against the “closest” rule, rather than by iterating over the entire rule set which makes it a perfect-fit for network-policy implementation.

Cilium provides multi-cluster capability built in layers (pod-ip routing, service discovery and loadbalancing etc.) where users can choose to use all layers or select and use only the layers as required.

Cilium CNI Implementation

Cilium Agent, Cilium CLI Client and CNI Plugin runs on every node in the cluster (deployed as daemonset). Cilium CNI plugin performs all the tasks related to network plumbing like creating link devices (veth pair), allocating IP for container, configuring IP address, route table, sysctl parameters, etc. Cilium agent compiles BPF programs and make the kernel run these programs at key points in the network stack.

Cilium offers two networking modes:

Overlay Network Mode: Most commonly used and default network mode. All nodes in the cluster for a mesh of tunnels using the udp based encapsulation protocols: VXLAN (default) or Geneve. In this mode Cilium can form an overlay network automatically without any configuration by the user using “ — allocate-node-cidrs” option in kube-controller-manager. Direct/Native Routing Mode: In this configuration Cilium hands over all packets which are not addressed for another local endpoint to the routing subsystem of the linux kernel. This setting requires an additional routing daemon like Bird, Quagga, BGPD, Zebra etc. to announce non-local node allocation prefix to all other nodes via node’s IP. BGP solution has better performance compared with VxLAN overlay and more importantly, it makes the container IP routable without any extra mesh configuration.

Taking the default network mode into consideration and Kubernetes CNI implementation, Cilium supports both VETH/IPVLAN as link devices.

Cilium creates three virtual interfaces on host network space: cilium_host, cilium_net and cilium_vxlan. Cilium agent upon starting creates a veth pair named ‘cilium_host ←> cilium_net’, and sets the first IP address of the CIDR to cilium_host, which then acts as the gateway of the CIDR. CNI plugin will generate BPF rules, compile them and inject them into kernel to bridge the gaps between veth pairs.

Cilium CNI Kubernetes — Implementation

As shown above, when a pod is created a VETH interface with “lxc-xx” is created on the root namespace and the other end is attached to the pod namespace. Default gateway of the pod points to the ‘cilium_host’ IP and Cilium agent installs required BPF program to reply ARP requests. LXC interface MAC is used for the ARP reply. Next L3 hop of the Pod generated traffic is cilium_host and next L2 hop of the Pod generated traffic is the host end of the veth pair.

In multi-host networking if using VxLAN, Cilium will create a cilium_vxlan device in each host, and do VxLAN encap/decap by software.Cilium uses the VXLAN device in metadata mode which means it can send and receive on multiple addresses.Cilium will use the first public IPv4 address it finds as the VXLAN VTEP. Cilium uses BPF for LWT (Lightweight Tunnel) encapsulation and all network packets emitted by pods are encapsulated in a header which can consist of a VXLAN or Geneve frame which are transmitted via a standard UDP packet header.

Cilium CNI Kubernetes — Interfaces

As shown above, in multi-host networking setting Cilim creates tunnels between nodes where the tunnel endpoint map is programmed by Cilium.

Cilium Cluster Mesh

ClusterMesh is Cilium’s multi-cluster implementation. ClusterMesh facilitates Pod IP routing across multiple Kubernetes clusters at native performance via tunneling or direct-routing without requiring any gateways or proxies.

Trying out ClusterMesh with three different AIO (master+worker) Kubernetes clusters in different regions deployed on Vultr:

Sample Topology — Multi-region Clusters

Cilium CNI is installed on each stack with unique/non-overlapping Pod_CIDR as shown below: