An illustrated guide to Kubernetes Networking [Part 3]

Everything I learned about the Kubernetes Networking

This is the third part of the series about Kubernetes Networking. If you haven’t read the Part 1 and Part 2 yet, I recommend checking those out first.

Cluster dynamics

Due to the every-changing dynamic nature of Kubernetes, and distributed systems in general, the pods (and consequently their IPs) change all the time. Reasons could range from desired rolling updates and scaling events to unpredictable pod or node crashes. This makes the Pod IPs unreliable for using directly for communications.

Label selector in Kubernetes Service Object

Enter Kubernetes Services — a virtual IP with a group of Pod IPs as endpoints (identified via label selectors). These act as a virtual load balancer, whose IP stays the same while the backend Pod IPs may keep changing.

The whole virtual IP implementation is actually iptables (the recent versions have an option of using IPVS, but that’s another discussion) rules, that are managed by the Kubernetes component — kube-proxy. This name is actually misleading now. It used to work as a proxy pre-v1.0 days, which turned out to be pretty resource intensive and slower due to constant copying between kernel space and user space. Now, it’s just a controller, like many other controllers in Kubernetes, that watches the api server for endpoints changes and updates the iptables rules accordingly.

Iptables DNAT

Due to these iptables rules, whenever a packet is destined for a service IP, it’s DNATed (DNAT=Destination Network Address Translation), meaning the destination IP is changed from service IP to one of the endpoints — pod IP — chosen at random by iptables. This makes sure the load is evenly distributed among the backend pods.

5-tuple entries in conntrack table

When this DNAT happens, this info is stored in conntrack — the Linux connection tracking table (stores 5-tuple translations iptables has done: protocol, srcIP, srcPort, dstIP, dstPort). This is so that when a reply comes back, it can un-DNAT, meaning change the source IP from the Pod IP to the Service IP. This way, the client is unaware of how the packet flow is handled behind the scenes.

So by using Kubernetes services, we can use same ports without any conflicts (since we can remap ports to endpoints). This makes service discovery super easy. We can just use the internal DNS and hard-code the service hostnames. We can even use the service host and port environment variables preset by Kubernetes.

Protip: Take this second approach and save a lot of unnecessary DNS calls!

Outbound traffic

The Kubernetes services we’ve talked about so far work within a cluster. However, in most of the practical cases, applications need to access some external api/website.

Generally, nodes can have both private and public IPs. For internet access, there is some sort of 1:1 NAT of these public and private IPs, especially in cloud environments.

For normal communication from node to some external IP, source IP is changed from node’s private IP to it’s public IP for outbound packets and reversed for reply inbound packets. However, when connection to an external IP is initiated by a Pod, the source IP is the Pod IP, which the cloud provider’s NAT mechanism doesn’t know about. It will just drop packets with source IPs other than the node IPs.

So we use, you guessed it, some more iptables! These rules, also added by kube-proxy, do the SNAT (Source Network Address Translation) aka IP MASQUERADE. This tells the kernel to use IP of the interface this packet is going out from, in place of the source Pod IP. A conntrack entry is also kept to un-SNAT the reply.

Inbound traffic

Everything’s good so far. Pods can talk to each other, and to the internet. But we’re still missing a key piece — serving the user request traffic. As of now, there are two main ways to do this:

NodePort/Cloud Loadbalancer (L4 — IP and Port)

Setting the service type to NodePort assigns the service a nodePort in range 30000-33000 . This nodePort is open on every node, even if there’s no pod running on a particular node. Inbound traffic on this NodePort would be sent to one of the pods (it may even be on some other node!) using, again, iptables.

A service type of LoadBalancer in cloud environments would create a cloud load balancer (ELB, for example) in front of all the nodes, hitting the same nodePort.

Ingress (L7 — HTTP/TCP)

A bunch of different implements, like nginx, traefik, haproxy, etc., keep a mapping of http hostnames/paths and the respective backends. This is entry point of the traffic over a load balancer and nodeport as usual, but the advantage is that we can have one ingress handling inbound traffic for all the services instead of requiring multiple nodePorts and load balancers.

Network Policy

Think of this like security groups/ACLs for pods. The NetworkPolicy rules allow/deny traffic across pods. The exact implementation depends on the network layer/CNI, but most of them just use iptables.