And how this will help bring highly available services to your entire company

ID Analytics has been running workloads on Kubernetes for a little over two years. We have both an on-premise and a cloud footprint, and a pain point has always been that running our on-premise bare-metal environment does not offer the same mature out-of-the-box APIs that we get when running on AWS, Google Cloud, and Azure. Kubernetes fills that gap, providing us a set of powerful, open, and extensible APIs that allow us to interact with our underlying network, compute, and storage.

One of the challenges we faced when running Kubernetes on bare-metal is one that nearly everyone faces - how do we access our Kubernetes services from outside of the cluster?

First, a refresher. Kubernetes services are, like everything else in Kubernetes, an API object that can be expressed in YAML such that:

kind: Service

apiVersion: v1

metadata:

name: mysql

spec:

selector:

type: read-db

ports:

— name: jdbc

protocol: TCP

port: 3306

Will produce a service and corresponding ClusterIP that applications within the Kubernetes cluster can call by specifying the name (in this case, simply “mysql”) or ClusterIP.

MySQL ClusterIP listening on native port of 3306

This is really cool! Now any application within the Kubernetes cluster can just set mysql as the name of the SQL server (or the clusterIP address 10.100.1.5), and Kubernetes magically handles getting the traffic to the right MySQL pod. But what about applications outside of the Kubernetes cluster?

No luck. There are generally three solutions to this:

NodePort — this will expose a port on each of your hosts that you can use to reach your service. The downside of this approach is two fold: You are back to dealing with port-management. Apps can no longer assume sane things like HTTPS is port 443, or that MySQL runs on port 3306. Instead, it may live on port 32042 in PROD, and 32012 in DEV. Additionally, you either need to add a load balancer in front of this set-up or tolerate a single point of failure in your application. LoadBalancer — This construct currently only works on cloud providers such as AWS and GKE. If you set this on an on-premise system, Kubernetes will create a ClusterIP and NodePort for you, and wait for you to create and provision your own load balancer. Not very helpful for us. Ingress — With ingress, you can run a software load balancer such as nginx, expose it as port 80/443 on all your hosts and then control routing any HTTP traffic to Kuberbetes services. This is fantastic for HTTP/HTTPS layer 7 traffic, but it doesn’t currently work for layer 4 traffic such as MySQL.

From the above, only option 1 would be viable to meet our needs, but the drawbacks were too steep for us to stay satisfied. So after some digging into GitHub issues, I stumbled upon this comment by the brilliant Tim Hockin:

Wait, so there is a way to send traffic to the service IP (ClusterIP) directly?

It appeared the key to being able to access any Kubernetes service from anywhere on our network was some weird networking thing called ECMP. But what is it?

We’ll need to understand how networking works in Kubernetes. Kubernetes requires that each pod be assigned a unique IP that is addressable by any other pod in the cluster. There are many different networking plugins that satisfy this requirement in different ways. In our case, we are using flannel with the host-gw backend, which assigns a /24 network (254 usable IPs) to each host running the Kubelet service. Other backends do this via different methods, but the results are all the same. Imagine the following set-up:

Hypothetical server/pod network setup

In this case, the first pod — let’s call it Pod A — scheduled on coreos-1 will get an ip 10.2.1.1, and the first one scheduled on coreos-2 — let’s call it Pod B — will get an ip of 10.2.2.1. For Pod A to be able to reach Pod B, it sends the traffic to the server coreos-2 which routes it internally to Pod B.

We have long known that we could let our systems that are outside of kubernetes access pod IPs by adding static routes on our networking gear, such that the routers are made aware that coreos-1 is the next hop for traffic destined to IPs 10.2.1.0/24, and that coreos-2 is the next hop for traffic destined to 10.2.2.0/24. Once we did that, any server on our network could route traffic to Pod A and Pod B. But what Tim Hockin is explaining is that our servers will also accept traffic that is destined for the ClusterIP that kubernetes creates when you create any service. To understand the details of why this works, this article is a good starting point.

So how do we get external traffic destined for our ClusterIP range to our cluster? Our first temptation was to do static routes — that worked in allowing us to send traffic to pod IPs. But if you recall, for static routes we specify a particular server as destination traffic should go to. We told it that traffic for PodIPs 10.2.1.0/24 should go to our server coreos-1. But we want traffic to load balance to multiple servers so that services are still accessible even if an individual server goes down. This is where ECMP comes in.

ECMP stands for equal cost mult-path routing. In this setup, you tell your router that every server in your cluster is a valid route to service the ClusterIP range that you defined in your kube-controller-manager settings ( — service-cluster-ip-range). This is definitely one of those times you are going to want to sit down with your networking team. Basically you will enter routes that tells your router the following:

You are telling your router that the next hop for the ClusterIP range you specified in the — service-cluster-ip-range is every server in the cluster.

Once this has been set up, when a developer wants to access the MySQL service from their development environment, traffic they send to the static ClusterIP (example from above was 10.100.1.5) will be routed in a round robin fashion to the three CoreOS nodes.

The upshot of this is when coreos-1 performs an automatic update and reboot, the router can remove it from the routing tables and have traffic sent only to coreos-2 and coreos-3. The way we orchestrate that process is a large topic I may delve into in a future post, but depending on your routing gear this process of notifying your routers that your hosts are available or unavailable can be either automatic or — as in our case — somewhat challenging. The upshot of that work is our developers will not notices this routine maintenance on their dev machines, and all without resorting to the management nightmare of using NodePorts with physical Load Balancers. We get to utilize the native Kubernetes service primitive, ClusterIP, and expose it to the rest of our network.

This opens up a lot of possibilities. Here’s one fun example to get your thoughts going: your kube-dns ClusterIP can now be reached from outside your cluster on port 53 of the ClusterIP for that service:

Note: The ClusterIP of your kube-dns service will be different

You can now tell your company’s internal DNS provider that kube-dns (10.100.0.10) should be the DNS responsible for serving your Kubernetes zone (in our cluster, that would be dev.cluster.local. To find yours, look at your kube-dns deployment for the — domain property. If none is specified, it is likely just cluster.local )

Now our MySQL service (at 10.100.1.5) is automatically resolvable and reachable at the DNS name of mysql.default.svc.dev.cluster.local to the rest of your environment:

Notice the response to the ClusterIP comes from the host coreos-1, but in subsequent requests any of the other hosts may respond. Native ClusterIP load balancing!

Going forward, exposing Kubernetes services to the rest of the network will be just a matter of creating new services against the Kubernetes API. Your network, kube-proxy, and kube-dns will handle the rest.