In this article I want to talk about the problem I met in graceful shutdown application while doing rolling update in Kubernetes, and how to fix it by adding a short wait time.

About graceful shutdown

This is how most applications run inside Kubernetes:

We have a stateless callee application, it provides a rest api. We create a Deployment with 3 replicas, and also a Service service-callee for the deployment. So all containers inside Kubernetes can consume the api by url http://service-callee/api.

When we want to deploy a new version of callee, Kubernetes use a RollingUpdate strategy, it stops the old Pods, at the same time create new Pods with the newer version of our deployment.

In more detail, what Kubernetes does are:

Send SIGTERM to the to-be-deleted Pod

to the to-be-deleted Pod Remove the Pod IP from the Service’s Endpoints

Start new Pod, wait for it to be Ready

Add the new Pod IP to the Service’s Endpoints

To make sure we don’t fail any request from our api consumer: the caller, the callee application has to do graceful shutdown. More specifically, when it receives SIGTERM signal, it has to do:

stop accepting any new requests wait for all in-flight requests to finish

In a simple NodeJS app, the code looks like this:

This is normally how we do graceful shutdown in Kubernetes. It calls server.close() immediately after receiving SIGTERM , and server.close does exactly those two things we listed above.

But this approach has a big problem: it does not prevent failure requests in Kubernetes.

The problem

To show the problem, I made a simple demo project in github. It creates a small Kubernetes cluster in GKE, and deploy our callee app above, and also a caller app:

The caller is just a simple NodeJS process. It calls the callee’s api at 10 request/second. We want to see if the caller will get any failure requests during callee’s rolling date.

Let’s trigger a rolling deployment and see what will happen in the caller side, we can simply add a random environment variable to callee’s container, this will trigger a rolling update without change the container image:

kubectl set env deployment/service-callee RANDOM=12345

And let’s see both the callee and caller’s log from GCP logging viewer:

As we see, during the rolling update, there are a lot of failed requests to in the caller. This shows the callee are not shutting down gracefully. Why did this happen?

How does Kubernetes do rolling update

To understand why those network failure happens, we have to understand how Kubernetes do rolling update.

First, let’s see how Kubernetes route network traffic from Service to Pods. When we create a Service in Kuberntees, it will create an Endpoints object for this service, and find all Pods that match Service’s selector, then put all Pods’ IP addresses and ports into the Endpoints object. In every node, there is a daemon process kube-proxy, it watches the apiserver to get updates of all Service and Endpoints, and finally create iptables rules (SNAT and DNAT) to route the tcp packets from Service IP to Pod IP.

Then, let’s see the rolling update. When we use kubectl to deploy a new version of our application, this is what will happen inside Kubernetes:

rolling update

deployment-controller receives the update from apiserver, it creates a new replicaset, and decrease the replicas of the old replicaset. By calling the apiserver’s api replicaset-controller receives the update of decrease replica, it pick one Pod from all replicas, and delete it. (also by calling apiserver) Both kubelet and the endpoint-controller watches the apiserver for Pod deletions. They will receive the update independently, but we don’t know which one will receive first, thats why I put them in same number (③). Kubelet will send SIGTERM to the containers in that Pod, and endpoint-controller will remove Pod’s ip and update the endpoints object (by calling apiserver) kube-proxy receives endpoint update, then it updates the iptables rules accrodingly: delete the Pod’s ip from the nat table. Then the caller’s request won’t be routed the that Pod anymore.

To notice here: all these controllers and daemon process work independently, they all get change events from one source: the apiserver, by using the watch api.

The problems here are, we can not determine who receives the Pod deletion event first: kubelet or kube-proxy. It depends on when apiserver send the evnets, and the network condition of each process. What’s more, kube-proxy will not update iptables rules immediately after it receives the event. It uses a BoundedFrequencyRunner to batch all updates and apply them in a minSyncPeriod , in GKE this period is 10 seconds. Which means in worst case, the iptables rule will be updated 10 seconds later after the pod deletion event arrives.

Now we can explain why those request failure happens, in the callee application, it closes the server immediately when received SIGTERM , but at this time, the kube-proxy may still not receive the pod deletion event yet, or it is just waiting for the sync period. So the iptables rules are not updated, it still routes traffic to our Pod, but the callee will reject all new requests(due to server.close() ), so the caller gets connection refuse error.

A proper solution for a better graceful shutdown

Now we know why the graceful shutdown does not work. How can we fix it? Instead of closing the server immediately, we can simply wait for a period of time, give kube-proxy the chance to update the iptables.

So how long should we wait? Since in GKE the minSyncPeriod is 10s, we can wait for 10s and see if it works. Let’s change the callee’s code a little bit:

And we trigger the rolling update(twice) same as before. Here is the callee’s log, follow with the caller’s log.

callee — wait 10s before close

caller

We can see each callee now will wait for 10 seconds before server.close() , then exit normally. And the caller has no failure logs, which means no failure requests. Just waiting for a period mitigates the network issue.

How about Readiness?

There are some articles recommend to fail the health endpoint when application receives SIGTERM , so Kubernetes can detect the readiness probe fail, and remove the Pod IP from Service endpoints immediately.

But according to our analyze above, this is not necessary, mainly for two reasons:

kubelet does not do anything to iptable rules regarding Service routing. It’s sololy done by kube-proxy.

kube-proxy only listens to endpoints changes event, which was triggered by Pod deletion. The readiness fail will not have any effect to kube-proxy.

To verify this assumption, it’s easy to do another test by using our demo project. We can set a readiness probe for the callee container, and make it short enough:

readinessProbe:

httpGet:

path: /health

port: 3000

initialDelaySeconds: 2

periodSeconds: 1

timeoutSeconds: 1

successThreshold: 1

failureThreshold: 1

And in callee’s code, we set the health endpoint to fail immediately when we receive SIGTERM , so Kubernetes will detect readiness fail at most within 1 second. With that, let’s shorten the waiting time to 5s. (we should set it to 1s, but let’s be nice and wait longer):

If the readiness assumption was right: Kubernetes will change the service route immediately once it detects readiness fail. Then we should not see any failure request during the waiting time, because we waited much longer after Kubernetes detects readiness failure.

These are one of the callee’s log and the caller’s log:

We see the same failure happens again. This proves the readiness setting has no effect on resolving failure requests.

Conclusion

As we can see from the experiments, to make a better graceful shutdown, the application needs to wait for a period after receiving SIGTERM , to prevent rejecting remaining requests.

The 10s is not a guaranteed value, but rather a convinient choice. It’s better to test with your Kubernetes setup and verify it works. In large clusters like more than 1000 nodes, the time might be longer because we have to wait for all kube-proxy on each node to finish updating the iptables rules.

References

You can try and testing different scenarios by using my github repo here: https://github.com/justlaputa/kubernetes-graceful-shutdown-poc

Also, though not exactly about graceful shutdown, Kubecon has a great talk about the scaling problem with Endpoints:

https://www.youtube.com/watch?v=Y5JOCCbJ_Fg