A few months ago I noticed weird connection timeouts when updating a Deployment within Kubernetes. When updating a Deployment , there would be anywhere from 30 seconds to two minute period in time where connections to the service backed by the Deployment would timeout or fail completely. At first, it sounded like the application in question wasn’t gracefully shutting down. This theory was quickly disproven by manual testing (via curl) the liveness and readiness endpoints as new pods were added and old pods were removed during a deployment update. Also, I noticed that other applications were experiencing this delay. Back to the drawing board.

After a few hours of questioning my career choice, I remembered that a Service doesn’t work directly with a Deployment but is purely a logical abstraction over a group of pods determined by a label and, more importantly, that a Service is backed by an Endpoints object which is updated when pods transition between ready and unready. This sounded like a good place to continue the investigation. I used the handy watch command to watch the endpoints for the Service in question while updating the Deployment

watch kubectl describe endpoints [endpoint name]

I noticed that as old pods were removed, they remain in the Endpoints ready list for anywhere from 30 seconds to a few minutes after the pod has been deleted. Additionally, new pods weren’t shown anywhere in the Endpoints until a few minutes after they were started. This seemed like the source of the connection failures but now the question was why?

After a few days of frustration I had a moment of clarity and realized that I should debug the component responsible for updating the Endpoints : kube-controller-manager. After turning up the logging verbosity, I started seeing interesting entries.

I0412 22:59:59.914517 1 request.go:638] Throttling request took 2.489742918s, request: GET:https://10.3.0.1:443/api/v1/namespaces/[some namespace]/endpoints/[some endpoints]"

That was interesting but two seconds is a lot less than a few minutes. After reading the kube-controller-manager code, I saw the problem. Kube-controller-manager is responsible for rectifying the current state of the cluster with the desired state of the cluster. Within the context of our issue, kube-controller-manager runs an endpoints controller that watches pod lifecycle events and updates Endpoints based on these events. The endpoints controller runs a static set of workers that processes these events and does the actual Endpoints updating. If enough endpoint requests are throttled, all workers become busy waiting on throttled requests and new events get queued up. If the queue is significantly deep, Endpoints updates can take many seconds (or minutes in our cases).

My first attempt at resolving this issue was adding more endpoint controller workers by tweaking the --concurrent-endpoints-syncs flag of kube-controller-manager. This had little effect. After more code reading and research, I discover the --kube-api-qps flag and its sister --kube-api-burst . These flags sets the rate limit for all kubernetes api requests by any controller within the kube-controller-manager (including the endpoints controller). The default is 20 but that is clearly not enough for the workload our cluster runs. After some trial and error, I determined a more suitable value that made the above log entry disappear. In our case, 300 and 325 were the magic numbers. With that change, Endpoints started to update immediately as pods were added and removed.

Just a fair warning, the larger those flags are, the more load that is placed on the api server and etcd. In our case, the increase in load was worth solving this problem once and for all.