My team attempted to migrate one application tier to Kubernetes this week. Unfortunately we aborted because increased timeouts between applications and our own inability to debug and resolve the issue. This post is about what our wrong, what we know, and what may be wrong. My team needs help debugging this issue so please respond (beers on me!) if you can help.

The Setup

20 odd applications compose our entire product. We use Apache Thrift for communication between internal services. The migration goal was to move a subset to Kubernetes.

Our approach was to create one Kubernetes LoadBalanacer service to expose all Thrift services to our existing infrastructure. I’ll simply refer to these things things as “[the] ELB” and “apollo” going forward. Unique port numbers on the ELB identify each service. We use Thrift with the binary protocol over TCP. Also our Kubernetes cluster exists in separate VPC which peered with apollo’s VPC. Once the service was created and everything reported healthy, then we updated apollo application configuration to find the services in Kubernetes. Technically this happens by setting environment variables like FOO_THRIFT_URL=tcp://the-k8s-load-balancer-host-name:FOO_SERVICE_PORT . This worked fine in our initial tests so we decided to move forward switching over more production traffic. If things caught fire it’s easy to rollback.

Lastly some particulars about the cluster. We used Kops 1.6.0 to build our cluster. Here’s the version:

$ kubectl version

Client Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.6", GitCommit:"114f8911f9597be669a747ab72787e0bd74c9359", GitTreeState:"clean", BuildDate:"2017-03-28T13:54:20Z", GoVersion:"go1.7.4", Compiler:"gc", Platform:"linux/amd64"}

Server Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.7", GitCommit:"8eb75a5810cba92ccad845ca360cf924f2385881", GitTreeState:"clean", BuildDate:"2017-04-27T09:42:05Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}

What Happened

We triggered a apollo puppet update to switch traffic to the Kubernetes. We noticed an absurdly high BackendConnctionErrors count the ELB. AWS’ definition is:

The number of connections that were not successfully established between the load balancer and the registered instances. Because the load balancer retries the connection when there are errors, this count can exceed the request rate. Note that this count also includes any connection errors related to health checks.

Here’s our data from this time period:

ELB metrics during the production migration

The data starts when traffic switched over (~12:00). The purple line is BackendConnectionErrors and blue is RequestCount . AWS defines RequestCount as:

The number of requests completed or connections made during the specified interval (1 or 5 minutes).

My understanding is there ~10% successful connections depending on time. Or in other words, there are a ton of failed connections. Initially this did not correlate with other error related metrics. Our backend architecture uses RESTful JSON APIs clients (web application, android, iOS) processes as proxies/orchestrators to multiple internal Thrift services. Thus Thrift RPC failures usually propagate as 5xx in the HTTP API tier.

Initially these bad connections did not correlate with those metrics. We let things run for about 12 hours to see if anything changed or if we could debug it. The next afternoon we realized the 5xx in the HTTP API tier and latency increased since the switch over. We rolledback at this point.

Observations

We attempted to debug the situation while the cluster served traffic. We could not conclude the root cause, but we did make some interesting observations. Consider the following graph.