We completed migrating the majority of our production components from our hand rolled container infrastructure to Kubernetes earlier this month. Our previous post discussed problems making it this far. This post discusses recent issues production issues, how we got into trouble, out of trouble, and how to stay out of trouble.

Our first migration strategy exposed Kubernetes components with a LoadBalancer service with ~14 ports. We could not figure out why this caused instant bad connection errors on those ELBs. We switched to multiple LoadBalancer services with a single port. This “solved” the problem. I quote solved because we did not determine the root cause and skipped out since it was only relevant in the migration phase. We’ve switched to ClusterIP now that all components are running in Kubernetes.

We also changed our migration approach to mitigate a big bang rollout. We used the existing HAProxy in our old infrastructure to do percentage based load balancing to the container running in the old infrastructure to the matching Kubernetes LoadBalancer . That worked like a charm! Moral of the story: I’m suspicious of services with a large number of ports (say more than two?).

Where are we now after about a month in production?

Situation Report

We have ~20 Deployments in our application. 2 Deployments consume ~80% of cluster CPU capacity. These two deployments are the Core API and the Search Service.

The Core API powers our web site, Android, and iOS applications. We make classifieds sites. The most common interaction in the product is either searching for ads or viewing an ad. Thus, the majority API calls include calls to the Search Service. It handles translating application level searches to ElasticSearch. Core API and Search Service horizontal scales (replica counts) have increased by 2x (or more) since migrating to Kubernetes. We’re seeing severe latency in these two components which directly impacts customer facing flows. Our SERP (Search Engine Results Page) availability is ~75% during peak hours. Here’s flow:

An application (web site, Android, or iOS) requests GET /v1/serp . (Core API) The Core API validates parameters does some transformation makes a Thrift RPC call to the Search Service The Search Service makes a query to Elasticsearch The Search service generates an appropriate Thrift RPC response The Core API generates an appropriate JSON response

This particular flow (or similar) accounts for ~85% of API requests. Here’s what this looks like in numbers: