This is the 3rd blog post for Mercari’s bold challenge month.

At Mercari, while migrating our monolithic backend to microservices architecture, we felt the need to have a service mesh and understood its importance in the long run. Most of the incident post-mortem reports had actionable items such as — implement rate-limit, implement a better canary release flow, better network policies… and this is exactly what a service mesh promises.

Last quarter, we finally decided to take this challenge on and started investing our time in Istio. Since then, we have been able to introduce Istio in the production environment which is a multi-tenant single Kubernetes cluster occupying more than 100+ microservices without any major incident. In this blog post, I will explain our Istio adoption strategy and overall journey so far. This blog post assumes that readers have a fair knowledge about what both a service mesh and Istio are. Istio’s official documentation does a great job in explaining this.

Motivation

In 2017, due to steep growth in business and the number of engineers, we realized our monolithic backend was not scalable and decided to migrate to microservices architecture. With microservices architecture, we started experiencing a new era of network-related problems: load balancing, traffic control, observability, security… For observability, we succeeded in building a centralized ecosystem for tracing, logging and metrics using DataDog but traffic control and security are still in primitive stages. One of the prime reasons to migrate to microservices architecture was to decouple different application components, but due to a lack of modern traffic control policies between services, we are still facing cascading failures. Introducing circuit-breakers and rate-limits can potentially solve these failures in the future.

Another big network-related issue for us was gRPC load-balancing. We use gRPC for inter-microservice communication. If you have used gRPC in Kubernetes, it’s very likely that you have faced this issue too. In short, gRPC is based on HTTP2 and uses multiplexing to send RPC calls. Kubernetes service works at the L4 layer and cannot load balance HTTP2. A common workaround for this issue is to use client-side load balancing which needs applications to add load balancing logic in them, which couples applications with the underlying infrastructure. This is something that we really do not want to do as our beliefs are the same as the envoyproxy vision — “The network should be transparent to applications.” Our microservices ecosystem is already polyglot and maintaining infrastructure-specific libraries for each language and keeping services updated with them is not something developers like to do.

A service mesh helps in solving these issues by running a proxy sidecar along with each application without changing the application code.

Why Istio?

Istio service mesh promises to provide a single solution for all network-related problems for modern microservices architecture: connect, secure, control and observe. It tries to do so transparently, without modifying any application code using envoyproxy as a sidecar. It is open-sourced and backed by Google.

Although it has been a year since Istio has been announced as production-ready, it is still hard to see many production success stories. We believe this is because of all the initial challenges required to setup Istio reliably. But once done correctly, on-boarding developers to use it will be low-cost because of its CRDs that provide Kubernetes native user experience which our developers are already used to. This led us to start investing our time in Istio.

Mercari microservices architecture and cluster ownership model

Before going into Istio, let’s have a look at our microservices architecture and our Kubernetes cluster ownership model.

Mercari’s Microservices Architecture

We are on GCP and use their managed Kubernetes service (GKE) to run all our stateless workloads. All client requests come through the API gateway which routes requests to microservices based on the request’s path. If the endpoint is not yet migrated to a microservice, the gateway just proxies the request to the monolithic backend running in our datacenter. This is how we are slowly migrating monolithic chunks to microservices. Depending on the endpoint, services talk to the gateway either using HTTP or gRPC. All inter-service communication happens through gRPC.

In order to minimize cluster operation cost, we use multi-tenancy model. A multi-tenant cluster is shared by multiple users or teams. In our model, each service has its own namespace and each service team has full access to its namespace. We also have a cluster operator team called Microservices Platform Team that manages the cluster and system namespaces: kube-system, istio-system — this is my team. This model ensures that responsibilities are well defined based on the focus area. Microservice backend team (whose focus is on service) manages microservice namespaces whereas the platform team (whose focus is overall infrastructure) manages the cluster. Introducing cluster-wide functionality and its reliability is the responsibility of the platform team.

You can check “Microservices Platform at Mercari” presentation, from MTC2018, to learn more about our architecture in detail.

Multi-Tenant Cluster Ownership Model

Some numbers

API Gateway receives 4M RPM during peak time

100+ microservices (100+ namespaces)

200+ developers have direct access to some namespace

Mercari, Merpay, Internal Services — all run in the same cluster

As can be seen from these numbers, this cluster is very important to us and it would affect every other team. It has the highest SLO. We need to be very cautious while introducing something that can affect the whole cluster. Ensuring Istio’s reliability is the foremost task. Also, being in a multi-tenant cluster, it is totally possible one team can make some configuration mistakes. Our Istio setup should be “well-guarded” from these mistakes, and misconfiguration in one service or namespace should not affect others. And believe me, without vigilance, things can quickly go awry in Istio! More on this later…

Istio Adoption Strategy

In short, our strategy was:

Do one thing at a time!

Istio Adoption Strategy

Introduce Istio’s one feature at a time. Enable Istio in one namespace at a time.

Istio’s feature selection

Istio’s mission is to become a swiss army knife for all networking problems in microservices areas like traffic control, observability, and security. It comes with lots of features — the last time when I counted the number of Istio related CRDs, it was around 53. And each feature has some unknown unknowns (e.g. all of a sudden, some services’ egress requests stopped working because of some outbound port-name conflicts). It’s good to first narrow down the scope of these unknown unknowns by limiting features. Fortunately, Istio’s components are designed in such a way that only necessary components need to be installed based on feature requirements. Said another way, installing all components is not required.

Out of the aforementioned feature categories (traffic control, observability, and security), we decided to go for traffic control first. Even within the traffic control category, there is a long list of features: load balancing, rate-limit, retries, canary release, circuit breakers, fault injection, fault tolerance… To be frank, we need most of these features ASAP — but for our initial Istio release, we narrowed down our feature requirements to just load balancing (gRPC load balancing, to be precise, as is already explained in the motivation section above).

With this, we decided our first Istio release goal:

Enable istio-sidecar proxy (envoy) to application Pods and extend service mesh gradually in all namespaces

Istio Feasibility Investigation

After deciding our release goal, we started testing Istio feasibility in the sandbox cluster to ensure that we can actually do what we had strategized. Our investigation approach and feasibility test requirements were to make sure:

Just installing Istio should not create any side effects in the cluster. Istio can be introduced gradually, one microservice (namespace) at a time. It should not be backward incompatible, and downstream services should work without any modification to them. Downstream may or may not have sidecar enabled. This is because microservice namespace is managed by developer teams and we cannot ask all teams to introduce Istio at once. Istio works well for all the types of protocols we have inside the cluster: gRPC, HTTP, HTTPS There is no noticeable performance degradation in latency or there are new errors. There is no big impact on the cluster if any of Istio’s control plane components is down — i.e. we have a backup plan.

As is often the case, our investigation journey was not so smooth. We encountered a few blockers that might not have been a big hurdle if our cluster SLO was low. Because of this, we had to figure out workarounds and strategies to satisfy feasibility test requirements.

Challenges

Below are a few initial challenges that we faced. Explaining them in detail would require dedicated posts for each, but I will provide brief explanations here.

1. Managing istio-proxy lifecycle

Network traffic flow in a Pod with istio sidecar enabled

When Istio’s sidecar is enabled in a Pod, all inbound and outbound traffic passes through the sidecar container. It is very important to make sure:

sidecar should start and is healthy before the application container starts, else any outbound requests such as a database connection will fail.

application container should terminate before sidecar starts terminating, else outbound requests during that period will fail.

Unfortunately, there is no easy way to control this lifecycle in Kubernetes, as a sidecar is not a first-class citizen. Behavior is random and any container can start or terminate first. There is an accepted proposal to solve this issue, but work is in progress and it will take a few releases to see this feature in Kubernetes. To solve this issue we use the following workaround.

Make sure the application container starts after the sidecar



spec:

template:

spec:

containers:

- command: ["/bin/sh", "-c", "while true; do STATUS=$(curl -s -o /dev/null -w '%{http_code}' Kind: Deploymentspec:template:spec:containers:- command: ["/bin/sh", "-c", "while true; do STATUS=$(curl -s -o /dev/null -w '%{http_code}' http://localhost:15020/healthz/ready ); if [ "$STATUS" -eq 200 ]; then exec /app; break; else sleep 1; fi; done;"]

This ensures that the application’s container process only starts after envoy sidecar is healthy and ready to take traffic.

2. Make sure the envoy sidecar starts terminating after the application container has terminated

containers:

- name: istio-proxy

lifecycle:

preStop:

exec:

command: ["/bin/sh", "-c", "while [ $(netstat -plunt | grep tcp | grep -v envoy | wc -l | xargs) -ne 0 ]; do sleep 1; done"

This needs to be added in Istios’s sidecar-injector-configmap.yaml. It ensures that envoy sidecar will wait until all connections with the application container are terminated. This workaround is taken from Envoy shutting down before the thing it’s wrapping can cause failed requests (#7136).

2. Zero downtime rolling updates

If you are following Istio closely, sporadic 503 during rolling updates is a very common complaint from Istio’s users. However, it isn’t Istio’s fault; it’s in Kubernetes design itself. Kubernetes is not consistent, it is “eventually consistent.” Istio just adds a little bit more complexity making the inconsistency period a bit longer, causing more 503 errors than usual. The popular answer to this issue is to retry these requests — but if downstream services have not enabled Istio, or you are not sure about service idempotency, retry is not feasible.

Kubernetes provides container lifecycle hooks such as preStop that can be used by developers to reduce inconsistency side-effects. We also configure preStop in application Pod based on protocol. Explaining this in detail would also require a separate post in itself.

for services with gRPC

apiVersion: apps/v1

Kind: Deployment

spec:

template:

spec:

terminationGracePeriodSeconds: 60

containers:

- lifecycle:

preStop:

exec:

command: ["sh", "-c", "sleep 30"]

In both scenarios, whether downstream has Istio enabled or not, client-side load balancing is used. In client-side load balancing, the client maintains the connection pool and refresh them when upstream service endpoint gets updated. This sleep period ensures that upstream service is still listening for a new connection until endpoints get updated.

for services HTTP

apiVersion: apps/v1

Kind: Deployment

spec:

template:

spec:

terminationGracePeriodSeconds: 75

containers:

- lifecycle:

preStop:

exec:

command: ["/bin/sh", "-c", "wget -qO- --post-data '' localhost:15000/healthcheck/fail; sleep 30"]

If downstream service has Istio enabled, then the scenario will be the same as gRPC one and sleep 30 is sufficient as Istio sidecar will refresh connections when endpoints get updated. When downstream does not have Istio enabled and client-side load balancing is not used, which is the case for HTTP, upstream needs to close the connection gracefully else the client will never know the connection has been closed and will continue sending requests. In Istio, sidecar creates connections with clients, not the application container, so connections need to be closed from the sidecar itself. By calling envoy’s healthcheck/fail endpoint, we can forcefully drain all connections from upstream during rolling updates.

3. Istio’s Kubernetes Service port-name convention

Kubernetes Service works at the L4 layer and it does not know the L7 layer protocol. Istio needs to know the higher-level application protocol beforehand so that it can configure sidecar. In order to do that, it uses a convention to add a prefix in Kubernetes Service port name:

<protocol>[-<suffix>]>

If some service does not follow this convention, this can create conflicts and can affect many other services in a mesh. The situation is worse when you have headless services.

In a multi-tenant cluster, we cannot trust each service will follow the correct convention. In order to solve this, we use stein, a YAML validator with custom policies in our centralized manifest repo. A custom Kubernetes validation admission webhook is a work in progress.