Update: I wrote this blogpost when Istio was at version 1.2 and 1.3, and things that I wrote about had changed. Istio team is making a great effort in fixing issues and making Istio better with every release.

When someone talks about Istio, it’s just bells and whistles, but nobody talks about difficulties that may arise during the integration into the existing project. I’m a software engineer @Exponea, and I was responsible for integration of the Istio to our infrastructure. I’ve postponed the integration because we weren’t able to take the last step that was deploying the Istio to the production. Why? Let me describe the whole journey from the beginning.

Our situation is that we are running a big part of our infrastructure on the Google Kubernetes Engine (GKE). We expect from Istio to simplify our application layer, give us more insights into our traffic, and increase the security of our application. We want to spend as little time as possible with managing Istio, so we are heading to try out the managed Istio on GKE. After a few experiments, we’ve found out that we’re not able to use the managed Istio on GKE. It was a big disappointment because it would solve some troubles (like the most managed services) for us. Managed Istio on GKE didn’t support any configuration besides the mTLS mode. We didn’t want anything super-advanced, just options available in the Helm chart provided by Istio.

Brace yourselves, the adventure begins. We’ve deployed Istio to the development environment. I explain the production deployment at the end of this article.

An application cannot start, nearly all pods are in the crash loop back-off state. Without debugging, I suspect that port names don’t have the right prefix. Istio knows what type of traffic flows through port based on its name (HTTP, gRPC, TCP, …). Therefore, if port names aren’t named correctly, your traffic may not work in some cases — for example, if you tag port as an HTTP, but it’s gRPC. I went through all the services and fixed all namings. Finally, most of the pods were able to start and run. If you are not familiar how Istio works, and you are not sure why this happens, check the diagram below.

Istio proxies intercept traffic going to your service and out of your service. Proxies are configured by Istio Pilot.

After a few days of debugging, new issues start to pop up. Our jobs are not finishing because Istio sidecars keep running after the job has finished. Reason? As you may know, Kubernetes Jobs run one-shot programs or scripts. When the program in the job exits, then Kubernetes considers the job as finished. On the contrary, Istio proxy is a server that runs forever. Istio proxy running causes the job to hang.

Our solution to this problem is disabling Istio proxy injection. Jobs without proxies aren’t a problem for us, as we don’t mutual TLS in the first stage of integration, and sidecar in jobs wouldn’t bring us an advantage. Istio provided a workaround for this problem in the latest release, and you can shutdown proxy sidecar by calling its API after your job finishes.

Now, things are getting exciting. Problems hide in-depth, and it’s not easy to debug them. The connections between services were braking, and we had no idea why. We observed that this happened when services were connecting to services setup as stateful sets. Istio doesn’t support stateful sets, but they can be attached to the service mesh using ServiceEntry resources.

ServiceEntry can be used for adding StatefulSet into the Istio service mesh

Unfortunately, there is a bug in a service discovery of Istio, which doesn’t update IP addresses of headless services with ServiceEntry. As a result, if you recreate pod published with headless service, then services connecting to this service would connect to the old IP address, which doesn’t exist. I’ve tried to fix this issue, but unfortunately, I wasn’t successful, so instead, I’ve opened the issue. Our temporary fix for this is disabling sidecar injection for stateful services.

Another related problem to headless services were gRPC services. Some of them were using headless services to implement client-side load balancing. As you know, gRPC handles load balancing for you, so we had to add the normal (clusterIP) services besides the headless services to ensure the backward compatibility. If you ask yourself why you need this kind of backward compatibility, then the answer is that we’ve enabled Istio just for the part of our services. Therefore, some services would use client-side load balancing, and some services would leverage envoy load balancing.

Setup that enables support of both — client-side and Istio gRPC load balancing. Services connecting to service X running without Istio connect using client-side gRPC load balancing with backends “service-x-0” and “service-x-1”. Services with Istio connect just to “service-x” backend and Istio ensures that traffic will be load balanced across all the pods.

The last problem (being optimistic here) that we’ve found in the development environment is the handling of graceful shutdowns by proxy sidecars. The sidecar exits gracefully upon receiving the signal from Kubernetes. However, this becomes the obstacle when your service has a more prolonged grace shutdown period and needs to finish its work (for example, send a batch of emails). Consequently, your service won’t be able to do so because the proxy sidecar is already shutdown.

Again, we’ve encountered a known issue. We’ve taken inspiration from the comments and edited a sidecar injection template. The injection template adds preStop hook to the proxy container when pod has specific annotation. PreStop hook checks liveness probe endpoint and allow to exit the sidecar proxy when the liveness probe is no longer working. The solution worked better for cases where service made several HTTP calls, and the number of outgoing connections could fall to 0 during the graceful shutdown period.