Photo by Zoltan Tasi on Unsplash

At Redbox, we are continuing our cloud native journey in modernizing our platform to support our new streaming platform — Redbox On Demand — and handle our digital transformation.

After much consideration, we are working on building/extending our delivery platform to support Kubernetes, which enables us to leverage some exciting new technologies such as Istio! With that being said, a journey would not be as exciting without a few bumps in the road. In this article, we are talking Kubernetes jobs, Istio Sidecars, and you!

Kubernetes Jobs

For those new to Kubernetes, Jobs are a resource type within Kubernetes for short lived tasks. They consist of 1 or more containers that perform a specific task, wait for completion, and return an exit code when that task is finished or an error has occurred. They are meant for deploying containers that don’t need to run indefinitely. At Redbox, we found them great for executing integration tests, as well as one time syncs of data.

Istio

Istio High Level Architecture — https://istio.io/docs/concepts/security/

Istio is a Service Mesh with support for Kubernetes. If you are unfamiliar with Istio it could be simply seen as a proxy that your applications talk to for service to service routing within a Kubernetes cluster. Offloading routing to an external layer enables Istio to do things like weighted routing for canary deployments and adding Mutual TLS. Checkout the Istio Concepts page for details on all the things it can do.

While trying to understand the challenge we faced in deploying Kubernetes Jobs it’s important to know how Istio does this. Whenever a new pod is created in Kubernetes Istio creates a sidecar container that proxies all traffic in and out of the pod. This is necessary to ensure any routing rules configured in Istio are applied to cluster traffic automatically.

The problem with Kubernetes Jobs and Istio Sidecars

Istio’s sidecar concept lets you do all sorts of networking and routing magic without updating any of the code you are deploying to Kubernetes. In fact your container(s) have no idea Istio is even there. That’s a major factor in the issue with sidecars in Jobs:

A Job is not considered complete until all containers have stopped running, and Istio Sidecars run indefinitely.

This means while your task may have completed, the Job as a whole will not appear as completed in Kubernetes. If you use a Continuous Deployment system the Job will show as running and your devs will most likely assume something is wrong with their code or tests.

Stopping an Istio Sidecar Automatically

The solution to this problem is easy in concept: stop the sidecar container when your container(s) are finished, but of course it’s a little trickier than that in practice. We found others had this problem too and that led to a few options:

Option 1: Disabling Istio Sidecar injection

Istio Sidecar injection can be disabled a few ways, including on individual Pods using annotations. So as a temporary workaround adding sidecar.istio.io/inject: “false” is possible but this disables Istio for any traffic to/from the annotated Pod. As mentioned, we leveraged Kubernetes Jobs for Integration Testing, which meant some tests may need to access the service mesh. Disabling Istio essentially means breaking routing — a show stopper.

Option 2: Use `pkill` to stop the Istio Process

This method we did have success with, but it required the developers to add pkill -f /usr/local/bin/pilot-agent inside of their Dockerfile command and shareProcessNamespace: true to the Pod spec (not the Job spec). The downside to this is we have to teach our developers how to do this and they lose the ability to easily return an exit code which is how we determine if integration tests passed or failed. This is understood by Kubernetes too, so Kubernetes marks the Job as Successful or Failed based on the exit code. We likely could have fixed the exit code issue with some bash-fu but that’s not a great solution either.

Option 3: envoy-preflight

This solution was posted in the original GitHub Issue linked above from the team at Monzo Bank. As Istio uses Envoy under the hood, the ability to stop Envoy sounded promising. envoy-preflight also added another really clever feature:

Waiting for Envoy to start before starting your container’s task

When running integration tests against a canary deployment, we use routing rules based on an HTTP header to enable Istio to route traffic to the relevant canary deployment. If Istio isn’t started yet, or fails to start, the integration tests will end up being routed by Kubernetes natively which will send traffic randomly to canary and prod containers leading to inaccurate test results. Thus this feature solved a problem we might have encountered along the way — awesome!

We figured since Istio uses Envoy this should just work out of the box! When trying it against the Istio sidecars, however, we noticed some unusual behavior. The Istio sidecar runs Envoy and other Istio specific services, but stopping only Envoy didn’t lead to the entire container halting as expected.

To figure out what was going on, we added additional logging and did some testing to confirm:

$ envoy-preflight <command>

Once our command finished, the following was logged:

envoy-preflight: quitquitquit sent, response: Response(200) # /quitquitquit is Envoy's Admin API endpoint to trigger a shutdown

We confirmed that Envoy has shutdown by shelling into the sidecar container:

$ curl 127.0.0.1:15000

curl: (7) Failed to connect to 127.0.0.1 port 15000: Connection refused

# Found in # 127.0.0.1:15000 is Envoy's Admin API port# Found in Istio Docs

But unfortunately, the Istio sidecar container kept running and the Job was not marked as completed:

istio-proxy docker.io/istio/proxyv2:1.2.2 Running

Unfortunately, trying envoy-preflight with Istio sidecars in specific did not work for us. It did everything properly within the context of Envoy however. If you are using Envoy or another Envoy based product, it could still be worth trying! Regardless, we still needed a solution.

Short Term Solution — Scuttle

Built on top of envoy-preflight, we have officially open sourced our solution for handling this issue within Istio — Scuttle! At its core, envoy-preflight:

Polls the Envoy sidecar proxy for readiness before starting the Job’s task. Stops Envoy when the job is finished. Forwards the exit code from whatever command is used for the Job’s task.

Scuttle extends this functionality by replacing the call to Envoy’s /quitquitquit with the pkill -f /usr/local/bin/pilot-agent from Option 2!

Long Term Solution — Istio 1.3 and Core k8s Support

Starting in Istio 1.3, there is a /quitquitquit endpoint on Istio that works similar to Envoy’s implementation; enabling users to send a POST request and Istio will shutdown. As Istio 1.3 was just released the same day we created Scuttle, we are still evaluating and rolling out the upgrade for Istio across our infrastructure. If all works as expected in our testing, that endpoint can be used instead of pkill.

More importantly, bigger discussions are underway to add support for sidecar containers within Kubernetes. Currently a sidecar is just a normal container in a pod, but changes are being discussed for a future versions of Kubernetes to handle everything envoy-preflight and Scuttle are doing — automatically.

We Are Hiring!

Building a truly cloud native platform requires hard work. We are always looking for great engineers to join us and help build! Feel free to check out our Careers page for job opportunities!