Google released a new feature “Using Preemptible VMs with Kubernetes Engine” and we decided to give it a go. Using preemptible VMs in your kubernetes cluster dramatically reduces the cost of running your infrastructure. In this post, I’ll show how I’ve adapted kubernetes deployment and Google Cloud configuration to run microservices on preemptible instances. It’s important to know that this feature is still in beta. However, we are currently running it in production for 2 weeks and didn’t have any problems so far.

What is a preemptible instance?

In a few words it is the same compute instances that you are using but much cheaper! There is a trick though, these instances can shut down at any time, cool huh?

A bit more detailed from Google:

Preemptible VMs are Google Compute Engine VM instances that last a maximum of 24 hours and provide no availability guarantees. Preemptible VMs are priced lower than standard Compute Engine VMs and offer the same machine types and options. You can use preemptible VMs in your Kubernetes Engine clusters or node pools to run batch or fault-tolerant jobs that are less sensitive to the ephemeral, non-guaranteed nature of preemptible VMs. To learn more about preemptible VMs, refer to Preemptible VMs in the Compute Engine documentation.

To feel the difference (the region is Oregon):

So, preemptible VM is ~4.7 times cheaper than regular VM.

As stated above preemptible VMs are ideal for batch jobs, however at pixboost, we don’t have many batch jobs, so I’ve decided to play around with our busiest service that optimizes and transforms images. We thought it would be possible to run it on preemptible VMs with a fallback to normal instances. Long story short lets put together our requirements.

Requirements

Before start developing I have defined some requirements:

“Images” microservice pods should run on preemptible VMs.

If there is autoscaling event and there is no preemptible VM available then pod should be scheduled on the standard VM.

No other pods except “Images” service should be running on the preemptible VMs.

“Images” service pod should not have any failed requests when an instance becomes unavailable

Once we put together our requirements we can move on and plan architecture of the solution.

Architecture

So, this is how our system looked before:

Our VMs are living in node pools. Each node pool defines a configuration of VMs and their behavior, like autoscaling. We can attach more than one node pool to the cluster. In our case, this new node pool should consist of preemptible VMs.

The updated architecture then should look like:

It’s important to mention that on application level all our microservices are running behind Nginx:

Finally our kubernetes deployment for “Images” service:

Let’s Rock

First of all, lets set up a new node pool and try to schedule service’s pods to run on it. To do that, let’s go to our Kubernetes Engine console and edit it. Click on “Add node pool” button and select your options as needed but with “Pre-emptible nodes” options enabled:

We also need to add a taint in order to avoid scheduling other pods except “Images” service on this pool:

Also, keep in mind that GKE will add a label to each node “cloud.google.com/gke-preemptible”, so we can use it in our deployment setup.

Now, when we have a new node pool we’ll need to update our deployment configuration, so we can run our pods on it. The easiest way to achieve that is to use node selector in the pod spec:

That node selector says that pod must be scheduled only on nodes that have specific label and value. Given the nature of preemptible VMs, the instance might not be available when it’s needed. We want to run a live web service and can’t wait when a new instance will become available.

So, the nodeSelector approach doesn’t meet our requirements. Luckily, there is another way to do that. Node affinity feature allows us to set up rules that kubernetes will follow during scheduling or execution of pods:

The rule above is quite simple and saying that when kubernetes scheduling a new pod then it should prefer preemptible instance. Though if there is no available preemptible instance then it will be scheduled on a standard one. Node affinity is giving you many other options to place your pods on right nodes and I definitely would recommend to check it out.

As you remember we’ve setup taint on the node pool that will prevent kubernetes to schedule new pods on it. We need to add toleration to this taint to the image service in order to run it on that node pool:

Having a taint allowed us to prevent scheduling any other pods on preemptible instances except “Images” service that now has toleration. Initial setup is done, so we can start playing with it to make sure that we meet last requirement — “No Failed Requests”.

Testing

I set up a simple load test using Apache Benchmark that is sending requests to the service in 8 parallel threads. While the test is running I am randomly stopping instances and writing down the time of stoppages. After the test finished I could align time of instances’ termination with logs from load balancer and see when failures happened. The goal is to not have any failed requests as per requirements. But nothing is working the first time. Still. After 16 year of my bright career. But yep, I kept trying :) On the first iteration, many requests were failing. That fact didn’t surprise me and after a bit of investigation I’ve found out that my service is not supporting the graceful shutdown.

Fixing Graceful Shutdown

When preemptible instance is no longer available it initiates shutdown protocol. Google cloud sends ACPI shutdown signal to an instance and it has 30 seconds to finish it’s business. After 30 seconds instance is terminated.

At the moment kubernetes engine doesn’t support shutdown script, so the only thing that we can do is to handle termination signal correctly inside our containers. The good thing is that we easily can emulate it locally by running microservice’s docker container and execute “docker stop“ command on the container.

The first mistake that I’ve uncovered was the wrong way I setup entrypoint. In my case I did:

Here I’m using shell form of ENTRYPOINT that will be transformed into `sh -c /entrypoint.sh`. In this case signals won’t be passed to our script because bash is not doing that.

What we should use is exec form:

We’ve just added brackets around ENTRYPOINT arguments. Now signals goes into our entrypoint.sh script. What I had inside bash script was wrong as well:

Here again signals won’t go into run_service process and our app won’t receive them. We should use exec function from bash to pass signals through:

Now signals should go to the app.

The last thing that we need to arrange is to handle SIGTERM in our webapp and finish all HTTP requests that are in progress and reject all new connections. Your implementation will depend on the language and frameworks that you are using. Golang is quite neat in this sense and since version 1.8 http server supports graceful shutdown, so all we need to do is to add logic to finish all in-flight requests.

Lucky me, there is an existing package github.com/TV4/graceful that did it for me.

Final Polish

After I’ve done graceful shutdown the number of failed requests had decreased. However, I still saw them. What I’ve found is that requests were failing for a long time (3–4 seconds) after I stopped an instance. The reason for that was a too high period on my readiness probe. If you look above you’ll see that it set to 10 seconds which means that kubernetes will do a health check and mark pod as healthy for the next 10 seconds. If let’s say SIGTERM came 1 seconds after health check then requests are still will be sent to pod for the next 9 seconds. I changed the period to a minimum allowed 1 second:

After this change only just a few requests were failing after I’ve been stopping instances. That was fair enough because there is still a possibility that app will be in shutdown mode, but pod will be marked as healthy.

So, I had to find a way to retry request failed requests. Nginx doesn’t have retry requests functionality out of the box, but there is a trick that you can use. We can setup upstream and add the same backend record several times.

Then if request failed nginx will send it to the next server in the upstream:

Failover

As the final step I created another deployment of image service that runs on standard instances. I didn’t add any autoscaling and just run one instance of the service. I updated ngnix upstream configuration and added it as a backup server:

Conclusion

We are currently running this configuration in production for about two weeks and didn’t have any problems with it so far.

Our bills are becoming much smaller!

I reckon, it’s not possible to run any web service on the preemptible instances especially transactions one, however, there are plenty of use cases where we can optimize the deployment to run it with preemptible instances and save considerable amount of money.