Exploring Multi-level Weaknesses using Automated Chaos Experiments

Kubernetes Pod Disruption Budgets through the lens of Chaos Engineering

Chaos engineering’s goal is to discover and help you overcome weaknesses across the entire sociotechnical system of software development.

This means it is about far more than merely breaking infrastructure and seeing how your application responds in production. Chaos engineering explores weaknesses across all the attack vectors on resiliency, from infrastructure, through platform and application levels, and finally the people, practices and processes as well.

How chaos engineering typically helps you explore weaknesses across the levels in your sociotechnical system of software development

Automated chaos engineering experiments usually explore weaknesses at the more technical levels, such as the platform and applications, leaving the People/Practices & Process level to Game Days. However, there’s no reason they need to be so limited.

In this article I’m going to explore how multi-level automated chaos experiments can be used to explore system weaknesses that cross the boundaries between the technical and people/process/practices levels.

Kubernetes Cluster Administration and Application Resiliency

In many system contexts there is a dividing line between those who are responsible for managing a Kubernetes cluster (including all the realcosts associated with these resources), let’s call those the Cluster Admins, and those trying to deploy and manage applications upon the cluster, the Application Team. The Cluster Admin would be interested in such tasks as:

Managing Nodes.

Managing Persistent Disks.

Keeping an eye on the costs if these resources were hosted on a Cloud Provider.

Ensuring the cluster is not the problem, so they can go home for the weekend.

The Application Team would be more focussed on:

Ensuring their application is healthy.

Ensuring they have enough redundancy and capacity across pods.

Ensuring they claim the persistent storage they need for their application to maintain its state.

Ensuring they can go home at 5:30 (other times are available).

The trick here is that in a Cluster Admin’s regular working day they may execute operations that cause issues with the goals of the Application Team. A common example here is where a Cluster Admin is looking to remove a node from the cluster, for maintenance perhaps, but this leaves an application in a state where it cannot get the resources, in this case the number of pods, it needs. Through no fault of their own the Cluster Admins and the Application Team can end up in conflict and this can lead to system outages.

If that sounds like a system weakness, you’re right! It’s a system weakness at the people and process level as the solution lies in how the two roles operate, coordinate and compromise during their daily work. As a chaos engineer the whole job starts with suspecting there is a weakness, and exploring if it is really the case…

Exploring the Attack on Resiliency through Node Maintenance

As we have a potential People & Process system weakness it makes sense that we can explore and surface this weakness through executing a chaos experiment.

For demonstration purposes, I’ve set up a Kubernetes cluster with three nodes and, taking the role of the Application Team, I’ve deployed a three-pod simple application as shown below:

NAME READY STATUS RESTARTS AGE my-service-6dc649f897-22p7f 1/1 Running 0 2h my-service-6dc649f897-7m72z 1/1 Running 0 46m my-service-6dc649f897-xkmpd 1/1 Running 0 2h

The specification for this deployment is:

These pods are fronted by a Kubernetes service so that they can be accessed from a static IP:

my-service LoadBalancer 10.15.250.81 35.189.85.252 80:31724/TCP 2h

The specification of the service is:

As you can see from the pod deployment.json specification, I want there to be three pods for my application. What is not specified, but implicitly agreed (hence the weakness) is the Application Team know there must be three pods during any known conditions (such as upgrades etc) otherwise Bad Things Could Happen(TM).

Node Drain can lead to “Pending” Pods…

The problem at the moment is that the need for my application’s pods to never go below three is an implicit limitation on any Cluster Admin’s tasks. It’s implicit, therefore it can be accidentally ignored! This means it’s quite possible for a Cluster Admin to lower this level through regular maintenance operations against the nodes in the cluster and have no immediate knowledge that this is a problem for the Application Team.

The Application Team might be forgiven for hoping that the system would survive at the right level during such operations, but they’d be right to be nervous. I’d be nervous. I don’t have trust and confidence that things actually will be ok. When I lack that trust and confidence, it’s time to roll out a chaos experiment using the free and open source Chaos Toolkit.

Using the Chaos Toolkit’s experiment format, first I define my steady-state hypothesis for my experiment. This helps me define what normality is supposed to look like, including my hard limit on the minimum number of my application’s pods that should be in the Ready state at a given moment in time:

This starter snippet of my experiment shows, in the block starting line 13, that I expect the application to respond normally when I hit it from its public URL. Then in the block starting on line 23 I am probing the Kubernetes control plane to make sure that all of my pods, as labelled by the biz-app-id=retail , are in the Running state.

The Chaos Toolkit will execute the steady-state hypothesis twice. Once at the beginning of the experiment’s execution to make sure we’re dealing with a system that at least looks ok at that point. Then once at the end of the experiment’s execution to assess if the experiment caused the system to deviate from this definition of “normal” and has therefore surfaced a weakness.

This is why we celebrate when a chaos experiment “fails”, because we’ve found an area for improvement, hurrah! Compare this to our reaction when a regular test fails and you’ll appreciate the subtle difference in perspectives between chaos experimentation and chaos testing…

Next we need an experimental method to actually cause the real-world events that an unaware Cluster Admin might cause:

Here we’re taking advantage of the Chaos Toolkit’s Kubernetes driver to tinker with our cluster just as a Cluster Admin rightly would. We have asked the cluster to drain (evacuate pods) from a particular node, hoping that the pods we need will be kept alive the way we specified. Then we run the experiment using chaos run …

Things did not go as we expected! Line 13 shows us what has happened. When we drained a node we left Kubernetes, specifically the Replica Set looking after our requested pod instances, in a very sticky situation.

By draining the node we’ve left the Replica Set unable to meet its specification, as shown by the fact that our application’s three pods are not all in the Ready state, one is now in Pending and will remain so until a suitable node is made available.

Being good chaos engineers we are in fact putting the node back at the end of the experiment with the following entry in the Rollbacks section:

Rollbacks in chaos experiments are not exactly rollbacks, there’s no absolute guarantee that the system has been effectively rolled back to a definite known state. In fact, when you think about it a chaos experiment is, by definition, trying to find unknown, unknown weaknesses. Expecting automatic rollback implies we know exactly what those weaknesses are beforehand… which kinda defeats the point!

Instead, in the Rollbacks section of the Chaos Toolkit’s experiment we’re merely trying to put things back the way they were before we ran the experiment. In this case we are un-cordoning the node that we cordoned off previously when we asked it to drain itself. Given a few moments (sometimes a few more than that…) the Kubernetes will respond and all three of our application’s pods will be back up, happy and in their Ready state.

The problem is that now we know that there’s a weakness. It’s just too easy for a well-meaning Cluster Admin to reduce our application’s minimum pods by accident (we assume they are not nefarious individuals making things difficult on purpose; after all, they are the admins!).

Enter, stage left, Kubernetes Pod Disruption Budgets!

Overcoming the Weakness with a Pod Disruption Budget

This particular weakness has been accommodated for by a Kubernetes resource, the Pod Disruption Budget.

By specifying a disruption budget the Application Team can provide a policy that dictates that, for a controlled disruption such as node drainage, our minimum number of pods will be protected.

Pod Disruption Budgets are pretty simple resources to specify:

The disruption budget states, on line 6, the minimum number of pod instances we’re able to accept at any time and, on line 9, the labels present on the pods that we want the budget to apply to.

But how do we prove that this budget does what we want? Yep, you guessed it, chaos run experiment.json …

Building Trust and Confidence in Overcoming a Weakness

The whole point of automating your chaos experiments is so that you can run them again and again to build trust and confidence in your system. Not just surfacing new weaknesses, but also ensuring that you’ve overcome a weakness in the first place.

Systems are changing quickly these days, and running chaos continuously (Continuous Chaos is, for me, right up there after Continuous Integration and Continuous Delivery) is a great way to build trust today and also build confidence that you’ll detect reintroduced (can you say “regression”?) weaknesses tomorrow.

After applying our new disruption budget it’s time to gain some trust and confidence that the change has overcome the weakness our experiment surfaced earlier. Let’s run the experiment again with chaos run experiment.json :

The experiment’s status this time is completed , which means the steady-state hypothesis passed both before and after the experimental method was executed. However what is that Error statement on line 10 saying…

The clue is in the “ The disruption budget my-app-pdb needs 3 healthy pods and has 3 currently ” message towards the end of that line.

Our Pod Disruption Budget worked! The cluster couldn’t handle a node being drained and still meet our needs, so the drain was blocked! (Now there’s an image…)

If this situation occurred “for real”, now would be a good time for the Cluster Admin to come talk to the Application Team, or maybe bring another node online so the Application Team can maintain the minimum number of pods. Either way this particularly gnarly People, Practices and Process weakness has been overcome!

Sharing your Experimental Outcomes

You are not alone. Chaos engineering is about collaboration and so it’s useful to make everyone aware of the sorts of weaknesses you’re exploring, and how and when they’ve been overcome.

For this purpose the Chaos Toolkit comes with the report command to help you disseminate your learnings to everyone, even management ;) In the past this command was a bit of a pain to install, as the pdf support from Python required all sorts of operating-system-specific libraries to be installed, but we’ve now got a neat Docker image that we can use instead to create a report that can be consumable by other human beings.

To create the report based on your experiment’s findings all you need to do is execute the following command:

$ docker run \

--user `id -u` \

-v `pwd`:/tmp/result \

-it \

chaostoolkit/reporting

Bingo, you now have a ready-made PDF report that shows how your Pod Disruption Budget protects your application from painful downtime and unexpected situations while the Cluster Admin’s go about their business. Weakness overcome, it’s time for a coffee…