Many engineers assume very optimistic approach that their services will work flawlessly, they tend to aim to final shape from very beginning without thinking about potential changes or unpredictable events which can cause disaster instead of taking advantage from it and making their services even stronger…

and here the question arises — how benefit from volatility, randomness and disorder?

There is a lot of blog posts about Antifragile Systems but not many of them refer to existing technologies. So let’s wrap everything up :)

The Antifragile

Antifragile term was introduced by Nassim Nicholas Taleb in his book “Antifragile: Things That Gain from Disorder”.

Some things benefit from shocks, they thrive and grow when exposed to volatility, randomness, disorder, and stressors and love adventure, risk, and uncertainty. Yet, in spite of the ubiquity of the phenomenon, there is no word for the exact opposite of fragile. Let us call it antifragile. Antifragility is beyond resilience or robustness. The resilient resists shocks and stays the same; the antifragile gets better.

Antifragility in Kubernetes

How we can measure a fragility of our services? What will happen in case of failure? How Kubernetes helps?

This boils down to a few key concepts of Antifragility.

Simplicity

Complex systems are difficult to monitor and maintain. The bigger the system is, the harder it becomes to change. Also, any unexpected event can lead to side effects or cascade failures which might be hard to trace and debug.

Kubernetes provides deployment unit called Pod which is a group of containers running on the same node with the common life cycle. We can safely assume that one container should be responsible for one particular thing (keep it simple).

apiVersion: v1

kind: Pod

metadata:

name: web-app

spec:

containers:

- name: web

image: nginx

ports:

- name: web

containerPort: 80

protocol: TCP

- name: database

image: postgres

ports:

- name: psql

containerPort: 5432

protocol: TCP

Obviously, there are more types of Kubernetes resources like Deployment, DaemonSet, StatefulSet, etc.

Observability

Monitoring and logging are key mechanisms in order to understand how things work and perform especially in dynamic and distributed environments like Kubernetes.

Every container running on Kubernetes should log application output on stdout or stderr. It helps to have separate storage and gives visibility if our container crashed.

I0221 00:40:02.060314 version: 1.14.8

I0221 00:40:02.133217 Using configuration read from directory: /kube-dns-config with period 10s

I0221 00:40:02.133299 FLAG: --alsologtostderr="false"

I0221 00:40:02.133312 FLAG: --config-dir="/kube-dns-config"

I0221 00:40:02.133320 FLAG: --config-map=""

I0221 00:40:02.133325 FLAG: --config-map-namespace="kube-system"

I0221 00:40:02.133331 FLAG: --config-period="10s"

I0221 00:40:02.133338 FLAG: --dns-bind-address="0.0.0.0"

I0221 00:40:02.133344 FLAG: --dns-port="10053"

I0221 00:40:02.133353 FLAG: --domain="cluster.local."

I0221 00:40:02.133361 FLAG: --federations=""

I0221 00:40:02.133368 FLAG: --healthz-port="8081"

I0221 00:40:02.133373 FLAG: --initial-sync-timeout="1m0s"

I0221 00:40:02.133378 FLAG: --kube-master-url=""

I0221 00:40:02.133385 FLAG: --kubecfg-file=""

I0221 00:40:02.133390 FLAG: --log-backtrace-at=":0"

I0221 00:40:02.133400 FLAG: --log-dir=""

I0221 00:40:02.133406 FLAG: --log-flush-frequency="5s"

I0221 00:40:02.133411 FLAG: --logtostderr="true"

I0221 00:40:02.133416 FLAG: --nameservers=""

I0221 00:40:02.133421 FLAG: --stderrthreshold="2"

I0221 00:40:02.133426 FLAG: --v="2"

I0221 00:40:02.133431 FLAG: --version="false"

I0221 00:40:02.133440 FLAG: --vmodule=""

I0221 00:40:02.133520 Starting SkyDNS server (0.0.0.0:10053)

I0221 00:40:02.133827 Skydns metrics enabled (/metrics:10055)

I0221 00:40:02.133836 Starting endpointsController

I0221 00:40:02.133841 Starting serviceController

Prometheus is an open-source system monitoring and alerting toolkit originally built at SoundCloud. In combination with Grafana, it gives full visibility of Kubernetes cluster health.

Kubernetes Pod Metricsby Ore Olarewaju

The tolerance and errors

Consider that Mother Nature is not just “safe.” It is aggressive in destroying and replacing, in selecting and reshuffling. Given the unattainability of perfect robustness, we need a mechanism by which the system regenerates itself continuously by using, rather than suffering from, random events, unpredictable shocks, stressors, and volatility.

Exactly the same principles apply to Kubernetes environment when for some reason node dies then all containers are redistributed to other healthy nodes, by the way, balancing resource utilization.

In terms of containers, there are liveness and readiness probes. Liveness takes care about restarting if the application is unable to make progress. Readiness is to determine when a container is ready to start accepting traffic.

apiVersion: v1

kind: Pod

metadata:

labels:

test: liveness

name: liveness-exec

spec:

containers:

- name: liveness

image: k8s.gcr.io/busybox

args:

- /bin/sh

- -c

- touch /tmp/healthy; sleep 30; rm -rf /tmp/healthy; sleep 600

livenessProbe:

exec:

command:

- cat

- /tmp/healthy

initialDelaySeconds: 5

periodSeconds: 5

readinessProbe:

exec:

command:

- cat

- /tmp/healthy

initialDelaySeconds: 5

periodSeconds: 5

Another important technique is fault injection, sometimes part of Chaos Engineering approach which lets systems to evolve and survive the chaos.

Chaos Engineering is the discipline of experimenting on a distributed system

in order to build confidence in the system’s capability

to withstand turbulent conditions in production. PRINCIPLES OF CHAOS ENGINEERING

kube-monkey is an implementation of Netflix’s Chaos Monkey for Kubernetes clusters. It randomly deletes Kubernetes pods in the cluster encouraging and validating the development of failure-resilient services.

Decentralization and isolation

Distributed systems resemble living organism and respond better to unexpected events obviously if overcompensation is in place. In combination with proper isolation, a blast radius is limited.

Kubernetes provides the concept of namespaces, we can think about them as scoped virtual clusters with access control policies based on attribute-based access control (ABAC) or more granular role-based access control (RBAC).

kind: Role

apiVersion: rbac.authorization.k8s.io/v1

metadata:

namespace: default

name: pod-reader

rules:

- apiGroups: [""] # "" indicates the core API group

resources: ["pods"]

verbs: ["get", "watch", "list"]

The specific namespace can be also limited on network layer using network policies which say what groups of pods are allowed to communicate with each other and other network endpoints.

apiVersion: networking.k8s.io/v1

kind: NetworkPolicy

metadata:

name: test-network-policy

namespace: default

spec:

podSelector:

matchLabels:

role: db

policyTypes:

- Ingress

- Egress

ingress:

- from:

- ipBlock:

cidr: 172.17.0.0/16

except:

- 172.17.1.0/24

- namespaceSelector:

matchLabels:

project: myproject

- podSelector:

matchLabels:

role: frontend

ports:

- protocol: TCP

port: 6379

egress:

- to:

- ipBlock:

cidr: 10.0.0.0/24

ports:

- protocol: TCP

port: 5978

Nonprediction

Sometimes is hard to make a decision under an uncertain situation, for example how much resources we should allocate in order to handle unexpected high traffic? How many instances running in the cloud we need? Unfortunately, we cannot predict an occurrence of rare events.

One of the Kubernetes advantages is the ability to scale up or down depending on special circumstances e.g. resources usage using cluster-autoscaler which adjusts the number of nodes also including deletion of underutilized nodes.

Kubernetes Cluster Autoscaler (via Prometheus) by bookchair

In terms of autoscaling on the application layer, we can use Horizontal Pod Autoscaler which automatically scales the number of Pods based on resource utilization or custom metrics.

Summary

Let’s be honest here :) It’s impossible to be antifragile but failing often and fast can make our systems more resistant to errors.

In order to reduce fragility, you should constantly stress your deployment process and services running on top of Kubernetes.