To Be Stateless or Not

Most of the components that Kubernetes offers (Pods, Deployments, Jobs, CronJobs, etc.) consider that you are working with a stateless application. A stateless app is made up of several identical components that can be easily replaced. If you’ve heard the famous example of pets vs. cattle, stateless apps are regarded as cattle. If not, let me quickly shed some light on it: when you have several nodes hosting your application (those could be physical nodes, virtual nodes, or containers) you can either treat them as pets or as cattle. When you have a pet, you name it; you feed it, it’s unique among your other pets (you have a dog, a cat, and a turtle, each with distinctive characteristics). If one of your pets got sick, it’s immediately noticeable; you must do something about it.

On the other hand, if you have a herd of cattle with hundreds of heads, you don’t give them names. Nothing is unique about any of them as all of them look and behave the same way. If one of them gets sick or dies, you can easily replace it by purchasing a new head. Such an action is not noticeable.

Stateless apps tend to follow the twelve-factor app principles, which make them suitable for being deployed to a cloud environment. However, almost all environment types need some stateful component, typically used for data storage. When Kubernetes first came out, there was no support for stateful components. At that time, you’d place all your stateless components inside the cluster, and have the stateful ones outside of it. They’re typically hosted on on-prem hardware or dedicated cloud infrastructure. Kubernetes introduced StatefulSets to address the stateful apps needs since version 1.5.

Why ReplicaSets Fall Short When Dealing With “Pets”?

When I first learned about StatefulSets, I always wondered: why to bother creating (and using) a new controller? Why not just use a ReplicaSet or a Deployment (which uses ReplicaSets internally) to serve our stateful app? Let’s assume that you need to run a MySQL database inside your Kubernetes cluster. The database needs to run only on one instance (provided that you are not using a MySQL cluster) that serves as the data store. You can create a ReplicaSet and set the number of replicas to 1. The Pod template would have one container running the MySQL image, and a Persistent Volume Claim that points to a Persistent Volume that was provisioned firsthand. You’d also create a Service that leads to the Pod and provides a DNS name for your MySQL database. The following diagram shows how this setup would look:

ReplicaSets were designed to favor availability over consistency. Although you set the number of Pod replicas to 1 in the ReplicaSet definition, there is a good possibility that the ReplicaSet spawns an additional Pod (colored in red in the diagram). For example, if the first Pod did not respond to health or readiness probes, or if the Pod needed to be scheduled to another node. A ReplicaSet does not kill a Pod unless the new one proves to be up and running to ensure no outage occurs. Another MySQL Pod receiving DB requests and accessing the same data volume may corrupt the data. The problem becomes even worse if you are using multiple MySQL Pods that form an internal cluster in between. Clusters have special requirements like quorum, well-defined hostnames, unique networking, among others. Let’s have a look at the typical needs of a stateful application.

Dedicated Volumes

A stateful app needs its dedicated storage. Remember the cattle vs. pets analogy? What a dog eats is different than what a cat does. Both are different than what a turtle should be fed.

On the other hand, all cattle eat the same food. In a database cluster, each node must have its storage volume. The cluster software handles replication among the nodes. A ReplicaSet offers a shared volume that’s used by all the Pods. While you can use the application logic to partition the shared volume so that each Pod uses a dedicated portion, this solution does not work when scaling. You can also create a separate, one-replica ReplicaSet for each Pod so that each of them has its dedicated volume. However, not only does this method add a lot of administrative burdens, but it also prevents Kubernetes from handling scaling up or down. The cluster administrator would be responsible for scaling by manually creating or removing ReplicaSets.

Stable Network Identity

If you have a pet, you must give it a name so that you can call it. Similarly, a stateful application node must have a well-defined hostname and IP address so that other nodes in the same application knows how to reach it. A ReplicaSet does not offer this functionality as each Pod receives a random hostname and IP address when it starts or is restarted. In stateless applications, we use a Service that load-balances the Pods behind it and offers a URL through which you can reach any of the stateless Pods. In a stateful app, each node may want to connect to a specific node. A ReplicaSet cannot serve this purpose.

Start and Termination Order

Stateful applications also honor a specific request of which nodes start (or terminate) before which. This is especially important when working in a clustered application. A ReplicaSet does not follow a specific order when starting or killing its pods.

To address all those requirements, Kubernetes offers the StatefulSet primitive.

StatefulSet Example

The following definition file creates two Pods, each hosting a container that runs Nginx:

apiVersion: v1 kind: Service metadata: name: nginx labels: app: nginx spec: ports: - port: 80 name: web clusterIP: None selector: app: nginx --- apiVersion: apps/v1 kind: StatefulSet metadata: name: web spec: serviceName: "nginx" replicas: 2 selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: containers: - name: nginx image: k8s.gcr.io/nginx-slim:0.8 ports: - containerPort: 80 name: web volumeMounts: - name: www mountPath: /usr/share/nginx/html volumeClaimTemplates: - metadata: name: www spec: accessModes: [ "ReadWriteOnce" ] resources: requests: storage: 1Gi

Let’s apply the above definition:

$ kubectl apply -f statefulset.yml service/nginx created statefulset.apps/web created

Dedicated Volumes

A Stateful application often needs persistent storage that’s specific to each of its nodes. In Kubernetes, storage is provisioned through a Persistent Volume and Persistent Volume Claim that accesses it. The StatefulSet definition guarantees that each Pod has its storage by using the volumeClaimTemplates parameter. The volumeClaimTemplates automatically create a Persistent Volume Claim for each Pod. Let’s have a look.

$ kubectl get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE www-web-0 Bound pvc-01c744cc-cbff-11e9-9e03-025000000001 1Gi RWO hostpath 51s www-web-1 Bound pvc-1c8e6c2e-cbff-11e9-9e03-025000000001 1Gi RWO hostpath 6s $ kubectl get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pvc-01c744cc-cbff-11e9-9e03-025000000001 1Gi RWO Delete Bound default/www-web-0 hostpath 49s pvc-1c8e6c2e-cbff-11e9-9e03-025000000001 1Gi RWO Delete Bound default/www-web-1 hostpath 13s

We have two volume claims created for us. The volume claim names were not randomly selected, but rather following a naming convention. We also have two volumes that back the PVC, one volume for each Pod.

Important Things to Notice About How StatefulSets Deal With Volumes:

Statefulset does not create a volume for you. Volumes should be either pre-provisioned by the administrator or uses the PersistentVolume Provisioner based on the requested storage class.

When a StatefulSet is deleted, the respective volumes are not deleted with it. This behavior is intentional to avoid accidental data loss.

Stable Network Identity

If you have a look at the Pods that the StatefulSet created:

$ kubectl get pods NAME READY STATUS RESTARTS AGE web-0 1/1 Running 0 3h7m web-1 1/1 Running 0 3h6m 13s

If you’ve used a ReplicaSet instead, we’d have a different output concerning the Pod names. ReplicaSets give random names to the Pods they create. However, StatefulSets use ordinal, predictable names for their Pods starting with 0. The Pod name consists of the StatefulSet name followed by the number. So, our Pods have well-defined names. How can we access each one of them?

The first part of the definition file (lines 1 to 13) defines a Service. But the clusterIP is set to None. A Service that does not expose an IP address is called a Headless Service. When a client resolves the IP address of a headless service, the DNS replies with a list of endpoints for the Pods. Let’s have a demonstration:

Create an ephemeral Ubuntu Pod to run our tests:

kubectl run -i --tty ubuntu --image=ubuntu:18.04 --restart=Never -- bash -il

We’ll need to install dig to issue DNS queries:

root@ubuntu:/# apt update && apt install dnsutils

Now, let’s test the headless service by issuing a SRV request:

root@ubuntu:/# dig SRV nginx.default.svc.cluster.local ; <<>> DiG 9.11.3-1ubuntu1.8-Ubuntu <<>> SRV nginx.default.svc.cluster.local ;; global options: +cmd ;; Got answer: ;; WARNING: .local is reserved for Multicast DNS ;; You are currently testing what happens when an mDNS query is leaked to DNS ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 41247 ;; flags: qr aa rd; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 3 ;; WARNING: recursion requested but not available ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ; COOKIE: 5c4ff0df2ede50ff (echoed) ;; QUESTION SECTION: ;nginx.default.svc.cluster.local. IN SRV ;; ANSWER SECTION: nginx.default.svc.cluster.local. 5 IN SRV 0 50 80 web-0.nginx.default.svc.cluster.local. nginx.default.svc.cluster.local. 5 IN SRV 0 50 80 web-1.nginx.default.svc.cluster.local. ;; ADDITIONAL SECTION: web-0.nginx.default.svc.cluster.local. 5 IN A 10.1.2.192 web-1.nginx.default.svc.cluster.local. 5 IN A 10.1.2.193 ;; Query time: 0 msec ;; SERVER: 10.96.0.10#53(10.96.0.10) ;; WHEN: Sat Aug 31 18:30:42 UTC 2019 ;; MSG SIZE rcvd: 354

The relevant information is highlighted: the endpoints and the IP addresses of the backend Pods. So, if our Ubuntu Pod needs to connect to the first Pod of the StatefulSet, it could communicate with web-0.nginx.default.svc.cluster.local.

Notice that you are required to create a Headless Service before creating a StatefulSet because you need to link them through the serviceName parameter.

Start and Termination Order

Since stateless apps are identical by nature, there should not be any particular order for their start or termination. A ReplicaSet starts and kills all of its Pods at once. That’s fine for Pods that do not need to maintain their state or synchronize data among each other. But a stateful application may not behave correctly if all of its Pods started or were terminated at once. Think of a clustered application that elects a master node to operate. In this app, nodes should start one after another to give a chance for data synchronization and other in-app behavior.

In a StatefulSet, Pods are started one after the other. A new Pod is not started unless the previous has already reported that it is healthy and running. When a new Pod is spawned, it is numbered according to its order of start. In our example, web-0 started before web-1. When you delete a StatefulSet or scale it down, Pods get terminated in reverse order. So, web-1 is killed before web-0.

If your stateful application does not need this feature (it does not require its nodes to start or shut down in order), you can disable it for faster deployment. Setting the .spec.podManagementPolicy parameter to parallel avoids starting and killing Pods in order.

Canary Releases Using Partitioned Updates

Canary Deployment refers to updating only a sample of the nodes hosting a particular software to the new version. The remaining nodes remain to host the old version of the application. The primary use case of Canary Deployments is to guarantee the gradual release of a feature/bug-fix and a faster rollback if needed.

StatefulSets allow you to roll your updates in a Canary Release manner by using the Partitioned Updates parameter. The .spec.updateStrategy.rollingUpdate.partition parameter controls the number of Pods that are updated if you make changes to the Pod template (for example, change the image tag). If ignored, this parameter is set to 0 by default. However, if you set it to a number, only Pods with an ordinal number higher than or equal to that number are affected by the update while the rest remains intact. For example, consider the following snippet of a StatefulSet definition file:

apiVersion: apps/v1 kind: StatefulSet metadata: name: web spec: updateStrategy: rollingUpdate: partition: 1 serviceName: "nginx" replicas: 2

If we make changes to the Pod template of this StatefulSet, for example, change the image from nginx to httpd, then only Pods with ordinal number greater than or equal to 1 are updated. In our example, this means only web-1. web-0 remains using nginx. Even if the old Pod was deleted, it is recreated using the old Pod template (nginx image).

If you want to avoid any updates to the StatefulSet Pods, you can set the rollingUpdate partition to a number that is greater than the replicas count. In our example, we can set this number to 2 since we have only 2 replicas (remember, the ordinal numbering starts at 0). In this case, any updates to the Pod template are not reflected in any of the Pods. You can think of this option as a failsafe button to prevent any other administrator from making accidental Pod changes to a critical StatefulSet deployment.

TL;DR

Many legacy applications were not designed to be cloud-native. They need special requirements regarding how their data should be stored, how their identity should be preserved on the network, and how they require specific handling when started or shutdown. Through StatefulSets, Kubernetes offer support to most of those requirements.

StatefulSets are meant to deploy applications that maintain their state. Their replicas are not identical as each one needs its own identity and storage.

ReplicaSets fall short when used to deploy a stateful application because they treat all the Pods the same, give them random hostnames and IP addresses that change on restarts.

A StatefulSet guarantees every Pod to have a unique hostname and dedicated storage. Using a StatefulSet, you ensure that no duplicate Pods get created. If the Pod needs to be moved to a different node, the new Pod does not start except when Kubernetes provides that the old one has completely shut down. This behavior ensures that no duplicate Pods exist, which is not guaranteed when using ReplicaSets.

StatefulSet offers built-in support for Canary Release strategy. It allows you to choose a specific number of Pods to be affected by an update while the rest remain using the old version. This ensures that changes can be easily rolled back and ensures that the application does not break due to a wrong update.

StatefulSets are typically used with databases and clustered applications that have specific requirements like quorum.

*The outline of this article outline is inspired by the book of Roland Huss and Bilgin Ibryam : Kubernetes Patterns.