Kubernetes and storage

I’ve been learning Kubernetes for a few months now, and one of the areas where I spent a lot of time testing and experimenting is storage. I have tested extensively a few open source solutions, namely Rook, Rancher’s Longhorn, and OpenEBS. I haven’t really done much testing with closed source offerings like Portworx or StorageOS because I prefer open source and because I couldn’t afford them anyway :)

Following a lot of testing which has lasted several weeks and was finalized to define a complete stack for a project I am working on, I settled with OpenEBS.

Rook is a solid option overall, and with Ceph it offers a battle tested storage solution that has been used in production for years, powering very large clusters; unfortunately I ran into some serious stability issues with the latest versions (1.x) and eventually I gave up after spending a considerable amount of time trying to solve them. One of my tests was the migration of some data from an existing deployment to persistent Ceph volumes in Kubernetes, but when using Ubuntu as the operating system for the nodes the system load would very often climb wildly while copying the data, up to a point that most of the times the nodes would become unresponsive and I had to forcefully reboot them. Tests were somewhat more “stable” when using another OS like CentOS and Fedora, but even with these I would see a high iowait at times that didn’t make the nodes unresponsive, but forced me to reboot anyway because stuff got stalled. I have tried many, many times to use Rook 1.x (with 0.9.x I don’t recall to have had the same issues) but as of today I haven’t been able to perform all my tests in such a way that I feel confident of actually using Rook for my project.

Rancher’s Longhorn is really interesting. It has a great UI that makes management of storage very easy, as well as built in backups to S3 compatible storage. Unfortunately, CPU usage is quite high even when the system is idle, and when for example upgrading a workload or restoring from backup, attach/mount times for the volumes are often too long compared to the other options. Backups and restores work very reliably, but they are very, very slow. Backing up the same volume with just 25 GB of data would take minutes with Velero but up to one hour with Longhorn’s built in backup functionality (same cluster, same backup target). And restoring the same volume from backup worked great, but was also painfully slow.

OpenEBS

What about OpenEBS? Well, everything with OpenEBS “just works”. It’s super easy to install and configure, and I didn’t run into any serious issues during my testing. I did find a few issues but in most cases these were promptly fixed by the devs and the team is always readily available to help on the Slack channel, which is something I am impressed with. OpenEBS offers three storage engines:

Jiva is actually based on the same technology that powers Longhorn, and works out of the box without requiring any particular configuration. By default it stores the data in a directory on the main disk of the node, although the location can be configured in the storage class.

cStor requires raw disks and has more advanced features than Jiva especially concerning snapshots and cloning; it also has a very handy plugin for Velero that allows backups of volumes based on point-in-time snapshosts. I have tested this successfully even with volumes containing databases (e.g. MySQL) without any data corruption.

Local PV is based on Kubernetes local persistent volumes but it has a dynamic provisioner. It can store data either in a directory, or use disks; in the first case the hostpath can be shared by multiple persistent volumes, while when using disks each persistent volume requires a separate device. Local PV offers extremely high performance close to what you get by reading from and writing to the disk directly, but it doesn’t offer features such as replication, which are built in Jiva and cStor. It’s an excellent choice for workloads like database systems that already have replication built in, so you don’t need replication at the storage level. Therefore these workloads can benefit from a much better performance compared to Jiva and cStor.

I recommend you visit the docs for more information on the storage engines and how to install OpenEBS, here I would like to share a few things that I have learned while using it.

Tips

Jiva: See status of the replicas

For each persistent volume, Jiva creates one controller pod and one pod for each replica in the same namespace as the volume, by default. It is possible to check the status of the replicas of a volume by running the following commands:

pvc=<name of the persistent volume claim> pod_name=`kubectl -n openebs get pod -l "openebs.io/persistent-volume-claim=$pvc" -o jsonpath='{.items[0].metadata.name}'` container_name=`kubectl -n openebs get pod $pod_name -o jsonpath="{.spec.containers[].name}"` kubectl -n openebs exec -it $pod_name -c $container_name -- curl localhost:9501/v1/replicas | jq | jq

Jiva: Clean up scrub jobs

Whenever a Jiva volume is deleted, a “scrub” job is created to remove the data. These jobs remain in the namespace of the volume but they can be removed once completed. To do this, you can run the following:

kubectl -n <namespace> delete job $(kubectl -n openebs get job -o=jsonpath='{.items[?(@.status.succeeded==1)].metadata.name}')

This will only remove the jobs that have been completed successfully. namespace here can be either the namespace of the volume or the openebs namespace, depending on where Jiva creates its pods - see next tip.

Jiva: Create pods in the openebs namespace

By default, Jiva creates its pods in the same namespace as the volume. This kinda clutters the namespace, and can cause a problem when using Restic backups with Velero: when a namespace that has Jiva volumes is backed up, Velero includes the status of the Jiva pods in the backup, so when you restore the backup Velero will restore the Jiva pods… but Jiva, when the volume to be restored is created, creates its own pods.. so you end up with twice as many Jiva pods as there should be. To work around this issue, you can change the Jiva storage class so that the Jiva pods are created in the openebs namespace instead of the volume’s namespace. This way when Velero backs up the volume namespace it won’t include a reference to the Jiva pods and won’t attempt to restore them when restoring the backup. You can change the storage class as follows:

apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: annotations: cas.openebs.io/config: |- - name: ReplicaCount value: "3" - name: StoragePool value: default - name: `DeployInOpenEBSNamespace` enabled: "true" openebs.io/cas-type: jiva name: openebs-jiva-default provisioner: openebs.io/provisioner-iscsi reclaimPolicy: Delete volumeBindingMode: Immediate

Note the DeployInOpenEBSNamespace setting.

cStor: Run ZFS commands

cStor uses zfs under the hood, so at times you may need to run some zfs command for troubleshooting. You can do this inside the cstor-pool container of the pod for the specific cStor pool you are looking into. For example:

pod_name=`kubectl -n openebs get pod -l app=cstor-pool -o jsonpath='{.items[0].metadata.name}'` kubectl -n openebs exec -it $pod_name -c cstor-pool -- zfs list

Get the size of a volume

You can find information on a volume by using the mayactl utility in the maya-apiserver pod:

pv=`kubectl -n <namespace> get pvc <pvc name> -o=jsonpath='{.spec.volumeName}'` pod_name=`kubectl -n openebs get pod -l name=maya-apiserver -o jsonpath='{.items[0].metadata.name}'` kubectl -n openebs exec -it $pod_name -- mayactl volume stats --volname $pv -n openebs

In this example, mayactl looks in the openebs namespace for the volume. This assumes that you are using either cStor or Jiva with DeployInOpenEBSNamespace set to true.

Mount and use a volume on a node bypassing Kubernetes

During my testing, like I mentioned earlier, I’ve had to migrate some data from an existing non-Kubernetes deployment to persistent volumes in Kubernetes. I tried with kubectl exec into the pod using the volume, but the session would be terminated after a while if no output was sent to STDOUT for a few minutes (for example when downloading large files). So, instead, I mounted the volume directly on a directory on a node so to do the transfer outside Kubernetes. Since this requires quite a few commands, I wrote a simple script for it which you can adapt to your case:

#!/bin/bash namespace=$1 volume_name=$2 action=$3 mount_path=$4 detected_node_ip=`kubectl get nodes -o=jsonpath='{.items[0].status.addresses[0].address}'` node_ip=${5:-$detected_node_ip} node_name=`kubectl get nodes -o=jsonpath='{.items[0].metadata.name}'` pv=`kubectl -n $namespace get pvc $volume_name -o=jsonpath='{.spec.volumeName}'` iqn=`kubectl get pv $pv -o=jsonpath='{.spec.iscsi.iqn}'` targetPortal=`kubectl get pv $pv -o=jsonpath='{.spec.iscsi.targetPortal}'` function run() { ssh [email protected]$node_ip $1 } if [[ $action == 'mount' ]]; then # login session run "sudo iscsiadm -m discovery -t sendtargets -p $targetPortal" run "sudo iscsiadm -m node -l -T $iqn -p $targetPortal" device=/dev/`run "sudo iscsiadm -m session -P 3 | egrep 'iqn|disk' | grep $iqn -A 2 | tail -n 1 | cut -f 4 | cut -d ' ' -f 4"` run "sudo mkdir -p $mount_path" run "sudo mount $device $mount_path" echo "Mounted $device to $mount_path. on $node_name" else run "sudo umount $mount_path" run "sudo rmdir $mount_path" # logout session run "sudo iscsiadm -m node -u -T $iqn -p $targetPortal" echo "Unmounted." fi

E.g. to mount a volume to a directory:

./script.sh <namespace> <pvc name> mount /mnt/myvolume

To unmount:

./script.sh <namespace> <pvc name> umount /mnt/myvolume

cStor: Add storage

One great thing of OpenEBS is thin provisioning, which means that you can create volumes with a size that can be much bigger than the storage actually available. Then you can add more storage as the actual usage grows. You can add more storage by either adding disks to a cStor pool, or expanding a pool if you can resize the underlying disk (some cloud providers allow resizing disks).

You can find instructions on how to add a disk to a pool here and how to expand a single disk pool here.

cStor and Velero backups

Like I mentioned earlier, cStor offers a plugin for Velero that allows a backup to include snapshots of the volumes. It’s very handy, but there are a few things to keep in mind / some extra commands to run in order to successfully do backups and restores - see the plugin’s README on Github. To simplify things, I wrote a couple of wrapper scripts that automate the process. The scripts require Ruby and you can find them in this Github repo, so take a look at the code if you are familiar with Ruby.

For backups, first you need to define a snapshot location for Velero. I use Wasabi as S3 compatible storage and I usually just delete the default snapshot location and then kubectl apply the following:

apiVersion: velero.io/v1 kind: VolumeSnapshotLocation metadata: name: default namespace: velero spec: provider: openebs.io/cstor-blockstore config: bucket: <bucket name> prefix: openebs provider: aws region: <bucket region> s3Url: <s3 endpoint>

Then, to perform a backup:

./backup.rb --backup <backup name> --include-namespaces <namespaces>

and to restore:

./restore.rb --backup <backup name> --include-namespaces <namespaces>

Local PV

I don’t have much to say about Local PV (other than I love it because of the performance), because it’s very simple and doesn’t have all the features that Jiva and cStor have. One thing I would like to mention though is that the default storage class for the hostpath option, openebs-hostpath, sets the directory where to store the data to /var/openebs/local. If you wish to use a different path (for example on a larger disk), do not change that storage class because it gets overwritten when the maya-apiserver pod restarts. Instead, create a new storage class. For example I use a disk mounted in /mnt/openebs:

apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: annotations: cas.openebs.io/config: | - name: StorageType value: "hostpath" - name: BasePath value: "/mnt/openebs" openebs.io/cas-type: local name: openebs-hostpath-mount provisioner: openebs.io/local reclaimPolicy: Delete volumeBindingMode: WaitForFirstConsumer

Wrapping up

OpenEBS is an excellent solution for storage in Kubernetes and offers several engines to fit every need. I am currently using Local PV because I don’t need replication at the storage level and I love its performance, but the other storage engines have interesting features, especially cStor with the plugin for Velero backups.

If you are looking to evaluate options to manage storage in your Kubernetes clusters, do take a look at OpenEBS. It just works and you won’t be disappointed.