In my last post about rook & ceph we talked generally about storage options on Kubernetes and how Rook & Ceph work at a high-level. Today I wanted to dive a bit deeper into day-to-day operations, and some of the things we've learned while managing storage on our clusters!

A storage system should primarily be judged based on how resilient it is to failure, and how it behaves during outages. In this case, Rook and Ceph do quite well keeping cluster storage online, with the exception that some manual effort is required to keep the system "healthy" rather than "warning".

Before we get started, we need to setup the ceph tool!

More or less vital for running a Rook/Ceph storage system, the ceph-toolbox should always be installed.

We use the ceph command-line utility so often that we store an alias for the following:

alias ceph="kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l 'app=rook-ceph-tools' -o jsonpath='{.items[0].metadata.name}') ceph"

This allows us to enter the ceph-tools container by typing ceph while using the appropriate kubectl context. For this tutorial, I'll be using ceph> to indicate commands given in the ceph shell.

Using the ceph status command, usually we'd see something like this:

ceph> status cluster: health: HEALTH_OK services: osd: 2 osds: 2 up (since 1d), 2 in (since 1d) data: pgs: 400 active+clean

HEALTH_OK really means it - there's nothing to do here and we're all green!

OSD Failure:

A common cause of issues is OSD Failure, let's say some catastrophic hardware failure means a server is simply never coming back, and needs to be removed from the storage system.

You can see 2 up, 2 in - those are our OSDs, which are the processes which manage bits of storage in our cluster. They're not always mapped 1-to-1 with servers in your cluster, but for this example let's assume they are. One day, you check on the cluster and notice:

ceph> status cluster: health: HEALTH_WARN Reduced data availability: 60 pgs inactive Degraded data redundancy: 642/1284 objects degraded (50.000%), 60 pgs degraded, 60 pgs undersized services: osd: 2 osds: 1 up (since 2d), 1 in (since 2d) data: pgs: 100.000% pgs not active 642/1284 objects degraded (50.000%) 60 undersized+degraded+peered

Ack! 1 up, 1 in out of 2! degraded (50.000%) ! What happened!? Let's look at our nodes:

> kubectl get nodes NAME STATUS ROLES AGE VERSION worker1 Ready <none> 1d v1.16.3 worker2 NotReady <none> 1d v1.16.3

In this case, I unplugged one of my raspberry pis, but in other cases that server is never coming back, so let's replace it! Assuming we can just plug-in another PI, get a glass of water, and then:

> kubectl get nodes NAME STATUS ROLES AGE VERSION worker1 Ready <none> 1d v1.16.3 worker2 NotReady <none> 1d v1.16.3 worker3 Ready <none> 5m v1.16.3

Adding a new OSD:

Let's edit our CephCluster resource with kubectl edit cephcluster -n rook-ceph . Under spec -> storage -> nodes , you can add your new node.

You can check either the logs: kubectl -n rook-ceph logs -l app=rook-ceph-operator Or the namespace to watch the OSD come up: kubectl -n rook-ceph get pods

Once the OSD starts, you'll see it start to replace the fallen server:

ceph> status services: osd: 3 osds: 2 up (since 2m), 2 in (since 3m) data: pgs: 40.000% pgs not active 177/1288 objects degraded (13.742%) 36 active+clean 16 undersized+degraded+remapped+backfill_wait+peered 4 peering 3 activating 1 undersized+degraded+remapped+backfilling+peered

And, eventually, the cluster will go HEALTH_OK once it's finished peering! You'll be left with a situation like we see above: osd: 3 osds: 2 up (since 2m), 2 in (since 3m) . So, let's remove that now-dead OSD!

Removing an old OSD:

So our worker2 is never coming back. Let's make sure we know which osd s existed on that system:

ceph> osd tree down ID CLASS WEIGHT TYPE NAME STATUS -9 0.07570 host worker2 0 ssd 0.07570 osd.0 down

So let's go ahead and remove osd.0 for good. We can do that with the following order of commands:

ceph osd out osd.0 ceph status , ensure cluster is healthy and recovery is complete kubectl -n rook-ceph delete deployment rook-ceph-osd-0 ceph auth del osd.0 ceph osd crush remove osd.0 ceph osd rm osd.0 kubectl delete node node-with-osd-0

Make sure you're checking ceph status often ane making sure the "recovery" io is nice and finished up before moving on to other OSDs. Rook and Ceph are excellent at preventing dataloss, but it's important to get comfortable with the platform before you start doing things in parallel!

Wrapping up

Rook and Ceph are fantastic tools for Operators. It also serves as the backend for our storage on hosted KubeSail clusters, so you don't need to worry about any of the above! For those of you running your own clusters, here are a couple of my favorite Rook/Ceph resources and videos:

Thanks for reading, and as always, feel free to reach out on gitter if you have any questions or comments!