3. CI & Matrixed Builds

Of course these tests — especially running against Kind — can be run in your CI (see our GitHub Actions).

There are a bunch of versions of k8s, so we need to be explicit about which ones we’re targeting — and test appropriately. Given your test suite is using Kind you can pass in the version via an env var and even matrix the builds so you can be sure that a given change doesn’t compromise support for a version of k8s you weren’t thinking about at the time (see main.yaml):

Yaml for matrixed builds

Matrixed builds in GitHub actions

4. Observability

Without reasonable observability, your operator becomes a black box sitting inside another black box. This is crucial not just for ‘getting to prod’, but in developing the thing in general — so make sure to schedule ample time to flesh this out.

Logs. The most obvious and basic aspect of observability. Be generous with logging. This is a nice way to capture everything the operator is doing in one place, and allows you to stream logs whilst you watch it run: kubectl logs my-pod-name manager -f

Log all the things

Events . Events allow you post messages back to the CRD which your operator is watching. The nice aspect of this is that the end user will likely not have access to the operator logs, but may have access to run kubectl describe MyCustomResource , which will show them the events relevant to their resource. Beware , however, by default events disappear after 90 minutes (I believe).

. Events allow you post messages back to the CRD which your operator is watching. The nice aspect of this is that the end user will likely not have access to the operator logs, but may have access to run , which will show them the events relevant to their resource. , however, by default events disappear after 90 minutes (I believe). Metrics. Prometheus has become a standard for monitoring k8s, so why not push your own metrics to be scraped by it? Use the Prometheus client (github.com/prometheus/client_golang/prometheus) to output any metrics you feel are relevant.

Push metrics to the endpoint

Graphs. Metrics are good, but only really useful when you can understand them and use them to make decisions. Here’s where Grafana comes in. It plays very well with Prometheus and helps you visualise your operator from a distance. This is key for day 2 operations. How do you know your operator is still doing the thing it’s supposed to be doing?

Single pain of glass for your operator

5. Stability & Recovery

We have to remember that the operator is likely running in a single thread in a single pod (you can run multiples but single is the default). This means that if it falls over, anything dependent on it will also fail. How dangerous this again depends on what your operator is actually doing — but it needs to be considered. We had to think about:

Input Validation . A person is going to fire a text file at your operator (likely yaml). What could they put in there that would break your code? Fortunately schema validation takes place before your code is called — so you can be confident that an int field will contain an int value. However, what if you’re expecting 1-10 and someone sends you 1000000 ? How will your code react?

. A person is going to fire a text file at your operator (likely yaml). What could they put in there that would break your code? Fortunately schema validation takes place before your code is called — so you can be confident that an field will contain an value. However, what if you’re expecting and someone sends you ? How will your code react? Idempotency . This is important not just for recovery — but consider what will happen if (when) your operator gets restarted, and k8s gives it all the CRDs to process again. What happens then? You need to consider code paths where the object you want to create already exists. How do you know it exists? How do you know it’s valid?

. This is important not just for recovery — but consider what will happen if (when) your operator gets restarted, and k8s gives it all the CRDs to process again. What happens then? You need to consider code paths where the object you want to create already exists. How do you know it exists? How do you know it’s valid? Liveness / Readiness Probes. How does k8s know that your operator is unhappy? Can it tell if something in the runtime has become unresponsive? Here’s where liveness and readiness become important. Ours were nice and simple, and could be extended if the project became more complex:

livenessProbe in manager.yaml

Returning 200 when we’re alive — main.go

6. Pod Metadata

Your operator will run in a pod just like any other. With this in mind some good hygiene is important:

Pod Priority Class. In a resource-constrained cluster, what happens if your operator fails, or more pods get deployed? If this is a ‘core’ service in your cluster, you want it running — it’s more important than other pods. So set the priority.

Creating a super-high priority class for your operator pod

Resource Limits & Requests. Tell k8s what it needs to run your operator.

resources in manager.yaml

Final Thoughts…

Don’t push these activities to the last minute just to get the project ‘over the line’. Adding decent tests and observability will actually help you accelerate your development, as paying the price upfront leads to clearer, simpler and safer development.