Introducing the etcd Operator: Simplify etcd cluster configuration and management

• By Hongchao Deng

Today, CoreOS introduced a new class of software in the Kubernetes community called an Operator. An Operator builds upon the basic Kubernetes resource and controller concepts but includes application domain knowledge to take care of common tasks. They reduce the complexity of running distributed systems and help you focus on the desired configuration, not the details of manual deployment and lifecycle management.

etcd is a distributed key-value store. In fact, etcd is the primary datastore of Kubernetes; storing and replicating all Kubernetes cluster state. As a critical component of a Kubernetes cluster having a reliable automated approach to its configuration and management is imperative.

As a distributed consensus-based system, the cluster configuration of etcd can be complicated. Bootstrapping, maintaining quorum, reconfiguring cluster membership, creating backups, handling disaster recovery, and monitoring critical events are tedious work, and require etcd-specific expertise.

Today we are introducing the etcd Operator and the Prometheus Operator showing how to make applications like these easier to run on Kubernetes. In this post, we'll outline the importance of an Operator for etcd. Let's dive in.

The etcd Operator: The best way to manage etcd clusters

The etcd Operator is simple to install with a single command line, and enables users to configure and manage the complexities of etcd using simple declarative configuration that will create, configure, and manage etcd clusters.

The etcd Operator provides the following features:

Create/Destroy : Instead of specifying tedious configuration settings for each etcd member, users only need to specify the size of the cluster minimally.

Resize : Users need only to modify the size in spec, and the etcd Operator will take care of deploying, destroying, and/or re-configuring cluster members, e.g. from 3 to 5, or from 5 to 3.

Backup : The etcd Operator performs backups automatically and transparently. Users need only to specify the backup policy, for example, to backup every 30 minutes and keep the last 3 backups.

Upgrade: Upgrading etcd without downtime is a critical but difficult task. Doing it with the etcd Operator not only simplifies operations, but also avoids common upgrade pitfalls and errors.

How it works

The etcd Operator simulates human operator behaviors in three steps: Observe, Analyze, and Act.

First, it observes the current cluster state by using the Kubernetes API. Second, it finds the differences between the desired state and current state. Last, it fixes the difference through one or both of the etcd cluster management API or the Kubernetes API.

etcd Operator logic loop in action

For example, let's say we have an etcd cluster of 3 members. Unfortunately, one member is down. The etcd Operator observes that the current cluster has 2 running pods. It diffs against the desired state, which should have 3 members. The Operator then acts to recover one member, by removing the dead one and adding a new one. Now the etcd cluster is back to a healthy state.

Testing the etcd Operator with "Chaos Monkey"

It is important to ensure the Operator is robust. We developed a tool similar to Netflix's Chaos Monkey that can kill pods randomly.

We use this tool to test the major Operator features: that is, to create, recover, and backup etcd clusters. In our continuous soak testing, the Chaos Monkey is enabled. It stresses the Operator by killing random etcd pods, so that we can see how the Operator reacts in real-time.

Try it out

Deploy the etcd Operator

Creating a new etcd Operator is simple on any Kubernetes cluster. There is an example deployment manifest in the etcd Operator source repo:

$ kubectl create -f https://coreos.com/operators/etcd/latest/deployment.yaml

This command creates a etcd Operator deployment on the Kubernetes cluster. The etcd Operator is now ready to manage etcd clusters.

Create a new etcd cluster with the Operator

Now, we'll create a 3-member etcd cluster with backup support. (Note that backup only works if your Kubernetes cluster supports Persistent Volumes). Once again, we'll use an example manifest from the etcd Operator repo:

$ kubectl create -f https://coreos.com/operators/etcd/latest/example-etcd-cluster.yaml $ kubectl get pods NAME READY STATUS RESTARTS AGE etcd-cluster-0000 1/1 Running 0 23s etcd-cluster-0001 1/1 Running 0 16s etcd-cluster-0002 1/1 Running 0 8s etcd-cluster-backup-tool-rhygq 1/1 Running 0 18s

Interested? To experiment with more examples and explore more features, check out the etcd Operator documentation.

The etcd Operator is under active development. A lot of exciting features are planned and being developed. We'd love to see your feedback and contributions!

Join CoreOS at KubeCon

We're hosting a number of events at the Kubernetes conference, KubeCon in Seattle, November 8 and 9, 2016. Watch a keynote with Brandon Philips for more details on the etcd Operator on Wednesday, November 9 at 3:50 p.m. PT. Check out the full schedule of CoreOS KubeCon events, stop by and visit our engineers at the CoreOS booth with your Kubernetes and container questions, or request an on-site sales meeting with a specialist.