Wherein I explore the requirements of, and develop a repeatable process for, standing up a moderately opinionated, production-ready Docker cluster using the community standard Engine, Machine, Swarm and Compose. HashiCorp Consul will be used as the key-value store for Swarm as well as providing a common discovery mechanism across all nodes.

Bleeding Edge?

When overlay networking was released with Docker 1.9 and Swarm 1.0 I noticed a mini explosion of articles describing how to setup Swarm clusters to leverage this excellent new feature, e.g.

These articles were pretty good introductions to clustering with Docker and Swarm but they were just that, introductions. I found myself searching for more information on the underlying key-value stores, that Swarm has an external dependency on via docker/libkv, and always came up short. What about high-availability and/or resiliency? Does the KVS actually need to be external from the cluster? What are the most common best-practices/assumptions that one should emulate when standing up a KVS and/or Swarm cluster? How to stand-up clusters in a repeatable fashion?

Welp, here goes!

Guiding Principles

Highly available and fault-tolerant key-value store

Highly available Swarm masters

Fully functional overlay networking

Repeatable, automation-ready setup

All cluster and node services delivered as containers

Smarter-than-default logging

Memory accounting configured in kernel

Secure communication between Consul nodes

If you care to follow along I have made the sources available on GitHub.

Architectural Assumptions

Inspired by the single-node diagram at Docker’s Engine Overview page, our setup will look something like this:

A Docker Swarm cluster leveraging Consul

Consul will provide the KVS that under-girds our cluster. We will be running Consul agents on every node in the cluster which means that we can have a consistent, node-local address to provide to the Docker Engine/Swarm running on each node. We will also be leveraging Consul’s DNS support to enable service-discovery via SRV records served to clients within the cluster as well as to clients on the cluster edge.

As I have attempted to show in the diagram above, Consul is deployed as a common pillar across the entire stack. This means that service discovery is bootstrapped into the cluster from the ground up. How is this achieved? By pointing all containers, as well as the Docker daemon, to Consul for DNS resolution. Yes, you read that right: even the Docker daemon uses Consul to resolve, via multiple SRV records returned per query, its necessary key-value store. No need for an external KVS when it is fully baked into the cluster and addressable in the exact same way from every node within.

Setup

The concepts presented here were initially developed with a series of Makefiles and utilized Docker 1.9, Compose 1.5, and Machine 0.5. I have since adapted them into much simpler Bash scripts which were tested with Docker 1.10, Compose 1.6, and Machine 0.6. Please install these versions (or newer). Please also create an API access token at Digital Ocean and add it to your environment, e.g.

export DIGITALOCEAN_ACCESS_TOKEN=”my-super-cool-access-token-hash”

Additionally, you will need to setup your MACHINE_STORAGE_PATH to the directory that you cloned into. I have supplied a .bashrc that can be sourced to set this up for you:

export MACHINE_STORAGE_PATH="$( cd "$(dirname "${BASH_SOURCE:-$0}")" ; pwd -P )"

Provisioning Phase One

Before we can build an overlay-ready cluster with built-in service discovery we need some nodes to commandeer via the Docker Machine generic driver. This first pass in effect pre-allocates all of the nodes, and hence IP addresses, we will need to be able to stand-up our multi-master Consul/Swarm combo cluster via an automated script. Running the provided setup-machines.sh script will get you Bash source-able output that looks like:

declare -x MACHINE_BRIDGE_ADDRESS=”172.17.0.1"

declare -x MACHINE_BRIDGE_INTERFACE=”docker0"

declare -ax MACHINE_CLUSTER_ADDRESSES=’(

[0]=”10.134.13.209"

[1]=”10.134.13.210"

[2]=”10.134.13.212"

[3]=”10.134.13.213"

[4]=”10.134.13.214"

[5]=”10.134.13.216"

[6]=”10.134.4.80"

)’

declare -x MACHINE_CLUSTER_INTERFACE=”eth1"

declare -x MACHINE_DATACENTER=”sfo1"

declare -x MACHINE_DOMAIN=”example.com”

declare -x MACHINE_DRIVER=”digitalocean”

declare -ax MACHINE_NAMES=’(

[0]=”doccur1"

[1]=”doccur2"

[2]=”doccur3"

[3]=”doccur4"

[4]=”doccur5"

[5]=”doccur6"

[6]=”doccur7"

)’

declare -ax MACHINE_PUBLIC_ADDRESSES=’(

[0]=”104.236.169.124"

[1]=”159.203.255.168"

[2]=”107.170.254.146"

[3]=”192.241.208.232"

[4]=”159.203.224.172"

[5]=”159.203.224.174"

[6]=”104.236.151.245"

)’

declare -x MACHINE_SSH_USER=”root”

declare -x MACHINE_STORAGE_PATH=”/home/jacob/Projects/docker-swarm-consul”

Provisioning Phase Two

The setup-swarm.sh script leverages the output of setup-machines.sh to commandeer the nodes setup therein. The meat of which, is below:

# executed once per node, with the first three assumed to be masters

docker-machine ${MACHINE_OPTS} create --driver generic \

--engine-opt "cluster-advertise ${MACHINE_CLUSTER_INTERFACE}:2376" \

--engine-opt "cluster-store consul://${MACHINE_CLUSTER_CONSUL}" \

--engine-opt "dns ${MACHINE_CLUSTER_PARTICIPANT_ADDRESS}" \

--engine-opt "dns-search ${MACHINE_DOMAIN}" \

# only slightly smarter-than-default logging

--engine-opt "log-driver json-file" \

--engine-opt "log-opt max-file=10" \

--engine-opt "log-opt max-size=10m" \

--generic-ip-address ${MACHINE_PUBLIC_ADDRESS} \

--generic-ssh-key ${MACHINE_SSH_KEY} \

--generic-ssh-user ${MACHINE_SSH_USER} \

--swarm \

--swarm-discovery "consul://${MACHINE_CLUSTER_CONSUL}" \

# following three lines included when provisioning masters

--swarm-master

--swarm-opt replication

--swarm-opt advertise=${MACHINE_CLUSTER_PARTICIPANT_ADDRESS}:3376

--tls-san ${MACHINE_NODE_NAME} \

--tls-san ${MACHINE_CLUSTER_PARTICIPANT_ADDRESS} \

--tls-san ${MACHINE_PUBLIC_ADDRESS} \

${MACHINE_NAME}

But how can this work? I see an apparent reference to Consul but it hasn’t yet been installed! As I have chosen to use the tools provided by Docker I am somewhat bound by their inadequacies — I am looking at you @DockerMachine:

Fortunately, the Docker daemon will happily retry to connect to the cluster-store aka the KVS every so often, this gives us time to underlay it via Docker Compose.

# upload the Consul configuration that will be bind-mounted

docker-machine ${MACHINE_OPTS} scp -r \

${MACHINE_STORAGE_PATH}/compose/consul/config \

${MACHINE_NAME}:/tmp/consul

# move the uploaded configuration into place

docker-machine ${MACHINE_OPTS} ssh ${MACHINE_NAME} \

"sudo mv -vf /tmp/consul /etc"

# deploy the composition

eval $(docker-machine env ${MACHINE_NAME})

docker-compose -f \

${MACHINE_STORAGE_PATH}/compose/consul/${MACHINE_TYPE}.yml \

up -d

At this point you might imagine that once the Consul masters achieve quorum and cluster that the Swarm is good to go. Not quiet yet. The reason for this is that the Swarm masters are defaulting to the Google resolvers (because Machine gives no hooks for injecting the dns options for the Swarm master/agent containers during node “creation”) which is preventing them from successful look-ups of Consul from … Consul.

# replace /etc/resolv.conf

docker-machine ${MACHINE_OPTS} ssh ${MACHINE_NAME} \

"sudo rm /etc/resolv.conf"

docker-machine ${MACHINE_OPTS} ssh ${MACHINE_NAME} \

"echo 'nameserver ${MACHINE_CLUSTER_PARTICIPANT_ADDRESS}' \

| sudo tee /etc/resolv.conf"

# reboot the Docker daemon so as to reboot all containers

docker-machine ${MACHINE_OPTS} ssh ${MACHINE_NAME} \

"sudo systemctl restart docker || sudo service docker restart"

An alternate way to solve this would be to recreate the swarm master/agents via composition, making sure to pass in the appropriate dns options to all containers. This is more attractive and I plan to tackle it once I have an automated way to convert a run-time inspection, e.g. `docker inspect swarm-agent-master` of container(s) to a composition, aka docker-compose.yml.

Within a minute or two of completing the provisioning for master nodes you should be able to point your Docker client to any of the Swarm masters:

$ eval $(docker-machine env --swarm doccur1); docker info

Containers: 17

Running: 17

Paused: 0

Stopped: 0

Images: 14

Server Version: swarm/1.1.2

Role: primary

Strategy: spread

Filters: health, port, dependency, affinity, constraint

Nodes: 7

doccur1: 104.236.169.124:2376

└ Status: Healthy

└ Containers: 3

└ Reserved CPUs: 0 / 1

└ Reserved Memory: 0 B / 513.4 MiB

└ Labels: executiondriver=native-0.2, kernelversion=4.2.0–27-generic, operatingsystem=Ubuntu 15.10, provider=generic, storagedriver=aufs

└ Error: (none)

└ UpdatedAt: 2016–03–02T11:31:58Z

doccur2: 159.203.255.168:2376

└ Status: Healthy

└ Containers: 3

└ Reserved CPUs: 0 / 1

└ Reserved Memory: 0 B / 513.4 MiB

└ Labels: executiondriver=native-0.2, kernelversion=4.2.0–27-generic, operatingsystem=Ubuntu 15.10, provider=generic, storagedriver=aufs

└ Error: (none)

└ UpdatedAt: 2016–03–02T11:32:10Z

doccur3: 107.170.254.146:2376

└ Status: Healthy

└ Containers: 3

└ Reserved CPUs: 0 / 1

└ Reserved Memory: 0 B / 513.4 MiB

└ Labels: executiondriver=native-0.2, kernelversion=4.2.0–27-generic, operatingsystem=Ubuntu 15.10, provider=generic, storagedriver=aufs

└ Error: (none)

└ UpdatedAt: 2016–03–02T11:31:41Z

doccur4: 192.241.208.232:2376

└ Status: Healthy

└ Containers: 2

└ Reserved CPUs: 0 / 1

└ Reserved Memory: 0 B / 513.4 MiB

└ Labels: executiondriver=native-0.2, kernelversion=4.2.0–27-generic, operatingsystem=Ubuntu 15.10, provider=generic, storagedriver=aufs

└ Error: (none)

└ UpdatedAt: 2016–03–02T11:31:56Z

doccur5: 159.203.224.172:2376

└ Status: Healthy

└ Containers: 2

└ Reserved CPUs: 0 / 1

└ Reserved Memory: 0 B / 513.4 MiB

└ Labels: executiondriver=native-0.2, kernelversion=4.2.0–27-generic, operatingsystem=Ubuntu 15.10, provider=generic, storagedriver=aufs

└ Error: (none)

└ UpdatedAt: 2016–03–02T11:32:01Z

doccur6: 159.203.224.174:2376

└ Status: Healthy

└ Containers: 2

└ Reserved CPUs: 0 / 1

└ Reserved Memory: 0 B / 513.4 MiB

└ Labels: executiondriver=native-0.2, kernelversion=4.2.0–27-generic, operatingsystem=Ubuntu 15.10, provider=generic, storagedriver=aufs

└ Error: (none)

└ UpdatedAt: 2016–03–02T11:32:27Z

doccur7: 104.236.151.245:2376

└ Status: Healthy

└ Containers: 2

└ Reserved CPUs: 0 / 1

└ Reserved Memory: 0 B / 513.4 MiB

└ Labels: executiondriver=native-0.2, kernelversion=4.2.0–27-generic, operatingsystem=Ubuntu 15.10, provider=generic, storagedriver=aufs

└ Error: (none)

└ UpdatedAt: 2016–03–02T11:32:29Z

Plugins:

Volume:

Network:

Kernel Version: 4.2.0–27-generic

Operating System: linux

Architecture: amd64

CPUs: 7

Total Memory: 3.509 GiB

Name: doccur1

After provisioning non-master nodes you should see them as participating in the Swarm within about a minute, but usually faster.

Room to Grow

The cluster as presented works fairly well. I can reboot some or all of the nodes and it will re-cluster without any manual intervention although sometimes with a good deal of patience while waiting for the various underlying services to perform their retries and sync up. Much like Michael Abrash, however, I am a big fan of computers performing their tasks near instantaneously. How to speed this up? The biggest gains are likely to be had by eliminating the retries that the Docker daemon is performing while establishing connections first to the KVS and then to the Swarm … which also waits on the KVS. Yeah, we have an order-of-operations issue.

Stupid Docker Tricks, Docker-in-Docker

So you can discover services in your daemon

The first feasible idea I had was to just use RancherOS and run Consul as a system service. This would make it run as a peer to the User Docker as everything in RancherOS is a container, with PID 1 being the System Docker. The fine folks at Rancher Labs are poring a lot of know-how into RancherOS but it is not yet ready for production nor is it on the small list of approved operating systems where I work. Why cannot such a setup be approximated?

Caveats

There are a few well understood issues with running Docker-in-Docker that result in either degraded performance and/or outright corruption of data. Jérôme Petazzoni put together a nice, informative article discussing the how tos and why fors. Provided there are no other pitfalls, this should be doable.

Conclusion

We have a first cut at a Docker Cluster that is ready for some “real” work. There are some pain points during startup and recovery of nodes but patience will get you past these. We also have an interesting proposal for eliminating said pain points.