Mesosphere DC/OS is a data center operating system, based on Apache Mesos and Marathon. It’s designed to run tasks and containers on a distributed architecture. It can be provisioned either on bare metal machines, within virtual machines or on a hosting provider (what some people like to call “the cloud.”). I wanted to see what was involved in setting up my own DC/OS instance, both locally and with a provider, for running some of my own projects in containers. I wanted to keep this cluster as low cost as possible, and ran into some issues with the Terraform installation in the DC/OS documentation. The following is a brief look at setting up a minimal DC/OS cluster on Digital Ocean.

Provisioning

For one of my projects, I created vSense, a devops provisioning system build around Vagrant and Ansible. It’s used for creating both development and production environments for BigSense, an open source sensor network system. Vagrant boxes can vary between providers, meaning the scripts need to be adjusted to handle differences between VirtualBox images for development and KVM base boxes for production. Thankfully, DC/OS does have an official Vagrant project, and supports deploying to hosted providers using a Terraform script.

The following can be used to bring up a local four node cluster (boot, manager, private agent and public agent) using local VirtualBox VMs:

git clone https://github.com/dcos/dcos-vagrant cd dcos-vagrant vagrant plugin install vagrant-hostmanager vagrant up m1 a1 p1 boot

DC/OS Nodes Running in Vagrant

DC/OS provides documentation for installing nodes on several hosted platforms as well. The following is taken from their documentation for using Digital Ocean as a provider:

git clone https://github.com/jmarhee/digitalocean-dcos-terraform cd digitalocean-dcos-terraform cp sample.terraform.tfvars terraform.tfvars # adjust your settings and API token eval $EDITOR terraform.tfvars terraform apply

DC/OS Nodes Running on Digital Ocean Droplets

It’s important to note that DC/OS, despite its name, is not really an operating system. It simply installs Docker and other packages to bootstrap itself on another Linux distribution. When using the Vagrant/VirtualBox installation above, it uses CentOS 7 for its individual virtual machines. Curiously for Digital Ocean, it installs itself onto CoreOS virtual machines.

Authentication

If you start with a fresh install of DC/OS and connect to the master node via HTTP, you’ll get an authentication page allowing the first account to be the administration account. By default, you cannot create this account. You are required to use one of the three default identity providers: Github, Microsoft or Google. DC/OS community edition has no built in authentication system. In order to integrate with LDAP, Active Directory or another identity provider, you must purchase the enterprise edition. The community edition allows you to override the default configuration, but only supports OAuth providers and only provides documentation for using the non-free service Auth0.

DC/OS Initial Login Screen

I really hesitated here. I rarely ever use external authentication, opting to use a strong password algorithms with e-mail based registration instead. I considered figuring out how to override the default, but then caved to my impatience and authenticated via github. This was a bad idea. Not only did I start getting unsolicited SPAM from Mesosphere on the e-mail associated with my Github account.

I stated getting SPAM for a secondary account I created within DC/OS.

Furthermore, the e-mail for the new user I manually created didn’t come from a locally running mail server that was part of DC/OS. It was relayed via a completely different third party:

From: DC/OS <help@dcos.io> Subject: You've been added to a DC/OS cluster Received: from [54.163.223.191] by mandrillapp.com id dafa457b3e374123b427c283824bfa0f; Sat, 26 Nov 2016 06:58:25 +0000 X-SWU-RECEIPT-ID: log_aad8fa045b93454cca9d5a9ccabc3504-3 Reply-To: <help@dcos.io> To: <--->

Also, by default, DC/OS has telemetry enabled. If you’re using the Terraform script for installation, it can be disabled by adding telemetry_enabled: 'false' to the make-files.sh script, in the section where it creates the config.yml . I highly recommend you disable telemetry before starting up a cluster, even locally with Vagrant.

The SPAM didn’t arrive for a couple of days after I experimented with DC/OS. However, it still bothers me that the official DC/OS provisioning tools enable telemetry by default. It’s not as bad as removing tracking from Alfresco, which is hard coded, but it is unnecessary and is most likely used for marketing purposes.

Minimal Cost

As I’ve mentioned, the minimum number of DC/OS nodes required by default is four. Upon talking to other DC/OS administrators, I’ve found that it’s not necessary to separate out public and private nodes. If you ran with only public nodes, your minimum would drop to three VMs. By default, the Terraform script mentioned above provisions all its nodes as 4gb , which currently run $40 USD/month on Digital Ocean.

If you’re a startup with funding, that isn’t an unreasonable price, even when you start scaling up for redundancy. However, if you’re a small shop trying to get off the ground with limited funding, or if you’re like me where you just want to host your personal projects cheaply, this can seem prohibitively expensive. The smallest size that Digital Ocean offers is a 512mb instance for $5 USD/month, which seems like it’d be more than adequate for the boot node.

Unfortunately, the management node must be a 1gb instance. Anything less leads to an unstable master. As we’ll see below, we can enable swap space on these nodes, but even the master agent is a heavy enough process that it will cause thrashing and lockups on anything less than 1gb of physical memory.

Boot Management Public Price Per Month Price Per Year 4gb 4gb 4gb $120 $1440 512mb 1gb 2gb $35 $420 512mb 1gb 2gb x2 $55 $660 512mb 1gb 1gb $25 $300 512mb 1gb 1gb x2 $30 $360

Keep in mind that by not creating any private nodes, you are trading off the security offered by having non-public facing containers (such as load balances or web servers) running on nodes only connected to a private network. This is also a minimal non-redundant solution. Redundancy requires either 3 or 5 master nodes and additional agent nodes as well.

Startup Issues

I wanted to use the smallest images possible to save on hosting costs. Unfortunately, both master and agent nodes refuse to start on anything smaller than 2gb images. If you have failures, you can SSH into the individual nodes using your SSH key, the IP address from the Digital Ocean web interface and the user core like so:

ssh -i do-key -lcore <node_ip>

The failures seem occur during the bootstrapping process in the dcos-download.service :

journalctl -u dcos-download.service -- Logs begin at Thu 2016-12-08 06:37:51 UTC, end at Thu 2016-12-08 07:31:18 UTC. -- Dec 08 06:38:21 digitalocean-dcos-public-agent-00 systemd[1]: Starting Pkgpanda: Download DC/OS to this host.... Dec 08 06:38:21 digitalocean-dcos-public-agent-00 curl[1568]: * Trying 104.131.142.20... Dec 08 06:38:21 digitalocean-dcos-public-agent-00 curl[1568]: * TCP_NODELAY set Dec 08 06:38:21 digitalocean-dcos-public-agent-00 curl[1568]: * Connected to 104.131.142.20 (104.131.142.20) port 4040 (#0) Dec 08 06:38:21 digitalocean-dcos-public-agent-00 curl[1568]: > GET /bootstrap/e73ba2b1cd17795e4dcb3d6647d11a29b9c35084.bootstrap.tar.xz HTTP/1. Dec 08 06:38:21 digitalocean-dcos-public-agent-00 curl[1568]: > Host: 104.131.142.20:4040 Dec 08 06:38:21 digitalocean-dcos-public-agent-00 curl[1568]: > User-Agent: curl/7.50.2 Dec 08 06:38:21 digitalocean-dcos-public-agent-00 curl[1568]: > Accept: */* Dec 08 06:38:21 digitalocean-dcos-public-agent-00 curl[1568]: > Dec 08 06:38:21 digitalocean-dcos-public-agent-00 curl[1568]: < HTTP/1.1 200 OK Dec 08 06:38:21 digitalocean-dcos-public-agent-00 curl[1568]: < Server: nginx/1.11.6 Dec 08 06:38:21 digitalocean-dcos-public-agent-00 curl[1568]: < Date: Thu, 08 Dec 2016 06:38:21 GMT Dec 08 06:38:21 digitalocean-dcos-public-agent-00 curl[1568]: < Content-Type: application/octet-stream Dec 08 06:38:21 digitalocean-dcos-public-agent-00 curl[1568]: < Content-Length: 581561548 Dec 08 06:38:21 digitalocean-dcos-public-agent-00 curl[1568]: < Last-Modified: Thu, 08 Dec 2016 06:37:03 GMT Dec 08 06:38:21 digitalocean-dcos-public-agent-00 curl[1568]: < Connection: keep-alive Dec 08 06:38:21 digitalocean-dcos-public-agent-00 curl[1568]: < ETag: "5848ff8f-22a9eccc" Dec 08 06:38:21 digitalocean-dcos-public-agent-00 curl[1568]: < Accept-Ranges: bytes Dec 08 06:38:21 digitalocean-dcos-public-agent-00 curl[1568]: < Dec 08 06:38:21 digitalocean-dcos-public-agent-00 curl[1568]: { [13032 bytes data] Dec 08 06:38:27 digitalocean-dcos-public-agent-00 curl[1568]: * Failed writing body (456 != 16384) Dec 08 06:38:27 digitalocean-dcos-public-agent-00 curl[1568]: * Curl_http_done: called premature == 1 Dec 08 06:38:27 digitalocean-dcos-public-agent-00 curl[1568]: * Closing connection 0 Dec 08 06:38:27 digitalocean-dcos-public-agent-00 curl[1568]: curl: (23) Failed writing body (456 != 16384) Dec 08 06:38:27 digitalocean-dcos-public-agent-00 systemd[1]: dcos-download.service: Control process exited, code=exited status=23 Dec 08 06:38:27 digitalocean-dcos-public-agent-00 systemd[1]: Failed to start Pkgpanda: Download DC/OS to this host.. Dec 08 06:38:27 digitalocean-dcos-public-agent-00 systemd[1]: dcos-download.service: Unit entered failed state. Dec 08 06:38:27 digitalocean-dcos-public-agent-00 systemd[1]: dcos-download.service: Failed with result 'exit-code'.

If I try to download this file manually within the node, I can retrieve it successfully. The size of the file is over 500MB. Even the smallest node option of 512mb (memory), has 20GB of disk space. Then I looked at the individual partition tables:

1GB Image:

$df -h Filesystem Size Used Avail Use% Mounted on devtmpfs 483M 0 483M 0% /dev tmpfs 499M 0 499M 0% /dev/shm tmpfs 499M 324K 499M 1% /run tmpfs 499M 0 499M 0% /sys/fs/cgroup /dev/vda9 27G 579M 26G 3% / /dev/vda3 985M 588M 347M 63% /usr tmpfs 499M 499M 4.0K 100% /tmp /dev/vda1 128M 39M 90M 30% /boot tmpfs 499M 0 499M 0% /media /dev/vda6 108M 64K 99M 1% /usr/share/oem tmpfs 100M 0 100M 0% /run/user/500

2GB Image:

$ df -h Filesystem Size Used Avail Use% Mounted on devtmpfs 987M 0 987M 0% /dev tmpfs 1003M 0 1003M 0% /dev/shm tmpfs 1003M 428K 1003M 1% /run tmpfs 1003M 0 1003M 0% /sys/fs/cgroup /dev/vda9 37G 1.9G 34G 6% / /dev/vda3 985M 588M 347M 63% /usr tmpfs 1003M 320K 1003M 1% /tmp tmpfs 1003M 0 1003M 0% /media /dev/vda1 128M 39M 90M 30% /boot /dev/vda6 108M 64K 99M 1% /usr/share/oem tmpfs 201M 0 201M 0% /run/user/500

The installation services are using the /tmp partition, and it’s obviously too small to complete downloading the bootstrap image. By default, tmpfs allocates half the size of available memory to its filesystem. The easy solution is to modify the section of make-files.sh that creates the do-install.sh script to ensure we have enough room on /tmp prior to installation. The Digital Ocean instances also don’t come with any swap, so we should create some to ensure we don’t run into errors due to running out of memory.

... cat > do-install.sh << FIN #!/usr/bin/env bash mkdir /tmp/dcos && cd /tmp/dcos # resize the tmpfs to ensure there's space for the dcos install sudo mount -t tmpfs -o remount,size=1G /tmp # setup swap if [ ! -f /swapfile ]; then sudo fallocate -l 2G /swapfile sudo chmod 600 /swapfile sudo mkswap /swapfile sudo swapon /swapfile echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab fi printf "Waiting for installer to appear at Bootstrap URL" ...

We’re not making the /tmp changes permanent by modifying the fstab , so rebooting the instances will set the /tmp allocation back to normal as well as clearing out the installation files.

IPv6

It is 2016, and a sustainable Internet does mean we need to start using IPv6. By default, the DC/OS Terraform scripts do not enable IPv6. Adding the following setting to dcos.tf allows the public nodes to have IPv6 addresses. You may want to add this to the other node types if you wish to have them accessible via IPv6 as well.

... resource "digitalocean_droplet" "dcos_public_agent" { name = "${format("${var.dcos_cluster_name}-public-agent-%02d", count.index)}" ipv6 = "true" depends_on = ["digitalocean_droplet.dcos_bootstrap"] ...

Thoughts on DC/OS

This tutorial simply covered installation of DC/OS. We have only touched the surface, and haven’t discussed running application containers, using marathon-lb for load balancing, volume management, or security and firewall settings for individual nodes. None of these tasks are trivial and deserve tutorials of their own.

Also, we only looked at Digital Ocean, but DC/OS does have official documentation for deployments on AWS, Azure, GCE and Packet. I’d recommend comparing each to reduce service cost.

I’ve seen DC/OS deployed in the wild in full production environments. It’s ability to schedule and manage tasks is very powerful. It also comes at a cost of a dedicated support and development team. If you’re a startup with strong development and operations engineers, setting up some kind of task or container orchestration, whether it’s DC/OS or something else, can help easy the pain of scaling out later. For smaller side projects, DC/OS seems prohibitive in both time and service costs.