The guide I wish I had.

TL;DR — It works. Github repo.

As a DevOps consultant, I often see companies working with outdated architectures and long forgotten “best practices” that do not apply today.

This is the story of one of those companies. They had a static cluster of RabbitMQ servers deployed up by an old Terraform code.

Where is the state for this Terraform code? Probably in the hard drive of some old discarded mac of a developer who worked there 3 years ago, collecting dust somewhere in the office storage room.

So basically, just 4 AWS EC2s alone in the wilderness (one of which was their designated master node) talking to each other with no alerts or monitoring.

This RabbitMQ set up worked pretty well for almost 3 years. the “bad craftsmanship” went unnoticed as it was “only” the staging environment. To be honest, the chances of something going wrong were slim.

After 3 years the luck ran out and the cluster began malfunctioning, something that I would describe as hiccups. ֿ

Before I continue to explain these hiccups, you need to know that this company has grown very fast in the last year or so. The business side was doing great, and lots and lots of traffic came through their servers. With it, came a lot of money, new developers, a drastic change to micro-services architecture and lots of new features. Pretty great for the company, not so good for the poor, unscalable and unmanaged RabbitMQ cluster. The hiccups started as an unexplained behavior of the entire system, as HTTP requests just dropped randomly when the RabbitMQ was overloaded. Sometimes it would happen at the beginning of the process, sometimes in the middle, and sometimes everything was fine.

It took them some time figuring out that the Rabbit cluster was the issue and it was one of the new developers who figured it out. The cluster was just too small for their current operation and needed 1 or 2 extra nodes. This was my starting point which was not that bad. After some debate on how we should approach the issue, I decided to start from scratch and re-deploy the cluster with auto-scaling, self healing, proper logs pipeline and monitoring. I chose Elastic Beanstalk for this venture, as about half of the company’s code is already on Beanstalk. we have ready made YAMLs for deployment and a pretty cool CI/CD pipeline which has been working great for a long time. Oh yeah, and Docker — when thinking of this cluster, I wouldn’t even consider not using Docker. it was apparent to me that I had to Dockerize Rabbit. I noticed there is an official docker image created by the RabbitMQ people. Easy.

Well; not that easy.

The official RabbitMQ docker image is very good. It covers a lot of options with docker environment variables. I used this config file (find the most important part in Bold):

management.load_definitions = /etc/rabbitmq/definitions.json

loopback_users.guest = false

listeners.tcp.default = 5672

hipe_compile = false

management.listener.port = 15672

cluster_formation.peer_discovery_backend = rabbit_peer_discovery_aws

cluster_formation.aws.use_autoscaling_group = true

RabbitMQ, as of version 3.7.0, comes with lots of built in plugins, one of which is rabbit_peer_discovery_aws. It’s a backend discovery plugin for AWS. Basically, the plugin, if enabled and configured correctly, searches through EC2s in your account in a specified region for signs of other RabbitMQ nodes. You can configure it to look for certain tags, such as:

cluster_formation.aws.instance_tags.region = us-east-1

cluster_formation.aws.instance_tags.service = rabbitmq

cluster_formation.aws.instance_tags.environment = staging

In my case, I used cluster_formation.aws.use_autoscaling_group = true which means the node looks for its own EC2 autoscaling group and looks for other nodes in that group. A much simpler solution in my opinion, which fits my needs.

Here is a quick breakdown of what is supposed to happen — Amazon Elastic Beanstalk node starts up with a RabbitMQ docker image in it; the Rabbit looks in it’s own EC2 autoscaling group for other rabbits; if it’s the only rabbit (first), it creates a RabbitMQ cluster and proclaims itself a leader (master). Another node spins up (from autoscaling rules). It looks for other rabbits and finds the master rabbit. It then registers with that master rabbit using the hostname from within the docker instance. Meaning that the slave rabbit tells the master rabbit how to find it and uses its hostname.

At that point, there is a problem; as the docker inner hostname is just a random number given by the docker daemon, and means nothing in regarding discovery. This is a big problem, which took me too long to figure out. Possible solutions I thought of:

Get a functional hostname (aws private ip) into the container as an environment variable at runtime (the only time I actually know the ip address) and then override the rabbits own hostname with the one I gave it from outside the container. This meant messing around with the Elastic Beanstalk startup scripts to make them give the host IP to the docker. Working with those Beanstalk scripts was so frustrating I went to option 2. Change the way RabbitMQ calculates the hostname and replace it with the much appreciated API from AWS EC2:

curl -s 169.254.169.254\latest\meta-data\local-hostname

This API call made from inside an EC2 instance gives back the local hostname of the instance. Very, very handy in my situation. All I need to do now is dive into the rabbit code and find where to stick this API call.

Luckily, the people who wrote RabbitMQ made my job much easier. In the configuration docs it states that the hostname environment variable is calculated with this simple command:

env hostname #for linux/unix machines

Also, there is a configuration file named rabbitmq-env.conf where you can define environment variables. Amazing. My rabbitmq-env.conf looks like this:

HOSTNAME=`curl -s 169.254.169.254\latest\meta-data\local-hostname`

That was the final piece of the puzzle! With everything put together, I had a functioning docker image ready for deployment.

Github repo for the whole project

At the end, the company could focus on developing things that matter, and not worry about the scale or health of their RabbitMQ cluster.