We’ve all been there: your production system is running great and uptime has never been better. One Wednesday morning you wake up, enjoy a nice breakfast, and head off to work—when a few dreaded notes from your phone tell you that production has just gone down. You buckle the racing strap and hit the nitrous. When you finally get to the office, you see that one of your core production system’s load is way too high. You quickly fix the issue by adding more CPUs to the system.

During the post-mortem you discuss what caused the instability, but you can’t figure out how or why the number of CPUs on the production instance was so low. Sheepishly, someone admits that they were just trying to save some money, so they changed the production system to use a cheaper EC2 instance type.

Cloud computing for the win!

Cloud computing has made it simple to change your compute environment with a few clicks of a button. The downside is that details can get lost when you have lots of people making changes. This is where “Infrastructure as Code” comes into play. Infrastructure as Code (IAC) is the idea that your infrastructure configuration should be maintained by revision control, the same way that source code is.

Inconsistencies are inconsistent.

For several years it was easy to keep the production environment stable here at Lucid because there were only a few individuals making changes to the environment. But as our team grew, instabilities became more frequent. We decided it was time to automate the configuration of our compute environment and adopt the IAC model.

Implementing IAC (Infrastructure as Code)

We first looked into CloudFormation. CloudFormation is Amazon’s tool for automatically configuring an AWS environment. We quickly realized a few major issues with it:

VERY verbose . To stand up a single instance you need around 300 lines of json . . . OUCH! Managing existing resources is very problematic. We wanted to keep our existing scale groups, ELBs, static instances, EBS volumes, security groups, etc., and also manage new resources. CloudFormation limits you to 200 resources per template. You can use “nesting” to increase the resources that you specify in a single template but that leads into the next issue. CloudFormation is destructive. Let’s say that you have an ec2 instance. You want to attach an additional security group to it. Cloudformation would do this by terminating the ec2 instance and then creating a new one rather than simply attaching the security group. The biggest issue with “nesting” is that destructive actions cascade. VPC change anyone? No thanks!

We also looked into several other services but found out that they use CloudFormation under the covers which means that they have the same limitations/issues that CloudFormation has.

After several weeks of investigation we came up with a list of priorities:

Consistent config with revision history. We wanted to be able to see how many instances were in a scale group two months ago. Change notifications. We wanted to be notified of every change, including who did it, and why they did it. Synchronization status. We wanted to be able to quickly identify if our production environment is ever out of sync with our IAC. Safe changes. We never wanted to automatically take a “destructive” action. Meaning we would rather manually terminate an EC2 instance, delete EBS volumes, delete route53 records, etc. than have something automatically do those actions for us. Efficient. We wanted to be able to quickly find ‘lost’ instances and other unused resources to save money.

In walks Cumulus.

We looked for a solution to meet our requirements. Every solution that we looked at either used CloudFormation under the covers or was not yet mature in development. We made the decision to create an open source tool called Cumulus which uses the AWS SDKs directly and always notifies rather than taking a destructive action.

The Cumulus command line tool allows us to have very short templatable configurations for each resource. It also does not have any resource limits so you can define everything about your AWS account in a single location. At Lucid Software we have a Git repo that our Cumulus config is pushed to. When someone makes a change they push their changes to the Git repo and also run a cumulus diff. The Cumulus diff shows the local vs the AWS configuration and highlights any changes. If the changes appear to be acceptable then that individual will run a Cumulus sync. The Cumulus commands can operate on an entire module like EC2 or IAM, or it can run on an individual resource in that module such as an EC2 instance or an IAM role.

We also have a job that runs hourly that does a complete cumulus diff for all modules and all AWS accounts. This allows us to see if configuration in the Git repo is out of sync with the actual AWS configuration. We can also use Git hooks to send notifications any time someone makes a change to the cumulus configuration.

The Cumulus documentation is remarkably helpful even though it is an open source project 🙂

While there are many examples in the documentation, below are two common use cases for cumulus that an organization of any size can appreciate.

Configuring Security Groups

Security groups are an important AWS resource that allows you to control which AWS objects are allowed to communicate with each other and the outside world. For any security-minded AWS customer, trying to manage security groups can be painful when trying to become PCI compliant or when adding new components to your setup. Cumulus makes configuring security groups much cleaner and allows for reusable rules and human-readable configuration.

Let’s say we want to set up two security groups called “foo” and “bar” that both allow SSH access inbound from subnet “10.1.0.0/16” and the security group called “default”, allow all outbound, and each allows ports 9000-9009 (foo) and 9010-9019 (bar) from the ip “72.22.22.22”. Doing this in the AWS console is not necessarily hard, but there is an overlap with the security groups’ rules, and updating the shared rules for both groups quickly becomes tedious.

In Cumulus we can handle this with a few configuration files. The first one is the configuration that has rules common to both groups. Let’s call it “ssh-access”:

{ "inbound": [ { "security-groups": [ "default"], "protocol": "tcp", "ports": [22], "subnets": ["10.1.0.0/16"] } ] }

While fairly succinct, we can make it a little better. Cumulus allows you to define a subnets file that allows you to give names to groups of subnets so you can reference them in config in a readable way. We do this by creating a subnets.json file that allows us to reference “10.1.0.0/16”, “72.22.22.22” and the special “0.0.0.0/0” subnet with names.

{ “all”: [“0.0.0.0/0”], “vpn”: [“10.1.0.0/16”], “service-1”: [“72.22.22.22/32”] }

With that done, configuring foo and bar becomes very simple:

{ "description": "foo", "tags": {}, "rules": { "includes": ["ssh-access"], "inbound": [ { "protocol": "tcp", "ports": ["9000-9009"], "subnets": [ "service-1"] } ], "outbound": [ { "protocol": "all", "subnets": ["all"] } ] } }

{ "description": "bar", "tags": {}, "rules": { "includes": ["ssh-access"], "inbound": [ { "protocol": "tcp", "ports": ["9010-9019"], "subnets": ["service-1"] } ], "outbound": [ { "protocol": "all", "subnets": ["all"] } ] } }

Syncing both groups is as easy as running a single command

Say I messed up and the vpn subnet should be “10.2.0.0/16”. Making this change is as easy as changing the single value in subnets.json, showing the diff to make sure the change is what I want, and doing another sync:

Launching an EC2 Instance

Launching an EC2 instance in the AWS console is a 7-step process with many configuration options. Launching EC2 instances that have the same config requires a lot of concentration and back-and-forth to make sure the configuration options match. If you also want to attach EBS volumes to the instance, don’t forget to create them individually and manually attach them to the instance afterwards.

With Cumulus, launching an EC2 instance is much simpler. Let’s try creating an Amazon Linux instance that has 4 x 100GB EBS volumes. First, we define the configuration of the EBS volumes. In Cumulus we define EBS volumes in what we call Volume Groups. A volume group is simply a set of EBS volumes that all have the “Group” tag set to the same value: the name of the group. By defining volumes in this manner, it is much easier to create instances that are attached with a set of EBS volumes. Our example instance will have a Volume Group saved in a file called “example-instance-volumes”:

{ "availability-zone": "us-east-1a", // The zone each volume is launched in "volumes": [ { "size": 100, // Size in GiB of each volume "type": "gp2", // Type of volume "count": 4, // Number of volumes "encrypted": false } ] }

Before we create our instance, we must create these EBS volumes so they will be ready to be attached to our instance. In Cumulus, it is as simple as running the following command:

Now that we have our EBS volumes ready, it’s time to define the configuration for the instance. Cumulus offers a lot of options when configuring any AWS resource, many of which will be the same between your EC2 instances. Below is a sample config that uses the Amazon Linux AMI, and the command used to launch it.

{ "ebs-optimized": false, "placement-group": null, // optional "profile": "example-instance-role", // the ARN of the IAM instance profile "image": "ami-8fcee4e5", // if set, overrides the default-image-id in configuration.json "key-name": null, // optional "monitoring": false, // Indicates of monitoring is enabled "network-interfaces": 1, // how many network interfaces should be on the instances "source-dest-check": false, // Indicates of source-dest checking is enabled on the instance and network interfaces "private-ip-address": null, // if not set, one is picked from the subnet by AWS "security-groups": [ // security group names "default" ], "subnet": "public-us-east-1a", // Name or id of a subnet "tenancy": "default", // default or dedicated "type": "m3.medium", // one of the ec2 instance types "user-data": null, // optional, file name in user-data-scripts/ "volume-groups": [ // names of volume groups from ebs module "example-instance-volumes" ], "tags": { "Name": "example-instance" } }

When you run the Cumulus command, a few things happen:

The instance is created Cumulus waits for the instance to run Cumulus attaches the EBS volumes from the volume groups specified to /dev/xvd[f-z] (configurable)

The entire process takes around 3 minutes, and depends mostly on how long it takes for the instance to be running. The best part is that you can save your configuration in your favorite version control system and reuse it to launch your next instance.

Conclusion: our AWS management solution

Cumulus provides a manageable, lightweight layer of abstraction on top of AWS that allows you to have IAC. Organizations of any size can get started managing all or part of their AWS environment with Cumulus. It is also very easy to see differences between the running environment and what is in IAC. Syncing individual resources or all of the resources in a module is very simple and fast as well. At Lucid, using Cumulus has increased production stability, allowed us to identify misconfigurations, and saved us money.

Check it out at http://lucidsoftware.github.io/cumulus