Waterbear Cloud was born from an ambition to build better cloud orchestration solutions. Our first experience with Infrastructure-as-Code (IaC) used CloudFormation, a templating technology that defined, configured, and associated AWS resources with each other. Implementing and managing AWS environments became much more manageable. Storing CloudFormation templates in Git gave us versioned infrastructure that we could pair up with specific versions of the applications hosted there.

As we built out solution after solution for various clients, a tool made up of BASH scripts, CloudFormation templates, and configuration files began to rise from the ether. But more importantly, we started to see orchestration and infrastructure design patterns that were shared between different projects. A new vision for infrastructure orchestration and management was beginning to form.

Reviewing our existing orchestration solution, there were two major features we dreamed of having available in infrastructure as code projects:

A high-level view of the environments, networks, accounts, and applications.

Adherence to the Don’t Repeat Yourself (DRY) software principle to eliminate duplication of configuration and infrastructure code.

Our discussions on solving these problems led us to the idea of semantic infrastructure configuration files that could act as a high-level roadmap for environments created by an IaC project. Using the design patterns we developed while implementing different environments, we discovered infrastructure architecture could be separated into networks and applications. We also noticed that infrastructure is ultimately hierarchical in nature, so we designed our configuration system to reflect this. Network environments hosted applications, but when developing web applications, networks and application infrastructure needed to be duplicated to support development, testing, staging, and production environments.

Using a hierarchical configuration format, we were able to define individual environments for each software development stage with minimal configuration by overriding a collection of network and application global settings. This eliminated duplication of configuration and gave rise to semantic cloud infrastructure.

Semantic Cloud Projects

Our first itch was to implement “A high-level view of the environments, networks, accounts, and applications.”, was born from two problems we routinely encountered: business leaders asking high-level questions about IaC projects, and the difficulty of onboarding new engineers to cloud infrastructure projects.

Business leaders would ask, “What environments and applications do I have in the Cloud? How are those resources governed?”. With the current best practices in infrastructure projects, these questions could only be answered with an engineer reading through the project’s Git repo, grokking thousands of lines of code, and manually translating that into business documents. We would often hear, “I thought the cloud was supposed to simplify my infrastructure. Why is it so difficult to get visibility into how my cloud is provisioned and governed?”

An engineer new to an infrastructure project asks, “How is this infrastructure project organized? What is the process for making modifications to the infrastructure? How do I run the IaC code to provision resources I’ve added or changed?”. It could often take several weeks before a new engineer might understand any given infrastructure project well enough to safely begin making modifications to it. If we saw the original engineers leave a project, or a solution architect hand it over to the Ops team, simple, one hour configuration updates could involve extensive engineering training followed by validation and sanity checking. An Ops rule of thumb used was, “Four hours for the original engineer to develop a feature, and four days for a new engineer to make a small, subsequent change to it.”

We wanted a configuration file to start with high-level logical concepts such as networks, applications, and environments. To group cloud resources in ways that made sense semantically. So our configuration file started with hierarchical YAML and looked something like this:

network: availability_zones: 2 vpc: enable_internet_gateway: true nat_gateway: my-app: segment: public applications: my-app: resources: loadbalancer: type: LBApplication webserver: type: AutoScalingGroup environments: development: applications: my-app: production: applications: my-app:

The ability for this configuration to organize components hierarchically meant that we could track which application and environment every AWS resource belonged to. This enabled tasks such as enforcing correct AWS Tagging, and per-application or per-environment cost analysis to became trivial. By creating a fixed file format that could be validated, we are able to ensure the semantic metadata’s correctness.

DRY: Don’t Repeat Yourself

The second key feature we added to our configuration file was the ability to have inherited configuration. By allowing us to define the full default configuration for networks and applications, they could then be provisioned into different environments as complete packages and it was only necessary to override settings that differed for each environment. Configuration no longer needed copy/pasting between environments and was truly DRY.

The duplication of configuration settings was a problem we had struggled with when working on CloudFormation and Terraform driven projects. For example, if you wanted to deploy an application to both development and production environments, and wanted both smaller instance sizes for development and larger in production, configuration files needed to be copied. Duplicating every setting for your entire application once for development and again for production meant managing updates to infrastructure in multiple places. If you wanted to see which settings were different between the environments, you could only do so by running a diff between files. Ad-hoc solutions to this problem had been proposed, but by having a fixed, validatable file format, we are able to ensure that invalid settings would be reported before attempting to provision infrastructure from broken configuration.

The copy/paste strategy for configuration files is manageable for a single application provisioned in a single AWS region with only two environments, but as cloud infrastructure grows in complexity, the number of configuration files also grows exponentially:

2 environments (development/production) for 1 application: 2 configuration files.

3 environments (development/staging/production) for 2 applications: 6 configuration files.

4 environments (development/QA/pre-production/production) for 3 applications across 4 AWS Regions: 48 configuration files.

We’ve seen enterprise customers with over 800 orchestration files. If configuration changes were needed, a developer would first configure the development environment and test it in there – but they might forget to copy/paste that setting to the other environments such as testing, staging, and production. This created a need to review all configuration files at play to understand the differences between them before changes could be safely made. Simple deployments that were expected to take 30 minutes often took 3 or 4 hours as configuration drift issues were manually hashed out.

Our final design for inheritable configuration has all the default resources and settings for an application described in an application section. Then when an application is provisioned into a specific environment, it’s configuration can override just what is unique to that environment. For example, changing the size and number of instances running in a web server scaling group looks like this:

applications: my-app: resources: webapp: type: AutoScalingGroup instance_type: t3.large min_instances: 2 max_instances: 4 environments: development: applications: my-app: resources: webapp: instance_type: t2.small min_instances: 1 max_instances: 1 production: applications: my-web-app:

After our tools parse this configuration file format, it merges the global and environment overrides together, and then unrolls all of the configuration into a full data model where every environment has a complete set of configuration:

environments: development: applications: my-app: resources: webapp: type: AutoScalingGroup instance_type: t2.small min_instances: 1 max_instances: 1 production: applications: my-web-app: resources: webapp: type: AutoScalingGroup instance_type: t3.large min_instances: 2 max_instances: 4

With all of this configuration validated and ready to go, we then built the Application Infrastructure Manager (AIM) engine to take this configuration and generate CloudFormation templates and provisioned them to the cloud.

The wonders of semantic cloud configuration

As we developed the Waterbear Cloud platform and started provisioning even more complex applications, we felt euphoric at times. Problems that had previously been thorny pain points in infrastructure projects simply went away by having every provisioned resource organized into logical concepts. Previous projects we’d worked on tackled semantic metadata only partially, and metadata wasn’t available throughout 100% of the system – for example, the application and environment of a resource could be determined by the directory structure that configuration files and code were placed in. But accessing these informal standards was ad-hoc. Scripts that were expected to be able to “walk up two directories and read the parent directory name” to return the environment to be provisioning into were fragile. Attempting to leverage other tools such as installing and configuring agents, and setting alarms to re-use this ad-hoc metadata were error prone. Complex infrastructure projects often felt like the left hand didn’t know what the right hand was doing – changes to these brittle systems often had unexpected consequences.

Having built this base configuration system, we’re very excited about the solutions we will build with it in the future. For example, we have been working on a web application for visualizing the configuration and plainly displaying your networks, applications, and environments with high-level views of the infrastructure’s health, performance, and configuration.