With the advent of infrastructure-as-code, there has always been a push to open source projects, sharing with the community at large. Today Neiman Marcus open sources its Terraform Jenkins Module. This module deploys a highly available Jenkins implementation with an out-of-the-box deployment on AWS.

In this blog I will take you through the module, and the decisions behind some of its development. Please feel free to check out the Github repo or module in the Terraform Modules Registry.

Features

Before going into specifics, let us discuss some of the features this module provides.

A highly available architecture, placing agents and master node in autoscaling groups.

Agents are ephemeral, while the master node connects to an EFS volume for data and configuration persistence.

Completely managed infrastructure-as-code project that will spin up the necessary resources for a full Jenkins deployment.

Custom user-data is available for additional instance configuration.

Spot instance pricing for Jenkins agents.

Agents connect via API key, allowing for external user authentication sources, not username and password.

Definitions

To clear things up I will go ahead and leave a short description about these tools. This blog assumes you are familiar with AWS products and services that will be created.

Jenkins — an open source automation server which enables developers to reliably build, test, and deploy software. The legacy of Jenkins for software deployment and development is undeniable.

Terraform — an open-source infrastructure-as-code software tool. It enables users to define and provision a datacenter infrastructure using a high-level configuration language known as Hashicorp Configuration Language.

Architecture and Design

Now for a review of the architecture and design.

The architecture, on the surface, is simple, but has a lot of things going on under the hood. Similar to a basic web-application architecture, an elastic load balancer sits in front of the master autoscaling group, which connects directly to agents, sitting in their own autoscaling group.

Master Node Details

The Master autoscaling group uses the Amazon Linux 2 AMI. This ASG is set to a minimum and maximum of one instance, and does not scale out or in. It can be in one of two availability zones. The ELB controls the ASG based on a health check.

The name of the master ASG is identical to the master launch configuration. This is intentional. If the launch configuration is updated, the master autoscaling group will be recreated with the new launch configuration.

Data are persisted through an EFS volume, with a mount target in each availability zone.

Agent Nodes Details

The agent ASG is also using the Amazon Linux 2 AMI, placed in the same availability zones.

Agents connect to the master node through the Jenkins SWARM plugin.

Agents are configured as spot instances, to lower cost.

Agent Scaling Details

Agents scale based on CPU, and on the Jenkins build queue. If the number of executors available is to low, then the ASG will scale out. If executors are idle, then it will scale in. This is configured in the cloud-init user data.

Interesting Points

The design, as straight forward as it is, has some interesting tricks up its sleeve.

API Key Generation

During initial launch, the master will generate an API key and publish it to SSM Parameter store. This is needed as the agents use the SWARM plugin to connect.

As we can see here, if the api_key.txt file does not exist, then it will (1) generate a crumb, (2) generate an API key, (3) write it to a file on the EFS volume, and then (4) publish it to SSM parameter store. Finally, it will use SED to place the API key in the scaling scripts.

Agent Connectivity

The agents are smart enough to get the master’s IP address using the AWS CLI and API key from the parameter store.

Agents launch, configure themselves, and connect to the master. If agents cannot connect, or get disconnected, the agent will self-terminate, causing the ASG to create a new instance. This helps in the case that the agents launch, and the master has not yet published the API key to the parameter store. Here is the code snippet to enable this on the agents:

After the API is published, the agents and master will sync up. If the master is terminated, the agents will automatically terminate.

Scaling Agents

The master node has a cron set to poll its busy executors. If the number of executors is less than half of the minimum defined, an alarm will be triggered, scaling out the agent ASG.

And with this next snippet, the master will poll for idle executors. Since the ASG is not aware of the application, we take a very conservative approach to scaling in. Only after executors have been idle for 10 minutes, there are no jobs running, and the number of executors is greater than the minimum will it terminate an instance.

Updating Jenkins and SWARM versions

The Jenkins and SWARM versions are controlled as variables. It’s simple to update the variable to whatever latest version is available. For this module, we will increment the module version number with each Jenkins/SWARM version update. We standardize on LTS but if you have a specific version you need to leverage, the variable can be overwritten.

Custom User Data

Terraform’s implementation of generating cloud-init user data is fairly robust. We see below, that a template is rendered, passing in variables to be replaced upon rendering.

Then the template parts are put together to generate the cloud-init configuration.

Since templates are set in parts, we are able to pass in an external rendered template into the module as a variable. Simply define another data source, externally from the module and pass it in, like below.

If you are not familiar with cloud-init, then leaving the extra user data off will still accomplish what is needed to get Jenkins up and running.

FAQ

I’ll go ahead and get some of the obvious questions out of the way.

Why Terraform?

Terraform has become the industry leader in cloud provisioning. Terraform provides flexibility in its template and cloud-init implementation. Additionally, the module registry enables open sourcing a project with ease.

Why not use ECS or Fargate?

ECS still requires managing instances with an autoscaling group, in addition to the ECS configuration. Just using autoscaling groups is less management overhead.

Fargate cannot be used with the master node as it cannot currently mount EFS volumes. It is also more costly than spot pricing for the agents.

Why not use a plugin to create agents?

The goal is to completely define the deployment with code. If a plugin is used and configured for agent deployment, defining the solution as code would be more challenging. With the SWARM plugin, and the current configuration, the infrastructure deploys instances, and the instance user-data makes all the connections happen, separate from manual configuration. The master is only used for scaling in and out based on executor load.

Where did you start?

Neiman Marcus was using a heavily modified open source Cloudformation template from Cloudonaunt, which was the basis of this project. This has since been re-designed to ease the cloud-init configuration, as well moved to Terraform. These changes have helped remove some of the administrative overhead.

Conclusion

I hope this blog has been informative to you. For further code review, or to help improve this project please head on over to the Github repo, or check out the module in the Terraform Modules Registry.

Thank you for your time!

CRD