Photo by Sergei Boldt on Unsplash

Things drift, like the car above it can be good, but in AWS Cloudformation, drift is bad.

What is drift?

Drift is when something has changed from your original CF template file in one of your resources.

An example:

You’ve deployed an EC2 instance. In your template file you specify that it should be a t2.medium and it is originally deployed as that. At some point after you’ve deployed this instance, someone else who has access to it decides that it’s running slowly and decides to make it into an m5.2xlarge. You don’t noticed this as there are no notifications based on this change by default, and a few months go by and then you glance at the higher cost on the bill and see it then. Not a good day.

This is a relatively minor change and doesn’t have huge consequences but there can be big consequences if things change from their desired state. A worse example would be if someone ‘fixes’ something to work in dev but doesn’t update the CF template and therefore that change never makes it to prod and that causes prod to break.

The CloudFormation console includes a new column with drift status in the ‘old’ version of the console and hides the drift status in the ‘new’ version of the console. It shows four states: IN_SYNC, MODIFIED, DELETED, and NOT_CHECKED. The console and the api allow you to cause a drift detection to happen, but only on a per stack basis. If you have 153 stacks in your environment, have fun clicking!

Once you’ve kicked off a drift detection, you have to wait an undetermined amount of time. This time can be just a few seconds to several minutes. It’s dependant upon how many resources are in your stack and therefore how many differences have to be checked for.

After the detection has finished you can check to see if there are any issues with the stack resources. If the stack is IN_SYNC, you are good. Any other status means you have a problem. You can then go into the drift details for the stack and see what all the differences are.

As I don’t like spending a lot of time on a tedious repetitive task like this I’ve already automated this whole detection and reporting process. Here’s how it works:

One lambda/python script can do it all. The basic flow:

Get a list of stacks in a region

Kick off the drift detection process on each stack which is not in a bad state

Wait

If the stack drift status has drifted, gather a list of resources not in the IN_SYNC state

Publish an sns message with the appropriate info to Slack

To do all this it took me 4 files:

requirements.txt — boto3 & botocore were not current in lambda and I had to require newer versions

serverless.yml — the CF template of the lambda function using the AWS SAM model

lookForDriftedStacksDeployer.sh — the file I use to deploy across all my accounts

lookForDriftedStacks.py — the actual python code doing the work

The directory structure I used looks like this:

The requirements.txt file is in the packages directory.

requirements.txt:

boto3>=1.9

botocore>=1.12

While in the packages directory type the following:

pip install -r requirements.txt -t .

This will cause pip to download all the required files for boto3 & botocore. There will be a bunch, so placing them in this directory keeps it neat and tidy.

The serverless.yml file:

I’ve set the timeout to 5 minutes, giving the function some time to run. It takes around 70ish seconds for it to run against our dev environment with 153 stacks, but we don’t have any particularly large stacks (lots of resources).

You’ll see two Environment variables, one is the topic arn of the sns topic used, the other is the name of the account, which is used in the sns message to help us identify where the message is coming from.

I’ve also set a cron timer to run it every Monday.

The lookForDriftedStacksDeployer.sh file:

I’ve named the profiles I used for our various accounts based on the level of access, so my ‘kernel’ profiles have full access and allow me to deploy. To use this bash script for another stack all that really needs to be done is change the stackName at the top, assuming the stack will have the same need for environmental variables, set by using the “--parameter-overrides” feature. My credentials are based out of the ‘master’ account, thus the stripping of the $profileString when deploying to that account.

The last file is lookForDriftedStacks.py:

I’m not a seasoned python programmer, so this is what it looks like, and it works.

On line 5 there is the following:

sys.path.insert(0,os.environ["LAMBDA_TASK_ROOT"]+"/packages")

This allows the packages directory to be used for the import source and the local directories are used first, so the next two import statements then use that directory. Maybe at some point in the future this can be removed when lambda has a more current version of boto3 & botocore, but it doesn’t hurt anything.

As we use policy to prevent using regions besides us-east-1, us-east-2, and us-west-2, I only have to look for stacks in those regions.

There are no account dependencies in this script and this is exactly what I run.

Good luck and I hope this helps other people automate their drift detection and reporting.