In this post we’ll walk through different AWS services and features that enable canary deployments of Lambda Functions, although you can check the Canary Deployments Serverless Plugin if you just want to safely deploy your functions and you are not interested about the details.

Deployment in a Serverless application is an all-at-once process, when we release a new version of any of our functions, every single user will hit the new version. We must be really confident about the new version, because if anything goes wrong and the function contains an error, all of our users will be experiencing ugly issues. However, AWS recently introduced a new feature that can make our deployment process much more reliable and secure: traffic shifting using aliases.

How can alias traffic shifting help us?

Usually, deploying a Lambda function involves that all the function invocations will execute the new code, either because we are updating $Latest or because we are pointing an alias to a new version. That means that if anything goes wrong, 100% of the invocations will be errored and we should quickly roll back to the previous version or, what is even worse, we might not notice the bug and we leave our system in an inconsistent state. However, with the introduction of alias traffic shifting, we can now specify version weights on an alias, so that not all the invocations hit the new release, but only a certain amount of traffic is routed to the latest version. This means that Lambda will automatically load balance requests between the two latest versions, allowing us to check how the new release behaves before completely replacing the previous one, minimizing the impact of a possible bug.

As we see that the new version behaves correctly, we could then gradually update its weight, increasing the load it receives. We can do that in the AWS Console, through the CLI or with some open source tools that handle weight updates automatically, but there’s a better way to handle Lambda function deployments.

Automating the deployment process

Even though being able to do traffic shifting is a huge leap forward, let’s admit that having to update weights manually (or deploying our own management system) is not really convenient. Don’t worry, as usual AWS has us covered. With CodeDeploy we can just specify how we want traffic to be shifted over the time and it automatically adjusts the weights. There are three different types of deployment preferences:

Canary : we specify the percentage of traffic we want to shift and the time we want the deployment to last. So, if we pass 10% and 30 minutes, for example, the 10% of the traffic will be routed to the new version during half an hour. When that time has passed, all the traffic will be shifted to the new version.

: we specify the percentage of traffic we want to shift and the time we want the deployment to last. So, if we pass 10% and 30 minutes, for example, the 10% of the traffic will be routed to the new version during half an hour. When that time has passed, all the traffic will be shifted to the new version. Linear : the amount of the traffic routed to the new version will be incremented according to the provided percentage and interval. So, if we configure it to increment a 10% of the traffic every 5 minutes, CodeDeploy will update the alias weights adding a 10% to the new version in intervals of 5 minutes, until all the traffic has been shifted.

: the amount of the traffic routed to the new version will be incremented according to the provided percentage and interval. So, if we configure it to increment a 10% of the traffic every 5 minutes, CodeDeploy will update the alias weights adding a 10% to the new version in intervals of 5 minutes, until all the traffic has been shifted. All-at-once: all the traffic is shifted to the new version straight away.

This way, we let CodeDeploy do all the heavy lifting in the deployment process, and change alias weights according to our preferences. Traffic will be shifted gradually and, in the meantime, we can check if our new function is behaving correctly and cancel the deployment if we see anything weird. How could we automate the roll back process, so that we don’t have to manually check how the system is performing? CodeDeploy has thought about that as well.

Rolling back to the previous Lambda Function version

CodeDeploy allows us to configure a pre and a post traffic shifting hook, which are in fact Lambda functions that are triggered before and after the traffic shifting process. They’re suited for performing tasks like running integration tests. CodeDeploy expects to get notified about the success or failure of the hooks within one hour, otherwise it’ll assume they failed. In any case, if the hook failed either for not calling CodeDeploy or for explicitly calling it with a failure response, the deployment will be aborted and all the traffic will be shifted to the old function version.

Hooks are not the only way we can check that our function is behaving as expected, since we can provide CodeDeploy with a list of CloudWatch Alarms to monitor the deployment process. As the traffic shifting begins, CodeDeploy will track those alarms, cancelling the deployment and rolling back to the previous function version if any of them is triggered.

Deployment process with CodeDeploy and Lambda weighted alias

Hooks and alarms allow us to monitor the whole deployment process. We can perform some tests before routing any traffic to the new function version, track alarms during the traffic shifting process and run more tests right after all the traffic is hitting the new version. If CodeDeploy notices something wrong in any of those steps, it will automatically roll back to the old, stable function version.

CloudFormation all the things

All that sounds really good, but how do we set it up? Well, doing it manually doesn’t seem to be a reasonable option, but fortunately AWS has an awesome service for defining and provisioning infrastructure: CloudFormation. This tool gives us a way to model our system in YAML or JSON templates, which we can use to create a collection of related AWS resources. As you can guess, the syntax of a template where we can define almost any imaginable resource in the AWS ecosystem and its configurations it’s intricate. They even built a simplified syntax to define Serverless applications, the Serverless Application Model, although it only supports a tiny subset of the AWS resources. The best way to deal with CloudFormation in a Serverless environment is, hands down, the Serverless framework, that provides a nice an easy DSL, which then turns into a CloudFormation template, so that we don’t have to deal with its complexity. However, when the DSL falls short, we can still include chunks of CloudFormation template syntax to create the resources we need. It turns out that the framework has not implemented the canary deployments feature, so we’ll have to specify the resources ourselves. So, we’ll need to include the following:

A CodeDeploy::Application. An IAM::Role with AWSCodeDeployRoleForLambda and AWSLambdaFullAccess permissions for CodeDeploy. A CodeDeploy::DeploymentGroup for every function, where we’ll specify the deployment preference type and alarms. A Lambda::Alias for every function including CodeDeployLambdaAliasUpdate, where we specify the CodeDeploy Application and DeploymentGroup it belongs to, and the associated hooks. The Serverless Framework always triggers the $Latest Lambda function version upon any event, so we must replace any reference to the function by the newly created alias in the event sources.

If this sounds like a lot of hassle… it’s just because it is. Luckily, the Serverless Framework is really modular and there are tons of plugins to complement its features, so you can use the Serverless Plugin Canary Deployments to create all those resources in a much more convenient way (Note: I’m the author of the plugin, any contribution, comment or feature request is welcome).

Happy safe deployments!