In this age of microservices and fast changing functionality we don’t always put enough thought into the negative effects that a new feature might have. Continuous delivery is mandatory if you need speed. However, if you have millions of customers then unit tests, integration tests or even manual tests will not be enough to catch all potential issues.

Canary release is a technique that helps reduce the impact of negative changes by gradually rolling out the changes. If a problem with the new release is detected during the rollout then it can be rolled back, and only a subset of the traffic will have been impacted.

On a related historic note:

Well into the 20th century, coal miners brought canaries into coal mines as an early-warning signal for toxic gases, primarily carbon monoxide. The birds, being more sensitive, would become sick before the miners, who would then have a chance to escape or put on protective respirators.

My team investigated the numerous options for implementing canary release on AWS. We wanted a solution that was simple and that didn’t introduce new components to our stack.

There are a few assumptions to keep in mind. It might be easy to take these for granted, but they’re essential so it is worth being explicit:

Both old and new or A and B versions should be able to run in parallel at the same time without any side effects.

and or and versions should be able to run in parallel at the same time without any side effects. You should be able to go back to the old state at any point of time

state at any point of time There is no stickiness in the session, meaning if the user had one request processed by version A there is no guarantee that they will always get version A.

Before going into the details on what we ended up with, let’s see what we used before the canary setup.

Managing change with CloudFormation and RollingUpdate:

The building blocks are :

A Route 53 ALIAS record to have a static name in front of the ELB

Elastic load balancer with one autoscaling group

3 EC2 instances placed in the autoscaling group

AWS setup ( note 3 nodes are mostly for availability in different zones)

The setup looks pretty straight forward but there are quite a few things that need to be in place:

A robust health-check mechanism for the load balancer.

An UpdatePolicy that defines how the new instances are added and old ones removed.

A setup for our CloudFormation templates with support for Rolling Deployments of Auto Scaling Groups.

A way for the EC2 instance to tell the ELB “YES I am healthy — use me!”.

After having all of this in place adding a new version will cause a rolling update where instances are taken out in batches of max batch size and the new ones are spun up. After all original instances are replaced with new ones the stack update is complete:

For a system running version v1 going to v2 :

A new version v2 is added in the autoscaling group

v1 node is taken out slowly by connection draining and another v2 is added

After repeating the process all the nodes are now running v2

The downside of this is that you typically find an issue midway through the rolling upgrade, or a few minutes later. If you have the right setup you can rollback to an older version. This isn’t terrible, but the time it will take to fully roll back is (instance_number X ( time_to_start_instance + time_to_start_app)) which could add up to 15–20 min depending on the different variables. During this period you would know that you are running with a faulty instance and serving a faulty version to the users with no real good reason.

One could argue that ending up with a faulty version should never happen if we would have a health-check that encompasses all the scenarios that would consider a system healthy. This is a valid point, and we should try to include as much as possible in the health-check. However, with systems with a sufficient number of dependencies covering all cases imposes a significant overhead and is difficult to pull off in practise.

Our active-passive dual autoscaling groups

Active/Passive setup

We picked a similar approach but instead of having a single autoscaling group we run two.

While the service is running normally the “active” autoscaling group has a size of 3 and the “passive”(canary) group is in a disabled mode.

When we want to deploy a new version, the new one is added to the passive autoscaling group. Once the new instance becomes active it will take 25% of the traffic since in total we have 4 nodes and the ELB uses round robin strategy.

Canary mode where we have one v2 and three v1

With this comes the main limitation of this approach; traffic to the canary group is based on the number of instances you have in production and you cannot start 1% and gradually increase the traffic. For our needs, this is good enough as we need a fair bit of traffic to detect problems that were missed by all the levels of testing and health-checks. We also liked the simplicity.

After monitoring the new version in production for a period time, we gain the confidence to start sending it 100% of the traffic. The next step is to disable the canary autoscaling group by draining all active connections and then do a more standard Rolling upgrade on the “active” group.

Active mode on a new version

You might wonder why do a rolling upgrade node by node instead of all 3 at once after proving the new version works with the canary? Well while the release configs and build are immutable the external dependencies are not, this is an extra precaution. In practice, this step never fails for any reason other than AWS related like not enough instances in allowed limit or similar provisioning failures.

Other options and why we did not pick them

One thing that really screams load balancing is DNS; you can set a low TTL and give 70% of the users version A and 30% version B. In AWS, you can do this using Route53. With Route53, you create a hosted zone and define how traffic is routed for that domain.

AWS Route53 offers an even better option called called Weighted Round-Robin. You configure a weight between 0–255 for each entry. When processing a DNS lookup, Route53 will pick one answer using a probability calculated based on those weights. This is a lot more fine grained than just using the number of instances to distribute the load.

The big disadvantage of this approach (and ultimately the deal breaker for us) is that propagating DNS changes can take some time, so you have no control over when the user will perceive it. In our case due to infrastructure setup, the traffic shaping by the DNS would not have the desired effect. In cases where low TTL values are acceptable, this is a great option. In our case some of our external integrations prevent us from using this approach.

We could have easily added another component like NGINX, HAProxy, Vamp or really anything that supports load balancing. All of these are great options for doing traffic shaping and canary releases. The problem with these is that they are yet another component you need to add and maintain. Adding more components increases the chance of failure and adds maintenance burden. I am a strong believer in KISS principle, by postponing the added complexity to a moment where it’s really the only option.

To sum up

Canary is the great technique to implement in your pipeline whatever solution you use. It adds a huge level of confidence. For us the GO or NO-GO decision is a manual “button press”. Others, notably Netflix have added a more automated approach to testing the “Confidence in the Canary” by using tons of metrics. In the end, the important point is to get speed by having fast deploys, fast rollbacks, and real production tests. Keep in mind that adding Canary is not necessary needed for every project since, just like every other component, it adds additional complexity. However if you’re in a project that needs to move quickly but has a low tolerance for failure then give it a try.

How about you? Are you running canary or do something else to minimize risk without sacrificing speed of development?

Other resources

Acknowledgements

Many thanks to the Klarna colleagues for their feedback: Jonas Lundberg, Ben Maraney, Case Taintor and Jose Ordiales