Imagine a project that’s eight years old. No matter what language or framework used, the code is very likely to be hard to maintain. The things that we take for granted now didn’t even exist eight years ago. We had a challenging experience on these lines, while delivering a Rails 2 to Rails 3.0 upgrade.

It took us a year and half of dedicated work to upgrade the app, while still having to add features to it. The team finished the work to upgrade versions of the language, framework, gems and other external libraries.

The first attempt to deploy the upgraded version was made using a Big Bang release approach. The major benefits of this was that the process was well tested, since the team was doing daily releases to pre-production environments, and it didn’t require many changes to the process that was already in place. On the other hand, the biggest risk in this approach was that it was all or nothing. If some critical unexpected issue came up, we would have to rollback the upgrade.

After the first release using the upgraded app, we faced some issues with not well tested integrations points and other external dependencies. In the end, the client asked for a rollback even though we weren’t sure about the root causes of the issues.

After the incident, the team decided to work on a new approach to the release that would reduce risk, simplify the rollback strategy and regain the confidence of the client. That’s when we decided to try out a Canary Release approach.

Why Canary Release?

When we started exploring approaches other than the Big Bang release, we found Canary Release to be ideal. To quote Danilo Sato:

Canary release is a technique to reduce the risk of introducing a new software version in production by slowly rolling out the change to a small subset of users before rolling it out to the entire infrastructure and making it available to everybody.

With that in mind, we started to work on our deployment process and infrastructure to support the two versions of the app (Rails 2 and Rails 3) in the same environment. This posed an interesting challenge of making different versions, and even different gems, communicate well with each other.

One thing that’s worth mentioning is our deployment architecture. We had a lot of servers/boxes with different roles, but the most important in this case were web boxes, web service boxes and worker boxes. The others didn't have app code running, so it wasn't really hard to upgrade them, except for the memcache boxes. You'll understand why we are talking about this one specifically later on.

The web boxes, as the name suggests, are responsible of handling the web traffic and they are where we focus our efforts to split the traffic.

The web service boxes respond to external calls from third party applications. For these boxes, we decided to have half of them running Rails 2 and the other ones running Rails 3. The policy to split the traffic between them was based on serving the request to the lowest loaded server.

Worker boxes are responsible of doing all the asynchronous work, such as sending emails, purge old data, so on. For some workers, it is mandatory that only one instance of the worker is running. We call this type of worker as singleton workers. We had two worker boxes that only ran singleton workers, so we decided to keep one box running Rails 2 and the other running Rails 3. For the other workers that didn't have the requirement of running only one instance, we decided to migrate only one box to Rails 3.