One of the lean startup techniques I’ll be discussing at this week’s session at the Web 2.0 Expo is called continuous deployment. It’s a process whereby all code that is written for an application is immediately deployed into production. The result is a dramatic lowering of cycle time and freeing up of individual initiative. It has enabled companies I’ve worked with to deploy new code to production as often as fifty times every day.

Continuous deployment is controversial. Most people, when they first hear about continuous deployment, think I’m advocating low-quality code or an undisciplined cowboy-coding development process. On the contrary, I believe that continuous deployment requires tremendous discipline and can greatly enhance software quality, by applying a rigorous set of standards to every change to prevent regressions, outages, or harm to key business metrics. (This criticism is a variation of the “time, quality, money – pick two” fallacy)

Another common reaction I hear to continuous deployment is that it’s too complicated, time-consuming, or hard to prioritize. It’s this latter fear that I’d like to address head-on in this post. While it is true that the full system we use to support deploying fifty times a day at IMVU is elaborate, it certainly didn’t start that way. By making a few simple investments and process changes, any development team can be on their way to continuous deployment. It’s the journey, not the destination, that counts. Here’s the why and how, in five steps.

Continuous integration server. This is the backbone of continuous deployment. We need a centralized place where all automated tests (unit tests, functional tests, integration tests, everything) can be run and monitored upon every commit. There are many fine free software tools to make this easy – I have had success with BuildBot. Whatever tool you use, it’s important that it be able to run all the tests your organization writes, in all languages and frameworks. If you only have a few tests (or even none at all), don’t despair. Simply set up the CI server and agree to one simple rule: we’ll add a new automated test every time we fix a bug. Following that rule will start to immediately get testing where it’s needed most: in the parts of your code that have the most bugs and, therefore, drive the most waste for your developers. Even better, these tests will start to pay immediate dividends by propping up that most-unstable code and freeing up a lot of time that used to be devoted to finding and fixing regressions (aka “firefighting”). If you already have a lot of tests, make sure that the total time the CI server spends on a full run is a small amount of time, 10-30 minutes at the maximum. If that’s not possible, simply partition the tests across multiple machines until you get the time down to something reasonable. For more on the nuts-and-bolts of setting up continuous integration, see Continuous integration step-by-step.

Source control commit check. The next piece of infrastructure we need is a source control server with a commit-check script. I’ve seen this implemented with CVS, subversion or Perforce and have no reason to believe it isn’t possible in any source control system. The most important thing is that you have the opportunity to run custom code at the moment a new commit is submitted but before it is accepted by the server. Your script should have the power to reject a change and report a message back to the person attempting to check in. This is a very handy place to enforce coding standards, especially those of the mechanical variety. But its role in continuous deployment is much more important. This is the place you can control what I like to call “the production line” to borrow a metaphor from manufacturing. When something is going wrong with our systems at any place along the line, this script should halt new commits. So if the CI server runs a build and even one test breaks, the commit script should prohibit new code from being added to the repository. In subsequent steps, we’ll add additional rules that also “stop the line,” and therefore halt new commits. This sets up the first important feedback loop that you need for continuous deployment. Our goal as a team is to work as fast as we can reliably produce high-quality code – and no faster. Going any “faster” is actually just creating delayed waste that will slow us down later. This feedback loop is also discussed in detail elsewhere.

Simple deployment script. At IMVU, we built a serious deployment script that incrementally deploys software machine-by-machine and monitors the health of the cluster and the business along the way so that it can do a fast-revert if something looks amiss. We call it a cluster immune system. But we didn’t start out that way. In fact, attempting to build a complex deployment system like that from scratch is a bad idea. Instead, start simple. It’s not even important that you have an automated process, although as you practice you will get more automated over time. Rather, it’s important that you do every deployment the same way and have a clear and published process for how to do it that you can evolve over time. For most websites, I recommend starting with a simple script that just rsync’s code to a version-specific directory on each target machine. If you are facile with unix symlinks, you can pretty easily set this up so that advancing to a new version (and, hence, rolling back) is as easy as switching a single symlink on each server. But even if that’s not appropriate for your setup, have a single script that does a deployment directly from source control. When you want to push new code to production, require that everyone use this one mechanism. Keep it manual, but simple, so that everyone knows how to use it. And, most importantly, have it obey the same “production line” halting rules as the commit script. That is, make it impossible to do a deployment for a given revision if the CI server hasn’t yet run and had all tests pass for that revision.

Real-time alerting. No matter how good your deployment process, bugs can still get through. The most annoying variety are bugs that don’t manifest until hours or days after the code that caused them is deployed. To catch those nasty bugs, you need a monitoring platform that can let you know when things have gone awry, and get a human being involved in debugging them. To start, I recommend a system like the open source nagios. Out of the box, it can monitor basic system stats like load average and disk utilization. For continuous deployment purposes, we want to be able to have it monitor business metrics like simultaneous users or revenue per unit time. At the beginning, simply pick one or two of these metrics to use. Anything is fine to start, and it’s important not to choose too many. The goal should be to wire the nagios alerts up to a pager, cell phone, or high-priority email list that will wake someone up in the middle of the night if one of these metrics goes out of bounds. If the pager goes off too often, it won’t get the attention it deserves, so start simple. Follow this simple rule: every time the pager goes off, halt the production line (which will prevent checkins and deployments). Fix the urgent problem, and don’t resume the production line until you’ve had a chance to schedule a five whys meeting for root cause analysis, which we’ll discuss next.