We treat all our virtual servers as immutable. When we upgrade our system we create brand new servers and destroy the old ones, rather than upgrading them in-place. This is a logical extension of the phoenix server approach in which servers are regularly recreated from scratch. Our adoption of immutable servers was inspired by an anecdote that when physical servers play up in Google’s data centers they are pulled out of the racks and discarded rather than fixed.

Why immutable servers?

The main benefit of this approach is that it allows us to be absolutely certain what the state of a server is once it has been provisioned. This is the promise of declarative configuration management tools like Puppet and Chef, but it’s not one that they can reliably deliver on. It’s impractical to specify every single detail of a system, so the end state that it reaches is inevitably somewhat dependent on what has gone before. For example, consider removing packages: if you no longer need a package that was previously installed, you have to jump through a lot of hoops to remove it. If you simply start again from scratch, you can be certain that there’s nothing installed but what you need.

We have also found that immutable servers simplify several aspects of our system. The deployment infrastructure is simplified because it doesn’t have to support different cases for upgrades and new environments (this is sympathetic with our principle of ad hoc environments). Our testing is simpler because we don’t need different tests for upgrading servers and creating them from scratch. And we are much more likely to be able to fix deployment problems by rolling-forward because any problematic servers will be destroyed.

Generally we find that the decisions we make in order to support immutable servers in our system have a simplifying and clarifying effect on the architecture. They often help (or force) us into doing the right thing and have unanticipated benefits later on.

The implications of using immutable servers

This approach does have some implications which need to be considered. System upgrades are somewhat slower because creating virtual servers from scratch takes a bit of extra time; and any change to a server requires a redeploy as there is no mechanism for modifying them in-place.

The fact that servers are recreated from scratch means that any data on them which changes between deploys will be lost. We need to take this into account when designing the system architecture. The most obvious kind of data to consider is application data. We deal with this by separating our architecture into three layers: a volatile layer (which contains the servers), a persistent layer (which has various services for storing data persistently) and an identity layer (which contains the persistent addresses of entry points into the system).

Every application needs to be carefully designed to ensure that important data is stored in the persistent layer. We have found it simplifies matters to use event-sourcing, with the events stored in the persistent layer and replayed on deployment so that applications can recreate the state that they need. It is important to clearly separate persistent from volatile data and ensure that you are only persisting exactly the data that you need to; this clarifies the application design, ensures that you are not paying the cost of persisting data unnecessarily and simplifies the process of discarding out-of-date volatile data when you deploy.

Apart from application data there are two other important bits of state which need to be handled. Logs need to be preserved across deployments so that we can retrospectively investigate problems without worrying whether then has been an intervening deployment. The best way to do this is to ship all logs straight off to a central server with something like rsyslog; that server then treats the logs as application data and stores them in the persistent layer.

And finally servers need to know the addresses of other servers within the system; having immutable servers means that you can’t update system configuration (like the address of other servers) in place, you have to redeploy in order to change that configuration. So if server addresses change across deployments then you end up with dependency problems, where you have to update them in a certain order, and possibly unresolvable circular dependencies. The solution to this is to use something to decouple addresses from the physical servers; our preference for this is to use DNS.

Check out the other parts in our series on "Rethinking the way we build on the Cloud"