Over the past decade, building large-scale online applications has become a pretty well-understood science with numerous books, papers, periodicals, forums, and conferences devoted to the subject. The Web overflows with advice and prescriptions for achieving high reliability at massive scale.

Trouble is, implementing the best scaling practices is not free, and is often overlooked early in a product's lifecycle. Small teams use modern frameworks to quickly develop useful applications, with little need to worry about scale: today you can run a successful application on very little infrastructure... at least, you can up to a point. Past this point lies an uncomfortable middle ground, where small teams face scaling challenges as their system becomes successful, often without the benefit of an ideal design or lots of resources to implement one. This article will lay out some pragmatic advice for getting past this point in the real world of limited foresight and budgets.

Learning lessons from Second Life

Most of this information is based on my experience working on Second Life at Linden Lab from 2001 to 2009. SL is a highly complex virtual world, incorporating the features of Web services, online games, 3D modeling and programming tools, IM and VOIP, and so on. Between 2006 and 2007, the userbase grew dramatically, and while it has become more manageable, it continues to grow today. We ran into all manner of scaling challenges, and had mixed success meeting them; ultimately SL did grow to meet the new levels of demand, but we certainly made some mistakes, and there were periods where the reliability of the system really suffered.

As I lay out my advice to teams facing scaling challenges, I'll be referring to these experiences; if I had known then what I know now, SL customers would have had a better experience—and I would have gotten a lot more sleep.

Second Life usage grew by a factor of ten between 2006 and 2007 (Courtesy Linden Lab)

So how do you get from here (a simple system on a commodity stack) to there (a robust system which can be confidently expanded to meet any level of demand)? Plenty of pixels have been spilled on the subject of where you should be headed: to single out one resource at random, Microsoft presented a good paper ("On Designing and Deploying Internet-Scale Services" [PDF]) with no less than 71 distinct recommendations. Most of them are good ("Use production data to find problems"); few are cheap ("Document all conceivable component failure modes and combinations thereof"). Some of the paper's key overarching principles: make sure all your code assumes that any component can be in any failure state at any time, version all interfaces such that they can safely communicate with newer and older modules, practice a high degree of automated fault recovery, auto-provision all resources. This is wonderful advice for very large projects, but herein lies a trap for smaller ones: the belief that you can "do it right the first time." (Or, in the young-but-growing scenario, "do it right the second time.") This unlikely to be true in the real world, so successful scaling depends on adapting your technology as the system grows.

While developers should certainly try to make scaling-friendly design choices early on, there are many cases where taking the best advice on scalability can drastically increase development cost (assuming time = money). As a simple example, consider the common notion that a system should tolerate all failures in all its internal components. To accomplish this, all interface code everywhere in the system must check for a variety of failure conditions and (presumably) do something intelligent with them. Do they retry? What if the problem component is overloaded? Can the client detect that? Should the user be given an error, or simply queued up? What if there is a partial-failure condition, where responses take far longer than expected? Does all of this interface code need to be non-blocking? And so forth. Even attempting to answer all these questions can eat up a lot of engineering time—time that your team may not have. Developers do not always want to admit this (especially early on in a project), but implementing everything "correctly" can risk not finishing the project at all, or having to rush through the later stages in order to get something out the door. In these cases, it's better to ship, or retain, a design with known deficiencies.

In the rest of the article I'll survey some of the big issues likely to arise during scale-out, along with strategies for prioritization and mitigation.

Requirements

The first area that most projects tackle (and hence get into trouble) is correctly identifying the business need. How large does the system have to become? This is generally a tough question to answer, but taking a look at basic constraints can be informative. If a recurring billing system needs to touch each user annually, and the product is only available to Internet users in the US and Europe, and by the biggest estimates will achieve no more than 10% penetration, then it needs to handle about 2-3 events per second (1bn * 75% * 10% / (365 * 86,400)). Conversely, a chat system with a similar userbase averaging 10 messages/day, concentrated during work hours, might need to handle 20,000 messages per second or more. (1bn * 75% * 10% * 10 * 2 / 86,400) This is a vast difference which may seem obvious, but in more nuanced scenarios it is easy to make a bad assumption about volume, which can lead to inadequate designs or testing, followed by nasty surprises in production.

Just as important are reliability targets: can the system be shut down at regular intervals? What are the consequences for failing the various types of requests (potentially severe for the billing system, and minor for the chat system). If some of these requirements are very stringent, it's even more important to compare them to the business reality: will reaching a midway growth milestone give you the luxury of additional time or resources to then produce a larger, more expensive design? It's a simple exercise to look at this, but many teams fail to do it thoroughly. If a team was building both the hypothetical billing and chat systems above, and put in the time to give the chat system a million-message-per-second capacity while making the biller rock-solid up to 10,000 transactions per day, they'd end up in a scramble to catch up with billing system load while chat was still under-utilized. If you aim for the wrong goalposts, you risk over-investing in some parts of the system while under-investing in others, which will cause a scramble to catch up down the road.