Ross Snyder, a senior software engineer at craft e-commerce site Etsy, recounted the story of the evolution of his company's technical architecture to a roomful of fellow travelers at the Surge conference in Baltimore. It was a story that, by his admission, is not entirely his own—he's only been with Etsy for a year and a half, which accounts for the “after” phase of the company's architectural picture.

But, as he put it, history is written by the victors—or at least those left around to write it. And his version of Etsy's history is part cautionary tale and part DevOps case study. Snyder's presentation was entitled “Scaling Etsy: What Went Wrong, What Went Right.” And it seems there was a lot that fell into the first bucket during the company's six-year history.

After starting off with just a single web server and database in 2005, the company's IT architecture evolved over the next two years to rely heavily on business logic written as stored procedures in a back-end Postgres database. The presentation side was driven mostly by PHP on Lighttpd web servers, chosen at the time because the Etsy team felt Lighttpd was less common and less likely to be hacked.

And the organization mirrored the architecture: Etsy's engineering organization was siloed into developers, database administrators, and operations, with little cohesion between them. Site software deployments happened infrequently and in large chunks, so they ended up being clumsy—with some site features disappearing in the process.

Scaling nowhere

While some of the database was partitioned by feature, it was largely monolithic. And, Snyder said, the reliance on stored procedures didn't lend the database to scaling up well. The site's uptime was “not that great,” he said, and "regular maintenance windows and site deploys often dissolved into outages." After weighing options that included re-writing the whole site's code base, he says that in the fall of 2007, the decision was made to help scale the site up by writing some middleware—a software stack that Etsy called "Sprouter," a portmanteau of "stored procedure router."

The Sprouter middleware was written largely in Python, and sat between the front-end PHP and the Postgres database. Snyder said it mapped requests to stored procedures in the database, returned results to the front end and did some caching of them. There was some thought of sharding the database to help scale it up, but those ideas were never implemented. The idea was that "the dev team writes code, the DBAs write SQL, and they meet in the middle," Snyder explained. The middleware, it was hoped, would help scale up the performance of the site, since it couldn't be easily scaled on the database end. And it would, in theory, prevent developers from having to write SQL calls. There were even hopes of turning Sprouter into an open-source project to help support it in the long term.

Crop failure



It didn't turn out that way. Sprouter was released in production a year later, around the same time that Chad Dickerson arrived as Etsy's chief technology officer from Yahoo. (Dickerson has since become Etsy's CEO). In the spring of 2009—about six months later—Etsy decided to abandon it.

Snyder explained that while Sprouter did a good job of centralizing access to the database and hiding the data store implementation from the applications, it was still highly bound to Postgres. Sprouter also "created substantial developer friction," Snyder added, because it required DBAs to write stored procedures for nearly every piece of site functionality" and created a bureaucracy developers had to go through to get functionality made.

The community support for Sprouter never materialized, as the code was never open-sourced. As a result, it became a "homegrown daemon with dependencies that ops had to maintain," Snyder said. When it was decided not to update the code further, an update to Python broke it.

This made the deploys much, much worse. Because it was so closely tied to the database, any time a stored procedure changed, it required a recompile of procedures and changes to Sprouter. And the database remained as a single point of failure.

A shift in culture



Things began to change as the engineering culture shifted, Snyder said. "We went from lockdown to DevOps," he explained, and the boundaries between teams were largely removed. With Dickerson's arrival, Etsy's team started to take on a significant amount of "Flickr's DNA", he said, and shifted toward a more agile, DevOps approach to development.

That culture includes frequent, small software releases, and giving developers access to operations' monitoring tools to watch performance and access to systems to help tweak them. "Not every developer has root on every box," Snyder said, "but by and large you can get everywhere and look at things. The core platform team needs a certain level of access. I think we err on the side of too much access instead of too little."

The first step down that path to escaping from the architectural hole Etsy had put itself in was stabilizing Sprouter and the rest of the infrastructure. That included improving metrics and monitoring of performance—which, Snyder joked, could be improved by “having any metrics and monitoring.” The engineering team also upgraded Etsy's database hardware as much as practically possible. “We upgraded the master database to the limits of what was possible,” Snyder said. “It still wasn't enough, but it bought us breathing room.”

With that little bit of breathing room (and the accompanying downtime for the upgrade), Etsy began to shift to a new architecture—still based on PHP on the front end, but now running on Apache web servers with connections to databases directly through object-relational mapping.

And the team started to shift feature by feature away from a semi-monolithic Postgres back-end to sharded MySQL databases. “It's a battle-tested approach,” Snyder said. “Flickr is using it on an enormous scale. It scales horizontally, basically, to near infinity, and there's no single point of failure—it's all master to master replication.”

With frequent small releases, and incremental migration of features away from Sprouter, it took until spring of this year for Etsy to completely move off the middleware and turn it off for good. “I got to be the one to remove it from source control,” Snyder said. The Postgres database, however, still remains—and likely will for some time.

One of the lessons learned from Sprouter, Snyder said, was that “if you're doing something 'clever," you're probably doing it wrong.” At the same time, he admits, he and the others at Etsy today are probably making architectural decisions that others will look back with hindsight at and find fault with.