Google CIO Ben Fried bared his soul to systems and software engineers and other IT pros gathered at OmniTI's Surge scalability conference in Baltimore Thursday, sharing the story of his greatest IT failure and how it informed how Google runs its IT operations. While he didn't call it by the name, Fried's keynote was as much a manifesto for the "cult of DevOps" as it was “disaster porn.”

There were plenty of other cautionary tales from Surge presenters, many of which promoted DevOps in some way. But they also highlighted just how fickle public cloud services—and Amazon's EC2 in particular—can be.

DevOps is the growing practice of forging tight collaboration between application developers and IT operations staff to continually improve performance, automation, and scalability of software and systems. The philosophy is also the force behind scripting language-based infrastructure automation tools, such as Puppet and Chef. But it's also a reflection of the necessities that come from trying to provide reliable systems based on increasingly complex and unreliable stacks of software and infrastructure.

A failure to

communicate

Fried described a catastrophic failure of an institutional trading application he led the development of seven years ago at his previous employer, Morgan Stanley (who he identified in his presentation as “a large investment banking firm”)—a failure that cost the company millions and took 18 months to correct. While there were a number of contributing technical errors that led to the failure, Fried said the source of the problem was in how the IT organization scaled up as the application had become a more successful business.

The application, which Fried said was the source of much back-patting at the time of its launch, was a desktop app for Morgan Stanley's large institutional customers who made large volumes of trades that re-used the infrastructure of the investment bank's Web-based trading tool. It used SSL connections to feed the application realtime market data, send trades, and pass back reports of how the trades were being executed. While the application used the Web infrastructure, all of the customers were “high value,” said Fried. The decision was made to try to move the customers over to private connections to help ensure quality of service—which many of them did.

Soon, there were complaints about the application's performance. But Fried says his response, when asked about the problems, was to “ask for data." This typically led to the complainers leaving him alone. That ended when “a very important person in the company called me,” he said, “and told me it was in my best interest to go look at what was going on on the trade floor, or the consequences would not be pleasant.” He found the trader support team slammed with calls from customers. And at that moment, he received a page: “There was a hard failure in the system, and it was going down.”

The reasons for the failure were legion. “The most interesting thing is that big disasters rarely happen because of just one thing,” Fried said. In this case, the first of them was that the dedicated load balancer that had been put in front of the application had a gigabit Ethernet port, but was only rated for 45 megabits per second throughput. “And what made it worse was that in the sea of configuration changes that had been made, someone had blocked the SNMP port” for the load balancer, he added—so it showed up as “green” on the network management console until it failed completely. But the solution required more than just a new load balancer—it required a change in network architecture. The leased lines that had been sold to customers were routed through the system's public facing Internet—totally defeating any effort at quality of service.

As if that wasn't enough, the application itself was causing network issues. While its communications had originally been based on small HTTP messages, that had been changed because of what Fried called “the organization's love affair with XML.” The messages had grown in size up to 3000 bytes—twice the maximum size of an Ethernet packet—so there was a “hockey-stick spike” in traffic and dropped packets.

The opposite of

DevOps

The real root of the problem, Fried said, was the way the organization around the system had been built. "Without even thinking about it, the way we scaled up was through specialization," Fried explained. "We added people to specialized teams, each operating within a functional boundary. We never said understanding how everything works is important." Because none of them had knowledge of how the application worked beyond their area of expertise, the teams made decisions that led to a "hard failure" of the application.

As companies strive to scale up applications to handle larger tasks, Fried said, it's increasingly important to have IT generalists on the team who can look cross-functionally at systems. "Scalability is pushing the boundaries of the possible,” he said. “We operate at the interface of the known and unknown. Normal industrial style thinking doesn't work, because specialists' expertise is not good at dealing with the unknown."

Fried said the process of fixing the problems at Morgan Stanley “forced me to rethink how we do operations, and what the culture of operations should be. Operations is engineering. We need generalists in operations, and we can't allow the tech barriers to separate us because that will result in failure.” He said that it's important to reward and recognize generalist skills and broad understanding of systems, and added that he thinks Google gets this right. “We go to great lengths to hire people with engineering skills, put engineers in operational roles and give them power and accountability."

That sounds a lot like an embrace of the philosophy of DevOps, and it was a message that many in the audience who are responsible for Web applications received warmly. In many ways, the DevOps style is something endemic to Web startups—especially small ones, where the developers end up being responsible for operations as well.

That was the case at Ruby-based platform-as-a-service provider Heroku, which was acquired earlier this year by Salesforce. As Heroku's cloud operations director Mark Imbriaco said in a presentation on the company's approach to responding to system failures, "A year ago, Heroku had no ops at all." The operations team is still small, so every engineer on staff participates in the company's on-call incident response. “It gives us a sense of shared suffering,” he added, “and lets everyone see the problems—particularly the people who wrote the code.”

It also means that every engineer at Heroku has sysadmin privileges—Imbriaco admitted that this is something he'd rather not have.

Amazon and other disasters

Some of the other "disaster porn" at Surge yielded practical advice that Google's CIO couldn't give, particularly about the dark arts of dealing with Amazon's EC2 cloud infrastructure services. EC2 was the platform of choice for most of the cloud service players at the conference; Heroku, for example, runs completely on EC2. But that's a choice that doesn't come without pain.

Imbriaco said that Heroku has seen "so many different errors from Amazon" that they have gotten to be experts at diagnosing them, and Heroku's own monitoring usually beats Amazon's usually by 15 minutes in diagnosing problems. And when the problems are related to ephemeral disk failures, Amazon does little to deal with them other than occasionally sending a message. "We will get an email saying, 'Your host is in a degraded state and you need to move your stuff,'" Imbriaco noted.

The majority of the issues that Heroku encounters with its services on Amazon are related to disk I/O, including the "ephemeral" disks related to instances failing and crashing the virtual machine instance. Most of Heroku's "playbooks" for dealing with system failures and degradations include resetting or "destroying" EC2 instances. Imbriaco says that he'd like to automate most of these responses, but "we're too afraid to right now." Automation, he said, is also a great way to rapidly distribute failure.

Andy Parsons, formerly of hyperlocal news service Outside.in and now CTO of a startup called Bookish, provided a long history of war stories from his serial startups, and many of them included lessons about EC2. "Machines disappear, you don't know why. Sometimes it's network availability—the infrastructrure at Amazon's data centers is immense; people are plugging things in all the time, and there are network outages." At Outside.in, he says, "We went through a period where we lost an instance a day. In any week, we were doing 10 emergency reboots."

Parsons also had I/O problems, which he says were in part because "you have no idea where the actual SAN is" that supports virtual systems. The storage area network might be on the other side of one of Amazon's EC2 campuses, practially in a separate data center. He also warned off usage of ephemeral storage for anything that is critical.

Another set of frequent problems that both Parsons and Imbriaco cited were related to the Domain Name Service at Amazon. "Local DNS failure is a common problem," Imbriaco said, resulting in connections failing. Parsons said that the private IP addresses of instances in the Amazon cloud change without warning as well, so it's important to use DNS for communications between them.

In the end, Parsons said, with Amazon instances, "Failure is assured." He recommended keeping a hardened basic system image and using tools like Puppet or Chef to load and patch them, and replacing instances early—before they fail." When asked if he considered switching to another cloud provider, he replied that he still thinks Amazon is the best option. "For all my complaints, I think EC2 is fantastic."