All systems fail, eventually. But with thoughtful site reliability engineering and sane DevOps practices, sometimes you can fail gracefully, learn from mistakes, and make IT systems more resilient.

Sometimes, system uptime is paramount, whether you’re thinking about a business website, corporate network, or enterprise application. OK, maybe uptime is paramount all the time if it’s your business’s servers that crashed, your telephone that rings in the middle of the night, and your toes curling into the CTO’s carpet the next day as you try to explain why the data-transfer lights all went out.

However, the uptime requirement—and system reliability—is more urgent when millions of other people rely on your website or platform or its failure brings the country to a screeching halt. Few of us can survive for long without Internet access, when everything we do relies on online applications and organizational datasets.

“We write the code that keeps the Internet online,” explained panelists at a Grace Hopper Celebration session titled, “When 99.9 Percent Isn’t Good Enough.” Or, as site reliability engineer (SRE) and moderator Anne Cesa Klein explained, "We avoid feeding blood to the machines."

Whether you call the role SRE or DevOps—there’s an ongoing debate about the distinction between the two and whether the distinction matters—the role is the same. It’s the job of an SRE to apply software engineering principles to design, build, deploy, monitor, and maintain massive systems. During sessions at the recent conference, these experienced SREs shared hard-earned wisdom about defining service-level objectives, identifying the metrics to monitor, designing systems to degrade gracefully, and learning from mistakes after a catastrophic failure.

What does “up” look like?

Among an IT department’s first items to consider are its service-level objectives. How much downtime is acceptable? If a server has 99.9 percent uptime, that means it is unavailable for 1 minute 26.4 seconds per day, or a little more than 10 minutes per week. For a mundane business server, that’s good enough. But if the demand is so critical that 10 minutes is too much downtime, you need to build a system that stays up 99.99 percent of the time, or maybe 99.999 percent of the time. Adding every one of those nines makes the task harder and more expensive.

It’s critical for the team to agree on how many nines users expect, said Jennifer Petoff, program manager for Google's site reliability engineering team and author of "Site Reliability Engineering: How Google Runs Production Systems." That discussion might be less about technology than it is about human motivations. Developers want to get features out the door at any cost—including reliability—while the operations team is apt to push back. “Focus on the user, not the emotion,” Petoff urged.

Reliability is important, but often it’s not the only design criteria. “We sometimes have to make the decision to not make something [super-]reliable to save on costs,” said Leslie Carr, engineering manager at Quip. But it’s all a balance. Ultimately, you might have to pay three times as much—or hire three more engineers, she said.

Never be afraid to fail. Just never fail the same way twice. Leslie Carr

Predicting system reliability is complex, pointed out Emily Burns, a senior software engineer at Netflix who also works on the open source deployment tool Spinnaker. Sudden market success can overwhelm an application, even when you use load balancing and other best practices. And then how do you update the software? “If you are small, you can go down for 20 minutes every Sunday [for a system refresh], but that doesn’t work when the application gets big and serious,” she said.

All the more reason to set your uptime goals carefully, the panelists agreed. How you monitor the system, roll back a code change, or respond to an anomaly helps the system stay up longer. Think through the process, both technical and human. At Netflix, for example, the SRE team is responsible for diagnosing issues and paging the appropriate person.

Degrade gracefully

Failures do happen. But with thoughtful design and proper engineering, the most important elements of a system can keep operating while the IT staff fixes an underlying problem. You can shut down smaller pieces of the system so that the larger, more popular pieces stay up and working.

“Think of ways to degrade the application without people noticing,” said Carr. “At Wikipedia, we turned off editing.”

At Netflix, the streaming service is designed to degrade rather than go down. “We could fail out of an entire AWS region with no impact on streaming traffic,” said Burns. In less than 10 minutes, the server streaming a customer’s movie will have moved and they wouldn’t even know it.

You can do that only if the SREs are involved in the design process earlier, pointed out Petoff, and if they are embedded with the software development team. At a minimum, added Klein, when you kick off any IT project, an SRE should be involved, so they can ask questions about design goals for reliability. “That saves time on the back end,” Klein said.

Mind the gap: How application delays affect company performance Download the report now

That design conversation touches on the tension between the developers creating something new and operations staff focusing on system reliability, which can create friction. The answer, said the panelists, is for the project planners to build in an error budget.

“Error budgets enable developers to take calculated risks,” said Petoff. “Let them innovate and rock out with the new features.” That is, make sure the project plan has set aside time in which to try new things.

But once you burn the error budget, the emphasis has to go back onto “get the project done,” which generally means making safe (if boring) choices.

Automation, metrics, and tools to the rescue

To track what’s happening with any software system, SREs put a lot of attention on metrics, log files, and the tools to glue together all the pieces. “When something is broken, you want as much information as possible to fix it,” Carr said.

The metrics and log files also inform the team’s road maps. For example, error logs are ammunition the SRE can take to a VP who wants to work on shiny new features instead of improving the software. Ditto for customer support tickets, mapped to situations where a well-constructed logging system might help answer issues. The dispassionate system failure numbers can convey the importance of fixing bugs before adding new functionality.

It can also inspire team projects, such as an “internal hack day," Carr suggested: “Everyone fix logging on Thursday, and we’ll have ice cream!”

She advised that whenever possible, you should “depend on others’ tools so you don’t have to create your own.” The panelists mentioned several in passing, such as PagerDuty, the open source tools from Netflix, and Chaos Monkey. Burns highly recommended that SREs do canary testing to test system fragility.

Ideally, you work toward a perhaps idealistic goal of self-healing services. “When you get paged in the middle of the night, there is a human element: ‘Crud, I have to get up and deal with this,’” said Klein. What we really want is a system that recognizes it has a problem, collects useful data, and restarts itself, she added.

Don’t make the same mistake a second time

Each panelist emphasized the importance of holding review meetings—and doing so in a blameless way.

“Blameless post-mortems are a software engineering principle,” said Petoff. During the meeting, the team should discuss “what went well, what went wrong, and where we got lucky.” For every problem encountered, do your best to answer, “What will we do to keep the same bad thing from happening?”

“Never be afraid to fail,” said Carr. “Just never fail the same way twice.”

She would know. “I took down Twitter for 20 minutes in 2010,” Carr said. It should have been a simple change to a router. “I used > instead of <,” she explained. “OK, that looks good…let’s push it out to all the routers.” Twitter users everywhere saw the “fail whale.”

No automated test could have tested for that situation, at the time. “There was no peer review,” Carr said. “No test beds." Not until afterwards, anyway.

“Blameless is important,” Klein emphasized. “If the boss came after you with a pitchfork, you’re incentivized to cover up issues and to hide mistakes.” A blameless culture has to permeate every corner of the organization, she said.

For example, when a LinkedIn SRE brought down the site, its blameless culture empowered the staff to talk about the inevitable failures in any complex system. “Everyone understood that if one person can bring down the site, there must be a lot of other issues involved. So, instead of placing the blame on me, our engineering organization made some changes to prevent that from ever happening again,” wrote the SRE.

The panelists highlighted the recent “nuclear alert” in Hawaii as a case study. The alert was meant to be a test, but the operator thought it was real. It is easy to point fingers at the operator, but, Petoff noted, it’s unclear that SRE principles were applied. “The question should have been, ‘What was there about the system that allowed the operator to make that error and to send the alert?’” Petoff said. That question should be asked until the answer is not about a person.

Petoff offered overall guidelines that may help anyone new to site reliability engineering. Think of a problem to solve with software, she said, and answer these questions: