Nothing has the potential to ruin a product or even an organization more than software instability. Most of us get excited by focusing on developing new features or services and we neglect the operation thereof. Agile methodologies mostly focus on product development. In large organisations with hundreds of systems this is not enough. You need more than just a methodology achieve resilience. This is where Organisational Practices to improve system resilience come in.

Mike Murphy is undoubtedly an expert when it comes to resilience. An expert for me is someone who has the scars of years of sleepless nights and ruined weekends because of a system being down AND and who has the theoretical background on the subject matter. Mike is such a person

In this third part of three articles we will focus on the organisational practices that helps making systems more reliable. Consider it a cheat sheet to improve your engineering.We covered the causes of system instability in Towards System Resilience (Part 1) and the Engineering practices in our Engineering Practices Cheat Sheet. Over to Mike for the operational practices.

Operational practices

Engineering alone does not lead to guarantees of higher systems stability. It is only when augmented by robust and resilient operations practices that systems stability improves over time. This section covers those operations practices in Group IT that are being significantly enhanced in order to support the efforts to improve systems stability.

You Build it, you run it

Traditional IT models divide implementation and support into separate teams, sometimes referred to as “build” and “run” or “development” and “support”. The results of this is to isolate the people building from any problems they create. It leaves the people who run the system with little ability to make meaningful improvements as they discover problems. Because the most valuable learning opportunities come from how a system is used in production, this dev/ops divide limits continuous improvement.

Another way to view this is that historically, developers designed and produced software, then “threw it over the wall” to operations people, who deployed and supported the software in production. When things broke, as they invariably did, it was the operations folk who would be summoned to fix them at the proverbial midnight hour.

In organizations where this takes place, developers essentially have no skin in the game in operating the software that they write; in fact, they have essentially transferred all the downside to operations while keeping all the upside (e.g., uninterrupted sleep) to themselves. There is no downside to them writing and deploying defective software.

This problem is solved by giving development teams end-to-end responsibility for the development and operational support of their systems. Concerns around segregation of duties are still maintained in this model and those who build do not review and promote their own code into production. When asking ourselves whether this is a good idea we should contemplate the following questions: “… who knows more about an application than the team that created it?”; “Who is better able to find the root cause of performance problems than the team that wrote the code?”

More skin in the game equals better ownership equals more responsive teams equals more stable systems.

Blameless post mortems (retrospectives)

Failure happens. This is a foregone conclusion when working with complex systems. But what about those failures that have resulted due to the actions (or lack of action, in some cases) of individuals? If we go with “blame” as the predominant approach, then we’re implicitly accepting that deterrence is how organizations become safer. This is founded in the belief that individuals, not situations, cause errors. Having a Just Culture means balancing safety and accountability. It means that by investigating mistakes in a way that focuses on the situational aspects of a failure’s mechanism and the decision-making process of individuals proximate to the failure, an organization can come out safer than it would normally be if it had simply punished the actors involved as a remediation.

A post mortem is a written record of the incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and the follow-up actions to prevent the incident from recurring. It is a process intended to inform improvements by determining aspects that were successful or unsuccessful. A blameless post-mortem assumes that everyone involved in the incident had good intentions and did the right things with the information that they had. Systematically learning from past problems is essential to long term systems reliability.

Blameless post mortems are instrumental in creating a culture where people can openly discuss their mistakes and learn from them. Having a “blameless” Post-Mortem process means that engineers whose actions have contributed to an accident can give a detailed account of: what actions they took at what time; what effects they observed; expectations they had; assumptions they had made; and their understanding of timeline of events as they occurred, and that they can give this detailed account without fear of punishment. Engineers who think they’re going to be reprimanded are disincentivized to give the details necessary to get an understanding of the mechanism, pathology, and operation of the failure. This lack of understanding of how the accident occurred all but guarantees that it will repeat. If not with the original engineer, another one in the future.

When the results of an outage are especially bad it is easy to make the mistake of thinking that the correct steps to prevent or shorten an outage are equally obvious before, during, and after the outage (essentially a manifestation of hindsight bias). The worse the outage, the more the tendency to blame the human committing the error. People become “root causes” of failure. However, humans are very rarely the root cause of failure. Failure is more likely to be triggered by things such as ineffective processes, a culture of recklessness, overly complex systems, organizational silos, role and responsibility confusion, etc. By focusing on learning from incidents, on improving the environment that contributes to how people working in it behave both during normal work and during stressful situations we will increase the likelihood that we are able to more effectively anticipate and respond to failure.

Blameless does not equal no consequences and nor does it mean that people get off the for making mistakes. People involved in the failure should be charged with improving the systems & processes and educating the rest of the organization on how not to make similar mistakes in the future.

Root cause analysis

Our intuitive notions of cause-and-effect, where each outage is attributable to a direct root cause, are a poor fit to the reality of modern systems.

Sooner or later, any complex system will fail. Failure can occur anytime and almost anywhere and the complexity of today’s systems ensures there are multiple flaws, latent bugs, present at any given moment. We can’t fix all of these, both for economic reasons and because it’s hard to picture how individual failures might contribute to a larger incident. We’re prone to think of these individual defects as minor factors, but seemingly minor factors can come together in a catastrophe.

Most, if not all, production outages have lots of contributing factors that come into play. Decisions that were made months or years ago can eventually help trigger an issue that no one could foresee at the time. When you add other factors that may or may not be in your control, those decisions can eventually contribute to a production outage.

Complex systems run as broken systems by default. Most of the time, they continue to work thanks to various resiliency measures: database replicas, redundant components, etc. And of course, thanks to good monitoring and alerting, coupled with knowledgeable engineers who fix problems as they arise. But at some point systems will fail.

Often times in complex systems, there is no single root cause. Single point failures alone are not enough to trigger an incident. Instead, incidents require multiple contributors, each necessary but only jointly sufficient. It is the combination of these causes that is the prerequisite for an incident. We therefore can’t always isolate a single root cause. One of the reasons we tend to look for a single, simple cause of an outcome is because the failure is too complex to keep in our head. Thus we oversimplify without really understanding the failure’s nature, and seize on a single factor as the root cause. This can be dangerous because it allows us and others to feel better about improving the reliability of our systems, when we may not be.

Because complex systems have emergent behaviors, not resultant ones, finding the root cause of a failure is like finding a root cause of a success.

The focus of Root cause analysis (RCA) has shifted from looking for a singular “root cause” towards identifying a system of causes. This way we open multiple opportunities to mitigate risk and prevent problems and don’t limit the solutions set, resulting in the exclusion of viable solutions.

Root cause analysis remains a crucial method for improving the resilience of systems by systematically uncovering those (often dormant) issues that lead to systems fragility.

Incident and crisis management

The aim of incident management is to restore the service to the customer as quickly as possible, often through a work around or temporary fixes, rather than through trying to find a permanent solution. Incidents are just unplanned events of any kind that disrupt or reduce the quality of service (or threaten to do so). A business application going down is an incident.

Crisis Management is defined as: the plans for and actions taken to protect and defend the reputation of the organization, its brand and its products/services. In the IT context a crisis is generally triggered by the protracted outage of a critical service that impacts a large portion of the customer base.

A ‘crisis’ may be as a result of an ‘incident’, but not necessarily, and not every Incident will result in a crisis. However, having an effective Incident Management process can reduce the chance of incidents escalating into a crisis.

The key to incident management is having a process, a good one, and sticking to it. However, defining the process is the easy part, sticking to it religiously, not so.

Being excellent at incident & crisis management is a pre-requisite to improving the long-term resilience of the IT systems. When failure occurs, restoration of service is greatly improved through the execution of a well-honed process.

Disaster recovery

Disaster recovery (DR) involves a set of policies and procedures to enable the recovery or continuation of vital technology infrastructure and systems following a natural or human-induced disaster. Disaster recovery focuses on the IT or technology systems supporting critical business functions, as opposed to business continuity, which involves keeping all essential aspects of a business functioning despite significant disruptive events. Disaster recovery is therefore a subset of business continuity.

Recovery point objective (RPO) and recovery time objective (RTO) are two important measurements in disaster recovery and downtime. A recovery point objective is the maximum targeted period in which data might be lost from an IT service due to a major incident. The recovery time objective (RTO) is the targeted duration of time and a service level within which a business process must be restored after a disaster. RPOs and RTOs are two of the constraints used in system design. The lower the tolerance for data loss in the event of a failure (RPO) and the shorter the desired recovery time (RTO) the more resilient (and by inference redundant) a systems needs to be – and the more complex and expensive it is likely to be to build and operate.

Testing is critical to change management in DR planning, helping to identify gaps and providing a chance to rehearse actions in the event of a crisis. DR testing is a critical process for determining whether systems are able to recover after induced failure.

Building great engineering teams

Arguably the most effective way to improve the long term resilience of IT systems is by systematically recruiting, training and developing great engineering teams. However, building great teams is hard and there is no perfect recipe for success. Software development is a team sport and high functioning teams make a real difference. Therefore recruitment practices need to focus on hiring people who, first and foremost, are at their best working in teams. This doesn’t mean that technical acumen and experience are not important, it is that they are not more important than those things that make a person a great team player. To quote Joel Spolsky, “People who are Smart but don’t Get Things Done often have PhDs and work in big companies where nobody listens to them because they are completely impractical.” And “people who get things done but are not smart, will do stupid things, seemingly without thinking about them, and somebody else will have to come clean up their mess later. This makes them net liabilities to the company because not only do they fail to contribute, they also soak up good people’s time.”.

Bibliography

While this is not an exhaustive list, there are a number of sources that have been drawn on for inspiration in the compilation of this and the subsequent posts. Some of the content has been used verbatim, some quoted and others used simply to frame and argument or position.

Books

Online Papers & Blogs

This was a guest post by Mike Murphy. Mike is the Chief Technology Officer of the Standard Bank Group. You can also listen to a podcast featuring Mike by clicking here.