This is the first post in a three-part series on High Severity Incident (SEV) Management Programs. Check out part 2, Understanding The Role Of The Incident Manager On-Call (IMOC), and part 3, Understanding The Role Of The Technical Lead On-Call (TLOC).

Introduction

High severity incident management is the practice of recording, triaging, tracking, and assigning business value to problems that impact critical systems. The purpose of establishing a program is to enhance the customer experience by improving your infrastructure reliability and upskilling your team. In this guide, we will share how to establish and measure the success of your own high severity incident management program.

What is High Severity Incident Management?

The management of high severity incidents encompasses high severity incident (SEV) detection, diagnosis, mitigation, prevention, and closure. SEV prevention includes SEV review and SEV correlation.

What are SEVs?

SEV is a term used to refer to an incident, it is derived from the word severity.

What are common types of SEVs?

Availability Drop

Product Issue / Feature Broken

Data Loss

Security Risk

What are examples of SEVs?

SEVs are common across all industries. They impact our ability to catch flights, access our bank accounts, buy the latest game we are excited about and much more.

The following are examples of SEVs that have occurred:

More examples have been collected and shared by Outage Report.

What are SEV levels?

SEV levels are designed to be simple for everyone to quickly understand the amount of urgency required in a situation. Launching your high severity program with SEV levels is important because it will make it easier to inform everyone across your engineering team how you classify incidents.

SEV levels empower and educate your entire team to feel confident lodging SEVs and effectively prioritising them.

An example of SEV levels are described in the table below:

SEV Level Description Target resolution time Who is notified SEV 0 Catastrophic Service Impact Resolve within 15 min Entire company SEV 1 Critical Service Impact Resolve within 8 hours Teams working on SEV & CTO SEV 2 High Service Impact Resolve within 24 hours Teams working on SEV

How do your resolution times impact SLOs/SLAs?

There are three service level terms often used to measure the level of service that will be provided to customers. These terms are; service level indicators (SLIs), service level objectives (SLOs) and service level agreements (SLAs). Companies will often set an SLO which is higher than their SLA, for example the SLA provided to customers would be 99.99% but the internal unpublished SLO would be 99.999%.

An SLA level of 99.99 % uptime/availability gives the following periods of potential downtime/unavailability:

Daily: 8.6s

8.6s Weekly: 1m 0.5s

1m 0.5s Monthly: 4m 23.0s

4m 23.0s Yearly: 52m 35.7s

You can calculate the error budget you have available based on your SLO and SLA at uptime.is.

What is the full lifecycle of a SEV?

The full lifecycle of a SEV involves detection, diagnosis, mitigation, prevention and closure. There is a glossary of SEV terms just below.

When we are reporting on SEVs that have occurred in the past to look for trends and patterns, it is useful to understand the overall mean measurements for SEVs. For example, the mean time to detection (MTTD) for all SEVs. This measurement is useful to dive deeper and understand which types of systems or services take much longer to detect incidents for. Engineers will often take the time to tag incidents because it enables the creation of visualizations by cause and magnitude.

How are SEVs measured?

High Severity Incidents (SEVs) are measured by the availability error rate and total time of impact. We use the formula below to identify the number of customer requests which were impacted:

% loss * outage duration

This also lends itself to prioritization. For example, you can assign a value to each customer request and the result will be visibility of the most important requests. You may choose to assign a higher value to requests related to new product launches, e.g. the launch of a new TV series that is being streamed.

Revenue impact for Freemium products can occur due to outages impacting customer upgrade flows. For example, a Freemium photo storage service’s revenue could be impacted by a mail campaign outage. This could then result in campaign mail not sending when a user hits their storage quota and is prompted to upgrade.

Glossary of common industry SEV terms

The following acronyms are used to measure and track SEVs. This data is important to collect in your SEV tracking tool. It is best to fill in all data when the SEV has been resolved. When you store this data it can then be used to automatically generate monthly SEV metrics reports.

SEV Term Description TTI Total time of impact TTD Time to detection TTE Time to engagement TTR Time to resolution TTP Time to prevention TBF Time between failures

How do you create SEV levels for free and paid products?

When you are establishing your high severity incident management program it is important to consider the pricing model of your product. If your product pricing uses a freemium model you likely have different SEV levels than a product with subscription pricing and no free tier.

If your product has tiers such as business and enterprise with different costs you could classify requests for these products with different severity ratings. This is very useful if you have VIP customers who require higher levels of availability than the majority of your customers.

Example: SEV levels for data loss

Any SEV which involves loss of customer data should be classified as a SEV 0.

SEV Level Data Loss Impact SEV 0 Loss of customer data SEV 1 Loss of primary backup SEV 2 Loss of secondary backup

Example: SEV levels for a free or freemium product

The following diagram illustrates the way to measure SEV levels for a free or freemium product. For example, a 90% loss of customer requests * 10min is classified as a SEV 1 incident. This is considered a critical service impact but we do not classify it as catastrophic because of the length of the outage and insignificant impact on revenue.

The diagram below demonstrates a SEV 1 impacting WWW and API with no impact to client. This is classified as a critical service impact. We measure this SEV as 0.2% * 30 min (6) for WWW and 0.11% * 30 min (3.3) for API. The 99.99 SLA has been breached. In the example below the SLA is 99.99 for availability.

Example: SEV levels for a product with subscription or complex pricing

The following diagram illustrates the way to measure SEV levels for a product with subscription pricing. For example, a 90% loss of customer requests * 10min is classified as a SEV 0 incident. This is considered a catastrophic service impact due to the impact to revenue.

How do you ensure your team operates effectively during a SEV 0?

There are two roles which are important to establish when you create your high severity incident management program. These are critical to ensure your team can effectively diagnose, mitigate and prevent SEV 0s. These roles are the Incident Manager On-Call (IMOC) and the Tech Lead On-Call (TLOC).

Incident Manager On-Call (IMOC) / Call Leader

Small rotation of Engineering Leaders ( <10 )

One person is on-call in this role at any point in time.

Wide knowledge of services and engineering teams

Tech Lead On-Call (TLOC)

The engineer responsible for resolving the SEV

Deep knowledge of own service area

Deep knowledge of upstream and downstream dependencies

How do you set up IMOCs for success during SEV 0s?

IMOCs play a critical role in managing SEV 0s. It is important to train IMOCs before they do their first rotation because they are the sole person responsible for coordination and communication of the SEV 0 with the entire company.

IMOC training is best conducted in a one hour face-to-face training session with time for questions. It involves gaining an understanding of the following:

SEV levels

The full lifecycle of SEVs

Examples of previous SEV 0s that have occurred within your company

Access to an IMOC runbook with communications templates

To ensure IMOCs can be as effective as possible, the following practices should be in place:

A 24/7 rotation for IMOCs with one Primary IMOC and one Secondary IMOC on-call for a week

Automatically page the Primary IMOC for SEV 0s.

Set up automatic escalation to the Secondary IMOC if the Primary IMOC does not ack the page in 1min

A monthly sync for all IMOCs

When your IMOC is paged, the first thing they will need to do is look at the recorded SEV details and then open the Critical Services Dashboard. Next we will discuss how to categorize SEVs to enable the IMOC to quickly grok the impact of SEVs.

How do you categorise SEVs?

When establishing your high severity incident management program it is useful to be able to view requests split by different categories such as services or new products. For example, if your team is responsible for a website, API and desktop client, being able to determine the impact based on the requests for WWW, API and Client will be very useful. This will enable you to more quickly work through the full SEV lifecycle.

The dashboard below is an example of how your dashboard would appear during a SEV impacting WWW and API with no impact to Client. There is also a red line to assist in determining if the SLA has been breached. In the example below the SLA is 99.99 for availability.

How do you empower everyone in your company to record SEVs?

There is a significant benefit to enabling and empowering everyone in your company to have the understanding and confidence to record SEVs.

At the Toyota Motor Manufacturing plant in Kentucky they “pull the Andon cord” 5000 times per day. The Andon cord is a physical rope that follows the assembly line; it can be pulled at any time to stop the manufacturing line. When someone pulls the cord and the problem is serious enough, everyone stops what they’re doing until the problem is fixed.

To empower your team, set up tooling and automation to make it simple for everyone to record, discuss, recall, and correlate SEVs. Many teams are now distributed around the world with multiple offices. Giving your team an easy to use web-based tool to record SEVs will increase the likelihood your high severity incident management program will be successful.

We have created a simple Google Form for you to use with your engineering teams. This form will enable all engineers to lodge SEVs. You can create a copy of this RecordSEV tool Google Form. The RecordSEV tool can be seen below:

How should you name SEVs?

To enable effective recall of SEVs and facilitate discussions it is useful to give SEVs descriptive names. A name generator that enables you to match a descriptive word and animal can be used to choose a name for the SEV. For example, “SEV 0 Cheeky Kangaroo”.

Descriptive SEV names will improve communication for your SEV program. Alternatively, SEVs can be named using numerical codes or timestamps.

What causes SEVs?

From industry analysis of SEVs we can classify the causes of SEVs as either technical issues or cultural issues. Some SEVs are caused by both technical and cultural issues. See the table below for examples:

Type of Issue Example Technical Issue Incorrect configuration change Cultural Issue Lack of on-call training

What are some of the expected failures are you likely to experience?

Technical Issues

Dependency Failure

Region/Zone Failure

Provider Failure

Overheating

PDU failure

Network upgrades

Rack failures

Core Switch failures

Connectivity issues

Flaky DNS

Misconfigured machines

Bugs

Corrupt or unavailable backups

Cultural Issues

Lack of knowledge sharing

Lack of knowledge handover

Lack of on-call training

Lack of Chaos Engineering

Lack of a high severity incident management program

Lack of documentation and playbooks

Lack of alerts and pages

Lack of effective alerting thresholds

Lack of backup strategy

How do you prevent SEVs from repeating?

It is possible to prevent SEVs from occurring again but to do this you must be able to record, correlate, and track SEVs.

We recommend making sure you have the following in place:

Descriptive SEV names

A form for all employees to record SEVs, e.g. RecordSEV tool

A postmortem tradition

Tracking if SEVs repeat, e.g. it’s another “Cheeky Kangaroo”

Capturing, tracking and resolving SEV Action Items

Chaos Engineering

Observability tools

How can you use Chaos Engineering to empower your teams to prevent SEVs?

Chaos Engineering is the facilitation of controlled experiments to uncover weaknesses of your infrastructure and team. Having your entire engineering team routinely performing chaos experiments on your systems will help you identify problems before they become incidents, and along the way increase your confidence that your systems and teams can handle failure. When used alongside the prevention methods described above, chaos empowers your teams to proactively uncover the technical and cultural issues that impact your company.

Here is a list of resources you can use to further explore how Chaos Engineering can be reduced to reduce the severity and number of incidents at your company:

Conclusion

A high severity incident management program is an important subset of reliability engineering, focused on ensuring that a team is prepared to manage incidents. This might seem complex; however, it can greatly improve your customer experience and empower you to meet SLAs. This will also empower you to be better prepared for compliance and auditing events as they arise. By following this guide you will be able to establish your own high severity incident management program and measure its success.

Join the Chaos Engineering Slack to connect with 5,000+ SREs and engineers about establishing SLOs and reducing incidents.