During an incident it is critically important to understand if clients are being affected and to what degree. This is the most important inflection point in any incident and can be used to guide an appropriate and proportional response. SLO’s (as defined by Google SRE) are an effective way to describe client experience, which connects engineers and clients and helps to understand client impact during incidents.

A client impacting incident necessitates a full and immediate response proportional to the impact, whereas an internal, non-client impacting incident requires a much lower level response, perhaps not even immediate:

This decision exists along two logical dimensions: one representing the client experience and their perception around latencies and their experience around availability. The second dimension is the internal level, which most engineers operate at. The internal/implementation level is the level of services, processes and machines. These are the things that are orchestrated in order to deliver the client experience:

Everything below the client experience is an implementation detail. How isn’t as important as ensuring that the value the client expects to be delivered is actually delivered. There’s two worlds: the client and us. Have you ever felt like the world has been melting that everything is on fire, but have been told in all hands that clients couldn’t be happier? The difficulty during incidents is characterizing and reasoning about these two perspectives, and understanding what the actual client impact is for any given incident. These two perspectives can influence us in unhelpful ways during incidents. As an operator who is focused on implementations, it’s easy to get lost in the system state, thinking any unhealthy service, host or increase inn latency is a call for concern.

Service Level Objects (SLOs) are a middle path that help operators and clients communicate. SLOs can be leveraged to inform the initial impact evaluation and help us understand if clients are being impacted.

To illustrate this imagine that we have a reactive job queue and worker pool that is processing events:

The Notification service instances have the following SLO’s around the service it’s providing:

80% Message Processing Availability (ie successfully processed messages / total processed messages > 80%) — Are instances able to send messages for our clients?

99% of messages should be processed in < 1 second (ingested, processed, sent for mailing, and ACKed) — How long are clients waiting for their notification requests to be registered with a gateway?

Choosing which signals to build objectives around and supporting data collection is out of scope of this article and non-trivial. Additionally, these SLO’s may need refinement in terms of definitions, the goal of this article is to illustrate how SLO’s can be used to inform client to non -client. While the most accurate way to characterize the latency of the service is to generate synthetic transactions as a publishing service, we can create a proxy of the client experience it using notification and queue based metrics.

Phrasing the impact in terms of client impact

Suppose that alerts for an unhealthy Notification Sender machine begin to fire. What’s the appropriate response? Start an incident? Risk our reputation? ACK it and wait until morning? Supposing that this is an alert how do we respond and ensure that our response is proportional to the problem?

How do we solve for client impact?

Taking an engineering perspective we may divert resources to immediately figure out what the source of the machine failure is, how it got into that state. We may start an incident. Another approach would be to use SLO’s, metrics which proxy the client experience and sit at the top of the client experience to guide the decision making: