Prioritize Product Reliability on your Terms

Use SRE as a data-driven approach for prioritizing reliability features on the product backlog.

Today’s product managers rely on data for making tactical decisions that both shape and prioritize a product’s backlog. From Google Analytics to a/b design of experiments that test product assumptions — fail early and often — Product Managers use data to react quickly to competition and changing markets. The pressures of rapid delivery, experimentation, and building adaptive products can jeopardize system reliability, erode user trust, and tarnish the brand. SRE, Site Reliability Engineering, is a balanced approach that addresses these product concerns by facilitating a safe delivery cadence while providing a high-level of system reliability.

Master Juggler

Managing delivery expectations, keeping users happy and protecting the product’s brand is a juggling act that Product Managers face daily. Brand is the “expectation of an experience” and users simply don’t think about the reliability of the products they use unless, of course, it’s missing or there’s a perception of unreliability. Performance degradation, errors, UX bugs and even spelling errors when left unchecked damage the brand and plant doubts in the minds of users about the reliability and trustworthiness of your products. Users expect products to work on their terms.

But we did all our testing before we shipped

Product shops and SaaS providers use all forms of QA resources including performance and user acceptance testing (UAT) that verify the functional aspects of deliverables against “known” outcomes in labs. This approach is an insufficient strategy for measuring product reliability. How confident are we about the reliability of our products after go-live in a volatile world? How resilient are products when critical cloud services become unavailable?

SRE saves the day

Another trove of data is available to the Product Manager providing insights into the reliability of products and services in live systems. SRE, when collectively practiced by Engineering and Product Management, equips these teams with a powerful data-driven approach for prioritizing reliability features on the product backlog. Proactively investing dollars in reliability, at the right time, lowers future development costs and protects the product brand.

Show me the data

We need to measure something before Product Managers know when to slow down feature delivery and prioritize product reliability. Slowing down in this context means that the normal feature velocity of an engineering team may change in favor of reliability feature in-take. This outcome provides a balanced approach that “right-sizes” system stability while facilitating the shipment of consistently reliable applications as product requirements change.

I’m sold. How do we get there?

Define Service Level Indicators (SLIs) — approximate the user’s experience

Define Service Level Objectives (SLOs) — measure the user’s experience

Gain executive sponsorship for error budgets — manage risk

Measure and report on objectives — provide visibility

Review, prioritize, and remediate failing objectives — invest in reliability

The secret weapon

The secret weapon of reliability engineering is the Service Level Indicator or SLI. It is the lifeblood of the SRE practice. An SLI is a specification that approximates the user’s experience and happiness of your product. No one is more qualified than a Product Manager when it comes to understanding the joys and frustration of how users interact with products. In addition to the analytics frameworks that PMs rely on to guide product feature decisions, PMs use other techniques including surveys, interviews and client advisory boards that can provide valuable information into the development of reliability requirements.

How do I start thinking about reliability?

Product managers can ground their thinking about product reliability around two SLI types: request/response and data processing.

A request/response SLI type best approximates user experiences of those who directly interact with websites and APIs.

A data processing SLI type best approximates system persona experiences of those who execute backend workflows such as ETLs, data streams and batch processes.

SLI specifications should focus on the major user journeys, features, and workflows the system and applications support — the fewer, the better.

Pass the salt, please

SLI types support a limited set of ingredients or dimensions for measuring the reliability of products and services. This is ideal as it provides a simple methodology for expressing consistent reliability requirements across Engineering and Product Management teams. This reduces the chances of producing “metric soup” that often plague reporting dashboards.

Web-based request/response SLIs can be expressed with these dimensions:

Availability: What is the proportion of valid requests served successfully?

Latency: What is the proportion of valid requests served faster than a threshold?

Quality: What is the proportion of valid requests served without degrading quality?

Data processing SLIs can be expressed with these dimensions:

Correctness: What is the proportion of valid data producing correct output?

Throughput: What is the proportion of time where the data processing is faster than a threshold?

Coverage: What is the proportion of data processed successfully and available as output?

Freshness: What is the proportion of valid data updated more frequently than a threshold?

Each SLI dimension specifies a “proportion” of valid events which, when implemented by an SRE engineer, expresses a metric value as a percentage. This is ideal for it provides a consistent way of calculating SLIs and outputs a standard measurement format that other reliability tooling can depend on later.

We can’t possibly fail

A 100% reliable service or product is the wrong number and untenable. A Service Level Objective or SLO is a threshold for acceptable product failure based on a dimension specified in the SLI specification.

SLOs provide transparent insights into the reliability of products in live systems. Engineering and Product Management all have skin-in-the-game as these teams work together to enhance and protect the product brand. Engineering teams embrace software releases knowing that reliability engineering will be prioritized while Product Managers play a key role defining the overall reliability goals.

What would Goldilocks do?

This SLO is too low. This SLO is too high. This SLO is just right.

SLO definitions have subtle tradeoffs based on an organization’s failure tolerance.

A lower tolerance for failure may divert more resources to reliability engineering and therefore, slow down feature delivery.

A lower tolerance for failure places additional burdens on Engineering teams as there is less time to invest in automation and operations projects — more operations toil.

A higher tolerance for failure may divert more resources to product development, however, you may be unaware that users are becoming irritated about degrading performance.

A higher tolerance for failure increases system technical debt, increases regression, slows down feature throughput, and places additional burden on the Engineering teams — reacting to fires.

Aspirational SLOs can be implemented side-by-side with production SLOs and then used as a data decision point for tuning an SLO’s failure threshold. Just right.

Some sanity, please

Product Managers should consider using “rule-of-nines” to balance the availability of products and services over rolling time windows.

For example, A 95% available service (1.5 nines) maybe unavailable for 8.4hrs during a 7-day rolling window whereas a 99% available service (2 nines) may only be unavailable for 1.68hrs in a 7-day rolling window.

Product Managers may find it useful to have an unavailability crib sheet close by that describes the amount of time products can be in failing states over 7, 14, and 30-day rolling windows. This puts a stake in the ground about an aspect of product reliability but may require renegotiation with the SRE team later.

Rule of Nines and Unavailability

Should I promise too much or too little?

Consider a strategy of under-delivery based on what the users can tolerate but recognize this is a balancing act especially with the brand in play. Users may not notice outages and may even wait for the issue to resolve itself if a suitable workaround exists such as an offline mode. Over delivery means that users will always expect the same level of high performant products in the future regardless of how the product evolves. If a product’s revenue model supports a lower tolerance for failure, it’s expensive, then this is a completely acceptable reliability approach. Any service level agreements, SLAs, published by competitors may provide additional clues about the level of availability you should target for users.

Now you’re talking my language — requirements!

There is no standard for how to express an SLO, but careful thought should be placed in “where” the SLO is measured. Technical PMs may have working knowledge of a system architecture and may know where to specify the measurement location. We want to be as close to the user as we can, however, things get interesting when we try to instrument the browser as this location may include other latencies that muddy the waters such as ISP upload/download restrictions. Measuring near a load balancer or gateway is a good approximation of the user experience.

Whatever the SLO specification looks like, it should be consistent and implementable by an SRE engineer. An SRE engineer may ask clarifying questions, push back if the SLO spec is too arduous to implement, and recommend different measurement approaches.

Latency Recipe

Here’s an example of a request/response latency recipe that approximates how long a user is willing to wait for a homepage.

Experience approximation: As a customer, I expect to login and land on the homepage within 5s.

Type: Request/Response

Dimension: Latency

Measurement Location: Load balancer

SLI: Proportion of valid logins served faster than 5 seconds

SLO: 95% of valid logins (1.5 nines) are faster than 5 seconds

Rolling Window: 7 days

Error Budget: When a 7-day error budget of 8.5 hrs. has been spent, Engineering teams have priority to fix the reliability issue. Or, a priority reliability story shall be added to the product backlog.

Additional Notes: An SSO login may require a number of interstitial steps to authenticate the user so we will need to aggregate latency of the operations we own.

Availability Recipe

Here’s an example of a request/response availability recipe that approximates the number of search errors a user can tolerate.

Experience approximation: As an operations person, I expect a 99.5% success rate for all employee search requests.

Type: Request/Response

Dimension: Availability

Measurement Location: Load Balancer

SLI: Proportion of valid employee search requests that return a 200 response

SLO: 99.5% (2.5 nines) of valid employee search requests return a 200 response

Rolling Window: 7 days

Error Budget: When the proportion of invalid events (500 response codes) exceeds 0.5% in a 7-day rolling time window, Engineering teams have priority to fix the reliability issues. Or, a priority reliability story shall be added to the product backlog.

Notes: Other error codes maybe relevant for this measurement.

We can budget for errors?

The latency and availability recipes above include the concept of an error budget. Error budgets require strong C-Level or D-Level sponsorship within the organization. Error budgets empower an autonomous decision-making process among Engineering and Product Management stakeholders that support the prioritization and remediation of product reliability issues.

When SLOs have been agreed upon and implemented by the SRE team, requests that fail to meet their SLO targets or thresholds are aggregated and an error calculation is made that records a count or residual failure of each request. When an error budget is spent or burnt over a rolling time window, the SLO has failed to meet its objective.

Failing SLOs are the key data points used by Product Managers to prioritize reliability work on the product backlog.

Can my organization do this now?

An organization will need to investigate current monitoring systems or invest in tooling that support SLO and error budget concepts. Most monitoring tools offer query and expression languages that express SLOs as a proportion or percentage of a valid event. Also, evaluate how difficult or easy it is to setup reporting dashboards that display the objectives you care about. Commercial tools such as Datadog and the open source project Prometheus support these concepts natively. Remember, the data will need to cook and simmer for a while but the sooner it’s collected, the faster you’ll gain insights into your product’s reliability.

How do we make this real at my company?

SLOs and reliability measurements serve no one if the organization fails to hold itself accountable to what’s happening in a live system. If you’re a scrum shop, consider using the sprint retrospective as a place for reviewing the short and long-term SLO trends with Product Management. Otherwise, schedule regular reliability reviews with Engineering and Product Management teams.

Fudging the numbers

The organization should resist “gaming” the system by fiddling with SLO thresholds — this will almost always backfire. For new or emerging products, initial SLOs maybe too aggressive for the fear of losing users and customers. Instead, implement “aspirational SLOs” that operate side-by-side with production SLOs. After they cook, aspirational SLOs provide an independent set of data you can evaluate for tuning the production SLOs. Conversely, other SLOs maybe too loose and may need tightening up.

We can be proactive about failure too!

Measuring an abnormal error budget burn can be an effective strategy for heading off service outages. For example, if 50% of an SLO error budget has been consumed during the first day of a seven-day rolling window, consider alerting the team. Left unchecked, the SLO will certainly fail. This strategy will give operations and engineers time to investigate negative reliability trends and take corrective action sooner rather than later.

I don’t always think about reliability, but when I do, I prefer SRE.

Implementing a data-driven approach that measures the reliability of your products and services may require organizational or structural team changes. Organizations that have a strong DevOps presence can use SRE tenets as a roadmap for implementing the changes, tools and processes needed for measuring and raising awareness of reliability engineering within the organization.

Reliability specifications may be new to some product managers but with practice and using the guidelines outlined here, PMs can develop robust product stories that include reliability thresholds as part of the exit criteria or definition of done. Reliability data gathered in live environments give PMs another decision-making tool for how hard to push the feature pedal.

Prioritize reliability features on your terms and have peace of mind about the product brand you’ve worked hard to build!