Service level objectives, or SLOs, are a key part of the site reliability engineering toolkit. SLOs provide a framework for defining clear targets around application performance, which ultimately help teams provide a consistent customer experience, balance feature development with platform stability, and improve communication with internal and external users.

In this post, we’ll learn how Datadog enables all the teams within an organization to track, manage, and monitor their SLOs in one place. You can search, sort, and filter all your SLOs in a comprehensive list view, and easily visualize the status of individual SLOs on your application dashboards. Datadog’s features for tracking and visualizing SLOs make it simple to monitor the real-time status of all your SLOs and communicate that status to your teams, executives, or external customers.

As we saw in Part 1, SLOs set precise targets for your SLIs, which are the metrics that reflect the health and performance of a service. By managing your SLOs in Datadog, you have seamless access to your monitoring data—including trace metrics from APM, custom metrics, synthetic data, and metrics generated from logs—to use as SLIs. For instance, if you want to ensure that typical user requests are serviced quickly, you might use your service’s median latency from APM as an SLI. You could then define an SLO as “the median latency of all user requests (as computed every minute) will be less than 250 milliseconds 99 percent of the time in any calendar month.”

To accurately track how actual performance compares to the objectives you’ve set, you need a way to not only monitor real-time performance (e.g., computing the median latency every 60 seconds and comparing it against the 250 ms threshold) but also to measure how often that threshold has been breached over longer timespans (to ensure that the 99 percent objective is met for every calendar month). Datadog tracks your SLIs and visualizes their status in relation to your established SLOs, so you can see immediately how actual performance compares to your objectives for a given time period.

If your organization is committed to a variety of SLOs across multiple products and teams, visualizing the status of all of your SLOs in one place can help you set priorities and address issues. Datadog’s Service Level Objectives view allows you to see the status of all of your SLOs, along with the remaining error budget for each of them.

You can filter your list of SLOs by facets to see only the ones owned by a specific team or scoped to a service, time window, or any tag. The SLOs you create based on Datadog monitors automatically inherit the tags associated with those monitors. You can apply custom tags, in addition, to make it easier to organize your SLOs by team, environment, or any other dimension.

Once you’ve drilled down to a subset of SLOs, you can save your query as a view. Saved Views let you easily access your most frequently used SLOs without having to perform a manual query each time. As you can see in the example above, we have saved the SLOs tagged with the checkout user journey, so we can share the status of this critical part of our ecommerce application with any internal or external stakeholders.

Creating a new monitor-based SLO to track the latency of requests to add items to a shopping cart

In the SLO list view, you can start tracking a new SLO by clicking the New SLO button. SLOs in Datadog can be based either on existing monitors (e.g., a monitor comparing p90 latency against a target threshold) or on real-time status computed from metrics. Metric-based SLOs are useful for monitoring the percentage of metrics that meet a certain definition, such as the number of non-5xx responses from your load balancer, divided by the total number of responses.

Creating a new metric-based SLO to track the number of non-5xx responses from AWS Elastic Load Balancer

Clicking on an SLO opens up a side panel which displays details of the SLO, such as its status, target value, and remaining error budget. Datadog automatically generates an error budget for each SLO, which indicates how much unreliability you can afford before your SLO is breached. This is useful for quickly understanding whether you are on track to meet your targets, and whether your development velocity is appropriate for your stated performance and stability goals. Datadog automatically calculates the error budget based on the SLO target and time window you specify. For example, a 99 percent SLO target for a seven-day period would give you an error budget of approximately three and a half hours of substandard performance over that period.

To track the status of your SLOs in context with detailed data about the relevant services or infrastructure components, you can add SLO widgets to your Datadog dashboards. You can then share your dashboards internally or externally to communicate the real-time status of your SLOs to anyone who depends on your service.

You can also visualize how often that threshold has been breached, over common SLO baselines such as the previous week, previous month, week to date, or month to date. If you set your target to be 99 percent over the past 30 days and a warning target of 99.5 percent, your SLO status will be displayed in green while it is above 99.5 percent, yellow when it dips below 99.5 percent, and red when it falls below 99 percent.

A monitor-based SLO for the response latency of our payments endpoint, grouped by availability zone

In the SLO side panel, you can not only visualize the overall status of your SLOs, but you can also see at a glance how different segments of your infrastructure are contributing to performance. For instance, you can see the status of your SLO for an entire service, and break down the status by specific groups—such as customer cohort or availability zone—to easily isolate localized issues. In the examples here, we’re monitoring the response latency (above) and response success (below) of our store’s payments endpoint, broken down by availability zone, so we can quickly zero in on any issues that arise.

A metric-based SLO for the response success of our payments endpoint, grouped by availability zone

Datadog makes it simple to monitor and manage your SLOs in the same place that you already monitor your applications, infrastructure, user experience, and more. Perhaps just as importantly, Datadog enables you to provide transparency to any stakeholders or users who depend on those SLOs being met. If you aren’t yet using Datadog to monitor the health and performance of your services, you can start with a free trial account here.

Head on over to the next and last part of the series where we will share best practices for making the most of your SLOs in Datadog.