Datadog APM Trace Search from Zero to One is the first part of advanced techniques we use at Zendesk to plan capacity and keep performance characteristics of the multi-tenant system at expected levels.

Response times vary between different API endpoints and customers, as the data set grows rapidly, and new products are being launched. If we can scale today, what is the long-term sustainability?

In terms of SLIs:

how much traffic can our system handle before falling/failing over?

when the rate of incoming requests increases by an order of magnitude, how would it affect SLOs?

This article covers practical examples of benchmarking a test system in isolated, controlled environments: changing only a limited number of parameters. I also tried to eliminate most of the influencing response time factors such downstream dependencies and network saturation. I demonstrate what happens to overall response time when incoming requests throughput exceeds available resources and requests start to queue. Visualisations and animated GIFs assist understanding low-level interactions.

A Practical Look at Performance Theory by Kavya Joshi covers the theoretical aspect of the Performance Modelling problem very well. I highly recommend watching this video or reading the slide deck before reading further.

Little’s Law (wikipedia)

The average number of customers in a system (over some interval) is equal to their average arrival rate, multiplied by their average time in the system… The average time in the system is equal to the average time in queue plus the average time it takes to receive service.

The average time it takes to receive service is what we call Service Time, and the average time in the queue is represented by Queueing Delay. From the user’s perspective, it doesn’t matter if the bottleneck is either Service Time or Queueing Delay: the impact is the same. For us, it is very important to understand why a response time SLO is not being met.

Let’s begin with this animated GIF to demonstrate the visualisation tools we use for analysis. Datadog Agent is required to collect logs from the Nginx container: