Providing a uniform consistent experience is critical in quickly debugging Issues getting information necessary should be low friction and familiar. One way to accomplish this is to have each service provide a uniform standardized view into its operation. An aggregate view can be used to characterize the overall health of a system by answering: How much work is a system doing? What’s the result of the work? and How long is it taking to perform the work? I think it’s critically important to find a convention for service dashboards in order to facilitate quick responses and uniform views into our system and this is a convention that I personally like and have found success with.

Why views?

During incidents all time is critical since the client experience, money and reputation is at stakes. In order to facilitate this and to promote client centric debugging uniform views into a system can be used. If views were OOP primitives they would be interfaces. Views should be parametrizable by (service, environment, etc) and show the same structured of metrics. While parametrizable dashboards are ideal for reusability and uniformity they are not always achievable, in this case views are conventions. Each service dashboard should expose these aggregates in order to create a more uniform experience. Anyone that understands how to interpret and correlate these aggregate signals will be able to do so for any service.

There are many different views a service might benefit from having:

Aggregate view (discussed here)

(discussed here) SLO View — Visualize the SLO’s for a given service

— Visualize the SLO’s for a given service Component View —Language Runtime Stats: GC, event loop, ticks, os threads; go: goroutines, heap, gc. Client Libraries.

—Language Runtime Stats: GC, event loop, ticks, os threads; go: goroutines, heap, gc. Client Libraries. Service View — Service specific metrics, queue depths, branch rates, implementation stats, etc

— Service specific metrics, queue depths, branch rates, implementation stats, etc System View — Memory of system executing the service, CPU, Disk, Memory, Network; could be docker container resources or virtual instance resources

— Memory of system executing the service, CPU, Disk, Memory, Network; could be docker container resources or virtual instance resources Resources View — Load Balancers, External Queues Etc, which could also be expressed as their own Service Views.

Views are consistent windows into a particular dimension of an application. In order to attempt full visibility into a system would be seen through the sum of its views. Views also closely match levels of abstraction. It’s often helpful to choose a single level of abstraction and view the system in terms of that. Sometimes it is important to correlate events across levels of abstraction. Consider an event where client latencies are increasing. Correlating this with machine network latencies or disk latencies may be helpful. Views are consistent, uniform and standardized. They are a primitive to view a specific dimension of an application.

So what’s an Aggregate View?

An Aggregate view represents a top level view into the system, and provides a starting point for all debugging. Aggregate views help to inform if there is currently a client problem and helps do determine if there is an online incident or an offline incident. Aggregate views should be the uppermost sections of dashboards, containing 3 of the 4 Golden signals: Throughput, Availability, and Latency.