Incident response involving human responders requires context of systems and services that are encountering issues. Getting this context is increasingly hard as the size of an organization grows and the number of services grow. Many times the incident responder bears the burden of forming complex mental models representing the systems they are trying to assess. As an industry our incident response tools aren’t keeping pace with our micro service architectures. A new class of tool is required to keep track of the essential service in business context. Linkedin’s Third Eye proves that these tools are technically possible but carry a huge implementation overhead. This post proposes a simple, low friction approach to centralizing critical events related to services (such as deploys) which reduces burden on IR engineer, reduces MTTR and makes querying complex system data and event state trivial.

An Example

Imagine that you’re on call for a service and a pagerduty alert fires: Increased latency on all requests for the service! What do you do? What information do you need to begin understanding the system, its current state, and how it got there? One of the most common heuristics in this situation is determining the last deploy.

In common IR response models, engineers may have to look at slack or jenkins to determine this information. Build information would be stored in metrics system (like datadog) as events which can be overlayed onto timeseries:

While this is extremely valuable it is not dependency aware and requires statically defining events (no dynamic queries); meaning there is no way to model more than a single degree relationship using events and overlays. In comparison IR Knowledge Graphs enable engineers to dynamically see events across their service and all its dependencies which provides the on IR with rich context around incidents and potential influences for those incidents. IR Knowledge Graphs are able to provide a rich picture of the current real time state of the system and the events that contributed to that state. This differs significantly from overlaying known events on a graph!

So What is an Incident Response (IR) Knowledge Graph?

An IR Knowledge Graph (not to be confused with google’s knowledge graph) is able to store structured graph based data and is able to easily query data and relationships. It provides a centralized location to query the current system state and events that led to that state. This is a graph in its traditional sense it incorporates information that is critical to developing a system understanding along a dimension of time.

Storing the system state and associated events along time support understanding which events led to the current state and trace causal events that affect state over time. As the system changes state it’s critical to have as much context around those changes available as possible. The current state of the industry has state changes fractured across multiple tools teams and mental models (jenkins, jira, github, slack, pagerduty, etc etc etc!). Having so many disparate stores fractures system understanding which handicaps human Incident Responders, discourages automated response and prolongs incidents. In contrast IR Knowledge Graphs centralize information from disparate sources in order to provide a sane view into system state while maintaining system structure and exposing it for ad hoc analysis.