Support Engineer: Hi, welcome to Red Hat support, how can we help?

Caller: Our web-server stopped responding and we had to reboot to restore it back, we need to find the root cause.

Support Engineer: Sure, was anything changed recently on this server?

[...]

The above is an example of a regular root cause analysis (RCA) request Red Hat’s support receives. RCA methodology has been used systematically in the past 30 or so years to help the IT industry find the origin of problems and ultimately how to fix them. In this blog post I argue - I think I’m not first and won’t be the last - that the current RCA process will not be suitable for the future of IT industry and a different approach is needed.

The origin

RCA can be rooted all the way back to Newton’s third law of motion “For every action, there is an equal and opposite reaction.” But in the modern age, it can be linked primarily to Toyota’s 5-whys process developed back in 1958, which requires asking a 5-level deep “why?”, and the answers to the whys should eventually reveal the root of all evil, AKA the root cause. The example provided on this Wikipedia page is straightforward:

The vehicle will not start. (the problem) Why? - The battery is dead. (First why) Why? - The alternator is not functioning. (Second why) Why? - The alternator belt has broken. (Third why) Why? - The alternator belt was well beyond its useful service life and not replaced. (Fourth why) Why? - The vehicle was not maintained according to the recommended service schedule. (Fifth why, a root cause)



The RCA process for the IT world wouldn’t be any different.

The web server is not responding. (the problem) Why? - The process was killed. (First why) Why? - The kernel killed it. (Second why) Why? - The server was running out of memory. (Third why) Why? - Too many child processes were spawned and consumed all memory and swap. (Fourth why) Why? - “MaxRequestsPerChild” was set to zero, which stopped recycling of child processes. (Fifth why, a root cause)

The problem(s)

Let’s check some of the reasons why the current RCA process won’t be the best fit for modern IT systems.

Binary state of mind

RCA implies that there are two states for our systems: working and not working. While this might be true for a monolithic legacy system, like the web server example above, it is either serving our web pages properly or not, but a modern microservices system is far more complicated.

This complexity makes it more probable to operate, somehow, with a broken component, as Dr. Richard Cook describes in his paper “How Complex Systems Fail,” “complex systems run in degraded mode.”

And it comes down to all the talented DevOps engineers, self-healing, load-balancing and failover mechanisms we have built throughout the years to keep those systems up and running. Reaching a failed state in modern microservices systems require multiple failures to be aligned together.

For example, consider the following scenario. In a CI/CD OpenShift environment, poorly tested functionality in an application is pushed to production OpenShift pods. The application and the mentioned functionality receives a large volume of traffic due to the holiday season. Writing to a busy SAN storage array, the slow writes lead to increasing CPU load, which triggers autoscaling of pods. Finally, autoscaling hits the namespace’s underestimated resource-quota and the cluster cannot scale any more and the website is unresponsive to its visitors.

What would be the root cause here? The poorly written functionality, the busy SAN storage, the resource quotas, or all of them combined?

You might think this is an over complicated scenario but if you read some post-mortems like those of Monzo bank, Slack and Amazon you will see this is not science fiction at all.

What You Look For Is What You Find

Or (WYLFIWYF) is a principle known to resilience engineers, that means that an RCA usually finds what it looks for. The assumptions about the nature of the incident guide the analysis. This predetermination sometimes hinders the ability to address secondary factors contributing to the outage.

Systems are static

RCA is like a witch hunt for which “change” caused the system to fail. This might work with a monolithic legacy system, where changes were scarce and infrequent, but the whole microservices idea we are moving to is about rapidly changing, loosely coupled components.

The domino effect

In simple linear systems - think of your favorite 3 tier application - malfunctions and their causes were perceived in a “domino effect” type-of-thinking. The IT industry is moving to a non linear - microservices - model, where failures will be mostly governed by the resonance and amplitude of failures instead. That is it to say, the alignment of various components’ failures and their magnitude will be the culprits behind major incidents.

What’s next

Hopefully, by now you’re convinced that the current RCA process is not ideal for complex microservices systems. Adapting to these new system models require changes on different levels. I will try to list here what I think might be helpful.

Learning

Invest in training DevOps engineers to understand the microservices system inside out. During or after a major incident you don’t want to hear “we have no clue where to start looking.” Microservices systems are complex by nature and require deep understanding.





Propagate the learning from an incident. Build an incident database accessible to the wider organization. Don’t rely on emails and newsletters.

Adapting

Adopt a post-incident review process rather than an RCA process. A post-incident review process aims to keeping a record of an incident’s timeline, its impact, the actions taken, and provide the context of contributing factors, as most microservices major incidents will have multiple, none of them is more important than the other.





Avoid finger pointing culture, instead encourage a blameless post-incident review process that encourages reporting errors, even human errors. You can’t "fix" people, but you can fix systems and processes to better support people making the right decision.





Introduce chaos engineering principles to your workflow. Tools such as Netflix’s chaosmonkey or kube-monkey for containers help you validate the development of failure-resilient services and build confidence in your microservices system.





A shameless plug, acquiring TAM services helps in progression of microservices adoption, knowledge transfer and handling incident analysis, especially those with multiple vendors involved.

Monitoring

Microservices systems have different, and more intensive, monitoring requirements, which require correlating data from different sources when compared to monolithic systems. Prometheus and Grafana provide a great combination of data aggregation tools.





Use monitoring dashboards providing both business and system metrics.





Sometimes the signal-to-noise ratio in microservices systems metrics makes it hard to find what to monitor. Here are some pointers that can help with anticipating or analysing a problem:

Success/failure ratio for each service.



Average response time for each service endpoint.



Average execution time for the slowest 10% requests.



Average execution time for the fastest 10% requests.





Map monitoring functions to the organizational structure by reflecting microservices’ structure on the teams monitoring them. That means smaller, loosely coupled teams with autonomy, yet still focused on the organization strategic objectives.

Responding

Establish a regular cadence of incident review meetings to work out incident review reports closing out any ongoing discussions and comments, to capture ideas, and to finalize the state.





Sometimes, political pressures push for getting an RCA as early as possible. In complex microservices systems that might not address the real problems and only address the symptoms of these problems. Take adequate time to absorb, reflect, and take actions post-incident.

Final thoughts

The IT industry is going through a paradigm shift from monolithic to microservices systems. Microservices provide a huge benefit allowing rapid software development and decreasing time to market (TTM), but it also requires a shift in processes and mindsets as we have seen. Adapting to microservices’ needs is crucial and delaying your adaptation will just increase your technical debt, and if there is one thing we have learned throughout past years is, technical debt finds a way to be repaid, but with interest.

That being said, I want to leave you with what led me to write this blog post, a wonderful research paper titled How Complex Systems Fail by Dr. Richard I. Cook, MD, and his presentation from Velocity Conference.



Ahmed Nazmy is a Senior Technical Account Manager (TAM) in the EMEA region. He has expertise in various industry domains like Security, Automation and Scalability. Ahmed has been a Linux geek since late 90s, having spent time in the webhosting industry prior to joining Red Hat. Find more posts by Ahmed at https://www.redhat.com/en/blog/authors/ahmed-nazmy

Header image provided courtesy of Kurt:S via Flickr under the Attribution 2.0 Generic (CC BY 2.0) Creative Commons license.