Jeramiah Dooley, Sr. Cloud Advocate at Microsoft gave an insightful talk at Microsoft Ignite that everyone can learn from whether you’re in the trenches handling the outage or overlooking the whole team and process. Coming from personal experience on the Operations side, I knew how important for Production to always be available but that obviously cannot always be possible. Incidents happen, plain and simple but I really didn’t enjoy the following while in Ops:

Troubleshooting an outage on multiple bridges at 2 am The blame game that would ensue via the RCA

Jeramiah, mentions that Post-Mortems end up being a chore or forced upon the people rather than giving value. Especially, when we have same incidents that happen over and over and we’re not learning anything from the process. This is where we need to take a step back and look at how we are performing our post incident review. In many cases, incidents don’t always fall into a specific checkbox or category for the root cause analysis.

A good example he brought up was a specific aircraft in WW2 that was involved in significant number of identical accidents. The aircraft would land successfully but as the plane was taxiing the landing gear would retract without warning. Obviously, this was a big issue since every plane was needed during that war.