[Portuguese Version] [English Version]

Abstract (TL;DR): Cloud-native applications are a type of complex system that depends on the continuous effort of the professionals that everyday combines the best of their expertise to keep them running. In other words, their reliability doesn’t simply rely on themselves but is a result of the interactions of all the different actors engaged in their design, build and operation. Over the years the collection of those interactions has been evolving together with the systems they were designed to maintain, which have been also becoming increasingly sophisticated and complex. The IT service management model, once designed to maintain control and stability, is now fading and giving place to a model designed to improve velocity while maintaining stability. Although the combination of those things might seem contradictory at first, this series of articles tries to reveal the reasons why the collection of practices that today we know as DevOps and SRE (Site Reliability Engineering) are becoming the norm for modern systems.

Table of Contents:

Chapter 1 — When innovation becomes mainstream (released:12/09/2019) Chapter 2 — How to cope with complexity (this document) Chapter 3 — Models for cultural change (release:12/23/2019) Chapter 4 — How innovation becomes mainstream (coming soon) Chapter 5 — Accelerate: The Science of Lean Software and DevOps (coming soon) Chapter 6 — Signals of change (coming soon) to be continued…

Chapter 2: How to cope with complexity

Complexity is everywhere and there’s no indication that either the number of systems features or the user’s expectations, will become less demanding in the future. As a new approach to managing systems complexity is required, the field of study of systems theory emerges as a reasonable approach, as it has in the past for other industries (ie. aviation safety).

Systems theory is the interdisciplinary study of systems, which are by its turn, a cohesive conglomeration of interrelated and interdependent parts. The systems that are subject to this article are the engineering systems assembled to automate and improve several aspects of our lives. Regardless of our own awareness about this fact, we’re constantly relying on a number of them in every aspect of our lives. It could be when we withdraw money from an ATM or even when we do our online shopping on Amazon, the fact is that technology became so pervasive that even unintentionally sometimes, while simply living our lives, we’re likely to be interacting with multiple parts that put together form the so-called IT System.

This system represents thousands, if not millions, of interdependent parts that combined interact with each other in ways that are so unique that it ends up revealing synergies and emerging properties that are not seen on any of the parts individually. IT systems are a great example of a complex adaptive system, where the complete understanding of each one of the parts won’t create a perfect understanding of the whole.

Figure 1: Systems are so complex that it is virtually impossible to predict all interactions between its parts.

As the complexity of a system increases, the accuracy of any single agent’s own model of that system decreases rapidly (Wood’s Theorem)

The mindset of my generation, although I recognize we might be reaching an inflection point soon, was largely influenced by the industrial age way of thinking enabled by Taylor’s scientific administration. It is psychologically comfortable to believe that if we invest in understanding all the intrinsic aspects of all parties that compose our systems, we will at the end, be able to understand the whole, and with that, predict the perfect behavior of it.

Figure 2: Traditional analytical decomposition won’t work on complex systems. Every subsequent event isn’t the direct consequence of its predecessor.

This is not the case for complex systems. We must change the way we manage them and stop seeking a perfectly controlled system and rather start finding ways to observe the properties that will emerge from the very fast and dynamic interactions between the parties, both human and non-human. One important aspect to consider is the ever-changing nature of those systems. Observing the synergies and emerging properties must be a continuum as things may … and will change. This is the very description of complexity.

Anything that is complex will necessarily be changing. Anything that is continuously changing will necessarily be complex. They’re synonyms. (Richard Cook)

Back in 2017 a group of high-end technology companies, including IBM, Etsy IEX and Ohio State University, got together and formed a workgroup called the SNAFUcatchers consortium. This group worked together in reviewing postmortems of major technology-related incidents with the objective of developing a better understanding of how engineers cope with the complexity of their systems. The conclusion was that the current internet-facing technology platforms are very prone to brittle failures and without the continuous efforts of the engineers that keep them running, some of them would stop working in days and most of them would stop working within a year. Although the report reveals some themes regarding the factors that produce resilient performances, we still know little about how engineers accomplish this vital work or keeping technology platforms operational and even less about how to support them better in doing it.

The Stella Report is available at http://stella.report/ and offers very important insights on how to focus on these themes and how they represent clear opportunities for improvement in the field. The second round of the consortium is underway, it focuses on the cost of coordination and has IBM, KeyBank, New Relic, and Salesforce as R&D partners for this study. Details are available at https://www.snafucatchers.com/projects.

Key takeaway: IT systems are increasingly complex organisms, they’ve become so complex that it is impossible to anticipate all the interactions between its parts (the unknown unknowns). Recognizing that trying to develop a perfect understanding of its individual parts is vain, we should invest in a broader approach for systems control and apply our collective expertise in understanding the system as a whole and most importantly, the emerging properties that will continually arise from the dynamic relationship between its elements. We still know little about the most effective practices that keep complex systems operational. We know that this is a result of the continuous efforts of the engineers that make them available and that the development of themes like observability, collaboration, blameless incident reviews, as well as techniques like STAMP (System-Theoretic Accident Model and Processes), are evolving very rapidly and certainly will become a great source of help going forward.

Follow-up:

Chapter 3 — Models for cultural change (release:12/16/2019) Chapter 4 — How innovation becomes mainstream (coming soon) Chapter 5 — Accelerate: The Science of Lean Software and DevOps (coming soon) Chapter 6 — Signals of change (coming soon) to be continued…

[Portuguese Version] [English Version]

Published By

Ricardo Coelho is a CTO at IBM, he is specialized in technology transformation in banking and financial services through the adoption of Lean, Agile, Cloud, DevOps, SRE and Microservices. He has 25 years of experience and is currently engaged in helping customers understand that to take advantage of the Cloud, they need to embrace a new way of work. A way that leverages new engineering practices and is focused on collaboration and continuous learning. Connect with him on https://www.linkedin.com/in/rcsousa1/ and https://twitter.com/ricardo_c_sousa

Originally published at https://www.linkedin.com