We still live in an era where most of the computer behavior is dictated by human generated code.

One of biggest errors one can make is to assume that such code is bullet proof based of the myths that people take for granted e.g.: works on my machine, it has 100% test coverage, and such.

is it still working?

is it still working?

The impact of the assumption that once the code ran, it will work forever is the unawareness of unexpected behaviour, which, to some extend, should be already covered by current testing methodologies, considering that code already doesn’t work 100% of the time.

A quick search on Google for “software bug” demonstrates how badly we fail on proving the quality of our code, from a low-level kind of bug such as Meltdown up to self-driving car issues.

Writing code for professional usage presents a very overwhelming challenge:

How does one ensure its code works the way it was designed to?

Innumerous answers to that question emerged over the course of the software development history. From out-of-the-box tools up to philosophies on how should we ensure quality of our systems.

Most software teams today and will say they ensure quality by having QA Engineers, or 100% coverage on their test suite, TDD, BDD, smoke tests, … any of those silver bullet buzzwords methodologies.

While some of them do help to improve quality to some extend, most of them are good to create software but are not great for ensuring quality over time. There are tons of articles demystifying coverage and some other test practices.

Bottleneck

Writing integration tests across multiple distributed services, databases, 3party APIs, environments, browsers, operating systems, device versions, screen sizes,… covering a huge amount of scenarios creates a big operational bottleneck. And we haven’t talked about performance yet (speed, scalability …)

Sure, one can spend a lot of human-work implementing test cases for all of that, but at what cost? Both creating and running those will impact directly on the time to market.

Besides, browsers change, device change, servers change… there is just too much that can change, errors will happen.

At the end, our goal is to find ways to detect anomalies or unexpected behaviour on our code over time.

Perspective is everything

Just as we have started using software development practices in order to manage infrastructure resources (IaC), couldn’t we get inspired by other practices or areas in order to improve our awareness of unexpected behaviour on our systems?

Let us define what we need

Considering that one ran its code at least once, in a single enviroment and it proved right. A bug is nothing but an unexpected behaviour of one’s code that happened over time under a certain condition or group of conditions.

What we want is to detect when and under which conditions such thing can happen. But, a bug doesn't always presents itself in the companion of a clear error or exception.

Since we don’t have a clear event of the bug, it leads us to believe we need to observe the state of our system at all times, so we have a baseline for when it operates under normal conditions. That way we can assume that anomalies on the external state are probably related to bugs or unexpected behaviour recently introduced.

If we think about it, we’ve seen this somewhere else… yes, that’s basically monitoring, but, with goal-related focus, not limited to the monitoring of the infrastructure, but its internal components behaviour.

Happily, that is quite similar of a known concept for electrical systems, called observability, which is part of the control theory.

Control theory

Control theory in control systems engineering deals with the control of continuously operating dynamical systems in engineered processes and machines. The objective is to develop a control model for controlling such systems using a control action in an optimum manner without delay or overshoot and ensuring control stability. — Wikipedia

Note that, at this point, we only care about one specific part of the control theory, Observability:

In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs — Wikipedia

It is quite self-explanatory. Both control theory, and, observability might seem complex, whereas they’re actually quite simple. Basically, it implies that a system should be aware of its external state in order to validate its integrity.

If we consider the external state being the measure of success when accomplishing an interaction with your system, the internal state being all the layers of your code/infrastructure it touches to achieve this interaction, and the integrity being the analysis of all the states and the conditions in order to define normal operation, we can then define an observable system.