Brian Kelley

Building Resilient Microservices from the Fallacies of Distributed Computing

Microservices are the latest hot trend in software architecture – and with good reason. They create a path to Continuous Deployment in cloud-native environments, giving organizations increased business velocity and flexibility. However, that speed can lead engineers to focus too much on the business features they are more easily churning out and forget that their new highly distributed system could in fact be more prone to failures than would a similarly-scoped monolith. This is because resilience is an oft-overlooked concern in microservices development.

A wish to make systems more resilient was at the heart of the “Fallacies of Distributed Computing”, originally penned by L Peter Deutsch in 1994 when he was at Sun Microsystems, and augmented by a few others since then. The fallacies read as follows:

The network is reliable. Latency is zero. Bandwidth is infinite. The network is secure. Topology doesn’t change. There is one administrator. Transport cost is zero. The network is homogeneous.

For many years, software engineers were freed of having to worry about many of these, thanks to wholly separate teams of operational staff doing much of the worrying for them. Those operational teams would keep their in-house production networks and servers healthy and running with sufficient bandwidth and resources. Especially with monolithic applications running at moderate scale, a set of deployment and scalability patterns arose that made the operational work fairly predictable. That left developers to worry about other things, such as the technical debt in their enormous monolithic code bases.

In recent years however, that has all changed. Developers are now building microservices to create hundreds – and sometimes thousands – of small, interconnected software components, any very often using different technology stacks. Having orders of magnitude more components in a system requires much more sophistication if they are to remain up and running as much as a simple monolith would.

Operational teams once tasked with basic maintenance tasks and simple upgrades are now being transformed into DevOps groups or Infrastructure teams with a much more complex set of problems to solve: zero-downtime upgrades, blue/green deployments, canary testing, constant technology shifts, and more. Engineers focused on microservices can no longer be isolated from these concerns, because the DevOps folk can’t do it all on their own. Engineers can start using some of the fallacies to help guide their designs and coding in such a way that resilience becomes a natural outcome. Let’s focus on some of the fallacies most relevant to building resilient, scalable microservices.

The network is reliable

Defensive Programming is a well-regarded technique in software engineering. For example, a defensive engineer will try to prevent divide-by-zero calculations, or trap exceptions thrown when trying to open a non-existent file. But those are obvious examples with simple, deterministic, grokkable reasons for failure.

Network failures – and the many ways they can manifest themselves – are less obvious, and it’s not as common for engineers to be as defensive when making remote calls. That’s somewhat understandable, given that many middleware technologies have tried to make the developer experience of writing client code to be very close to the experience of calling a local function. And when an engineer is spending their time writing and testing a microservice by using local mock data to represent other services on which theirs is dependent, they aren’t getting exposed to any of the potential problems within their production system’s network. Since they won’t see any issues in their local tests, they’re less likely to defend against them.

Middleware systems and standards have historically tried to hide the fact that there’s a network sitting between each service, and that has had the unfortunate effect of making resilience less obvious as a concern for developers as they write their code. In fact, using a middleware or services layer that forces engineers to think about their resilience strategies in the face of network failures is quite valuable. After all, the engineers are the best people to decide how a system should behave when things go wrong.

Latency is zero

One failure scenario that is more commonly tested in the microservices world is for when dependent services simply aren’t running. An engineer might run the tests of their new microservice right after they kill its dependent services and see how things behave. That’s a useful test, but it’s probably less likely to replicate operational issues compared to the scenario where dependent services just run a lot slower. Why? Because operational teams are usually pretty good at setting up services to restart automatically upon failure, and unless something is fundamentally wrong, those services generally come back as healthy after the restart.

Much more unexpected, however, is for sudden and increased load to hit a running system, causing operations to return much slower than anticipated. Without timeouts or circuit breakers in place, this increase in latency will begin to compound and may even look like total unavailability to the system’s users. Ironically, testing for microservice slowness is done quite easily: by injecting sleeps or waits and seeing how the system behaves overall. It just doesn’t get verified as often as it should because, by its very nature, it takes time.

Techniques like TDD have shown us that getting instant green bar results from extremely fast unit tests causes engineers to run them very often and keep them all passing, and that in turn leads to higher feature quality and broader test coverage. But that “immediate assessment” approach makes it unlikely that the same unit tests will be used to verify timeouts kicking in from some undesirable slowness of downstream services. The engineer most familiar with the service will then deliver something working in isolation, but verifying true inter-service resilience is left to others running full integration QA tests, and they might not know how each service should react when encountering slow dependencies.

Bandwidth is infinite

Explosive growth in the use of microservices could quickly reduce bandwidth available in the production network, especially if the system uses centralized load balancers that handle the request traffic for a number of microservices. Load balancers usually only cause request latency degradation after they’ve reached complete bandwidth saturation, so knowing that they’re close to reaching that point will be more advantageous than having to react when they cross that threshold and start to cause major interruptions.

Client-side load balancers are far less likely to become saturated chokepoints when overall system traffic increases, so they are definitely more appropriate for environments that have many microservices that are growing in number to handle increased load.

Topology doesn’t change

Discovery and routing tools are hugely important when building a resilient microservices system. Without them, the inevitable topology shifts that always occur in production environments will cause nasty ripple effects as services are having to be reconfigured to reflect the new realities.

Worse still, many topology changes occur not because of a planned change, but rather due to unexpected failures occurring in production. Those reconfigurations then have to be done at stressful times, which can make them error-prone or cause further unintended downtime. A resilient system will use discovery and routing services to allow for topology changes to be done at any time, and without any interruption to the overall system’s availability.

There is one administrator

With a monolithic system, the operational team supporting it can be small, perhaps only a couple of individuals. But in a growing microservices-based system (especially one using many different technologies), there may be a much larger team of administrators. If a company is using the increasingly popular “developers are responsible for pushing to, testing on, and managing their own services in production” approach, every developer would also be playing the role of an administrator.

Without taking care to build sensible operational controls to enable this kind of end-to-end responsibility, a production environment could quickly become chaotic and messy. Chaos in a distributed system is rarely desired (unless it’s in simian form). So even though there isn’t one administrator who is responsible for everything, a sensible use of discovery services, cloud governance systems, and monitoring tools will at least give you a unified view of your system.

Having a microservices-focused engineering team take these fallacies into account during development is simple to put into practice. Train the team on all of the fallacies (not just the ones covered in detail here), and conduct post mortems to discuss incidents where say, a service failed because it made an incorrect assumption about the availability of another service or of the network. Enhance code review processes to explicitly consider situations where fallacies may be getting permanently codified. Work with QA teams to have more situations like slowness, limited bandwidth and topology changes explicitly tested. Doing all of these things will help you build more resilient microservices, and further increase your development velocity.