The complexity in complex distributed systems isn’t in the code, it’s between the services or functions. Testing implies balancing finding problems versus delivering value, said Sarah Wells at the European Testing Conference. Testers often have the best understanding of what the system does; they have a good hypothesis about what went wrong, and are able to validate it pretty quickly.

In her keynote, Wells explored what changes when systems are complex and distributed. When you have a monolith, it is probably hard to work out where the code that did a particular thing is, but it is easy to work out the flow of requests through the system, and most communication is in process. With distributed systems, the complexity moves from being within the system to being between them, argued Wells.

With microservices, the code is simpler, but the routing is much more complex, and happens over HTTP or via a queue. More things can go wrong there, said Wells; you often get transient errors that mean one request fails but if repeated a few seconds later, it succeeds. She argued that it’s sensible to build in backoff and retry, but that means it’s harder to follow exactly what happened during a response to a request.

Wells suggested to use a risk-based approach and focus testing efforts on the things where it really matters. You need to be able to quickly find out when things are going wrong, and quickly fix them, and you have to build your systems for observability to work out what went wrong, she said.

Wells mentioned that for testing complex distributed systems you need to balance two things: finding problems as early as possible, versus delivering value as early as possible. It might be better to accept that some problems won’t be found until you hit production, and you should optimise for identifying and fixing those quickly, she argued.

InfoQ is covering the European Testing Conference 2019 and spoke with Sarah Wells after her keynote about testing complex distributed systems.

InfoQ: What’s your advice for testing complex distributed systems?

Sarah Wells: What we found when we started building complex distributed systems is that it doesn’t really work to try to spin up the complete replica of the system locally, either for developers or for testers. You spend a lot of time trying to create a good replica of production but you never quite manage that. Tied to this is that these are complex systems. If you can’t evaluate the likely blast radius for a particular change, you can spend a lot of time doing regression testing and we found this was a bottleneck - and didn’t often find issues. Like lots of things, it’s all about communication. If developers talk through what they have just done, often a tester can pinpoint the most risky possible consequences of the change. The main things that help are around continuous delivery and microservices. The changes are small and self-contained - microservices have very clear boundaries. The research (see for example the book Accelerate) shows that organisations who can release small independent changes frequently have a much lower failure rate for those changes. I think establishing a contract between systems can help, but I wonder if maintaining contract tests might cost more than the risk of getting it wrong - I’d probably aim to do this at boundaries between teams rather than within a system.

InfoQ: How can monitoring and logging support testing, or even replace it?

Wells: With distributed systems, a lot of the problems have little to do with code that just got released. They could be related to the environment the code is running in, or they could be because of the dependencies between services - i.e. a change to one causing unexpected errors for another. That service may not be owned by the same team. You may not even know about it, if it calls an API you own. So once the change makes it to production it’s all about spotting any issue early and rolling back. That can be via something like synthetic monitoring, where you constantly test key business functionalities, or a business capability level monitoring where you check that some real event completed successfully - our example is for content publishing where we check that every relevant data store in both our regions has the correct update or alert otherwise. We run this in lower environments too. This replaced some acceptance tests that were fairly brittle - amending the setup of fixtures was painful when anything changed, and things change a lot with these kinds of architectures. You have to build your systems for observability, so that you can work out what just went wrong. That means for example that you absolutely need all the logs to be in a single aggregated log store, and you need those logs to be structured so you can easily run queries. You may end up sampling logs (because you now likely have an order of magnitude of more logs than you would have done with the monolith, since requests pass through multiple services), but in that case you want to make sure that if any logs for a particular event are stored, they all are, and you want to be able to tie them together via a unique transaction id. You also want to capture metrics, but it’s easy to capture too much - you really want metrics for the top level of services, i.e. those customer calls, reporting on the request rate, error rate (probably as a proportion of request rate), and request duration (these are often referred to as the RED metrics).

InfoQ: How can testers contribute when things go wrong, and what value do they bring?