What is flakiness and how we deal with it

Difficulties of QA engineering excellence

Here, at Azimo, we put a lot of effort into making sure that our software is reliable and free of bugs. Each of the mission teams and every technical platform has dedicated QA engineer(s) who stands guard for the quality of developed software.

One of our software quality rules says that automatic tests must check each piece of technology that is released to the customers.

In a perfect world, all of those tests should pass, but in reality, it doesn’t happen very often. What is the reason for it? Tests flakiness.

In this article, we will show you how we deal with it in Azimo. In a tech stack built as a composition of microservices, monolithic core, and dozens of external partners — everything tied via REST APIs and Kafka messages.

What does “flaky test” mean?

When growing bigger, a test suite is rarely green. Not only because of bugs and regressions. Sometimes you run your tests multiple times in a row with no code change, and even then, the results are different. This instability is called flakiness. A flaky test is a test that can be failing or passing with no changes in the application or infrastructure.

Why do we need stable tests?

Nobody wants the customers to wait for a bug fix for months. We want to go live with the change as soon as possible.

If your tests execute quickly and you can trust them, you can deploy code to production whenever you want without being scared that your changes are going to make the product explode!

By reducing tests’ execution time and the flakiness, we allowed Azimo engineers to deploy the monolithic system up to 5 times a day, not to mention the number of microservices releases (up to few dozens a day).

This pace isn’t possible to achieve when a software engineer has to ask QA about why some tests fail during each deployment. If an investigation ends in a constant tests’ reruns or a statement, “it’s a known issue, please ignore” only one thing can happen. Software engineers start ignoring tests’ results and pray for successful deployment with no production incidents instead.

Reasons for the flakiness

Flakiness caused by app’s issues

Two main reasons are standing behind test suite instability — bugs in your application or defects in your testing code. Here are some examples of failures in our app that have led to flakiness in our tests.

One of them was the test that was checking if the application returns a proper ID value for newly created money transfer. In most cases, it was correct, but sometimes we got an id of some other transaction created by the same user. Long story short, It appeared that transfer events weren’t partitioned correctly (by transfer ID, not the user ID). Kafka supports ordering only within a partition, so if some events go to a different partition, they might be consumed in a different order.

In case of multiple transfers per one user created within milliseconds, the tested service could consume the second transaction before the first one, which caused the failure.

Here is more about partitions: https://stackoverflow.com/questions/38024514/understanding-kafka-topics-and-partitions.

What can be other application issues making tests flaky? From our experience, it can be:

Missing index on a DB table causing timeouts in more complex queries,

Failing infrastructure or deployment process causing random errors 50x

Unstable external APIs

There are different ways of fighting with those problems. If the issue is known (missing index on DB table), we create a ticket to fix it, mark the test as “ignored until fixed,” and push on a domain’s software engineering team to make it happen.

If flakiness is caused by the instability of something bigger (like our infra or deployment process), we very rarely introduce mechanisms for automatic test reruns. While we don’t recommend this approach, some circumstances justify it (e.g., cloud provider migration). Sometimes a test that is passing 2/3 times is better than no test (again, very rarely!).

Finally, there is a case where flakiness is caused by a failing component that isn’t part of your system at all (external API). In most situations, we build our tests as independent as possible, but sometimes we have to rely on external integration. To overcome this problem, we can build test proxy service running between the applications and external partners. Such a service can record the successful HTTP calls when everything works as expected and replay them when the real responses are unavailable.

At Azimo, we decided not to use such an approach due to its drawbacks. The main disadvantage is that you’re not testing the real production-like system, but a system with some mocks instead. Unfortunately, without a proxy server, there is not much you can do except for contacting support teams and waiting for the issues to be fixed.

Flakiness caused by problems with tests

Based on our experience, in most cases, flakiness is caused by issues in testing code. One of the reasons may be using the application in the wrong way. It’s not that hard to make it happen when you have to deal with loads of services depending on each other.

As an example — in some of our tests, we changed the transaction status to check some side effects. However, what we missed, was a scheduled job (executed quite rarely) — doing the same, for real. It might have changed the status of the transfer to something else right after we changed it in tests, so some of the expected side effects were not applied.

It took us months to figure it out 🤯.

Flakiness caused by fails in testing code can also be coming from improper usage of the testing tool. In our case, we had a situation where tests failed because they didn’t receive expected Kafka messages. We figured out that as the creation of Kafka consumer was asynchronous, it might have been initialized after the event arrived at the Kafka topic.

Because the testing code had to use ‘latest’ for the auto.offset.reset, sometimes consumption started too late — right after messages were delivered. To overcome this, we assigned partitions and offsets manually, so it gave the guarantee that events are read directly before expected ones appeared.

Another timing issue we faced — if a timeout for consuming a Kafka message is 10 seconds and a job that generates the messages is scheduled to every 10 seconds as well (or more), sometimes the test might not be waiting long enough before the assertion fails.

Also, waiting for the event to come for a fixed period might make the test execution time too long. It’s worth trying something like `eventually` from ScalaTest or implementing the mechanism for re-checking if the condition is passed every X milliseconds or seconds till the timeout.

Randomized data used in tests can also be a reason for the flakiness. In our case, when we tested transactional flow and used random data generators, from time to time the suite would fail unexpectedly. As we discovered later, randomly generated data was sometimes accurate enough to trigger some of our compliance engine rules, and because of them, the transactional flow was interrupted.

Because our compliance engine is well-tested separately, we decided to disable it on our pre-production environment, which in the end, improved the stability of our tests noticeably.

Finally, there is also a concurrency as one of the reasons for the flakiness — probably the most difficult one to discover. If you run tests in parallel, and they, for example, make some changes in a business configuration, two tests might touch the same config at the same moment, leading to failures. One of our solutions to avoid that was to synchronize our tests on Consul’s KV. When a test changes the configuration, others will have to wait for the lock to be released.

Where to start?

If your testing infrastructure becomes more and more complex, it’s just a matter of time when the flakiness will appear. It is not something very unusual, and you only need to find a way to deal with it effectively.

Here are some hints that may be helpful.

Start with the randomly failing tests that frustrate you the most. Everyone has such a test in their suite. Run them and verify what the cause of the problem is. If it’s a bug in the app, report it and ignore the test until the problem is solved. If it’s a bug in a test itself, fix it!

When you get rid of flaky tests where the reason is known, use tools that will help you to track others. Find something that runs your tests multiple times and analyzes results to point out flakiness. Because most of our backend tests are written in Scala, we use https://github.com/otrebski/sbt-flaky/ .

If this isn’t enough, you might also create your task that runs tests several times itself. At Azimo, we have a Jenkins job that evaluates the test suite three times in a row during the night, then collects all distinct fails to calculate the percent of flaky tests and notifies us on Slack. That gives us a number that can be tracked over time and set our long term goals.

Tests flakiness is one of our KPIs in Azimo QA guild. It says that the average flakiness from all platforms (backend, mobile, web) has to be lower than 1%.

Zero is hard, and every big tech company knows it. Even Google struggles with flakiness issues.

But while the last missing percent of tests’ stability isn’t an easy task to do, there is one thing we are sure of. As a QA engineer, it’s your duty to keep eliminating flakiness and make your software engineering teams trust in your tests as much as possible.