Tips on how to make and keep UI tests green

I think the only way to maintain a 100% test success rate is to run the tests as a blocking build on Pull Request.

This is how you don’t merge changes that break everything

However, it is very hard to make them stable enough to run on PR. Performance is another story. It shouldn’t take forever to run tests on Pull Request. I think, 30–60 minutes is okay.

You cannot expect to have many 100% stable E2E tests right after you have started writing tests. You need a stable test suite, stable CI, a way to run tests fast. You should be prepared that it can take time.

We started running a good amount (100..300) of tests as a blocking build to PR after nearly a year of development of our testing tools and infrastructure. Before that, we ran a fixed number of tests on a pull request, then we started also running modified tests. You should try to run tests on pull request as early as possible, one test will suffice. Then focus on increasing the number of tests and maintaining the stability and duration of PR checks and time.

There are some techniques that we use to improve the performance and success rate of our tests, and thus increase the number of tests we can run on PR and thus the stability of our full regress suite.

Stable tools

We use tools that are fault-tolerant. For example, if the simulator stops responding, it is created again and test restarts. If some of the machines in our farm become unresponsive, they are blocked by the test runner and tests restart.

We use backend-driven UI a lot, a lot of A/B-tests, so UI can be different at different times. We don’t want to fix the tests every time someone changes something in the backend. We don’t want to add “if”s to the code. We find UI elements by their IDs (and use same IDs for same elements, even if a feature has multiple implementations), always scroll to them automatically, make tests independent of each other (clean every possible state), almost every action or check has a lot of retries and fallbacks, etc.

Stable test backend

We don’t have one. It is very expensive to deploy everything at every pull request, even at every run of the full regression test suite. We have a few instances of the test backends that are shared by the product teams for their tests (web, mobile web, mobile apps, etc).

Test backend is shared, it is not as stable as production. One could say that we should make it more stable, but there are some workarounds that allow us to ignore the problems, and the problems aren’t that critical. We are not proud of this situation, but we don’t care much about it.

Retries

We retry every test a lot of times if it is running on a pull request. We don’t want to bother developers by failing their builds unless they break everything down.

We don’t want to have flaky tests either. Retries allow tests to be flaky. But our colleagues at the Android team implemented an improvement for the Pull Request check: every new or modified test is required to pass 5 times out of 5.

Profit: +10% success rate in our case. Flakiness is eliminated. Useful on PR.

Example: 10% here means that every 1 out of 10 successful tests was retried at least once. We have some problems with stability of tests, we’ll be solving them.

Restarts

Sometimes some services are unavailable. Sometimes they remain so for a long period of time. So we want to avoid this kind of problems. The night before the regression testing, we start several builds with a 3 hours’ interval. Every subsequent build reuses previous results and restarts failed tests only. It turned out to be useful. Once, it saved 12% of the tests.

Profit: +2% to +12% to the success rate at release (up to 95% in our case, where 5% are long-forgotten tests that don’t work).

Trusted tests

A VERY, VERY important thing for us. It allows us to enter the world of “running tests on PR.” It makes PR checks stable and fast.

In short: we analyze the run history, we select tests that were not failing or even flaky in the past N full runs. We call them trusted and run them on pull request. These tests are extremely stable, so developer PRs is rarely blocked for no reason.

It is rather straightforward. You only need to store a history somewhere and then query it and pass to the test runner. That’s all. We were lucky as we had developed the reporting tool and test runner with something like that in mind, and it was very easy to make everything work.

This mechanism also allows you to automatically ignore outdated and not working tests and unignore them when they are fixed, also automatically. So you can do just fine with a partially green suite.

One of the possible variants of this algorithm

Comparing to the target branch

The aforementioned “restarts” technique does not work with Pull Requests, because it requires a long waiting period. We don’t want to and we can’t add another hour to our 15 mins UI test check on Pull Request. But we can eliminate infrastructure problems.

The algorithm we currently use is as follows (assume we create a pull request from the “source” to the “target” branch):

If the test passes on the source branch, it passes

If the test fails on the source branch, restart it on the target branch

If the test passes on the target branch, it was broken by the current PR changes

If the test fails on the target branch, the code (probably) didn’t affect it, do not block PR

If the test is changed in PR we can not compare it to the target branch.

Here is what we got after implementing this algorithm. The chart shows how many tests were skipped after they were failed on the source branch. We were able to fix about 40% of pull requests that didn’t really break tests.

How much tests were ignored, among only builds with tests that failed on source branch

We have not integrated this feature into Emcee yet, it is purely external so far. And we were experiencing some problems with it. The problem was that we started all tests and only then restarted failed. So there is a long time interval between running tests on the source and target branches. The service can be down during tests on the source branch, but up and running when comparing to the target branch. This sometimes resulted in PR being blocked without a reason. Example: a service was down and 37 tests failed showing same error (e.g. 500 status code from integration API), then 36 tests also failed on the target branch, but one test passed. Ideally, tests should be run simultaneously on the source and target branch after a failure. And more than once, to eliminate any possibility of the situation described above.

We solved this issue simply by moving some runs on the source branch after the run on the target branch. This is what changed in the algorithm:

We assumed that most of the tests will pass at the first stage, the longest one. Then some tests will be compared to the target branch. Then very small amount of tests will be checked again on the source branch. The interval between running tests on the target branch and the source branch will be shorter. The probability of the issue described above will be lower.

It seems that our assumption was correct. Here is a chart after we updated the algorithm:

How much tests were ignored, among only builds with tests that failed on source branch

It is a very good practice to stop blocking someone’s PRs. Obviously, it can’t be used in regression testing, because there is no such thing as “target branch” when we run tests on some specific branch.

Impact analysis

This method allows you to not run every test at pull request, but still get a lot of things checked.

In short: impact analysis is a process of detecting tests that could be affected by changes in code.

Our Android team has an impressive system for impact analysis, they have great tools for that. They identify what tests should be run by changes in the app code. It is really a big deal because black box tests are separate from the app.

In iOS we use a very low-tech, very simple, but very helpful solution to detect changes: we just get “git diff” and run new or modified tests. We require that those tests pass, they can’t be ignored. So red tests cannot be added to a repo, tests cannot be broken if someone modifies them.