Flaky Tests at Google

Google has around 4.2 million tests that run on our continuous integration system. Of these, around 63 thousand have a flaky run over the course of a week. While this represents less than 2% of our tests, it still causes significant drag on our engineers.

Test size - Large tests are more likely to be flaky

We categorize our tests into three general sizes: small, medium and large. Every test has a size, but the choice of label is subjective. The engineer chooses the size when they initially write the test, and the size is not always updated as the test changes. For some tests it doesn't reflect the nature of the test anymore. Nonetheless, it has some predictive value. Over the course of a week, 0.5% of our small tests were flaky, 1.6% of our medium tests were flaky, and 14% of our large tests were flaky [1]. There's a clear increase in flakiness from small to medium and from medium to large. But this still leaves open a lot of questions. There's only so much we can learn looking at three sizes.

Correlation between metric and likelihood of test being flaky Metric r2 Binary size 0.82 RAM used 0.76

Certain tools correlate with a higher rate of flaky tests

[7]. For a few of our common testing tools, I determined the percentage of all the tests written with that tool that were flaky. Of note, all of these tools tend to be used with our larger tests. This is not an exhaustive list of all our testing tools, and represents around a third of our overall tests. The remainder of the tests use less common tools or have no readily identifiable tool. Some tools get blamed for being the cause of flaky tests. For example, WebDriver tests (whether written in Java, Python, or JavaScript) have a reputation for being flaky. For a few of our common testing tools, I determined the percentage of all the tests written with that tool that were flaky. Of note, all of these tools tend to be used with our larger tests. This is not an exhaustive list of all our testing tools, and represents around a third of our overall tests. The remainder of the tests use less common tools or have no readily identifiable tool.



Flakiness of tests using some of our common testing tools Category % of tests that are flaky % of all flaky tests All tests 1.65% 100% Java WebDriver 10.45% 20.3% Python WebDriver 18.72% 4.0% An internal integration tool 14.94% 10.6% Android emulator 25.46% 11.9%

Size is more predictive than tool

We can combine tool choice and test size to see which is more important. For each tool above, I isolated tests that use the tool and bucketed those based on memory usage (RAM) and binary size, similar to my previous approach. I calculated the line of best fit and how well it correlated with the data (r2). I then computed the predicted likelihood a test would be flaky at the smallest bucket [8] (which is already the 48th percentile of all our tests) as well as the 90th and 95th percentile of RAM used.

Predicted flaky likelihood by RAM and tool Category r2 Smallest bucket

(48th percentile) 90th percentile 95th percentile All tests 0.76 1.5% 5.3% 9.2% Java WebDriver 0.70 2.6% 6.8% 11% Python WebDriver 0.65 -2.0% 2.4% 6.8% An internal integration tool 0.80 -1.9% 3.1% 8.1% Android emulator 0.45 7.1% 12% 17%

Predicted flaky likelihood by binary size and tool



Category r2 Smallest bucket

(33rd percentile) 90th percentile 95th percentile All tests 0.82 -4.4% 4.5% 9.0% Java WebDriver 0.81 -0.7% 14% 21% Python WebDriver 0.61 -0.9% 11% 17% An internal integration tool 0.80 -1.8% 10% 17% Android emulator 0.05 18% 23% 25%

Conclusions

Engineer-selected test size correlates with flakiness, but within Google there are not enough test size options to be particularly useful.

Footnotes