At Dropbox, we run more than 35,000 builds and millions of automated tests every day. With so many tests, a few are bound to fail non-deterministically or “flake.” Some new code submissions are bound to break the build, which prevents developers from cutting a new release. At this scale, it’s critical we minimize the manual intervention necessary to temporarily disable flaky tests, revert build-breaking commits, and notify test owners of these issues. We built a system called Athena to manage build health and automatically keep the build green.

What we used to do

To ensure basic correctness, all code at Dropbox is subjected to a set of pre-submit tests. A larger suite of end-to-end tests, like Selenium/UI tests, which are too flaky, slow, and costly to test before code changes are only run after code lands on the master branch, and we call these “post-submit” tests. We require both pre-submit and post-submit tests to pass to cut a new release.

To keep the build green, we initially established a rotation which reverted commits that broke post-submit tests and temporarily disabled, or “quarantined” flaky tests. Over time, the operational load of that rotation became too high, and we distributed the responsibility to multiple teams which all felt the burden of managing build health. Also, ineffective manual intervention to quarantine flaky tests made pre-submit testing a slow, frustrating experience. So we started brainstorming for a sustainable solution.

Enter Athena

We landed on a new service, called Athena, which manages the health of our builds and requires minimal human intervention.

Athena reduces the human effort required to keep the build green by:

Identifying commits that make a test deterministically fail, and notifying the author to revert the commit Identifying tests that are flaky and unreliable, and automatically quarantining them

What makes this tricky?

It can be challenging to determine whether a single test failure is a deterministic breakage or a spurious failure. Ultimately, a test is arbitrary user code, and tests can fail in various ways. The three main classes of non-deterministic test failures we see are non-hermetic tests, flaky tests, and infrastructural flakiness.

Non-hermetic tests

Hermetic tests only use declared dependencies and have no dependencies outside the build and test environment, which make their results reproducible.

A few tests at Dropbox depend on external resources that are hard to fake, like time. We often see tests that start failing when it’s a new UTC day, or at the end of every month. For example, code that tests whether a particular discount is valid starts failing after the discount expires. We call these environmental failures.

We mitigate these by keeping track of the latest “stable” commit, one where all tests have passed. Every time a new commit has all tests passing, we mark that commit as stable. If a test fails when run on a stable commit, it can indicate non-hermetic behavior or an environmental failure.

Flaky tests

Flaky tests are tests that behave non-deterministically with no change in input. Potential sources of non-determinism are dependence on random numbers, thread scheduling, improperly selected timeouts, and concurrency. With flaky tests, it’s impossible to say with 100% confidence whether a commit truly broke a test.

We mitigate this by retrying the same test up to ten times, and if the result isn’t consistent across the runs, we’re confident that the test is exhibiting a flake. We settled on ten retries after some experimentation.

Infrastructural flakiness

We have done a lot of work over the years to ensure that our tests have reliable resource guarantees so that their behavior is consistent. We run tests in a container, give them consistent CPU and memory via resource quotas, and do CPU and NUMA pinning. Unfortunately, with a thousand node cluster, we’re bound to have performance variations and straggling hosts.

The most common case of infrastructural flakiness we see is a test timing out on a poorly performing host. We mitigate this by retrying the test on a different host.

How does Athena work?

To apply the mitigations listed above, we created a service, Athena, that watches test results, reruns failing tests to identify if they’re flaky or broken, and takes actions like quarantine.

Athena watches test results for all new code submissions and marks tests that fail more than once within a few hours in post-submit testing as “noisy”, which is a temporary state while the system is unsure if the test failures are flakes, infrastructural, or a breakage.