In previous articles we’ve looked at Airbnb’s product architecture, the mocking system we built for it, and how this ecosystem enables us to run automated tests on our features. One notable missing piece though has been a discussion of how and when these tests are run.

We’ll now take a look at the tooling that ties all of our testing infrastructure together, and how we’ve designed it to be a pleasant experience for our engineers.

Generating Test Files

As mentioned previously in this series, we define mock data for our Fragments, but don’t require our engineers to write normal tests to test those mocks. Instead, the test files are generated. Here we’ll take a closer look at what that means in practice.

First, we use Kotlin scripting to parse our app directory and pull out the names of all MvRx Fragments.

The script leverages the Kotlin Compiler to generate an AST of each Kotlin file to make it easier to detect Fragment classes

Through a lint rule we enforce a naming convention on our Fragments, which makes it easier to detect Fragment classes

For simplicity, we gather the names of all Fragment classes. If a Fragment is not a MvRx Fragment it is skipped at test runtime.

Once we have the Fragment class names, the script uses KotlinPoet to write a JUnit test file to our test sources directory, in the AndroidTest folder, so that a test build will include those tests. The resulting generated file looks like this:

The name of each test is based on the fully qualified name of the Fragment, and each test function simply starts the test by providing a Fragment name. All of the code to run the test lives in the base class so that it doesn’t need to be included in the generated file. See the previous article for details about this base class and the Activity that runs the tests.

There are several advantages to this code generation approach:

Removes overhead of manually creating and maintaining tests for each Fragment

Makes it trivial to split up our test work into whatever chunks the script is capable of — this is very important to support test sharding and ensuring the tests are scalable. A simple approach would run all fragments together in a single test which is not scalable

Makes it trivial to add new test types. When we added interaction tests in addition to screenshot tests all we had to do was add a few lines to our script to generate an additional function for each Fragment

It is also worth noting some design decisions we made:

Since our mock framework uses a custom DSL it would be more complex for the script to detect all of the mock variant names for each Fragment. Instead, we just collect the Fragment’s name. This means that each test function must run all mocks for a Fragment, instead of being able to shard each mock into its own test. This is generally fine, but can lead to longer test times for Fragments with many mock variants. To combat this we have implemented a basic system for splitting mocks into groups, which was designed to be easily parsed via the script.

The script’s AST representation of each Kotlin file is limited — it views each file in isolation and cannot follow reference chains. For example, we can’t easily see the type of a property if it’s defined in another file. This means that the script needs to make some potentially error prone best guesses. In an ideal world we could compile and run the whole app, and programmatically access any information we need — this would allow us to gather details on all MvRxFragments and declared mocks in a guaranteed way (similar to how the MvRx launcher works). However, this would add significant complexity and run time to the process, both of which we strive to reduce. In practice, the Kotlin script approach works very well for our needs, and runs in just a few seconds.

JUnit supports parameterized tests, and these can be declared programmatically at runtime. This is almost a perfect alternative to our Kotlin script approach for generating tests, however, parameterized tests do not work with test sharding, so it would hamper our ability to easily scale our tests.

Similarly a naive approach could be used to manually define tests that group Fragments, such as a test for all fragments starting with each letter of the alphabet. This works to some extent, but also does not scale easily and doesn’t evenly distribute tests across shards.

A custom annotation processing system could be built, where Fragments are annotated with an annotation that is detected at compile time, and used to generate the test files. The accuracy of this would be fantastic, but it has a few downsides: 1) Requires boilerplate to annotate each Fragment, instead of allowing automatic detection. 2) Annotation processors increase compile time. 3) Test files can’t be generated statically, they would rely on compilation

This test generation script is run at the start of our CI testing job, so that the test source file exists in the project directory before we build the project. See the section about our CI job below for more details CI configuration.

Overall, our scripting approach achieves the following goals:

Programmatically detects Fragments and their mock data with an acceptable amount of granularity and accuracy

Enables efficient scaling of tests, and fairly equal distribution of test time across test shards

Doesn’t add much time to the CI job when configuring the test run

Enables us to easily add new test types

Does not add much complexity to the project

Allows for minimal overhead on the part of developers when creating mocks for a Fragment

Enables us to only run tests on Fragments that are affected by changes

CI Infrastructure for Integration Tests

Our test suite runs on every Github Pull Request to our app’s repository. We use Buildkite for running CI, which makes it easy to add as many separate pipelines as we want for different test types — for example, unit tests and integration tests run in different pipeline jobs so they operate in parallel. Since this article series has focused on integration tests, we’ll just look at how that pipeline works.

After pulling our app repository, we first run the test generation script that was explained above. This parses our project to find Fragments and mocks that should be used in tests, and generates a JUnit test source file in our androidTest directory. What wasn’t mentioned before is that this script is also smart about which Fragments are included in the test.

Detecting Changed Fragments

Our pipeline compares the PR branch with the branch it is being merged into, and checks which files have changed. The changed files are then used to determine which modules were affected by changes. Our test generation script uses this to exclude Fragments if they are in modules that weren’t affected by changes, allowing us to only run changes on Fragments that may have had their behavior changed. This allows us to run tests more quickly, and to save money on Firebase costs.

There are a few tricky things to watch out for if you take this approach:

Module dependencies must be considered. If Module A contains a file that was changed, and Module B depends on Module A, then we must make sure Module B is included in our tests as well. This requires using a module dependency graph to determine changes.

External dependency changes can also affect modules. If you change the version of a library you are using then all modules depending on that library should be tested. For simplicity, we have a single file where all our dependency versions are declared and we run all tests if there are any changes to that file.

Handling Firebase Outages

At this point, we have generated a test file with Espresso entry points to test the Fragments affected by the PR’s changes. The next step is to run commands to build the app apk and test apk, but before continuing we make a sanity check to ensure that Firebase is not down.

Our integration tests all run on Firebase, and if it is down because of an incident the tests may fail. This has some negative consequences on developer productivity:

Confusion from developers about why it failed and what they should do

An inundation of help requests to our team

Unfortunately, Firebase has been unstable frequently enough for this to be an issue for us, so we have an automated approach to handle it. Our pipeline uses the Firebase status api to check for ongoing incidents with Firebase Test Lab, and posts a comment back to the PR if an incident is detected. The comment includes a link to the incident and instructions about how the developer should handle it.

Sharding with Flank

Now, we have our test file generated, our APK’s built, and we’re ready to actually run our tests on Firebase. This is fairly straightforward with Firebase’s gcloud command line support, but that will run all tests in serial in a single test matrix. This can take a long time if you have many tests, and does not scale well over time.

Thankfully, a great open source library called Flank is available to help us split our tests into shards, which run in parallel on multiple Firebase test matrices. Our full test suite takes about two hours of total test time, but runs in just a few minutes when sharded (excluding setup and teardown time).

Flank’s documentation is fairly straightforward so I won’t go into detail on our setup. Our pipeline simply uses a script to generate a flank.yml file containing the configuration we want, and Flank handles the rest.

One thing to note is that the device you choose to test with can make a big difference in test time. For us, emulators are much slower than physical devices, so we only test on physical devices to reduce test time. Additionally, the type of device can drastically affect test time. We initially used a Google Pixel for our tests, without thinking much of it. Later experimentation with a Pixel 3 showed that it ran our tests about twice as fast. This isn’t surprising in retrospect, but is a good reminder to be intentional about the test device you use.

Handling Results

Once tests are running, our last responsibility in the pipeline is to communicate results back to the PR.

Failures

On test failure, a naive approach is to simply let the CI job fail if Flank exits with any error. However, this leaves the end developer with little insight into why it failed, and they then have to spend time diving into the CI logs to find links to the correct Firebase test matrices — this is even more difficult with sharded tests because it isn’t necessarily clear which shards failed.

Automated tooling can again help here, so that we can reduce friction for the developer as much as possible. Our pipeline script does the following:

Uploads all of Flank’s output files as Buildkite artifacts so that they are easily accessible for debug usage

Parses Flank’s JUnit report to collect a list of test matrices containing failures

Posts a comment back to the PR with links to the failed matrices

A PR comment links to Firebase Test Lab matrices that had test failures

This allows developers to easily access Firebase failures directly from the PR. It also allows us to include links to documentation on how the developer should handle the failures and how to work with Firebase, which is important as our contributor count continues scaling.

Generating Happo Reports

Previous articles discussed in depth our use of Happo to generate diff reports for both screenshots and interaction details. This Happo integration is done as one of the last steps of our CI job — the script compares the Happo report that was just generated in the tests, to the latest report on the master branch. If there are any differences then a comment is posted back to the PR with details.

A Happo PR comment calls out visual changes that were detected

The developer must follow the link to inspect the diff results and acknowledge that the changes are intentional before merging the PR (See previous articles in this series for more details on this Approval Testing approach).

It’s worth mentioning that Happo works well with sharded tests. Each shard uploads its own report, and Happo combines all of these partial reports into a final report representing all Fragments. One complication, however, is that the final report will be incomplete when our tests only include Fragments affected by changes. Unchanged Fragments are not included in the testing, and are thus missing from the Happo reports and the final diff would show them as being removed.

Our solution here is a script that pulls the latest Happo report from master, and uses the list of unchanged Fragments to copy the details from those screens into the new report. This works well, but there is one gotcha: we must make sure the master report has been fully created. It’s possible that the CI job to generate a new master report for the latest commit is still running if the PR was recently rebased, and to avoid that race condition we save this step for the very end.

Code Coverage Reports

A final responsibility of the integration test pipeline is to pull code coverage data, and compute a code coverage report for the PR. We use standard Jacoco tooling which is straightforward to integrate with Flank and Firebase. However, some complexity arises since we run integration tests and unit tests in separate pipelines, and need to combine those reports to get absolute coverage data.

A coverage report is posted back to the PR to provide awareness to developers. We are still in the early stages of building our code coverage tooling and hope to improve our features here over time.

PR Comments

In several places I have mentioned that our CI pipeline posts comments back to the PR to clearly surface information to the developer. This is done via a tool we built to easily enable any pipeline to post a comment via a simple API. It handles updating a comment if we want to change its content, or deleting it if it is no longer applicable, both of which are important when new commits are pushed and the job is rerun.

This tool builds on top of Github’s API to create and delete comments, and abstracts the details of needing to work with that API directly. Additionally, Github doesn’t provide an easy way to update a comment, so we instead need to delete a previous comment and add a new one when we want to “update” it. To do this, our tool requires that each message be associated with a String key, and it associates a Github comment id with this String key in an AWS database. This way it can lookup what comment may already exist for that key, and get the id needed to delete it.

This tool for managing comments has been extremely helpful to us. It makes it simple for our CI pipelines to surface information to the user in a clean way, while reducing complexity within the pipeline itself.

Closing Thoughts

By now you hopefully have a good understanding of our philosophy on testing Android code, and how we have built systems to make testing easier and more comprehensive. Most of these systems have been built in just the past year, and while we are very pleased with our progress, we always have an eye towards what’s next.

Future Improvements

Our existing suite of tests covers a high percentage of code paths, but isn’t perfect. We are planning ways to improve what use cases it can test. Thankfully, our mock architecture makes it easy to build new test systems on top of it.

Some areas we’d like to explore are:

Automated testing of deep link handling

End to end tests that run through multiple screens and hit production API’s

Support for manual, custom Espresso tests that leverage the mocking framework

Automated performance benchmarking via the new Jetpack Benchmark library

Automated support for testing other common code paths, such as EditText inputs and onActivityResult implementations

Testing optimized builds (R8/Proguard) to catch issues that surface only in production

Open Source Plans

We are proud of the testing work we’ve done, and are excited to share it! It was designed to integrate with MvRx, and so is a natural extension to that open source library. We are in the process of releasing these testing frameworks as an addition to MvRx (starting with the 2.0.0 alpha release) and are looking forward to feedback and contributions from the community.

Series Index

This is a seven part article series on testing at Airbnb.

Part 1 — Testing Philosophy and a Mocking System

Part 2 — Screenshot Testing with MvRx and Happo

Part 3 — Automated Interaction Testing

Part 4 — A Framework for Unit Testing ViewModels

Part 5 — Architecture of our Automated Testing Framework

Part 6 — Obstacles to Consistent Mocking

Part 7 (This article) — Test Generation and CI Configuration

We’re Hiring!

Want to work with us on these and other Android projects at scale? Airbnb is hiring for several Android engineer positions across the company! See https://careers.airbnb.com for current openings.