The Rust Infrastructure team is looking if it’s worth migrating away from Travis CI in the near future. This discussion was started during autumn of 2018, after a summer full of bad Travis CI issues and outages (more on that below), and the rest of the year was still bad enough that we’re considering migrating to another CI platform.

We’re going to discuss this at the Rust All Hands, taking place next week in Berlin. We’re researching alternative CI platforms on our own, but we’ll likely miss some of them so please suggest what you know or use here in the thread (look at the requirements we have below)! We’re looking forward to reading through your suggestions, and we will consider them when making a decision!

Please note that, even if we’re an open source project, we pay a lot of money to Travis CI and we would like to make sure that we’re getting the best value for money. If alternative CI platform requires us to pay that’s fine.

– The Rust Infrastructure team.

What problems did we have with Travis CI

Still present

Sometimes scripts included by Travis in the build fail to execute, causing an otherwise-good build to spuriously fail. We don’t have any control on those scripts, so we can’t prevent those failures. There is not really a tracking issue for this, it just happens sometimes.

Resolved problems since May 2018

2018-05-24 → 2018-12-18: Broken networking on Docker at ~6:30 UTC Caused a spurious failure mostly every day: basically a cronjob was running on the images that disabled IPv4 forwarding everyday at around 6:30 UTC. It took Travis CI 208 days to investigate and resolve this issue, despite repeated pings both in the issue tracker and on the support email.

2018-06-08: Travis CI outage that prevented any build from starting

2018-07-25 → 2018-07-26: macOS builders skipped due to a bug in the configuration parser Travis updated their configuration parser that day, but a bug in it ignored jobs in the build matrix if they used a specific if: syntax. Since we were using that syntax for our macOS jobs they were ignored. This caused PRs to land without actually being tested on macOS, a Tier 1 platform. We noticed it since users reported a missing beta on macOS.

2018-07-26: Travis “lost” our macOS images According to Travis Support, along with the previous incident they “lost” our macOS image. We use a custom image called xcode9.3-moar for our builds, which gives us more cores.

2018-07-27 → 2018-08-24: Travis CI builders spurious shutdowns in the middle of a build Basically a bug in their software marked our VMs as TERMINATE instead of MIGRATE when an hypervisor needed maintenance. It took Travis CI 28 days to deploy a fix for this issue, causing multiple failures a day.

2018-09-12 → 2018-09-13: Travis CI reduced our timeout from 3 hours to 50 minutes A refactoring of their software removed the piece of code that was increasing the jobs timeout of allowed repositories, including rustc. They then deployed a fix. It took Travis CI a day to deploy the fix, blocking all the queue.

2018-10-04: Travis CI failed to generate build scripts A few builds spuriously failed due to “some network slowness inside our systems” (Travis Support). The rate of spurious failures decreased after we reported it, and I don’t think we tracked it after that.



The migration to travis-ci.com

Due to the GitHub Services sunset happening on January 31st Travis CI was forced to migrate existing repositories away from GitHub Services, and they initially decided to do that at the same time of the migration of public repositories away from travis-ci.org on their unified platform on travis-ci.com . (note: that decision was reverted and repos on .org are migrated to webhooks).

The way they handled the migration was far from ideal: there was no notification about the migration one month before the cutoff date, and this migration changed the way build results are reported to GitHub, so it required manual action from everyone with custom infrastructure based on Travis (like we do). We learned about the migration when one infrastructure team member randomly noticed the GitHub Services sunset and we asked Travis Support ourselves.

Adding to that, Travis Support reported wrong information in the communications with them: they said encrypted secret variables were not migrated (while they were migrated perfectly fine), and they said there was no way for us to keep using travis-ci.org or commit statuses after January 31st, even though that’s now the plan for everyone who didn’t migrate.

Also, the migration process was rough (it was marked as beta, so it’s sort of expected, even though we were one month away from the migration…). Cronjobs were not migrated and were broken after manually migrating them with the API (turns out we hit a bug in their API), branch protection had to be updated on every repository and even today build badges are not working, since there is no redirect in place.

What requirements we have for a replacement

These are the requirements we have for a Travis CI replacement. We aren’t looking for an AppVeyor alternative at the moment.

Hard requirements

The service must be operated by a company we can contact directly for support. Building and maintaining a CI system in a reliable way for a project as big as ours takes a lot of time, especially to test on macOS, and most of the Rust infrastructure team is not paid to work on infrastructure. If the service requires us to use our own servers, the maintenance work we have to do should be minimal. Support should be direct, prompt (as appropriate), and helpful.

The service must provide both Linux and macOS machines. Windows support could be nice, but switching away from AppVeyor is not a priority for us.

The service must allow us to increase timeouts and the available resources in the VMs. Anecdotally we need at least 4-core machines. 14 Windows + 5 Mac + 38 Linux = 57 current builders per PR

The service must be able to build and execute Docker containers (or a comparable system for enabling a level of reproducible builds).

The service should be somewhat established, in the sense that we won’t have to go looking for a new solution in a year.

Nice to have