The TaskCluster Platform team worked very hard in Q2 to support the migration off Buildbot, bring new projects into our CI system and look forward with experiments that might enable fully-automated VM deployment on hardware in the future.

We also brought on 5 interns. For a team of 8 engineers and one manager, this was a tremendous team accomplishment. We are also working closely with interns on the Engineering Productivity and Release Engineering teams, resulting in a much higher communication volume than in months past.

We continued our work with RelOps to land Windows builds, and those are available in pushes to Try. This means people can use “one click loaners” for Windows builds as well as Linux (through the Inspect Task link for jobs)! Work on Windows tests is proceeding.

We also created try pushes for Mac OS X tests, and integrated them with the Mac OS X cross-compiled builds. This also meant deep diving into the cross-compiled builds to green them up in Q3 after some compiler changes.

A big part of the work for our team and for RelEng was preparing to implement a new kind of signing process. Aki and Jonas spent a good deal of time on this, as did many other people across PlatformOps. What came out of that work was a detailed specification for TaskCluster changes and for a new service from RelEng. We expect to see prototypes of these ideas by the end of August, and the major blocking changes to the workers and provisioner to be complete then too.

This all leads to being able to ship Linux Nightlies directly from TaskCluster by the end of Q3. We’re optimistic that this is possible, with the knowledge that there are still a few unknowns and a lot has to come together at the right time.

Much of the work on TaskCluster is like building a 747 in-flight. The microservices architecture enables us to ship small changes quickly and without much pre-arranged coordination. As time as gone on, we have consolidated some services (the scheduler is deprecated in favor of the “big graph” scheduling done directly in the queue), separated others (we’ve moved Treeherder-specific services into its own component, and are working to deprecate mozilla-taskcluster in favor of a taskcluster-hg component), and refactored key parts of our systems (intree scheduling last quarter was an important change for usability going forward). This kind of change is starting to slow down as the software and the team adapts and matures.

I can’t wait to see what this team accomplishes in Q3!

Below is the team’s partial list of accomplishments and changes. Please drop by #taskcluster or drop an email to our tools-taskcluster lists.mozilla.org mailing list with questions or comments!

Things we did this quarter:

initial investigation and timing data around using sccache for linux builds

released update for sccache to allow working in a more modern python environment

created taskcluster managed s3 buckets with appropriate policies

tested linux builds with patched version of sccache

tested docker-worker on packet.net for on hardware testing

worked with jmaher on talos testing with docker-worker on releng hardware

created livelog plugin for taskcluster-worker (just requires tests now)

added reclaim logic to taskcluster-worker

converted gecko and gaia in-tree tasks to use new v2 treeherder routes

Updated gaia-taskcluster to allow github repos to use new taskcluster-treeherder reporting

move docs, schemas, references to https

refactor documentation site into tutorial / manual / reference

add READMEs to reference docs

switch from a * certificate to a SAN certificate for taskcluster.net

increase accessibility of AWS provisioner by separating bar-graph stuff from workerType configuration

use roles for workerTypes in the AWS provisioner, instead of directly specifying scopes

allow non-employees to login with Okta, improve authentication experience

named temporary credentials

use npm shrinkwrap everywhere

enable coalescing

reduce the artifact retention time for try jobs (to reduce S3 usage)

support retriggering via the treeherder API

document azure-entities

start using queue dependencies (big-graph-scheduler)

worked with NSS team to have tasks scheduled and displayed within treeherder

Improve information within docker-worker live logs to include environment information (ip address, instance type, etc)

added hg fingerprint verification to decision task

Responded and deployed patches to security incidents discovered in q2

taskcluster-stats-collector running with signalfx

most major services using signalfx and sentry via new monitoring library taskcluster-lib-monitor

Experimented with QEMU/KVM and libvirt for powering a taskcluster-worker engine

QEMU/KVM engine for taskcluster-worker

Implemented Task Group Inspector

Organized efforts around front-end tooling

Re-wrote and generalized the build process for taskcluster-tools and future front-end sites

Created the Migration Dashboard

Organized efforts with contractors to redesign and improve the UX of the taskcluster-tools site

First Windows tasks in production – NSS builds running on Windows 2012 R2

Windows Firefox desktop builds running in production (currently shown on staging treeherder)

new features in generic worker (worker type metadata, retaining task users/directories, managing secrets in secrets store, custom drive for user directories, installing as a startup item rather than service, improved syscall integration for logins and executing processes as different users)

many firefox desktop build fixes including fixes to python build scripts, mozconfigs, mozharness scripts and configs

CI cleanup https://travis-ci.org/taskcluster

support for relative definitions in jsonschema2go

schema/references cleanup

Paying down technical debt