Lots of progress for Debian's reproducible builds

Benefits for LWN subscribers The primary benefit from subscribing to LWN is helping to keep us publishing, but, beyond that, subscribers get immediate access to all site content and access to a number of extra site features. Please sign up today!

Over the last year or two, there has been a lot of talk about "reproducible builds"; that is, for two builds of a given source package to produce byte-for-byte identical binaries. Projects like Bitcoin and Tor have a strong interest in allowing their users to verify that the binaries they distribute correspond exactly to the published source code. For Linux distributions, doing the same for their repositories is much bigger job—hundreds or thousands of source code packages would need to be built in a reproducible way.

As it turns out, at least one distribution is taking that job on. The Debian Reproducible Builds project has recently gotten more than 80% of packages to build reproducibly, as Jérémy Bobbio (aka Lunar) reported. It requires an experimental toolchain to do so, but now covers some 17,000+ packages. Given that Debian's package repository is generally a superset of other distributions' repositories (or close), the work the project is doing should, at minimum, provide other interested distributions with pointers toward ... well ... reproducing this work for themselves.

There are a number of issues that stand in the way of reproducible (or deterministic) builds. First off, the contents of the binaries built for each package are dependent on the build environment, which includes things like tool versions, system time, build paths, host names, and so on. There are also a few more subtle factors, such as that both the ordering of file names in the filesystem and the locale affect how tar creates an archive file. Two seemingly identical filesystem trees can produce different tar files on different systems. Once you have handled all of those factors, though, it is also necessary to record that information with the package so that others can duplicate the results.

The solution to the latter problem for Debian is the .buildinfo file that is based on the format of the .changes file (which indicates what has changed in a new version of a package). .buildinfo records all of the packages required to build the package, along with the version numbers of each. It also has some basic information about the package, its version, hashes of the .deb files, the build path used, and so on. Multiple .deb files of the same package and version that are built on separate machines must all match the hash in .buildinfo in order to have duplicated the build.

The .buildinfo files can then be signed by Debian developers (DDs). The signature asserts that each signing DD was able to reproduce the package exactly using the information found in the file. Those signatures will be kept in separate files that are referenced from a "Build-Signed-Off-By" entry in the "Packages" files. The presence of those signatures will allow users to have confidence in the packages without actually rebuilding them (using the reproducible mechanism, of course) themselves.

For package maintainers who want to make their package reproducible, the project has a How-to page. It contains a recommendation that packagers use the debhelper packaging style, but has tips for those using other styles (including "roll your own"). The experimental toolchain contains modified versions of debhelper and cdbs to incorporate the changes needed for deterministic builds.

There is also a list of the kinds of problems a maintainer may encounter when trying to make their package build reproducibly. This includes issues like the data.tar file (which is the core of a .deb package) being created in the wrong order. The solution to that is to set the locale appropriately and to sort directory listings before handing them off to tar . There are also examples for dealing with timestamps in a whole raft of different kinds of generated files, as well as handling a number of other build problems that lead to non-deterministic packages.

Beyond the changes to debhelper and cdbs, the project has also changed a variety of other pieces of the Debian build infrastructure, including dpkg, build tools for various languages (e.g. Java, Python, R, Haskell), and certain library bindings (e.g. Qt for Python). Most of that work was to handle either timestamps or file-name-ordering problems. All of the changes are making their way upstream so that the normal toolchain can hopefully be used down the road.

While Debian is currently focused on the jessie (8.0) release, Bobbio would like to see reproducible builds become a focus for the following release:

Reproducible builds are not going to change anything for most of our users. They simply don't care how they get software on their computer. But they care to get the right software without having to worry about it. That's our responsibility, as developers. Enabling users to trust their software is important and a major contribution, we as Debian, can make to the wider free software movement. Once Jessie is released, we should make a collective effort to make reproducible builds [a] highlight of our next release.

It is clear that a lot of work is going into the project over the last few months, with eye-opening results. A look at the project history shows that the whole effort has really only been going for a year and a half or so. There is undoubtedly a long tail of packages that will strongly resist reproducibility, so there is still lots of work to do. Given the progress so far, though, having Debian 9.0 be entirely reproducible doesn't seem out of reach.