Improving .deb

This article brought to you by LWN subscribers Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

Debian Linux and its family of derivatives (such as Ubuntu) are partly characterized by their use of .deb as the packaging format. Packages in this format are produced not only by the distributions themselves, but also by independent software vendors. The last major change of the format internals happened back in 1995. However, a discussion of possible changes has been brought up recently on the debian-devel mailing list by Adam Borowski.

As documented in the deb(5) manual page, modern Debian packages are ar archives containing three members in a particular order. The first file is named debian-binary and has the format version number, currently "2.0", as one line of text. The second archive member is control.tar.xz , containing the package metadata files and scripts that are executed before and after package installation or removal. Then comes the data.tar.xz file, the archive with the actual files installed by the package. For both the control and data archives, gzip , not xz , was used for compression historically and is still a valid option. The Debian tool for dealing with package files, dpkg , has gained support for other decompressors over time. At present, xz is the most popular one both for Debian and Ubuntu.

The choice to use ar as the outer archive format might seem strange. After all, the only other modern application of this format is for static libraries (they are ar archives with object code files inside), and the de-facto standard for archives in the Unix world is tar , not ar . The reason for this historical decision is, according to Ian Jackson, that "handwriting a decoder for ar was much simpler than for tar ".

Before 1995, a different format, not based on ar , was used for Debian packages. It was, instead, a concatenation of two ASCII lines (format version and the length of the metadata archive) and two gzip compressed tar archives, one with metadata, similar to the modern control.tar.gz , and one with files, just like data.tar.gz . Even though old-format packages are not in active use now, modern dpkg can still create and install them.

What prompted Borowski to start a discussion about changing the internals of the package format amounts to a few possible improvements that can easily be implemented. For example, while the xz compressor yields the smallest package size, switching to zstd for compression would improve the unpacking time by a factor of eight while still beating the venerable gz in terms of compression ratio. As Borowski suggested:

Thus, even though we'd want to stick with xz for the official archive, speed gains from zstd are so massive that it's tempting to add support for it, at least for non-official uses, possibly also for common Build-Depends.

To be fair, this is not the first time developers have proposed zstd compression support for inclusion into Debian's dpkg . Also, Ubuntu 18.04 ships with zstd support already enabled in its version of dpkg .

Beyond recommending adding support for a new compressor, Borowski suggested returning to the old format. The reason was that ar archives (and thus modern deb packages) store the size of their members as a string of no more than ten decimal digits. It means that data.tar.xz can be at most 9,999,999,999 bytes long, or approximately 9.4GiB. While there are no packages of this size in the Debian archive (the largest package is flightgear-data-base , taking "only" 1,178,833,172 bytes), this limitation is indeed a problem for some communities producing unofficial packages, as confirmed by Sam Hartman. The old format does not have a fixed-size length field and thus does not have such a limitation. In addition, in the benchmarks performed by Borowski, even in the apples-to-apples comparison using the gzip compressor for both format versions, the old format was slightly faster to decompress.

Jackson, as the developer who introduced the currently used format, responded that Borowski's suggestion is "an interesting proposal". He acknowledged that the size limitation is indeed a problem and explained the rationale behind the current format. Namely, the old format was not easy to extract without dpkg (e.g. on non-Debian systems) and was not easily extensible. A short discussion thereafter confirmed that people do routinely extract .deb files on "foreign" Linux distributions by hand and perceive this ability as an important property of the .deb package format. Extensibility, on the other hand, in practice amounted to the addition of new decompressors and new fields in files that are in the control tarball. All of that could be done with the old format just as well.

However, switching away from the current " ar with tar files inside" format does not necessarily mean returning to the old format. And that's exactly the objection raised by Ansgar Burchardt. He mentioned the use case of extracting only a few data files (such as the Debian changelog, or a pristine copy of the configuration files), which is currently slow. This operation is slow not only because of a slow decompressor, but also because, in order to get to a file in the middle of a compressed tar archive, one has to decompress and discard everything before it. In other words, fixing this slowness would require switching away from a "compressed tar " format for the data archive to something that supports random access. According to Burchardt, if the Debian project were to introduce one incompatible change to the package format anyway, it would be also a chance to move away from tar , or to tack on other improvements that require incompatible changes. Jackson, however, expressed disagreement with the idea of bundling several incompatible changes together.

Borowski measured the overhead of switching to a seekable archive format by compressing each file in the /usr directory and the Linux kernel source individually and comparing the total size of the compressed files with the size of a traditional compressed tar.xz archive. As it turns out, individually compressed files, which are needed for a seekable archive, took 1.8x more space, thus making the proposal too expensive. Burchardt suggested retesting with the 7z archiver, because it can do something in between compressing files individually and compressing the whole archive. Namely, to get a file from the middle of the archive, one needs to decompress everything not from the very beginning, but only from the beginning of a so-called "solid block" containing the file in question. The "solid block" size is tunable. Still, even with 16MiB solid blocks, according to Borowski's measurement, "the space loss is massive" (1.2x). This experiment convinced Burchardt that switching to a format that allows random access is just not worth it.

An idea of replacing ar with uncompressed tar as the outer archive format has also been proposed. This would eliminate the package-size limitation, while keeping the advantage that Debian packages can be examined and unpacked by low-level shell tools. This is actually the same as the opkg format used by embedded Linux distributions.

Guillem Jover (the maintainer of dpkg ) acknowledged the problems with both old and current .deb package formats and, after examining possible alternatives, concluded that the proposal to switch the outer archive format to tar is "the most straightforward and simple of the options". He promised to present a diff to the .deb format documentation and to start adding support in dpkg version 1.20.x. However, Borowski objected to any "archive in archive" format design and especially did not like uncompressed tar as the outer archive, because it wastes bytes on so-called "blocks" that are only relevant for tape drives. Also, optional features of the tar archive format, such as sparse file support, would unnecessarily complicate the implementation.

Jackson suggested that it is possible to support only a strict subset of the tar format, without the problematic features. He noted that it is already the case for the usage of ar as the outer archive format, "to the point that it is awkward to *create* a legal .deb with a normal ar utility". He also brought up his old idea on how to deal with the data.tar.xz size limit: just split it into multiple files and store them in the ar archive as extra members. This proposal has the advantage that it is still compatible with third-party tools and amounts to absolutely no change if the existing package size limit is not hit.

At this point, the discussion accumulated quite a large number of conflicting proposals and opinions. Due to the issue being too contentious, Jover retracted his promise to work on changing the format documentation. The thread died off without any conclusions or action items. Still, at this time no official Debian packages come too close to the limitations of the current .deb format, so no urgent action is needed. And, if someone needs to unofficially package something really big, they can do it right now — thanks to Borowski's idea about the old format, which is still supported.