Speeding up the Debian installer using eatmydata and dpkg-divert

The Debian installer could be a lot quicker. When we install more than 2000 packages in Skolelinux / Debian Edu using tasksel in the installer, unpacking the binary packages take forever. A part of the slow I/O issue was discussed in bug #613428 about too much file system sync-ing done by dpkg, which is the package responsible for unpacking the binary packages. Other parts (like code executed by postinst scripts) might also sync to disk during installation. All this sync-ing to disk do not really make sense to me. If the machine crash half-way through, I start over, I do not try to salvage the half installed system. So the failure sync-ing is supposed to protect against, hardware or system crash, is not really relevant while the installer is running.

A few days ago, I thought of a way to get rid of all the file system sync()-ing in a fairly non-intrusive way, without the need to change the code in several packages. The idea is not new, but I have not heard anyone propose the approach using dpkg-divert before. It depend on the small and clever package eatmydata, which uses LD_PRELOAD to replace the system functions for syncing data to disk with functions doing nothing, thus allowing programs to live dangerous while speeding up disk I/O significantly. Instead of modifying the implementation of dpkg, apt and tasksel (which are the packages responsible for selecting, fetching and installing packages), it occurred to me that we could just divert the programs away, replace them with a simple shell wrapper calling "eatmydata $program $@", to get the same effect. Two days ago I decided to test the idea, and wrapped up a simple implementation for the Debian Edu udeb.

The effect was stunning. In my first test it reduced the running time of the pkgsel step (installing tasks) from 64 to less than 44 minutes (20 minutes shaved off the installation) on an old Dell Latitude D505 machine. I am not quite sure what the optimised time would have been, as I messed up the testing a bit, causing the debconf priority to get low enough for two questions to pop up during installation. As soon as I saw the questions I moved the installation along, but do not know how long the question were holding up the installation. I did some more measurements using Debian Edu Jessie, and got these results. The time measured is the time stamp in /var/log/syslog between the "pkgsel: starting tasksel" and the "pkgsel: finishing up" lines, if you want to do the same measurement yourself. In Debian Edu, the tasksel dialog do not show up, and the timing thus do not depend on how quickly the user handle the tasksel dialog.

Machine/setup Original tasksel Optimised tasksel Reduction Latitude D505 Main+LTSP LXDE 64 min (07:46-08:50) <44 min (11:27-12:11) >20 min 18% Latitude D505 Roaming LXDE 57 min (08:48-09:45) 34 min (07:43-08:17) 23 min 40% Latitude D505 Minimal 22 min (10:37-10:59) 11 min (11:16-11:27) 11 min 50% Thinkpad X200 Minimal 6 min (08:19-08:25) 4 min (08:04-08:08) 2 min 33% Thinkpad X200 Roaming KDE 19 min (09:21-09:40) 15 min (10:25-10:40) 4 min 21%

The test is done using a netinst ISO on a USB stick, so some of the time is spent downloading packages. The connection to the Internet was 100Mbit/s during testing, so downloading should not be a significant factor in the measurement. Download typically took a few seconds to a few minutes, depending on the amount of packages being installed.

The speedup is implemented by using two hooks in Debian Installer, the pre-pkgsel.d hook to set up the diverts, and the finish-install.d hook to remove the divert at the end of the installation. I picked the pre-pkgsel.d hook instead of the post-base-installer.d hook because I test using an ISO without the eatmydata package included, and the post-base-installer.d hook in Debian Edu can only operate on packages included in the ISO. The negative effect of this is that I am unable to activate this optimization for the kernel installation step in d-i. If the code is moved to the post-base-installer.d hook, the speedup would be larger for the entire installation.

I've implemented this in the debian-edu-install git repository, and plan to provide the optimization as part of the Debian Edu installation. If you want to test this yourself, you can create two files in the installer (or in an udeb). One shell script need do go into /usr/lib/pre-pkgsel.d/, with content like this:

#!/bin/sh set -e . /usr/share/debconf/confmodule info() { logger -t my-pkgsel "info: $*" } error() { logger -t my-pkgsel "error: $*" } override_install() { apt-install eatmydata || true if [ -x /target/usr/bin/eatmydata ] ; then for bin in dpkg apt-get aptitude tasksel ; do file=/usr/bin/$bin # Test that the file exist and have not been diverted already. if [ -f /target$file ] ; then info "diverting $file using eatmydata" printf "#!/bin/sh

eatmydata $bin.distrib \"\$@\"

" \ > /target$file.edu chmod 755 /target$file.edu in-target dpkg-divert --package debian-edu-config \ --rename --quiet --add $file ln -sf ./$bin.edu /target$file else error "unable to divert $file, as it is missing." fi done else error "unable to find /usr/bin/eatmydata after installing the eatmydata pacage" fi } override_install

To clean up, another shell script should go into /usr/lib/finish-install.d/ with code like this:

#! /bin/sh -e . /usr/share/debconf/confmodule error() { logger -t my-finish-install "error: $@" } remove_install_override() { for bin in dpkg apt-get aptitude tasksel ; do file=/usr/bin/$bin if [ -x /target$file.edu ] ; then rm /target$file in-target dpkg-divert --package debian-edu-config \ --rename --quiet --remove $file rm /target$file.edu else error "Missing divert for $file." fi done sync # Flush file buffers before continuing } remove_install_override

In Debian Edu, I placed both code fragments in a separate script edu-eatmydata-install and call it from the pre-pkgsel.d and finish-install.d scripts.

By now you might ask if this change should get into the normal Debian installer too? I suspect it should, but am not sure the current debian-installer coordinators find it useful enough. It also depend on the side effects of the change. I'm not aware of any, but I guess we will see if the change is safe after some more testing. Perhaps there is some package in Debian depending on sync() and fsync() having effect? Perhaps it should go into its own udeb, to allow those of us wanting to enable it to do so without affecting everyone.

Update 2014-09-24: Since a few days ago, enabling this optimization will break installation of all programs using gnutls because of bug #702711. An updated eatmydata package in Debian will solve it.

Update 2014-10-17: The bug mentioned above is fixed in testing and the optimization work again. And I have discovered that the dpkg-divert trick is not really needed and implemented a slightly simpler approach as part of the debian-edu-install package. See tools/edu-eatmydata-install in the source package.

Update 2014-11-11: Unfortunately, a new bug #765738 in eatmydata only triggering on i386 made it into testing, and broke this installation optimization again. If unblock request 768893 is accepted, it should be working again.