November 12th, 2009 edited by ana

Article submitted by ERSEK Laszlo. DebADay needs you more than ever! Please submit good articles about software you like!

lbzip2 is a multi-threaded bzip2 compressor/decompressor utility that can be used on its own, in pipelines, or passed to GNU tar with the –use-compress-program option (or with the –use shorthand).

The main motivation for writing lbzip2 was that I didn’t know about any parallel bzip2 decompressor that would exercise multiple cores on a single-stream bz2 file (i.e. the output of a single bzip2 run) and/or on a file read from a non-seekable source (e.g. a pipe or socket). Thus lbzip2 started out as lbunzip2, but with time it gained multiple-workers compression and single-worker decompression features. Due to the input-bound splitter of its multiple-workers decompressor, it should scale well to many cores even when decompressing.

Target audience

Originally, the target audience for lbzip2 was experienced users and system administrators: up to version 0.15, lbzip2 deliberately worked only as a filter. Now at 0.17, lbzip2 is mostly command line compatible with bzip2, except it doesn’t remove or overwrite files it didn’t create. If lbzip2 will have a chance to enter the Debian alternatives system, as an alternative for bzip2, I’ll add this feature. In any case, you are encouraged always to verify lbzip2’s output manually before (or instead of automatically) removing its input, both when compressing and when decompressing. I also recommend perusing the README, installed as /usr/share/doc/lbzip2/README.gz on Debian, before switching over to lbzip2 eventually.

Usage examples

As lbzip2 was chiefly created for speeding up decompression of single-stream bz2 files and/or for speeding up decompression from a pipe, I’ll provide examples of decompression first. Basically all free software tarballs should be available on the net as tar.bz2 files, I’ll choose (not surprisingly) a kernel tarball.

The “traditional” method:

wget \ http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.31.1.tar.bz2 tar --use=lbzip2 -x -f linux-2.6.31.1.tar.bz2

The overlapped method:

wget -O - \ http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.31.1.tar.bz2 \ | tee -i linux-2.6.31.1.tar.bz2 \ | tar --use=lbzip2 -x

If wget fails to download the tarball for some reason (at which point at least tar will complain), you should remove the partially decompressed tree and fall back to the traditional method. To avoid losing the already downloaded part, pass -c to wget.

Another example might be the import of a Wikimedia Dump file, perhaps with a pipeline like this:

lbzip2 -d < enwiki-latest-pages-articles.xml.bz2 \ | php importDump.php

Finally, a compression/backup example with verification at the end:

tar --format=pax --use=lbzip2 -c -f tree.tar.bz2 tree tar --use=lbzip2 --compare -f tree.tar.bz2 -v -v

Hypothetically, with lbzip2 as the configured bzip2 alternative, we should be able to replace –use=lbzip2 with the well-known -j GNU tar option.

Comparison with other bzip2 utilities

I posted a longish mail with feature analyses and performance measurements to the debian-mentors maling list. To reiterate what I said there: fundamentally, lbzip2 was created to fill a performance gap left by pbzip2.

After working on lbzip2 for a while, I found out that p7zip does in parallel the decompression of single-stream bz2 files, but (the last time I checked) it couldn’t scale above four threads, and it refused to read bz2 files from a pipe.

Bzip2 compression and decompression performance is very sensitive to the cache size that is dedicated to a single worker thread (i.e. a single CPU core). To my limited knowledge, this implies that among commodity desktops, lbzip2 performs best on multi-core AMD processors.

lbzip2 does have shortcomings. They are either inherent in the design or I deem then unimportant. I tried to document them all. Please read the debian-mentors post linked above, the README file, and the manual page.

As said above, I didn’t originally intend lbzip2 as a drop-in replacement for bzip2. Even though it is almost there now, you should nonetheless get to know it thoroughly before deciding to switch over to it.

Availability

Various versions of lbzip2 are available for Debian (squeeze and sid) and Ubuntu (karmic and lucid).

You should be able to install lbzip2 on lenny too; it shouldn’t break anything. I used the following commands:

cat >>/etc/apt/sources.list <<EOT deb http://security.debian.org/ testing/updates main deb http://ftp.hu.debian.org/debian/ testing main EOT apt-get update apt-get install lbzip2

Upstream releases are announced on the project’s Freshmeat page. I distribute the upstream version to end-users from my recently moved home page, which also links to other distributions’ lbzip2 packages.

A development library version is very unlikely. You can work around this by communicating with an lbzip2 child process over pipes via select(), and by checking its exit status via waitpid() after receiving EOF. This is not an unusual method; see, for example, gpg’s many –[^-]*-fd options.

End-user stress-testing

I encourage you to test lbzip2. The upstream README describes the test method in general; let me instantiate that description here specifically for Debian.

Necessary packages, in alphabetical order:

bzip2

dash

gcc

lbzip2

perl

Recommended packages, in alphabetical order:

p7zip-full

pbzip2

Create a test directory (you will need lots of free space under that directory), and under it a well-compressible big file. For example:

mkdir -m 700 -v -- "$TMPDIR"/testdir tar -c -v -f "$TMPDIR"/testdir/testfile.tar /usr/bin/ /usr/lib/

Then issue the following commands, utilizing the test file created above. As this could take several hours, I suggest entering a screen session first. Your machine should be otherwise unloaded during the test, both IO- and CPU-wise.

cd /usr/share/lbzip2 dash test.sh "$TMPDIR"/testdir/testfile.tar

Any errors encountered during the test should be either handled or fatally rejected. In particular, utilities refusing to decompress from a pipe are handled.

Estimated disk space usage: when writing this article, I executed the above commands with a 100 MB test file. (You should aim at least at 1 GB.) The test directory ended up being 250 MB in size. M stands for 220, G stands for 230.

Estimated time span: supposing

your machine has N cores (each with a dedicated L2

cache),

cores (each with a dedicated L2 cache), the file you use for testing lbzip2 is S GB big,

GB big, and bzip2 takes T seconds to compress a 1 GB test file with similar contents,

then the full test should take around

S * (1879 + 2098 * 2 / N) * T / 240

seconds.

Estimated peak memory usage: N * 50 MB should be a very safe bet.

To view the test report:

less -- "$TMPDIR"/testdir/results/report

The only obscure entries in the table should be the “ws” ones. They mean “workers stalled” and give a percentage of how many times the (de)compressor worker threads tried to start munching a block but had to go to sleep because there was no block to munch. Anything above 1-2% usually implies some bottleneck and shows that lbzip2 couldn’t fully exhaust your cores. This shouldn’t occur, but if it does and lbzip2 and pbzip2 have performed similarly in the compression tests, then the bottleneck is in your system, not lbzip2.

The lzip2 package has been available .