Peeking into archives

At little insight goes a long way. I often say that I get my best ideas when I'm in the shower. I relax and my sometimes my brain makes some pretty neat connections.

One example of this is CPAN::Mini::Webserver, which allows you to search and browse a MiniCPAN. One insight was that the 02packages file in CPAN mirrors was full of enough information to be useful to search. The other was that browsing through distributions didn't actually require the distributions to be unpacked - they could be unpacked on the fly. That lead to Archive::Peek.

A few weeks ago I was noticed and was quite impressed with CPAN grep, a neat website by David Leadbeater which allows you to use a regular expression to search the whole of CPAN. Check out the example searches. It's based on some pretty neat code including RE2, "a fast, safe, thread-friendly alternative to backtracking regular expression engines like those used in PCRE, Perl, and Python" but which has memory limits so is useful for applying user-generated regular expressions.

However, the CPAN grep code not particularly lightweight. It requires a beefy multicore machine with lots of memory to unpack, index and search CPAN.

I wondered about using Archive::Peek (which uses Archive::Tar and Archive::Zip behind the scenes) to index a local CPAN mirror without unpacking it. I wrote some code and it took 40 minutes to index all distributions with authors with a PAUSEID that starts with A.

Cue a shower idea: Archive::Peek::External, which uses external tools "tar" and "unzip" to peek into archives. That reduced the time taken to 13 minutes.

While investigating Tarsnap, an online backup service, a few months ago, I had noticed that it used libarchive a "C library and command-line tools for reading and writing tar, cpio, zip, ISO, and other archive formats".

Cue another shower idea, which involved writing some XS: Archive::Peek::Libarchive, which wraps libarchive. That reduced the time taken to 16 seconds.

Sixteen seconds! That's so fast I wrote search_cpan.pl, which allows you to search a local CPAN mirror for a Perl regular expression while unpacking distributions on the fly. Takes about a minute.

Yay for showers and superior technology!

