deepgrep: grep nested archives with one command / / at 02:00 / / by abe

from the grep-revisited dept.

Several months ago, I wrote about grep everything and listed grep-like tools which can grep through compressed files or specific data formats. The blog posting sparked several magazine articles and talks by Frank Hofmann and me.

Frank recently noticed that we though missed one more or less mighty tool so far. We missed it, because it’s mostly unknown, undocumented and hidden behind a package name which doesn’t suggest a real recursive “grep everything”:

deepgrep

deepgrep is part of the Debian package strigi-utils, a package which contains utilities related to the KDE desktop search Strigi.

deepgrep especially eases the searching through tar balls, even nested ones, but can also search through zip files and OpenOffice.org/LibreOffice documents (which are actually zip files).

deepgrep seems to support at least the following archive and compression formats:

tar

ar, and hence deb

rpm (but not cpio)

gzip/gz

bzip2/bz2

zip, and hence jar/war and OpenOffice.org/LibreOffice documents

MIME messages (i.e. files attached to e-mails)

A search in an archive which is deeply nested looks like this:

$ deepgrep bar foo.ar foo.ar/foo.tar/foo.tar.gz/foo.zip/foo.tar.bz2/foo.txt.gz/foo.txt:foobar foo.ar/foo.tar/foo.tar.gz/foo.zip/foo.tar.bz2/foo.txt.gz/foo.txt:bar

deepgrep though neither seems to support any LZMA based compression (lzma, xz, lzip, 7z), nor does it support lzop, rzip, compress (.Z suffix), cab, cpio, xar, or rar.

Further current drawbacks of deepgrep :

Nearly no commandline options, especially none of the common grep options

No man-page or other documentation

Exit code not related to search results, you have to check the output to see if something has been found

deepfind

If you just need the file names of the files in nested archives, the package also contains the tool deepfind which does nothing else than to list all files and directories in a given set of archives or directories:

$ deepfind foo.ar foo.ar foo.ar/foo.tar foo.ar/foo.tar/foo.tar.gz foo.ar/foo.tar/foo.tar.gz/foo.zip foo.ar/foo.tar/foo.tar.gz/foo.zip/foo.tar.bz2 foo.ar/foo.tar/foo.tar.gz/foo.zip/foo.tar.bz2/foo.txt.gz foo.ar/foo.tar/foo.tar.gz/foo.zip/foo.tar.bz2/foo.txt.gz/foo.txt

As with deepgrep , deepfind does not implement any common options of it’s normal sister tool find .

[The following part has been added on 17-Nov-2012]

As with deepgrep, it also doesn’t seem to support any of the more modern or more exotic compression formats, i.e. it fails on modern debian binary packages which use xz compression on the data part:

deepfind xulrunner-18.0_18.0\~a2+20121109042012-1_amd64.deb xulrunner-18.0_18.0~a2+20121109042012-1_amd64.deb xulrunner-18.0_18.0~a2+20121109042012-1_amd64.deb/debian-binary xulrunner-18.0_18.0~a2+20121109042012-1_amd64.deb/control.tar.gz xulrunner-18.0_18.0~a2+20121109042012-1_amd64.deb/control.tar.gz/triggers xulrunner-18.0_18.0~a2+20121109042012-1_amd64.deb/control.tar.gz/preinst xulrunner-18.0_18.0~a2+20121109042012-1_amd64.deb/control.tar.gz/md5sums xulrunner-18.0_18.0~a2+20121109042012-1_amd64.deb/control.tar.gz/postinst xulrunner-18.0_18.0~a2+20121109042012-1_amd64.deb/control.tar.gz/control xulrunner-18.0_18.0~a2+20121109042012-1_amd64.deb/data.tar.xz

[End of part added at 17-Nov-2012]

Dependencies

The package strigi-utils doesn’t pull in the complete Strigi framework (i.e. no daemon), just a few libraries (libstreams, libstreamanalyzer, and libclucene). On Wheezy it also pulls in some audio/video decoding libraries which may make some server administrators less happy.

Conclusion

Both tools are quite limited to some basic use cases, but can be worth a fortune if you have to work with nested archives. Nevertheless the claim in the Debian package description of strigi-utils that they’re “enhanced” versions of their well known counterparts is IMHO disproportionate.

Most of the missing features and documentation can be explained by the primary purpose of these tools: Being backend for desktop searches. I guess, there wasn’t much need for proper commandline usage yet. Until now. ;-)

42.zip

And yes, I was curious enough to let deepfind have a look at 42.zip (the one from SecurityFocus, unzip seems not able to unpack 42.zip from unforgettable.dk due a missing version compatibility) and since it just traverses the archive sequentially, it has no problem with that, needing just about 5 MB of RAM and a lot of time:

[…] 42.zip/lib f.zip/book f.zip/chapter f.zip/doc f.zip/page e.zip 42.zip/lib f.zip/book f.zip/chapter f.zip/doc f.zip/page e.zip/0.dll 42.zip/lib f.zip/book f.zip/chapter f.zip/doc f.zip/page f.zip 42.zip/lib f.zip/book f.zip/chapter f.zip/doc f.zip/page f.zip/0.dll deepfind 42.zip 11644.12s user 303.89s system 97% cpu 3:24:02.46 total