Every so often something happens in my work that makes me think, even if I don't know what conclusions to really draw from it. I recently mentioned that we'd found a bug in GNU Tar, and the story of how that happened is one of those times.

We back up our fileservers through Amanda and GNU Tar. For a long time, we've had a problem where every so often, fortunately quite rarely, tar would freak out while backing up the filesystem that held /var/mail , producing huge amounts of output. Most of the time this would go on forever and we'd have to kill the dump eventually; other times it would eventually finish, having produced terabyte(s) of output that fortunately seemed to compress very well. At one point we captured such a giant tar file and I subjected it to some inspection, which revealed that the runaway area was a giant sea of null bytes, which ' tar -t ' didn't like, but after a while things returned to normal.

(This led to me wondering if null bytes were naturally occurring in people's inboxes. It turns out that hunting for null bytes in text files is not quite as easy as you'd like, and yes, people's inboxes have some.)

We recently moved the filesystem with /var/mail to our new Linux fileservers, which are on Ubuntu 18.04 and so have a more recent and more mainline version of GNU Tar than our OmniOS machines. We hoped that this would solve our GNU Tar issues, but then we almost immediately had one of these runaway tar incidents occur. This time around, with GNU Tar running on an Ubuntu machine where I felt fully familiar with all of the debugging tools available, I did some inspection of the running tar process. This inspection revealed that tar was issuing an endless stream of read() s that were all returning 0 bytes:

read(6, "", 512) = 0 read(6, "", 512) = 0 [...] read(6, "", 512) = 0 write(1, "\0\0\0\0\0"..., 10240) = 10240 read(6, "", 512) = 0 [...]

lsof said that file descriptor 6 was someone's mailbox.

Using ' apt-get source tar ', I fetched the source code to Ubuntu's version of GNU Tar and went rummaging around through it for read() system calls that didn't check for end of file. Once I decoded some levels of indirection, there turned out to be one obvious place that seemed to skip it, in the sparse_dump_region function in sparse.cs. A little light went on in my head.

A few months ago, we ran into a NFS problem with Alpine. While working on that bug, I strace 'd an Alpine process and noticed, among other things, that it was using ftruncate() to change the size of mailboxes; sometimes it extended them, temporarily creating a sparse section of the file until it filled it in, and perhaps sometimes it shrunk them too. This seemed to match what I'd spotted; sparseness was related, and shrinking a file's size with ftruncate() would create a situation where tar hit end of file before it was expecting to.

(This even provides an explanation for why tar sometimes recovered; if something later delivered more mail to the mailbox, taking it back to or above the size tar expected, tar would stop getting this unexpected end of file.)

I did some poking around in GDB, using Ubuntu's debugging symbols and the tar package source code I'd fetched, and I can reproduce the bug, although it's somewhat different than my initial theory. It turns out that sparse_dump_region is not dumping sparse regions of a file, it's dumping non-sparse ones (of course), and it's used on all files (sparse or not) if you run tar with the --sparse argument. So the actual bug is if you run GNU Tar with --sparse and a file shrinks while tar is reading it, tar fails to properly handle the resulting earlier than expected end of file. If the file grows again, tar recovers.

(Except if a file that is sparse at the end shrinks purely in that sparse section. In that case you're okay.)

What is interesting to me about this is that there's nothing here I could not have done years ago on our OmniOS fileservers, in theory. OmniOS has ways of tracing a program's system call activity, and it has general equivalents of lsof , and I could have probably found and looked at the source code for its version of GNU Tar and run it under some OmniOS debugger (although we don't seem to have any version of GDB installed), and so on. But I didn't. Instead we shrugged a bit and moved on. It took moving this filesystem to an Ubuntu based environment to get me to dig into the issue.

(It wasn't just an issue of tools and environment, either; part of it was that we automatically assumed that the OmniOS version of GNU Tar was some old unsupported version that there was no reason to look at, because surely the issue was fixed in a newer one.)

PS: Our short term fix is likely to be to tell Amanda to run GNU Tar without --sparse when backing up this filesystem. Mailboxes shouldn't be sparse, and if they are we're compressing this filesystem's backups anyway so all those null bytes will compress really well.

PPS: I haven't tried to report this as a bug to the GNU Tar people because I only confirmed it Friday and the university is now on its winter break. Interested parties should feel free to beat me to it.

Update: The bug has been reported to the GNU Tar people and is now fixed in commit c15c42c.