POSIX close(2) is broken

In the world of POSIX, everything is a file. Well, sort of. There's sockets and pipes, which behave rather like files except that you can't seek on them and they have some extra metadata. And there's devices, where sometimes you can only read and write appropriately-sized blocks, not individual bytes. And then there's terminals, which are all sorts of weird. But in all these cases, you've got a file descriptor, and when you're finished you release the resource by calling the close(2) system call.

There's just one small problem: The way POSIX has defined close(2) is completely and utterly broken.

A few days ago, Taylor R Campbell — the same guy who reported Tarsnap's incredibly stupid crypto bug back in January 2011 — sent me an email pointing out some peculiar language in the standard:

If close() is interrupted by a signal that is to be caught, it shall return -1 with errno set to [EINTR]...

Now, we're used to dealing with EINTR: Almost every system call we make can be interrupted and return EINTR. No problem: We just reissue the system call — keep trying until we get through without a signal interrupting us. That would be fine, except for these last few words:

... and the state of fildes is unspecified.

If close(2) returns EINTR and you call it again, you might get an EBADF ("this is not an open file descriptor") error back. Even worse, if you are running in a threaded process, a different thread might have opened a file and been assigned the same descriptor value, at which point your second close(2) call can succeed... at closing the wrong file. Throw in another file being opened and you've now got silent data corruption. On the other hand, if close(2) returns EINTR and you don't call it again, you might be leaking an open socket. For a short-lived process this might not matter, but it's certainly not something you want to do in a long-lived server.

After several days, I see a couple imperfect solutions. The first one is a refinement of the standard EINTR loop:

int threadunsafe_close(int fd) { if (close(fd) == 0) return (0); if (errno != EINTR) return (-1); while (close(fd)) { if (errno == EBADF) return (0); if (errno != EINTR) return (-1); } return (0); }

Obviously that first solution, while working fine on single-threaded processes, is not safe with multiple threads due to the potential race against open(2) or some other system call reallocating the descriptor. A second option avoids the problem by preventing EINTR:

int blocksignals_close(int fd) { sigset_t set; sigset_t oset; int rc; int errno_saved; if (sigfillset(&set)) return (-1); if (pthread_sigmask(SIG_SETMASK, &set, &oset)) return (-1); if ((rc = close(fd)) != 0) errno_saved = errno; if (pthread_sigmask(SIG_SETMASK, &oset, NULL)) return (-1); if (rc) errno = errno_saved; return (rc); }

Unfortunately this second solution is also less than ideal: It can fail in several different ways (at least theoretically — POSIX doesn't define any errors which sigfillset or pthread_sigmask could return here, but implementations are allowed to invent other reasons to fail), which makes deciphering errno harder if -1 is returned; but more importantly, by blocking signals while calling close(2) it stops us from hitting ^C to interrupt the process if close(2) blocks for a long time (e.g., if it is causing data to be flushed to an NFS-mounted filesystem over a failing network).

What makes this problem particularly annoying is that there is no need for POSIX to have this ambiguity. Many operating systems solve this problem by simply not allowing close(2) to return EINTR; this is always going to be possible by the simple tactic of having close(2) mark the descriptor as deceased and then garbage collecting asynchronously, if nothing else. But even if close(2) can fail with EINTR, there's no reason to leave the descriptor state ambiguous: Immediately before returning to userland, the kernel simply needs to look at the descriptor and ask itself "is this descriptor still open?" — and then return success or EINTR respectively.

I hope that a future revision of the standard fixes this. In the mean time, anyone wanting to safely close a file descriptor without assuming more than the standard specifies has a lot of work to do.

Disqus