Sorry for the long posting hiatus. I have enough readers now that it’s hard to write the usual random nonsense.

I was recently reminded of an old problem using CVS over SSH, which was an interesting example of various different instances of reasonable behaviour adding up to a bug. It’s possible that this bug has been fixed, but I’ll assume that it hasn’t. The bug is that “cvs diff 2>&1 | less” will sometimes drop data, leaving you looking at an incomplete diff.

When CVS invokes SSH, it sets up file descriptors 0 (standard input) and 1 (standard output), but not 2 (standard error). Thus SSH inherits file descriptor 2 from CVS. This means that any SSH errors get reported to the standard error passed to CVS, which is what you want to have happen. However, when using 2>&1, this means that SSH’s file descriptor 2 will be the same as CVS’s file descriptors 1 and 2.

SSH puts its file descriptors 0, 1, and 2 into nonblocking mode, so that it can use select to send data back and forth without blocking. This means that SSH puts CVS’s file descriptor 2 into nonblocking mode. When using 2>&1, file descriptors 1 and 2 are the same, so this puts CVS’s file descriptor 1 into nonblocking mode.

CVS uses stdio to output data to standard output. When writing to a pipe, the buffer can fill up. CVS naturally never puts the descriptor into nonblocking mode, but when SSH has done it indirectly, and the buffer fills, the data being written will be discarded. This is what causes the bug: the discarded data is never seen by the user.

So what’s the fix? It’s reasonable for SSH to put its descriptors into nonblocking mode. It’s reasonable for CVS to pass its file descriptor 2 to SSH. It’s reasonable for CVS to use stdio to output data. It’s reasonable for stdio to not specially handle a nonblocking file descriptor—any program which wants to use a nonblocking descriptor needs to handle I/O retries itself. It’s reasonable for 2>&1 to mean that file descriptors 1 and 2 refer to the same underlying pipe. It’s reasonable for the user to use 2>&1 when piping cvs diff output to less.

I think the only remaining link in the sequence leading to the bug is that when SSH sets its file descriptor 2 to nonblocking mode, this affects the file in CVS. This is a consequence of the Unix file model, in which file descriptors refer to underlying files. A file descriptor has only one flag: whether it is closed if the exec system cal is run. All the other information is attached to the underlying file. Using 2>&1 means that two file descriptors point to the same file. Forking and execing SSH does not change this–in fact, it adds two more file descriptors, in the SSH process, which point to the same file. Any change in the flags associated with that file is seen by all the associated file descriptors.

This separation of file descriptor and file is what makes 2>&1 work. It’s also what makes >> work; >> opens a file in append mode, and the append flag is inherited by other processes which refer to that file. In any case, what really counts here is not the exec, but the fork; forking a process should not change the flags associated with a file. Further, I’m sure there programs which depend on the fact that changing the flags on a file after a fork affects the file as seen by the parent process.

It’s possible to imagine that file descriptors point to a new shared structure which then points to the underlying file. The file position and some flags would stay with the underlying flie. The new shared structure would just hold some flags which need not always be shared: O_NONBLOCK, O_ASYNC, etc. Calling fork would not create a new shared structure, but calling exec would, copying the existing structure. That would let some flags not be copied across exec process boundaries.

However, that would be a significant change to the Unix file model, a model which has lasted for decades and is not clearly broken. Absent that change, we are left with a complex bug, in which all choices are reasonable.

The workaround for the problem is to invoke SSH with 2>/dev/null, and assume that SSH never writes anything useful to standard error. The 2>/dev/null disassociates SSH file descriptor 2 from CVS file descriptor 2, so CVS file descriptor 1 is not accidentally set into nonblocking mode.