Two new ways to read a file quickly

This article brought to you by LWN subscribers Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

System calls on Linux are relatively cheap, though the mitigations for speculative-execution vulnerabilities have made them more expensive than they once were. But even cheap system calls add up if one has to make a large number of them. Thus, developers have been working on ways to avoid system calls for a long time. Currently under discussion is a pair of ways to reduce the number of system calls required to read a file's contents, one of which is rather simpler than the other.

readfile()

LWN recently looked at the proposed fsinfo() system call, which is intended to return information about mounted filesystems to an application. One branch of the discussion delved into whether that information could be exported via sysfs instead; one concern that was expressed about that approach is that the cost of reading a lot of little files would be too high. Miklos Szeredi argued that it would not be, but suggested that, if people were concerned, they could reduce this cost by introducing a new system call to read the contents of a file:

ssize_t readfile(int dfd, const char *path, char *buf, size_t bufsize, int flags);

The dfd and path arguments would identify a file in the usual way. A successful readfile() would read the contents of the indicated file into buf , up to a maximum of bufsize bytes, returning the number of bytes actually read. On its face, readfile() adds nothing new; an application can obtain the same result with calls to openat() , read() , and close() . But it reduces the number of system calls required from three to one, and that turns out to be interesting to some users.

In particular, Karel Zak, the maintainer of the util-linux project, offered "many many beers" for the implementation of readfile() . Many of the utilities in util-linux (tools like ps and top , for example) spend a lot of time reading information from small /proc and sysfs files; having a readfile() call would make them quite a bit more efficient.

People who complain that it's hard to get kernel developers to pay attention to their problems clearly have missed an important technique; Greg Kroah-Hartman quickly responded with enthusiasm: "Unlimited beers for a 21-line kernel patch? Sign me up!". He provided a first implementation, and went on to say that this system call might actually make sense to have. Naturally, the patch has grown past 21 lines once all of the details that need to be taken into account were dealt with, and there is still a manual page to write. But it seems likely that there will be a submission of readfile() in the near future.

Of course, some people are already talking about the need for a writefile() as well.

readfile() on the ring

As the conversation progressed, Jann Horn pointed out that the developers working on io_uring have also expressed interest in adding a readfile() -like capability. The whole point of io_uring is to be able to perform system-call actions asynchronously and without having to actually call into the kernel, so it might seem like a good fit for this use case. He did note that truly supporting that feature in io_uring is "a bit complicated", since there is no way to propagate a file descriptor returned by openat() to a subsequent read() operation queued in the ring. Without that, the read() cannot be queued until after the file has been opened, defeating the purpose of the exercise.

The fact of the matter, though, is that "a bit complicated" is a good description of io_uring in general. It seems unlikely that the author of a tool like ps is going to want to go through all of the effort needed to set up an io_uring instance, map it into the address space, queue some operations, and start things running just to avoid some system calls when reading /proc . But the developers of other, more complex applications would, it seems, like to have this sort of capability.

In short order, perhaps in the hope of tapping into that "unlimited beer" stream, io_uring maintainer Jens Axboe posted a patch that fills in the missing piece. It works by remembering the file descriptor returned by the last openat() call in a given chain of operations. To implement a readfile() , an application could set up an io_uring chain with three operations, corresponding to the openat() , read() , and close() calls. For the latter two, though, the usual file-descriptor argument would be provided as the special constant IOSQE_FD_LAST_OPEN , which would be replaced by the descriptor for the last opened file when the operation executes.

This approach works, at the cost of complicating the interface and implementation with the magic file-descriptor substitution. Josh Triplett had a different idea, which he first posted in an LWN comment in January: let applications specify which file descriptor they would like to use when opening a file. He filled out that idea in March with a patch series adding a new O_SPECIFIC_FD flag to the openat2() system call. This feature is available independently of io_uring; if an application really wants to open a file on descriptor 42 and no other, the new flag makes that possible.

The patch set also adds a new prctl() operation to set the minimum file descriptor to use when the application has not requested a specific one. This minimum defaults to zero, preserving the "lowest available descriptor" semantics that Unix has guaranteed forever. A developer wanting to control the file descriptors used could raise this minimum and know that the kernel would not use any of the descriptors below that minimum without an explicit request.

It only took Axboe about three hours to come up with a new patch series integrating this work. It mostly consists of delaying the checks of file-descriptor validity so that they don't happen ahead of the call that actually makes a given descriptor valid. There seems to be a general agreement that this approach makes more sense than magic file-descriptor substitution, so this is the version that seems likely to go ahead.

At this point, though, this work has only circulated on the io_uring list, which has a relatively limited readership. Axboe has said that he plans to post it more generally in the near future, and that merging for 5.7 is within the realm of possibility. So it may well be that there will soon be two different ways for an application to read the contents of a file with a minimum of system calls — and Karel Zak may end up buying a lot of beer.

