Improving Linux performance by preserving Buffer Cache State

The file system cache (buffer cache) helps programs to get to their data blocks faster by keeping recently used file blocks in memory. If you copy a large file tree, this has a devestating effect on the cache since all the copied data will also end up in the cache, force other data blocks out of the cache. This is very bad for system performance since of all the other processes on the system that had their data blocks in the cache before the copying started will suddenly have to reead data from disk again. Using posix_fadvise you can hint the OS that it should drop certain file blocks from the cache. Together with information from mincore that tells us which blocks are currently cached we can alter applications to work without disturbing the buffer cache. This article shows how this works, using rsync as an example.

The posix_fadvise function

The posix_fadvise function allows you to give the OS advice regarding your expected use of the data associated with an open file handle. The calling convention looks like this:

#include <sys/fcntl.h> int posix_fadvise( int fd, off_t offset, off_t len, int advice ); int posix_fadvise64( int fd, off_t offset, off_t len, int advice );

The offset gives the start of the area you are giving advice on. The len is the length of the area. If len is zero all bytes starting from offset will be affected by the call. The advice parameter specifies the type of advice.

The advice we are interested in here, is called POSIX_FADV_DONTNEED . It tells the OS that we will not be needing the specified bytes again. The effect of this is, that the bytes will be released from the file system cache. The following mini program will tell the OS to release all data associated with a particular file from the cache.

#define _XOPEN_SOURCE 600 #include <unistd.h> #include <fcntl.h> int main(int argc, char *argv[]) { int fd; fd = open(argv[1], O_RDONLY); fdatasync(fd); posix_fadvise(fd, 0,0,POSIX_FADV_DONTNEED); close(fd); return 0; }

As you can see we are calling fdatasync right before calling posix_fadvise , this makes sure that all data associated with the file handle has been committed to disk. This is not done because there is any danger of loosing data. But it makes sure that that the posix_fadvise has an effect. Since the posix_fadvise function is advisory, the OS will simply ignore it, if it can not comply. At least with Linux, the effect of calling posix_fadvise(fd,0,0,POSIX_FADV_DONTNEED) is immediate. This means if you write a file and call posix_fadvise right after writing a chunk of data, it will probably have no effect at all since the data in question has not been committed to disk yet, and therefore can not be released from cache.

As of this writing (2.6.21) Linux does not remember POSIX_FADV_DONTNEED advice for an open file. It acts when the advice is given, and when it can not comply it forgets the advice. So it is up to you to make sure Linux can comply.

The mincore function

Being able to tell the OS to drop a file from the cache is nice, but if the file has been cached before our program touched it, we should not drop the cache, since the file has been cached for a reason. Most likely some other application using it.

Since the whole point of the exercise is to NOT disturb the filesystem cache we need a way to figure out, which blocks of a file are present in the cache before we touch the file.

The mincore function tells us just this. Its usage is a bit complicated, since it works on memory in general and not only on files.

int mincore(void *memory_pointer, size_t file_length, unsigned char *vec);

The fist step is to memory map the file. Then call mincore on the memory pointer to get information about the cache-state of each block in the file. Here is a small example program, that will list which blocks of a file are in cache.

#include <stdio.h> #include <stdlib.h> #include <fcntl.h> #include <sys/types.h> #include <sys/stat.h> #include <unistd.h> #include <sys/mman.h> int main(int argc, char *argv[]) { int fd; struct stat file_stat; void *file_mmap; unsigned char *mincore_vec; size_t page_size = getpagesize(); size_t page_index; fd = open(argv[1],0); fstat(fd, &file_stat); file_mmap = mmap((void *)0, file_stat.st_size, PROT_NONE, MAP_SHARED, fd, 0); mincore_vec = calloc(1, (file_stat.st_size+page_size-1)/page_size); mincore(file_mmap, file_stat.st_size, mincore_vec); printf("Cached Blocks of %s: ",argv[1]); for (page_index = 0; page_index <= file_stat.st_size/page_size; page_index++) { if (mincore_vec[page_index]&1) { printf("%lu ", (unsigned long)page_index); } } printf("

"); free(mincore_vec); munmap(file_mmap, file_stat.st_size); close(fd); return 0; }

Teaching rsync to use posix_fadvise and mincore

I use rsync with its hard-link feature for snapshot-like backups. In that context it is very bad when the backup process evicts data from the file system cache. It reduces the performance of the other programs accessing the file system. Given the information from the previous section it was quite simple to implement a patch for rsync that drops cache after read or write operation. The resulting version of rsync has virtually no impact of the file system cache contents.

Calling fdatasync as in the example above, before closing a file is quite expensive, especially when dealing with small files. Therefore the patch introduces a file-handle cache where the files only get synced after some time has passed. This gives the kernel a chance to write data to disk at its own pace and thus reduces the performance hit we take from syncing.

The goal of this patch is for rsync to disturbe the filesystem cache as little as possible. Actively dropping data from the cache when it is not used anymore helps, but it can also be counterproductive, if the data had been in the cache before rsync even ran. In that case the data should not be touched.

So before rsync reads anything from a file it asks the kernel which pages of the file it already has in the cache. It will then only drop the pages that had not been in the cache before.

The rsync patch

The new rsync functionality has been contributed to the rsync mainline and will appears in the patch directory of the rsync source archive along with the next version of rsync. The original patch for rsync 2.6.9 is available from here.

Testing the new rsync functionality

To see the amount of file system cache curently in use, run

> grep ^Cached: /proc/meminfo

To see the effect of a large write operation, use dd to generate a 67 MB file filled with zeros.

> dd if=/dev/zero bs=64k count=1024 of=largefile.tmp 1024+0 records in 1024+0 records out 67108864 bytes transferred in 0.753085 seconds (89111922 bytes/sec)

Now check the cache usage, remove the file and check the cache again.

> grep ^Cached: /proc/meminfo Cached: 742340 kB > rm largefile.tmp > grep ^Cached: /proc/meminfo Cached: 676792 kB

The difference in cache usage matches the file size quite closely. This indicates that the whole file had been in the file system cache. This is also the reason for the rather impressive transfer rate dd has reported. Now lets see what happens when running rsync on a large file.

> dd if=/dev/zero bs=64k count=1024 of=largefile.tmp > grep ^Cached: /proc/meminfo Cached: 742340 kB > rsync largefile.tmp largefile2.tmp > grep ^Cached: /proc/meminfo Cached: 807876 kB > rm largefile.tmp largefile2.tmp

Again the whole file landed in the cache. Twice actually. Once when it was written by dd and a second time when it was copied by rsync. So finally lets do the same thing using the new rsync cache dropping feature.

> dd if=/dev/zero bs=64k count=1024 of=largefile.tmp > grep ^Cached: /proc/meminfo Cached: 741940 kB > rsync --drop-cache largefile.tmp largefile2.tmp > grep ^Cached: /proc/meminfo Cached: 741940 kB