In response to my last post about dd , a friend of mine noticed that GNU cp always uses a 128 KB buffer size when copying a regular file; this is also the buffer size used by GNU cat . If you use strace to watch what happens when copying a file, you should see a lot of 128 KB read/write sequences:

$ strace -s 8 -xx cp /dev/urandom /dev/null ... read(3, "\x61\xca\xf8\xff\x1a\xd6\x83\x8b"..., 131072) = 131072 write(4, "\x61\xca\xf8\xff\x1a\xd6\x83\x8b"..., 131072) = 131072 read(3, "\xd7\x47\x8f\x09\xb2\x3d\x47\x9f"..., 131072) = 131072 write(4, "\xd7\x47\x8f\x09\xb2\x3d\x47\x9f"..., 131072) = 131072 read(3, "\x12\x67\x90\x66\xb7\xed\x0a\xf5"..., 131072) = 131072 write(4, "\x12\x67\x90\x66\xb7\xed\x0a\xf5"..., 131072) = 131072 read(3, "\x9e\x35\x34\x4f\x9d\x71\x19\x6d"..., 131072) = 131072 write(4, "\x9e\x35\x34\x4f\x9d\x71\x19\x6d"..., 131072) = 131072 ...

As you can see, each copy is operating on buffers 131072 bytes in size, which is 128 KB. GNU cp is part of the GNU coreutils project, and if you go diving into the coreutils source code you'll find this buffer size is defined in the file src/ioblksize.h. The comments in this file are really fascinating. The author of the code in this file (Jim Meyering) did a benchmark using dd if=/dev/zero of=/dev/null with different values of the block size parameter, bs . On a wide variety of systems, including older Intel CPUs, modern high-end Intel CPUs, and even an IBM POWER7 CPU, a 128 KB buffer size is fastest. I used gnuplot to graph these results, shown below. Higher transfer rates are better, and the different symbols represent different system configurations.

Most of the systems get faster transfer rates as the buffer size approaches 128 KB. After that, performance generally degrades slightly.

The file includes a cryptic, but interesting, explanation of why 128 KB is the best buffer size. Normally with these system calls it's more efficient to use larger buffer sizes. This is because the larger the buffer size used, the fewer system calls need to be made. So why the drop off in performance when a buffer larger than 128 KB is used?

When copying a file, GNU cp will first call posix_fadvise(2) on the source file with POSIX_FADV_SEQUENTIAL as the "advice" flag. As the name implies, this gives a hint to the kernel that cp plans to scan the source file sequentially. This causes the Linux kernel to use "readahead" for the file. On Linux you can also initiate readahead using madvise(2). There's also a system call actually called readahead(2), but it has a slightly different use case.

When you read(2) data from a regular file, if you're lucky some or all of the data you plan to read will already be in the kernel's page cache. The page cache is a cache of disk pages stored in kernel memory. Normally this works on an LRU basis, so when you read a page from disk the kernel first checks the page cache, and if the page isn't in the cache it reads it from disk and copies it into the page cache (possibly evicting an older page from the cache). This means the first access to a disk page actually requires going to disk, but subsequent accesses can simply copy the data from main memory if the disk page is still in the page cache.

When the kernel initiates readahead, it makes a best effort to prefetch pages that it thinks will be needed imminently. In particular, when accessing a file sequentially, the kernel will attempt to prefetch upcoming parts of the file as the file is read. When everything is working correctly, one can get a high cache hit rate even if the file contents weren't already in the page cache when the file was initially opened. In fact, if the file is actually accessed sequentially, there's a good chance of getting a 100% hit rate from the page cache when the kernel is doing readahead.

There's a trade-off here, because if the kernel prefetches pages more aggressively there will be a higher cache hit rate; but if the kernel is too aggressive, it may wastefully prefetch pages that aren't actually going to be read. What actually happens is the kernel has a readahead buffer size configured for each block device, and the readahead kernel thread will prefetch at most that much data for files on that block device. You can see the readahead buffer size using the blockdev command:

# Get the readahead size for /dev/sda $ blockdev --getra /dev/sda 256

The units returned by blockdev are in terms of 512 byte "sectors" (even though my Intel SSD doesn't actually have true disk sectors). Thus a return value of 256 actually corresponds to a 128 KB buffer size. You can see how this is actually implemented by the kernel in the file mm/readahead.c, in particular in the method ondemand_readahead() which calls get_init_ra_size() . From my non-expert reading of the code, it appears that the code tries to look at the number of pages in the file, and for large files a maximum value of 128 KB is used. Note that this is highly specific to Linux: other Unix kernels may or may not implement readahead, and if they do there's no guarantee that they'll use the same readahead buffer size.

So how is this related to disk transfer rates? As noted earlier, typically one wants to minimize the number of system calls made, as each system call has overhead. In this case that means we want to use as large a buffer size as possible. On the other hand, performance will be best when the page cache hit rate is high. A buffer size of 128 KB fits both of these constraints---it's the maximum buffer size that can be used before readahead will stop being effective. If a larger buffer size is used, read(2) calls will block while kernel waits for the disk to actually return new data.

In the real world a lot of other things will be happening on the host, so there's no guarantee that the stars will align perfectly. If the disk is very fast, the effect of readahead is diminished, so the penalty for using a larger buffer size might not be as bad. It's also possible to race the kernel here: a userspace program could try to read a file faster than the kernel can prefetch pages, which will make readahead less effective. But on the whole, we expect a 128 KB buffer size to be most effective, and that's exactly what the benchmark above demonstrates.