Or: A tale about Linux file write patterns.

So I once wrote a custom core dump handler to be used with Linux’s core_pattern. What it does is take a core dump on STDIN plus a few arguments, and then write the core to a predictable location on disk with a time stamp and suitable access rights. Core dumps tend to be rather large, and in general you don’t know in advance how much data you’ll write to disk. So I built a functionality to write a chunk of data to disk (say, 16MB) and then check with fstatfs() if the disk has still more than threshold capacity (say, 10GB). This way, a rapidly restarting and core-dumping application cannot lead to “disk full” follow up failures that will inevitably lead to a denial of service for most data handling services.

So… how do we write a lot of data to disk really fast? – Let us maybe rephrase the question: How do we write data to disk in the first place? Let’s assume we have already opened file descriptors in and out , and we just want to copy everything from in to out .

One might be tempted to try something like this:

ssize_t read_write(int in, int out) { ssize_t n, t = 0; char buf[1024]; while((n = read(in, buf, 1024)) > 0) { t += write(out, buf, n); } return t; }

“But…!”, you cry out, “there’s so much wrong with this!” And you are right, of course:

The return value n is not checked. It might be -1 . This might be because e.g. we have got a bad file descriptor, or because the syscall was interrupted.

is not checked. It might be . This might be because e.g. we have got a bad file descriptor, or because the syscall was interrupted. A call to write(out, buf, 1024) will – if it does not return -1 – write at least one byte, but we have no guarantee that we will actually write all n bytes to disk. So we have to loop the write until we have written n bytes.

An updated and semantically correct pattern reads like this (in a real program you’d have to do real error handling instead of assertions, of course):

ssize_t read_write_bs(int in, int out, ssize_t bs) { ssize_t w = 0, r = 0, t, n, m; char *buf = malloc(bs); assert(buf != NULL); t = filesize(in); while(r < t && (n = read(in, buf, bs))) { if(n == -1) { assert(errno == EINTR); continue; } r = n; w = 0; while(w < r && (m = write(out, buf + w, (r - w)))) { if(m == -1) { assert(errno == EINTR); continue; } w += m; } } free(buf); return w; }

We have a total number of bytes to read ( t ), the number of bytes already read ( r ), and the number of bytes already written ( w ). Only when t == r == w are we done (or if the input stream ends prematurely). Error checking is performed so that we restart interrupted syscalls and crash on real errors.

What about the bs parameter? Of course you may have already noticed in the first example that we always copied 1024 bytes. Typically, a block on the file system is 4KB, so we are only writing quarter blocks, which is likely bad for performance. So we’ll try different block sizes and compare the results.

We can find out the file system’s block size like this (as usual, real error handling left out):

ssize_t block_size(int fd) { struct statfs st; assert(fstatfs(fd, &st) != -1); return (ssize_t) st.f_bsize; }

OK, let’s do some benchmarks! (Full code is on GitHub.) For simplicity I’ll try things on my laptop computer with Ext3+dmcrypt and an SSD. This is “read a 128MB file and write it out”, repeated for different block sizes, timing each version three times and printing the best time in the first column. In parantheses you’ll see the percentage increase in comparison to the best run of all methods:

read+write 16bs 164ms 191ms 206ms read+write 256bs 167ms 168ms 187ms (+ 1.8%) read+write 4bs 169ms 169ms 177ms (+ 3.0%) read+write bs 184ms 191ms 200ms (+ 12.2%) read+write 1k 299ms 317ms 329ms (+ 82.3%)

Mh. Seems like multiples of the FS’s block sizes don’t really matter here. In some runs, the 16x blocksize is best, sometimes it’s the 256x. The only obvious point is that writing only a single block at once is bad, and writing fractions of a block at once is very bad indeed performance-wise.

Now what’s there to improve? “Surely it’s the overhead of using read() to get data,” I hear you saying, “Use mmap() for that!” So we come up with this:

ssize_t mmap_write(int in, int out) { ssize_t w = 0, n; size_t len; char *p; len = filesize(in); p = mmap(NULL, len, PROT_READ, MAP_SHARED, in, 0); assert(p != NULL); while(w < len && (n = write(out, p + w, (len - w)))) { if(n == -1) { assert(errno == EINTR); continue; } w += n; } munmap(p, len); return w; }

Admittedly, the pattern is simpler. But, alas, it is even a little bit slower! (YMMV)

read+write 16bs 167ms 171ms 209ms mmap+write 186ms 187ms 211ms (+ 11.4%)

“Surely copying around useless data is hurting performance,” I hear you say, “it’s 2014, use zero-copy already!” – OK. So basically there are two approaches for this on Linux: One cumbersome but rather old and known to work, and then there is the new and shiny sendfile interface.

For the splice approach, since either reader or writer of your splice call must be pipes (and in our case both are regular files), we need to create a pipe solely for the purpose of splicing data from in to the write end of the pipe, and then again splicing that same chunk from the read end to the out fd:

ssize_t pipe_splice(int in, int out) { size_t bs = 65536; ssize_t w = 0, r = 0, t, n, m; int pipefd[2]; int flags = SPLICE_F_MOVE | SPLICE_F_MORE; assert(pipe(pipefd) != -1); t = filesize(in); while(r < t && (n = splice(in, NULL, pipefd[1], NULL, bs, flags))) { if(n == -1) { assert(errno == EINTR); continue; } r += n; while(w < r && (m = splice(pipefd[0], NULL, out, NULL, bs, flags))) { if(m == -1) { assert(errno == EINTR); continue; } w += m; } } close(pipefd[0]); close(pipefd[1]); return w; }

“This is not true zero copy!”, I hear you cry, and it’s true, the ‘page stealing’ mechanism has been discontinued as of 2007. So what we get is an “in-kernel memory copy”, but at least the file contents don’t cross the kernel/userspace boundary twice unnecessarily (we don’t inspect it anyway, right?).

The sendfile() approach is more immediate and clean:

ssize_t do_sendfile(int in, int out) { ssize_t t = filesize(in); off_t ofs = 0; while(ofs < t) { if(sendfile(out, in, &ofs, t - ofs) == -1) { assert(errno == EINTR); continue; } } return t; }

So… do we get an actual performance gain?

sendfile 159ms 168ms 175ms pipe+splice 161ms 162ms 163ms (+ 1.3%) read+write 16bs 164ms 165ms 178ms (+ 3.1%)

“Yes! I knew it!” you say. But I’m lying here. Every time I execute the benchmark, another different approach is the fastest. Sometimes the read/write approach comes in first before the two others. So it seems that this is not really a performance saver, is it? I like the sendfile() semantics, though. But beware:

In Linux kernels before 2.6.33, out_fd must refer to a socket. Since Linux 2.6.33 it can be any file. If it is a regular file, then sendfile() changes the file offset appropriately.

Strangely, sendfile() works on regular files in the default Debian Squeeze Kernel (2.6.32-5) without problems. (Update 2015-01-17: Przemysław Pawełczyk, who in 2011 sent Changli Gao’s patch which re-enables this behaviour to stable@kernel.org for inclusion in Linux 2.6.32, wrote to me explaining how exactly it ended up being backported. If you’re interested, see this excerpt from his email.)

“But,” I hear you saying, “the system has no clue what your intentions are, give it a few hints!” and you are probably right, that shouldn’t hurt:

void advice(int in, int out) { ssize_t t = filesize(in); posix_fadvise(in, 0, t, POSIX_FADV_WILLNEED); posix_fadvise(in, 0, t, POSIX_FADV_SEQUENTIAL); }

But since the file is very probably fully cached, the performance is not improved significantly. “BUT you should supply a hint on how much you will write, too!” – And you are right. And this is where the story branches off into two cases: Old and new file systems.

I’ll just tell the kernel that I want to write t bytes to disk now, and please reserve space (I don’t care about a “disk full” that I could catch and act on):

void do_falloc(int in, int out) { ssize_t t = filesize(in); posix_fallocate(out, 0, t); }

I’m using my workstation’s SSD with XFS now (not my laptop any more). Suddenly everything is much faster, so I’ll simply run the benchmarks on a 512MB file so that it actually takes time:

sendfile + advices + falloc 205ms 208ms 208ms pipe+splice + advices + falloc 207ms 209ms 210ms (+ 1.0%) sendfile 226ms 226ms 229ms (+ 10.2%) pipe+splice 227ms 227ms 231ms (+ 10.7%) read+write 16bs + advices + falloc 235ms 240ms 240ms (+ 14.6%) read+write 16bs 258ms 259ms 263ms (+ 25.9%)

Wow, so this posix_fallocate() thing is a real improvement! It seems reasonable enough, of course: Already the file system can prepare an – if possible contiguous – sequence of blocks in the requested size. But wait! What about Ext3? Back to the laptop:

sendfile 161ms 171ms 194ms read+write 16bs 164ms 174ms 189ms (+ 1.9%) pipe+splice 167ms 170ms 178ms (+ 3.7%) read+write 16bs + advices + falloc 224ms 229ms 229ms (+ 39.1%) pipe+splice + advices + falloc 229ms 239ms 241ms (+ 42.2%) sendfile + advices + falloc 232ms 235ms 249ms (+ 44.1%)

Bummer. That was unexpected. Why is that? Let’s check strace while we execute this program:

fallocate(1, 0, 0, 134217728) = -1 EOPNOTSUPP (Operation not supported) ... pwrite(1, "\0", 1, 4095) = 1 pwrite(1, "\0", 1, 8191) = 1 pwrite(1, "\0", 1, 12287) = 1 pwrite(1, "\0", 1, 16383) = 1 ...

What? Who does this? – Glibc does this! It sees the syscall fail and re-creates the semantics by hand. (Beware, Glibc code follows. Safe to skip if you want to keep your sanity.)

/* Reserve storage for the data of the file associated with FD. */ int posix_fallocate (int fd, __off_t offset, __off_t len) { #ifdef __NR_fallocate # ifndef __ASSUME_FALLOCATE if (__glibc_likely (__have_fallocate >= 0)) # endif { INTERNAL_SYSCALL_DECL (err); int res = INTERNAL_SYSCALL (fallocate, err, 6, fd, 0, __LONG_LONG_PAIR (offset >> 31, offset), __LONG_LONG_PAIR (len >> 31, len)); if (! INTERNAL_SYSCALL_ERROR_P (res, err)) return 0; # ifndef __ASSUME_FALLOCATE if (__glibc_unlikely (INTERNAL_SYSCALL_ERRNO (res, err) == ENOSYS)) __have_fallocate = -1; else # endif if (INTERNAL_SYSCALL_ERRNO (res, err) != EOPNOTSUPP) return INTERNAL_SYSCALL_ERRNO (res, err); } #endif return internal_fallocate (fd, offset, len); }

And you guessed it, internal_fallocate() just does a pwrite() on the first byte for every block until the space requirement is fulfilled. This is slowing things down considerably. This is bad. –

“But other people just truncate the file! I saw this!”, you interject, and again you are right.

void enlarge_truncate(int in, int out) { ssize_t t = filesize(in); ftruncate(out, t); }

Indeed the truncate versions work faster on Ext3:

pipe+splice + advices + trunc 157ms 158ms 160ms read+write 16bs + advices + trunc 158ms 167ms 188ms (+ 0.6%) sendfile + advices + trunc 164ms 167ms 181ms (+ 4.5%) sendfile 164ms 171ms 193ms (+ 4.5%) pipe+splice 166ms 167ms 170ms (+ 5.7%) read+write 16bs 178ms 185ms 185ms (+ 13.4%)

Alas, not on XFS. There, the fallocate() system call is just more performant. (You can also use xfsctl directly for that.) –

And this is where the story ends.

In place of a sweeping conclusion, I’m a little bit disappointed that there seems to be no general semantics to say “I’ll write n bytes now, please be prepared”. Obviously, using posix_fallocate() on Ext3 hurts very much (this may be why cp is not employing it). So I guess the best solution is still something like this:

if(fallocate(out, 0, 0, len) == -1 && errno == EOPNOTSUPP) ftruncate(out, len);

Maybe you have another idea how to speed up the writing process? Then drop me an email, please.

Update 2014-05-03: Coming back after a couple of days’ vacation, I found the post was on HackerNews and generated some 23k hits here. I corrected the small mistake in example 2 (as pointed out in the comments – thanks!). – I trust that the diligent reader will have noticed that this is not a complete survey of either I/O hierarchy, file system and/or hard drive performace. It is, as the subtitle should have made clear, a “tale about Linux file write patterns”.