Greetings,

This post is dedicated to Ted, who raised doubts a while back about

whether Tux3 can ever have a fast fsync:

https://lkml.org/lkml/2013/5/11/128 "Re: Tux3 Report: Faster than tmpfs, what?"

Ted suggested that Tux3's inherently asynchronous design would be a

limitation when it comes to synchronous operations like fsync. As he

put it, "any advantage of decoupling the front/back end is nullified"

because of temporal coupling. We found the opposite to be true: our

asynchronous design works as well for synchronous operations as it does

for any other, and Tux3 now has a high performance fsync to prove it.

Until now, Tux3 handled fsync as a full delta commit equivalent to

syncfs, just like Ext3. This is semantically correct but sometimes

commits more data than necessary and creates a bottleneck by serializing

all fsyncs. To optimize it, we added a mechanism for committing any

subset of dirty inodes separately from a full delta.

Like our normal delta commit, the design is asynchronous: front end

tasks produce fsync work and a backend task commits it. We now have two

backends, one to commit fsyncs and another to commit deltas, serialized

so the combination is single threaded and mostly lockless. Each fsync

moves an inode's dirty blocks from the front delta to the back delta,

then queues the inode for the backend. The backend spins away committing

fsyncs, opportunistically batching them up into group commits. The fsync

backend gives priority to the delta backend: whenever a full delta flush

is needed because of cache pressure or cache age, it finishes its fsync

work and gets out of the way.

This approach required only minimal changes to our existing delta commit

mechanism, mainly to support crash consistency. In particular, our delta

model did not need to be changed at all, and block forking protects

fsync commits in the same way as normal delta commits. That is, an inode

can be freely updated (even via mmap) while it is being fsynced. The

fsync data is isolated from those changes and the frontend does not

stall.

I measured fsync performance using a 7200 RPM disk as a virtual drive

under KVM, configured with cache=none so that asynchronous writes are

cached and synchronous writes translate into direct writes to the

block device. To focus purely on fsync, I wrote a small utility (at the

end of this post) that forks a number of tasks, each of which

continuously appends to and fsyncs its own file. For a single task doing

1,000 fsyncs of 1K each, we have:

Ext4: 34.34s XFS: 23.63s Btrfs: 34.84s Tux3: 17.24s

Tux3 has a nice advantage for isolated fsyncs. This is mainly due to

writing a small number of blocks per commit, currently just five blocks

for a small file. As for a normal delta commit, we do not update bitmap

blocks or index nodes, but log logical changes instead, flushing out

the primary metadata with occasional "unify" deltas to control the log

size and keep replay fast. The 1,000 fsync test typically does about

three unifies, so we do in fact update all our metadata and pay that

cost, just not on every commit.

The win is bigger than it appears at first glance, because writing the

commit block takes about 16.5 ms. That would be similar for all the

filesystems tested, so factoring that out, we see that Tux3 must be

doing 10 times less work than XFS, 24 times less than Ext4 and 25 times

less than Btrfs. Over time, that gap should widen as we reduce our small

file commit overhead towards just two blocks.

In this single threaded test, we pay the full price for "communication

delays" with our asynchronous backend, which evidently does not amount

to much. Given that a context switch is on the order of microseconds

while writing the commit block is on the order of milliseconds, it is

unsurprising that two extra context switches just disappear in the

noise.

Things get more interesting with parallel fsyncs. In this test, each

task does ten fsyncs and task count scales from ten to ten thousand. We

see that all tested filesystems are able to combine fsyncs into group

commits, with varying degrees of success:

Tasks: 10 100 1,000 10,000 Ext4: 0.79s 0.98s 4.62s 61.45s XFS: 0.75s 1.68s 20.97s 238.23s Btrfs 0.53s 0.78s 3.80s 84.34s Tux3: 0.27s 0.34s 1.00s 6.86s (lower is better)

We expect sub-linear scaling with respect to tasks as opportunities to

combine commits increase, then linear scaling as total write traffic

begins to dominate. Tux3 still shows sub-linear scaling even at 10,000

tasks. XFS scales poorly, and also suffers from read starvation at the

high end, sometimes taking tens of seconds to cat a file or minutes to

list a directory. Ext4 and Tux3 exhibit no such issues, remaining

completely responsive under all tested loads. The bottom line for this

test is, Tux3 is twice as fast at fsync with a modest task count, and

the gap widens to nine times faster than its nearest competitor as task

count increases.

Is there any practical use for fast parallel fsync of tens of thousands

of tasks? This could be useful for a scalable transaction server that

sits directly on the filesystem instead of a database, as is the

fashion for big data these days. It certainly can't hurt to know that

if you need that kind of scaling, Tux3 will do it.

Of course, a pure fsync load could be viewed as somewhat unnatural. We also need to know what happens under a realistic load with buffered operations mixed with fsyncs. We turn to an old friend, dbench: Dbench -t10 Tasks: 8 16 32 Ext4: 35.32 MB/s 34.08 MB/s 39.71 MB/s XFS: 32.12 MB/s 25.08 MB/s 30.12 MB/s Btrfs: 54.40 MB/s 75.09 MB/s 102.81 MB/s Tux3: 85.82 MB/s 133.69 MB/s 159.78 MB/s (higher is better)

Tux3 and Btrfs scale well and are way ahead of Ext4 and XFS, while Ext4

and XFS scale poorly or even negatively. Tux3 is the leader by a wide

margin, beating XFS by more than a factor of 5 at the high end.

Dbench -t10 -s (all file operations synchronous)

Tasks: 8 16 32 Ext4: 4.51 MB/s 6.25 MB/s 7.72 MB/s XFS: 4.24 MB/s 4.77 MB/s 5.15 MB/s Btrfs: 7.98 MB/s 13.87 MB/s 22.87 MB/s Tux3: 15.41 MB/s 25.56 MB/s 39.15 MB/s (higher is better)

With a pure synchronous load (O_SYNC) the ranking is not changed but the

gaps widen, and Tux3 outperforms XFS by a factor of 7.5.

At the risk of overgeneralizing, a trend seems to be emerging: the new,

write-anywhere designs run synchronous operations faster, combine

synchronous and asynchronous operations more efficiently, and scale

better to high task counts than the traditional journaling designs. If

there is to be an Ext5, it might be worth considering the merit of

abandoning the journal in favor of something along the lines of Tux3's

redirect-on-write and logging combination.

Getting back to Ted's question, perhaps an asynchronous design really is

a better idea all round, even for synchronous operations, and perhaps

there really is such a thing as an improved design that is not just a

different set of tradeoffs.

In the full disclosure department, Tux3 is still not properly optimized

in some areas. One of them is fragmentation: it is not very hard to

make Tux3 slow down by running long tests. Our current allocation

algorithm is completely naive - it just allocates the next available

block and wraps at the top of volume. After a few wraps, it makes a big

mess. So today we are not claiming victory in the benchmark department,

we still have some work to do. Today is just about fsync, for which it

is fair to say that Tux3 sets a new performance standard.

Regards, Daniel

Footnote: While I was doing this work, Hirofumi set about improving our

full filesystem sync to support merging parallel full syncs into single

delta commits. That one change improved fsync performance so much that

I almost abandoned my separate fsync work. However, a true, separate

fsync with aggressive group commit eventually proved superior, so now we

have both: a high performance fsync and a full filesystem sync that is

nearly as fast under many loads.

/* * syncs.c

*

* D.R. Phillips, 2015

*

* To build: c99 -Wall syncs.c -o syncs

* To run: ./syncs [<filename> [<syncs> [<tasks>]]] */ #include <unistd.h> #include <stdlib.h> #include <stdio.h> #include <fcntl.h> #include <sys/wait.h> #include <errno.h> #include <sys/stat.h> char text[1024] = { "hello world!

" }; int main(int argc, const char *argv[]) { const char *basename = argc < 1 ? "foo" : argv[1]; char name[100]; int steps = argc < 3 ? 1 : atoi(argv[2]); int tasks = argc < 4 ? 1 : atoi(argv[3]); int err, fd; for (int t = 0; t < tasks; t++) { snprintf(name, sizeof name, "%s%i", basename, t); if (!fork()) goto child; } for (int t = 0; t < tasks; t++) wait(&err); return 0; child: fd = creat(name, S_IRWXU); for (int i = 0; i < steps; i++) { write(fd, text, sizeof text); fsync(fd); } return 0; } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/