From Jens Axboe <> Subject [PATCHSET v3][RFC] Make background writeback not suck Date Wed, 30 Mar 2016 09:07:48 -0600 Hi,



This patchset isn't as much a final solution, as it's demonstration

of what I believe is a huge issue. Since the dawn of time, our

background buffered writeback has sucked. When we do background

buffered writeback, it should have little impact on foreground

activity. That's the definition of background activity... But for as

long as I can remember, heavy buffered writers has not behaved like

that. For instance, if I do something like this:



$ dd if=/dev/zero of=foo bs=1M count=10k



on my laptop, and then try and start chrome, it basically won't start

before the buffered writeback is done. Or, for server oriented

workloads, where installation of a big RPM (or similar) adversely

impacts data base reads or sync writes. When that happens, I get people

yelling at me.



Last time I posted this, I used flash storage as the example. But

this works equally well on rotating storage. Let's run a test case

that writes a lot. This test writes 50 files, each 100M, on XFS on

a regular hard drive. While this happens, we attempt to read

another file with fio.



Writers:



$ time (./write-files ; sync)

real 1m6.304s

user 0m0.020s

sys 0m12.210s



Fio reader:



read : io=35580KB, bw=550868B/s, iops=134, runt= 66139msec

clat (usec): min=40, max=654204, avg=7432.37, stdev=43872.83

lat (usec): min=40, max=654204, avg=7432.70, stdev=43872.83

clat percentiles (usec):

| 1.00th=[ 41], 5.00th=[ 41], 10.00th=[ 41], 20.00th=[ 42],

| 30.00th=[ 42], 40.00th=[ 42], 50.00th=[ 43], 60.00th=[ 52],

| 70.00th=[ 59], 80.00th=[ 65], 90.00th=[ 87], 95.00th=[ 1192],

| 99.00th=[254976], 99.50th=[358400], 99.90th=[444416], 99.95th=[468992],

| 99.99th=[651264]





Let's run the same test, but with the patches applied, and wb_percent

set to 10%:



Writers:



$ time (./write-files ; sync)

real 1m29.384s

user 0m0.040s

sys 0m10.810s



Fio reader:



read : io=1024.0MB, bw=18640KB/s, iops=4660, runt= 56254msec

clat (usec): min=39, max=408400, avg=212.05, stdev=2982.44

lat (usec): min=39, max=408400, avg=212.30, stdev=2982.44

clat percentiles (usec):

| 1.00th=[ 40], 5.00th=[ 41], 10.00th=[ 41], 20.00th=[ 41],

| 30.00th=[ 42], 40.00th=[ 42], 50.00th=[ 42], 60.00th=[ 42],

| 70.00th=[ 43], 80.00th=[ 45], 90.00th=[ 56], 95.00th=[ 60],

| 99.00th=[ 454], 99.50th=[ 8768], 99.90th=[36608], 99.95th=[43264],

| 99.99th=[69120]





Much better, looking at the P99.x percentiles, and of course on

the bandwidth front as well. It's the difference between this:



---io---- -system-- ------cpu-----

bi bo in cs us sy id wa st

20636 45056 5593 10833 0 0 94 6 0

16416 46080 4484 8666 0 0 94 6 0

16960 47104 5183 8936 0 0 94 6 0



and this



---io---- -system-- ------cpu-----

bi bo in cs us sy id wa st

384 73728 571 558 0 0 95 5 0

384 73728 548 545 0 0 95 5 0

388 73728 575 763 0 0 96 4 0



in the vmstat output. It's not quite as bad as deeper queue depth

devices, where we have hugely bursty IO, but it's still very slow.



If we don't run the competing reader, the dirty data writeback proceeds

at normal rates:



# time (./write-files ; sync)

real 1m6.919s

user 0m0.010s

sys 0m10.900s





The above was run without scsi-mq, and with using the deadline scheduler,

results with CFQ are similary depressing for this test. So IO scheduling

is in place for this test, it's not pure blk-mq without scheduling.



The above was the why. The how is basically throttling background

writeback. We still want to issue big writes from the vm side of things,

so we get nice and big extents on the file system end. But we don't need

to flood the device with THOUSANDS of requests for background writeback.

For most devices, we don't need a whole lot to get decent throughput.



This adds some simple blk-wb code that keeps limits how much buffered

writeback we keep in flight on the device end. The default is pretty

low. If we end up switching to WB_SYNC_ALL, we up the limits. If the

dirtying task ends up being throttled in balance_dirty_pages(), we up

the limit. If we need to reclaim memory, we up the limit. The cases

that need to clean memory at or near device speeds, they get to do

that. We still don't need thousands of requests to accomplish that.

And for the cases where we don't need to be near device limits, we

can clean at a more reasonable pace. See the last patch in the series

for a more detailed description of the change, and the tunable.



I welcome testing. If you are sick of Linux bogging down when buffered

writes are happening, then this is for you, laptop or server. The

patchset is fully stable, I have not observed problems. It passes full

xfstest runs, and a variety of benchmarks as well. It works equally well

on blk-mq/scsi-mq, and "classic" setups.



You can also find this in a branch in the block git repo:



git://git.kernel.dk/linux-block.git wb-buf-throttle



Note that I rebase this branch when I collapse patches. Patches are

against current Linus' git, 4.6.0-rc1, I can make them available

against 4.5 as well, if there's any interest in that for test

purposes.



Changes since v2



- Switch from wb_depth to wb_percent, as that's an easier tunable.

- Add the patch to track device depth on the block layer side.

- Cleanup the limiting code.

- Don't use a fixed limit in the wb wait, since it can change

between wakeups.

- Minor tweaks, fixups, cleanups.



Changes since v1



- Drop sync() WB_SYNC_NONE -> WB_SYNC_ALL change

- wb_start_writeback() fills in background/reclaim/sync info in

the writeback work, based on writeback reason.

- Use WRITE_SYNC for reclaim/sync IO

- Split balance_dirty_pages() sleep change into separate patch

- Drop get_request() u64 flag change, set the bit on the request

directly after-the-fact.

- Fix wrong sysfs return value

- Various small cleanups





block/Makefile | 2

block/blk-core.c | 15 ++

block/blk-mq.c | 31 ++++-

block/blk-settings.c | 20 +++

block/blk-sysfs.c | 128 ++++++++++++++++++++

block/blk-wb.c | 238 +++++++++++++++++++++++++++++++++++++++

block/blk-wb.h | 33 +++++

drivers/nvme/host/core.c | 1

drivers/scsi/scsi.c | 3

drivers/scsi/sd.c | 5

fs/block_dev.c | 2

fs/buffer.c | 2

fs/f2fs/data.c | 2

fs/f2fs/node.c | 2

fs/fs-writeback.c | 13 ++

fs/gfs2/meta_io.c | 3

fs/mpage.c | 9 -

fs/xfs/xfs_aops.c | 2

include/linux/backing-dev-defs.h | 2

include/linux/blk_types.h | 2

include/linux/blkdev.h | 18 ++

include/linux/writeback.h | 8 +

mm/page-writeback.c | 2

23 files changed, 527 insertions(+), 16 deletions(-)





--

Jens Axboe



