Dynamic writeback throttling

Please consider subscribing to LWN Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net.

Writeback is the process of writing dirty memory pages (i.e. those which have been modified by applications) back to persistent storage, saving the data and potentially freeing the pages for other use. System performance is heavily dependent on getting writeback right; poorly-done writeback can lead to poor I/O rates and extreme memory pressure. Over the last year, it has become increasingly clear that the Linux kernel is not doing writeback as well as it should; several developers have been putting time into improving the situation. The dynamic dirty throttling limits patch from Wu Fengguang demonstrates a new, relatively complex approach to making writeback better.

One of the key concepts behind writeback handling is that processes which are contributing the most to the problem should be the ones to suffer the most for it. In the kernel, this suffering is managed through a call to balance_dirty_pages() , which is meant to throttle a process's memory-dirtying behavior until the situation improves. That throttling is done in a straightforward way: the process is given a shovel and told to start digging. In other words, a process which has been tossed into balance_dirty_pages() is put to work finding dirty pages and arranging to have them written to disk. Once a certain number of pages have been cleaned, the process is allowed to get back to the vital task of creating more dirty pages.

[PULL QUOTE: So, when the system is under memory pressure and very much needs optimal performance from its block devices, it goes into a mode which makes that performance worse. END QUOTE] There are some problems with cleaning pages in this way, many of which have been covered elsewhere. But one of the key ones is that it tends to produce seeky I/O traffic. When writeback is handled normally in the background, the kernel does its best to clean substantial numbers of pages of the same file at the same time. Since filesystems work hard to lay out file blocks contiguously whenever possible, writing all of a file's pages together should cause a relatively small number of head seeks, improving I/O bandwidth. As soon as balance_dirty_pages() gets into the act, though, the block layer is suddenly confronted with writeback from multiple sources; that can only lead to a seekier I/O pattern and reduced bandwidth. So, when the system is under memory pressure and very much needs optimal performance from its block devices, it goes into a mode which makes that performance worse.

Fengguang's 17-part patch makes a number of changes, starting with removing any direct writeback work from balance_dirty_pages() . Instead, the offending process simply goes to sleep for a while, secure in the knowledge that writeback is being handled by other parts of the system. That should lead to better I/O performance, but also to more predictable and controllable pauses for memory-intensive applications.

Much of the rest of the patch series is aimed at improving that pause calculation. It adds a new mechanism for estimating the actual bandwidth of each backing device - something the kernel does not have a good handle on, currently. Using that information, combined with the number of pages that the kernel would like to see written out before allowing a dirtying process to continue, a reasonable pause duration can be calculated. That pause is not allowed to exceed 200ms.

The patch set tries to be smarter than that, though. 200ms is a long time to pause a process which is trying to get some work done. On the other hand, without a bit of care, it is also possible to pause processes for a very short period of time, which is bad for throughput. For this patch set, it was decided that optimal pauses would be between 10ms and 100ms. This range is achieved by maintaining a separate " nr_dirtied_pause " limit for every process; if the number of dirtied pages for that process is below the limit, it is not forced to pause. Any time that balance_dirty_pages() calculates a pause time of less than 10ms, the limit is raised; if the pause turns out to be over 100ms, instead, the limit is cut in half. The desired result is a pause within the selected range which tends quickly toward the 10ms end when memory pressure drops.

Another change made by this patch series is to try to come up with a global estimate of the memory pressure on the system. When normal memory scanning encounters dirty pages, the pressure estimate is increased. If, instead, the kswapd process on the most memory-stressed node in the system goes idle, then the estimate is decreased. This estimate is then used to adjust the throttling limits applied to processes; when the system is under heavy memory pressure, memory-dirtying processes will be put on hold sooner than they otherwise would be.

There is one other important change made in this patch set. Filesystem developers have been complaining for a while that the core memory management code tells them to write back too little memory at a time. On a fast device, overly small writeback requests will fail to keep the device busy, resulting in suboptimal performance. So some filesystems (xfs and ext4) actually ignore the amount of requested writeback; they will write back many more pages than they were asked to do. That can improve performance, but it is not without its problems; in particular, sending massive write operations to slow devices can stall the system for unacceptably long times.

Once this patch set is in place, there's a better way to calculate the best writeback size. The system now knows what kind of bandwidth it can expect from each device; using that information, it can size its requests to keep the device busy for one second at a time. Throttling limits are also based on this one-second number; if there are not enough dirty pages in the system for one second of I/O activity, the backing device is probably not being used to its full capacity and the number of dirty pages should be allowed to increase. In summary: the bandwidth estimation allows the kernel to scale dirty limits and I/O sizes to make the best use of all of the devices in the system, regardless of any specific device's performance characteristics.

Getting this code into the mainline could take a while, though. It is a complicated set of changes to core code which is already complex; as such, it will be hard for others to review. There have been some concerns raised about the specifics of some of the heuristics. A large amount of performance testing will also be required to get this kind of change merged. So we may have to wait for a while yet, but better writeback should be coming eventually.

