PostgreSQL pain points

Please consider subscribing to LWN Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net.

The kernel has to work for a wide range of workloads; it is arguably unsurprising that it does not always perform as well as some user communities would like. One community that sometimes felt left out in the cold is the PostgreSQL relational database management system project. In response to an invitation from the organizers of the 2014 Linux Storage, Filesystem, and Memory Management summit, PostgreSQL developers Robert Haas, Andres Freund, and Josh Berkus came to discuss their worst pain points and possible solutions.

PostgreSQL is an old system, dating back to 1996; it has a lot of users running on a wide variety of operating systems. So the PostgreSQL developers are limited in the amount of Linux-specific code they can add. It is based on cooperating processes; threads are not used. System V shared memory is used for interprocess communication. Importantly, PostgreSQL maintains its own internal buffer cache, but also uses buffered I/O to move data to and from disk. This combination of buffering leads to a number of problems experienced by PostgreSQL users.

Slow sync

The first problem described is related to how data gets to disk from the buffer cache. PostgreSQL uses a form of journaling that they call "write-ahead logging". Changes are first written to the log; once the log is safely on disk, the main database blocks can be written back. Much of this work is done in a "checkpoint" process; it writes log entries, then flushes a bunch of data back to various files on disk. The logging writes are relatively small and contiguous; they work fairly well, and, according to Andres, the PostgreSQL developers are happy enough with how that part of the system works on Linux.

The data writes are another story. The checkpoint process paces those writes to avoid overwhelming the I/O subsystem. But, when it gets around to calling fsync() to ensure that the data is safely written, all of those carefully paced writes are flushed into the request queue at once and an I/O storm results. The problem, they said, is not that fsync() is too slow; instead, it is too fast. It dumps so much data into the I/O subsystem that everything else, including read requests from applications, is blocked. That creates pain for users and, thus, for PostgreSQL developers.

Ted Ts'o asked whether the ability to limit the checkpoint process to a specific percentage of the available I/O bandwidth would help. But Robert responded that I/O priorities would be better; the checkpoint process should be able to use 100% of the bandwidth if nothing else wants it. Use of the ionice mechanism (which controls I/O priorities in the CFQ scheduler) was suggested, but there is a problem: it does not work for I/O initiated from an fsync() call. Even if the data was written from the checkpoint process — which is not always the case — priorities are not applied when the actual I/O is started by fsync() .

Ric Wheeler suggested that the PostgreSQL developers need to better control the speed with which they write data; Chris Mason added that the O_DATASYNC option could be used to give better control over when the I/O requests are generated. The problem here is that such approaches require PostgreSQL to know about the speed of the storage device.

The discussion turned back to I/O priorities, at which point it was pointed out that those priorities can only be enforced in the request queue maintained by the I/O scheduler. Some schedulers, including those favored by PostgreSQL users (who tend to avoid the CFQ scheduler), do not implement I/O priorities at all. But, even those that do support I/O priorities place limits on the length of the request queue. A big flush of data will quickly fill the queue, at which point I/O priorities lose most of their effectiveness; a high-priority request will still languish if there is no room for it in the request queue. So, it seems, I/O priorities are not the solution to the problem.

It's not clear what the right solution is. Ted asked if the PostgreSQL developers could provide a small program that would generate the kind of I/O patterns created by a running database. Given a way to reproduce the problem easily, kernel developers could experiment with different approaches to a solution. This "program" may take the form of a configuration script for PostgreSQL initially, but a separate (small) program is what the kernel community would really like to see.

Double buffering

PostgreSQL needs to do its own buffering; it also needs to use buffered I/O for a number of reasons. That leads to a problem, though: database data tends to be stored in memory twice, once in the PostgreSQL buffer, and once in the page cache. That increases the amount of memory used by PostgreSQL considerably, to the detriment of the system as a whole.

Much of that memory waste could conceivably be eliminated. Consider, for example, a dirty buffer in the PostgreSQL cache. It is more current than any version of that data that the kernel might have in the page cache; the only thing that will ever happen to the page cache copy is that it will be overwritten when PostgreSQL flushes the dirty buffer. So there is no value to keeping that data in the page cache. In this case, it would be nice if PostgreSQL could tell the kernel to remove the pages of interest from the page cache, but there is currently no good API for that. Calling fadvise() with FADV_DONTNEED can, according to Andres, actually cause pages to be read in; nobody quite understood this behavior, but all agreed it shouldn't work that way. They can't use madvise() without mapping the files; doing that in possibly hundreds of processes tends to be very slow.

It would also be nice to be able move pages in the opposite direction: PostgreSQL might want to remove a clean page from its own cache, but leave a copy in the page cache. That could possibly be done with a special write operation that would not actually cause I/O, or with a system call that would transfer a physical page into the page cache. There was some talk of the form such an interface might take, but this part of the discussion eventually wound down without any firm conclusions.

Regressions

Another problem frequently experienced by PostgreSQL users is that recent kernel features tend to create performance problems. For example, the transparent huge pages feature tends not to bring much benefit to PostgreSQL workloads, but it slows them down significantly. Evidently a lot of time goes into the compaction code, which is working hard without actually producing a lot of free huge pages. In many systems, terrible performance problems simply vanish when transparent huge pages are turned off.

Mel Gorman answered that, if compaction is hurting performance, it's a bug. That said, he hasn't seen any transparent huge page bugs for quite some time. There is, he said, a patch out there which puts a limit on the number of processes that can be performing compaction at any given time. It has not been merged, though, because nobody has ever seen a workload where too many processes running compaction was a problem. It might, he suggested, be time to revisit that particular patch.

Another source of pain is the "zone reclaim" feature, whereby the kernel reclaims pages from some zones even if the system as a whole is not short of memory. Zone reclaim can slow down PostgreSQL workloads; usually the best thing to do on a PostgreSQL server is to simply disable the feature altogether. Andres noted that he has been called in as a consultant many times to deal with performance problems related to zone reclaim; it has, he said been a good money-maker for him. Still, it would be good if the problem were to be fixed.

Mel noted that the zone reclaim mode was written under the assumption that all processes in the system would fit into a single NUMA node. That assumption no longer makes sense; it's long past time, he said, that this option's default changed to "off." There seemed to be no opposition to that idea in the room, so a change may happen sometime in the relatively near future.

Finally, the PostgreSQL developers noted that, in general, kernel upgrades tend to be scary. The performance characteristics of Linux kernels tend to vary widely from one release to the next; that makes upgrades into an uncertain affair. There was some talk of finding ways to run PostgreSQL benchmarks on new kernels, but no definite conclusions were reached. As a whole, though, developers for both projects were happy with how the conversation came out; if nothing else, it represents a new level of communication between the two projects.

[Your editor would like to thank the Linux Foundation for supporting his travel to the LSFMM summit.]

