Transparent huge pages in the page cache

This article brought to you by LWN subscribers Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

As was discussed at the recent Linux Storage, Filesystem, and Memory-Management Summit, there is interest in adding transparent huge page (THP) support to filesystems, particularly tmpfs but ultimately others as well. Huge pages are important when mapping large files into a process's virtual memory as they reduce the per-access overhead. Supporting huge pages transparently is essential as the tradeoffs between huge and non-huge pages depend on system-level parameters such as the amount of free memory and its level of fragmentation. Individual applications are not able to assess these parameters and so cannot be expected to make appropriate decisions.

The level of interest in THP for filesystems is such that two competing patch sets have emerged that provide similar functionality with quite different approaches. The discussions at the summit were largely about engineering issues, including amount of testing, general maintainability, and level of intrusiveness into other parts of the kernel. While these are appropriate for that forum, they can leave the technically minded wondering what the deep technical differences are, what these new "team pages" are, and whether they are — or are not — a good idea.

Before delving into those differences, it will be useful to understand how the requirements for THP support in filesystems differ from those for the existing THP support for anonymous memory. The different requirements primarily come from the fact that files can be accessed concurrently and in different sized units, and particularly how this can cause the file to grow.

Anonymous memory is only accessed by memory mapping (i.e. with mmap() ) and the size of this mapping is usually fixed on allocation. Sharing between processes only happens as the result of a fork() call and, for the process-private mappings that support THP, huge pages will only remain shared as long as they are unchanged. So every mapping of an anonymous transparent huge page will be the same size.

Memory used for file-backed data can be accessed by direct reads and writes and can be mapped concurrently by multiple processes, which may request differently sized mappings. Preventing one process from getting a huge mapping because some other process has only mapped one or two (small) pages is not nearly so acceptable as it might be with anonymous memory. For this reason, THP for filesystems must be able to support concurrent mapping as both huge and non-huge pages.

Unlike anonymous memory, space in files isn't allocated by mapping it: that happens as the result of write() or fallocate() calls. The allocated size may not be a multiple of the huge-page size, and the space allocated may not even be contiguous if holes are left. If huge pages were only used once a file was large enough to completely fill them, then a file that was not initially allocated to its final size would need to be migrated from individual pages to huge pages before a huge-page mapping would be possible. This would often be unnecessary work. So THP for files must allocate huge pages before the file is known to be big enough to fully utilize them.

Finally, a file may be used without being mapped into process memory at all, while anonymous memory is always mapped. So any changes to a filesystem to support transparent huge page mapping must not negatively impact normal read/write performance on an unmapped file.

The two patch sets approach these challenges from two different directions. One seems to say "we know how to do filesystems, let's build on that to enable huge mappings". The other starts with "we know how to do huge mappings" and proceeds to "how can we adjust those to work with filesystems?" Both patch sets support only tmpfs. This allows for valuable simplifications but means they cannot easily be assessed as fully general solutions.

Page teams: extending filesystems toward huge pages

The first patch set is in use at Google and is being championed by Hugh Dickins. It starts with current filesystem use of the page cache and asks "how can we encourage huge page mappings?". This involves allocating a "high order" physically contiguous set of pages when possible, annotating them so that the memory-management code can easily detect that a huge-page mapping will work, and allowing the set to be broken into individual pages when, but only when, necessary.

The allocation step is relatively straightforward: when allocating memory in the page cache, if there are not currently any allocated pages in the region that a huge page would fill, request a huge page. If that works, good. If not, just fall back to individual small pages.

The annotation step is where the term "team pages" comes in; the large allocation is referred to as a "team" or "team of pages." The first page of the high-order set of pages, and every page in that set that has been instantiated in the file, gets a new page flag: PG_team . This makes it easy, when looking at a page, to know if it is part of a larger team allocation. By itself this is not quite enough to allow a huge page to be mapped since, when that mapping happens, any "holes" within that page must be instantiated as they could be written to. So every small page must get PG_team set. This instantiation happens the first time that a huge mapping is attempted, and an extra page flag — PG_checked — is set on the head page. From then on, a huge mapping can be created after just checking for PG_team and PG_checked .

Finally, it must be possible to break up the team when free memory is short. There are two sides to this. If a team has never been mapped as a huge page, it may have some unallocated pages ( PG_team not set) that are invisible to memory management (as they are not part of the file), but must be made visible so they can be released when memory pressure is high. Pages in the team that are part of the file will be tracked on an LRU (least recently used) list so that the oldest pages can be written out to swap should that become necessary. Either of these will require disbanding the team before individual pages are written or freed.

Unallocated pages are marked in the file's radix tree using a tag, SHMEM_TAG_HUGEHOLE , which reuses the PAGECACHE_TAG_DIRTY tag used by normal filesystems. All files with "huge holes" are kept in a new LRU list that is scanned by a shrinker when memory reclaim tries to recover memory. The basic approach of this shrinker is, quoting Dickins, to "find [the] least occupied huge page in [the] older half of shrinklist, and migrate its cachepages into the most occupied huge page with enough space to fit, again chosen from [the] older half of shrinklist."

Rather than just freeing some scattered pages, it attempts to free a whole huge page. This generally helps reduce fragmentation of free space, so there are more likely to be huge pages free the next time one is needed.

When dismantling a team of pages that has no holes, it is necessary to get a page lock on the head page first in order to avoid racing with an attempt to map the whole team into an address space. Consequently, attempting to disband a team when a tail page has reached the top of the LRU list is problematic since the reclaim process is not a good time to lock some other page. The page-teams patch set addresses this by refusing to write a tail page to swap before the head page has been written. When the head page is processed, it disbands the team, writes out that head, and thus frees all the tail pages to be written when their time comes. As pages tend to be written out in the same order as in the file, this usually works quite well.

When it doesn't work, though, the very large number of pages that refuse to be evicted (up to 511 for every 512) can distort the active and inactive page counts that are used to guide page eviction. To address this, slightly creative accounting is used to not count the tail pages, but instead to count the head pages as though they were multiple pages. This means that, as long as the head page is active, the whole team is accounted as active, and the memory reclaim code will be keen to transition more pages from active to inactive so they can then be freed.

While this approach seems to have minimal impact on filesystems by allowing them to continue to use pages in the page cache much as they already do, there are problems. As already noted, the PAGECACHE_TAG_DIRTY tag is being reused to record where in the radix tree there are holes that can be reclaimed. This is not a problem for tmpfs as its pages don't have a distinction between "clean" and "dirty". Anything that is in memory can be written to swap and then removed from memory, it need never be marked "clean". It is a problem for most other filesystems, though, so some other way of finding "huge holes" would need to be found.

Another conflict is that, to achieve the creative accounting mentioned, the head page holds a count of the number of allocated tail pages. This number is stored in the same place in the page description structure that most filesystems use to store a pointer to their own private data. This is not a problem for tmpfs, which doesn't use the private field, but would be for other filesystems. And then there is that fact that the head page of a team imposes a particular meaning on the PG_checked flag, which would interfere with file systems that use that flag themselves.

It is quite possible that these issues can all be addressed with sufficient ingenuity. The patch set is deliberately aimed at supporting only tmpfs, not on general filesystem support, to keep the complexity manageable. But they are issues to keep in mind when considering the long term future of THP support in the page cache.

THP-enabled tmpfs using compound pages

The second patch set under consideration comes from Kirill Shutemov. It starts with the premise that the memory-management code already knows how to map compound pages into a process's address space, because they are used for anonymous THP and for files in the hugetlbfs filesystem. It tries to make such compound pages (collections of pages that largely behave like a single page) usable by a filesystem, particularly tmpfs.

Having individual compound pages, rather than teams of small pages, means that the active/inactive LRUs just see the one page rather than multiple related pages on different LRUs, so the creative accounting needed for page teams is irrelevant here. However, when memory is mapped with MAP_LOCKED , it must be accounted as "unevictable" rather than active or inactive. If just part of a compound page is mapped and locked, some extra care is needed, once again, to get that accounting right.

Unlike the page team approach, compound pages provide no obvious mechanism for keeping track of the free space that might be at the end of a compound page after the end of the file. The current patch set doesn't address this at all, treating that space as in-use until the compound page is disbanded for some other reason, typically to write it out to swap. There are suggestions that this will be addressed in a future patch set. Similarly this patch set makes no effort to support page-sized holes within a compound page.

Processes expect to be able to map individual small pages. If the region they want to map is within a compound page, this becomes more complex: it is necessary to find the compound page that contains the target region and then map part of that. The current approach for finding that compound page stores all the small pages within the compound page in the page-cache radix tree. This is similar to how page teams operate, but doesn't make best use of having a single compound page. This solution is expected only to be interim. Matthew Wilcox, one of the DAX developers, has some patches that enhance the radix tree to be able to store a single entry covering a large range of addresses. This will help with DAX, which uses huge, and even larger, pages and would help with mapping compound pages too.

Allowing, eventually, only one radix tree entry per huge page is a real benefit, and there are other related benefits. Any operation that works a page at a time could fairly easily operate on a huge page at a time, thus reducing per-page overhead. Andrea Arcangeli, architect of the original huge-page support in Linux, sees this as a key issue, noting that:

My view is that in terms of long-lived computation from userland point of view, both models are malleable enough and could achieve everything we need in the end, but as far as the overall kernel efficiency is concerned the compound model will always retain a slight advantage in performance by leveraging a native THP compound refcounting that requires just one atomic_inc/dec per THP mapcount instead of 512 of them.

In terms of compatibility with other filesystems, it is quite clear that a traditional filesystem would need a lot of change to work with compound pages. A filesystem will typically attach some private data to each page, possibly a list of buffer_head s describing the filesystem blocks in that page. With compound pages that private data structure will now have to describe a much larger range of blocks. A linear search as currently used would no longer be appropriate. On the positive side, at least there is somewhere to store that private data, which is not currently the case with team pages.

Six of one, half a dozen of the other

The two approaches are clearly different and they both have costs and benefits. Probably the two key questions are: which provides the lowest overheads for large scale operations, and which integrates more easily with other filesystems. Neither patch set is really sufficiently mature to provide answers to those questions and it wouldn't be at all surprising if each question is answered best by a different approach. Continuing on without a clear decision does not seem likely to be in the best interests of Linux in the short term, but making one now does not seem at all easy. I'm certainly glad that it isn't my responsibility to make that call.