In-kernel memory compression

Did you know...? LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net.

Amdahl's law tells us that there is always a bottleneck in any computing system. Historically, the bottleneck in many workloads on many systems is the CPU and so system designers have made CPUs faster and more efficient and also continue to increase the number of CPU cores even in low-end systems. So now, increasingly, RAM is the bottleneck; CPUs wait idly while data is moved from disk to RAM and back again. Adding more RAM is not always a timely or cost-effective option and sometimes not an option at all. Faster I/O buses and solid-state disks reduce the bottleneck but don't eliminate it.

Wouldn't it be nice if it were possible to increase the effective amount of data stored in RAM? And, since those CPUs are waiting anyway, perhaps we could use those spare CPU cycles to contribute towards that objective? This is the goal of in-kernel compression: We keep more data — compressed — in RAM and use otherwise idle CPU cycles to execute compression and decompression algorithms.

With the recent posting of zswap, there are now three in-kernel compression solutions proposed for merging into the kernel memory management (MM) subsystem: zram, zcache, and zswap. While a casual observer might think that only one is necessary, the three have significant differences and may target different user bases. So, just as there are many filesystems today co-existing in the kernel, someday there may be multiple compression solutions. Or maybe not... this will be decided by Linus and key kernel developers. To help inform that decision, this article compares and contrasts the different solutions, which we will call, for brevity, the "zprojects". We will first introduce some of the key concepts and challenges of compression. Then we will identify three design layers and elaborate on the different design choices made by the zprojects. Finally, we will discuss how the zprojects may interact with the rest of the kernel and then conclude.

Compression basics

For in-kernel compression to work, the kernel must take byte sequences in memory and compress them, and then keep the compressed version in RAM until some point in the future when the data is needed again. While the data is in a compressed state, it is impossible to read any individual bytes from it or write any individual bytes to it. Then, when the data is needed again, the compressed sequence must be decompressed so that individual bytes can again be directly accessed.

It is possible to compress any number of sequential bytes but it is convenient to use a fixed "unit" of compression. A standard storage unit used throughout the kernel is a "page" which is made up of a fixed constant PAGE_SIZE bytes (4KB on most architectures supported by Linux). If this page is aligned at a PAGE_SIZE address boundary, it is called a "page frame"; the kernel maintains a corresponding " struct page " for every page frame in system RAM. All three zprojects use a page as the unit of compression and allocate and manage page frames to store compressed pages.

There are many possible compression algorithms. In general, achieving a higher "compression ratio" takes a larger number of CPU cycles, whereas less-effective compression can be done faster. It is important to achieve a good balance between time and compression ratio. All three zprojects, by default, use the LZO(1X) algorithm in the kernel's lib/ directory, which is known to deliver a good balance. However, it is important that the choice of algorithm remain flexible; possibly the in-CPU algorithm might be replaced in some cases by an architecture-specific hardware compression engine.

In general, the number of cycles required to compress, and to later decompress, a sequence of bytes is roughly proportional to the number of bytes in the sequence. Since the size of a page is fairly large, page compression and decompression are expensive operations and, so, we wish to limit the number of those operations. Thus we must carefully select which pages to compress, choosing pages that are likely to be used again but not likely to be used in the near future, lest we spend all the CPU's time repeatedly compressing and then decompressing pages. Since individual bytes in a compressed page are inaccessible, we must not only ensure that normal CPU linear byte addressing is not attempted on any individual byte, but also ensure that the kernel can clearly identify the compressed page so that it can be found and decompressed when the time comes to do so.

When a page is compressed, the compression algorithm is applied and the result is a sequence of bytes which we can refer to as a "zpage". The size of a zpage, its "zsize", is highly data-dependent, hard to predict, and highly variable. Indeed the expected distribution of zsize drives a number of key zproject design decisions so it is worthwhile to understand it further. The "compression ratio" for a page is zsize divided by PAGE_SIZE. For nearly all pages, zsize is less than PAGE_SIZE so the compression ratio is less than one, but odd cases may occur where compression actually results in the zsize being larger than PAGE_SIZE. A good compression solution must have a contingency plan to deal with such outliers. At the other extreme, a data page containing mostly zeroes or ones may compress by a factor of 100x or more. For example, LZO applied to an all-zero page results in a zpage with zsize equal to 28, for a compression ratio of 0.0068.

As a rough rule of thumb, "on average", compression has a ratio of about 0.5 (i.e. it reduces a page of data by about a factor of two), so across a wide set of workloads, it is useful to envision the distribution as a bell curve, centered at PAGE_SIZE/2. We will refer to zpages with zsize less than PAGE_SIZE/2 as "thin" zpages, and zpages with zsize greater than PAGE_SIZE/2 as "fat" zpages. For any given workload, of course, the distribution may "skew thin", (if many mostly-zero pages are compressed, for example) or "skew fat", as is true if the data is difficult to compress (already-compressed data like a JPEG image would be an example). A good compression solution must thus be able to handle zsize distributions from a wide range of workloads.

With this overview and terminology, we can now compare and contrast the three zprojects. At a high level, each uses a "data source layer" to provide pages to compress and a "data management layer" to organize and store the zpages. This data management layer has two sublayers, the first for data organization, which we will call the "metadata layer", and the second for determining how best to utilize existing kernel memory and kernel allocation code (such as alloc_page() ) to optimally store the zpages, which we will call the "zpage allocator layer".

Data source layer — overview

As previously noted, the nature of compression limits the types and quantities of pages to which it can be applied. It doesn't make sense to compress a page for which there is a high probability of imminent reuse. Nor does it make sense to compress a page for which there is a high probability the data it contains will never again be reused. All three zprojects assume one or more data sources that provide a "pre-qualified" sequence of data pages. In one zproject, an existing kernel subsystem is the data source. For the other two zprojects, previous kernel hooks added as part of the "transcendent memory" projects known as cleancache and frontswap, source the data page stream.

Each data source must also provide a unique identifier, or "key", for each page of data so that the page, once stored and associated with that unique key, can be later retrieved by providing the key. It's also important to note that different data sources may result in streams of pages with dramatically different contents and thus, when compressed, different zsize distributions.

Swap/anonymous pages as a data source

The Linux swap subsystem provides a good opportunity for compression because, when the system is under memory pressure, swap acts as a gatekeeper between (a) frequently used anonymous pages which are kept in RAM, and (b) presumably less frequently used anonymous pages that "overflow", or can be "swapped", to a very-much-slower-than-RAM swap disk. If, instead of using a slow swap disk, we can cause the gatekeeper to store the overflow pages in RAM, compressed, we may be able to avoid swapping them entirely. Further, the swap subsystem already provides a unique key for each page so that the page can be fetched from the swap disk. So, unsurprisingly, all three zprojects consider swap pages as an ideal data source for compression.

One zproject, zram, utilizes the existing swap device model. A zram swap device is explicitly created and enabled in user space (via mkswap and swapon ) and this zram device is prioritized (presumably highest) against other configured swap devices. When the swap subsystem sends a data page to the zram device, the page goes through the block I/O subsystem to the zram "driver". All swap pages sent to zram are compressed and associated with a key created by concatenating the swap device identifier and page offset. Later, when the swap subsystem determines (via a page fault) that the page must be brought back in from the swap device to fill a specific page frame, the zram device is notified; it then decompresses the zpage matching the swap device number and page offset into that page frame.

The other two zprojects, zcache and zswap, use the kernel's frontswap hooks (available since Linux 3.5) in the swap subsystem. Frontswap acts as a "fronting store" — a type of a cache — to an existing swap device. It avoids the block I/O subsystem completely; swapped pages instead go directly to zcache/zswap where they are compressed. On the fetch path, when the page fault occurs, the block I/O subsystem is again bypassed and then, as with zram, the page is decompressed directly into the faulting page frame.

There are some subtle but important side effects of the different approaches:

The block I/O subsystem is designed to work with fixed-size block devices and is not very happy when a block write results in a failure. As a result, when a poorly-compressible page ("very fat" zpage) is presented to zram, all PAGE_SIZE bytes of the uncompressed page are stored by zram in RAM, for no net space savings. The frontswap-based zprojects have more flexibility; if a very fat zpage is presented, zcache/zswap can simply reject the request, in which case the swap subsystem will pass on the original data page to the real swap disk.

Since zram presents itself as a swap disk, user-space configuration is required, but this also means that zram can work on systems with no physical swap device, a configuration common for embedded Linux systems. On the other hand, zcache and zswap depend entirely on already-configured swap devices and so are not functional unless at least one physical swap device is configured and sized adequately. As a result, one might think of zram as being a good match for an embedded environment, while the others are better suited for a more traditional server/data-center environment.

It is important to note that all three zprojects must never discard any of these data pages unless explicitly instructed, otherwise user-space programs may experience data loss. Since all three zprojects have a finite capacity on any given system, it is prudent to have a "pressure relief valve". Both zcache and zswap have the ability to "writeback" data to the swapdisk to which the data was originally intended to be written; this cannot be done with zram.

Clean page cache pages as a data source

It is not uncommon for the vast majority of page frames in a running system to contain data pages that are identical to pages already existing on a disk in a filesystem. In Linux, these pages are stored in the page cache; the data in these pages is kept in RAM in anticipation that it will be used again in the future. When memory pressure occurs and the kernel is looking to free RAM, the kernel will be quick to discard some of this duplicate "clean" data because it can be fetched from the filesystem later if needed. A "cleancache" hook, present in the kernel since Linux 3.0, can divert the data in these page frames, sending the pages to zcache where the data can be compressed and preserved in RAM. Then, if the kernel determines the data is again needed in RAM, a page fault will be intercepted by another cleancache hook which results in the decompression of the data.

Because modern filesystems can be large, unique identification of a page cache page is more problematic than is unique identification of swap pages. Indeed, the key provided via cleancache must uniquely identify: a filesystem; an "exportable" inode value representing a file within that filesystem; and a page offset within the file. This combination requires up to 240 bits.

Zcache was designed to handle cleancache pages, including the full range of required keys. As a result, the data management layer is much more complex, consisting of a combination of different data structures which we will describe below. Further, the location of some cleancache hooks in the VFS part of the kernel results in some calls to the data management layer with interrupts disabled which must be properly handled.

Even more than with swap data, filesystem data can proliferate rapidly and, even compressed, can quickly fill up RAM. So, as with swap data, it is prudent to have a pressure-relief valve for page cache data. Unlike swap data, however, we know that page cache data can simply be dropped whenever necessary. Since zcache was designed from the beginning to manage page cache pages, its data management layer was also designed to efficiently handle the eviction of page cache zpages.

Zram can maintain entire fixed-size filesystems in compressed form in RAM, but not as a cache for a non-RAM-based filesystem. In essence, part of RAM can be pre-allocated as a "fast disk" for a filesystem; even filesystem metadata is compressed and stored in zram. While zram may be useful for small filesystems that can entirely fit in RAM, the caching capability provided via cleancache combined with the compression provided by zcache is useful even for large filesystems.

Zswap is singularly focused on swap and so does not make use of the page cache at all or filesystem data as a data source. As a result, its design is simpler because it can ignore the complexities required to handle the tougher page cache data source.

The metadata layer

Once the zproject receives a data page for compression, it must set up and maintain data structures so that the zpage can be stored and associated with a key for that page, and then later found and decompressed. Concurrency must also be considered to ensure that unnecessary serialization bottlenecks do not occur. The three zprojects use very different data structures for different reasons, in part driven by the requirements of data sources to be maintained.

Since the number of swap devices in a system is small, and the number of pages in each is predetermined and fixed, a small, unsigned integer representing the swap device combined with a page offset is sufficient to identify a specific page. So zram simply uses a direct table lookup (one table per configured zram swap device) to find and manage its data. For example, for each data page stored, the zram table contains a zsize, since zsize is useful for decompression validation, as well as information to get to the stored data. Concurrency is restricted, per-device, by a reader-writer semaphore.

Zswap must manage the same swap-driven address space, but it utilizes one dynamic red-black tree (rbtree) per swap device to manage its data. During tree access, a spinlock on the tree must be held; this is not a bottleneck because simultaneous access within any one swap device is greatly limited by other swap subsystem factors. Zswap's metadata layer is intentionally minimalist.

Zcache must manage both swap data and pagecache data, and the latter has a much more extensive key space which drives the size and complexity of zcache data structures. For each filesystem, zcache creates a "pool" with a hash table of red-black tree root pointers; each node in the rbtree represents a filesystem inode — the inode space in an exportable filesystem is often very sparsely populated which lends itself to management by an rbtree. Then each rbtree node contains the head of a radix tree — the data structure used throughout the kernel for page offsets — and this radix tree is used to look up a specific page offset. The leaf of the radix tree points to a descriptor which, among other things, references the zpage. Each hash table entry has a spinlock to manage concurrency; unlike swap and frontswap, filesystem accesses via cleancache may be highly concurrent so contention must be aggressively avoided.

Impact of writeback on the metadata layer

Earlier it was noted that zcache and zswap support "writeback"; this adds an important twist to the data structures in that, ideally, we'd like to write back pages in some reasonable order, such as least recently used (LRU). To that end, both zcache and zswap maintain a queue to order the stored data; zswap queues each individual zpage and zcache queues the page frames used to store the zpages. While lookup requires a key and searches the data structures from the root downward, writeback chooses one or more zpages from the leaves of the data structure and then must not only remove or decompress the data, but must also remove its metadata from the leaf upward. This requirement has two ramifications: First, the leaf nodes of data structures must retain the key along with information necessary to remove the metadata. Second, if writeback is allowed to execute concurrently with other data structure accesses (lookup/insert/remove), challenging new race conditions arise which must be understood and avoided.

Zswap currently implements writeback only as a side effect of storing data (when kernel page allocation fails) which limits the possible race conditions. Zcache implements writeback as a shrinker routine that can be run completely independently and, thus, must handle a broader set of potential race conditions. Zcache must also handle this same broader set when discarding page cache pages.

Other metadata layer topics

One clever data management technique is worth mentioning here: Zram deduplicates "zero pages" — pages that contain only zero bytes — requiring no additional storage. In some zsize distributions, such pages may represent a significant portion of the zpages to be stored. While the data of the entire page may need to be scanned to determine if it is a zero page, this overhead may be small relative to that of compression and, in any case, it pre-populates the cache with bytes that the compression algorithm would need anyway. Zcache and zswap should add this feature. More sophisticated deduplication might also be useful. Patches are welcome.

Zpage allocator layer

Memory allocation for zpages can present some unique challenges. First, we wish to maximize the number of zpages stored in a given number of kernel page frames, a ratio which we will call "density". Maximizing the density can be challenging as we do not know a priori the number of zpages to be stored, the distribution of zsize of those zpages, or the access patterns which insert and remove those zpages. As a result, fragmentation can be a concern. Second, allocating memory to store zpages may often occur in an environment of high-to-severe memory pressure; an effective allocator must handle frequent failure and should minimize large (i.e. multiple-contiguous-page) allocations. Third, if possible, we wish the allocator to support some mechanism to enable ordered writeback and/or eviction as described above.

One obvious alternative is to use an allocator already present and widely used in the kernel: kmalloc() , which is optimized for allocations that nicely fit into default sizes of 2N byte chunks. For allocations of size 2N+x, however, as much as 50% of the allocated memory is wasted. kmalloc() provides an option for "odd" sizes but this requires pre-allocated caches, often of contiguous sets of pages ("higher order" pages) which are difficult to obtain under memory pressure. Early studies showed that kmalloc() allocation failures were unacceptably frequent and density was insufficient, so kmalloc() was discarded and other options were pursued, all based on alloc_page() , which always allocates a single page frame.

To improve density, the xvmalloc allocator was written, based on the two-level sequential fit (TLSF) algorithm. Xvmalloc provided high density for some workloads and was the initial default allocator for zram and for the frontswap-driven portion of the initial zcache. It was found, however, that the high density led to poor fragmentation characteristics, especially for zsize distributions that skewed fat. Xvmalloc has since been discarded (though remnants of it survive through Linux 3.8.)

A new allocator, zbud, was written for the cleancache-driven portion of the original zcache code. The initial version of zbud had less focus on high density and more on the ability to efficiently evict cleancache-provided zpages. With zbud, no more than two zpages are contained in a page frame, buddying up a pair of zpages, one fat and one thin. Page frames, rather than zpages, are entered into an LRU queue so that entire page frames can be easily "freed" to the system. On workloads where zsize skews thin, a significant amount of space is wasted, but fragmentation is limited, and eviction is still easy.

Zsmalloc, an allocator intended to "rule them all", was then written. Zsmalloc takes the best of kmalloc() but with the ability to provide the complete range of "size classes" suitable to store zpages; zsmalloc also never requires higher-order allocations. All zpages with zsize within a "class" (e.g. between 33-48 bytes) are grouped together and stored in the same "zspage". A clever feature (especially useful for fat zpages) allows discontiguous page frames to be "stitched" together, so that the last bytes in the first page frame contain the first part of the zpage and the first bytes of the second page frame contain the second part. A group of N of these discontiguous pages are allocated on demand for each size class, with N chosen by the zsmalloc code to optimize density.

As a result of these innovations, zsmalloc achieves great density across a wide range of zsize distributions: balanced, fat, and thin. Alas, zsmalloc's strengths also proves to be its Achilles heel: high density and page crossings make it very difficult for eviction and writeback to concurrently free page frames, resulting in a different kind of fragmentation (and consequent lower density) and rendering it unsuitable for management of cleancache-provided pages in zcache. So zbud was revised to utilize some of the techniques pioneered by zsmalloc. Zbud is still limited to two zpages per page frame, but when under heavy eviction or writeback, its density compares favorably with zsmalloc. The decision by zcache to use zbud for swap pages sadly led to a fork of zcache, first to "old zcache" (aka zcache1) and "new zcache" (aka zcache2) and then to a separate zproject entirely, zswap, because the author prefers the density of zsmalloc over the ability to support cleancache pages and the predictability and page frame-reclaim of zbud.

So, today, there are two zpage allocators, both still evolving. The choice for best zpage allocator depends on use models and unpredictable workloads. It may be possible that some hybrid of the two will emerge, or perhaps some yet unwritten allocator may arise to "rule them all".

Memory management subsystem interaction

In some ways, a zproject is like a cache for pages temporarily removed from the MM subsystem. However it is a rather unusual cache in that, to store its data, it steals capacity (page frames) from its backing store, the MM subsystem. From the point-of-view of the MM subsystem, the RAM is just gone, just as if the page frames were absorbed by a RAM-voracious device driver. This is a bit unfortunate, given that both the zproject and the MM subsystem are respectively, but separately, managing compressed and uncompressed anonymous pages (and possibly compressed and uncompressed page cache pages as well). So it is educational to consider how a zproject may "load balance" its memory needs with the MM subsystem, and with the rest of the kernel.

Currently, all three zprojects obtain page frames using the standard kernel alloc_page() call, with the "GFP flags" parameter chosen to ensure that the kernel's emergency reserve pages are never used. So each zproject must be (and is) resilient to the fairly frequent situation where an alloc_page() call returns failure.

Zram makes no other attempt to limit its total use of page frames but, as we've seen, it is configured by the administrator as a swap device, and swap parameters do limit the number of swap pages, and thus zpages, that can be accepted. This indirectly but unpredictably limits the number of page frames used. Also, the configuration is independent of and not cognizant of the total RAM installed in the system, so configuration errors are very possible.

Zswap uses a slightly modified version of zsmalloc to track the total number of page frames allocated. It ensures this number does not exceed a certain percent of total RAM in the system, and this limit is configurable via sysfs. As previously noted, using writeback, zswap can reduce the number of (anonymous) zpages stored and this, indirectly, can reduce the number of page frames, though at the likely cost of fragmentation.

Zcache and zbud were designed from the beginning with much more dynamic constraints on page frame utilization in mind, intended eventually to flexibly mesh with the dramatically time-varying memory balancing algorithms that the MM subsystem uses. While the exact interaction model and best future control policy are not yet determined, zcache's and zbud's underlying page frame-based writeback and eviction mechanisms support a shrinker-like interface that can be told to, for example, reduce the number of page frames used for storing anonymous zpages from 10000 to 8937 and the number of page frames used for storing page cache zpages from 3300 to 463, and zcache can do that. Since zbud is much more predictable (two zpages per page frame), the MM subsystem can estimate and manage both the total number of anonymous and pagecache pages (compressed and uncompressed) in the system and the total number of page frames used to manage them.

Status and conclusions

Zram, zcache, and zswap are all advancing the concept of in-kernel compression in different ways. Zram and zcache are in-tree but both still in the staging tree. While the staging tree has served to expose the concept of in-kernel compression to a wide group of kernel developers and even to the users of a few leading-edge distributions, as Andrew Morton says, the staging tree "is where code goes to be ignored." Staging tree maintainer Greg Kroah-Hartman has become reluctant to accept any new zproject functionality into the staging tree. This creates a conundrum for these zprojects: evolution in the staging tree has resulted in rapid improvements in the design and implementation and exposure of the zprojects, but because of this evolution they are not stable enough for promotion into the core kernel and further evolution is now stymied.

Zswap is attempting to circumvent the staging tree entirely by proposing what is essentially a simplified frontswap-only fork of zcache using zsmalloc instead of zbud, for direct merging into the MM subsystem. Since it is also much simpler than zcache, it has garnered more reviewers which is a valuable advantage for to-be-merged code. But it is entirely dependent on the still-in-staging zsmalloc, does not support page cache pages at all, and does not support page-frame-based writeback, so its simplicity comes at a cost. If zswap is merged, it remains to be seen if it will ever be extended adequately.

So, in-kernel compression has clear advantages and real users. It remains unclear though when (or if) it will be merged into the core kernel and how it should interact with the core kernel. If you are intrigued, the various zproject developers encourage your ideas and contributions.

Acknowledgments

Nitin Gupta was the original author of compcache and I was the original author of transcendent memory. While both projects were initially targeted to improve memory utilization in a multi-tenant virtualized environment, their application to compression in a single kernel quickly became apparent. Nitin designed and wrote the Linux code for zram, xvmalloc, and zsmalloc, and wrote an early prototype implementation of zcache. I wrote frontswap and cleancache (with the cleancache hook placement originally done by Chris Mason), and the Linux code for zcache, zbud, and ramster. Seth Jennings contributed a number of improvements to zcache and is the author of zswap. Kernel mm developer Minchan Kim was a helpful influence for all the zprojects, and none of the zprojects would have been possible without Greg Kroah-Hartman's support — and occasional whippings — as maintainer of the staging drivers tree.

As with any open-source project, many others have contributed ideas, bug fixes, and code improvements and we are thankful for everyone's efforts.

