The 2007 Linux Storage and File Systems Workshop

LWN.net needs you! Without subscribers, LWN would simply not exist. Please consider signing up for a subscription and helping to keep LWN publishing

Fifty members of the Linux storage and file system communities met February 12 and 13 in San Jose, California to give status updates, present new ideas and discuss issues during the 2007 Linux Storage and File Systems Workshop. The workshop was chaired by Ric Wheeler and sponsored by EMC, NetApp, Panasas, Seagate and Oracle.

Day 1: Joint Session

Ric Wheeler opened with an explanation of the basic contract that storage systems make with the user: the complete set of data will be stored, bytes are correct and in order, and raw capacity is utilized as completely as possible. It is so simple that it seems that there should be no open issues, right?

Today, this contract is met most of the time but Ric posed a number of questions. How do we validate that no files have been lost? How do we verify that the bytes are correctly stored? How can we utilize disks efficiently for small files? How do errors get communicated between the layers?

Through the course of the next two days some of these questions were discussed, others were raised and a few ideas proposed. Continue reading for the details.

Ext4 Status Update

Mingming Cao gave a status update on ext4, the recent fork of the ext3 file system. The primary goal of the fork was the move to 48-bit block numbers; this change allows the file system to support up to 1024 petabytes of storage. This feature was originally designed to be merged into ext3, but was seen as too disruptive. The patch is also built on top of the new extents system. Support for greater than 32K directory entries will also be merged into ext4.

On top of these changes a number of ext3 options will be enabled by default including: directory indexing which improves file access for large directories, "resize inodes" which reserve space in the block group descriptor for online growing, and 256-byte inodes. Ext3 users can use these features today with a command like:

mkfs.ext3 -I 256 -O resize_inode dir_index /dev/device

A number of other features are also being considered for inclusion into ext4 and have been posted on the list as RFCs. This includes a patch that will add nanosecond timestamps and the creation of persistent file allocations, which will be similar to posix_fallocate() but won't waste time writing zeros to the disk.

Ext4 currently stores a limited number of extended attributes in-inode and has space for one additional block of extended attribute data, but this may not be enough to satisfy xattr-hungry applications. For example, Samba needs additional space to support Vista's heavy use of ACLs, and eCryptFS can store arbitrarily large keys in extended attributes. This led everyone to the conclusion that data needs to be collected on how extended attributes are being used to help developers decide how to best implement them. Until larger extended attributes are supported, application developers need to pay attention to the limits that exist on current file systems e.g. one block on ext3 and 64K on XFS.

Online shrinking and growing was briefly discussed and it was suggested that online defragmentation, which is a planned feature, will be the first step toward online shrinking. A bigger issue however is storage management and Ted Ts'o suggested that the Linux file system community can learn from ZFS on how to create easy to manage systems. Christoph Hellwig sees the disk management issue as being a user space problem that can be solved with kernel hooks and sees ZFS as a layering violation. Either way it is clear that disk management should be improved.

The fsck Problem

Zach Brown and Valerie Henson were slated to speak on the topic of file system repair. While Val booted her laptop, she introduced us to the latest fashion: laptop rhinestones, a great discussion piece if you are waiting on a fsck. If Val's estimates for fsck time in 2013 come true, having a way to pass the time will become very important.

Val presented an estimate of 2013 fsck times. She first measured a fsck of her 37GB /home partition (with 21GB in use) which took 7.5 minutes and read 1.3GB of file system data. Next, she used projections of disk technology from Seagate to estimate the time to fsck a circa-2013 home partition, which will be 16 times larger. Although 2013 disks will have a five-fold bandwidth increase, seek times will only improve about 1.2 times (to 10ms) leading to an increase in fsck time from about 8 minutes to 80 minutes! The primary reason for long fscks is seek latency, since fsck spends most of its time seeking over the disk discovering and fetching dynamic file system data like directory entries, indirect blocks and extents.

Reducing seeks and avoiding the seek latency punishment is key to reducing fsck times. Val suggested one solution would be keeping a bitmap on disk that tracks the blocks that contain file system metadata; this would allow for reading all data in a single arm sweep. This optimization, in the best case, would make a single sequential sweep over the disk and, on the future disk, reading all file system metadata would only take around 134 seconds, a large improvement over 80 minutes. A full explanation of the findings and possible solutions can be found in the paper Repair-Driven File System Design [PDF]. Also, Val announced that she is working full time on a file system called chunkfs [PDF] that will make speed and ease of repair a primary design goal.

Zach Brown presented some blktrace output from e2fsck. The outcome of the trace is that, while the disk can stream data at 26 Mb/s, fsck is achieving only 12 Mb/s. This situation could be improved to some degree without on-disk layout changes if the developers had a vectorized I/O call. Zach explained that in many cases you know the block locations that you need, but with the current API you can only read one at a time.

A vectorized read would take a number of buffers and a list of blocks to read as arguments. Then the application could submit all of the reads at once. Such a system call could save a significant amount of time since the I/O scheduler can reorder requests to minimize seeks and merge requests that are nearby. Also, reads to blocks that are located on different disks could be parallelized. Although a vectorized read could speed up the fsck eventually file system layout changes will be needed to make fsck faster.

libata: bringing the ATA community together

Jeff Garzik gave an update on the progress of libata, the in-kernel library to support ATA hosts and devices. He first presented the ATAPI/SATA features that libata now supports including: PATA+C/H/S, NCQ, FUA, SCSI SAT, and CompactFlash. The growing support for parallel ATA (PATA) drives in libata will eventually deprecate the IDE driver; Fedora developers are helping to accelerate testing and adoption of the libata PATA code by disabling the IDE driver in Fedora 7 test 1.

Native Command Queuing (NCQ) is a new command protocol introduced in the SATA II extensions and is now supported under libata. With NCQ the host can have multiple outstanding requests on the drive at once. The drive can reorder and reschedule these requests to improve disk performance. A useful feature of NCQ drives is the force unit access (FUA) bit which will ensure the data, in write commands with this bit set, will be written to disk before returning success. This has the potential of enabling the kernel to have both synchronous and non-synchronous commands in flight. There was a recent discussion about both NCQ FUA and SATA FUA in libata on the linux-ide mailing list.

Jeff briefly discussed libata's support for SCSI ATA translation (SAT) which lets an ATA device appear to be a SCSI device to the system. The motivation for this translation is the reuse of error handling and support for distribution installers which already know how to handle SCSI devices.

There are also a number of items slated as future work for libata. Many drivers need better suspend/resume support and the driver API is due for a sane initialization model using a allocate/register/unallocate/free system and "Greg blessed" kobjects. Currently libata is written under the SCSI layer and debate continues on how to restructure libata to minimize or eliminate its SCSI dependence. Error handling has been substantially improved by Tejun Heo and his changes are now in mainline. If you have had issues with SATA or libata error handling, try an updated kernel to see if those issues have been resolved. Tejun and others continue to add features and tune the libata stack.

Communication Breakdown: I/O and File Systems

During the morning a number of conversations sprung up about communication between I/O and file systems. A hot topic was getting information from the block layer about non-retryable errors that affect an entire range of bytes and passing that data up to user space. There are situations when retries are happening on a large range of bytes even when the I/O layer knows that an entire range of blocks are missing or bad.

A "pipe" abstraction was discussed to communicate data on byte ranges that are currently in error, under performance strain (because of a RAID5 disk failure), or temporarily unplugged. If a file system were aware of ranges that are currently handling a recoverable error, have unrecoverable errors or are temporarily slow, it may be able to handle the situation more gracefully.

File systems currently do not receive unplug events and handling unplug situations can be tricky. For example, if a fibre channel disk is pulled for a moment and plugged back in it may be down for only 30 seconds but how should the file system handle the situation? Ext3 currently remounts the entire file system as read only. XFS has a configurable timeout for fibre channel disks that must be reached before it sends an EIO error. And what should be done with USB drives that are unplugged? Should the file system save state and hope the device gets plugged back in? How long should it wait and should it still work if it is plugged into a different hub? All of these questions were raised but there are no clear answers.

The Filesystems Track

The workshop split into different tracks; your author decided to follow the one dedicated to filesystems.

Security Attributes

Michael Halcrow, eCryptFS developer, presented an idea to use SELinux to make file encryption/decryption dependent on application execution. For example, a policy could be defined so that the data would be unencrypted when OpenOffice is using the file but encrypted when the user copies the file to a USB key. After presenting the mechanism and mark-up language for this idea Michael opened the floor to the audience. The general feeling was that SELinux is often disabled by users and that per-mount-point encryption may be a more useful and easy to understand user interface.

Why Linux Sucks for Stacking

Josef Sipek, Unionfs maintainer, went over some of the issues involved with stacking file systems under Linux. A stacking file system, like Unionfs, provides an alternative view of a lower file system. For example, Unionfs takes a number of mounted directories, which could be NFS/ext3/etc, as arguments at mount time and merges their name space.

The big unsolved issue with stacking file systems is handling modifications to the lower file systems in the stack. Several people suggested that leaving the lower file system available to the user is just broken and that by default the lower layers should only be mounted internally.

The new fs/stack.c file was discussed too. This file currently contains a simple inode copy routines that is used by Unionfs and eCryptfs, but in the future more stackable file system routines should be pushed to this file.

Future work for Unionfs includes getting it working under lockdep and additional experimentation with an on-disk format. The on-disk format for Unionfs is currently under development; it will store white-out files (representing files which have been deleted by a user but which still exist on the lower-level filesystems) and persistent Unionfs inode data.

B-trees for a Shadowed FS

Many file systems use B-trees to represent files and directories. These structures keep data sorted, are balanced, and allow for insertion and deletion in logarithmic time. However, there are difficulties in using them with shadowing. Ohad Rodeh presented his approach to using b-trees and shadowing in an object storage device, but the methods are general and useful for any application.

Shadowing may also be called copy-on-write (COW); the basic idea is that when a write is made the block is read into memory, modified, and written to a new location on disk. Then the tree is recursively updated starting at the child and using COW until the root node is atomically updated. In this way the data is never in an inconsistent state; if the system crashes before the root node is updated then the write is lost but the previous contents remain intact.

Replicating the details of his presentation would be a wasted effort as his paper, B-trees, Shadowing and Clones [PDF], is well written and easy to read. Enjoy!

eXplode the code

Storage systems have a simple and important contract to keep: given user data they must save that data to disk without loss or corruption even in the face of system crashes. Can Sar gave an overview of eXplode [PDF], a systematic approach to finding bugs in storage systems.

eXplode systematically explores all possible choices that can be made at each choice point in the code to make low-probability events, or corner cases, just as probable as the main running path. And it does this exploration on a real running system with minimal modifications.

This system has the advantage of being conceptually simple and very effective. Bugs were found in every major Linux file system, including a fsync bug that can cause data corruption on ext2. This bug can be produced by doing the following: create a new file, B, which recycles an indirect block from a recently truncated file, A, then call fsync on file B and crash the system before file A's truncate gets to disk. There is now inconsistent data on disk and when e2fsck tries to fix the inconsistency it corrupts file B's data. A discussion of the bug has been started on the linux-fsdevel mailing list.

NFS

The second day of the file systems track started with a discussion of an NFS race. The race appears when a client opens up a file between two writes that occur during the same second. The client that just opened the file will be unaware of the second write and will keep an out-of-date version of the file in cache. To fix the problem, a "change" attribute was suggested. This number would be consistent across reboots, unit-less and would increment on every write.

In general everyone agreed that a change attribute is the right solution, however Val Henson pointed out that implementing this on legacy file systems will be expensive and will require on disk format changes.

Discussion then turned to NSFv4 access control lists (ACLs). Trond Myklebust said they are becoming a standard and Linux should support them. Andreas Gruenbacher is working on patches to add NFSv4 support to Linux but currently only ext3 is supported; more information can be found on the Native NFSv4 ACLs on Linux page. A possibly difficult issue will be mapping current POSIX ACLs to NFSv4 ACLs, but a draft document, Mapping Between NFSv4 and Posix Draft ACLs, lays out a mapping scheme.

GFS Updates

Steven Whitehouse gave an overview of the recent changes in the Global File System 2 (GFS2), a cluster file system where a number of peers share access to the storage device. The important changes include a new journal layout that can support mmap() , splice() and other system calls on journaled files, page cache level locking, readpages() and partial writepages() support, and ext3 standard ioctls lsattr and chattr.

readdir() was discussed at some length, particularly the ways in which it is broken. A directory insert on GFS2 may cause a reorder of the extensible hash structure GFS2 uses for directories. In order to support readdir() every hash chain must be sorted. The audience generally agreed that readdir() is difficult to implement and Ted Ts'o suggested that someone should try to go through committee to get telldir/seekdir/readdir fixed or eliminated.

OCFS2

A brief OCFS2 status report was given by Mark Fasheh. Like GFS2, OCFS2 is a cluster file system, designed to share a file system across nodes in a cluster. The current development focus is on adding features, as the basic file system features are working well.

After the status update the audience asked a few questions. The most requested OCFS2 feature is forced unmount and several people suggested that this should be a future virtual file system (vfs) feature. Mark also said that users really enjoy the easy setup of OCFS2 and the ability to use it as a local file system. A performance hot button for OCFS2 are the large inodes and occupy an entire block.

In the future Mark would like to mix extent and extended attribute data in-inode to utilize all of the available space. However, as the audience pointed out, this optimization can lead to some complex code. In the future Mark would also like to move to GFS's distribute lock manager.

DualFS: A New Journaling File System for Linux

DualFS is a file system by Juan Piernas that separates data and meta data into separate file systems. The on-disk format for the data disk is similar to ext2 without meta-data blocks. The meta data file system is a log file system, a design that allows for very fast writes since they are always made at the head of the log which reduces expensive seeks. A few performance numbers were presented; under a number of micro- and macro-benchmarks DualFS performs better than other Linux journaling file systems. In its current form, DualFS uses separate partitions for data and metadata, forcing the user to answer a difficult question: how much metadata do I expect to have?

More information, including performance comparisons, can be found on the DualFS LKML announcement and the project homepage. The currently available code is a patch on top of 2.4.19 and can be found on SourceForge.

pNFS Object Storage Driver

Benny Halevy gave an overview of pNFS (parallel NFS), which is part of the IETF NFSv4.1 draft and tries to solve the single server performance bottleneck of NFS storage systems. pNFS is a mechanism for an NFS client to talk directly to a disk device without sending requests through the NFS server, fanning the storage system out to the number of SAN devices. There are many proprietary systems that do a similar thing including EMC's High Road, IBM's TotalStorage SAN, SGI's CXFS and Sun's QFS. Having an open protocol would be a good thing.

However, Jeff Garzik was skeptical of including pNFS in the NFSv4.1 draft particularly because to support pNFS the kernel will need to provide implementations of all three access protocols: file storage, object storage and block storage. This will add significant complexity to the Linux NFSv4 implementation.

Benny explained that the pNFS implementation in Linux is modular to support multiple layout-type specific drivers which are optional. Each layout driver dynamically registers itself using its layout type and the NFS client calls it across a well-defined API. Support for specific layout types is optional. In the absence of a layout driver for some specific layout type the NFS client falls back to doing I/O through the server.

After this overview Benny turned to the topic of OSDs, or object based storage devices. These devices provide a more abstract view of the disk than the classic "array of blocks" abstraction seen in todays disks. Instead of blocks, objects are the basic unit of an OSD, and each object contains both meta-data and data. The disk manages the allocation of the bytes on disk and presents the object data as a contiguous array to the system. Having this abstraction in hardware would make file system implementation much simpler. To support OSDs in Linux Benny and others are working to get bi-directional SCSI command support into the Kernel and support for variable length command descriptor blocks (CDBs).

Hybrid Disks

Hybrid disks with an NVCache (flash memory) will be in consumers' hands soon. Timothy Bisson gave an overview of this new technology. The NVCache will have 128-256Mb of non-volatile flash memory that the disk can manage as a cache (unpinned) or the operating system can manage by pinning specified blocks to the non-volatile memory. This technology can reduce power consumption or increase disk performance.

To reduce power consumption the block layer can enable the NVCache Power Mode, which tells the disk to redirect writes to the NVCache, reducing disk spin-up operations. In this mode the 10 minute writeback threshold of Linux laptop mode can be removed. Another strategy is to pin all file system metadata in the NVCache, but spin-ups will still occur on non-metadata reads. An open question is how this pinning should be managed when two or more file systems are using the same disk.

Performance can be increased by using the NVCache as a cache for writes requiring a long seek. In this mode the block layer would pin the target blocks ensuring a write to the cache instead of incurring the expensive seek. Also, a file system can use the NVCache to store its journal and boot files for additional performance and reduced system start-up time.

If Linux developers decide to manage the NVCache there are many open questions. Which layer should manage the NVCache? The file system or block layer? And what type of API should be created to leverage the cache? Another big question is how much punishment can these caches take? According to Timothy it takes about a year (using a desktop workload) to a fry the cache if you are using it as a write cache.

Scaling Linux to Petabytes

Sage Weil presented Ceph, a network file system that is designed to scale to petabytes of storage. Ceph is based on a network of object based storage devices and complete copies of each object is distributed across multiple nodes using an algorithm called CRUSH. This distribution makes it possible for nodes to be added and removed from the system dynamically. More information on the design and implementation can be found on the Ceph homepage

Conclusion

The workshop concluded with the general consensus that bringing together SATA, SCSI and file system people was a good idea and that the status updates and conversations were useful. However, the workshop was a bit too large for code discussion and more targeted workshops will need to be held to workout the details of some of the issues discussed at LSF'07. Topics for future workshops include virtual memory and file system issues and extensions that are needed to the VFS.