Run ZFS on Linux

Connect with Tim Tim is one of our most popular and prolific authors. Browse all of Tim's articles on developerWorks. Check out Tim's profile and connect with him, other authors, and fellow readers in My developerWorks.

Linux has an interesting relationship with file systems. Because Linux is open, it tends to be a key development platform both for next-generation file systems and for new, innovative file system ideas. Two interesting recent examples include the massively scalable Ceph and the continuous snapshotting file system nilfs2 (and of course, evolutions in workhorse file systems such as the fourth extended file system [ext4]). It's also an archaeological site for file systems of the past—DOS VFAT, Macintosh(HPFS), VMS ODS-2, and Plan-9's remote file system protocol. But with all of the file systems you'll find supported within Linux, there's one that generates considerable interest because of the features it implements: Oracle's Zettabyte File System (ZFS).

The ZFS was designed and developed by Sun Microsystems (under Jeff Bonwick) and was first announced in 2004, with integration into Sun Solaris occurring in 2005). Although pairing the most popular open operating system with the most talked-about, feature-rich file system would be an ideal match, licensing issues have restricted the integration. Linux is protected by the GNU General Public License (GPL), while ZFS is covered by Sun's Common Development and Distribution License (CDDL). These license agreements have different goals and introduce restrictions that conflict. Fortunately, that doesn't mean that you as a Linux user can't enjoy ZFS and the capabilities it provides.

This article explores two methods for using ZFS in Linux. The first uses the Filesystem in Userspace (FUSE) system to push the ZFS file system into user space to avoid the licensing issues. The second method is a native port of ZFS for integration into the Linux kernel while avoiding the intellectual property issues.

Where can you find ZFS? Today, you can find ZFS natively within OpenSolaris (also covered under the CDDL) but also in other operating systems that have complementary licenses. For example, you can find ZFS in FreeBSD (since 2007). ZFS was once part of Darwin (a derivative of Berkeley Software Distribution [BSD], NeXTSTEP, and CMU's Mach 3 microkernel) but has since been removed.

Introducing ZFS

Calling ZFS a file system is a bit of a misnomer, as it is much more than that in the traditional sense. ZFS combines the concepts of a logical volume manager with a very feature rich and massively scalable file system. Let's begin by exploring some of the principles on which ZFS is based. First, ZFS uses a pooled storage model instead of the traditional volume-based model. This means that ZFS views storage as a shared pool that can be dynamically allocated (and shrunk) as needed. This is advantageous over the traditional model, where file systems reside on volumes and an independent volume manager is used to administer these assets. Embedded within ZFS is an implementation of an important set of features such as snapshots, copy-on-write clones, continuous integrity checking, and data protection through RAID-Z. Going further, it's possible to use your own favorite file system (such as ext4) on top of a ZFS volume. This means that you get those features of ZFS such as snapshots on an independent file system (that likely doesn't support them directly).

But ZFS isn't just a collection of features that make up a useful file system. Rather, it's a collection of integrated and complementary features that make it an outstanding file system. Let's look at some of these features, and then see some of them in action.

Storage pools

As discussed earlier, ZFS incorporates a volume-management function to abstract underlying physical storage devices to the file system. Rather than viewing physical block devices directly, ZFS operates on storage pools (called zpools), which are constructed from virtual drives that can physically be represented by drives or portions of drives. Further, these pools can be constructed dynamically, even while the pool is actively in use.

Copy-on-write

ZFS uses a copy-on-write model for managing data on the storage. This means that data is never written in place (never overwritten), but instead new blocks are written and the metadata updated to reference it. Copy-on-write is advantageous for a number of reasons (not only for some of the capabilities like the snapshots and clones that it enables). By never overwriting data, it's simpler to ensure that the storage is never left in an inconsistent state (as the older data remains after the new Write operation is complete). This allows ZFS to be transaction based, and it's much simpler to implement features like atomic operations.

An interesting side effect of the copy-on-write design is that all writes to the file system become sequential writes (because remapping is always occurring). This behavior avoids hot spots in the storage and exploits the performance of sequential writes (faster than random writes).

Data protection

Storage pools made up of virtual devices can be protected using one of ZFS's numerous protection schemes. You can mirror a pool across two or more devices (RAID 1) protect it with parity (similar to RAID 5) but across dynamic stripe widths (more on this later). ZFS supports a variety of parity schemes based on the number of devices in the pool. For example, you can protect three devices with RAID-Z (RAID-Z 1); with four devices, you can use RAID-Z 2 (double parity, similar to RAID6). For even greater protection, you can use RAID-Z 3 with larger numbers of disks for triple parity.

For speed (but no data protection other than error detection), you can employ striping across devices (RAID 0). You can also create striped mirrors (to mirror striped drives), similar to RAID 10.

An interesting attribute of ZFS comes with the combination of RAID-Z, copy-on-write transactions, and dynamic stripe widths. In a traditional RAID 5 architecture, all disks must have their data within the stripe, or the stripe is inconsistent. Because there's no way to update all disks atomically, it's possible to produce the well-known RAID 5 write hole problem (where a stripe is inconsistent across the drives of the RAID set). Given ZFS transactions and never having to write in place, the write hole problem is eliminated. Another convenient quality of this approach is what happens when a disk fails and a rebuild is required. A traditional RAID 5 system uses data from other disks in the set to rebuild data for the new drive. RAID-Z traverses the available metadata to read only the data that's relevant for the geometry and avoids reading the unused space on the disk. This behavior becomes even more important as disks become larger and rebuild times increase.

Checksums

Although data protection provides the ability to regenerate data on a failure, it says nothing about the validity of the data in the first place. ZFS solves this issue by generating a 32-bit checksum (or 256-bit hash) for metadata for each block written. When a block is read, its checksum is verified to avoid the problem of silent data corruption. In a volume that has data protection (mirroring or RAID-Z), the alternate data can be read or regenerated automatically.

Standard approaches for integrity The T10 provides a similar mechanism for end-to-end integrity called Data Integrity Field (DIF). This mechanism proposes a field containing a cyclic redundancy check of a block and other metadata stored on disk to avoid silent data corruption. An interesting attribute of DIF is that you'll find hardware support for it in a number of storage controllers, so that the process is completely offloaded from the host processor.

Checksums are stored with metadata in ZFS, so phantom writes can be detected and—if data protection is provided (RAID-Z)—corrected.

Snapshots and clones

Given the copy-on-write nature of ZFS, features like snapshots and clones become simple to provide. Because ZFS never overwrites data but instead writes to a new location, older data can be preserved (but in the nominal case is marked for removal to converse disk space). A snapshot is a preservation of older blocks to maintain the state of a file system at a given instance in time. This approach is also space efficient, because no copy is required (unless all data in the file system is rewritten). A clone is a form of snapshot in which a snapshot is taken that is writable. In this case, original unwritten blocks are shared by each clone, and blocks that are written are available only to the specific file system clone.

Variable block sizes

Traditional file systems are made up of statically sized blocks that match the back-end storage (512 bytes). ZFS implements variable block sizes for a variety of uses (commonly up to 128KB in size, but you can change this value). One important use of variable block sizes is compression (because the resulting block size when compressed will ideally be less than the original). This functionality minimizes waste in the storage system in addition to providing better utilization of the storage network (because less data emitted to storage requires less time in transfer).

Outside of compression, supporting variable block sizes also means that you can tune the block size for the particular workload expected for improved performance.

Other features

ZFS incorporates a many other features, such as de-duplication (to minimize copies of data), configurable replication, encryption, an adaptive replacement cache for cache management, and online disk scrubbing (to identify and fix latent errors while they can be fixed when protection isn't used). It does this with immense scalability, supporting 16 exabytes of addressable storage (264 bytes).

Using ZFS on Linux today

Now that you've seen some of the abstract concepts behind ZFS, let's look at some of them in practice. This demonstration uses ZFS-FUSE. FUSE is a mechanism that allows you to implement file systems in user space without kernel code (other than the FUSE kernel module and existing file system code). The module provides a bridge from the kernel file system interface to user space for user and file system implementations. First, install the ZFS-FUSE package (the following demonstration targets Ubuntu).

Installing ZFS-FUSE

Installing ZFS-FUSE is simple, particularly on Ubuntu using apt . The following command line installs everything you need to begin using ZFS-FUSE:

$ sudo apt-get install zfs-fuse

This command line install ZFS-FUSE and all other dependent packages (mine also required libaiol ) as well as performing the necessary setup for the new packages and starting the zfs-fuse daemon.

Using ZFS-FUSE

In this demonstration, you use the loop-back device to emulate disks as files within the host operating system. To begin, create these files (using /dev/zero as the source) with the dd utility (see Listing 1). With your four disk images created, use losetup to associate the disk images with the loop devices.

Listing 1. Setup for working with ZFS-FUSE

$ mkdir zfstest $ cd zfstest $ dd if=/dev/zero of=disk1.img bs=64M count=1 1+0 records in 1+0 records out 67108864 bytes (67 MB) copied, 1.235 s, 54.3 MB/s $ dd if=/dev/zero of=disk2.img bs=64M count=1 1+0 records in 1+0 records out 67108864 bytes (67 MB) copied, 0.531909 s, 126 MB/s $ dd if=/dev/zero of=disk3.img bs=64M count=1 1+0 records in 1+0 records out 67108864 bytes (67 MB) copied, 0.680588 s, 98.6 MB/s $ dd if=/dev/zero of=disk4.img bs=64M count=1 1+0 records in 1+0 records out 67108864 bytes (67 MB) copied, 0.429055 s, 156 MB/s $ ls disk1.img disk2.img disk3.img disk4.img $ sudo losetup /dev/loop0 ./disk1.img $ sudo losetup /dev/loop1 ./disk2.img $ sudo losetup /dev/loop2 ./disk3.img $ sudo losetup /dev/loop3 ./disk4.img $

With four devices available to use as your block devices for ZFS (totaling 256MB in size), create your pool using the zpool command. You use the zpool command to manage ZFS storage pools, but as you'll see, you can use it for a variety of other purposes, as well. The following command requests a ZFS storage pool to be created with four devices and provides data protection with RAID-Z. You follow this command with a list request to provide data on your pool (see Listing 2).

Listing 2. Creating a ZFS pool

$ sudo zpool create myzpool raidz /dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3 $ sudo zfs list NAME USED AVAIL REFER MOUNTPOINT myzpool 96.5K 146M 31.4K /myzpool $

You can also investigate some of the attributes of your pool, as shown in Listing 3, which represent the defaults. Among other things, you can see the available capacity and portion used. (This code has been compressed for brevity.)

Listing 3. Reviewing the attributes of the storage pool

$ sudo zfs get all myzpool NAME PROPERTY VALUE SOURCE myzpool type filesystem - myzpool creation Sat Nov 13 22:43 2010 - myzpool used 96.5K - myzpool available 146M - myzpool referenced 31.4K - myzpool compressratio 1.00x - myzpool mounted yes - myzpool quota none default myzpool reservation none default myzpool recordsize 128K default myzpool mountpoint /myzpool default myzpool sharenfs off default myzpool checksum on default myzpool compression off default myzpool atime on default myzpool copies 1 default myzpool version 4 - ... myzpool primarycache all default myzpool secondarycache all default myzpool usedbysnapshots 0 - myzpool usedbydataset 31.4K - myzpool usedbychildren 65.1K - myzpool usedbyrefreservation 0 - $

Now, let's actually use the ZFS pool. First, create a directory within your pool, and then enable compression within it (using the zfs set command). Next, copy a file into it. I've selected a file that's around 120KB in size to see the effect of ZFS compression. Note that your pool is mounted at the root, so treat is just like a directory within your root file system. Once the file is copied, you can list it to see that the file is present (but is the same size as the original). Using the dh command, you can see that the size of the file is half the original, indicating that ZFS has compressed it. You can also look at the compressratio property to see how much your pool has been compressed (using the default compressor, gzip). Listing 4 shows the compression.

Listing 4. Demonstrating compression with ZFS

$ sudo zfs create myzpool/myzdev $ sudo zfs list NAME USED AVAIL REFER MOUNTPOINT myzpool 139K 146M 31.4K /myzpool myzpool/myzdev 31.4K 146M 31.4K /myzpool/myzdev $ sudo zfs set compression=on myzpool/myzdev $ ls /myzpool/myzdev/ $ sudo cp ../linux-2.6.34/Documentation/devices.txt /myzpool/myzdev/ $ ls -la ../linux-2.6.34/Documentation/devices.txt -rw-r--r-- 1 mtj mtj 118144 2010-05-16 14:17 ../linux-2.6.34/Documentation/devices.txt $ ls -la /myzpool/myzdev/ total 5 drwxr-xr-x 2 root root 3 2010-11-20 22:59 . drwxr-xr-x 3 root root 3 2010-11-20 22:55 .. -rw-r--r-- 1 root root 118144 2010-11-20 22:59 devices.txt $ du -ah /myzpool/myzdev/ 60K /myzpool/myzdev/devices.txt 62K /myzpool/myzdev/ $ sudo zfs get compressratio myzpool NAME PROPERTY VALUE SOURCE myzpool compressratio 1.55x - $

Finally, let's look at the self-repair capabilities of ZFS. Recall that when you created your pool, you requested RAID-Z over the four devices. You can check the status of your pool using the zpool status command, as shown in Listing 5. As shown, you can see the elements of your pool (RAID-Z 1 with four devices).

Listing 5. Checking your pool status

$ sudo zpool status myzpool pool: myzpool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM myzpool ONLINE 0 0 0 raidz1 ONLINE 0 0 0 loop0 ONLINE 0 0 0 loop1 ONLINE 0 0 0 loop2 ONLINE 0 0 0 loop3 ONLINE 0 0 0 errors: No known data errors $

Now, let's force an error into the pool. For this demonstration, go behind the scenes and corrupt the disk file that makes up the device (your disk4.img, represented in ZFS by the loop3 device). Use the dd command to simply zero out the entire device (see Listing 6).

Listing 6. Corrupting the ZFS pool

$ dd if=/dev/zero of=disk4.img bs=64M count=1 1+0 records in 1+0 records out 67108864 bytes (67 MB) copied, 1.84791 s, 36.3 MB/s $

ZFS is currently unaware of the corruption, but you can force it to see the problem by requesting a scrub of the pool. As shown in Listing 7, ZFS now recognizes the corruption (of the loop3 device) and suggests an action to replace the device. Note also that the pool remains online, and you can still get to your data, as ZFS self-corrects through RAID-Z.

Listing 7. Scrubbing and checking the pool

$ sudo zpool scrub myzpool $ sudo zpool status myzpool pool: myzpool state: ONLINE status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-4J scrub: scrub completed after 0h0m with 0 errors on Sat Nov 20 23:15:03 2010 config: NAME STATE READ WRITE CKSUM myzpool ONLINE 0 0 0 raidz1 ONLINE 0 0 0 loop0 ONLINE 0 0 0 loop1 ONLINE 0 0 0 loop2 ONLINE 0 0 0 loop3 UNAVAIL 0 0 0 corrupted data errors: No known data errors $ wc -l /myzpool/myzdev/devices.txt 3340 /myzpool/myzdev/devices.txt $

As recommended, introduce a new device to your RAID-Z set to act as the new container. Begin by creating a new disk image and representing it as a device with losetup . Note that this process is similar to adding a new physical disk to the set. You then use zpool replace to exchange the corrupted device ( loop3 ) with the new device ( loop4 ). Checking the status of the pool, you can see your new device with a message indicating that data was rebuilt on it (called resilvering), along with the amount of data moved there. Note also that the pool remains online with no errors (visible to the user). To conclude, you scrub the pool again; after checking its status, you'll see that no issues exist, as shown in Listing 8.

Listing 8. Repairing the pool using zpool replace

$ dd if=/dev/zero of=disk5.img bs=64M count=1 1+0 records in 1+0 records out 67108864 bytes (67 MB) copied, 0.925143 s, 72.5 MB/s $ sudo losetup /dev/loop4 ./disk5.img $ sudo zpool replace myzpool loop3 loop4 $ sudo zpool status myzpool pool: myzpool state: ONLINE scrub: resilver completed after 0h0m with 0 errors on Sat Nov 20 23:23:12 2010 config: NAME STATE READ WRITE CKSUM myzpool ONLINE 0 0 0 raidz1 ONLINE 0 0 0 loop0 ONLINE 0 0 0 loop1 ONLINE 0 0 0 loop2 ONLINE 0 0 0 loop4 ONLINE 0 0 0 59.5K resilvered errors: No known data errors $ sudo zpool scrub myzpool $ sudo zpool status myzpool pool: myzpool state: ONLINE scrub: scrub completed after 0h0m with 0 errors on Sat Nov 20 23:23:23 2010 config: NAME STATE READ WRITE CKSUM myzpool ONLINE 0 0 0 raidz1 ONLINE 0 0 0 loop0 ONLINE 0 0 0 loop1 ONLINE 0 0 0 loop2 ONLINE 0 0 0 loop4 ONLINE 0 0 0 errors: No known data errors $

This short demonstration explores the consolidation of volume management with a file system and shows how easy it is to administer ZFS (even in the face of failures).

Other Linux-ZFS possibilities

The advantage of ZFS on FUSE is that it's simple to begin using ZFS, but it has the downside of not being efficient as it could be. This lack of efficiency is the result of the multiple user-kernel transitions required per I/O. But given the popularity of ZFS, there is another option that provides greater performance.

A native port of ZFS to the Linux kernel is well under way at the Lawrence Livermore National Lab. This port still lacks some elements, such as the ZFS Portable Operating System Interface (for UNIX®) Layer, but this is under development. Their port provides a number of useful features, particularly if you're interested in using ZFS with Lustre. (See Related topics for details.)

Going further

Hopefully, this article has whetted your appetite to dig farther into ZFS. From the earlier demonstration, you can easily get ZFS up and running on most Linux distributions—even in the kernel, with some limitations. Topics such as snapshots and clones were not demonstrated here, but the Related topics section provides links a interesting articles on this topic. In the end, Linux and ZFS are state-of-the-art technologies, and it will be difficult to keep them apart.

Downloadable resources

Related topics