From OpenZFS

OpenZFS Development Roadmap

This document serves as a single point for interested developers, admins, and users to look at what's coming, when, and to where in OpenZFS. Information was gathered at the OpenZFS Dev Summit in 2016 and collated here.

Primary maintenance of the roadmap is currently being done by User:Jim Salter. Feel free to contact Jim if there's a project you're working on that you'd like to see updated here, or if you're a wiki user with edit privileges, update it yourself.

Upcoming Features

ZFS Compatibility Layer

Primary dev/contact: Paul Dagnelie

Currently ZFS code is littered with Solaris-isms which are frequently malapropisms in other platforms like Linux or BSD, and translated with SPL (Solaris Portability Layer). Goal is to subsume and replace SPL with platform-neutral ZCL, which does not favor any one platform and will be able to handle native featuresets better in Linux and elsewhere.

Example: memory allocation via SPL can cause kernel-mode errors like trying to free 16K cache from 512K pages; causes ztest under Linux to throw far too many errors.

Status: Thread and process libs are almost done - almost build internally - hopefully passing tests by October 2016; then push internally at Delphix, then upstream in Nov/Dec timeframe. After thread/process, next step will be adding new things like atomic store generic interface, ZIO layer, etc.

How to Help: Paul is looking for volunteers for other bits like replacing atomics calls, etc. Looking for input on what the APIs should look like in the ZCL before people start actually hacking them into existence.

ZFS on Linux downstream porting

Primary dev/contact: Brian Behlendorf

Features currently in master, ready for next stable release (end-2016 timeframe):

zol #3983 - user and group dnode accounting - not quite ready for master yet; next few weeks, waiting for review

zfs send/receive resume after interruption - done in master

preserve on-disk compression in zfs send - done in master

delegation (zfs allow command) - done in master

ZFS At-Rest Encryption

Primary dev/contact: Tom Caputi

At-rest encryption currently using AES-CCM and AES-GCM; pluggable for future algorithm changes

Encryption of data not metadata - eg you can zfs list -rt all without needing the key

Key wrapping - master key used to encrypt data is derived from changeable user passphrase; can change user passphrase without needing to re-encrypt data; master key can only be gotten by way of kernel debugger on unlocked in-flight operation

raw send - zfs send updated to send raw (still-encrypted) data, can be received by untrusted remote pool which does not need user passphrase or master key to accept full or incremental receive!

remote pool which does not need user passphrase or master key to accept full or incremental receive! raw send is nearly feature complete; Tom expects to merge that functionality into PR in late March 2017

How to Help: Tom is desperately seeking code review so patches can get accepted into upstream master! Need standard code review and, ideally, crypto review from accomplished cryptographer(s).

Top-level vdev removal

Primary dev/contact: Matt Ahrens

in-place removal of top-level vdev from pool

accidental add of singletons instead of mirrors; migrate pool in-place from many smaller mirror vdevs to few larger mirror vdevs

no block pointer rewrite; accomplished by use of in-memory map table from blocks on removed vdevs to blocks on remaining vdevs

remap table is always hot in RAM; minimal performance impact but may add seconds to zpool import times

repeated vdev removal results in increasingly large remap table size, with longer remap chains: eg block was on removed vdev A, remapped to now-removed vdev B, remapped to now-removed vdev C, remapped to current vdev D. Chains are not compressed after the fact.

technique could technically be used for defrag but would result in maximally large remap tables even on a single use, would very rapidly scale out of control with continual defrag runs

Current status: feature complete for singleton vdevs only; in internal production at Delphix. Expected to be extended to removal of mirror vdevs next. Removal of top-level RAIDZ vdevs technically possible, but ONLY for pools of identical raidz vdevs - ie 4 6-disk RAIDZ2 vdevs, etc. You will not be able to remove a raidz vdev from a "mutt" pool.

How to help: work on mirror vdev removal.

Parity Declustered RAIDZ (draid)

Primary dev/contact: Isaac Hehuang

40,000 foot overview: draid is a new top-level vdev topography which looks much like entire standard pools. For example, a draid vdev might look functionally like three 7-disk raidz1 vdevs with three spares. Just like a similarly-constructed pool, the draid vdev would have 18 data blocks and three parity blocks per stripe, and could replace up to three failed disks with spares. However, data is interleaved differently from row to row across the entire draid in much the same way it's interleaved from row to row inside a single raidz1 or raid5, presenting similar performance benefits to those gained by raid5 (interleaved single parity) over raid3 (dedicated single parity) when degraded or rebuilding.

As an example, disk 17 of a 30-disk draid vdev might contain the fourth data block of the first internal raidz1-like grouping on one row, but contain the parity block for the fourth raidz1-like grouping on the following row. Again, this is very similar to the way raid5/raidz1 already interleaves parity within its own structure.

Isaac Hehuang explains draid1 internal structure at OpenZFS Dev Summit, Sep 27 2016

Isaac live demonstrated a 31-disk draid vdev constructed as six 5-disk raidz1 internal groups plus one spare rebuilding after removal of one disk. The vdev rebuilt the missing disk at a rate of greater than 1GB/sec. This is possible because draid can read sequentially (like a conventional RAID rebuild) while still skipping free space (like a RAIDZ vdev resilvering).

Note that for draid vdevs, spares are added at the *vdev* level, not the pool level!

Status: draid is functional with single-parity groupings now. Double-parity, triple-parity, and mirror internal grouping support is planned but not yet implemented. Drivers are working for both members and spares at the single-parity level. Thorough testing is still needed to flush potential bugs.

Notable gotchas:

draid rebuild is not a resilver - checksums/parity are not verified during rebuild, and therefore should be verified with a scrub immediately after a rebuild. Scrub performance is not notably different for a draid than it would be for a similarly-constructed pool of raidz vdevs.

performance is not notably different for a draid than it would be for a similarly-constructed pool of raidz vdevs. draid vdev topology is immutable once created, like other parity vdev types. If you create a 30-disk draid vdev, it will be a 30-disk draid vdev for the lifetime of the pool!

How to help: Isaac is looking for code review, testing, and eventually platform porting (the development platform is Linux).

SPA Metadata Allocation Classes

Primary dev/contact: Don Brady

Overview: SPAMAC allows the targeting of metadata to high-IOPS vdevs within a pool using lower-IOPS vdevs for data storage.

Example:

zpool create dozer \ raidz1 sda sdb sdc sdd sde \ raidz1 sdf sdg sdh sdi sdj \ metadata mirror sdk sdl \ log mirror sdm sdn

This creates a pool with two low-performance, high storage efficiency RAIDZ1 vdevs for data storage, but one high-performance mirror for storage of MOS, DMU, DDT and another high-performance mirror for SLOG.

It is possible to further individually target MOS, DDT, DMU to individual vdevs as either primary or secondary (overflow) targets, which can be cataloged with zpool list -cv:

zpool list -cv NAME ANY MOS DMU DDT LOG dozer raidz1-0 Pri - - - - raidz1-1 Pri - - - - mirror-2 - Pri Pri Pri Sec mirror-3 - - - - Pri

In this example, data goes to the two raidz1 vdevs only; MOS, DMU, DDT goes to mirror-2 only; and SLOG goes primarily to mirror-3 but can overflow to mirror-2 if necessary.

Currently this is working prototype being pushed upstream (from Intel internal, to ZFS on Linux) as working prototype, with API and CLI tools well fleshed out. Expected push to master in Q1 2017.

Don Brady diagrams allocation of SPA metadata to designated vdevs, OpenZFS Dev Summit Sep 27 2016

How to help: code review, testing.

Eager Zero

Primary dev/contact: George Wilson

Executive overview: when running ZFS on top of a virtual storage environment the admin does not control (such as leased AWS infrastructure), the underlying storage may be allocated sparse/thin, rather than being fully preallocated. This can impose a performance penalty of 200%+ on "first writes" to blocks which have not yet been allocated.

Eager Zero adds the ability to forcibly write data to the raw blocks occupying empty metaslabs - currently, never-ending repetitions of DEADBEEF. This should cause the underlying storage to preallocate those blocks, removing the first-write penalty when those blocks are later used for actual data in the pool. The rest of ZFS does not consider those writes to be CoW data it cares about; the empty metaslabs remain empty until actually written to.

Gotchas:

if the underlying storage uses inline compression (like ZFS with compression=lz4, etc) the stream of DEADBEEF will compress to several orders of magnitude, effectively meaning that real data writes will still require initial allocation later.

Presumably, your vendor knew precisely what they were doing when they thin-provisioned you, and may be budgeting their own resources based on the proposition that most clients will not actually be fully allocated - it's possible that eventually, vendors may charge more for actually-allocated storage than they do for sparse storage capacity, in response to tricks like this. YMMV.

Redacted Send/Receive

Primary dev/contact: Paul Dagnelie

Executive overview: Redacted Send/Receive allows you to clone a dataset containing sensitive data, wipe out / obfuscate / modify the sensitive data in the clone, then replicate the clone to an untrusted target. For example, you might use this to replicate the working configuration and system of a VM containing HIPAA data, but with the actual HIPAA data in the VM removed.

(this entry is a stub, and I need to contact Paul to further flesh it out)

Persistent L2ARC / TRIM

Primary dev/contact: Saso Kiselkov

Executive overview: this is the easiest improvement to understand in the current set of proposed features; exactly what it says on the tin: L2ARC that persists across reboots, and TRIM support for the underlying devices in a pool. TRIM function supports both Trim and Unmap, as appropriate to underlying storage.

pL2ARC status: Code [out for review on illumos] for several years, waiting on TRIM to land. Suspect ZFS on Linux will merge it once they merge TRIM.

TRIM status: Likewise, code [out for review on illumos] and [OpenZFS] for a while, but not yet merged. ZFS on Linux was waiting on OpenZFS to merge it first, but now [looks like it's just going to merge it shortly].

How to help: pushing the code the last few inches across the finish lines by refreshing the patches against latest master branches and fixing any issues reviewers identify.

SPA import and pool recovery

Primary dev/contact: Pavel Zakharov

Executive overview: with Pavel's patches, it becomes possible to import a pool missing a top-level vdev, which will be missing user data but should still have all metadata intact if only one top-level vdev is missing (due to metadata copies=2). Interesting corollary: any datasets with copies=2 set on data should have ALL data intact and available, although the pool itself can not be made healthy by adding a new vdev to replace it. However, any intact data (again, including ALL data in any dataset with copies=2 or more set) can be extracted intact to a healthy pool (or other storage).

zdb is also improved to allow modification of tunables directly, rather than requiring direct kernel interaction.

Code was merged into illumos and OpenZFS in Feb 2018.

How to help: Integration and PRs for other platforms