Thank you for contacting us. We will get back to you shortly.

Share:

Back in 2008, I wrote a post about recovering a removed file on a zfs disk. This post links to a paper here, (see page 36), and a set of slides here.

Over time, I have received email from various people asking for help either recovering files or pools or datasets, or for the tools I talk about in the blog post and the OpenSolaris Developers Conference in Prague in 2008. These tools were a modified mdb(1) and a modified zdb(1M). It is time to revisit that work.

In this post, I'll create a ZFS pool, add a file to the pool, destroy the pool, and then recover the file. To do this, I'll use a modified mdb , and a tool I wrote to uncompress ZFS compressed data/metadata ( zuncompress ). Since zdb does not seem to work with destroyed zpools (in fact, much of zdb does not work with pools that do not import), I will not be using it. The code for what I am using is available at mdbzfs. Please read the README file for instructions on how to set things up.

For those of you who are running ZFS on Linux, at the end of this blog post, I have a suggestion on how you might try this on your ZFS on Linux file system.

Before you try this on your own, please backup the disk(s) in question. Use the technique I am showing at your own risk. (Note that nothing I am doing should change any data in the zpool). If you are using a file the way I do here, there is of course no need to make a backup.

First, we'll create a zfs pool using a file, then add a file to the pool, then destroy the pool

# mkfile 100m /var/tmp/zfsfile # zpool create testpool /var/tmp/zfsfile # touch /testpool/foo # cp /usr/dict/words /testpool/words # sync # zpool destroy testpool #

Note that the first time I tried this, I did not do the sync . I create the pool, added the file, and destroyed the pool before zfs got around to committing the transactions to disk, resulting in the file not showing up.

The steps we'll take to get the words file back from the destroyed pool will start at the uberblock , and walk the (compressed) metadata structures on disk until we get to the file. If I (or someone else) ever get around to adding a "zfs on disk" target to mdb , this will be much simpler.

# mdb /var/tmp/zfsfile > ::walk uberblock u | ::print zfs`uberblock_t ub_txg ! sort -r ub_txg = 0xe ub_txg = 0xd ub_txg = 0xc ub_txg = 0xb ub_txg = 0xa ub_txg = 0x9 ub_txg = 0x6 ub_txg = 0x5 ub_txg = 0x4 ub_txg = 0x14 ub_txg = 0x11 ub_txg = 0 ub_txg = 0 ...

The uberblock walker is in the rawzfs.so dmod (see the source on github). And I have added the following lines to ~/.mdbrc :

::loadctf

The zfs.so and rawzfs.so files are built when you build mdb from my github repo. If you gmake world , you may not need to do the two loads. So, in this case, the highest transaction group id is 0x14 . Note that I am making an assumption that this is the last active uberblock_t . If it doesn't work, try the next lowest id. Let's print out the uberblock_t for that transaction group id.

> ::walk uberblock u | ::print zfs`uberblock_t ub_txg | ::grep ".==14" | ::eval "

The rootbp blkptr_t in the above takes us to a objset_phys_t for the meta object set (MOS) for the pool. Let' look at that blkptr_t

> 25028::blkptr DVA[0]=<0:84800:200> DVA[1]=<0:1284800:200> DVA[2]=<0:2484800:200> [L0 OBJSET] FLETCHER_4 LZJB LE contiguous unique triple size=800L/200P birth=20L/20P fill=39 cksum=126da42f4f:6be7bf74635:145b828e81ab7:2a37bf50847b59 > $q #

So, there are 3 copies of the objset_phys_t specified by the blkptr, at 0x84800, 0x1284800, and at 0x2484800 bytes into the first (and only) vdev (the leading 0 in 0:84800:200). The three copies are compressed via lzjb compression. On disk, each is 0x200 bytes large. Decompressed, the objset_phys_t is 0x800 bytes. Currently, mdb has no way to decompress the data. We'll use the new tool zuncompress to uncompress the data into a file.

# ./zuncompress -p 200 -l 800 -o 84800 /var/tmp/zfsfile > /tmp/mos_objset #

The decompressed objset_phys_t is now in /tmp/mos_objset. Now we'll run mdb on the file to look at the objset_phys_t .

# mdb /tmp/mos_objset >0::print -a -t zfs`objset_phys_t 0 objset_phys_t { 0 dnode_phys_t os_meta_dnode = { 0 uint8_t dn_type = 0xa 1 uint8_t dn_indblkshift = 0xe 2 uint8_t dn_nlevels = 0x1 3 uint8_t dn_nblkptr = 0x3 ... 40 blkptr_t [1] dn_blkptr = [ 40 blkptr_t { 40 dva_t [3] blk_dva = [ 40 dva_t { 40 uint64_t [2] dva_word = [ 0x5, 0x41f ] }, 50 dva_t { 50 uint64_t [2] dva_word = [ 0x5, 0x941f ] }, 60 dva_t { 60 uint64_t [2] dva_word = [ 0x5, 0x1241f ] }, ] 70 uint64_t blk_prop = 0x800a07030004001f 78 uint64_t [2] blk_pad = [ 0, 0 ] 88 uint64_t blk_phys_birth = 0 90 uint64_t blk_birth = 0x14 98 uint64_t blk_fill = 0x1f a0 zio_cksum_t blk_cksum = { a0 uint64_t [4] zc_word = [ 0xbc335cdf82, 0xee3d3a7c1fc4, 0xc355cf13639994, 0x78d0d2289454a408 ] } }, ] c0 uint8_t [192] dn_bonus = [ 0x3, 0, 0, 0, 0, 0, 0, 0, 0x2b, 0x4, 0, 0, 0, 0, 0, 0, 0x3, 0, 0, 0, 0, 0, 0, 0, 0x2b, 0x94, 0, 0, 0, 0, 0, 0, ... ] ...

Let's get the blkptr_t in the objset_phys_t . This will be either a block containing the dnode_phys_t for the meta objset set (MOS) for the pool, or an indirect block containing blkptr_t s which may contain the dnode_phys_t , or more indirect blocks.

> ::status debugging file '/tmp/objset' (object file) > 40::blkptr DVA[0]=<0:83e00:a00> DVA[1]=<0:1283e00:a00> DVA[2]=<0:2483e00:a00> [L0 DNODE] FLETCHER_4 LZJB LE contiguous unique triple size=4000L/a00P birth=20L/20P fill=31 cksum=bc335cdf82:ee3d3a7c1fc4:c355cf13639994:78d0d2289454a408 > $q #

In this case, the blkptr is for a block containing the MOS (array of dnode_phys_t . (The L0 DNODE in the above output shows that there are 0 levels of indirection. A case where there are multiple levels of indirection from a blkptr_t will be shown below. We'll decompress the block.

# ./zuncompress -p a00 -l 4000 -o 83e00 /var/tmp/zfsfile > /tmp/mos #

As mentioned earlier, the MOS is an array of dnode_phys_t . The decompressed block is 0x4000 bytes large.

# mdb /tmp/mos > ::sizeof zfs`dnode_phys_t sizeof (zfs`dnode_phys_t) = 0x200 > 4000%200=K 20 >

There are 32 (0x20) entries in the array. Let's dump them.

> 0,20::print -a -t zfs`dnode_phys_t 0 dnode_phys_t { 0 uint8_t dn_type = 0

An "object directory" ( DMU_OT_OBJECT_DIRECTORY ) is a "ZAP" object containing information about the meta objects. Meta objects in the MOS include the root of the pool, snapshots, clones, the space map, and other information. The ZAP object is contained in the data specified by the blkptr_t at location 0x240 in the above output.

> 240::blkptr DVA[0]=<0:4000:200> DVA[1]=<0:1204000:200> DVA[2]=<0:2400000:200> [L0 OBJECT_DIRECTORY] FLETCHER_4 LZJB LE contiguous unique triple size=400L/200P birth=4L/4P fill=1 cksum=f38ae7fee:6064734c9bd:13a8cd3126a75:2bfdd306beb1a2 > $q #

Let's decompress and look at the ZAP.

# ./zuncompress -p 200 -l 400 -o 4000 /var/tmp/zfsfile > /tmp/objdir # mdb /tmp/objdir > 0/K

0: 8000000000000003

The 8000000000000003 tells us this is a microzap (as opposed to a "fat ZAP". Fat zaps are used when the amount of data in the ZAP exceeds 1 block (hence needs indirect blocks).

> 0::print -a -t zfs`mzap_phys_t 0 mzap_phys_t { 0 uint64_t mz_block_type = 0x8000000000000003 8 uint64_t mz_salt = 0x16c04723 10 uint64_t mz_normflags = 0 18 uint64_t [5] mz_pad = [ 0, 0, 0, 0, 0 ] 40 mzap_ent_phys_t [1] mz_chunk = [ 40 mzap_ent_phys_t { 40 uint64_t mze_value = 0x2 48 uint32_t mze_cd = 0 4c uint16_t mze_pad = 0 4e char [50] mze_name = [ "root_dataset" ] }, ] } > $q #

There are more entries, but this is the entry we want (the "root_dataset" ). The value of 2 for mze_value is an object id. Basically, an index into the MOS array of dnode_phys_t s where the root dataset is described.

# mdb /tmp/mos > 2*200::print -a -t zfs`dnode_phys_t

Here, the blkptr_t is not used. Instead, the information we need is in the "bonus buffer" ( dn_bonus at offset 0x4c0).

> 4c0::print -a -t zfs`dsl_dir_phys_t 4c0 dsl_dir_phys_t { 4c0 uint64_t dd_creation_time = 0x5203bf9a 4c8 uint64_t dd_head_dataset_obj = 0x15 4d0 uint64_t dd_parent_obj = 0 4d8 uint64_t dd_origin_obj = 0x12 4e0 uint64_t dd_child_dir_zapobj = 0x4 4e8 uint64_t dd_used_bytes = 0x5d400 4f0 uint64_t dd_compressed_bytes = 0x4b200 4f8 uint64_t dd_uncompressed_bytes = 0x4b200 500 uint64_t dd_quota = 0 508 uint64_t dd_reserved = 0 510 uint64_t dd_props_zapobj = 0x3 518 uint64_t dd_deleg_zapobj = 0 520 uint64_t dd_flags = 0x1 528 uint64_t [5] dd_used_breakdown = [ 0x48400, 0, 0x15000, 0, 0 ] 550 uint64_t dd_clones = 0 558 uint64_t dd_filesystem_count = 0 560 uint64_t dd_snapshot_count = 0 568 uint64_t [11] dd_pad = [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ] }

From here, we'll go to the dd_head_dataset_obj , 0x15.

> 15*200::print -a -t zfs`dnode_phys_t 2a00 dnode_phys_t { 2a00 uint8_t dn_type = 0x10

The data for the DMU_OT_DSL_DATASET is in the bonus buffer. Let's dump that out.

> 2ac0::print -a -t zfs`dsl_dataset_phys_t 2ac0 dsl_dataset_phys_t { 2ac0 uint64_t ds_dir_obj = 0x2 ... 2b40 blkptr_t ds_bp = { 2b40 dva_t [3] blk_dva = [ 2b40 dva_t { 2b40 uint64_t [2] dva_word = [ 0x1, 0x2d6 ] }, 2b50 dva_t { 2b50 uint64_t [2] dva_word = [ 0x1, 0x90d6 ] }, 2b60 dva_t { 2b60 uint64_t [2] dva_word = [ 0, 0 ] }, ] ...

And look at the blkptr_t .

> 2b40::blkptr DVA[0]=<0:5ac00:200> DVA[1]=<0:121ac00:200> [L0 OBJSET] FLETCHER_4 LZJB LE contiguous unique double size=800L/200P birth=11L/11P fill=9 cksum=15955ae455:7d0aed4c6f5:17b63dc48793f:3202bc4dfa3b58 > $q #

This is another objset_phys_t , this time for the root dataset instead of the MOS. We'll decompress and take a look.

# ./zuncompress -p 200 -l 800 -o 5ac00 /var/tmp/zfsfile > /tmp/ds_objset # mdb /tmp/ds_objset > 0::print -a -t zfs`objset_phys_t 0 objset_phys_t { 0 dnode_phys_t os_meta_dnode = { 0 uint8_t dn_type = 0xa

Grabbing the blkptr_t as was the case for the MOS objset.

> 40::blkptr DVA[0]=<0:5a200:400> DVA[1]=<0:121a200:400> [L6 DNODE] FLETCHER_4 LZJB LE contiguous unique double size=4000L/400P birth=11L/11P fill=9 cksum=5a33c7bab6:3e5fa32d9ea0:16d3626ce1ceee:5d94da91be37c8d > $q #

For the dataset object set, there are 2 copies of the metadata (unlike the three copies for the MOS). And the " L6 " says there are 6 levels of indirection. Indirect blocks are blocks containing blkptr_t s of block containing block pointers... of blocks containing data. In this case, 6 levels deep. We'll look at the first blkptr_t in each of these. Note that if this was a large file system with lots of data, we would probably still need the beginning (root of the file system) to get started. In this particular case, the only blkptr_t being used in all of the indirect blocks is the first one. The rest are "holes" (placeholders for when/if the file system has more objects). Given an object id, the arithmetic needed to find the correct path through the indirect blocks for that object id is covered in the papers mentioned at the beginning of this post.

At this point we'll follow a sequence of decompressing and following the block pointers until we get to level 0 (the dnode_phys_t array for the objects in the (root) dataset).

# ./zuncompress -p 400 -l 4000 -o 5a200 /var/tmp/zfsfile > /tmp/l6_dnode # mdb /tmp/l6_dnode > 0::blkptr DVA[0]=<0:59e00:400> DVA[1]=<0:1219e00:400> [L5 DNODE] FLETCHER_4 LZJB LE contiguous unique double size=4000L/400P birth=11L/11P fill=9 cksum=5a4813c63a:3e6e4b21ab12:16d82a6aab7196:5da1fa71471b3a2 > $q # # ./zuncompress -p 400 -l 4000 -o 59e00 /var/tmp/zfsfile > /tmp/l5_dnode # mdb /tmp/l5_dnode > 0::blkptr DVA[0]=<0:59a00:400> DVA[1]=<0:1219a00:400> [L4 DNODE] FLETCHER_4 LZJB LE contiguous unique double size=4000L/400P birth=11L/11P fill=9 cksum=5a07a23ca3:3e2ff2ae47d2:16b9e360815f88:5d048405cb59ba5 > $q # # ./zuncompress -p 400 -l 4000 -o 59a00 /var/tmp/zfsfile > /tmp/l4_dnode # mdb /tmp/l4_dnode > 0::blkptr DVA[0]=<0:59600:400> DVA[1]=<0:1219600:400> [L3 DNODE] FLETCHER_4 LZJB LE contiguous unique double size=4000L/400P birth=11L/11P fill=9 cksum=594127027c:3d7854dc4336:1664aa2337fdfd:5b5d2ad4907d3f2 > $q # # ./zuncompress -p 400 -l 4000 -o 59600 /var/tmp/zfsfile > /tmp/l3_dnode # mdb /tmp/l3_dnode > 0::blkptr DVA[0]=<0:59200:400> DVA[1]=<0:1219200:400> [L2 DNODE] FLETCHER_4 LZJB LE contiguous unique double size=4000L/400P birth=11L/11P fill=9 cksum=5a6c9eaf90:3e918a332bce:16e93bc40842e1:5dfa7ee35affc19 > $q # # ./zuncompress -p 400 -l 4000 -o 59200 /var/tmp/zfsfile > /tmp/l2_dnode # mdb /tmp/l2_dnode > 0::blkptr DVA[0]=<0:58e00:400> DVA[1]=<0:1218e00:400> [L1 DNODE] FLETCHER_4 LZJB LE contiguous unique double size=4000L/400P birth=11L/11P fill=9 cksum=573ebf43bc:3c03ae3ccfbe:15cc559f914849:58921efeca0c341 > $q # # ./zuncompress -p 400 -l 4000 -o 58e00 /var/tmp/zfsfile > /tmp/l1_dnode # mdb /tmp/l1_dnode > 0::blkptr DVA[0]=<0:58800:600> DVA[1]=<0:1218800:600> [L0 DNODE] FLETCHER_4 LZJB LE contiguous unique double size=4000L/600P birth=11L/11P fill=9 cksum=87a454f048:6092818ca2a1:2d0688f6b70082:104b023c565fb938 > $q # # ./zuncompress -p 600 -l 4000 -o 58800 /var/tmp/zfsfile > /tmp/dnodes #

Now we're at level 0. This is an array of dnode_phys_t for files and directories in the root of the ZFS file system. Let's dump the array.

# mdb /tmp/dnodes >0,20::print -a -t zfs`dnode_phys_t 0 dnode_phys_t { 0 uint8_t dn_type = 0

The second entry is the "master node" for the file system. Let's look at the blkptr_t

> 240::blkptr DVA[0]=<0:0:200> DVA[1]=<0:1200000:200> [L0 MASTER_NODE] FLETCHER_4 LZJB LE contiguous unique double size=400L/200P birth=4L/4P fill=1 cksum=a23da9de2:4526f62c71b:f0b3b5fb1f03:239ad9c427b988 > $q #

This is another ZAP block. We'll decompress and take a look.

# ./zuncompress -p 200 -l 400 -o 0 /var/tmp/zfsfile > /tmp/master # mdb /tmp/master > 0/K 0: 8000000000000003 > 0::print -a -t zfs`mzap_phys_t 0 mzap_phys_t { 0 uint64_t mz_block_type = 0x8000000000000003 8 uint64_t mz_salt = 0x16d68b53 10 uint64_t mz_normflags = 0 18 uint64_t [5] mz_pad = [ 0, 0, 0, 0, 0 ] 40 mzap_ent_phys_t [1] mz_chunk = [ 40 mzap_ent_phys_t { 40 uint64_t mze_value = 0 48 uint32_t mze_cd = 0 4c uint16_t mze_pad = 0 4e char [50] mze_name = [ "normalization" ] }, ] } >

Let's look at additional entries in the ZAP object. We want an entry for " ROOT ".

> .::print -a -t zfs`mzap_ent_phys_t .::print -a -t zfs`mzap_ent_phys_t c0 mzap_ent_phys_t { c0 uint64_t mze_value = 0 c8 uint32_t mze_cd = 0 cc uint16_t mze_pad = 0 ce char [50] mze_name = [ "casesensitivity" ] } > .::print -a -t zfs`mzap_ent_phys_t 100 mzap_ent_phys_t { 100 uint64_t mze_value = 0x5 108 uint32_t mze_cd = 0 10c uint16_t mze_pad = 0 10e char [50] mze_name = [ "VERSION" ] } > .::print -a -t zfs`mzap_ent_phys_t 140 mzap_ent_phys_t { 140 uint64_t mze_value = 0x2 148 uint32_t mze_cd = 0 14c uint16_t mze_pad = 0 14e char [50] mze_name = [ "SA_ATTRS" ] } > .::print -a -t zfs`mzap_ent_phys_t 180 mzap_ent_phys_t { 180 uint64_t mze_value = 0x3 188 uint32_t mze_cd = 0 18c uint16_t mze_pad = 0 18e char [50] mze_name = [ "DELETE_QUEUE" ] } > .::print -a -t zfs`mzap_ent_phys_t 1c0 mzap_ent_phys_t { 1c0 uint64_t mze_value = 0x4 1c8 uint32_t mze_cd = 0 1cc uint16_t mze_pad = 0 1ce char [50] mze_name = [ "ROOT" ] } > $q #

The root directory for the file system is object id 4 ( mze_value from above. This is the 5th entry (starts at 0) in the array of dnode_phys_t for the file system. Let's take a look.

> ::status debugging file '/tmp/dnodes' (object file) > 4*200::print -a -t zfs`dnode_phys_t 800 dnode_phys_t { 800 uint8_t dn_type = 0x14

Directories are ZAP objects. We'll dump the blkptr_t , decompress if necessary, and find the words file that we copied into the file system at the beginning of this post.

> 840::blkptr DVA[0]=<0:58600:200> DVA[1]=<0:1218600:200> [L0 DIRECTORY_CONTENTS] FLETCHER_4 OFF LE contiguous unique double size=200L/200P birth=11L/11P fill=1 cksum=27626ee8e:109d3b9097a:395d35f5c237:8703c96b7bd4c > $q #

Notice that compression is turned off, and there are no indirect blocks (" L0 ").

# mdb /var/tmp/zfsfile > 400000+58600::print -a -t zfs`mzap_phys_t 458600 mzap_phys_t { 458600 uint64_t mz_block_type = 0x8000000000000003 458608 uint64_t mz_salt = 0x16d68999 458610 uint64_t mz_normflags = 0 458618 uint64_t [5] mz_pad = [ 0, 0, 0, 0, 0 ] 458640 mzap_ent_phys_t [1] mz_chunk = [ 458640 mzap_ent_phys_t { 458640 uint64_t mze_value = 0x8000000000000008 458648 uint32_t mze_cd = 0 45864c uint16_t mze_pad = 0 45864e char [50] mze_name = [ "foo" ] }, ] } > .::print -a -t zfs`mzap_ent_phys_t 458680 mzap_ent_phys_t { 458680 uint64_t mze_value = 0x8000000000000009 458688 uint32_t mze_cd = 0 45868c uint16_t mze_pad = 0 45868e char [50] mze_name = [ "words" ] } > $q #

The "words" file is at object id 9. Let's look at that dnode_phys_t .

# mdb /tmp/dnodes > 9*200::print -a -t zfs`dnode_phys_t 1200 dnode_phys_t { 1200 uint8_t dn_type = 0x13

Let's look at the blkptr_t .

> 1240::blkptr DVA[0]=<0:4fe00:400> DVA[1]=<0:120fe00:400> [L1 PLAIN_FILE_CONTENTS] FLETCHER_4 LZJB LE contiguous unique double size=4000L/400P birth=9L/9P fill=2 cksum=5d1e925d95:3ed351070323:16995992c8e96c:5b9701a2a4ef414 > $q #

This is a single indirect block ( L1 in the above output. This makes sense as the size of the words file is ~256K. We'll decompress and look at the resulting blkptr_t s.

# ./zuncompress -p 400 -l 4000 -o 4fe00 /var/tmp/zfsfile > /tmp/l1_file # mdb /tmp/l1_file > 0::blkptr DVA[0]=<0:fe00:20000> [L0 PLAIN_FILE_CONTENTS] FLETCHER_4 OFF LE contiguous unique single size=20000L/20000P birth=9L/9P fill=1 cksum=2f6c9bcce37c:bd82a253b632bb1:acb0037ee619745c:5e7c6fc8adcedccd > 80::blkptr DVA[0]=<0:2fe00:20000> [L0 PLAIN_FILE_CONTENTS] FLETCHER_4 OFF LE contiguous unique single size=20000L/20000P birth=9L/9P fill=1 cksum=1bae53745c3f:9dc2421d31452d3:658d66823cf4fb0:11c158edbbfcc0f3 > 100::blkptr > $q #

Now we'll go to the location specified by these block pointers to get our data.

# mdb /var/tmp/zfsfile > 400000+fe00,20000/c 0x40fe00: 10th 1st 2nd 3rd 4th 5th 6th 7th 8th 9th a AAA AAAS Aarhus Aaron AAU ABA Ababa aback ...

And there is the contents of the first 128KB of the file. The remainder of the file is at the block specifed by the blkptr_t at offset 80 in the ::blkptr output.

If this were binary, it is simple enough to use dd(1M) , seek to the correct location on the device, and dump from there. For instance,

> (400000+fe00)%200=E 8319 > 20000%200=E 256 # dd if=/var/tmp/zfsfile iseek=8319 bs=512 count=256 10th 1st 2nd 3rd 4th 5th 6th 7th 8th 9th a AAA AAAS ...

That's a lot of work. Is there a way to just "see" all of the information? Yes, it's called zdb(1M). But zdb is not interative, and it does not work with destroyed pools (or pools that won't import). Also, I find that using mdb this way forces you to understand the on-disk format. For me, much preferable to having it all done for me.

I mentioned at the beginning of this post that it will only work on illumos-based systems, i.e., systems with mdb . I cannot include Solaris 11 or newer because there is no way to build mdb without source code. But what if you are using ZFS on Linux?

You could upload your devices (or files) as files to manta, along with the modified mdb , the zfs.so and rawzfs.so modules, and the zuncompress program. Then you use mlogin to log into the manta instance and try from there. I've included built copies of mdb, the modules, and zuncompress in the github repo. Note that I have not yet tried this, but it will likely be in a blog post in the next week or so.

Have fun!