1 2 HAMMER2 DESIGN DOCUMENT 3 4 Matthew Dillon 5 dillon@backplane.com 6 7 08-Dec-2018 (v6) 8 24-Jul-2017 (v5) 9 09-Jul-2016 (v4) 10 03-Apr-2015 (v3) 11 14-May-2013 (v2) 12 08-Feb-2012 (v1) 13 14 Current Status as of document date 15 16* Filesystem Core - operational 17 - bulkfree - operational 18 - Compression - operational 19 - Snapshots - operational 20 - Deduper - live operational, batch specced 21 - Subhierarchy quotas - scrapped (still possible on a limited basis) 22 - Logical Encryption - not specced yet 23 - Copies - not specced yet 24 - fsync bypass - not specced yet 25 - FS consistency - operational 26 27* Clustering core 28 - Network msg core - operational 29 - Network blk device - operational 30 - Error handling - under development 31 - Quorum Protocol - under development 32 - Synchronization - under development 33 - Transaction replay - not specced yet 34 - Cache coherency - not specced yet 35 36 Recent Document Changes 37 38* Reorganized the feature list to indicate currently operational features 39 first, and moving the future features to another section (since they 40 are taking so long to implement). 41 42 Current Features List 43 44* Standard filesystem semantics with full hardlink and softlink support. 45 64-bit hardlink count field. 46 47* The topology is indexed with a dynamic radix tree rooted in several 48 places: The super-root, the PFS root inode, and any inode boundary. 49 Index keys are 64-bits. Each element is referenced with a blockref 50 structure (described below) that is capable of referencing a power-of-2 51 sized block. The block size is currently capped at 64KB to play 52 nice(r) with the buffer cache and SSDs. 53 54 The dynamic radix tree pushes elements into new indirect blocks only 55 when the current level fills up, and will delete empty indirect blocks 56 when a level is cleaned out. 57 58* Block-copy-on-write filesystem mechanism for both the main topology 59 and for the freemap. Media-level block frees are deferred and flushes 60 rotate between (up to) 4 volume headers (capped at 4 if the filesystem 61 is > ~8GB). Recovery will choose the most recent fully-valid volume 62 header and can thus work around failures which cause partial volume 63 header writes. 64 65 Modifications issue copy-on-write updates up to the volume root. 66 67* Utilizes a fat blockref structure (128 bytes) which can store up to 68 64 bytes (512 bits) of check code data for each referenced block. 69 In the original implementation I had gone with 64 byte blockrefs, 70 but I eventually decided that I wanted to support up to a 512-bit 71 hash (which eats 64 bytes), so I bumped it up to 128 bytes. This 72 turned out to be fortuitous because it made it possible to store 73 most directory entries directly in the blockref structure without 74 having to reference a separate data block via the blockref structure. 75 76* 1KB 'fat' inode structure. The inode structure directly embeds four 77 blockrefs so small files and directories can be represented without 78 requiring an indirect block to be allocated. The inode structure can 79 also overload the same space to store up to 512 bytes of direct 80 file data (for files which are <= 512 bytes long). 81 82 The super-root and PFS root inodes are directly represented in the 83 topology, without the use of directory entries. A combination of 84 normal directory entries and separtely-indexed inodes are implemented 85 under each PFS. 86 87 Normal filesystem inodes (other than inode 1) are indexed under the PFS 88 root inode by their inode number. Directory entries are indexed under the 89 same PFS root by their filename hash. Bit 63 is used to distinguish and 90 partition the two. Filename hash collisions are handled by incrementing 91 reserved low bits in the filename hash code. 92 93* Directory entries representing filenames that are less than 64 bytes 94 long are directly stored AS blockrefs. This means that an inode 95 representing a small directory can store up to 4 directory entries in 96 the inode itself before resorting to indirect blocks, and then those 97 indirect blocks themselves can directly embed up to 512 directory entries. 98 Directory entries with long filenames reference an indirect data block 99 to hold the filename instead of directly-embedding the filename. 100 101 This results in *very* compact directories in terms of I/O bandwidth. 102 Not as compact as e.g. UFS's variable-length directory entries, but still 103 very good with a nominal 128 real bytes per directory entry. 104 105 Because directory entries are represented using a dynamic radix tree via 106 its blockrefs, directory entries can be randomly looked up without having 107 to scan the whole directory. 108 109* Multiple PFSs. In HAMMER2, all PFSs are implemented the same way, with 110 the kernel choosing a default PFS name for the mount if none is specified. 111 For example, "ROOT" is the default PFS name for a root mount. You can 112 create as many PFSs as you like and you can specify the PFS name in the 113 mount command using the <device_path>@<pfs_name> notation. 114 115* Snapshots are implemented as PFSs. Due to the copy-on-write nature of 116 the filesystem, taking a snapshot is a trivial operation requiring only 117 a normal filesystme sync and copying of the PFS root inode (1KB), and 118 that's it. 119 120 On the minus side, can complicate the bulkfree operation that is responsible 121 for freeing up disk space. It can take significantly longer when many 122 snapshots are present. 123 124* SNAPSHOTS ARE READ-WRITE. You can mount any PFS read-write, including 125 snapshots. For example, you can revert to an earlier 'root' that you 126 made a snapshot of simply by changing what the system mounts as the root 127 filesystem. 128 129* Full filesystem coherency at both the radix tree level and the filesystem 130 semantics level. This is true for all filesystem syncs, recovery after 131 a crash, and snapshots. 132 133 The filesystem syncs fully vfsync the buffer cache for the files 134 that are part of the sync group, and keeps track of dependencies to 135 ensure that all inter-dependent inodes are flushed in the same sync 136 group. Atomic filesystem ops such as write()s are guaranteed to remain 137 atomic across a sync, snapshot, and crash. 138 139* Flushes and syncs are almost entirely asynchronous and will run concurrent 140 with frontend operations. This feature is implemented by adding inodes 141 to the sync group currently being flushed on-the-fly as new dependencies 142 are created, and reordering inodes in the sync queue to prioritize inodes 143 which the frontend is stalled on. 144 145 By reprioritizing inodes in the syncq, frontend stalls are minimized. 146 147 The only synchronous disk operations is the final sync of the volume 148 header which updates the ultimate root of the filesystem. A disk flush 149 command is issued synchronously, then the write of the volume header is 150 issued synchronously. All other writes to the disk, regardless of the 151 complexity of the dependencies, occur asynchronously and can make very 152 good use of high-speed I/O and SSD bandwidth. 153 154* Low memory footprint. Except for the volume header, the buffer cache 155 is completely asynchronous and dirty buffers can be retired by the OS 156 directly to backing store with no further interactions with the filesystem. 157 158* Compression support. Multiple algorithms are supported and can be 159 configured on a subdirectory hierarchy or individual file basis. 160 Block compression up to 64KB will be used. Only compression ratios at 161 powers of 2 that are at least 2:1 (e.g. 2:1, 4:1, 8:1, etc) will work in 162 this scheme because physical block allocations in HAMMER2 are always 163 power-of-2. 164 165 Modest compression can be achieved with low overhead, is turned on 166 by default, and is compatible with deduplication. 167 168 Compression is extremely useful and often gives you anywhere from 25% 169 to 400% the logical storage as you have physical blocks, depending. 170 Of course, .tgz and other pre-compressed files cannot be compressed 171 further by the filesystem. 172 173 The usefulness shnould not be underestimated, our users are constantly 174 being surprised at things the filesystem is able to compres that just 175 makes life a lot easier. For example, 30GB core dumps tend to contain 176 a great deal of highly compressable data. Source trees, web files, 177 executables, general data... this is why HAMMER2 turns modest compression 178 on by default. It just works. 179 180* De-duplication support. HAMMER2 uses a relatively simple freemap 181 scheme that allows the filesystem to discard block references 182 asynchronously. The same scheme allows essentially unlimited references 183 to the same data block in the hierarchy. Thus, both live de-duplication 184 and bulk deduplication are relatively easy to implement. 185 186 HAMMER2 currently implements only live de-duplications. This means that 187 typical situations such as when copying files or whole directory hierarchies 188 will naturally de-duplicate. Simply reading filesystem data in makes 189 it available for deduplication later. HAMMER2 will index a potentially 190 very large number of blocks in memory, even beyond what the buffer cache 191 can hold, for deduplication purposes. 192 193* Zero-fill detection on write (writing all-zeros), which requires the data 194 buffer to be scanned, is fully supported. This allows the writing of 0's 195 to create holes. 196 197 Generally speaking pre-writing zerod blocks to reserve space doesn't work 198 well on copy-on-write filesystems. However, if both compression and 199 check codes are disabled on a file, H2 will also disable zero-detection, 200 allowing the file blocks to be pre-reserved (by actually zeroing them and 201 reusing them later on), and allow data overwrites to write to the same 202 sector. Please be aware that DISABLING THE CHECK CODE IN THIS MANNER ALSO 203 MEANS THAT SNAPSHOTS WILL NOT WORK. The snapshot will contain the latest 204 data for the file and not the data as-of the snapshot. This is NOT turned 205 on by default in HAMMER2 and is not recommended except in special 206 well-controlled circumstances. 207 208* Multiple supporting kernel threads, breaking up frontend VOP operation 209 from backend I/O, compression, and decompression operation. Buffer cache 210 I/O and VOP ops message the backend. Actual I/O is handled by the backend 211 and not by the frontend, which will theoretically allow us to survive 212 stalled devices and nodes when implementing multi-node support. 213 214 Pending Features 215 (not yet implemented) 216 217* Constructing a filesystem across multiple nodes. Each low-level H2 device 218 would be able to accomodate nodes belonging to multiple cluster components 219 as well as nodes that are simply local to the device or machine. 220 221 CURRENT STATUS: Not yet operational. 222 223* Incremental synchronization via highest-transaction id propagation 224 within the radix tree. This is a queueless, incremental design. 225 226 CURRENT STATUS: Due to the flat inode hierarchy now being employed, 227 the current synchronization code which silently recurses indirect nodes 228 will be inefficient due to the fact that all the inodes are at the 229 same logical level in the topology. To fix this, the code will need 230 to explicitly iterate indirect nodes and keep track of the related 231 key ranges to match them up on an indirect-block basis, which would 232 be incredibly efficient. 233 234* Background synchronization and mirroring occurs at the logical layer 235 rather than the physical layer. This allows cluster components to 236 have differing storage arrangements. 237 238 In addition, this mechanism will fully correct any out of sync nodes 239 in the cluster as long as a sufficient number of other nodes agree on 240 what the proper state should be. 241 242 CURRENT STATUS: Not yet operational. 243 244* Encryption. Whole-disk encryption is supported by another layer, but I 245 intend to give H2 an encryption feature at the logical layer which works 246 approximately as follows: 247 248 - Encryption controlled by the client on an inode/sub-tree basis. 249 - Server has no visibility to decrypted data. 250 - Encrypt filenames in directory entries. Since the filename[] array 251 is 256 bytes wide, client can add random bytes after the normal 252 terminator to make it virtually impossible for an attacker to figure 253 out the filename. 254 - Encrypt file size and most inode contents. 255 - Encrypt file data (holes are not encrypted). 256 - Encryption occurs after compression, with random filler. 257 - Check codes calculated after encryption & compression (not before). 258 259 - Blockrefs are not encrypted. 260 - Directory and File Topology is not encrypted. 261 - Encryption is not sub-topology validation. Client would have to keep 262 track of that itself. Server or other clients can still e.g. remove 263 files, rename, etc. 264 265 In particular, note that even though the file size field can be encrypted, 266 the server does have visibility on the block topology and thus has a pretty 267 good idea how big the file is. However, a client could add junk blocks 268 at the end of a file to make this less apparent, at the cost of space. 269 270 If a client really wants a fully validated H2-encrypted space the easiest 271 solution is to format a filesystem within an encrypted file by treating it 272 as a block device, but I digress. 273 274 CURRENT STATUS: Not yet operational. 275 276* Device ganging, copies for redundancy, and file splitting. 277 278 Device ganging - The idea here is not to gang devices into a single 279 physical volume but to instead format each device independently 280 and allow crossover-references in the blockref to other devices in 281 the set. 282 283 One of the things we want to accomplish is to ensure that a failed 284 device does not prevent access to radix tree elements in other devices 285 in the gang, and that the failed device can be reconstructed. To do 286 this, each device implements complete reachability from the node root 287 to all elements underneath it. When a device fails, the sychronization 288 code can theoretically reconstruct the missing material in other 289 devices making up the gang. New devices can be added to the gang and 290 existing devices can be removed from the gang. 291 292 Redundant copies - This is actually a fairly tough problem. The 293 solution I would like to implement is to use the device ganging feature 294 to also implement redundancy, that way if a device fails within the 295 gang there's a good chance that it can still remain completely functional 296 without having to resynchronize. But making this work is difficult to say 297 the least. 298 299 CURRENT STATUS: Not yet operational. 300 301* MESI Cache coherency for multi-master/multi-client clustering operations. 302 The servers hosting the MASTERs are also responsible for keeping track of 303 the cache state. 304 305 This is a feature that we would need to implement coherent cross-machine 306 multi-threading and migration. 307 308 CURRENT STATUS: Not yet operational. 309 310* Implement unverified de-duplication (where only the check code is tested, 311 avoiding having to actually read data blocks to calculate a de-duplication. 312 This would make use of the blockref structure's widest check field 313 (512 bits). 314 315 Out of necessity this type of feature would be settable on a file or 316 recursive directory tree basis, but should only be used when the data 317 is throw-away or can be reconstructed since data corruption (mismatched 318 duplicates with the same hash) is still possible even with a 512-bit 319 check code. 320 321 The Unverified dedup feature is intended only for those files where 322 occassional corruption is ok, such as in a web-crawler data store or 323 other situations where the data content is not critically important 324 or can be externally recovered if it becomes corrupt. 325 326 CURRENT STATUS: Not yet operational. 327 328 GENERAL DESIGN 329 330HAMMER2 generally implements a copy-on-write block design for the filesystem, 331which is very different from HAMMER1's B-Tree design. Because the design 332is copy-on-write it can be trivially snapshotted simply by making a copy 333of the block table we desire to snapshot. Snapshotting the root inode 334effectively snapshots the entire filesystem, whereas snapshotting a file 335inode only snapshots that one file. Snapshotting a directory inode is 336generally unhelpful since it only contains directory entries and the 337underlying files are not arranged under it in the radix tree. 338 339The copy-on-write design implements a block table as a radix-tree, 340with a small fan-out in the volume header and inode (typically 4x) and 341a large fan-out for indirect blocks (typically 128x and 512x depending). 342The table is built bottom-up. Intermediate radii are only created when 343necessary so small files and directories will have a much shallower radix 344tree. 345 346HAMMER2 implements several space optimizations: 347 348 1. Directory entries with filenames <= 64 bytes will fit entirely 349 in the 128-byte blockref structure and do not require additional data 350 block references. Since blockrefs are the core elements making up 351 block tables, most directories should have good locality of reference 352 for directory scans. 353 354 Filenames > 64 bytes require a 1KB data-block reference, which 355 is clearly less optimal, but very few files in a filesystem tend 356 to be larger than 64 bytes so it works out. This also simplifies 357 the handling for large filenames as we can allow filenames up to 358 1023 bytes long with this mechanism with no major changes to the 359 code. 360 361 2. Inodes embed 4 blockrefs, so files up to 256KB and directories with 362 up to four directory entries (not including "." or "..") can be 363 accomodated without requiring any indirecct blocks. 364 365 3. Indirect blocks can be sized to any power of two up to 65536 bytes, 366 and H2 typically uses 16384 and 65536 bytes. The smaller size is 367 used for initial indirect blocks to reduce storage overhead for 368 medium-sized files and directories. 369 370 4. The File inode itself can directly hold the data for small 371 files <= 512 bytes in size, overloading the space also used 372 by its four 128 bytes blockrefs (which are not needed if the 373 file is <= 512 bytes in size). This works out great for small 374 files and directories. 375 376 5. The last block in a file will have a storage allocation in powers 377 of 2 from 1KB to 64KB as needed. Thus a small file in excess of 378 512 bytes but less than 64KB will not waste a full 64KB block. 379 380 6. When compression is enabled, small physical blocks will be allocated 381 when possible. However, only reductions in powers of 2 are supported. 382 So if a 64KB data block can be compressed to (16KB+1) to 32KB, then 383 a 32KB block will be used. This gives H2 modest compression at very 384 low cost without too much added complexity. 385 386 7. Live de-dup will attempt to share data blocks when file copying is 387 detected, significantly reducing actual physical writes to storage 388 and the storage used. Bulk de-dup (when implemented), will catch 389 other cases of de-duplication. 390 391Directories contain directory entries which are indexed using a hash of 392their filename. The hash is carefully designed to maintain some natural 393sort ordering. The directory entries are implemented AS blockrefs. So 394an inode can contain up to 4 before requiring an indirect block, and 395each indirect block can contain up to 512 entries, with further data block 396references required for any directory entry whos filename is > 64 bytes. 397Because the directory entries are blockrefs, random access lookups are 398maximally efficient. The directory hash is designed to very loosely try 399to retain some alphanumeric sorting to bundle similarly-named files together 400and reduce random lookups. 401 402The copy-on-write nature of the filesystem means that any modification 403whatsoever will have to eventually synchronize new disk blocks all the way 404to the super-root of the filesystem and then to the volume header itself. 405This forms the basis for crash recovery and also ensures that recovery 406occurs on a completed high-level transaction boundary. All disk writes are 407to new blocks except for the volume header (which cycles through 4 copies), 408thus allowing all writes to run asynchronously and concurrently prior to 409and during a flush, and then just doing a final synchronization and volume 410header update at the end. Many of HAMMER2s features are enabled by this 411core design feature. 412 413The Freemap is also implemented using a radix tree via a set of pre-reserved 414blocks (approximately 4MB for every 2GB of storage), and also cycles through 415multiple copies to ensure that crash recovery can restore the state of the 416filesystem quickly at mount time. 417 418HAMMER2 tries to maintain a small footprint and one way it does this is 419by using the normal buffer cache for data and meta-data, and allowing the 420kernel to asynchronously flush device buffers at any time (even during 421synchronization). The volume root is flushed separately, separated from 422the asynchronous flushes by a synchronizing BUF_CMD_FLUSH op. This means 423that HAMMER2 has very low resource overhead from the point of view of the 424operating system and is very much unlike HAMMER1 which had to lock dirty 425buffers into memory for long periods of time. HAMMER2 has no such 426requirement. 427 428Buffer cache overhead is very well bounded and can handle filesystem 429operations of any complexity, even on boxes with very small amounts 430of physical memory. Buffer cache overhead is significantly lower with H2 431than with H1 (and orders of magnitude lower than ZFS). 432 433At some point I intend to implement a shortcut to make fsync()'s run fast, 434and that is to allow deep updates to blockrefs to shortcut to auxillary 435space in the volume header to satisfy the fsync requirement. The related 436blockref is then recorded when the filesystem is mounted after a crash and 437the update chain is reconstituted when a matching blockref is encountered 438again during normal operation of the filesystem. 439 440 FILESYSTEM SYNC SEQUENCING 441 442HAMMER2 implements a filesystem sync mechanism that allows the frontend 443to continue doing modifying operations concurrent with the sync. The 444general sync mechanism operates in four phases: 445 446 1. Individual file and directory inodes are fsync()d to disk, 447 updated the blockrefs in the parent block above the inode, and 448 removed from the syncq. 449 450 Once removed from the syncq, the frontend can do a modifying 451 operation on these file and directory inodes without further 452 effecting the filesystem sync. These modifications will be 453 flushed to disk on the next filesystem sync. 454 455 To reduce frontend stall times, an inode blocked on by the frontend 456 which is on the syncq will be reordered to the front of the syncq 457 to give the syncer a shot at it more quickly, in order to unstall 458 the frontend ASAP. 459 460 If a frontend operations creates an unavoidable dependency between 461 an inode on the syncq and an inode not on the syncq, both inodes 462 are placed on (or back onto) the syncq as needed to ensure filesystem 463 consistency for the filesystem sync. This can extend the filesystem 464 sync time, but even under heavy loads syncs are still able to be 465 retired. 466 467 2. The PFS ROOT is fsync()d to storage along with the subhierarchy 468 representing the inode index (whos inodes were flushed in (1)). 469 This brings the block copy-on-write up to the root inode. 470 471 3. The SUPER-ROOT inode is fsync()d to storage along with the 472 subhierarchy representing the PFS ROOTs for the volume. 473 474 4. Finally, a physical disk flush command is issued to the storage 475 device, and then the volume header is written to disk. All 476 I/O prior to this step occurred asynchronously. This is the only 477 step which must occur synchronously. 478 479 MIRROR_TID, MODIFY_TID, UPDATE_TID 480 481In HAMMER2, the core block reference is a 128-byte structure called a blockref. 482The blockref contains various bits of information including the 64-bit radix 483key (typically a directory hash if a directory entry, inode number if a 484hidden hardlink target, or file offset if a file block), number of significant 485key bits for ranged recursion of indirect blocks, a 64-bit device seek that 486encodes the radix of the physical block size in the low bits (physical block 487size can be different from logical block size due to compression), 488three 64-bit transaction ids, type information, and up to 512 bits worth 489of check data for the block being reference which can be anything from 490a simple CRC to a strong cryptographic hash. 491 492mirror_tid - This is a media-centric (as in physical disk partition) 493 transaction id which tracks media-level updates. The mirror_tid 494 can be different at the same point on different nodes in a 495 cluster. 496 497 Whenever any block in the media topology is modified, its 498 mirror_tid is updated with the flush id and will propagate 499 upward during the flush all the way to the volume header. 500 501 mirror_tid is monotonic. It is primarily used for on-mount 502 recovery and volume root validation. The name is historical 503 from H1, it is not used for nominal mirroring. 504 505modify_tid - This is a cluster-centric (as in across all the nodes used 506 to build a cluster) transaction id which tracks filesystem-level 507 updates. 508 509 modify_tid is updated when the front-end of the filesystem makes 510 a change to an inode or data block. It does NOT propagate upward 511 during a flush. 512 513update_tid - This is a cluster synchronization transaction id. Modifications 514 made to the topology will clear this field to 0 as they propagate 515 up to the root. This gives the synchronizer an easy way to 516 determine what needs revalidation. 517 518 The synchronizer revalidates the cluster bottom-up by validating 519 a sub-topology and propagating the highest modify_tid in the 520 validated sub-topology up via the update_tid field. 521 522 Update to this field may be optimized by the HAMMER2 VFS to 523 avoid the double-transition. 524 525The synchronization code updates an out-of-sync node bottom-up and will 526dynamically set update_tid as it goes, but media flushes can occur at any 527time and these flushes will use mirror_tid for flush and freemap management. 528The mirror_tid for each flush propagates upward to the volume header on each 529flush. modify_tid is set for any chains modified by a cluster op but does 530not propagate up, instead serving as a seed for update_tid. 531 532* The synchronization code is able to determine that a sub-tree is 533 synchronized simply by observing the update_tid at the root of the sub-tree, 534 on an inode-by-inode basis and also on a data-block-by-data-block basis. 535 536* The synchronization code is able to do an incremental update of an 537 out-of-sync node simply by skipping elements with a matching update_tid 538 (when not 0). 539 540* The synchronization code can be interrupted and restarted at any time, 541 and is able to pick up where it left off with very low overhead. 542 543* The synchronization code does not inhibit media flushes. Media flushes 544 can occur (and must occur) while synchronization is ongoing. 545 546There are several other stored transaction ids in HAMMER2. There is a 547separate freemap_tid in the volume header that is used to allow freemap 548flushes to be deferred, and inodes have a pfs_psnap_tid which is used in 549conjuction with CHECK_NONE to allow blocks without a check code which do 550not violate the most recent snapshot to be overwritten in-place. 551 552Remember that since this is a copy-on-write filesystem, we can propagate 553a considerable amount of information up the tree to the volume header 554without adding to the I/O we already have to do. 555 556 DIRECTORIES AND INODES 557 558Directories are hashed. In HAMMER2, the PFS ROOT directory (aka inode 1 for 559a PFS) can contain a mix of directory entries AND embedded inodes. This was 560actually a design mistake, so the code to deal with the index of inodes 561vs the directory entries is slightly convoluted (but not too bad). 562 563In the first iteration of HAMMER2 I tried really hard to embed actual 564inodes AS the directory entries, but it created a mass of problems for 565implementing NFS export support and dealing with hardlinks, so in a later 566iteration I implemented small independent directory entries (that wound up 567mostly fitting in the blockref structure, so WIN WIN!). However, 'embedded' 568inodes AS the directory entries still survive for the SUPER-ROOT and the 569PFS-ROOTs under the SUPER-ROOT. They just aren't used in the individual 570filesystem that each PFS represents. 571 572Hardlinks are now implemented normally, with multiple directory entries 573referencing the same inode and that inode containing a nlinks count. 574 575 RECOVERY 576 577H2 allows freemap flushes to lag behind topology flushes. This improves 578filesystem sync performance. The freemap flush tracks a separate 579transaction id (via mirror_tid) in the volume header. 580 581On mount, HAMMER2 will first locate the highest-sequenced check-code-validated 582volume header from the 4 copies available (if the filesystem is big enough, 583e.g. > ~8GB or so, there will be 4 copies of the volume header). 584 585HAMMER2 will then run an incremental scan of the topology for mirror_tid 586transaction ids between the last freemap flush tid and the last topology 587flush tid in order to synchronize the freemap. Because this scan is 588incremental the time it takes to run will be relatively short and well-bounded 589at mount-time. This is NOT an fsck. Freemap flushes can be avoided for any 590number of normal topology flushes but should still occur frequently enough 591to avoid long recovery times in case of a crash. 592 593The filesystem is then ready for use. 594 595 DISK I/O OPTIMIZATIONS 596 597The freemap implements a 1KB allocation resolution. Each 2MB segment managed 598by the freemap is zoned and has a tendancy to collect inodes, small data, 599indirect blocks, and larger data blocks into separate segments. The idea is 600to greatly improve I/O performance (particularly by laying inodes down next 601to each other which has a huge effect on directory scans). 602 603The current implementation of HAMMER2 implements a fixed 64KB physical block 604size in order to allow the mapping of hammer2_dio's in its IO subsystem 605to consumers that might desire different sizes. This way we don't have to 606worry about matching the buffer cache / DIO cache to the variable block 607size of underlying elements. In addition, 64KB I/Os allow compatibility 608with physical sector sizes up to 64KB in the underlying physical storage 609with no change in the byte-by-byte format of the filesystem. The DIO 610layer also prevents ordering deadlocks between unrelated portions of the 611filesystem hierarchy whos logical blocks wind up in the same physical block. 612 613The biggest issue we are avoiding by having a fixed 64KB I/O size is not 614actually to help nominal front-end access issue but instead to reduce the 615complexity of having to deal with mixed block sizes in the buffer cache, 616particularly when blocks are freed and then later reused with a different 617block size. HAMMER1 had to have specialized code to check for and 618invalidate buffer cache buffers in the free/reuse case. HAMMER2 does not 619need such code. 620 621That said, HAMMER2 places no major restrictions on mixing logical block 622sizes within a 64KB block. The only restriction is that a logical HAMMER2 623block cannot cross a 64KB boundary. The soft restrictions the block 624allocator puts in place exist primarily for performance reasons (i.e. to 625try to collect 1K inodes together). The 2MB freemap zone granularity 626should work very well in this regard. 627 628HAMMER2 also utilizes OS support for ganging 64KB buffers together into even 629larger blocks for I/O (OS buffer cache 'clustering'), OS-supported read-ahead, 630OS-driven asynchronous retirement, and other performance features typically 631provided by the OS at the block-level to ensure smooth system operation. 632 633By avoiding wiring buffers/memory and allowing the OS's buffer cache to 634run normally, HAMMER2 winds up with very low OS overhead. 635 636 FREEMAP NOTES 637 638The freemap is stored in the reserved blocks situated in the ~4MB reserved 639area at the base of every ~1GB level-1 zone of physical storage. The current 640implementation reserves 8 copies of every freemap block and cycles through 641them in order to make the freemap operate in a copy-on-write fashion. 642 643 - Freemap is copy-on-write. 644 - Freemap operations are transactional, same as everything else. 645 - All backup volume headers are consistent on-mount. 646 647The Freemap is organized using the same radix blockmap algorithm used for 648files and directories, but with fixed radix values. For a maximally-sized 649filesystem the Freemap will wind up being a 5-level-deep radix blockmap, 650but the top-level is embedded in the volume header so insofar as performance 651goes it is really just a 4-level blockmap. 652 653The freemap radix allocation mechanism is also the same, meaning that it is 654bottom-up and will not allocate unnecessary intermediate levels for smaller 655filesystems. The number of blockmap levels not including the volume header 656for various filesystem sizes is as follows: 657 658 up-to #of freemap levels 659 1GB 1-level 660 256GB 2-level 661 64TB 3-level 662 16PB 4-level 663 4EB 5-level 664 16EB 6-level 665 666The Freemap has bitmap granularity down to 16KB and a linear iterator that 667can linearly allocate space down to 1KB. Due to fragmentation it is possible 668for the linear allocator to become marginalized, but it is relatively easy 669to reallocate small blocks every once in a while (like once a year if you 670care at all) and once the old data cycles out of the snapshots, or you also 671rewrite the snapshots (which you can do), the freemap should wind up 672relatively optimal again. Generally speaking I believe that algorithms can 673be developed to make this a non-problem without requiring any media structure 674changes. However, touching all the freemaps will replicate meta-data whereas 675the meta-data was mostly shared in the original snapshot. So this is a 676problem that needs solving in HAMMER2. 677 678In order to implement fast snapshots (and writable snapshots for that 679matter), HAMMER2 does NOT ref-count allocations. All the freemap does is 680keep track of 100% free blocks plus some extra bits for staging the bulkfree 681scan. The lack of ref-counting makes it possible to: 682 683 - Completely trivialize HAMMER2s snapshot operations. 684 - Completely trivialize HAMMER2s de-dup operations. 685 - Allows any volume header backup to be used trivially. 686 - Allows whole sub-trees to be destroyed without having to scan them. 687 Deleting PFSs and snapshots is instant (though space recovery still 688 requires two bulkfree scans). 689 - Simplifies normal crash recovery operations by not having to reconcile 690 a ref-count. 691 - Simplifies catastrophic recovery operations for the same reason. 692 693Normal crash recovery is simply a matter of doing an incremental scan 694of the topology between the last flushed freemap TID and the last flushed 695topology TID. This usually takes only a few seconds and allows: 696 697 - Freemap flushes to be be deferred for any number of topology flush 698 cycles (with some care to ensure that all four volume headers 699 remain valid). 700 - Does not have to be flushed for fsync, reducing fsync overhead. 701 702 FREEMAP - BULKFREE 703 704Blocks are freed via a bulkfree scan, which is a two-stage meta-data scan. 705Blocks are first marked as being possibly free and then finalized in the 706second scan. Live filesystem operations are allowed to run during these 707scans and any freemap block that is allocated or adjusted after the first 708scan will simply be re-marked as allocated and the second scan will not 709transition it to being free. 710 711The cost of not doing ref-count tracking is that HAMMER2 must perform two 712bulkfree scans of the meta-data to determine which blocks can actually be 713freed. This can be complicated by the volume header backups and snapshots 714which cause the same meta-data topology to be scanned over and over again, 715but mitigated somewhat by keeping a cache of higher-level nodes to detect 716when we would scan a sub-topology that we have already scanned. Due to the 717copy-on-write nature of the filesystem, such detection is easy to implement. 718 719Part of the ongoing design work is finding ways to reduce the scope of this 720meta-data scan so the entire filesystem's meta-data does not need to be 721scanned (though in tests with HAMMER1, even full meta-data scans have 722turned out to be fairly low cost). In other words, its an area where 723improvements can be made without any media format changes. 724 725Another advantage of operating the freemap like this is that some future 726version of HAMMER2 might decide to completely change how the freemap works 727and would be able to make the change with relatively low downtime. 728 729 CLUSTERING 730 731Clustering, as always, is the most difficult bit but we have some advantages 732with HAMMER2 that we did not have with HAMMER1. First, HAMMER2's media 733structures generally follow the kernel's filesystem hiearchy which allows 734cluster operations to use topology cache and lock state. Second, 735HAMMER2's writable snapshots make it possible to implement several forms 736of multi-master clustering. 737 738The mount device path you specify serves to bootstrap your entry into 739the cluster. This is typically local media. It can even be a ram-disk 740that only contains placemarkers that help HAMMER2 connect to a fully 741networked cluster. 742 743With HAMMER2 you mount a directory entry under the super-root. This entry 744will contain a cluster identifier that helps HAMMER2 identify and integrate 745with the nodes making up the cluster. HAMMER2 will automatically integrate 746*all* entries under the super-root when you mount one of them. You have to 747mount at least one for HAMMER2 to integrate the block device in the larger 748cluster. 749 750For cluster servers every HAMMER2-formatted partition has a "LOCAL" MASTER 751which can be mounted in order to make the rest of the elements under the 752super-root available to the network. (In a prior specification I emplaced 753the cluster connections in the volume header's configuration space but I no 754longer do that). 755 756Connecting to the wider networked cluster involves setting up the /etc/hammer2 757directory with appropriate IP addresses and keys. The user-mode hammer2 758service daemon maintains the connections and performs graph operations 759via libdmsg. 760 761Node types within the cluster: 762 763 DUMMY - Used as a local placeholder (typically in ramdisk) 764 CACHE - Used as a local placeholder and cache (typically on a SSD) 765 SLAVE - A SLAVE in the cluster, can source data on quorum agreement. 766 MASTER - A MASTER in the cluster, can source and sink data on quorum 767 agreement. 768 SOFT_SLAVE - A SLAVE in the cluster, can source data locally without 769 quorum agreement (must be directly mounted). 770 SOFT_MASTER - A local MASTER but *not* a MASTER in the cluster. Can source 771 and sink data locally without quorum agreement, intended to 772 be synchronized with the real MASTERs when connectivity 773 allows. Operations are not coherent with the real MASTERS 774 even when they are available. 775 776 NOTE: SNAPSHOT, AUTOSNAP, etc represent sub-types, typically under a 777 SLAVE. A SNAPSHOT or AUTOSNAP is a SLAVE sub-type that is no longer 778 synchronized against current masters. 779 780 NOTE: Any SLAVE or other copy can be turned into its own writable MASTER 781 by giving it a unique cluster id, taking it out of the cluster that 782 originally spawned it. 783 784There are four major protocols: 785 786 Quorum protocol 787 788 This protocol is used between MASTER nodes to vote on operations 789 and resolve deadlocks. 790 791 This protocol is used between SOFT_MASTER nodes in a sub-cluster 792 to vote on operations, resolve deadlocks, determine what the latest 793 transaction id for an element is, and to perform commits. 794 795 Cache sub-protocol 796 797 This is the MESI sub-protocol which runs under the Quorum 798 protocol. This protocol is used to maintain cache state for 799 sub-trees to ensure that operations remain cache coherent. 800 801 Depending on administrative rights this protocol may or may 802 not allow a leaf node in the cluster to hold a cache element 803 indefinitely. The administrative controller may preemptively 804 downgrade a leaf with insufficient administrative rights 805 without giving it a chance to synchronize any modified state 806 back to the cluster. 807 808 Proxy protocol 809 810 The Quorum and Cache protocols only operate between MASTER 811 and SOFT_MASTER nodes. All other node types must use the 812 Proxy protocol to perform similar actions. This protocol 813 differs in that proxy requests are typically sent to just 814 one adjacent node and that node then maintains state and 815 forwards the request or performs the required operation. 816 When the link is lost to the proxy, the proxy automatically 817 forwards a deletion of the state to the other nodes based on 818 what it has recorded. 819 820 If a leaf has insufficient administrative rights it may not 821 be allowed to actually initiate a quorum operation and may only 822 be allowed to maintain partial MESI cache state or perhaps none 823 at all (since cache state can block other machines in the 824 cluster). Instead a leaf with insufficient rights will have to 825 make due with a preemptive loss of cache state and any allowed 826 modifying operations will have to be forwarded to the proxy which 827 continues forwarding it until a node with sufficient administrative 828 rights is encountered. 829 830 To reduce issues and give the cluster more breath, sub-clusters 831 made up of SOFT_MASTERs can be formed in order to provide full 832 cache coherent within a subset of machines and yet still tie them 833 into a greater cluster that they normally would not have such 834 access to. This effectively makes it possible to create a two 835 or three-tier fan-out of groups of machines which are cache-coherent 836 within the group, but perhaps not between groups, and use other 837 means to synchronize between the groups. 838 839 Media protocol 840 841 This is basically the physical media protocol. 842 843 MASTER & SLAVE SYNCHRONIZATION 844 845With HAMMER2 I really want to be hard-nosed about the consistency of the 846filesystem, including the consistency of SLAVEs (snapshots, etc). In order 847to guarantee consistency we take advantage of the copy-on-write nature of 848the filesystem by forking consistent nodes and using the forked copy as the 849source for synchronization. 850 851Similarly, the target for synchronization is not updated on the fly but instead 852is also forked and the forked copy is updated. When synchronization is 853complete, forked sources can be thrown away and forked copies can replace 854the original synchronization target. 855 856This may seem complex, but 'forking a copy' is actually a virtually free 857operation. The top-level inode (under the super-root), on-media, is simply 858copied to a new inode and poof, we have an unchanging snapshot to work with. 859 860 - Making a snapshot is fast... almost instantanious. 861 862 - Snapshots are used for various purposes, including synchronization 863 of out-of-date nodes. 864 865 - A snapshot can be converted into a MASTER or some other PFS type. 866 867 - A snapshot can be forked off from its parent cluster entirely and 868 turned into its own writable filesystem, either as a single MASTER 869 or this can be done across the cluster by forking a quorum+ of 870 existing MASTERs and transfering them all to a new cluster id. 871 872More complex is reintegrating the target once the synchronization is complete. 873For SLAVEs we just delete the old SLAVE and rename the copy to the same name. 874However, if the SLAVE is mounted and not optioned as a static mount (that is 875the mounter wants to see updates as they are synchronized), a reconciliation 876must occur on the live mount to clean up the vnode, inode, and chain caches 877and shift any remaining vnodes over to the updated copy. 878 879 - A mounted SLAVE can track updates made to the SLAVE but the 880 actual mechanism is that the SLAVE PFS is replaced with an 881 updated copy, typically every 30-60 seconds. 882 883Reintegrating a MASTER which has fallen out of the quorum due to being out 884of date is also somewhat more complex. The same updating mechanic is used, 885we actually have to throw the 'old' MASTER away once the new one has been 886updated. However if the cluster is undergoing heavy modifications the 887updated MASTER will be out of date almost the instant its source is 888snapshotted. Reintegrating a MASTER thus requires a somewhat more complex 889interaction. 890 891 - If a MASTER is really out of date we can run one or more 892 synchronization passes concurrent with modifying operations. 893 The quorum can remain live. 894 895 - A final synchronization pass is required with quorum operations 896 blocked to reintegrate the now up-to-date MASTER into the cluster. 897 898 899 QUORUM OPERATIONS 900 901Quorum operations can be broken down into HARD BLOCK operations and NETWORK 902operations. If your MASTERs are all local mounts, then failures and 903sequencing is easy to deal with. 904 905Quorum operations on a networked cluster are more complex. The problems: 906 907 - Masters cannot rely on clients to moderate quorum transactions. 908 Apart from the reliance being unsafe, the client could also 909 lose contact with one or more masters during the transaction and 910 leave one or more masters out-of-sync without the master(s) knowing 911 they are out of sync. 912 913 - When many clients are present, we do not want a flakey network 914 link from one to cause one or more masters to go out of 915 synchronization and potentially stall the whole works. 916 917 - Normal hammer2 mounts allow a virtually unlimited number of modifying 918 transactions between actual flushes. The media flush rolls everything 919 up into a single transaction id per flush. Detection of 'missing' 920 transactions in a concurrent multi-client setup when one or more client 921 temporarily loses connectivity is thus difficult. 922 923 - Clients have a limited amount of time to reconnect to a cluster after 924 a network disconnect before their MESI cache states are lost. 925 926 - Clients may proceed with several transactions before knowing for sure 927 that earlier transactions were completely successful. Performance is 928 important, we won't be waiting for a full quorum-verified synchronous 929 flush to media before allowing a system call to return. 930 931 - Masters can decide that a client's MESI cache states were lost (i.e. 932 that the transaction was too slow) as well. 933 934The solutions (for modifying transactions): 935 936 - Masters handle quorum confirmation amongst themselves and do not rely 937 on the client for that purpose. 938 939 - A client can connect to one or more masters regardless of the size of 940 the quorum and can submit modifying operations to a single master if 941 desired. The master will take care of the rest. 942 943 A client must still validate the quorum (and obtain MESI cache states) 944 when doing read-only operations in order to present the correct data 945 to the user process for the VOP. 946 947 - Masters will run a 2-phase commit amongst themselves, often concurrent 948 with other non-conflicting transactions, and will serialize operations 949 and/or enforce synchronization points for 2-phase completion on 950 serialized transactions from the same client or when cache state 951 ownership is shifted from one client to another. 952 953 - Clients will usually allow operations to run asynchronously and return 954 from system calls more or less ASAP once they own the necessary cache 955 coherency locks. The client can select the validation mode to wait for 956 with mount options: 957 958 (1) Fully async (mount -o async) 959 (2) Wait for phase-1 ack (mount) 960 (3) Wait for phase-2 ack (mount -o sync) (fsync - wait p2ack) 961 (4) Wait for flush (mount -o sync) (fsync - wait flush) 962 963 Modifying system calls cannot be told to wait for a full media 964 flush, as full media flushes are prohibitively expensive. You 965 still have to fsync(). 966 967 The fsync wait mode for network links can be selected, either to 968 return after the phase-2 ack or to return after the media flush. 969 The default is to wait for the phase-2 ack, which at least guarantees 970 that a network failure after that point will not disrupt operations 971 issued before the fsync. 972 973 - Clients must adjust the chain state for modifying operations prior to 974 releasing chain locks / returning from the system call, even if the 975 masters have not finished the transaction. A late failure by the 976 cluster will result in desynchronized state which requires erroring 977 out the whole filesystem or resynchronizing somehow. 978 979 - Clients can opt to keep a record of transactions through the phase-2 980 ack or the actual media flush on the masters. 981 982 However, replaying/revalidating the log cannot necessarily guarantee 983 success. If the masters lose synchronization due to network issues 984 between masters (or if the client was mounted fully-async), or if enough 985 masters crash simultaniously such that a quorum fails to flush even 986 after the phase-2 ack, then it is possible that by the time a client 987 is able to replay/revalidate, some other client has squeeded in and 988 committed something that would conflict. 989 990 If the client crashes it works similarly to a crash with a local storage 991 mount... many dirty buffers might be lost. And the same happens in 992 the cluster case. 993 994 TRANSACTION LOG 995 996Keeping a short-term transaction log, much less being able to properly replay 997it, is fraught with difficulty and I've made it a separate development task. 998For now HAMMER2 does not have one. 999 1000