HAMMER2 PRELIMINARY DESIGN DOCUMENT Matthew Dillon 02 May 2011 - Version 2.0 11 May 2011 - Version 2.1 (c)Copyright 2011, Matthew Dillon, All Rights Reserved The HAMMER2 design is essentially a complete rewrite of HAMMER1 based on everything I learned from doing V1, with an eye towards a fully cache coherent logical clustering capability. V2 has the following major goals: * Cluster friendly layout and support. HAMMER1 only got half way there and it wasn't enough. In particular, the use of logical clustering as a means of providing RAID-like redundancy, supporting what is effectively multi-copies and quorum protocol in a combination local & networked storage environment. * Real-time and semi-real-time non-queued mirroring. HAMMER1 had this. * A logical layout that can be easily reblocked for performance. HAMMER1's layout could be reblocked but performance for random lookups was elusive, mainly due to the difficulty of managing inode numbers verses directory scans in long-lived directory sub-hierarchies. * Infinitely recursive pseudo-filesystem support. PFSs are used for NFS, mirror & backup management, etc. HAMMER1 had PFSs but they weren't entirely transparent and they weren't recursive. Solves the 'ROOT is a PFS' problem in HAMMER2 and then some. HAMMER2 PFSs can give you block & inode usage totals simply by virtue of the fact that such totals are available for any sub-directory hierarchy in a HAMMER2 filesystem. HAMMER2 PFSs can be used to create writable mount points based on snapshots. Unlike HAMMER1, HAMMER2 PFSs are built into the filesystem without the use of a special softlink. A hammer2 directive is available to create these babies. * Writable snapshots & unionfs equivalents. A writable snapshot can be taken which makes unionfs-style operations almost trivial. Basically you just take a normal read-only snapshot and then you create a writable mount point on top of that using the PFS mechanic. Poof, it just works. * Automatic fine-grained snapshots, high level snapshot management. HAMMER v1 had this. * Built-in directory/sub-tree based quotas, block, and inode accounting (these aren't uid/gid based). HAMMER v1 did not have this. This also means that the total size of any subdirectory hierarchy, particularly in the case of a snapshot, is immediately available. Overprovisioning is allowed if desired. A large drive can be effectively cut-up with this mechanism. * Chained CRCs from ROOT gives us a full-filesystem consistency check. These are small cryptographic hashes (smaller than what ZFS uses) designed for integrity verification and automatic healing via alternative paths. Unlike ZFS HAMMER2's CRCs are not designed for blind data validation. * HAMMER2 will support live-rebuild on disk or CRC failure and is capable of using any other mirror master or slave (or several together) as the data source, including off-site mirrors. The filesystem will remain live through through the process. * HAMMER2 implements a root directory which is ABOVE the nominal mount point for the filesystem. That is, the nominal mount point is typically a file inside this directory instead of the directory itself. This feature can be replicated for any subdirectory, where the parent holds multiple snapshots of said directory. There is no global snapshot table per-say. This makes it possible to trivially construct and maintain multiple mirroring domains within any subdirectory structure. For example you can construct a HAMMER2 filesystem which holds multiple roots and then mount the desired one based on a boot menu item, and you can work within these roots as if they were the root of the whole filesystem (even though they are not). HAMMER2 TRADEOFFS * HAMMER2's hardlinks are very expensive, if a file is hardlinked more than a few times it isn't going to be a pretty sight. This is partly by design, but also simply because there is no way around the issue if we wish to implement fine-grained cache coherency in a clustered filesystem setup. The initial work will probably not even *have* hardlink support. * HAMMER2 has no 'atime' support. atime is simply too expensive to support due to the way blocks are updated and I do not want to add hacks as I did in HAMMER1 to support atime updates without having to roll new copies. The OS is expected to implement a devfs for the device space (the only space where atime is really used). * HAMMER2 utilizes block-level snapshot and history mechanics. HAMMER1's snapshots and history utilized the B-Tree and had lower overheads. However, HAMMER2's snapshots are still going to be very fast. IMPLEMENTATION DETAILS (I) General directory and file topology HAMMER2 implements files and directories as arrays of blocks, with recursive indirect blocks, similar to conventional filesytems with some caveats. These blocks represent an abstracted block size for recursion or data leafs and each actual block is variable-sized up to the abstracted maximum. The initial release utilizes a 64K block size with variability down to 64 bytes for the last block in the file. The directory layout uses the layout slightly differently. Each 64K 'block' in a directory represents a variable-sized directory entry (up to 64K) containing the filename and an embedded inode. Blocks are laid out using an extended hash code and recurse through an indirect block table when required to expand the size of the directory. The inode itself can also contain direct data so it is possible to entirely embed a small file in its directory entry. It is also theoretically possible for a directory inode to embed severy directory entries (each with its own embedded inode) using this mechanism, for highly compressed sparse directory layouts, but for the moment we do not attempt to do this. The directory layout is a radical change relative to conventional filesystems. Though some in the past have used the location of the directory entry as the inode number what HAMMER2 does is significantly more involved. Inodes are embedded in directories for several reasons. (1) First and foremost the cache coherency model relies on transaction ids propagating to the root inode, which means that any modification to a file or its inode MUST propagate up the directory tree anyway (and all related trees in the hardlink case if hardlinks were to be supported). (2) The cache coherency model can collapse its memory representation trivially as well, allowing it to use a bounded amount of memory. (3) Writeable snapshots can be trivially implemented when the inode is embedded in the directory entry. (4) This removes the need to index by inode number in the normal case, assuming a reasonable non-disconnective namecache topology (which is how DragonFly's namecache works). However, NFS servers may still require such an index. (5) Read-only snapshots will replicate the inode & inode number while writable snapshots allocate a new inode number and use replicative references for the remainder in the initial snapshot (a shared ref, NOT copy-on-write). (II) 64K block I/O size, 16/64K logical filesystem buffer size. HAMMER1 used mixed buffer cache block sizes. HAMMER2 uses a 64K block size for all block device operations and mixed 16K/64K (based on offset) for all logical file operations. The actual allocation on-media may be smaller (down to 64 bytes). (III) Fat inodes HAMMER2 implements fairly fat 256-byte inodes. Each inode contains 128 bytes of inode related information and 128 bytes worth of block table or embedded direct data. Inodes are embedded in the directory entry structure with up to 16 characters of the filename built into the inode structure. Longer filenames wind up using 512-byte directory entries. The filename is not really logically part of the inode even though it is partially stored within it. A block table reference is made up of two 64-bit quantities, a CRC and a byte offset into the block device. Because the data is CRCd HAMMER2 does not support large data extents. The low 6 bits of the 64-bit block offset encodes a base-2 exponent with the number of bytes represented from 64 bytes to 2^63 bytes. The remainig bits (with the low 6 bits masked to 0) represent the block device byte offset. Exponents from 2^6 to 2^16 represent leaf data while any exponent greater than that represents an indirect block table. The inode can embed 128 bytes of direct data or up to 64K x 8 (524288) bytes using direct blocks. Indirect blocks must be used beyond that point. Data at the file EOF can be represented in powers of 2 from 64 bytes to 64KB. The filesystem supports block devices up to 2^63-1 bytes, which is around 16 Exabytes (16384 petabytes). Remember that inodes are directly embedded in directory entries, and the data for a small file (up to 128 bytes) is directly embedded in the inode. This makes HAMMER2 accesses for small files extremely efficient. (IV) Hardlinks - the only really horrible tradeoff Hardlinks are HAMMER2's most difficult problem to overcome, and I've decidde not to even try to solve it for the initial work. There are a multitude of reasons why hardlinks are difficult: * In order for snapshots (and writable snapshots) to work properly we really can't have parent pointers embedded in media structures. HAMMER2's meta-data topology has to be strictly top-down with the OS caching the intermediate path elements reaching any cached inode or open file. * The clustered cache coherency model is extremely difficult to manage with hardlinks, or at least hardlinks that are not concentrated in a single directory (for any given file). * Hardlinks will be expensive no matter how we twist it due to the need to traverse upward from all hardlinks related to a hardware to update cache state. * Quotas operate on any directory sub-tree, with definitive information available at any directory node for its entire sub-tree. * And lots of other things depend on coherent directory sub-trees, such as mirroring. (V) Physical block allocation Modifications to blocks are always allocate-on-write, recursing upward all the way to the root of the filesystem. The allocation status for a filesystem block is abstracted using a top-down perspective based on the root (the root-above-the-root) for the filesystem, which is stored in the volume header. Any block not indirectly referenceable from a top-down perspective, in abstract, is considered free and available for reallocation. -- The block allocation topology mimics a file block map but is rooted in the volume header instead of a file inode. In addition, because the topology uses algorithmically calculates blocks the normal block pointer found in a allocmap (data_offset) is used to hold a free-space count value instead. The flags field specifies early-termination and indicates whether we indirect to the primary or the alternative block. struct hammer2_allocmap_node { uint32_t icrc; /* icrc of target buffer */ uint16_t flags; /* termination, other flags */ uint8_t big_hint[2]; /* log2(biggest_alloc) */ uint64_t free_bytes; /* aggregate free block count */ }; struct hammer2_allocmap_leaf { uint32_t append_offset; /* fine-grained allocations */ uint16_t flags; /* collision, other flags */ uint8_t big_hint[2]; /* log2(biggest_alloc) */ uint64_t mask; /* pruner collection mask */ }; Each 64K allocmap buffer holds 4096 (2^12) blocks. The 2^64 storage space is thus laid out as a base-4096 radix tree: 0 16-byte allocmap entry in volume header 12 indirect allocmap (level 3) (represents 16EB) 12 indirect allocmap (level 2) (represents 4PB) 12 indirect allocmap (level 1) (represents 1TB) 12 direct allocmap (level 0) (represents 256M) 16 64KB of storage The layout of the allocmap in memory is broken down into 1GB sections. The allocation topology uses a fixed 4-level allocmap regardless of how small the filesystem is and simply cuts off any algorithmically generated offsets which are beyond the end-of-storage. Each 1G block of storage begins with 16x64K reserved blocks as shown below: [VH][l3][l2][l1][l0][l0][l0][l0][a3][a2][a1][a0][a0][a0][a0] VH - Volume header in first 1G section (1G_SECTION[0]). l3 - indirect allocmap representing 16 Exabytes (note 1) l2 - indirect allocmap representing 4 Petabytes (note 1) l1 - indirect allocmap representing 1 Terrabyte (note 1) l0 - 4x64K direct allocmaps representing 1GB a* - Alternative allocmaps select w/flag in parent (note 2) note 1: These buffers only exist in the 1G sections aligned to their representitive size. So e.g. all indirect buffers exist in 1G_SECTION[0]. A l1 buffers then ALSO exist at each 1TB mark, the l2 buffer at each 4PB mark, and the l3 buffers at each 16EB mark (meaning it only exists in 1G_SECTION[0] since the filesystem cannot be larger than 16 Exabytes). This leaves a certain amount of dead space in larger filesystems. note 2: When an allocmap is modified a copy is made, ping-ponging between the primary an alternative buffer locations in such a way as to guarantee consistency when flushed. The ping-pong only occurs when we desire an update of the root volume header. Even though the allocation map is treated as a base-4096 object the kernel breaks this down into a two-level base-64 topology. big_hint[1] is overlayed on the same structures to represent the higher level of the two-level topology, making allocation searches much faster. SUB-64K ALLOCATIONS The append_offset field may be used to make allocations down to a 64-byte granularity within the 64K block. The mask field is used by the pruner (see below). SHORTCUTTING ALLOCATIONS The allocation space does not need to be entirely initialized by newfs_hammer2. Early termination via the allocmap flags field may be used to mark wholely free and wholely allocated areas. A few blocks do need to be initialized to deal with the storage EOF edge case (no more than 4). Blocks will be initialized as needed. Note that this allows linear allocations and linear freeing operations to avoid read-before-write ops in many cases. MOUNT INCONSISTENCIES HAMMER2 resolves inconsistencies on mount (whether a crash occured or not) by using an optimized scan of the directory tree up to the allocation map's transaction id (similar to a mirror stream's scan), re-marking blocks as allocated. Since the optimized scan stops once it hits an older block (by transaction id), the actual number of blocks scanned is typically minimal, and zero for a clean mount. FREEING BLOCKS The pruner is responsible for freeing blocks in the allocation map. The pruner accomplishes this by traversing the entire directory/file topology using a parallel traversal which identifies shared references and re-collapses the recursion for such references. Since most roots share a large number of blocks the pruner's algorithm winds up being almost linear. More importantly the pruner is able to defer traversals of differentiating branches to reduce random disk accesses when a large number of snapshots are present. Since HAMMER2 has no idea how many shared references a block might have it clears bits in the mask field whenever any deallocation occurs as an indication that the related block *MIGHT* be free. The pruner pass then runs through and re-sets any mask bits for blocks which are found not to be free. When the prune run is finished the pruner is able to issue the actual block frees and/or adjust append_off (up to a point). We do this instead of doing a full wipe prior to the rescan in order to be SSD-friendly. A pruner pass for which no changes occur does not need to make any modifications to that portion of the allocmap. Any deletion which occurs while a pruning run is active sets a collision flag in the flags field and does not mess with the mask bits. The pruner will detect this flag and (a) clear the flag and the mask field and (b) not deallocate anything from the block in question. Freeing the related storage will be resolved the next time the pruner is run. REBLOCKING Reblocking is also a significant problem since, again, HAMMER2 has no idea how many shared references a block might have. To solve this problem HAMMER2 records block offsets and CRCs as it runs and integrates the de-dup functionality into the reblocker. The reblocker does a recursive breath-first directory scan. In order to guarantee functionality the reblocker does not throw away any CRCs. Instead the reblocker limits the range of CRCs recorded to guarantee a bounded size capable of reasonably fitting in kernel memory. The default is around 512K entries (~8MB) which represents around 32G of storage. The fill ratio defaults to 80% and can be changed. Since HAMMER2 is doing a topological scan it can determine if files and directories have reasonable linearity and thus determine if reblocking is necessary. An excessively full filesystem may not be able to reblock optimally. XXX this needs more work. (VI) De-dup, Snapshot & Union management Unionfs mounts are stored as writable snapshots and appear as a subdirectory somewhere in the filesystem. The current root is also stored as a writable snapshot. Typically the mount point for a physical HAMMER2 filesystem specifies a top-level directory name in the root-above-the-root directory. The 'root' directory is thus one directory level down from the filesystem's true root directory. This allows any filesystem to implement multiple native roots and makes mirror/backup/cluster setups all that much easier. Deduplication, snapshots, and union roots (writable snapshots) are all trivialized by this abstraction. Since snapshots are represented as a directory, snapshots can be named by virtue of that directory. named (writable) roots can be mounted with a mount spec that includes the name, making it possible for the boot loader to control which root the kernel tries to mount. References to duplicate blocks can be collapsed by the dedup algorithm. The dedup algorithm works about the same as it does in HAMMER1 (with online dedup and live dedup features). Essentially CRCs are scanned and recorded and potential duplicates are resolved against their actual data. If the number of CRCs exceeds available ram then a bounded amount of ram can be used along with multiple passes restricting the CRC range being checked, so it is still possible to de-dup the entire filesystem in a deterministic manner. A snapshot can be mounted read-only or read-write. When mounted read-write modifications propagate up to and modify the snapshot root in the volume header. A union-style mount can be implemented by snapshotting a snapshot and then mounting it writable. However, the unionfs mount will not see any modifications made to the original snapshot. All dependencies on a snapshot remain intact (see no modifications) if the snapshot is modified. (VII) Mirroring and Clustering HAMMER2 rolls up transaction ids all the way to the root, making it possible for mirrors to determine whether an entire directory subhierarchy on any master OR slave matches simply by checking the directory inode's mirror_tid. This knowledge is stored in the PERSISTENT state of a master or slave and does not disappear on reboot. In otherwords, ANY MASTER OR SLAVE can mirror data to localize it and then serve any of the data once the mirror_tid is validated against the cluster. struct hammer2_blockref { uint32_t icrc; uint32_t reserved01; uint64_t mirror_tid; uint64_t modify_tid; uint64_t data_offset; /* offset|log2(size) */ } mirror_tid represents the highest aggregate modifying transactiond id and propagates all the way up to the root. modify_tid is updated only for blocks actually modified (non-inclusive of the block reallocations). It should be noted that the inode mtime is updated for any modifying write to a file or directory so the inode modify_tid always reflect the most recent modification to that directory or file. A mirror scan determines what needs to be traversed by observing mirror_tid, and determines what needs to be transmitted by observing modify_tid. Sparse deletions within files are handled the same way, with the mirror_tid and modify_tid both updated for a block pointer whos data_offset has been cleared. File truncations are handled logically simply by observing that the size field in the inode (embedded in the directory entry) has changed. Directory deletions and renames are significantly more complex. A directory can be viewed as a sparse 2^63 byte (maximally sized) file. Directory entries are essentially hashed within this sparse space. The mirroring target will thus view the deletion of one or more entries as a ranged deletion similar to a sparse deletion in a file. A rename works similarly except the mirror target must match the deleted entry with the new entry by inode number. Both the deletion and the new entry will occur in the same transaction and thus be in the same mirroring segment, but the two events can occur in any order in the stream so the mirror target must record the fact when the first event is encountered for later match-up. Since any directory sub-tree can be mirrored independently any rename that moves a file from outside the space to inside the space will cause the mirror target to detect incomplete inode information. When this occurs the mirror target must request the missing information and effectively perform a copy operation even if the operation on the originator was a rename and not a copy. Similarly any file which is moved from inside to outside the bounds of a mirror will be physically deleted on the target and potentialy reconstructed elsewhere, depending on what you are mirroring. Only masters can take part in modifying operations (requiring a quorum to agree on what the current state of the data is and a quorum to acknowledge a write), but once a mirror (master or slave) syncs up with that transaction id it can serve up data to requestors. Let me be clear about that... even a SLAVE (not taking part in the quorum protocol) can serve up coherent bulk data if it happens to have it. Clustering consists of network connections over which masters and slaves communicate with each other. The quorum protocol only occurs between masters in order to agree on the transaction id governing a coherent data request, but any member of the cluster can serve up the actual data. Modifying operations transmit bulk data over these connections, typically only to the masters. Mirroring consists of a separate set of connections, usually ssh based, networked in whatever manner the system administrator desires to synchronize all or part of the directory topology between machines. Masters use mirroring to synchronize any data they missed (due to not being part of a quorum operation), while slaves use mirroring to acquire all modifications made to the filesystem since they never take part in the quorum protocol. The DragonFly kernel maintains file and directory topology caches which record the transaction id. These caches are capable of re-validating transaction ids without having to flush the related entries in case of a connection failure and otherwise receive appropriate invalidations via the cluster network protocol. In otherwords, the caches can be made fully coherent on a page-by-page basis even before data is synchronized to the filesystem proper. (This is theoretically the basis for being able to perform process migration, and most certainly being able to perform virtual kernel migration). (VIII) Inode numbers Logical mirroring, read-only and writable snapshots, and the fact that inodes are embedded in directory entries all present problems for the implementation. Generally speaking local HAMMER2 filesystem operations are easy since the filesystem has a direct reference to the inodes related to the operation and can track the changes. But NFS and mirroring operations are significantly more difficult to deal with. * NFS can be disconnective, meaning that due to the lack of a deterministic knowledge on whether a node is opened or not by the client the server can throw away cache information and then be presented with a file-handle (inode number) by the client. NFS will have severe problems in a cluster framework as operations on uncached portions of the cluster will be decidedly asynchronous (having to wait for the bulk mirroring op), and might not even be noticed at all (if exporting the cluster without a local mirror of the data). * The incremental nature of a mirror stream can wind up presenting only incremental modifications for the inode related to a renamed file or snapshot. The path to the original inode being renamed or snapshotted is not available since it was deleted at some point in the past. The inode number and related data typically can be matched up since the delete-portion of the rename occurs in the same transaction as the create-portion. * There can be many different versions of inodes depending on your viewpoint (writable snapshot, read-only snapshot, original, etc). Indexing the inode space as a base feature of the filesystem is not likely to be a viable option. HAMMER1 did not have this problem because HAMMER1 indexed all versions of an inode. HAMMER2 has a simpler but highly maleable data structure which results in potentially much more complex interactions, so HAMMER2 cannot easily index anything by inode number. In both cases just locating the inode can be a problem. For the mirroring case the target must locate an inode whos modify_tid is within the range of transaction ids being updated or it will not have a sufficient base to apply incremental adjustments to (if adjustments are made), and otherwise must find the correct version of the inode for the PFS/snapshot/whatever being mirrored. * The mirroring case is solvable with only moderate difficulty, because a renamed or deleted inode will be detected by the mirroring target by virtue of the mirroring target having to delete the old directory entry and/or the inode being incomplete (due to not being newly created). This may occur in any order, either before or after the renamed target is streamed, so in such cases the mirroring target must record the information on-the-fly and reassociate the deleted inode with the newly renamed file when it has both pieces of data. As long as the mirroring code does not acknowledge the mirroring segment until after reconciling such information it can maintain the record in-memory. * If a file or directory outside a mirrored space is renamed to the inside of a mirrored space some of the data related to the renamed inode will be outside the mirror's transaction id range. In this case the mirror target must recognize at the end of the mirroring segment that it has insufficient information and request the missing information before allowing the stream to complete. * The NFS case effectively requires either indexing or requires an exhaustive directory scan of the exported directory structure in order to locate the inode number. In the latter case the NFS server becomes heavily reliant on its vnode cache for peformance. A NFS server which exports a portion of the cluster fs but has no local mirror for the data requires a mirroring stream (even without a filesystem to write the data to) in order to keep its index up to date.