Networked rebuild and self-healing in HAMMER2

The idea is to be able to automate it at least so long as spare nodes are available. So if one had a cluster of 3 masters (quorum is thus 2 nodes), and 2 additional nodes operating as slaves, then if one of the masters fails the cluster would continue to be able to operate with 2 masters until the failed master is replaced. But the cluster would also be able to promote one of the slaves (already mostly synchronized) to become a master, returning the system to the full 3 masters and making the timing of the replacement less critical. This alone does not really replace RAIDs. For a very large storage subsystem, each node would be made up of many disks so another layer is needed to manage those disks. The documentation has a 'copies' mechanism that is meant to address this, where redundancy is built within each node to handle disk failures and to manage a pool of hot replacements. If a disk fails and is taken out, the idea is for there to be sufficient copies to be able to rebuild the node without having to access other nodes. But if for some reason there is not a sufficient number of copies then it could in fact get the data from other nodes as well. For smaller storage systems the cluster component is probably sufficient. But for larger storage systems both the cluster component and the copies component would be needed. One important consideration here is how spare disks or spare nodes are handled. I think it is relatively important for spare disks and spare nodes to be 'hot' ... that is, fully live in the system and useable to improve read fan-out performance. So the basic idea for all spares (both at the cluster level and the copies level) is for the spares drives to be fully integrated into the filesystem as extra slaves. Right now I am working on the clustering component. Getting both pieces operational is going to take a long time. I'm not making any promises on the timing. The clustering component is actually the easier piece to do. -Matt On Wed, Mar 25, 2015 at 3:12 AM, PeerCorps Trust Fund < ipc at peercorpstrust.org> wrote: > Hi, > > If I understand the HAMMER2 design documents, one of the benefits that it > brings is the ability to rebuild a failed disk using multiple networked > mirrors? It seems that it also uses this capability to provide data healing > in the event of corruption. > > If this is the case, are these processes transparent to the user based on > some pre-defined failover configuration, or must they be manually set off > in the event of a disk failure/corruption? > > Also, would RAID controllers still be necessary in the independent nodes > if there is sufficient and reliable remote replication? Or could a HAMMER2 > filesystem span the disks in a particular node and have the redundancy of > the remote replication provide features that otherwise would come from a > RAID controller? > > Thanks for any clarifying statements on the above! > > -- > Mike > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.dragonflybsd.org/pipermail/users/attachments/20150326/9009fee7/attachment.html>