Decentralized data fabric with blockchain-like architecture.

The conventional tradeoff for distributed storage clusters is between transactional consistency and eventual consistency. Eventual consistency is usually viewed as a cheaper solution, both in terms of desirability and system cost. The critical cost of transactional consistency is the need to reach a consensus on ordering updates.

Eventual consistency is usually portrayed as simply tolerance for inconsistency on the presumption that momentary contradictions are acceptable as long as they go away eventually.

The decentralized data fabric core, open-sourced EdgeFS project takes a different, blockchain-like approach. All stored chunks, whether metadata or payload, are unique, immutable and self-validated. References to these chunks do not include the locations where they are stored but still enable those chunks to be efficiently retrieved.

This strategy allows EdgeFS to provide guarantees beyond making thing consistent “eventually”:

Any client will never retrieve a version older than the most recent version that the client has put itself. The changes in any version will never be automatically erased.

A version will only be expunged according to policy and after a version that is a successor to it is published.

No network partition will prevent a client from putting a new object version. Indeed no client will ever prevent another client from putting a new object version.

Never blocking a new object version because of the actions of another client is a feature, not a bug or limitation. Transactional systems can only guarantee non-overlapping edits after reaching a consensus. The consensus may be on which updater has the exclusive right to update an object now (distributed locking) or on which of multiple conflicting updates can be committed (MVCC or multi-version consensus control). Effectively the cluster must be serialized either before the update is initiated or before it can be completed. Distributed locking is more efficient when conflicting attempted edits are common, MVCC for the far more common situation where conflicts are rare.

So eventual consistency is actually what end consumers want for versioned document archives. Not accepting, or even delaying, the ability to record a new version is not good. What they would prefer is to reliably know when other versions are created and to minimize the time when a new update is not visible to someone wanting to further edit the same document.

What decentralized segmented EdgeFS installation offers is eventual consistency with the benefits of immutability and knowledge of what range of possible object versions could exist but which have not yet been propagated.

What lies behind this capability is simple, EdgeFS has defined its metadata so that no consensus algorithm is needed. Other storage solutions may have clever consensus algorithms (RAFT etc), but you cannot be more clever than no consensus algorithm at all.

Needless to say that the capability to sustain network partitioning for hours and days required for many Edge/IoT computing applications where connectivity to a core or cloud isn’t guaranteed, and where sub-millisecond latency is crucially important.

Consensus is Expensive

The fundamental issue is that it is impossible to update the same information at multiple locations at exactly the same time. This has been expressed in many ways, including CAP Theorem.

Distributed storage systems that offer transactional consistency by requiring a cluster-wide consensus before the put transaction creating a new object version can complete. This may be based upon a priori locks or optimistic locking which detects conflicting edits and immediately applies the conflicting edits before reapplying the attempted edit.

Either strategy requires end-to-end communications covering at least a relevant quorum of node members. Of course, a quorum based consensus is dependent on agreement about how many votes are needed, which is why consensus algorithms always get complex. If a quorum consensus on either the lock or the specific edit cannot be achieved then the requested operation cannot proceed or complete. Disallowing puts of new versions is the last thing that a storage cluster supporting versioned documents should do.

Unique Chunks Do Not Require Consensus

EdgeFS defines object versions in metadata chunks called Version Manifests with unique Version Manifest Content Hash Identifier (VMCHID) “Top Hash”. These chunks include the fully qualified object name and a unique version identifier.

Chunks are located using a cryptographic hash identifier of the chunk. For Version Manifests this is the Name Hash Identifier (NHID). All Version Manifests for a given object are stored within target servers, across geographically decentralized segments, addressed by a FlexHash group number, derived from the NHID. Payload “hash(L1-L4)” and non-version manifest chunks “(Hash 0–1, Hash 0–0 .. 1–1, etc)”, by contrast, is located based on the Content Hash Identifier (CHID).

Further, because the Version Manifests include their unique identifier, their Content Hash Identifiers (CHIDs) are also unique. Depending on options selected the cryptographic hashes maybe 256 or 512-bits.

EdgeFS Always Accepts New Versions

All EdgeFS chunks are unique. They either have unique payload identified by a 256 or 512-bit cryptographic hash, or they have a Version Manifest that includes a unique identifier of the version.

Two nodes can both put the same payload chunks without harm. Because the unique version identifier includes the IP address of the originating node there is only a single source for any new version. This does impose the onerous constraint that no single source can put two versions of the same object within a single system tick, currently 1/10,000th of a second. This is not a significant constraint as the round trip times to negotiate and confirm putting a new chunk will take longer than that.

Because the same Version Manifest has a unique identifier (VMCHID) the source creating it does not need to consult with any other node before creating it. The only entity that is qualified to have an opinion on whether it created a new version of a given object at a given timestamp is itself. Instant consensus.

Namespace Manifest and Snapshots

Because Version Manifests are unique they can always be created. EdgeFS collects and processes the transaction log entries noting each new Version Manifest to create a permanent registry of all Version Manifest that we call a Namespace Manifest. The Namespace Manifest can support complex metadata queries and makes it possible to take true point-in-time snapshots of without requiring any consensus deriving blockage.

Summary

EdgeFS is open-sourced on the terms of Apache License v2 and our community welcoming users, Edge/IoT enthusiasts to join and experiment with it. Learn more at EdgeFS Github pages.

Thanks to Caitlin Bestler for the original write up on this article.