Securing and Deduplicating the Edge with EdgeFS

Data security presents a major challenge of Edge/Fog computing growth. Learn how to overcome issues with the introduction of modern decentralized data layer EdgeFS.

We’ve been articulating Edge/Fog computing transformation with arguments in significant cost savings due to reduced bandwidth consumption, improved analytics efficiency due to improved necessary reaction time to events in the physical world, maximum uptime by not relying on WAN (Wide Area Network) like cellular that can be less reliable than wired, and improved security.

In terms of security, the benefits come from protecting assets close to the source of data, ones that were never intended to connect to broader networks, much less the internet.

However, as applications, data, computing services getting pushed away from centralized locations the data fragments have to be replicated across an increasing number of distributed networks. With that in mind, data security presents still a major challenge of Edge/Fog computing growth.

To address these challenges we need conceptually new decentralized data distribution and accessing layers that are designed with Edge/Fog security in mind.

Let’s compare two open-source decentralized storage layers designs that might fit the requirement list from data security standpoints: EdgeFS (http://edgefs.io, Apache Licensed) and IPFS (the Interplanetary File System https://ipfs.io, MIT Licensed).

When we designed EdgeFS, data security was our top priority. In EdgeFS, once recorded, data in any given block cannot be altered retroactively as this would invalidate all SHA-3 hashes in the previous blocks in a blockchain-like n-ary tree and break the consensus agreed among decentralized locations. The same can be said about IPFS.

EdgeFS built upon the architecture with immutable self-validating location-independent metadata referencing self-validating location-independent payload.

While the two storage solutions’ handling of payload chunks is very similar, the differences in how objects are named and found are almost as different as possible. IPFS was designed primarily for peer-to-peer ledger crypto transactions while EdgeFS isn’t making such assumptions and focusing on extream high-performance for many-to-many local or remote content-addressable network operations.

Immutable Payload Chunks

The end result of placing a data chunk to IPFS is that it is identified and validated with a strong cryptographic hash and that the cryptographic hash can be used to find the chunk for retrieval. This is very similar to EdgeFS, but there are a few differences:

IPFS accepts the chunk and then generates its cryptographic hash. A

EdgeFS client (via CCOW “Cloud-Copy-On-Write” gateway library API) cryptographically hashes the chunk before requesting that it be put. This avoids transmission of duplicate payload chunks, also known as inline data deduplication

EdgeFS client (via CCOW “Cloud-Copy-On-Write” gateway library API) cryptographically hashes the chunk before requesting that it be put. This avoids transmission of duplicate payload chunks, also known as inline data deduplication IPFS routing is a consistent hashing solution. EdgeFS instead routes I/O request to a Target Group and then does rapid negotiations within the group to find and dynamically put new chunks on the least burdened targets. This improves capacity balancing, utilization of storage devices, also known as dynamic data placement.

EdgeFS FlexHash table is a local site construct. It is automatically discovered and resides in the local site’s server memory. FlexHash is responsible for I/O routing and plays an important role in dynamic load balancing logic. Based on discovered site topology it defines so-called Negotiating Target Groups that are typically formed across 8–24 zoned storage devices to ensure proper failure domain distribution.

Differences in Metadata Philosophy

The IPFS naming system is still a work-in-progress, and examples suggest that IPFS uses a very different method for publishing content accessible by name.

IPFS takes the cryptographic hash of the atomic object and embeds those references in other named objects, which basically function as directories.

Each of these directory objects is also immutable, referencing specific frozen-in-time content. The directory object itself has a cryptographic hash, which can be referenced in higher layer directories. Finally, a “root” directory is published which is then pointed to by a mutable name to directory object mapping. I suspect that such design was heavy influenced by the necessity to provide a highly secure persistent layer for cryptocurrency ledger algorithms at the expense of generic storage flexibility and performance.

EdgeFS takes a different approach with an objective to enable shared data repository for versioned content that can be accessed and updated simultaneously by thousands of tenant approved users, and with supports for cross-site consistency groups.

In EdgeFS, information that supports finding a stored object, by name or by other search criteria, always recorded as metadata separate from the payload. It treats stored payload as opaque blobs with the on-disk organization that doesn’t require to find references within the chunk itself thus allowing client-driven end-to-end encryption. I.e. it presumes that all payload is encrypted and never tries to analyze it. Mutable metadata information always stored locally (local site cluster), thus enabling always-local, immediately consistent I/O policy without sacrificing flexibility or performance.

Immutable Version Metadata

By definition, most metadata about a specific version of an object must be immutable. Certain metadata can be independent of the version contents, such as metadata controlling retention of the object version, local site replication overrides, ACLs, and so on.

One of the strong points for IPFS is that it does not change the storage for a directory object when the mutable naming reference is changed to point at a new version. This is very similar to how EdgeFS handles mutable naming references and in my opinion, is far preferable to the practice of creating an explicitly versioned name. In EdgeFS, mutable naming is always assumed to be local and gets “re-hydrated” upon remote site-to-site transfer. This enables transfers across geographies to be always globally immutable and as such enables global replication that avoids unnecessary networking transfers very efficiently, a.k.a. inline data deduplication over WAN.

Summary

There are many other features of the modern metadata storage subsystem that are required for versioned content storage that IPFS seems simply does not yet address:

The ability to quickly find names that fall within any given folder/directory

Predictable directory or bucket search times

Tenant control over access-to and modification-of tenant metadata

Metadata driven retention of referenced payload.

While I admit that primary and original goals of IPFS were not to serve Edge/Fog computing use cases, its security and global scalability benefits do fit the profile. And perhaps it will catch up one day on the rest of the requirements. But why wait? EdgeFS is available today and it fits the most important requirements for Edge/Fog computing — data security, cost reduction, and performance.

EdgeFS exploits locally available site resources and presents them as a highly-available cluster segment that is a part of the geographically decentralized data layer. Outstanding local-site performance characteristics achieved due to its immutable data structure design, dynamic data placement via low latency UDP-based protocol, built-in multi-protocol storage gateways (S3, NoSQL DB, NFS, iSCSI, etc), and highly scalable shared-nothing architecture can be a true enabler of applications that are designed for Edge/Fog computing era.

Give it a try today! Star our GitHub repository and let me know what you think?

Find out more by joining our growing community at http://edgefs.io and http://rook.io