Raise High the Merkle Tree, Programmers

AKA Principles of Content Addressing, Part II

2015-10-26

Ben Trask

Since I wrote the Principles of Content Addressing (Part I)[#] over a year ago, content addressing has continued slowly gaining steam. There are many different projects out there trying to create a distributed and/or decentralized web. Some use DHTs, some use advanced client-side JavaScript, many hash their own app-specific data (making their hashes incompatible with other projects; see Part I). While I have strong opinions about all of these things and more, this article is about the core design of content addressing that applies to every distributed system, no matter what it’s used for.

What is content addressing?

Let’s start by looking at the content address’s closest relative, the Universally Unique ID. A UUID (or GUID) is just a random string that is generated and is used to uniquely refer to something, for example a document. Because the ID is long and random, there’s a sufficiently low chance of a collision (two documents being assigned the same ID).

A content address is an enhanced UUID. Instead of being random, the “randomness” is generated from the document itself (through a hash function). That gives the content address some useful properties:

The content address can be used to verify that the right document was retrieved, without necessarily trusting the resolver.

The same document will always be assigned the same content address, without any extra coordination.

The second property means that you can assign content addresses without concern for whether the documents in question have already had different addresses assigned. It doesn’t matter since you’ll always choose the same address.

By contrast, UUIDs or GUIDs are what I like to call “random addressing.” Still useful, but not as useful.

Why is agreeing on the same addresses important?

By giving all copies of a document the same address, there are two primary benefits:

Duplicate documents don’t have to be stored or transmitted, saving space and bandwidth.

Documents and associated meta-data can be shared and synced reliably.

The first (performance) benefit is significant but fairly straightforward so I’ll focus on the second.

Say you have a large collection of family photos. It’d be nice if your relatives could tag them and share the tags with you. In order to do that, you have to agree on the “identity” of a photo being tagged, regardless of whose computer it’s on. One answer is a centralized service that keeps all of your photos for you. Another answer is content addressing. (Of course having shared identities for your photos doesn’t mean you have to see everyone’s tags on your photos, but it gives you the option.)

A system might index backlinks so you can find everything that links to a given document. However, if there are several copies of the document fragmented across several addresses, it would be harder to find more than just a subset.

Content addressing doesn’t magically make data “permanent,” but it does make fully-transparent mirroring possible, which would help organizations like the Internet Archive and even individuals to keep content that they care about alive.

Technical tidbit: You Cannot Have Exactly-Once Delivery in a distributed system. Content addressing is actually the simplest and most fool-proof way of working around this problem. Many distributed databases (and even RSS) use ad-hoc, informally specified content addressing or random addressing without realizing it, to their detriment. You also need to be careful that operations are indempotent, which in the case of simply adding files, is straightforward.

Why does content addressing have to be done carefully?

The hard part of content addressing is the concept of identity. The less data you use to generate a content address, the more focused its identity will be. If you change a document just slightly, its address changes. If you include extraneous or random data in the document, the address effectively becomes random.

In order to keep addresses useful, follow a simple rule of thumb: Content addressing is for content.

The definition of content can differ based on context, but here is a partial list of things that should usually or always be kept out of content addresses:

File names

File meta-data

Most timestamps (depends on use case; keep granularity to a reasonable level)

System-specific storage information

File wrappers

Templated or synthesized data

Other files (use one address per file, rather than a shared address for a whole directory)

All of this data can and should be stored, just not as a part of the hashed content for the file in question.

Proprietary hashes versus portable hashes

One of the key ideas behind content addressing is that you can resolve the same hash against potentially many different resolvers. These resolvers could use different storage techniques (one using DHT, one offline), offer different services (paid, free, or personal hosting), or simply be made by different organizations (the way we have several compatible web browsers today). If hashes contain “unreproducible” data, other resolvers can’t use them.

Another way to think of this is like name mangling in C++. C++ wants to assign extra attributes to symbols, and it does so by glomming extra data onto symbol names, which makes them difficult to use from other languages. The solution in that case is extern "C" , which tells C++ to keep good “name hygene.”

When your data storage system is immutable and decentralized, many of the typical long term archival concerns go away. One of the big ones that remain is that the software itself changes or stops being supported. These non-portable hashes are effectively proprietary, and bring with them all of the problems of vendor lock-in.

If you use any software that exposes content addresses, demand a way to automatically convert their hashes to/from plain hex-encoded hashes of the original file content and nothing else, without being forced to re-hash the underlying data. Portable hashes should be equivalent to what you could compute with standard tools like sha256sum(1) . Don’t get locked into proprietary hashes.

Block deduplication

Content addressing can be used at different layers. When used at the storage or file system layer, it often takes the form of block deduplication. Files are broken into blocks that are given individual content addresses, allowing chunks of files to be deduplicated and verified.

The biggest problem with block dedup is the way it’s usually implemented. You need to be very careful to keep it separate from the user-facing hashes so that your internal implementation details don’t leak out. Unfortunately in the case of BitTorrent (including Project Maelstrom), IPFS, and Camlistore, the high level user-facing hashes are of the block data, rather than the original files. That means all of these hashes are project-specific. There is no way to convert these to standard hashes without re-hashing.

Block storage schemes also often allow flexibility in how exactly the data is broken up, which fragments the addresses even within one system.

Whether block deduplication is a feature you care about or not, it’s perfectly possible for a high level content addressing specification to allow it but not demand it, which makes the spec more flexible. Defining it into every hash is short-sighted.

Raise High the Merkle Tree, Programmers

As opposed to block deduplication, content addressing is much more interesting and useful as a way of referring to whole files. Ironically, many content addressing systems use ordinary file names–not content addresses–when referring to files embedded in other files (for example, resources in a web page). That, in turn, necessitates or encourages hashing file names, which makes content addressing less robust.

Conversely, when you use content addressing to embed files, each file contains the hashes of the files it depends on. That means the files themselves become Merkle trees, and you can verify each file and its dependencies recursively just by knowing the initial file’s hash. Content addressing works best when you lift it into the highest storage layer, application data.

An entire web built on content addressing would be more resilient and robust, but I believe we have to get the details right first.

Read on for StrongLink: An Introduction on GitHub.