Mediachain ❤ Ethereum

Adventures in off-chain storage

New Parts Bins by Randomskk via Attribution Engine. Licensed under CC BY-SA

Over the past couple of months, we’ve been experimenting with integrating Mediachain into the Ethereum world as performant, scalable off-chain storage. Today, we’re excited to announce some of these early experiments: our goal is to help decentralized app (dApp) developers skip thinking about storage and focus on their product and community instead.

For the uninitiated, Mediachain is:

a scalable, decentralized database

indexed and structured in feeds, with realtime notifications via PubSub

discoverable through a directory system for dynamic query routing

content addressed and IPFS compatible (immutable/location independent)

integrated with various identity providers (Blockstack, Keybase, soon uPort)

We believe that dApps, on top of Ethereum and other platforms, can create a more fair distribution of value between participants in the media ecosystem, appropriately compensate the content creators, and maybe even challenge the dominant surveillance capitalism mode of the social/user-generated content web. A perfect fit with the Mediachain vision.

The nascent crop of exploratory apps in this space shows promise: the early success of steemit [more] has been encouraging, Resonate is on a slow, steady climb, Userfeeds and Akasha are not far from launch. Some others (notably Ujo Music) didn’t quite hit mainstream adoption with their prototypes, but are coming back around for a serious attempt on top of a more mature ecosystem. Many more are on the horizon.

One thing that unites these, and future, projects: they need to store data somewhere — specifically metadata about the songs, photos, “medium” posts, and so on. Let’s talk about challenges surrounding this (scroll down to “Enter Mediachain” if you’re impatient).

A Brief History of ⛓Storage

Just Hardcode It

Some early projects, like the first iteration of Ujo, created very inspiring demos around what were essentially toy datasets — in the case of Ujo, the splits for the one song you could buy through the original contract were hardcoded right in the constructor:

Setting splits information in the original Ujo contract’s constructor

This worked to showcase the mechanics, but of course it was quite limited in practice.

The State We’re In

The obvious next step was to add some state mutation methods to your contract, so that the owner could call it with CRUD operations. This is pretty easy — the constructor above writes to a public state variable already — so now you could have your own little database replicated on every Ethereum node.

Unfortunately, the cost of storing data this way ends up being around $140k/GB at current ETH/gas prices, or enough for about 500,000 gigabyte-years of S3 storage. Ouch. The same problem comes up in any direct on-chain storage system: the design parameters of transaction and data storage are inherently at odds.

Many Happy Returns

The problem of on-chain storage costs exceeding San Francisco real estate prices was touched upon, indirectly, at least as far back as 2012 (please send me earlier examples) with Proof of Existence storing SHAs of documents in Bitcoin OP_RETURN instructions to “notarize” them.

An example PoE transaction containing the DOCPROOF marker and the SHA of a document

So here we’re writing a 32 byte hash instead of whatever size the original document was, not bad! However, these are some very expensive bytes: approx $500k/GB (in 2017 BTC), or roughly the per-megabyte price of an expensive hard drive from the late 70s. We’re also subject to the <10 tps global rate limit and a multi-minute confirmation time.

Worse, there is not much we can do with this data outside the proof of existence application: the underlying document has to be located somehow.

Off The Chain

The next logical step was taken by others like Ascribe, by placing the referred-to document in a well-known location like a particular S3 bucket, so it could (usually) be retrieved based on the identifier recorded on-chain. This approach was formalized by BitSpark, which is a protocol for specifying the URL of a “delivery server”, which will hopefully contain the document, together with the hash.

The same approach can be taken with values written to Ethereum state, which further drives costs down about 5x and makes the reference easier to deal with. We’re doing better now: the document can be located from the blockchain record, though we still can’t do much with it.

The achilles heel of this approach is the delivery server mechanism: it’s location-addressed, which means the moment the domain lapses, the origin server goes down, or the ICE gets involved, the link will irrecoverably break. We’ve welded a single point of failure to our decentralized network.

What may be awaiting at the destination URL

Intergalactic, Planetary

Fortunately, this is exactly the sort of problem that the IPFS project is intended to solve (I’ll assume the reader is probably familiar with IPFS, but here is a nice intro post). Briefly, using IPFS gives us location-independent addressing and per-byte costs approximating the underlying storage.

By replacing location-based references with IPFS multihash content addresses, we get a robust, wholly decentralized system without astronomical storage costs. Similar approaches, like Swarm, offer comparable guarantees. Sounds good, right?

Oracles and Logs (a brief diversion)

If you’re familiar with the EVM, you’ve probably noticed that so far I haven’t talked about the other storage mechanism (ignoring memory and stack ): events. The events system, which started out as a kind of abuse of the EVM logging facilities, offers some intriguing upsides: its much less expensive on a per-byte basis ($1.8K/GB, firmly in the mid-90s), has fairly robust support in tooling like truffle, and currently doesn’t get pruned so (in theory) its longevity is the same as the main chain.

The downside of events is that, from the contract’s perspective, they act like a kind of garbage chute leading to a landfill: you can never get back what you’ve written, and it all ends up in a giant pile together with all the other events, transaction receipts, internal events, etc.

The events system is also an essential feature underlying oracles, which are currently the best way to get data in and out of Ethereum. This is where we tap in.

Enter Mediachain

We started building Mediachain because our consumer-facing apps Mine (2015) and Attribution Engine (2016) put us face to face with the limitations of on-chain storage (read more about our thinking at that time here). IPFS, and the greater libp2p ecosystem, felt like the right direction but we realized we needed a few more pieces:

scale/cardinality: media datasets are large (think 155MM objects in Library of Congress or 30MM+ in Spotify’s catalog), and advertising and routing these as top-level objects in IPFS is not quite feasible due to DHT limitations (post with numbers forthcoming)

media datasets are large (think 155MM objects in Library of Congress or 30MM+ in Spotify’s catalog), and advertising and routing these as top-level objects in IPFS is not quite feasible due to DHT limitations (post with numbers forthcoming) structure/discoverability: we needed to be able to use existing identifiers, and be able to answer queries like “give me all the songs the system knows about” efficiently

we needed to be able to use existing identifiers, and be able to answer queries like “give me all the songs the system knows about” efficiently collaboration: related to the above — we think that reuse, remixing and collaboration are amongst the most exciting aspects of distributed apps + open data, which we need first class support for

Mediachain builds on the ideas (and internals— we are fully compatible libp2p nodes) of IPFS, adding namespaces for discovery, the ability to route hundreds of millions/billions of objects, and a flexible permissions model to enable collaboration. We think that this is the right level of abstraction and feature set to build the next Soundcloud, Reddit, or something we haven’t even imagined yet.

MC ❤ ETH

Over the past few months, we’ve focused primarily on building Mediachain Core: the infrastructure that holds and routes the data. Now, we’re releasing some early experiments in connecting Mediachain to the Ethereum universe. The first is a Registrar contract in Solidity, which you can inherit from to create events that get “captured” by listening Mediachain nodes:

event Write(

address payer,

string namespace,

bytes body,

uint fee

);

where namespace is a namespace you have permission to write to (extended discussion of possible permission models, particularly token-incentivized namespaces and how they relate to fee , will be the subject of an upcoming post) and body is a CBOR/IPLD metadata blob to save. There is also a much more extensive “beatcoin” prototype (developed together with Zeppelin) that allows registration and lookup of payment information for a something like a decentralized Bandcamp.

The second is an embeddable version of aleph, which you can include in your dApp UI together with something like Metamask to provide querying, logic and payment, all against decentralized systems. We hope that this will eventually feel much like writing an app on Firebase: a fluent API on top of a “magic” database and identity layer.

Please join our slack if you want to experiment with these, are building a dApp in the media space, or are just curious about the future of decentralized applications.

Thanks to Simon de la Rouviere, Maciej Olpinski, Doug Petkanics, Demian Brener from Zeppelin and Andy and Nikolai from Nexus for feedback on drafts of this post.