Files as Capabilities in us

Previously, we saw how to manually add encryption and redundancy to our files. What we didn’t see was how to manage the additional metadata required by those features. And there is a lot of metadata: we need to know which hosts the file’s shards are stored on; which encryption algorithm was used to encrypt the shards; which erasure code parameters were used; and so on. If we want to retrieve our file later, we need to save all this metadata somewhere. While there are many ways to accomplish this, us provides a “blessed” format for doing so, called a metafile.

In this post, we’ll examine the structure of a metafile and how it fits into the us system. We’ll also take a look at another us format: the format that stores your file contracts. Although us provides types and functions for working with these formats, this post won’t contain any code examples; instead, I’d like to focus on the design decisions. So we’ll wrap up with a discussion of why metafiles and contracts in us are fundamentally different from their siad counterparts.

Part 1: Metafiles

A metafile consists of an index and a set of shards. The metafile format itself is a gzipped tar archive of the index and shards, each stored as a separate file. The suggested extension for metafiles is .usa – “a” for “archive.”

The index contains what we typically think of as file metadata: the size, mode bits, modtime, etc. It also contains Sia-specific information: the encryption key, the erasure code parameters, and the public keys of each host the file was stored on. The index is encoded as a simple JSON object; you can view it directly with tar xzf [metafile] index -O .

The shards collectively describe the actual bytes that comprise the encrypted, redundant file. Each shard is associated with a single host. Shards consist of a series of binary-encoded “sector slices,” each identifying a sector (by its Merkle root) and an offset and length within that sector. For example, a shard might refer to the first 512 bytes of sector A, followed by 1024 bytes from the middle of sector B; the “logical shard” is then the concatenation of these two slices, for a total of 1536 bytes.

To upload a file, we first create the index and initialize a shard file for each host. Each time we upload a sector of the file, we’ll append a slice to the host’s corresponding shard file. When we’re done, we bundle the index and shard files into a .tar.gz . To download, we first need to unzip and un-tar that file; then, we use the sector slices in each shard to retrieve the (encrypted, redundant) data stored on hosts; then we use the metadata in the index to turn that raw data back into our original file.

The metafile format was designed to be easy to grok and easy to work with. It uses existing popular formats – tar, gzip, JSON – for everything but the shards, which must be binary-encoded for performance reasons. The shards themselves are intuitive if you are familiar with erasure-coding; this is why the format explicitly makes each shard its own “thing,” rather than having one giant binary blob where the boundaries of each shard are determined by fixed offsets. (Don’t get me wrong, the latter approach has many advantages – but it’s also harder to work with.)

For a more technical description of the metafile format, see formats.md.

Part 2: Contracts

Contract files are much simpler than metafiles. They have two parts. First is an immutable header, containing the host’s public key, the contract ID, and the secret key that can sign revisions. Following this is the most recent revision of the contract, which contains information like the total Merkle root of the contract data, how many coins are allocated to the renter and host, etc. We overwrite the revision each time we modify the contract.

In principle, a contract could consist of just a header, because the Sia protocol allows us to request the most recent revision from the host. We keep a local copy of the revision for performance and convenience; it means we don't need to perform any network I/O to answer simple questions like "how many coins are left in this contract?" But since the revision is non-essential, we can play fast and loose with it. If it becomes corrupted or lost, we can always ask the host for their copy. Consequently, we don't need to worry about updating the revision atomically or calling fsync after each update, which can be major performance bottlenecks. The header, on the other hand, is immutable and must not be modified. If we accidentally overwrite our secret key, for example, we'll never be able to revise the contract again. To avoid this, we write the header exactly once, fsync it, and never touch it again.

Contract files also used to contain the Merkle roots of each sector comprising the contract data, but a later upgrade to the Sia protocol made this unnecessary. The renter now only needs to store a single Merkle root – one covering the entire contract – in order to verify that the host has processed its requests correctly. (This root is stored in the revision.)

For a more technical description of the contract file format, see formats.md.

Part 3: Capabilities

What do contracts and metafiles have in common? The answer lies not so much in what they are, but rather what they allow you to do. What contracts and metafiles share is that possessing them inherently bestows access rights. If you possess a contract file, you can revise that contract; if you possess a metafile, you can download that file.

The technical term for this is a capability. A capability is “a communicable, unforgeable token of authority.” Contracts and metafiles contain both a reference to an object and a key that permits access to that object – the key is the “unforgeable token of authority.” And both files are communicable: if you send one to your friend, they gain the exact same access rights.

In siad , the contract and siafile formats contain references and keys, just like their us counterparts. But crucially, these files are not communicable. If you send a siad contract to a friend, they will not be able to load it into their own siad ; the system was simply not designed to accommodate this sort of operation. Likewise with siafiles, although siad plans to support some form of filesharing soon.

By contrast, us encourages treating these files as first-class citizens. When you finish uploading a file with user , it doesn’t just print “done”, it directly returns a capability, in the form of a metafile. What you do with that metafile is up to you; you can stick it in a folder, rename it, delete it, compress it, rsync it to a backup server, whatever. The important thing is that it’s sitting out in the open for you to manipulate, and you manipulate it directly via the filesystem instead of through a custom API.

A great example of this is how user manages “enabled” and “disabled” contracts: to enable a contract, just create a symlink to it in the appropriate directory. To disable the contract, just delete the symlink. (This should sound familiar to anyone who has used nginx.) This sort of functionality could have been accomplished with an “enabled” list in a config file, or by adding a bool to the contract format, but punting it to the filesystem gives us this feature for free, and with semantics the user is already familiar with.

Of course, there are downsides to shoving these files in the user’s face. Managing contracts is hard work; you need to pick your hosts carefully and make sure you renew on time. These tasks are best handled by a sophisticated program, not a human. That’s why siad abstracts your contracts into an “allowance,” so that you can focus on the high-level goals: how much to spend, and over what period. Same with files: by ceding control of your files to siad , you allow it to automatically repair them when hosts go offline. user , on the other hand, forces you to make all of these decisions explicitly, which is both empowering and overwhelming. The good news is that you can write your own siad ! That is, you can write programs in any language that automatically manage your contracts and files. For example, you could set up a cron job that automatically renews your contracts, or a Python script that regularly scans your hosts and sorts them by latency/throughput/price. So in the long term, I don’t expect many people to invoke user directly. Instead, they’ll build more sophisticated systems on top of user (or us ) that are tailored to their specific needs.

Conclusion