TL;DR: Contracts for decentralised storage make people feel safer than incentives do, but the statistics tell a totally different story.

One of the questions that we often receive when explaining the Arweave’s proof of access mechanism to people is ‘so the storage of my data is probabilistic? Doesn’t that mean that it can get lost?’. In this post we outline some of the numbers around this issue, and why an incentive based approach (like Arweave) is actually statistically safer in practise than contractual approaches (as taken by most other decentralised storage projects).

Incentive based storage

In Arweave, miners compete to store as much of the blockweave as they can, in order to receive block and transaction reward pool tokens in the network. The more blocks they store, the higher their likelihood of receiving a reward while mining (more information in our lightpaper here). In this post we will be referring to the average proportion of blocks that each miner stores as the ‘replication rate’. For example, if we had a blockweave with 10 blocks, of which each miner held 5, we would have a replication rate of 0.5.

Our quest is to find the probability that there exists a block in the network that no single node has access to, at any given time. First, we need to calculate the probability that a given node will have access to a specific block:

P(can_access(node, block)) = replication_rate

Simple enough, but we want the inverse of this — the probability that a given node will not have access to a given block:

P(not can_access(node, block)) = 1-replication_rate

Fine. Now we need to calculate the probability that any given set of nodes does not have access to an individual block:

P(not can_access(nodes, block)) = (1-replication_rate)^count(nodes)

To make this step simpler to digest, let’s imagine a network with 2 nodes and a replication rate of 0.5. From the above, we know that the likelihood of a node not having the data is 50%. We can then see that the likelihood of both nodes not having the data is 50% of 50% — 25%. If we add a third node, this trend continues — 50% of 50% of 50% — 12.5%. When we think about this in terms of the probabilities, we can see that this pattern (0.5*0.5*0.5=0.125) extrapolates to the formula above as we add more nodes.

Next up, let’s calculate the probability that a given block is on at least one node in the network:

P(can_access(nodes, block)) = 1-((1-replication_rate)^count(nodes))

Simple. And finally, the probability that every block in the blockweave is available from at least one node in the network:

P(can_access(nodes, blocks)) =

1-((1-replication_rate)^count(nodes))^count(blocks)

Theory aside, here is how the probabilities come out: If we imagine a modestly sized Arweave network with 200 nodes, and a replication rate of 50% and 200,000 blocks, the probability that a single block in that network is not available is 6.223*10^-61 (0. followed by sixty-one consecutive zeroes and a six). In the current Arweave network, the probability of a single block not being available is 4.498*10^-290. Not very likely.

Even if the replication rate were to dramatically drop to just 20% (remember: storage is cheap relative to hashing, so miners are highly incentivised to optimise storage before spending money improving hashing rate) the probability of a single block being unavailable is tiny: 0.000000000000000829.

All of these calculations assume the default Arweave node’s random block selection mechanism. When we move to a more intelligent block selection mechanism in the future, as long as available network storage > weave size (our storage pricing and TX reward pool mechanics take care of this), the probability of data unavailability theoretically drops to zero. In practice however, the probability is much higher that we will accidentally introduce a subtle error into the block selection agent that causes block unavailability, hence our reluctance to do this long before it is necessary.

Contract based storage

Most other decentralised storage networks take a different approach. In these trustless decentralised networks (Sia, Storj, Filecoin, etc) users create a storage contract with the (potential malicious) miners in the network that specifies a duration and number of replications of the data that the storer desires.

The probability that malicious actors in these contract based networks will be able to withhold or delete your data is quite simple to calculate. We will assume random assignment of storage to the nodes in the network, as is normal.

P(can_perform_withholding_attack) = (unfaithful_actors/(faithful_actors+unfaithful_actors))^replications

In the simple case where we have only requested one copy of the data to be stored in the network, the likelihood that any data can be withheld from us (or deleted) is simply equal to the proportion of bad actors in the network. For each subsequent replication, we need to take into account the likelihood of being assigned all bad actors to store your data (hence the exponential factor).

If we ask for 6 replications of a file in such a network (seems safe, right?), and the network suffers from a 10% bad actor rate, we can see that the probability of an attacker being able to deny access to our file is 0.000001.

This situation becomes worse when one considers that these statistics are only for a specific file in the network. The likelihood that there exists a single file in the network that is susceptible to data withholding attacks like these is of course proportional to the number of files that the network contains. Most contract based storage networks employ a system of collateral to dissuade miners from performing such attacks, however in practice the value of these collaterals are typically far smaller than the true value to the user that would satisfy them if their data was lost — most 1 TB chunks of data are worth more to the storer than the collateral of ~$0.21.

Perhaps the key take away here is not so much that incentives > contracts (although we do think that, too), but that all truly decentralised networks operate on probabilities — and that is just fine. To be clear, we are not attempting to imply that the contract based networks are likely to lose your data — all of the probabilities in this post are exceptionally small — but we do think the probabilities around these networks are interesting and often misunderstood.

What are your thoughts on incentives vs contracts? Leave a comment below! Try out the Arweave network and grab your free tokens now.