Storj has newly issued a White Paper v3 at https://storj.io/storjv3.pdf . This White Paper has a total of 90 pages and takes quite a lot of time to read. If you don’t have time to read it yourself, you can take a look at my brief technical interpretation.

Storj’s last White Paper, v2, was released on December 15, 2016; this White Paper v3 was released in November 2018, two years after the last White Paper was released.

My general feeling after reading White Paper v3 is that it is substantially more practical than the White Paper v2 was and that it explains the details of many implementations. The entire White Paper is about the architectural details of Storj’s decentralized storage, but the blockchain aspect remains a subject that was mentioned very rarely in the paper.

After reading Storj White Paper, several points were clear to me:

Storj is still the ERC20 token. It does not write the stored data to the blockchain; only the asset data (essentially “money”) is written into the blockchain. Because the White Paper V3 rarely mentions the contents of the blockchain, the current state of Storj adoption of the ERC20 token is expected to continue for a long time.

Storj still uses a centralized settlement method that generates “wages” every month to motivate miners. These “wages” are the token Storj ERC20.

The “farmers” in the White Paper V2 have all been changed to “storage nodes” in the White Paper V3. This also implies that Storj is on a route leading away from the blockchain. In this sense, Storj has become closer to the traditional classical project, such as Amazon’s AWS S3.

In the following section I will give a thorough summary and interpretation of Storj.

Storj History

Here are several important dates in the history of Storj:

In July 2014, the Storj project was established and made the first token sales, raising 910 bitcoins, which at that time was worth $500K USD. After almost two years of development, the beta version was finally launched in April 2016; at the end of 2016, the second edition of Storj White Paper was released. From February to July 2017, another round of token sales was launched. This time it was equivalent to ICO and it raised about $30M USD. In March 2018, Ben Golub, the founder and former CEO of Docker, joined Storj as the new CEO. Finally, this White Paper v3 was released in November 2018.

Storj design constraints

In Storj White Paper V3, the first thing mentioned is the design constraints. These constraints include: security and privacy; decentralization; marketplace and economics; compatibility with Amazon’s AWS S3; durability, device failure, and churn; latency; bandwidth; object size; byzantine fault tolerance; coordination avoidance

Here is the interpretation for several key points of information:

1. It can be seen that “security and privacy” is the first constraint of Storj design. If other principles conflict with this, they must still be implemented in accordance with this constraint.

Based on this foundational principle, decentralized storage must be designed from the ground up to support not only end-to-end encryption but also enhanced security and privacy at all levels of the system including HIPAA, GDPR, etc.

2. Storj is looking for ways to drive down infrastructure costs for maintenance, utilities, and bandwidth.

3. The Storj White Paper V3 is the first to mention AWS S3 compatibility. This allows developers to quickly transfer previously written programs from AWS S3 to Storj. Storj now supports the seven core APIs of AWS S3.

4. Storj defines a series of QoS, with a particular focus on latency. Previous feedback from the Storj community expressed that it sometimes took several hours to store data in the Storj network. It seems that Storj is currently very concerned about this feedback. Of course, in addition to the latency, another very important parameter has been defined: durability. Durability is the probability of ensuring that data is not lost, even in the event of a large number of hardware failures or a large number of storage nodes being offline. Durability is generally measured in terms of the number of nines. For example, 99.99% durability is the durability of 4 niness, indicating that one in ten thousand data may be lost.

5. Storj’s White Paper v3 thoroughly explains their economic system which is designed in terms of four different roles: end users, storage node operators, demand providers, and the network operator (Storj Labs).

6. The White Paper determined that the size of stored objects will be a minimum of 4MB or larger. Smaller files are supported, but will be stored at the cost of a 4MB file, encouraging users to store larger files.

7. Storage nodes are generally classified into three categories:

Byzantine nodes, which may deviate arbitrarily from the suggested protocol.

Altruistic nodes, which, aside from inevitable hardware failures, are good nodes that fully comply to the rules and selflessly serve users.

Rational nodes, which are neutral actors that comply to the rules only when it is in their best interest.

In general, most nodes are rational nodes. Byzantine nodes and altruisitc nodes are quite uncommon.

8. In order to achieve the largest possible scale, Storj’s strategy strongly promotes local protocol and minimizing coordination.

Actors

Storj’s system includes the following roles:

Client: A user or application that will upload or download data from the network

Peer class:

Uplinks: This peer class represents any applications or services that implement libuplink and wants to store and/or retrieve data. This peer class is not expected to remain online like the other two classes and is relatively lightweight.

Libuplink

Gateway

Uplink CLI

Satellites: This peer class caches node address information, stores per-object metadata, maintains storage node reputation, aggregates billing data, pays storage nodes, performs audits and repair, and manages authorization and user accounts.

Users have accounts on and trust specific Satellites. Any user can run their own Satellite, and Storj expects that many users will choose to use multiple Satellites to avoid operational complexity.

Users have accounts on and trust specific Satellites. Any user can run their own Satellite, and Storj expects that many users will choose to use multiple Satellites to avoid operational complexity.

Framework

Storj is designed to that everything with its framework will do the following things: store data, retrieve data, maintain data, and pay for usage.

The individual components include Storage nodes; Peer-to-peer communication and discovery; Redundancy; Metadata; Encryption; Audits and reputation; Data repair; Payments

Storage nodes

The storage node’s role is to store and return data. Storage nodes are selected to store data based on various criteria including ping time, latency, throughput, bandwidth caps, sufficient disk space, geographic location, uptime, history of responding accurately to audits, and so forth. This is very close to the way traditional P2P projects choose nodes.

Pieces may be stored with a specific TTL expiry where data is expected to be deleted after the expiration date. Storage nodes must also keep track of signed bandwidth allocations to send to Satellites for later settlement and payment. Both TTL and bandwidth allocations are stored in an SQLite database. Storage nodes can choose which Satellites to work. If they work with multiple Satellites (the default behavior), then payment may come from multiple sources with varying payment schedules. Storage nodes that fail random audits will be removed from the pool, can lose funds held in escrow to cover additional costs and will receive limited to no future payments. Storage nodes will support three methods: get, put, and delete. Each method will take a piece ID, a Satellite ID, a signature from the associated Satellite instance, and a bandwidth allocation. Storage nodes will allow administrators to configure maximum allowed disk space and per-Satellite bandwidth usage over the last rolling 30 days.

Peer-to-peer communication and discovery:

All peers on the network communicate via a standardized protocol. This protocol supports the following:

● Provides peer reachability, even in the face of firewalls and NATs where possible. This may require techniques like STUN, UPnP, NAT-PMP, etc.

● Provides authentication as in S/Kademlia, where each participant cryptographically proves the identity of the peer with whom they are speaking to avoid man-in-the-middle attacks.

● Provides complete privacy, all communications are private by default

● a network overlay, such as Chord, Pastry, or Kademlia, can be built on top of Storj chosen peer-to-peer communication protocol to provide discovery services.

Redundancy

Redundancy was mentioned in Storj’s White Paper v2, but the redundancy used at that time was simple redundancy. Storj’s White Paper V3 explains the shortcomings of simple redundancy and uses erasure redundancy instead. I will now briefly cover the advantages and disadvantages of simple redundancy and erasure redundancy.

Simple replication

The creation of identical copies of data across different locations of the storage system

Typically 2 or 3 copies, configurable based on accepted risk level If a drive fails, data is re-created from the copy on another drive

Pros:

Less CPU intensive = faster writing performance Simple restoration = faster rebuilding performance

Cons:

Requires 2x or more the original storage space and bandwidth

Erasure Codes

A protection technique based on parity checking

Data broken into fragments and encoded Uses a configurable number of redundant pieces and does not require pieces to be stored across different locations

Pros:

Consumes less storage than replication — good for cheap/deep storage Allows for the failure of two or more elements of a storage system

Cons:

Parity calculation is CPU-intensive Increase latency can slow production writes and rebuilds

Reed-Solomon Erasure codes

If a block of data is encoded with a (k,n) erasure code, there are n total generated erasure shares, where only any k of them are required to recover the original block of data. If a block of data is s bytes, each of the n erasure shares is roughly s/k bytes. when k = 1 (replication), all erasure shares are unique. 1MB data, erasure code: (10, 16), erasure shares: 0.1M, total storage data = 1.6MB

Reed–Solomon codes were developed in 1960 by Irving S. Reed and Gustave Solomon, who were then staff members of MIT Lincoln Laboratory. After decades of history, Reed-Solomon erasure coding is the most classic open source erasure coding algorithm. Other erasure coding algorithms are similar.

update and download

Storj doesn’t necessarily add redundancy as soon as it finds a lack of such, but rather, it analyzes the situation to determine if redundancy is should be added.

● K: minimum required number of pieces for reconstruction,

● M: minimum safety, essentially a safety buffer for data reconstruction

● O: optimal value, able to act as a buffer for nodes fluctuation

● N：long tail tolerance

The value of k is set such that if the amount of available pieces is lower than K the data will be lost. In other words, K determines whether or not data sruvives.

The value of m is set such that if a Satellite notices the amount of available pieces has fallen below M, it immediately triggers a repair in order to ensure that Storj always maintain K or more pieces. In other words, M is the safety line.

O is the desired degree of redundancy. Its value is set such that during uploads and repairs, as soon as O pieces have finished uploading, remaining pieces between K and N are canceled.

The value of n, therefore, is set such that storing n pieces will exceed the redundancy goal.

Durability

Mathematically, time-dependent processes are modeled according to the Poisson distribution, where it is assumed that events are observed in the given unit of time. As a result, we model durability as the cumulative distribution function (CDF) of the Poisson distribution with mean = pn, where we expect pieces of the file to be lost monthly. To estimate durability, we consider the CDF up to n–k, looking at the probability that at most n — k pieces of the file are lost in a month and the file can still be rebuilt. The CDF is given by:

Storj made the following assumptions:

p is the monthly loss rate for erasure redundancies which Storj assumes to be 10%

n and k are the parameters of the erasure algorithm

lamda is the mean of the Poisson distribution, which is p*n

Exp.factor is the multiple of redundancy

As you can see, Storj is able to achieve very high monthly durability.

However, this calculation process is somewhat problematic.

Storj did not consider the time to recover the erasure shares. As mentioned above, Storj has designed several parameters, k, m, and o. After the erasure shares are lost, they can be recovered. The P obtained as the result of this calculation is only monthly durability of data,whereas the storage industry generally uses annual durability as a parameter of the platform. The durability announced by AWS S3 is annual durability. The CDF formula is a fitting formula for the Poisson distribution. The result calculated by the CDF formula is an approximation, the CDF formula is only suitable when p is small and n is large. However, because the assumed p of Storj is 0.1, which is not small, therefore the number calculated by the CDF formula is not completely accurate. The Storj White Paper V3 also calculates the case of k=1 (that is, simulating a simple copy). In fact, the result value of this set of calculations is relatively large.

Data

Here we each of the data units used by Storj:

1.Bucket:

A bucket is an unbounded but named collection of files identified by paths. Every file has a unique path within a bucket. This is the same as the definition of the bucket in AWS S3.

2. Path:

A path is a unique identifier for a file within a bucket. Unless otherwise requested, Storj encrypts paths before they ever leave the customer’s application’s computer. This is the same as the definition of the path in AWS S3.

3.File or Object:

A file is referred to by a path, contains an arbitrary amount of bytes, and has no minimum or maximum size. A file is represented by an ordered collection of one or more segments. Segments have a fixed maximum size. This is the same as the definition of Object in AWS S3.

4.Extended attribute:

An extended attribute is a user-defined key/value field that is associated with a file. Like other per-file metadata, extended attributes are stored encrypted

5.Segment:

A segment represents a single array of bytes, between 0 and a user-configurable maximum segment size.

6.Remote Segment:

A remote segment is a segment that will be erasure encoded and distributed across the network. A remote segment is larger than the metadata required to keep track of its bookkeeping, which includes information such as the IDs of the nodes that the data is stored on.

7.Inline Segment:

An inline segment is a segment that is small enough where the data it represents takes less space than the corresponding data a remote segment will need to keep track of which nodes had the data. In these cases, the data is stored “inline” instead of being stored on nodes.

8.Stripe:

A stripe is a further subdivision of a segment. A stripe is a fixed amount of bytes that is used as an encryption and erasure encoding boundary size. Erasure encoding happens on stripes individually A stripe is a unit on which audits are performed.

9.Erasure Share:

When a stripe is erasure encoded, it generates multiple pieces called erasure shares. Only a subset of the erasure shares is needed to recover the original stripe. Each erasure share has an index identifying which erasure share it is.

10.Piece:

When a remote segment’s stripes are erasure encoded into erasure shares, the erasure shares for that remote segment with the same index are concatenated together, and that concatenated group of erasure shares is called a piece. The ith piece is the concatenation of all of the ith erasure shares from that segment’s stripes.

11.Pointer:

A pointer is a data structure that either contains the inline segment data or keeps track of which storage nodes the pieces of a remote segment were stored on, along with other per-file metadata.

This is a schematic diagram of the data unit in Storj:

Previously, Storj’s White Paper V2 briefly mentioned metadata, and now this time the details of the metadata are described in the White Paper V3.

● Every time an object is added, edited, or removed, one or more entries in this metadata storage system will need to be adjusted.

● The metadata system may experience large amounts of data loss, and across the entire user base the metadata itself will likely undergo significant changes.

● Storj expects the platform to incorporate multiple implementations of metadata storage that users will be allowed to choose between.

● The platform has Amazon S3 compatibility, and is able to Put (store metadata at a given path), Get (retrieve metadata at a given a path), List (paginated, deterministic listing of existing paths), and Delete (remove a path).

Encryption

All data or metadata will be encrypted. Data must be encrypted before it ever leaves the original computer. Amazon S3-compatible interface client library and the user’s application should be run on the same computer. Storj’s encryption choice is authenticated encryption. Authenticated encryption is used so that the user can know if anything has tampered with the data. Encryption should use a pluggable mechanism, where various encryption algorithms can be used. A hierarchical encryption algorithm based on BIP32 will allow subtrees to be shared without sharing their parents, in other words, it will allow some files to be shared without sharing other files.

the same encryption key should not be used for every file, as having access to one file would result in access to decryption keys for all files. Therefore, each Storj file is encrypted with a different key. Data is encrypted in blocks of small batches of stripes, preferably with a size of 4KB or less. Paths are also encrypted. Like BIP32, the encryption is hierarchical and deterministic, and each path component is encrypted separately.

Audits

Audits are simply a mechanism used to determine a node’s degree of stability.

Auditors, such as Satellites, will send a challenge to a storage node and expect a valid response. As the HAIL system, Storj uses erasure coding to read a single stripe at a time as a challenge and then validates the effects of the erasure share responses. This allows Storj to run arbitrary audits without pre-generated challenges. Storj request that stripe’s erasure shares from all storage nodes responsible. Storj then runs the Berlekamp-Welch algorithm [39, 73] across all the erasure shares. When enough storage nodes return correct information, any faulty or missing responses can be easily identified.

Storage node reputation

Storage node uptime and overall health are the primary metrics used to determine which files need repair. A reputation system is needed to persist the history of audit outcomes for given node identities. Storage node reputation can be divided into four subsystems.

Proof of work identity system: requires a short proof, in the form of time, stake or resources, that the storage node operator is invested. Initial vetting process: slowly allows nodes to join the network. A filtering system: blocks bad storage nodes from participating. A preference system: remaining statistics collected during audits will be used to establish a preference for better storage nodes during uploads.

Data repair

To repair the data, Storj will recover the original data via an erasure code reconstruction from the remaining pieces and then regenerate the missing pieces and store them back in the network on new storage nodes.

Payments

● A storage system that performs sufficiently well cannot wait on the slow performance of blockchain operations.

● Storj’s framework instead works more like a game theory model, ensuring that participants in the network are properly incentivized so that they will remain in the network and behave rationally to get paid.

● Storage nodes in Storj framework should limit their exposure to untrusted payers.

● Currently uses Ethereum-based Storj ERC20 tokens as the default mechanism for payment, but in the future other alternate payment types maybe implemented as well.

Satellite

The collection of services that hold this metadata is called the Satellite. Users of the network will have accounts on a specific Satellite instance, which will store their file metadata, manage authorization to data, keep track of storage node reliability, repair and maintain data when redundancy is reduced, issue payments to storage nodes on the user’s behalf.

Note that in Storj, the user does not pay the storage node directly, rather the user pays the Satellite first, and then the Satellite pays the storage node.

the Satellite service is being developed and will be released as open source software. Any individual or organization can run their own Satellite to facilitate network access. the Satellite is never given data unencrypted and does not hold encryption keys. The Satellite instance is made up of these components: A full node discovery cache A per-object metadata database indexed by encrypted path An account management and authorization system A storage node reputation, statistics, and auditing system A data repair service A storage node payment service

Here is a diagram of a put operation

Here is a diagram of a get operation

Authorization

Metadata operations will be authorized. Users will authenticate with their Satellite, which will allow them access to various operations according to their authorization configuration. Once the Uplink is authorized with the Satellite, the Satellite will approve and sign for operations to storage nodes, including bandwidth allocations. The Uplink must retrieve valid signatures from the Satellite prior to operations with storage nodes.

Bandwidth allocation

The Satellite will only create a bandwidth allocation if the Uplink is authorized for the request. At the beginning of a storage operation, the Uplink can transfer the bandwidth allocation to a storage node. The storage node can validate the Satellite’s signature and perform the requested operation up to the allowed bandwidth limit, storing and later sending the bandwidth allocation to the Satellite for payment. In the case of a Get operation, assume the Satellite-signed bandwidth allocation allows up to x bytes total. The Uplink will start by sending a restricted allocation for some small amount (y bytes), will then send another allocation where y is larger, continuing to send allocations for data until y has grown to the full x value.

Garbage collection

Every time data is deleted, storage nodes that are online and reachable will receive notifications right away. Storage nodes will sometimes be temporarily unavailable and will miss deleting messages. In these cases, unneeded data is considered garbage.

Uplink

Uplink is the software middleware of Storj.

Any software or service that invokes libuplink in order to interact with Satellites and storage nodes. Libuplink — libuplink is a library that provides access to storing and retrieving data in the Storj network. Gateway — a simple service layer on top of libuplink. Storj’s first gateway is an Amazon S3 gateway. Uplink CLI — a command line application which invokes libuplink

Future work

Storj’s White Paper V3 also briefly covers what they are going to do in the future.

Hot files and content delivery： If necessary, Satellites are able to temporarily restrict access while increasing the redundancy of the file over more storage nodes, and then reallow access. Improving user experience around metadata： In the long term, Storj plans to construct a platform out of Satellites. Storj hopes to eliminate Satellite control of metadata entirely via a viable Byzantine-fault tolerant consensus algorithm.

Conclusion



Pros

As a company targeting products and markets, Storj truly values Quality of Service (QoS) and sincerely hopes to implement decentralized storage in practical scenarios. Has a strong focus on security and privacy,their foundational design principle. Compatible with AWS S3 and supports several key AWS S3 API. For developers, this is a good thing. Avoids Byzantine distributed consensus allowing them to greatly improve efficiency. Designed with redundancy details and carefully measured durability, which is the most important QoS for a commercially available storage system. Has designed metadata and established a unique process for for it. Metadata is different from ordinary data, and it changes very frequently. Audits and credibility, in particular the implementation of credibility values, have a positive impact on the overall economic system.

Cons

Very dependent on erasure techniques, data repair overhead is very large. There is still little mention of blockchain technology. Storj has very little consideration for blockchain technology. As a project that has been ongoing for more than four years and yet still uses a centralized settlement method, it is a bit disappointing. The Satellite is a very centralized design. Except for storing data, everything else is done by it. Their monthly payment cycle is too long and is not friendly enough for the storage node, especially new storage nodes, which must wait very long for positive feedback. Bandwidth allocation, using bytes as the denominator, is not strong enough. The distribution of popular files and content is rarely considered.

Why did I write this article?

As mentioned in my other articles, I designed and launched the PPIO Storage Public Chain project, which is a decentralized data storage and delivery platform for developers

that values affordability, speed, and privacy. Although the PPIO project I designed is somewhat similar to Storj, I wrote this article from a neutral perspective. I think decentralized storage is a brand new track compared to centralized storage (e.g. AWS S3, Google Cloud, Microsoft Axure, etc.). The development of this new track, and the creation of value through decentralized storage is something that everyone needs to explore together. I hope we can make progress together.

In the process of designing PPIO, many of my design ideas early on in the process were very similar to Storj V3, including compatibility with AWS S3, valuing QoS, viewing decentralized storage and blockchain systems as two independent submodules, using the erasure shares, measuring the durability, individually processing frequently changing metadata, and designing the verifier node (similar to Storj’s Satellite audit). Of course, PPIO is also different from Storj in many ways. Check out our White Paper and my articles for more information.

My email is wayne@pp.io. If you have any questions, feel free to reach out to me.

Article author：Wayne Wong

If you want to reprint, please indicate the source

If you have an exchange about blockchain learning, you can contact me with wayne@pp.io.