With great throughput, comes great amounts of storage. During our load tests which are pushing ~1300 transactions per second, we’re generating more than 500GB per month of data. Generating that much data is a big problem, especially with blockchain where many nodes must keep a full copy of the entire chain. After some time, for a node to store the entire chain would require massive hard drives that cost a ton of money or some kind of complicated storage setup. What if you don’t store the entire chain on a single node?

GoChain is doing just that, we’re storing data that isn’t required offchain, in the cloud. This is cheaper, easier and actually a lot better than the other options since it essentially gives us unlimited space to grow.

There’s actually a few things we’re doing in terms of storage:

Reducing data size of each block and the state tree

Different storage schemes

Offchain storage

Reducing Data Size of Each Block and the State Tree

This is exactly what it sounds like. We’re searching for any opportunity we can to reduce the size of the data on disk for each block and in the state tree. These are small things that can add up to make a non-trivial difference.

Different Storage Schemes

A lot of the storage optimizations have to do with the underlying embedded database. Parity uses RocksDB, Ethereum Go uses LevelDB. One such optimization is not using the database at all and instead storing some parts in regular files. For instance, storing batches of old blocks in a simple binary file, rather than an indexed database. Little tricks like this can make a huge difference in both the size on disk and efficiency of storage and retrieval.

OffChain Storage

This is where we get the biggest reduction in storage required to run a node because we’re simply not storing the entire chain on each node. The basic idea is to store old blocks and state that isn’t currently needed in a cloud service that can store massive amounts of data with ease, such as Amazon S3, Google Cloud Storage, Digital Ocean Spaces, etc. If the data is needed at some point, then it’s a quick retrieval from cloud storage to disk and it acts as if it has been there the whole time.

There are similar features in other Ethereum clients like geth’s syncmode=fast (which gets the entire chain, but doesn’t process or validate anything until it gets to a recent block, latest-64) and syncmode=light (which just gets the current state and if it needs old blocks or state information, it will ask other nodes for the information it needs). The problem with fast is that it still takes a long time to sync and still takes lot of storage. light has more similarities, but at this point in time, it’s very unstable and still relies on some number of nodes to store the entire chain. A miner for instance needs to have the entire chain available on disk.

Offchain storage essentially gives us unbounded space for the blockchain to grow.

Conclusion

These three things put together solve our storage problems in the foreseeable future. As we increase throughput in the years to come (to 13,000 transactions and beyond), we will have to keep evolving, but we’ll jump that hurdle when we get there.