PROPOSAL: Merkle tree of unspent transactions (MTUT), for serverless thin clients and self-verifiable prunned blockchain.

Overview

Satoshi's original paper describes a way of prunning spent transactions in the blockchain to save storage space while it remains consistent and verifiable, but it's useless for partial blockchain downloads: while you can know if a given transaction is in the blockchain, you can't know if it has been spent in a subsequent transaction.

This proposal describes how to add a hash-tree based check in the blockchain that allows to verify if a transaction is unspent without downloading and checking all the blockchain. The idea is not new, but at the time of this writing there isn't any technical description of how this should be done. Aditionally, this solution is rather simple.

Motivations: The math

Copy and paste this piece of code in your favourite Python 2.x interpreter:

nblocks = 6*24*365*10 # 10 years of blocks blockhsize = 200 # size of header+coinbase of each block (guesstimate) txperblock = 200000 # transactions per block txsize = 1024 # average transaction size hashsize = 32 # sha256 KB, MB, GB, TB = map(float, (1<<10,1<<20,1<<30,1<<40)) # units print "Block size: %.2f MB" % ((blockhsize + txperblock*txsize)/MB) total = (nblocks * (blockhsize + txperblock*txsize)) print "Total size of the blockchain: %.2f GB (%.2f TB)" % (total/GB, total/TB) print "Total size of the block headers: %.2f MB" % ((nblocks * 80)/MB) treedepth = len(bin(nblocks))-3 + len(bin(txperblock))-3 print "Size of unspent tx verification data:", print "%.2f KB" % (((treedepth*(hashsize+1)) + txsize + blockhsize)/KB)

It calculates some interesting numbers of how would be the blockchain after 10 years, assuming a volume of transactions similar to visa or paypal (and assuming an average of $20 per transaction) during all those 10 years. Here's the result:

Block size: 195.31 MB Total size of the blockchain: 100250.34 GB (97.90 TB) Total size of the block headers: 40.10 MB Size of unspent tx verification data: 2.36 KB

The last line is how much data you need to download in order to verify any given transaction while having only the block headers, if this feature is implemented. No need to trust a third party at all. If you're really paranoid you should download the full last 6 blocks, but no more. While something like this will be obviously needed within ten years, it would be a sustantial breaktrough for Bitcoin right now, when a whole day is needed to download the entire chain.

Proposed solution

The idea is to add a hash in each block which is the merkle root of all unspent outputs in the blockchain before that block. Similarly to how a block stores the merkle tree of its transactions, it would calculate the same tree but skipping spent transactions in the hashing. Also, it wouldn't be hashing its own transactions but all of the previous blocks, and then hashing all roots into a single merkle root.

Here's a visual example:

Before:

/ HASH <- TX 0 HASH / \ HASH <- TX 1 -HASH \ / HASH <- TX 2 HASH \ HASH <- TX 3

After spending TX 2:

/ HASH <- TX 0 HASH / \ HASH <- TX 1 -HASH \ HASH <- TX 3

Unlike a typical merkle root, it skips entirely the hashing of spent transactions. This should be done for each block that changes its status, and then a merkle root of all root hashes is computed.

When you request verification data for TX 1 at height N, the prunned tree is sent to you which is at most (log2(number of tx in the block)+log2(N)) * (32+1) bytes long. 32 is the hash size and one extra byte (a bit, actually) for the position (left or right) of each hash. Compare the result with the hash stored in the block N to verify that TX 1 hasn't been spent until block N.

Furthermore, the inputs of the transaction can be skipped from the hashing, as we can safely assume those inputs are valid if they're buried enough in the chain, allowing more space savings.

Implementation details

No actual prunning of transactions is needed to compute this tree, neither should be done before the spending transactions has enough confirmations (no client does this at the moment, anyway). All the tree is computed and preserved but it's not stored in the blocks (only the final root). There should be three functions:

markAllSpent() : Calculates the entire tree, to be used only once the first time the whole blockchain is downloaded (prunned or not).

: Calculates the entire tree, to be used only once the first time the whole blockchain is downloaded (prunned or not). markSpent(N) : Having the tree up until N-1, this function re-computes the tree for spendings made in block N, and the resulting hash should be included in the block N+1.

: Having the tree up until N-1, this function re-computes the tree for spendings made in block N, and the resulting hash should be included in the block N+1. unmarkSpent(N) : Does just the opposite, so it can roll back when there are splits in the chain.

When mining, a magic number "MTUT" in the coinbase is included followed by the root hash. In a future version of the block format (if it ever exists), this data can be included in its own field instead.

Implementation schedule

No hard dates are needed at all. Nodes should accept both blocks with and without MTUT hash. Once 55% of blocks includes a valid MTUT hash in a 2-week timespan, they should reject any block with an invalid (i.e. probably malicious) MTUT hash block while accepting blocks without MTUT. At this point we can consider any MTUT hash to be trustworthy with at least 6 confirmations. It may be decided to reject blocks without MTUT hash if at any point there exist 90% of blocks with MTUT hash in a 2016 block span.

Further development

Once we have valid MTUT hashes in the chain, any regular bitcoin client can turned into a thin client without relying on trusted servers. The official bitcoin client should allow this thin functionality by default, with the option to wait for the whole blockchain download (improving anonimity as you don't have to ask for specific addresses), or with a setting to specify how much space you want to allocate to the chain (so e.g. if we have lots of nodes with a random 1% of the chain, we'll always have a full chain in the cloud).

We can diferentiate between light nodes and supernodes, so we can keep looking for peers until we are connected to 8 supernodes, which are much more likely to verify and relay transactions. Having a not-too-big number of supernodes worldwide also helps with propagation, so transactions are confirmed faster and tx radars are more reliable.

DiThi 01:53, 24 January 2012 (GMT)