Summary: Certain individuals in the bitcoin community claim that bitcoin's merkle tree does not scale and needs to be replaced by a different data structure. This post summarises some of the approaches to parallelise the merkle tree as well as benchmarking that has been done to see what performance is achievable today with simple consumer hardware.

First of all, let me introduce the readers who are not familiar with merkle trees to why this data structure is so important for bitcoin, especially for scaling. Readers who are familiar with bitcoin's design can skip this.

What is a merkle tree?

As you know, bitcoin uses a blockchain so that the network can agree on one common history of transactions.

Let's have a close look at how a block looks in detail:

We've got a version field, the hash of the previous block, a timestamp, a difficulty target and a nonce. This is only 80 bytes of data, you might be wondering where the transactions are stored? They aren't actually stored in the block (header) itself but hashed together to one single value of 32 bytes called the Merkle root or Merkle root hash , which then is written into the block (header). If you change one single transaction, you get a completely different merkle root and therefore a different blockheader and hash.

There are multiple ways of how one could reduce all transactions to a single hash, here's a trivial example:

Bitcoin uses a very clever way of hashing the transactions. It's called a Merkle tree , also known as a Hash tree . It basically works by concatenating two hashes and take the hash of this value, then we end up with half the number of hashes. We take the hash over two hashes again and get half the number of hashes. We repeat this till we end up with one single hash, which we then call the Merkle root.

Merkle trees can be visualised in the following way:

Now, what benefit do we gain from hashing the transactions into a merkle tree? A merkle tree has the interesting property of allowing compact proofs that a transaction is part of the merkle root hash, without needing to send all the transaction ids for a specific block. Assuming that a client knows the root hash, she can verify that her transaction 4 was indeed included in a block by requesting a server to issue a Merkle tree proof (MTP) :

The MTP has been highlighted in red. These are the only two values the client needs to know to verify that transaction 4 is included in a block. Now, for a small block of 4 transactions the benefit over just sending all transaction ids is minimal. Let's have a look at how efficient this is with a block with twice the number of transactions:

It turns out that we only need to know three hashes to verify that our transaction 8 was included in the block. For 16 transactions it would be four hashes, for 32 transactions five hashes, and so on. You need about to know log2(number of transactions) hashes (rounded up to the next whole number) in order to verify that your transaction was included in the block. If we think about blocks with a size of 1TB which would enable every person on this planet to do about 50 transactions a day, we only need to know 32 hashes (assuming 2,500,000,000 transactions in the block) to verify that our transaction was included in a block. Every hash has a size of 32 bytes. That is only 32 * 32 bytes = 1 kb of proof!

The outlined method is known as Simplified Payment Verification (SPV) and used by wallet software like ElectrumSV and Cashpay .

Building merkle trees

There are two reasons of why you want to build a merkle tree:

Either you are mining a new block (miner) ...

... or you just received a block and want to verify the correctness of the root hash (miner, payment processor, ...)

The crucial part is when it comes to receiving a block: A miner wants to validate a new block as fast as possible to continue mining on the new block. Let's still have a look at both parts.

Mining a new block

Mining a new block is an appending process on the merkle tree. From a stream of transaction ids you can build the merkle root hash with almost no memory consumption.

This is a single threaded process but it's fast enough in practice. I coded up a simple and naive implementation to test single threaded performance: It took about 2 seconds to hash 4 million transactions into a merkle tree on my Intel Core i7-8550U (which is a laptop CPU) and would take another 2 seconds (slightly more) to accept another 4 million transactions. That's a process that just needs to be faster than the network sending transactions, and as you can see it's really fast right now. Unfortunately, my CPU does not have the Intel SHA Extensions (which would be faster than the OpenSSL SHA-256 C implementation), but many new AMD CPUs have them. If it ever became an issue for the miner, there are multiple ways to increase the performance of this process (FPGAs would be a cheap and simple solution). However, it doesn't look like there will be a bottleneck anytime soon. And after all, these methods are part of what makes mining competitive (apart from the pure process of hashing): Some miners will keep their algorithms secret and other miners who can't keep up won't be able to include as many transactions in a block.

Receiving a new block

Receiving a new block is where things become interesting: Miners want to receive and validate blocks as fast as possible. In contrast to mining a new block, you know the number of transactions beforehand. That makes parallelising somewhat easier.

There are mainly two ideas one can come up with when it comes to parallelising the validation of the tree.

Splitting the work into subtrees

The first one is splitting the tree into subtrees where each subtree is calculated by a different CPU core and when all subtrees are calculated we can calculate the root hash:

We then just take the append code from above to calculate each tree on it's own and then combine the results in the last step. The advantages of this approach are that it does consume as little memory as possible and doesn't require any communication between the two CPU cores while they are building their trees. The disadvantage is that we can only split the work into a power of two: 2, 4, 8, 16, 32, ... CPU cores. Also, with larger blocks it's very likely that our rightmost subtree computes much faster than the rest as it contains less transactions:

Hashing 4,000,000 transactions (that's about a 1GB-2GB block) with four threads took about 580ms on my CPU. At first, I was quite satisfied with the results. But as soon as we double or quadruple the number of transactions in a block we would also spend two or four times the time validating just the merkle tree on all cores. While it's still acceptable for a payment processor to validate the tree in 2s (it doesn't impact 0-confirmation transactions), it can be a lot of time for miners.

Reducing the tree layer by layer

I fiddled around a bit with other approaches and tried a way that looks like a less performant way at first glance but actually is a lot faster when applied correctly. It works by reducing the tree layer by layer: You start with the bottom layer and hash two leaves to what becomes one leaf in the next layer:

One big advantage is that we have many small units of work per layer which means:

We can pick any number of CPU cores to do the work and distribute work between these cores

It doesn't matter whether we have a perfect tree (power of two number of leaves) or not: The work is shared equally between the cores

Additionally, the implementation is tiny and very easy to read. The disadvantage is that it takes a lot of sequential memory to build the tree, but once it's calculated the memory is available again. Memory isn't much of an issue though, it would take 80GB at maximum to build the tree for a 1TB block. Even if that should be too much you can also combine the two methods: First split the tree into four subtrees (first method) and then calculate the subtrees one after the other with the approach described in this segment (second method).

Looking at the single unit of work, we can note that it's always the same branchless (i.e. no decisions have to be made) task (except for the rightmost work unit, but there are only 32 of them at 1TB blocks):

It's always the same task: hashing, hashing, hashing, hashing, ... that's it! This reminds one of another thing that is done 24/7 on bitcoin since the genesis block: mining. We won't be using ASICs here though for multiple reasons:

They are too expensive for what we want to achieve

They don't fit our memory requirements

We wouldn't be able to use mining ASICs either because they can only be used for mining, not for arbitrary hash calculations

But there is hardware we can find in almost any desktop computer or laptop that performs well at executing the same branchless task over and over again: GPUs.

So after coding an implementation for the outlined layer by layer reduction, I tested the performance on my Intel UHD Graphics chip that's built into the Intel processor. The same block with 4,000,000 transactions now took 270ms to hash, that's half of the time! The integrated Intel chips belong to the worst GPUs on the market though. Luckily, I was able to get access to a dedicated AMD GPU. I needed to do some adjustments to make it run as the dedicated GPU can not access the RAM directly, which means all transaction hashes are first copied over to the GPU's memory before any calculations begin. However, the process of copying data is faster than one would expect. The results are incredible. The GPU calculated the merkle tree for about 33,500,000 transactions (note that this is more than 8 times more transactions than before) in 300ms! That's a merkle tree for a 13.5GB block hashed in a blink of an eye, including copying the transaction ids to the GPU. And if you need twice the performance, you just put it another GPU and let every one of them build a subtree. It's noteworthy that the AMD RX580 GPU to perform the computation costs only about $160-$200, it would be interesting to see what can be achieved with the more powerful but expensive Vega 64. Edit: Brad (@brad1121) tested the code on his Nvidia RTX2070 and was able to hash the same 13.5GB block in 180ms and a 27GB block in 348ms!

You can find the code here .

What can we learn?

These simple tests have shown that merkle trees DO scale. There certainly are some challenges when it comes to scaling bitcoin, but the merkle tree definitely isn't one of them. The so called "vulnerabilities" are well-known and easy to protect against, both implementations of mine I posted here protect against these attacks. I don't think it's worth it to introduce a completely new data structure that hasn't been used in practice for as long as the merkle tree has been - just for a little benefit. If a few developers can successfully change the bitcoin core design because they believe that their protocol design scales better (most of the times without any code to test it), then bitcoin has failed at being stable money that nobody can manipulate. You never know what's the real intention behind certain changes.

Where do we go from here?

With current block sizes, we won't gain any benefit by using the power of GPUs to build merkle trees. We will probably start seeing benefits when blocks larger than 1GB are sent through the network (bitcoin's largest block has a size of about 100MB).

There are two important improvements that can be made and I am exploring further:

Updating extraNonce in coinbase

Most nodes put a field called 'extraNonce' into the coinbase that is incremented when the 32 bit nonce has been exhausted (cycled through) for the block header. Updating the coinbase on today's software means hashing the whole tree again. This can be fixed quite easily by saving a merkle tree proof for the coinbase transaction and only rehash this part. This is implemented in the first code I shared (the second doesn't have it because saving the MTP is only interesting when it comes to building, not verifying).

Improving SPV performance

Since nodes do not store the nodes of the merkle tree, they have to recalculate the tree for every single SPV request. It would also be possible to save all the nodes so that no recalculation is necessary at all, but that also requires more storage. Neil Booth (aka @kyuupichan) did some interesting work on this topic in his ElectrumX software: Instead of saving all the nodes, he saves the midlayer (i.e. layer depth/2). This results in the recalculation only being approximately square-root as costly as the first computation. If we take our 300ms from the AMD above, that would correspond to a time of about 20ms to issue the MTP. A clever SPV server software would store all the nodes for the last few blocks on a fast SSD or even keep it in RAM and only the midlayer for older blocks.



