CATCH THE WINNING SPIRIT CATCH THE WINNING SPIRIT tised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction. Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction. Advertise here.

socrates1024



Offline



Activity: 125

Merit: 104





Andrew Miller







Full MemberActivity: 125Merit: 104Andrew Miller Re: Storing UTXOs in a Balanced Merkle Tree (zero-trust nodes with O(1)-storage) August 20, 2012, 08:58:56 PM #3



Originally, my UTXO tree contained only keys, of the form (txid,index). In order to actually validate a transaction, it would be necessary to look up the corresponding transaction data. Even though only a small subset from this transaction data is relevant (just a single txout), an O(1)-storage client would also need to recompute the hash over the entire transaction in order to be secure. This means the validation cost, per txout, goes from O(log N) to O(T log N), where T is the size of a transaction (total number of txins and txouts).



What I want to do instead is to use the (currently-unused) 'value' field in each leaf node of the binary tree to store a hash of just the relevant validation data.



The validation consists of fields (isCoinbase, nHeight, amount, scriptPubKey). I would serialize it as follows:

Code: [ 1 byte][4 bytes][8 bytes][ x bytes]

isCoinbase nHeight amount scriptPubKey



There is now a reason to treat branch nodes differently than leaf nodes. Leaf nodes need to contain an additional hash value, but since they don't need to contain the left/right hashes, they're just as small. (I use a sentinel byte instead to disambiguate leaf vs branch)



Code: (color, (), (("UTXO", txhash, index), utxohash), ()):

[1 byte][1 byte][4 bytes][32 bytes][4 bytes][32 bytes]

"." color "UTXO" txhash index utxohash

total: 1+1+4+32+4+32 = 74 bytes



I can't think of any reason to include 'values' for the ADDR lookups, but they can still benefit from omitting the child hashes.

Code: (color, (), (("ADDR", address, txhash, index), ()), ()):

[1 byte][1 byte][4 bytes][20 bytes][32 bytes][4 bytes]

"." color "ADDR" addr txhash index

total: 1+1+4+20+32+4 = 62 bytes



With this scheme, transaction validation only costs O(log N) per txout, regardless of the total size of each transaction. I want to make an improvement to this proposal.Originally, my UTXO tree contained only keys, of the form (txid,index). In order to actually validate a transaction, it would be necessary to look up the corresponding transaction data. Even though only a small subset from this transaction data is relevant (just a single txout), an O(1)-storage client would also need to recompute the hash over the entire transaction in order to be secure. This means the validation cost, per txout, goes from O(log N) to O(T log N), where T is the size of a transaction (total number of txins and txouts).What I want to do instead is to use the (currently-unused) 'value' field in each leaf node of the binary tree to store a hash of just the relevant validation data.The validation consists of fields (isCoinbase, nHeight, amount, scriptPubKey). I would serialize it as follows:There is now a reason to treat branch nodes differently than leaf nodes. Leaf nodes need to contain an additional hash value, but since they don't need to contain the left/right hashes, they're just as small. (I use a sentinel byte instead to disambiguate leaf vs branch)I can't think of any reason to include 'values' for the ADDR lookups, but they can still benefit from omitting the child hashes.With this scheme, transaction validation only costs O(log N) per txout, regardless of the total size of each transaction.

[my twitter] [research@umd]

I study amiller on freenode / 19G6VFcV1qZJxe3Swn28xz3F8gDKTznwEMI study Merkle trees , credit networks, and Byzantine Consensus algorithms

etotheipi





Offline



Activity: 1428

Merit: 1041





Core Armory Developer







LegendaryActivity: 1428Merit: 1041Core Armory Developer Re: Storing UTXOs in a Balanced Merkle Tree (zero-trust nodes with O(1)-storage) August 21, 2012, 04:39:16 AM

Last edit: August 21, 2012, 04:50:20 AM by etotheipi #6 Quote from: socrates1024 on August 21, 2012, 01:01:37 AM



My main motivation for this is to address the fears that "order independence" may be important for some reason. I showed that it's not required for any of the two tasks I defined. Is there a requirement missing for which order-independence is important? Or are the assumptions I made too strong? Otherwise, Merkle trees are at least as good asymptotically, and are likely faster by a constant factor.



On the other hand, if you allow stronger assumptions (such as about the number of txouts per unique txid), or weaker requirements, there are at least some possible scenarios where trie-based solutions are faster, but never by more than a constant factor.



Some related work using Merkle Tries



I've found several instances of Merkle Tries. None of them have mentioned any benefit from order-independence. Each one them suggests that the trie is on average faster than RedBlack Merkle trees by a constant factor, however this relies on the keys consisting only of random hashes. In the UTXO, we at least need to look up validation information by (txid,idx). One suggestion has been to use a trie where the only key is (txid) and all txouts with the same txid are grouped together. In this case, the average-case validation depends on some additional quantity, T, describing the ratio of txouts to unique txids: O(T log M).



I made a post in which I outlined the necessary requirements for the Bitcoin data structures. I focused on two different tasks: fork validation and transaction validation.My main motivation for this is to address the fears that "order independence" may be important for some reason. I showed that it's not required for any of the two tasks I defined. Is there a requirement missing for which order-independence is important? Or are the assumptions I made too strong? Otherwise, Merkle trees are at least as good asymptotically, and are likely faster by a constant factor.On the other hand, if you allow stronger assumptions (such as about the number of txouts per unique txid), or weaker requirements, there are at least some possible scenarios where trie-based solutions are faster, but never by more than a constant factor.I've found several instances of Merkle Tries. None of them have mentioned any benefit from order-independence. Each one them suggests that the trie is on average faster than RedBlack Merkle trees by a constant factor, however this relies on the keys consisting only of random hashes. In the UTXO, we at least need to look up validation information by (txid,idx). One suggestion has been to use a trie where the only key is (txid) and all txouts with the same txid are grouped together. In this case, the average-case validation depends on some additional quantity, T, describing the ratio of txouts to unique txids: O(T log M).

You're focusing a lot on access speed. Getting the access speed question out of the way is important one, but I think dwelling on it is unnecessary. We've established that even for billions of elements, the trie and BST will have the same, absolute order of magnitude access time. The trie has better theoretical order of growth with respect to number of elements (O(1)), but it's absolute constant access time will be comparable to the BST's O(logN) for N < 10^10. I'm already satisfied with the conclusion: "both are fast enough."



So then my concern remains with regards to order-independence. Two particular concerns pop out that I'm sure you have figured out, but I want to make sure:



(1) How can you guarantee that node-deletion from the BST will result in the same underlying tree structure as you started with? Any particular node addition could trickle rebalancing/rotation operations from the bottom of the BST to the top. Is node-deletion guaranteed to undo those rotations? If not, what do you need after every operation to make sure you know how to remove the node, if it becomes necessary?

(2) If a lite-node requests all the UTXOs for a given address, how do you (the supplier) transmit that information? You can't just give it a sequential list of UTXOs, and you don't know just from looking at the address tree what order they were inserted into the sub-branch. How do you communicate this information to the lite-node so that it can verify the UTXO list?



Of course, I have to supplement those questions with the fact that tries don't even have to think about this. (1) Delete a node from the trie! If two tries have the same elements, they have the same structure. (2) Send the UTXO list in any order: if two tries (or trie-branches) have the same elements, they have the same structure!



I don't want to derail this thread too far in the trie-vs-BST direction. I just want to make sure there's good answers for these questions. I really do appreciate that someone has put in time to try to make a proof-of-concept. Even if I don't agree with the particular optimization, there's a lot of value/education in actually trying to put the theory into practice, and the lessons you learn will help us out in future -- or you will simply finish it for us and then we will gratefully leverage your accomplishment!



So, thanks for pioneering this effort!



P.S. -- You argued in a PM, against the point that a trie can be parallel-updated/maintained. But I didn't understand the argument. I can design a system where the top-level branch node (branching on first byte), split up between X threads/HDDs/servers. Each server handles 256/X sub-branches. When a new block comes in, all additions and deletions induced by that block are distributed to the appropriate server. The main thread waits for all the servers to finish -- which happens by reporting back the updated "roots" of their sub-trees -- then the main thread combines those into the final, master root. This seems like a valid (major!) benefit.







You're focusing a lot on access speed. Getting the access speed question out of the way is important one, but I think dwelling on it is unnecessary. We've established that even for billions of elements, the trie and BST will have the same, absolute order of magnitude access time. The trie has betterorder of growth with respect to number of elements (O(1)), but it's absolute constant access time will be comparable to the BST's O(logN) for N < 10^10. I'm already satisfied with the conclusion: "both are fast enough."So then my concern remains with regards to order-independence. Two particular concerns pop out that I'm sure you have figured out, but I want to make sure:(1) How can you guarantee that node-deletion from the BST will result in the same underlying tree structure as you started with? Any particular node addition could trickle rebalancing/rotation operations from the bottom of the BST to the top. Is node-deletion guaranteed to undo those rotations? If not, what do you need after every operation to make sure you know how to remove the node, if it becomes necessary?(2) If a lite-node requests all the UTXOs for a given address, how do(the supplier) transmit that information? You can't just give it a sequential list of UTXOs, and you don't know just from looking at the address tree what order they were inserted into the sub-branch. How do you communicate this information to the lite-node so that it can verify the UTXO list?Of course, I have to supplement those questions with the fact that tries don't even have to think about this. (1) Delete a node from the trie! If two tries have the same elements, they have the same structure. (2) Send the UTXO list in any order: if two tries (or trie-branches) have the same elements, they have the same structure!I don't want to derail this thread too far in the trie-vs-BST direction. I just want to make sure there's good answers for these questions. I really do appreciate that someone has put in time to try to make a proof-of-concept. Even if I don't agree with the particular optimization, there's a lot of value/education in actually trying to put the theory into practice, and the lessons you learn will help us out in future -- or you will simply finish it for us and then we will gratefully leverage your accomplishment!So, thanks for pioneering this effort!P.S. -- You argued in a PM, against the point that a trie can be parallel-updated/maintained. But I didn't understand the argument. I can design a system where the top-level branch node (branching on first byte), split up between X threads/HDDs/servers. Each server handles 256/X sub-branches. When a new block comes in, all additions and deletions induced by that block are distributed to the appropriate server. The main thread waits for all the servers to finish -- which happens by reporting back the updated "roots" of their sub-trees -- then the main thread combines those into the final, master root. This seems like a valid (major!) benefit.

Armory Bitcoin Wallet: Bringing

Only use Armory software signed by the



(or donate directly via 1QBDLYTDFHHZAABYSKGKPWKLSXZWCCJQBX -- yes, it's a real address!) Founder and CEO of Armory Technologies, Inc. : Bringing cold storage to the average user!Only use Armory software signed by the Armory Offline Signing Key (0x98832223) Please donate to the Armory project by clicking here! (or donate directly via 1QBDLYTDFHHZAABYSKGKPWKLSXZWCCJQBX -- yes, it's a real address!)

socrates1024



Offline



Activity: 125

Merit: 104





Andrew Miller







Full MemberActivity: 125Merit: 104Andrew Miller Re: Storing UTXOs in a Balanced Merkle Tree (zero-trust nodes with O(1)-storage) August 21, 2012, 05:34:24 AM

Last edit: August 21, 2012, 08:21:33 AM by socrates1024 #7



Quote My main motivation for this is to address the fears that "order independence" may be important for some reason. I showed that it's not required for any of the two tasks I defined. Is there a requirement missing for which order-independence is important? Or are the assumptions I made too strong?

Quote from: etotheipi on August 21, 2012, 04:39:16 AM So then my concern remains with regards to order-independence. Two particular concerns pop out that I'm sure you have figured out, but I want to make sure:



(1) How can you guarantee that node-deletion from the BST will result in the same underlying tree structure as you started with? Any particular node addition could trickle rebalancing/rotation operations from the bottom of the BST to the top. Is node-deletion guaranteed to undo those rotations? If not, what do you need after every operation to make sure you know how to remove the node, if it becomes necessary?

(2) If a lite-node requests all the UTXOs for a given address, how do you (the supplier) transmit that information? You can't just give it a sequential list of UTXOs, and you don't know just from looking at the address tree what order they were inserted into the sub-branch. How do you communicate this information to the lite-node so that it can verify the UTXO list?



1) Node-deletion is not the inverse of node-insertion. This isn't a requirement. Both operations produce new trees, typically with new root hashes. There are potentially many trees that represent the same set of coins, but only a particular tree is committed in a given block. To answer the "if not, then what" question, I have tried to clearly describe two abstract scenarios:



Transaction Validation: I assume you already know the root hash at time T, and have access to an untrusted copy of the UTXO-set (e.g., stored on a shared cloud host). Now you need to securely compute a new root hash for time T+1 (with one UTXO deleted). This is done deterministically, using the redblack deletion rules. You only need to fetch O(log M) elements from the untrusted storage. This is exactly as difficult as with UltraPrune (BTree), which also fetches at least O(log M) data (but requires trusted storage).





Fork Validation: If you are validating a fork, it is because you have received the head of a chain that purports to be larger than yours. Suppose the fork point goes back N blocks ago. Even though you have the full UTXO for your current fork, you need to simulate a node that is bootstrapping itself from just the header from N blocks ago. It takes O(N log M) operations to validate the new fork, but you only needed to download O(M + N) data, M for a snapshot of the UTXO-tree from N blocks ago in your chain, and N for all the transactions in the fork. You validate them in forward order. You're not vulnerable to DDoS, because you validate the work first (headers).





2) This is an optional scenario, but I want to include it. This is about a client that wants to make a request of the form (roothash, address). The client wants to do a range query, receiving a set of paths O(m log M) paths where 'm' is the number of spendable coins with that address. The client is assumed to already know the root hash of the most recent valid block.



You (the untrusted server) have a chain N blocks long, and you want to be able to serve requests where 'roothash' falls anywhere in your chain. Then you must store O(M + N log M) data in total - M for your current UTXO snapshot, and N log M for all the "deltas" from every transaction in the past. This is a "persistent datastructure" because you can simulate queries from any of the N trees represented in your chain. This is "optional" because it isn't necessary for Transaction Validation or for Fork Validation. I can't prove that there are no variations of this problem for which a trie might give better performance. If you have a counter example, then we would need to see if the trie also performs satisfactorily for the two core requirements.



Perhaps you (the server) would want to store all this data (locally) in a trie. It wouldn't need to be a Merkle trie. The sequence of UTXO-trees would still be updated according to redblack balancing rules, but you could use a trie to store all the node data you might have to serve.





Quote P.S. -- You argued in a PM, against the point that a trie can be parallel-updated/maintained. But I didn't understand the argument. I can design a system where the top-level branch node (branching on first byte), split up between X threads/HDDs/servers. Each server handles 256/X sub-branches. When a new block comes in, all additions and deletions induced by that block are distributed to the appropriate server. The main thread waits for all the servers to finish -- which happens by reporting back the updated "roots" of their sub-trees -- then the main thread combines those into the final, master root. This seems like a valid benefit.



Ordinary tries can be updated in parallel, but this isn't one of the performance characteristics that carries over when you augment them with collision-resistant hashes. The computation must be serial, but the storage can be easily sharded (so it's concurrent (safe), not parallel (fast)). There are such things as parallel-update authenticated structures, but they require special hashes with homomorphic superpowers. It's not a derail, it's the most important thing we need to figure out.1) Node-deletion is not the inverse of node-insertion. This isn't a requirement. Both operations produce new trees, typically with new root hashes. There are potentially many trees that represent the same set of coins, but only a particular tree is committed in a given block. To answer the "if not, then what" question, I have tried to clearly describe two abstract scenarios:2) This is an optional scenario, but I want to include it. This is about a client that wants to make a request of the form (roothash, address). The client wants to do a range query, receiving a set of paths O(m log M) paths where 'm' is the number of spendable coins with that address. The client is assumed to already know the root hash of the most recent valid block.You (the untrusted server) have a chain N blocks long, and you want to be able to serve requests where 'roothash' falls anywhere in your chain. Then you must store O(M + N log M) data in total - M for your current UTXO snapshot, and N log M for all the "deltas" from every transaction in the past. This is a "persistent datastructure" because you can simulate queries from any of the N trees represented in your chain. This is "optional" because it isn't necessary for Transaction Validation or for Fork Validation. I can't prove that there are no variations of this problem for which a trie might give better performance. If you have a counter example, then we would need to see if the trie also performs satisfactorily for the two core requirements.Perhaps you (the server) would want to store all this data (locally) in a trie. It wouldn't need to be a Merkle trie. The sequence of UTXO-trees would still be updated according to redblack balancing rules, but you could use a trie to store all the node data you might have to serve.Ordinary tries can be updated in parallel, but this isn't one of the performance characteristics that carries over when you augment them with collision-resistant hashes. The computation must be serial, but the storage can be easily sharded (so it's concurrent (safe), not parallel (fast)). There are such things as parallel-update authenticated structures, but they require special hashes with homomorphic superpowers.

[my twitter] [research@umd]

I study amiller on freenode / 19G6VFcV1qZJxe3Swn28xz3F8gDKTznwEMI study Merkle trees , credit networks, and Byzantine Consensus algorithms