Background Information On Hashing

A hash function is a function which takes an input and applies some algorithm to the input which returns a ‘hash’ of the input. For example, a very (very) simple hash function could be one which turns the input into binary data and sums the bits. This would allow us to take any group of data, for example, the decimal number ten, turn it into the binary representation (1010), and sum the bits (2). So, in this scenario, our simple hash would return a hash of ‘2’ for the decimal number ten.

An important distinction should be made between encoding and hashing. When data is encoded, for example base58 encoded or utf-8 encoded, we take the data and change it into an equivalent form in a reversible way. We can encode data and then decode it (such as URLs that turn spaces into %20’s, base58 encoding which takes arbitrary data and makes it use letter/number character only, etc), whereas a hash, for the most part, cannot be reversed.

These types of hash functions have many uses. In regular programming, the most common use-case is most likely in key-value pair storage of hashmaps. The most frequent use in cryptography for hash functions, however, is to validate data without passing around the data itself. What we mean by that is, if ten was our private data, and another user wanted to confirm that we knew what the private data was, we could send them a hash of our data (2). They can then compare our hash with the hash their received and confirm that we agree on the input data, without explicitly revealing the input data.

However, hash functions are not perfect. In our simple hash example above (hash(10) == 2), multiple inputs could give that same output. For example, if we passed the decimal number five into the hash function, we would get the same result (5 in binary is 0101, sum the bits, we get 2 again). This is known as a collision, where two inputs give the same output. Collisions are a side effect caused by writing an algorithm that is not encoding/decoding. That is, by making it so you cannot necessarily determine the input by the output, we’ve also made it so multiple inputs may give the same output, as they no longer have a 1:1 relationship.

In a hashmap, this would result in keys being stored in the same location, overwriting each others memory. In cryptography, this results in potentially worse things, such as two of the same private keys generating the same public key. If such a collision occurred, it would effectively mean two users are sharing the same wallet, sharing the same funds, with full ability to transfer each others share from the wallet as they please.

Proper hash functions, however, are more secure than our simple hash function above. In the above example of summing the bits, we did a basic operation on 4 bits. Modern hash functions are generally very very complex algorithms on 32/64/128/160/256/512 bits for the inputs & outputs, with each bit adding to the complexity of the hash function exponentially. The odds of collisions occurring usually relates to what level of security a program chooses to use.

Secure Hash Algorithms

In cryptography, larger, safer hashing functions that minimize collisions which being as difficulty to reverse as possible, are preferred. The most popular hash functions or hashing families of functions include Message Digest, Secure Hash Function, RACE Integrity Primitives Evaluation Message Digest and Whirlpool.

Message Digest (MD)

The MD family of hashing includes MD2, MD4, MD5 and MD6. They are all variations, which are 128 bit hash functions. Of the family, MD5 is the most popular. It is occasionally used in cryptography, however it’s most common use is for software validation. People would take a hash of a program and compare the MD5 checksum to the one offered by the company offering the download. This MD5 checksum guarantee you downloaded the version they say they are listing. Although we won’t explore this more, I wanted to bring it up to stress how important checking the MD5 checksum can be. If you are downloading wallets off website of Github, things you are trusting with your private keys, do yourself a favour and check the MD5 checksums. You won’t regret being safe :)

Secure Hash Function (SHA)

The Secure Hash Function (SHA) family mainly comprises of four hash functions, SHA-0, SHA-1, SHA-2 and SHA3. Where SHA-2 is a family of hash functions, rather than a single one. We will focus on SHA-1 and SHA-2 for discussion.

The SHA-1 hash function was very popular, however it has been deemed ‘broken’ in cryptographic terms due to a proposed phenomenon that was later proven by google in 2009 known as the ‘Birthday Attack’. The SHA-2 family (of which SHA-256 belongs to) does not have the same vulnerability. There are people who fear SHA-2, despite its differences to SHA-1, due to it following a similar design in certain aspects.

RACE Integrity Primitives Evaluation Message Digest (RIPEMD)

This family includes RIPEND, RIPEMD-128, and RIPEMD-160. There are 256 bit and 320 bit versions as well.

RIPEMD is generated considered not secure, as RIPEMD-128 is a safer variant that was released shortly afterwards. Although the 256 bit and 320 bit versions are less likely to run into collisions, their extra security over RIPEMD-160 has been doubted.

From my research, RIPEMD-160 appears to be the maximum security with minimal collisions.

Whirlpool

Whirlpool is a 512 bit hashing algorithm I honestly was not aware of until I began researching more. It appears to have come from a variation of AES, with one of the hashing algorithms designers being a co-creator of AES. I’m ashamed to say I could not find much information comparing the security and efficiency.

Decisions in Cryptocurrencies

In Bitcoin, everytime hashing must be done, Satoshi originally used on of two hashing combinations, RIPEMD160(SHA256(x)) or SHA256(SHA256(x)).

The former he uses when generating Bitcoin addresses. This could be out of fear of how RIPEMD160 would interact with the Elliptic Curve Digital Signature Algorithm (ECDGSA) at the time, desiring to combine the two hashes in order to avoid any issue one of the hashing algorithms may have. It could also, however, simply be because Satoshi desired the addresses to be 160 bits rather than 256 bits, assuming it gives sufficient security.

The later is used in most other scenarios, such as when calculating hashes for a merkle tree. Many researchers believe that, should a similar attack to the Birthday Attack occur with the SHA-2 family in the future, these attacks would not break nested SHA-2 hashes. The reason was way over my head, and I did not sufficiently understand it, however it seems like nesting the SHA-256 calls is simply for future-proofing a technology, potentially in an overkill way.

Whirlpool results in 512 bit hashes, which would be doubling SHA256’s size. Considering how precious storage is on the blockchain, I do not believe it would be worth using Whirlpool unless it gave some significant advantage I am not aware of.

MD5 and similar hashing algorithms also seem to be generally less secure than SHA256. To give sufficient security, avoiding the 128 bit hash may be useful. For now, I believe sticking to SHA-2 would be ideal.

Dev-Choice: SHA256(SHA256(x))) For Both Scenarios

Originally, I was not sold on the idea of doing the addresses via RIPEMD160(SHA256(x)). My original thought process was “It takes a secure 256 bit hash, and then generates shorter 160 bit hash, when a equally secure 256 bit version is there, which may be more future proof”. However, the more I thought about it, the more I disagreed with myself.

Let’s look at the benefits. If we save 96 bits per public key generated, and thus save 96 bits of data per user, and a blockchain reaches the point where it has five million users, that would save an additional 480 megabytes of data. Now, this is not a fair direct comparison, since public keys are not stored on the blockchain in a list as simply as that. However, it does show how a 160 bit key may allow for a more scalable design than a 256 bit version. After looking deeper into it, it does appear RIPEMD160 should be secure-enough and not run into collisions.

Especially if your blockchain runs on a system where each private key will only generate one address per user rather than multiple, which the Seed system will do. In that case, the potential for collisions also reduces drastically, to a point where it seems near impossible to find a collision. Therefore, I believe following Bitcoins original design for hashing appears to be optimal based on the options laid out above.

Notes

Compressed vs Uncompressed keys were not touched in this discussion. These were not apparently to me originally, as it appears Satoshi was not aware of them so they were not in the original white papers. In the future, I will look into compressed keys and see if they require any differences in hashing algorithms.

Thank you for your reading! As always, if you enjoyed this write-up, an upvote and follow are always appreciated! If I missed your favourite hashing algorithm or an important piece of the puzzle, please leave a comment so I can analyze that missed option next time!