

Tips written on October 20, 2018

Hash Functions survival guide

Hash functions are one of the building blocks behind robust solutions to solve architectural problems every developer will encounter in their career, like how to securely store passwords or how to distribute work across many servers.

To be honest, while their concepts are well known, they can also easily be misused.

This article hopes to shed some light on how to safely leverage them for some standard use-cases.

A note on Terminology

Depending on your community or programming language of choice, you may know some constructs and applications under different terms.

This article will use the following nomenclature:

A hash function refers to the implementation in which, given an arbitrary input, returns a deterministic output

This function implements a hash algorithm

A hash (which you may know as a digest or hash code) refers to the function’s output

Do : know how a hash function work

A hash function is a deterministic function in which, given an arbitrary large input, returns a deterministic, fixed-size output. E.g, the MurmurHash3 function, in its 32 bits variant, will map any input to a 32 bit hash code :

Input Input size Output The “foo” ASCII string 3B 0x1764df7f The entire Hamlet play 188kB 0x2b881d54 The lenna test image 468kB 0x001fad14

This implies that an infinite input space is mapped to a finite number of outputs. Therefore, there must be some inputs which return the same hash code. These are known as collisions. Most use cases leverage the pseudo-uniqueness of hash codes, so you want to avoid them.

Since a function which always returns 42 can be considered a hash function under these terms, industrial-use hash algorithms are expected to meet the following criteria :

Uniformity : Their output should be evenly distributed into the output space. E.g, still using the MurmurHash3 as an example : Input (ASCII String) Output (32bit unsigned int) foo 4138058784 fou 2660153266 boo 389335901 Determinism : Given an input, the hash function shall always return the same output

Each algorithm has a given length, which refers to the number of bits of the hash codes it generates. E.g, SHA-256 generate 256 bits hash codes, and murmur3 can be configured to generate 32 bits or 128 bits hash codes. Generally, short hashes are faster to generate, and lighter to manipulate, but have a greater collision risk.

Do : know the difference between crypto and non-crypto hash algorithms

A subset of hash algorithms can be considered cryptographic hash algorithms, if they meet these additional criteria :

Preimage resistance : Given a hash code, it should be hard to generate a message having the same hash code Second-preimage resistance : Given a message, it should be hard to find a second message having the same hash code Collision resistance : It should be hard to find two distinct messages having the same hash code

By “hard”, we mean that brute forcing all inputs should be the fastest approach.

The SHA-2 family are a widely-implemented and respected collection of cryptographic hash algorithms. When in doubt, they should not be a bad choice. MD5 and SHA-1 should be avoided, as their collision resistance has been broken by the advancement of computation speed.

Concerning non-crypto hashes, your language of choice should already have an existing implementation (e.g : ruby has Object#hash, based on the murmur2 algorithm). If that is the case, trust it and use it. More than the implementation, you want to leverage the years of optimization work on multiple hardware architectures.

If your language does not have a native implementation of a good non-crypto hash algorithm, you may use one of the following ones :

Murmur3

FNV Hash

CityHash

xxHash

Don’t : roll your own implementations

This article will reference some well known and standard hash algorithms. While these algorithm are open, you should avoid reimplementing them, and try to use existing implementations :

Standard hash algorithms are likely to have been optimized down to the hardware level

Crypto algorithms have been audited by people way smarter than you or me

If your platform does not come with native implementations, libsodium is a very well made library which contains all the functions you may need.

Don’t : use a crypto hash algorithm for non-sensitive uses

Crypto or non-crypto, every good hash function gives you a strong uniformity guarantee. Crypto hashes are however slower, and tend to generate larger codes (256 bits or more)

Using them to implement a bucketing strategy for 100 servers would be over-engineering. Just use a simple, fast, non-crypto algorithm for it.

Do : use the right encoding for your use case

Hash functions typically return binary hash codes.

Depending on your use case, you should know which encoding is better suited for a use case.

For example, the hash code of the ASCII string “foo” can be encoded as :

A uint32 4138058784 easy to modulo for bucketing

easy to modulo for bucketing A hex code 0xf6a5c420 for machine-to-machine exchange

for machine-to-machine exchange A binary code 0bx11110110101001011100010000100000 for storage

for storage A Base64 code IMSl9g for lightweight HTTP transfer

Do : leverage hashes for bucketing

The most well known application of hash functions are hash tables. A hash table is a data structure mapping keys to value, enabling fast data lookup by partitioning its values into a select number of buckets. The hash code of the key can be transformed into a deterministic bucket index in constant time. You’re effectively trading a small constant hash computation cost for a 0(1) lookup complexity.

The cool property of hash tables is that they ensure a good data repartition without any pre-existing knowledge of the data going into it. You can use the same approach for any deterministic stateless pseudo-random attribution : e.g balancing jobs on workers, payloads on servers, or riders on drivers.

Don’t : use a naive bucketing strategy for a variable amount of buckets

A standard bucketing strategy consists of modulo-ing the data hash code with the number of buckets. As the hash function is uniform, this guarantees an eventual uniform repartition of resources between the buckets.

This strategy is not robust when faced with a varying amount of buckets, especially when migration costs are high, like in distributed hash tables. When having N buckets, adding or removing one means that almost all the data needs to be redistributed between the buckets.

Using a division instead of a modulo is better, but still implies that all buckets need to reallocate data.

The right solution for this is using consistent hashing. While this diminishes the uniformity of the data repartition, you gain the guarantee that adding / removing a bucket means that only one bucket needs to have its data migrated.

Don’t : use a raw hash for message signing

Another common use of hash functions is for message signing. This can done using a hash-based message authentication code, or HMAC.

While a cryptographic hash function can assert the integrity of a message (i.e the message has not been tampered with during transfer), a HMAC also asserts its authenticity (i.e the messages comes from the sender). This mechanism is enabled via a shared secret key between the sender and the recipient.

At first glance, transmitting the hash of a concatenation of the message and the secret key may seem to do the same job.

However, many hash algorithms, including SHA-2, are vulnerable against length extension attacks (meaning that - knowing the hash code and message length - it is possible to generate data which when appended to the message will keep the hash code valid).

The HMAC algorithm provides immunity against this kind of attacks. Its design makes it secure even when used with insecure hash functions (MD5 or SHA-1).

In short, when you need a stateless way to validate that a payload comes from a trusted sender, and has not been tampered with, send a HMAC-SHA2 code alongside it. If you want to protect yourself against replay attacks, it is as easy as including a timestamp in the payload.

Common use cases in the wild include API authentication (e.g AWS) and JSON web tokens.

Don’t : use SHA for password hashes

Storing passwords in clear text is evil. While security best practices and password managers are on the rise, the vast majority of users reuse their passwords on multiple services. It is your duty to ensure that, in the event your database is comprised, the impact on your users is limited.

At a minimum, the passwords you store should be

Cryptographically hashed (so while you can easily verify a match, the original plaintext one is not deducible from the hash)

Salted (in order to be robust against rainbow table attacks)

While these practices are not bad per-ser, a standard hash function won’t do. Hash algorithms, even cryptographic ones, are so ubiquitous in today’s tech that their implementation must be fast.

Since your adversary knows both the hash and salt, and can with his GPU compute a billions SHA-256 hash codes per second, you have to use a hash function specifically designed to be slow.

You should use

Argon2 : the most recent one, winner of the 2015 password hashing competition

BCrypt : an older one, more vulnerable to hardware-optimize attacks, but more vetted and maybe already part of your standard library

In short, when storing passwords, generate a secure and long hash, hash it with a slow hash function, and store only the hash and the salt.

Conclusion

We hope these tips helped expand your knowledge of what hash functions can and cannot do. Hopefully you won’t make the same mistakes we did before learning these.

This article is only scratching the surface of the subject. Security is a complicated field, and there are still some debate in the community on certain points (e.g whether you should use SHA-3 or not).

If you wish to dig deeper, there are resources on the security and crypto stackexchanges.

More resources