There’s been a lot of discussion about hash collisions and birthday attacks in response to my previous post. If you have small children, you already know a birthday attack is a 140 decibel sonic weapon that spontaneously activates sometime between when cake is served and bedtime. In the course of discussing hashing algorithms however, a birthday attack is whole different matter.

Note to reader, this is a follow on to my previous post titled A better way to store password hashes?. If you haven’t read that, the following might be missing some context. Skip to the ‘Design Goals’ section if you don’t want a primer on secure hashing, and why even scrypt on its own isn’t going to protect your users.

Some Background

But let me take a step back first. A hash, by definition, takes an arbitrary input, or key, and maps it to a value, called a message digest or simply, a digest. The total range space of the digest is determined by the algorithm. MD5 produces 128 bit digests, SHA-1 produces 160 bit digests, SHA-2 can be 256 or 512 bit, etc. Even though a 512 bit value has an unfathomably large range of output values, by definition, there are still an infinite number of inputs that will result in the same digest. Any time you have two inputs that give the same digest, it’s called a collision.

When you save password hashes in your database, you are happily allowing an infinite number of passwords to successfully authenticate a given user, as long as their hash collides with the stored hash. This is restricted only by the format and length of the input your application server will accept as a valid ‘password’ value. Of course, this is actually nothing to worry about, because we can depend on the statistical properties of the hashing function to make finding (or stumbling upon) a collision exceedingly difficult.

Attacking the Hash

Attackers will always target the weakest link in your system. As long as the difficulty of finding a collision in the hash is orders of magnitude greater than the difficulty of simply guessing the original password itself, then the hashing function is doing it’s job. So, how difficult is it to find a collision? There are two concepts you’ll hear about; preimage attacks, and birthday attacks.

A preimage attack is when you are given a hash value h, you try to find any message m such that hash(m) = h. These attacks have time complexity of 2n for an n-bit secure hash. This is a very beautiful thing.

A birthday attack is slightly different; it is the work required to discover any two messages m1 and m2 where hash(m1) = hash(m2). Birthday attacks are easier because the specific hash value is not fixed, you just need any two hashes in a collection to ever collide, and time complexity is reduced to 2n/2 which is called the birthday bound.

It’s worth noting that the entire purpose of a secure hash function is to ensure the following criteria are met [1]:

Collision resistance with a security level of 2 n/2 ,

, Preimage resistance with a security level of 2 n , and

, and Second-preimage resistance with a security level of 2n

Various algorithms over the years have been crafted to achieve these criteria, and in turn, attackers have found theoretical and sometimes even applied weaknesses in the algorithms which reduce their security level below the theoretical maximum provided by their bit length.

Attacking the Password

On the other hand, the difficulty of simply guessing the right password outright is based on the strength of the password itself, and has little to do with the bit-length of the hashing algorithm. Unfortunately the strength of users’ passwords is depressingly low. [2] Dictionary attacks target the password, not the hash, by trying to find the actual value that was originally used to produce the hash. It comes down to two factors:

How weak (easy to guess) are the passwords? How fast can you test each guess? In other words, how fast can you run the hashing algorithm?

We need both strong passwords and slow hashing in order to prevent dictionary attacks effectively. A weak password requires an exponentially slower hashing speed to protect the hash from being inverted. Likewise, a fast hashing speed requires an exponentially more complex password to prevent the hash from being inverted. When you hear someone saying “just use bcrypt” or “just use scrypt” what they are saying is, let’s work on #2 — let’s make it as slow as possible to test each guess. This is not a solution, it is merely a good start.

Because there’s a problem… There’s an upper limit on just how slow we can make our hashing function, based on the real-world usage requirements:

Your server must be responsive to both correct and incorrect login attempts. There’s some leeway in how you define responsive, but Google has a lot to say on the matter, and 64ms seems to be a common target Your server must be able to handle multiple users logging on concurrently, i.e. some scalability factor Your server probably need a few resources left over for the real work of serving actual content to your users

It’s the first restriction which is the most difficult to overcome. While you could always throw massive compute resources into handling more concurrent users, or dedicating servers to nothing but hashing, you’re still up against a limit for how long you can spend computing a single hash, on a single CPU core, before users leave your site. [3] So, even if you’ve maxed out your scrypt settings, an attacker can spin up as many ‘EC2 – Compute Cluster – Eight Extra Large’ as they want for the paltry sum of $2.40 per hour. Then they can simply check each of your hashes against the 10,000 most common passwords list, which are apparently used by 98.8% of all users. [4] In the case of the 32 million passwords stolen from rockyou in December, 2009 – the top 5,000 passwords were used by 20% of all users. [5] Simply put, if you only have to make 5,000 or 10,000 attempts to succeed in 20% of cases, your hash algorithm would have to be eternally slow to keep those users safe if someone steals your hash table.

Are We Doomed?

So where does that leave us? We’re stuck trying to increase RAM and CPU requirements for hashing, but running up against the brick wall of user sensitivity to page load times. This is all just another reason why passwords must die, and I don’t mean by replacing them with passphrases either. Unfortunately, it appears like we have to make the best of a bad situation for a bit longer. How do you design a system that doesn’t leak 20% of your users passwords within minutes of your database being compromised?

So now we’re back, perhaps, to where the previous post began, and I apologize profusely if you’ve found this just a re-hashing of common knowledge. There are, however, a couple adjustments I would like to make to my original proposal of decoupling hashes from your users to respond to some suggestions in the forums, and to clarify the design goals.

Design Goals

First, I didn’t do a good job in the previous post stating my design goals. scrypt already does a perfectly fine job of letting you tune the CPU and RAM cost of completing a single hash. There’s absolutely no need for any other method than scrypt if all you want to do is tune the CPU and RAM burden on your attacker. What I want to do is two-fold:

Impose additional costs (other than RAM and CPU) on an attacker Make it easier to detect if attackers start stealing your hashes

There are two types of costs you can impose on a dictionary attack — setup, and run-time. Setup costs are one-time capital the attacker must acquire, versus run-time costs which are built into every iteration. Run-time costs should be maximized up to your users’ tolerance for latency, since the attacker will pay them 5000 times over for each weak password they try to crack. After that, if you want to increase the attack costs, you have to look at increasing setup costs.

There are two areas, other than RAM and CPU, which are typical computing cost centers; storage and bandwidth. I would add electricity to the list, but that’s basically a side-effect of RAM and CPU. Another cost center you could try to increase is human capital, otherwise known as ‘security through obscurity’ — the problem with that is your attacker is always too smart, and what you thought was ‘tricky’ turns out actually wasn’t.

So this approach focuses on increasing the disk storage and bandwidth requirement of copying hashes from a target site.

The Original Proposal

In the prior post, I suggested storing hashes in a separate table, with no foreign key linking each hash to a specific user. Each user would still have their own, 32-byte, randomly generated salt stored with the user. You would hash a user’s password using their specific salt, and if the digest matched any value in your list, you would successfully authenticate them.

Then, however large you want your Hashes collection to be, you simply add more random data. In the case of a large budget to protect hashes, you could potentially justify keeping a petabyte of data, spread across many servers, where only a small fraction of the stored data actually corresponded to users’ passwords.

Collisions and Birthday Attacks

Whether you think it’s a good idea, or not, to maintain massive amounts of random data in the name of stopping hash-theft, the widest objection by far, aside from ‘just use scrypt’, was that storing so many hashes would inevitably increase the risk of a collision. Of course there is some increased risk, but does the risk become actually exploitable?

The white cells in this table are the number of hashes you need to collect, to achieve the given probability of a hash collision. In other words, this tells us the probability that a user’s password, hashed with that user’s random salt, would result in a value that’s already in the collection, based on how many values are in the table. We can readily see, if you are using a 256-bit hashing algorithm, even after storing 4.8 x 1029 hashes, you still have only a 1 in 10-18 chance of a collision. You would have to store more 256 bit hashes than the entire world’s data capacity (300+ exobytes) before the chance of a collision even reached 1 in 10-18.

OK… So now maybe you are pounding your keyboard, screaming at your monitor, and cursing my existence. You probably wish I would just go throw myself into the closest active volcano, or something, anything, to just stop inflicting my cryptographic naiveté on the world. Look, maybe this table doesn’t apply the way I think it does, and either way I certainly will never convince many of you that it does. So, I will happily relent, and eliminate the issue entirely. Please read on…

Eliminating A Potential Backdoor

So there’s actually more clear and present danger than spooky collisions to the originally proposed method. If an attacker is able to both steal a user’s salt value, and insert a specific hash of their choosing into the Hashes table, they can grant themselves access to a specific user’s account with a password of their choosing, while the user’s own password would keep working. There would be no good way to even detect this happened post facto. Houston, we have a problem.

The solution, however, is neat enough, and also should put to rest the arguments about collisions. Let me lay it out:

Users Table: { UserID, Username, …, Salt1, Hash2 }

Hashes Table: { Hash1, Salt2 }

Salt1 and Salt2 are both 32-byte random values generated with a cryptographically secure pseudo-random number generator

Hash1 and Hash2 are both 32-byte hash values, hashed with scrypt if you please hash(Salt1, password) = Hash1 hash(Salt2, password) = Hash2

The primary key for the Hashes table can be left as { Hash1 } as long as you’re also not worried about supernovae impacting our climate. (Otherwise, feel free to make the primary key include Salt2)

When a password is first set, do the following:

Generate a 32-byte Salt1 and Salt2 using CS-PRNG

Calculate Hash1, and Insert Hash1 and Salt2 into the Hashes table INSERT INTO [Hashes] ( [Hash1], [Salt2] ) VALUES ( hash(Salt1, password), Salt2 )

Calculate Hash2, and Update the Salt1 and Hash2 stored in the Users table UPDATE [Users] SET [Salt] = Salt1, [Hash2] = hash(Salt2, password) WHERE [UserID] = UserID



To validate a user:

Retrieve Salt1 and Hash2 from the Users table SELECT [Salt1], [Hash2] FROM [Users] WHERE [UserID] = UserID

Retrieve Salt2 from the Hashes table SELECT [Salt2] FROM [Hashes] WHERE [Hash1] = hash(Salt1, password)

Verify the password hashed with Salt2 matches Hash2 ASSERT hash(Salt2, password) == Hash2



So what’s changed? The Hashes table is still unrelated to the Users table, and will still be chock-full of random noise, but now we essentially make a round trip back to the Users table. An attacker can no longer simply insert a row into your Hashes table to log in as a specific user with their own password, unless they can also generate a Salt2 value, which combined with their chosen password, would produce Hash2. That’s a preimage attack which is 2n complexity.

So we close the backdoor, but I also promised to prevent collisions. The new Salt2 value does this effectively, too. Now a given input would have to collide on one of the hashes in Hashes table, but also collide on the specific {Salt2,Hash2} for the specific user as well.

Conclusions

It’s bad enough when your site gets hacked. It’s worse when you have to tell your users that their passwords have likely been leaked and attackers are playing in their PayPal accounts. We need better ways to protect against hash theft, since it’s too easy to invert hashes of weak passwords.

This approach decouples hashes from specific users, and fuzzes your hash tables with random data. This is not as a replacement for scrypt, but instead attempts to increase the cost of an attack across new axis (storage and bandwidth) which should help deter and increase detection of attempts to steal your hashes. An additional benefit is that attackers can no longer target specific users without first managing to steal a large portion of a database of which you can directly set the size.

Up Next

This approach may be cost prohibitive for smaller businesses, but that’s also part of the point. The technique could be used in a centralized manor where a 3rd party maintains a single public, global Hashes table where all client’s hashes are stored together. The API provided would be similar to:

// Verify an existing user’s password

// Finds Hash1 in distributed data store, returns 32-byte Salt2

// Returns 0 bytes if Hash1 does not exist in the data store

public byte[] GetSalt2(byte[] hash1);

// Add a new user, or change an existing password

// Returns a 32-byte CS-PRNG Salt2

// Use Salt2 to hash the password and then discard Salt2

public byte[] AddHash(byte[] hash1, byte[] salt2)

Aggregating hashes across multiple clients into a single Hashes table reduces the necessary overhead of injecting random data, and therefore decreases operating costs of the service.

Please post your comments on Reddit or Hacker News

[1] – Tradeoﬀ tables for compression functions: how to invert hash values

[2] – A Large-Scale Study of Web Password Habits

[3] – Speed Matters – Google Research Blog

[4] – Top 10,000 Most Common Passwords

[5] – Consumer Password Worst Practices