Security breaches are very common. To make matters worse, when it comes to users’ passwords it is frequent that no reasonable precautions were taken to ensure that they can’t be easily extracted from the breached data.

People tend to use the same password, or simple variations of it in multiple sites. This makes it easier for one to remember a password, but it also means that when that password is exposed, an attacker can potentially get access to the other websites where that password was used.

The thing is, there’s no reason for this to be so common nowadays. Even though the science around cryptography is fairly complicated, the algorithms are easy to use and are readily available.

This blog post is an attempt at explaining what good qualities a stored password should have and which algorithms can be used to get there.

So what qualities should a stored password have? First, only the user should know it. There cannot be a way of getting to the password after it is stored.

That rules out saving the password in clear text, but it also rules out encrypting it.

This might sound like common sense, but unfortunately there’s been examples of fairly prominent online business that would just email you the password in clear text if you use their “Forgot password functionality”.

The example I’m thinking of right now is Tesco, one of the biggest supermarket chains in the UK and Ireland. A few years ago, when pressed about it, they said that they were encrypting the passwords. This is not a good idea since if there’s a security breach and the cryptographic key that was used to encrypt the passwords were to be discovered, the attackers would have access to all of the users’ passwords.

So if encrypting the passwords is not a good idea, what alternative can we use?

Hashing

Maybe we could hash the passwords. But first, what is hashing?

The idea behind hashing is that you can take any input and produce an output that has a specific size and from which you cannot “extract” original input.

For example here’s the output of the SHA1 hashing algorithm in base64 for the string “hello world”:

22596363b3de40b06f981fb85d82312e8c0ed511

And here’s the SHA1 hash for “hello world!”:

f951b101989b2c3b7471710b4e78fc4dbdfa0ca6

And here’s the SHA1 hash for a video file with 700Mb:

1b2803f08a8f2ba251a557cd61f1a38b07427011

Two very similar inputs produce completely different hashes, and irrespectively of the size of the input the output always has the same size.

Hashing algorithms are typically used for detecting duplicate data. Imagine you are storing video files and you want to determine if a certain video file has already been stored.

If each time you need to store a new video file you first compute its hash, you can then compare the video file’s hash with previously stored hashes. If there is already a hash that is the same as the video file you are trying to save that means it’s a duplicate, if not, you can save the video file with the hash. The advantage of doing this is that it is much faster to compare two hashes than two video files.

Another common use case for hashing is to detect if data has been corrupted while being transmitted.

For example, if a file needs to be transmitted through a medium that is not very reliable, before the file is sent a hash of it is created. Then, the file is sent together with the hash.

On the receiving end the hash is computed again and compared with the hash that was transmitted. If they don’t match it means the file was corrupted.

Some examples of hashing algorithms are MD5, SHA1, SHA256 and SHA512 just to name some common ones.

The qualities that the output of these algorithms have seem like they are very appropriate for storing passwords. Given a password, compute it’s hash and store that.

When the user tries to login, compute the hash from the password the user entered in the login form. If that hash matches the stored one, then it’s the right password.

Unfortunately, there’s a catch here. Hashing is supposed to be very fast. In its original use cases (e.g. duplicate detection, validating that data has not been changed/corrupted) it needs to be very fast.

That makes it a bad match for storing passwords because if it’s fast to compute that means many hashes can be generated in a short period of time. Which means that if you want to figure out what password was used to generate a particular hash, you can try many candidate passwords in a short period of time to see if any produces a hash that matches the one you are looking for.

Because most passwords people pick are somewhat predictable (e.g. words that you can find on a dictionary or simple variations of them), if an attacker manages to get access to a password’s hash, most of the time finding that password becomes an exercise of computing hashes of words in a dictionary until finding a match.

A way to speed up the process of getting a password from a hash is to use data from previous breaches, namely the most commonly used passwords. By pre-computing the most common passwords it is possible to just do a lookup search from the hash to the password. If an attacker manages to get access to a website’s database where the hashes are stored, it is very likely that this process will reveal a significant number of passwords.

This method of looking up passwords by using pre-computed password hashes is called a rainbow table attack. You don’t even need to generate the hashes yourself, there are pre-computed hashes readily available online for purchase. They are not too hard to find.

HMAC

If we can’t stop people from using the same passwords all the time is there something we can do to make hashing more secure?

What if we just appended a random string to the passwords?

We can store the random string with the hash of the password + random string combination. When the user tries to sign in, we append the random string with the password the user entered and recompute the hash. If they match the password is correct.

For example, if the password is “cutecats”, instead of computing the hash of “cutecats”, we can compute the hash of “[email protected]!Q^x” and store HASH("[email protected]!Q^x") and [email protected]!Q^x .

It is very unlikely you’ll find [email protected]!Q^x in a rainbow table.

An effective way to counteract a rainbow table attack is to generate a random string per password. If you consider the alternative of just having a random string for all passwords an attacker could still pre-compute all the hashes once and use them to try to get to the passwords.

If there’s a random string per password it means that the attacker can’t reuse any previously computed hashes.

The random string appended to the password is commonly know as salt).

Although we could just use an hashing algorithm to perform these operations, for example SHA256(password + salt) , there are reasons for not doing it. Imagine that the password is “cutecats” and the salt is “1salt”. Now imagine that the password is “cutecats1” and the salt is “salt”. These two combinations will generate the same hash. This is suboptimal.

There’s another algorithm that combines two inputs and generates an output with the same properties of the previously mentioned hashing algorithms. It’s named HMAC which stands for Hash-based Message Authentication Code. Using HMAC, “cutecats”, “1salt” and “cutecats1” and “salt” generate different outputs.

Before we go into what HMAC is, lets talk about MAC (Message Authentication Code) and what it’s used for.

Imagine you want to communicate with someone over the internet and although you don’t mind if someone is listening to your conversation, you want to make sure that the conversation is not tampered with (i.e. no one can send a message as one of the two people participating in the conversation). A hash of the message wouldn’t do the trick here. A hash only guarantees that the message wasn’t corrupted, if a bad actor has access to the message and can alter it, that means that he can compute a new hash for the altered message. To the receiver of the message it would look like everything was ok.

A Message Authentication Code addresses this issue by using a secret that is known only by the parties that take place in the communication. Using Alice and Bob as an example, if Alice wants to send a message to Bob she’ll compute MAC(secret, messageToBob) and send that together with messageToBob . Bob, when he receives the message will repeat the process and compare the MAC code he generated with the one he received from Alice. If they match Bob can be confident that not only wasn’t the message altered, but it was sent by Alice, since she’s the only other person who could’ve generated a valid MAC.

In case you are wondering how Alice and Bob can agree on a secret even when there might be people listening to their communications have a look at Brief(ish) explanation of how https works, where you’ll find an example of the Diffie–Hellman algorithm that is used exactly for this.

That is MAC. What about HMAC (Hash-based Message Authentication Code)? It’s just an implementation of MAC that allows us pick from several different hashing algorithms to compute the MAC, for example there’s HMAC-SHA256 where the hashing algorithm used is SHA256, of HMAC-SHA1 where the hashing algorithm is SHA-1, etc.

How can we then use HMAC to store passwords? We could, for example, pick SHA256 and compute HMAC-SHA256(password, salt) and store that together with the salt .

Solved? Unfortunately, no.

Although rainbow table attacks wouldn’t work here, today’s hardware allows us to generate these hashes so quickly that it becomes feasible to just use passwords from a list of commonly used password or even do a brute force attack, i.e. for example compute HMAC-SHA256(testPassword, salt) where testPassword is “a”, then “aa”, then “ab”, up to all combinations of characters, numbers and symbols until a certain length.

To give you an idea, a high end graphics card that you can go buy today, for example a NVIDIA GTX 1080, can do ~4450 million SHA256 hashes per second). It is possible to have several of these cards in a single computer, so that number can be made much much higher.

Key Derivation

So if HMAC does not work, then what?

The idea of using a salt and a password is a good one, the only reason that it doesn’t work is because computing HMAC is so fast that it is possible to guess the password in a reasonable amount of time. To do this the attacker would only need the password’s hash and the salt (which would happen in the case of a breach) and good enough hardware that was capable of generating many hashes per second.

Maybe there’s a way to make the process more computationally expensive?

What if we do HMAC(password, salt)=hash1 and then HMAC(password, hash1)=hash2 and repeat this a large number of times, for example 100.000. That would significantly slow down the process. Trying out passwords one by one, or a brute force approach be not be reasonable anymore.

There are a few algorithms for doing just that, and they do a little more than just repeat HMAC . They are called Key Derivation function algorithms.

The idea behind key derivation is that given a key you generate one or more keys from that original key that have certain properties. For example, the AES-256 encryption algorithm requires a key with 256 bits.

It is not reasonable to ask a user to generate an exactly 256 bit key by hand. That key would be extremely hard to memorize.

That’s one of the uses of key derivation. It’s a way of going from a key, for example the password the user thinks up, to another key that has a specific size.

But apart from that these key derivation have the property of, with the right configuration, being very computationally expensive.

For example PBKDF2 (Password-Based Key Derivation Function 2) can be configured to use a password, a salt, a particular HMAC algorithm and the number of iterations the HMAC should be applied. There is also bcrypt and scrypt that can also be used in a similar fashion.

As an example here’s how using PBKDF2 in dotnet core would look like, with a 128 bit salt, 100.000 iterations and an output key size of 128 bits (16 bytes).

const int saltSize = 128 / 8; const int outputKeySize = 128 / 8; const int numberOfIterations = 100000; var password = "cutecats"; using (var rng = RandomNumberGenerator.Create()) { var salt = new byte[saltSize]; rng.GetBytes(salt); var derivedKey = KeyDerivation.Pbkdf2(password, salt, KeyDerivationPrf.HMACSHA256, numberOfIterations, outputKeySize); Console.WriteLine($"The salt in base64: {Convert.ToBase64String(salt)}"); Console.WriteLine($"The 'hash/derived key' in base64: {Convert.ToBase64String(derivedKey)}"); }

To run this example you need to install the Microsoft.AspNetCore.Cryptography.KeyDerivation nuget package.

If you are wondering why I picked a 128 bit salt size or a 128 bit output key I don’t have a good answer for that, but a good rule of thumb is that it should be less than what the hashing algorithm produces, for example we used SHA256 which produces a 256 bit output or 32 bytes.

By the way, “cutecats” is a horrible password, you should use a password manager like LastPass, 1Password or similar to generate passwords for you. That’s because even if a password is stored correctly, if it’s something like “cutecats” or for example a word you’d find in a dictionary then, even if it takes 1 second to try each password, an attacker might find it just by guessing and having a little patience.

So that’s it, I hope I’ve helped explain why we should use passwords with a reasonable length, combinations of several numbers, symbols and characters and why they should not be stored in a way that can be easily “reversed” back to the the original password.

It's only fair to share... Linkedin