Have you ever typed your social security number into a form on the Internet and wondered, “should I really click submit?” Most of us place an extraordinary level of trust in websites run by people we’ve never met, using standards we don’t understand. We’ll happily type our social security numbers, passwords, credit card numbers, addresses, phone numbers, and all kinds of other sensitive information into web pages, only to send that data off into the world without a second thought. On the way to its destination, our sensitive data flies through the air as radio waves, travels through copper wire as electrical signals, and zips through fiber optic cabling as blasts of light. While performing the complex song-and-dance of the Internet, our data often passes through public channels that can easily be monitored by an enterprising hacker, or a stalwart government entity like the NSA or KGB.

When your data leaves your machine, where does it go? What happens to it along the way? And what systems have been put in place to ensure that your information is kept private as it travels, and after it arrives at its final destination? The short answer is: quite a lot. So strap in as we take you on a tour of the secret life of your username and password in order to expose the trials and tribulations of keeping a secret on the web.

Vulnerable Before We Even Click Send

Web pages are typically loaded using a protocol called Hypertext Transfer Protocol (HTTP), hence the http:// at the start of so many urls. This protocol began development in 1989, in a world where the Internet was fundamentally different from what it is now. Back then, the Internet was mostly used by academic researchers at universities. The people developing the Internet were a small group of friends and colleagues — they ran into each other at conferences and cheered each other on when their names appeared on papers in journals. Furthermore, the Internet was not being used to share significant amounts of private information. As a result, the protocols developed to power the Internet assumed trust between all parties.

The specifications for HTTP and its underlying protocol, the Transmission Control Protocol (TCP) both did not (and still do not) concern themselves with privacy, encryption, or any security related issues at all. Under HTTP textual data is transferred across the net as is; anyone who can “tap the wire” can read the data being transmitted between two parties. What’s more, a sophisticated attacker can execute a “man-in-the-middle” attack — if the attacker can set up a computer somewhere along the path from Alice to Bob, then they could receive outbound messages from Alice, keep them, and send replacement messages to Bob; essentially a man in the middle can impersonate Alice without Bob knowing.

As the web became mainstream, new standards had to be invented to protect against this kind of attack (and others too). In 1994 the first version of the Secure Socket Layer (SSL) protocol was invented, and HTTPS (HTTP over SSL) was introduced. SSL and its successor, Transport Layer Security (TLS) are protocols that, unlike TCP, were invented to protect users in an environment of explicit mistrust (paranoia is a best practice among software security professionals). Even though SSL has been around for 24 years, and provides significantly more assurances than the assurance-less TCP, it still does not enjoy 100% adoption among website operators. This lack of adoption creates the first opportunity for hackers to steal your credentials before you’ve even clicked the submit button. Consider this:

When data is loaded over HTTP, no work is done by the protocol to validate that the data you requested is actually the data you received. Say you punched in http://www.facebook.com to your URL bar, your browser resolves the IP address for that server via the Domain Name System (DNS), then sends an HTTP request to that address, expecting to receive the Facebook homepage from a Facebook server. If a man-in-the-middle attack is successfully executed, an intruder (we’ll call her Trudy) reads all of the information you send to Facebook, and instead of simply forwarding information between you and Facebook (as a non-malicious node in the Internet would do) Trudy keeps the information, and subtly alters the responses from Facebook to contain one new inline script tag:

Now, when your mouse hovers over the login button, Trudy’s phishing server receives your username and password as you typed them. We’re using the mouseover event because Trudy still wants the “regular” thing to happen when you actually click the button — you login to Facebook and are left unaware that your credentials have just been compromised.

If we didn’t want to rely on users mousing over and actually clicking submit, we could also detect key-presses when the password box has focus, or use the blur event to detect when the password box loses focus. Turning off JavaScript completely can help prevent attacks like these, but it’s possible for attackers to glean some information by inserting malicious CSS sheets without any JavaScript at all.

Even if the request made when a user actually clicks “login” is an HTTPS request, so long as the landing page was served over HTTP, we are vulnerable to these attacks. Of course, if the credentials themselves were sent over HTTP, Trudy doesn’t have to be nearly as sophisticated; she simply makes a note of the username and password as it passes through her computer on the way to Facebook.

Successfully executing a woman-in-the-middle attack requires Trudy to be sophisticated. In this example it also requires Facebook to be negligent, and further requires the user to be uninformed. The real Facebook, in keeping with best practices, never serves their login page over HTTP; they always use HTTPS. You, the informed user, know never to enter sensitive information into a website that doesn’t have the little green HTTPS lock icon. How does HTTPS protect us from Trudy, and other would-be data thieves?

Certificate Authorities and the TLS Handshake

TLS provides two crucial services in the context of this example: endpoint verification, and encryption. Endpoint verification is an assurance that the data you are receiving comes exclusively from the person/server you think it is (e.g. Facebook). In the previous example, Trudy was able to sneak some small amount of data into our transaction with Facebook. With endpoint verification we can be sure that any information we receive during the transaction will have come directly from Facebook (even if Trudy is still in the middle of our connection). With encryption we can ensure that Trudy can’t read any of the data passed between the two parties.

Connections made over SSL/TLS always perform a “handshake” before any actual application data (such as a username/password) is exchanged. During this handshake the two communicating parties establish a secure (encrypted) connection, and at least one party’s identity is established. The handshake has several steps, and involves both asymmetric and symmetric encryption:

The two parties negotiate a “ciphersuite” to use for this connection. The ciphersuite includes a symmetric encryption algorithm (AES is a common choice), a cryptographic hash algorithm (the SHA-2 family is a common choice), as well as a random value called a nonce which is used to ensure that each handshake is universally unique (which guards against replay attacks). The server (Facebook in our running example) sends something called a certificate to the client. Optionally, the server may request the client’s certificate as well, but in our running Facebook example this does not happen. The client verifies the server’s certificate, and extracts the server’s public key from the certificate provided that the certificate is valid. If the certificate was not valid the client abandons the connection. Based on the negotiated ciphersuite, the client creates a value called the pre-master secret. This value is encrypted using the server’s public key, and sent to the server. The server receives the pre-master secret. Based on the ciphersuite and the pre-master secret the server generates the master secret. Independently, the client also generates the master secret. If both the client and server follow all the rules, the independently generated master secrets will match. The master secret is used as the key for the previously agreed upon symmetric encryption algorithm; now that both the client and server have generated it, they begin using the symmetric encryption algorithm to communicate. Finally, both parties independently use the negotiated hash algorithm to compute the hash of all the messages exchanged so far, and send their computed hash codes to each other. If the hash codes do not match, the parties have detected an intrusion, and they abandon the connection. Otherwise, they begin sending application data (your username and password) using the negotiated symmetric encryption algorithm.

The TLS Handshake

To understand why this flow protects us from the Trudys of the world, we have to understand a little bit about public key encryption, as well as the role of certificate authorities in creating and signing certificates.

Public key encryption is simple enough in concept (and the complex details are beyond the scope of this article). With symmetric encryption, the same key is used to encrypt and decrypt any data you wish to transmit. This is just like a lock on a house — if you get a new roommate, you give them an identical copy of the key you already have. That key locks the door, and unlocks the door. With public key encryption, two keys must be generated: the public key and its corresponding private key. Data that is encrypted with the public key can only be decrypted with the private key; data that is encrypted with the private key can only be decrypted with the public key.

Encrypting something with a private key is similar to the ancient and medieval use of sealing wax, and wax emblems. If a letter was sealed with the emblem of the Stark direwolf anyone in the world can open it (just apply force/use the public key). Additionally, the person who first opened it could be confident that the letter inside came from House Stark.

For the life of me, I cannot come up with a great metaphor for encrypting something using the public key. An okay metaphor is that encrypting data with the public key is like using a safe with a mail slot. Anyone can put mail into the safe, and no one can read the mail in the safe except the person whose safe it is. (If you have a metaphor you like more, leave it in the comments!)

Metaphors aside, every website operator who wants to use HTTPS must generate a public/private key pair. Public keys, as you might have guessed, are made public; whereas the private keys are carefully safeguarded. The result of this is that anyone can use Facebook’s public key to encrypt data and only Facebook can decrypt that data. This is what happens with the pre-master secret — even if Trudy intercepted all the data sent during the handshake, she couldn’t read the pre-master secret without Facebook’s private key. Without the pre-master secret, it would be impossible for Trudy to determine the master secret, and therefore impossible for her to decrypt our symmetric-key encrypted messages.

A Certificate Authority (CA) is a trusted third party; they are neither Facebook nor are they a user, but (presumably) both parties trust the CA. The CA’s role in all of this is to “sign” Facebook’s certificate with their (the CA’s) public key. The flow of information looks like this:

Facebook asks Verisign (a major CA) to create and sign a certificate for them. In the process, Facebook provides Verisign with Facebook’s public key. Verisign creates a certificate, which includes identifying information about Facebook and Facebook’s public key. Verisign encrypts the certificate with Verisign’s private key. This means that anyone can decrypt the data using Verisign’s public key; it also means that the certificate can only be read after such a decryption. If someone uses Verisign’s public key to decrypt a certificate, and the data becomes readable, we know that it was indeed signed with Verisign’s private key. When Facebook sends us their certificate, we use Verisign’s public key to decrypt the certificate. If it unlocks, and we can extract Facebook’s public key, then we have determined (because we trust Verisign) that we have indeed been communicating with Facebook.

So, through SSL/TLS we can establish a secure connection and verify that we are indeed connected to the person/server we think we are (provided we can trust Verisign). But we all know from the big Equifax hack, that just because our data was encrypted in transit doesn’t mean it’s safe in storage.

Password Storage and Hashing

Once your information has been received and the server at Facebook has decrypted it, Facebook has to validate your credentials. This means that Facebook has to compare the information their server just received with the information they have on file. If someone sends your username, and something other than your password, they need to know to refuse service to that individual. Facebook might be tempted to simply store your password as text into their database when you initially signed up for the service; this would clearly allow them to compare the password you send during a login attempt. Doing so, however, would be a serious vulnerability.

Imagine you work at Facebook, and in particular you manage some of the databases there. You have administrative privileges, which means you can query the database at any time and examine the data therein. Further imagine that someone you know had been saying slanderous things about you on Facebook. You might be tempted to just quickly look at their password, log in as them, and publicly apologize to yourself on behalf of this misguided slanderer. Or maybe, the employee became disgruntled with Facebook itself, and decided to get a little bonus by illegally selling a large cache of username/password combinations on the underground market. Finally, the risk of a 3rd party (hacker) gaining access to the database cannot be reduced to zero either. Even if all of your employees can be trusted completely, your database might still be at risk of being exposed.

So, the problem for password storage becomes: how can I store enough information to authenticate a user, without giving someone with access to the database the ability to impersonate that user? In other words, how can we check that someone provided the correct password, without actually storing the password itself? For this, we use cryptographically secure hash functions.

A cryptographic hash function is a “one way” mapping from some variable length input (your password) to a fixed length output (called a hash code or hash value). Having the hash code of a particular value does not give you enough information to reconstruct the input value. Furthermore, any input value (ideally) must map to a unique output value — meaning no two distinct passwords could ever produce the same hash code.

Because the output length is fixed, it’s not possible to theoretically ensure this property of a hash function. For example, if the hash code was only 8 bits long then there would only be 256 possible unique hash codes; meaning as soon as 257 unique passwords had been hashed, at least one collision would be found regardless of the process for computing that hash code. Modern cryptographic hash algorithms produce longer hash-codes; values between 512 and 2048 bits in length are common. 512 bit hash codes yield 2⁵¹² ~= 1.34e+154 unique hash codes (thats a number with 154 zeros on it — so it’s a lot). 2048 bit hash codes yield roughly 3.2e+617 unique values (a number with 617 zeros on it). So while it’s theoretically impossible to ensure there are not collisions, we can make it extraordinarily unlikely that a collision is created by accident, and computationally infeasible to create a collision intentionally.

To avoid storing your password, Facebook instead stores a cryptographic hash of your password. If the hash function is a good one, then the only input that maps to the generated hash code is your password. Because it is impossible to use the hash code to generate the associated password (cryptographic hash functions are “one-way” functions) anyone who compromised the database of hashed passwords couldn’t use that information to masquerade as one of the users. When you want to log in, Facebook simply recomputes the hash code of your password and compares it to the hash code in the database — if they match then the passwords matched and Facebook can now once again discard the plaintext password.

Unfortunately, hackers have found ways to attack this storage mechanism as well.

Rainbow Tables

Eventually, many cryptographic hash functions do ultimately become “broken”. Sometimes computational power catches up, and it becomes computationally feasible to create a collision by brute force (generate tons of random passwords and hash them, until you generate the same hash code twice) or because a vulnerability is discovered in the algorithm that went originally unnoticed. This happened with the MD5 algorithm, where a “collision attack” against MD5 was discovered to be a major component of the Flame virus in 2012. Breaking MD5 was the work of highly sophisticated experts; a collaboration between analysts at the CIA, NSA, and Israeli military.

Breaking a strong hash algorithm takes quite a wellspring of computational knowledge and expertise. More commonplace attacks against cryptographic hash functions rely on completely different tactics. Like so many things in security, a major source of vulnerability in hashed passwords, is people. Would-be-attackers quickly realized that between security experts who spend years meticulously crafting algorithms specifically designed to thwart the same would-be-attackers; and John “Don’t Know Sh*t About Cyber Security” Doe; that John is the easier mark.

A Rainbow Table is an attack targeting databases with hashed passwords, designed around the idea that many people do at least one of these things:

Use the same password many times for different services. Choose an easy to guess password. Use a password that someone else has used (conjecturing that most people are not particularly original when it comes to passwords).

Using these three assumptions, and combining them with with other phishing attacks, attackers create large databases of known passwords with their pre-computed hash codes. To generate these tables, attackers start with lists of passwords known to be common (things like password12345, 123abc, 1q2w3e4r5t6y) as well as any passwords they acquire through phishing. For all of the known passwords in the list they generate hash codes for popular hash algorithms, and then save them in a database. Such a database is called a “rainbow table”.

Now, when the attacker finally compromises Facebook’s database, they don’t need to “decrypt” anything, nor do they need to try to manufacture a collision. The attacker just needs to determine if the hash code in the Facebook database matches any of the pre-computed hash codes by looking in the rainbow table. If a hash code matches, the attacker just looks at the associated plaintext password that was used to generate the hash code in the rainbow table. Because common passwords are common, compromised databases and rainbow tables will have significant overlap.

In order to combat against rainbow tables, most modern password databases will use another tactic called “salt”.

Salted Hash: More Than Just A Breakfast Food

Rainbow tables are based on the assumption that attackers can use known passwords to pre-compute the hash values that would appear in a database of hashed passwords. They begin this computation well in advance of actually compromising a particular database; once a database of hashed passwords is compromised, it is trivially easy to map the compromised password hashes to their plaintext counterpart in the rainbow table. In order to combat rainbow tables, security people use salt. The purpose of salt is to make it impossible to create the rainbow table in advance by introducing a dash of randomness.

Salt is easy to understand through examples, so lets start with one. Suppose my password is “superSafe123”. Using openssl via the command line we can compute the sha-256 hash of my password (-n below suppresses newlines from echo, I chose 256 so that it would fit neatly on more screen sizes, 512 is more secure):

echo -n “superSafe123” | openssl sha256 6c4cb5b0abd0fff82f5132a1597cc1424d2b20ebc49e19e2b4621081f7e754dd

When we use salt, instead of hashing the password directly, we concatenate the password with “the salt” which is some randomly chosen value. Then we prepend the salt to the generated hash and add a delimiter (a colon or semicolon are common choices, it really does not matter as long as it’s consistent) so that we can later separate the salt from the hash. So, keeping with our example, suppose I randomly generated the value “4321” as my salt. Now hashing looks like this (real salt values should be longer, and ideally they should be globally unique):

echo -n “superSafe1234321” | openssl sha265 d9362e30ffa5940a7e60b0f26d78e7ac12a2065175114e1d4b828b8e91d8857a

The hash has changed significantly due to the addition of the salt; as we expect for a cryptographic hash function. Even small changes in the input result in dramatically altered hash codes. Before we can store the value in our database, we need to make a note of what salt value was used (I’ve used a colon as a delimiter):

4321:d9362e30ffa5940a7e60b0f26d78e7ac12a2065175114e1d4b828b8e91d8857a

The addition of salt is no problem if we just want to verify that a supplied password matches the password stored in our database. We simply query our table via the username, extract the salt by grabbing all the characters prior to the delimiter, concatenate that to the password we were provided, and hash as usual. For people trying to maintain rainbow tables, however, salt is a nightmare.

A unique salt causes even the most common passwords to create a unique hash code when hashed with the salt; pre-computation based attacks will fail unless an attacker was able to systematically guess what salt value you used for every single password. This addition of randomness guards against rainbow tables because pre-computation of all possible hash(salt+password) combinations is infeasible; whereas pre-computation of just hash(password) is completely reasonable.

What Does This Mean For Me?

Software security is a war of attrition: new hacking tactics are invented all the time and security professionals in turn develop countermeasures. The best thing you can do to protect yourself and your company against hackers is to keep your software up to date. Many attacks take advantage of well known flaws in particular versions of software. Most hackers are not discovering brand new vulnerabilities, they are discovering new locations where known-to-be-vulnerable software is already installed. Whether it’s your server framework, operating system, or web browser, always install the latest security updates.

For the standard user of the Internet: do not use easy to guess passwords, use unique passwords for every website you use, and never input sensitive data to websites you cannot trust. You can never trust a website that is not served over HTTPS (look for the little green lock in your URL bar). Start using a password manager, and consider installing the EFF’s HTTPS Everywhere browser extensions for the browser(s) you use.

For website maintainers: if your website is currently available over HTTP and not HTTPS, get an SSL certificate and take the necessary steps to protect your users by enabling HTTPS. You can get started with the Linux Foundation’s Let’s Encrypt. Additionally, make sure you never store passwords (or other sensitive data) in a raw, unencrypted, format. Salt and hash your users’ passwords, and consider applying this tactic to any sensitive information that you might need to verify, such as social security numbers. You might also consider hiring a private firm to perform a security audit of your application in order to better understand its weak points, thereby giving yourself the opportunity to patch any discovered vulnerabilities.

This article is just the tip of the iceberg in terms of how your data might (or might not) be handled once it enters the open web. Hopefully it was enough to give you a newfound appreciation for just how much work and subtlety goes into both the acts of protecting, and stealing, your information online. For more information, consider checking out the EFF’s Security and privacy sections.