Building Searchable Encrypted Databases with PHP and SQL

Update (2019): We've implemented the design this blog post builds up towards. Read the new blog post about our implementation here.

We get asked the same question a lot (or some remix of it).

This question shows up from time to time in open source encryption libraries' bug trackers. This was one of the "weird problems" covered in my talk at B-Sides Orlando (titled Building Defensible Solutions to Weird Problems), and we've previously dedicated a small section to it in one of our white papers.

The question is, How do we securely encrypt database fields but still use these fields in search queries?

Our secure solution is rather straightforward, but the path between most teams asking that question and discovering our straightforward solution is fraught with peril: bad designs, academic research projects, misleading marketing, and poor threat modeling.

If you're in a hurry, feel free to skip ahead to the solution.

Towards Searchable Encryption

Let's start with a simple scenario (which might be particularly relevant for a lot of local government or health care applications):

You are building a new system that needs to collect social security numbers (SSNs) from its users.

Regulations and common sense both dictate that users' SSNs should be encrypted at rest.

Staff members will need to be able to look up users' accounts, given their SSN.

Let's first explore the flaws with the obvious answers to this problem.

Insecure (or otherwise ill-advised) Answers

Non-randomized Encryption

The most obvious answer to most teams (particularly teams that don't have security or cryptography experts) would be to do something like this:

<?php class InsecureExampleOne { protected $db; protected $key; public function __construct(\PDO $db, string $key = '') { $this->db = $db; $this->key = $key; } public function searchByValue(string $query): array { $stmt = $this->db->prepare('SELECT * FROM table WHERE column = ?'); $stmt->execute([ $this->insecureEncryptDoNotUse($query) ]); return $stmt->fetchAll(\PDO::FETCH_ASSOC); } protected function insecureEncryptDoNotUse(string $plaintext): string { return \bin2hex( \openssl_encrypt( $plaintext, 'aes-128-ecb', $this->key, OPENSSL_RAW_DATA | OPENSSL_ZERO_PADDING ) ); } }

In the above snippet, the same plaintext always produces the same ciphertext, when encrypted with the same key. But more concerning with ECB mode is that every 16-byte chunk is encrypted separately, which can have some extremely unfortunate consequences.

Formally, these constructions are not semantically secure: If you encrypt a large message, you will see blocks repeat in the ciphertext.

In order to be secure, encryption must be indistinguishable from random noise to anyone that does not hold the decryption key. Insecure modes include ECB mode and CBC mode with a static (or empty) IV.

You want non-deterministic encryption, which means each message uses a unique nonce or initialization vector that never repeats for a given key.

Experimental Academic Designs

There is a lot of academic research going into such topics as homomorphic, order-revealing, and order-preserving encryption techniques.

As interesting as this work is, the current designs are nowhere near secure enough to use in a production environment.

For example, order-revealing encryption leaks enough data to infer the plaintext.

Homomorphic encryption schemes are often repackaging vulnerabilities (practical chosen-ciphertext attacks) as features.

Unpadded RSA is homomorphic with respect to multiplication. If you multiply a ciphertext by an integer, the plaintext you get will be equal to the original message multiplied by the same integer. There are several possible attacks against unpadded RSA, which is why RSA in the real world uses padding (although often an insecure padding mode).

AES in Counter Mode is homomorphic with respect to XOR. This is why nonce-reuse defeats the confidentiality of your message in CTR mode (and non-NMR stream ciphers in general).



As we've covered in a previous blog post, when it comes to real-world cryptography, confidentiality without integrity is the same as no confidentiality. What happens if an attacker gains access to the database, alters ciphertexts, and studies the behavior of the application upon decryption?

There's potential for ongoing cryptography research to one day produce an innovative encryption design that doesn't undo decades of research into safe cryptography primitives and cryptographic protocol designs. However, we're not there yet, and you don't need to invest into a needlessly complicated research prototype to solve the problem.

Dishonorable Mention: Decrypt Every Row

I don't expect most engineers to arrive at this solution without a trace of sarcasm. The bad idea here is, because you need secure encryption (see below), your only recourse is to query every ciphertext in the database and then iterate through them, decrypting them one-by-one and performing your search operation in the application code.

If you go down this route, you will open your application to denial of service attacks. It will be slow for your legitimate users. This is a cynic's answer, and you can do much better than that, as we'll demonstrate below.

Secure Searchable Encryption Made Easy

Let's start by avoiding all the problems outlined in the insecure/ill-advised section in one fell swoop: All ciphertexts will be the result of an authenticated encryption scheme, preferably with large nonces (generated from a secure random number generator).

With an authenticated encryption scheme, ciphertexts are non-deterministic (same message and key, but different nonce, yields a different ciphertext) and protected by an authentication tag. Some suitable options include: XSalsa20-Poly1305, XChacha20-Poly1305, and (assuming it's not broken before CAESAR concludes) NORX64-4-1. If you're using NaCl or libsodium, you can just use crypto_secretbox here.

Consequently, our ciphertexts are indistinguishable from random noise, and protected against chosen-ciphertext attacks. That's how secure, boring encryption ought to be.

However, this presents an immediate challenge: We can't just encrypt arbitrary messages and query the database for matching ciphertexts. Fortunately, there is a clever workaround.

Important: Threat Model Your Usage of Encryption

Before you begin, make sure that encryption is actually making your data safer. It is important to emphasize that "encrypted storage" isn't the solution to securing a CRUD app that's vulnerable to SQL injection. Solving the actual problem (i.e. preventing the SQL injection) is the only way to go.

If encryption is a suitable security control to implement, this implies that the cryptographic keys used to encrypt/decrypt data are not accessible to the database software. In most cases, it makes sense to keep the application server and database server on separate hardware.

Our Threat Model

The rest of the article will be assuming the following is true:

Your database server and webserver are on separate bare-metal hardware (to side-step VM escapes). Your database server does not know the keys held by the webserver. We're protecting against live attacks on the database server, not a webserver compromise. A webserver compromise is a game over situation. Live attacks vs offline attacks: You can solve a threat model that consist of physical access to a stolen hard drive by simply using full disk encryption, but this does nothing to protect against attacks against a decrypted, online server.

The rest of the threat model here is left intentionally vague. As long as the above assumptions are true, then our solution may be applicable to your threat model.

Implementing Literal Search of Encrypted Data

Possible use-case: Storing social security numbers, but still being able to query them.

In order to store encrypted information and still use the plaintext in SELECT queries, we're going to follow a strategy we call blind indexing. The general idea is to store a keyed hash (e.g. HMAC) of the plaintext in a separate column. It is important that the blind index key be distinct from the encryption key and unknown to the database server.

For very sensitive information, instead of a simple HMAC, you will want to use a key-stretching algorithm (PBKDF2-SHA256, scrypt, Argon2) with the key acting as a static salt, to slow down attempts at enumeration. We aren't worried about offline brute-force attacks in either case, unless an attacker can obtain the key (which must not stored in the database).

So if your table schema looks like this (in PostgreSQL flavor):

CREATE TABLE humans ( humanid BIGSERIAL PRIMARY KEY, first_name TEXT, last_name TEXT, ssn TEXT, /* encrypted */ ssn_bidx TEXT /* blind index */ ); CREATE INDEX ON humans (ssn_bidx);

You would store the encrypted value in humans.ssn . A blind index of the plaintext SSN would go into humans.ssn_bidx . A naive implementation might look like this:

<?php /* This is not production-quality code. * It's optimized for readability and understanding, not security. */ function encryptSSN(string $ssn, string $key): string { $nonce = random_bytes(24); $ciphertext = sodium_crypto_secretbox($ssn, $nonce, $key); return bin2hex($nonce . $ciphertext); } function decryptSSN(string $ciphertext, string $key): string { $decoded = hex2bin($ciphertext); $nonce = mb_substr($decoded, 0, 24, '8bit'); $cipher = mb_substr($decoded, 24, null, '8bit'); return sodium_crypto_secretbox_open($cipher, $nonce, $key); } function getSSNBlindIndex(string $ssn, string $indexKey): string { return bin2hex( sodium_crypto_pwhash( 32, $ssn, $indexKey, SODIUM_CRYPTO_PWHASH_OPSLIMIT_MODERATE, SODIUM_CRYPTO_PWHASH_MEMLIMIT_MODERATE ) ); } function findHumanBySSN(PDO $db, string $ssn, string $indexKey): array { $index = getSSNBlindIndex($ssn, $indexKey); $stmt = $db->prepare('SELECT * FROM humans WHERE ssn_bidx = ?'); $stmt->execute([$index]); return $stmt->fetchAll(PDO::FETCH_ASSOC); }

A more comprehensive proof-of-concept is included in the supplemental material for my B-Sides Orlando 2017 talk. It's released under the Creative Commons CC0 license, which for most people means the same thing as "public domain".

Security Analysis and Limitations

Depending on your exact threat model, this solution leaves two questions that must be answered before it can be adopted:

Is it safe to use, or does it leak data like a sieve? What are the limitations on its usefulness? (This one is sort of answered already.)

Given our example above, assuming your encryption key and your blind index key are separate, both keys are stored in the webserver, and the database server doesn't have any way of obtaining these keys, then any attacker that only compromises the database server (but not the web server) will only be able to learn if several rows share a social security number, but not what the shared SSN is. This duplicate entry leak is necessary in order for indexing to be possible, which in turn allows fast SELECT queries from a user-provided value.

Furthermore, if an attacker is capable of both observing/changing plaintexts as a normal user of the application while observing the blind indices stored in the database, they can leverage this into a chosen-plaintext attack, where they iterate every possible value as a user and then correlate with the resultant blind index value. This is more practical in the HMAC scenario than in the e.g. Argon2 scenario. For high-entropy or low-sensitivity values (not SSNs), the physics of brute force can be on our side.

A much more practical attack for such a criminal would be to substitute values from one row to another then access the application normally, which will reveal the plaintext unless a distinct per-row key was employed (e.g. hash_hmac('sha256', $rowID, $masterKey, true) could even be an effective mitigation here, although others would be preferable). The best defense here is to use an AEAD mode (passing the primary key as additional associated data) so that the ciphertexts are tied to a particular database row. (This will not prevent attackers from deleting data, which is a much bigger challenge.)

Compared to the amount of information leaked by other solutions, most applications' threat models should find this to be an acceptable trade-off. As long as you're using authenticated encryption for encryption, and either HMAC (for blind indexing non-sensitive data) or a password hashing algorithm (for blind indexing sensitive data), it's easy to reason about the security of your application.

However, it does have one very serious limitation: It only works for exact matches. If two strings differ in a meaningless way but will always produce a different cryptographic hash, then searching for one will never yield the other. If you need to do more advanced queries, but still want to keep your decryption keys and plaintext values out of the hands of the database server, we're going to have to get creative.

It is also worth noting that, while HMAC/Argon2 can prevent attackers that do not possess the key from learning the plaintext values of what is stored in the database, it might reveal metadata (e.g. two seemingly-unrelated people share a street address) about the real world.

Implementing Fuzzier Searching for Encrypted Data

Possible use-case: Encrypting peoples' legal names, and being able to search with only partial matches.

Let's build on the previous section, where we built a blind index that allows you to query the database for exact matches.

This time, instead of adding columns to the existing table, we're going to store extra index values into a join table.

CREATE TABLE humans ( humanid BIGSERIAL PRIMARY KEY, first_name TEXT, /* encrypted */ last_name TEXT, /* encrypted */ ssn TEXT, /* encrypted */ ); CREATE TABLE humans_filters ( filterid BIGSERIAL PRIMARY KEY, humanid BIGINT REFERENCES humans (humanid), filter_label TEXT, filter_value TEXT ); /* Creates an index on the pair. If your SQL expert overrules this, feel free to omit it. */ CREATE INDEX ON humans_filters (filter_label, filter_value);

The reason for this change is to normalize our data structures. You can get by with just adding columns to the existing table, but it's likely to get messy.

The next change is that we're going to store a separate, distinct blind index per column for every different kind of query we need (each with its own key). For example:

Need a case-insensitive lookup that ignores whitespace? Store a blind index of preg_replace('/[^a-z]/', '', strtolower($value)) .

Need to query the first letter of their last name? Store a blind index of strtolower(mb_substr($lastName, 0, 1, $locale)) .

Need to match "beings with this letter, ends with that letter"? Store a blind index of strtolower($string[0] . $string[-1]) .

Need to query the first three letters of their last name and the first letter of their first name? You guessed it! Build another index based on partial data.



Every index needs to have a distinct key, and great pains should be taken to prevent blind indices of subsets of the plaintext from leaking real plaintext values to a criminal with a knack for crossword puzzles. Only create indexes for serious business needs, and log access to these parts of your application aggressively.

Trading Memory for Time

Thus far, all of the design propositions have been in favor of allowing developers to write carefully considered SELECT queries, while minimizing the number of times the decryption subroutine is invoked. Generally, that is where the train stops and most peoples' goals have been met.

However, there are situations where a mild performance hit in search queries is acceptable if it means saving a lot of disk space.

The trick here is simple: Truncate your blind indexes to e.g. 16, 32, or 64 bits, and treat them as a Bloom filter:

If the blind indices involved in the query match a given row, the data is probably a match. Your application code will need to perform the decryption for each candidate row and then only serve the actual matches.

If the blind indices involved in the query do not match a given row, then the data is definitely not a match.

It may also be worth converting these values from a string to an integer, if your database server will end up storing it more efficiently.

Conclusion

I hope I've adequately demonstrated that it is not only possible to build a system that uses secure encryption while allowing fast queries (with minimal information leakage against very privileged attackers), but that it's possible to build such a system simply, out of the components provided by modern cryptography libraries with very little glue.

If you're interested in implementing encrypted database storage into your software, we'd love to provide you and your company with our consulting services. Contact us if you're interested.