Indexing encrypted database field for searching

Personal data Security today is also an important issue. Each year, more than a dozen companies, big and small, expose user data.

Image source: percona.com

Encrypt data in the database can minimize the risk of information leakage when data is stolen, but lose the important feature of the DB is searching and organizing. Here is an example for easy to understand, I encrypt the same text AAAA 10 times with the same secret using AES.

With 10 times of encryption, we will get 10 different results. You can run back a few dozen times and see the results are still different. If you want to search, it is impossible to find a string similar to the encrypted string to search. So let’s find out how to index and search in the pile of encrypted data.

Some things you should notice in advance

Secure your app/server first

It is true that the encrypted database will protect your data. But only if someone has your DB and cannot decrypt it. If an attacker gains control of your web/API server then everything is over. So encrypted DB does not mean you can overlook security for your app. Remember not to leave the encryption key with the DB. Leave the DB and web / api app in 2 different places to minimize the risk.

Information leakage

If you want to search, you need to index the data. But the index will reveal information about the data, not really the data itself but often metadata. For example, an index that allows sorting data can reveal the order of the data. The index for searching will reveal information about the text that may be related or similar. This is an inevitable side effect, the problem is which level It is or it depends on how your data is distributed or how important.

Depending on the level of security, you can also index it somewhere else with the DB, or even save in-memory and only build when you start running the app.

Deterministic encryption

The first way that everyone might think about is how to get the same outcome of each time of encryption. Also known as deterministic encryption. This way, you can easily create an encrypted string to search in the DB, of course, only the exact results are searched, not LIKE types. Here are examples of 2 ways to do so with AES.

The common algorithms we use (DES, AES, Blowfish) are all cypher blocks. Meaning that the text to be encrypted will be divided into blocks and encrypt one by one. For example, AES has a block size of 128 bits. That means the AAAAAAAAAAAAAAAAAABBBB segment will be divided into 2 blocks AAAAAAAAAAAAAAAA and BBBB to encrypt. This is how the algorithm will get the same result with the same text as in the above example.

AES has different operating modes. The last mode is called ECB (Electronic Codebook) mode. To get different results with the same text as the first example, AES uses the result of the previous block to combine with the following block (XOR) before encrypting it and so on and so forth.

For the first block, since there is no previous block, we choose a random key called IV (Initialization Vector). Each time encrypts, we choose another IV so the result is random. This mode is called CBC (Cipher Block Chaining) mode. Like the above example, the two ways that I use is to use ECB mode and choose a fixed IV instead of changing each encrypts.

As described above, you can see that with this type of encrypting, each block always is encrypted into the same result. Here is an example:

You may realize that with ECB mode, AAAAAAAAAAAAAAAAAA always ecrypt to cXraKOjg / jsIVEuHTNcaz and BBBBBBBBBBBBBBBB always encrypt to hYSoRAWWIqJHFkMsvlpVg. CBC mode with static IV is better because the outcome of the following block depends on the previous block but the first block will remain unchanged.

If the encrypted information is the information that the user can enter (via web / mobile app, API) then anyone can easily build a codebook to decode all your encrypted data. without the need for an encryption key. Such a type of attack is called a chosen-plaintext attack and the above encryption algorithm cannot protect your data. So this is not necessarily a good way to use.

Use a blind index

When it comes to creating a value that corresponds to a piece of data that produces the same result every time, the first thing you should think of is not deterministic encryption but hash. We will create an index to search by hashing the data and saving it somewhere else (another column). This is called blind indexing.

Of course, it is impossible to use MD5 or SHA-1 then because anyone can easily create a lookup table to get the original data. In a simple case, adding HMAC is enough, so you’ll need the key to generate the right hash to search. This is an example with HMAC-SHA-256.

We still have encrypted data with random results and still have an index to search. If the data is more important, you can use harder (but slower) hash algorithms like PBKDF2, bcrypt or Argon2 for example. Do not save the encryption key with the DB, but must be located somewhere else.

It can be easily seen that this way also leak information because the same data, the index will also be identical. Especially if your data has only a few values (boolean or enum type) then this is quite serious.

Fields with small input domain

Indexing for such a boolean or enum data type is not very secure, but we still have to encrypt if the data is important anyway.

Generally, the data is too little value then you can somehow increase the number, making it look more random. The simplest way is to transform each value into a different value before hash. For example, a boolean data type has two values true and false , it can be converted true into one of the string ture , ture , ture … When searching, do the same and query all possible hash results.

Another way to create more values is to combine that field with another field with more values for random data. This is called the compound blind index. Of course, the searchability will be greatly reduced because you can only search with 1 other field now. So it is best to combine the related fields that we want to search.

For example, I have t_virus_infected column type enum with 2 values yes and no. I will combine it with the country column for example. Then I can search for people who have been infected with T-virus in a certain country. But of course, it is not always possible to combine such convenience.

Introduce some noise

In case of little value, you can follow the above method. If you have more value but still have a lot of duplicates, you can reduce the level of danger a bit by adding a bit of noise by transforming the index before saving (just like transforming data). on). Instead of saving the entire hash as an index, you only save one part, maybe the first few dozen characters, or take each part a few characters in different positions and then join together. For example, there are 2 different strings hashed into:

bb4367f35e06e4da1684f089442ecc93e80a66160894d4149f60aee1b8606870

bb4367f35e06e4da1684f089442ecc931f6d20f1bb3650293402f4926b536741

Instead of saving the entire hash as an index, you get the first 32 characters. Then both strings will have the same index, bb4367f35e06e4da1684f089442ecc93, even if they are not related to each other. Of course doing this will have a case of false positive, so you can filter from the results of the query. But if there are too many false positives, the index doesn’t make much sense. So you have to calculate to make this index suitable. Choose the shorter index string, the higher the percentage of overlapping, but too long is useless because almost no one has the same. You can refer to the appropriate calculation here.

With the above calculation, each value in the index will not be unique for each string anymore, so it is impossible from the same index value to conclude that the data is the same. This type of data structure is called the Bloom filter if you want to learn more.

Fuzzy search

With the previous way, you also find that I only searched for exact results, and LIKE type search is completely not. However, we can still somewhat search fuzzy a bit by transforming the data into another fuzzy searchable form before creating the index. When searching, transform it into the same format as the index to search.

If you want to really search LIKE queries, you can try creating a giant index with n-grams. It will be quite complicated in terms of both implementation and storage. Of course, consider also the index will reveal information.

If you want to encrypt and index the database in the above way, there are several libraries available for you to use.