A team of Microsoft researchers led by Seny Kamara claims to have been successful at recovering a substantial amount of data from health records stored in CryptDB (PDF), a database technology that uses layers of encryption to allow users to search through encrypted data without exposing its contents.

CryptDB was originally developed at MIT. It functions as an addition to a standard, unmodified SQL database and is intended to allow applications to interact with encrypted data using Structured Query Language. By using layers of encryption, CryptDB can allow certain properties of data to be revealed to applications processing the queries while keeping the data itself protected. In theory, the encryption prevents the database administrator (or anyone who attacks the database by gaining trusted access) from being able to view the contents of the database. Data from different users is encrypted with different keys.

CryptDB has been used with the open-source MySQL and PostgreSQL databases, and Google uses it to provide an encrypted version of its BigQuery cloud database. SAP and other large database vendors are looking to apply the technology to their own databases as well. And the federally funded MIT Lincoln Laboratory (PDF) has worked with CryptDB as an additional interface to the Apache Accumulo NoSQL database—the same database originally developed by the National Security Agency to store NSA's multi-level security "big data."

While CryptDB protects against a compromise of the database server application itself, with data at rest always being encrypted, it isn't designed to protect against an attack on applications used to access the data. However, it is designed to partially mitigate this kind of attack by limiting the breach to only data accessible by any of the keys that might be compromised. There's also some "leakage" of data required for the SQL server to do some processing, so intercepting queries sent to the server could also reveal some data—depending on the way queries are structured.

The appeal of this sort of system to anyone in the cloud software business is obvious: CryptDB could allow for greater security of data stored in shared cloud environments. That would allow applications such as electronic medical record systems and other sensitive databases to move to cloud environments without having to rely on expensive, purpose-made database systems.

The Microsoft Research team sought to burst that bubble by going after the weakest link in CryptDB: the Order Preserving Encryption (OPE) and Deterministic Encryption (DET or DTE) schemes. OPE is used to make it possible for SQL queries such as "ORDER BY" to execute. DTE encryption allows databases to be searched for matching values, as described in the original paper by its developers, "by deterministically generating the same ciphertext for the same plaintext. This encryption layer allows the server to perform equality checks, which means it can perform selects with equality predicates, equality joins, GROUP BY, COUNT, DISTINCT, etc." These schemes are the ones most prone to data leakage in CryptDB.

Kamara, Muhammad Naveed of the University of Illinois-Urbana Champaign, and Charles Wright of Portland State University pulled out one of the oldest tricks in the cryptanalyst's book in their attack: good ol' frequency analysis. Using a data source similar to the targeted data in content, they were able to analyze the frequency of characters within the text and then match that against the frequency of data within DTE-encrypted columns of data. They also used three new attacks of their own devising, drawn from the centuries-old frequency analysis:

Lp-optimization: is a new family of attacks we introduce that decrypts DTE-encrypted columns. The family is parameterized by the lp norms [an analysis of the expected difference between values] and is based on combinatorial optimization techniques. Sorting attack: is an attack that decrypts OPE-encrypted columns. This folklore attack is very simple but, as we show, very powerful in practice. It is applicable to columns that are "dense" in the sense that every element of the message space appears in the encrypted column. While this may seem like a relatively strong assumption, we show that it holds for many real-world datasets. Cumulative attack: is a new attack we introduce that decrypts OPE-encrypted columns. This attack is applicable even to low-density columns and also makes use of combinatorial optimization techniques.

To test these attacks, the researchers used real patient data from US hospitals pulled from the National In-patient Sample (NIS) database of the Healthcare Cost and Utilization Project (HCUP), encrypting some of the data using OPE and DTE. Both the frequency analysis and Lp attack were able to recover "mortality risk and patient death" attributes "for 100 percent of the patients for at least 99 percent of the 200 largest hospitals," as well as 100 percent of disease severity data for 51 percent of the 200 hospitals in the data set. Other data easily obtained included the admission month, mortality risk, and admission type for a majority of the same 200 large hospitals, along with nearly all the same data for 200 small hospitals in the sample in OPE-encrypted columns.

As a rebuttal, former CryptDB developer Raluca Ada Popa responded to the research findings by telling Forbes' Thomas Fox-Brewster that the OPE and DTE encryption schemes were intended for "high entropy" values, where the order of data wouldn't reveal much rather than more tightly packed data like percentages of mortality in large sets of patients. "This is how the CryptDB paper says it should be used," she told Fox-Brewster. Users of CryptDB should not be affected by what Kamara's team reported "because they either use the order encryption scheme in a correct way (for the right types of data), or do not use it," Popa said. "Everyone I was in touch with that used CryptDB was careful about the use of OPE."