image obtained from http://www.uniquedatarecovery.in/encryption-device-data-recovery.php

This article is co-authored with Todd Stavish (@toddstavish) of Lab41 and Alexander Titus (@alexandertitus) of B.Next.

Access to data is continually a pain point for many applications and many sectors. This holds true when it comes to genetic data and applications involving the detection, identification, or diagnosis of microorganisms. Determining the identity and/or origin of an unknown biological sample based on genetic information requires comparison of the unknown sample to known genetic sequences, but a large amount of genetic information is privately held and proprietary, and many entities won’t share their data or make it discoverable for various reasons. Additionally, the querying entity may not want to openly reveal the query sequence due to privacy or security concerns.

You can read more about the genesis of this project in our previous post.

We have begun to explore the development of a new Privacy Preserving Query method for microbial genomic studies that:

obscures both the content of the query as well as the result set

the as well as when run against plaintext data

using Partial homomorphic encryption via Paillier cryptography,

that also uses locality-sensitive hashing (i.e. storage efficient, probabilistic data structure) to promote:

linear scaling by hashing the query, and

by hashing the query, and resistance to data artifacts such as data loss, insertions, deletions, etc.

Many approaches could be taken to preserve the privacy of queries for microbial genomic studies, but we decided to focus on a software-based private information retrieval approach leveraging homomorphic encryption. Homomorphic encryption enables computations to be performed on encrypted data and returns encrypted results, which upon decryption match the results obtained from the same operations performed on plaintext data. As an alternative, we are also exploring a new-to-the-scene hardware-based approach, Software Guard eXtension (SGX). At the end of this project, we will compare the features, opportunities, and challenges of these approaches in addition to a trusted third party approach.

To date, we have developed a secure sequence matching algorithm in Python using locality-sensitive hashing and Paillier encryption and have performed tests to confirm the algorithm is functional. We have also begun developing the code for a fully homomorphic approach and are exploring opportunities for an SGX approach.

Creating the Encryption Layer

The scenario being considered for this project is a common secure two-party computation scenario. It goes as follows:

A Query holder (Q) wants to share an encrypted locality-sensitive hash (LSH) of a query sequence with a database holder (DB), who will run a command to compare the query to the database and then return the encrypted intersection scores (or, ideally, Intersection over Union (IOU) scores) back to the query holder. To do this:

Q must first convert the query to an LSH, then encrypt the LSH.

Q will provide the encrypted LSH, a public key, a hash function, a software executable, and instructions to execute to DB.

DB will convert the database to LSHs using the hash function and parameter instructions provided by Q.

DB will run the software commands according to the instructions provided by Q to compare the encrypted query with the unencrypted database.

DB will return the encrypted results to Q.

Q will decrypt the results and perform any additional analysis or sorting, as needed.

To enable the implementation of the scenario described above, we are exploring homomorphic encryption (partial using Paillier encryption; full using the SEAL library developed by MSR) and SGX. Each approach has its own nuances, mostly related to computational performance and level of security. SGX is a hardware based encryption solution and is expected to be the least computationally expensive of the options explored. However, it has memory limitations, requires extensive expertise (which is rare), has only one technology provider (Intel), and the software ecosystem is immature; all making development difficult at this time. Partial homomorphic is the next best option as far as computational expense, but has mathematical limitations which affect how the results are reported. Paillier encryption, in particular, has been around for many years and is well respected. Fully homomorphic is really strong in security, but is the most computationally expensive. The SEAL library is based on LWE (Learning with errors), which is a well-supported approach.

To date…

Partial homomorphic. A secure sequence matching algorithm was developed in Python using LSH and Paillier encryption. We successfully completed initial performance tests using data selected from the AddGene database (our subset included 11,862 entries of ~1K basepairs each) by comparing an encrypted LSH derived from a sequence selected from the AddGene database to all of the unencrypted LSHs derived from our AddGene database subset. The query sequence was converted to an LSH using hashing functions, then encrypted using the Paillier encryption algorithm. The database entries were converted to LSHs using the same hashing functions. The encrypted query LSH was compared to each database entry’s LSH using a shared public key and set of software commands for matching. For each comparison to a database entry, an encrypted intersection score and computed magnitude were produced. The encrypted intersection scores were then returned to the query sender and decrypted using the paired private key, after which an IoU score for each database entry was calculated. This process is detailed in the figure below:

Partial Homomorphic Encryption schematic between a query holder and a database (DB) holder. unencrypted space = blue boxes; encrypted space = green boxes.

We have bench marked our initial tests against the widely-used Basic Local Alignment Search Tool (BLAST) algorithm hosted by National Center for Biotechnology Information (NCBI). We set up four query sequences to perform comparisons against our AddGene database set using the secure matching algorithm and, separately, BLAST. These four queries consisted of a sequence identical to one in the database, a sequence with no match to any entry in the database, and two sequences with partial matches to entries in the database. The results (shown in the table below) illustrate that the IoU scores returned similar results to the BLAST results. For the perfect match (Query 3), BLAST returned 100% alignment and the ‘Best IoU’ score returned was 1.0, indicating a perfect match. For the perfect mismatch (Query 4), BLAST returned ‘no significant similarity found’ meaning no alignment was found. The ‘Best IoU’ score of 0.09 suggests this to be true, since the IoU noise ceiling is expected to be around 10–15% (this exact number will be determined during the next phase of tuning and testing). Anything below a noise ceiling would be deemed insignificant. The partial matches (Queries 1 and 2), returned ‘Best IoU’ scores indicating 21% and 38% alignment, and the BLAST results were consistent, returning 50–100 matches based on best local alignment. These IoU scores are somewhat low, but still suggest a match of significant value since they are above a suspected noise ceiling. However, from these numbers we cannot determine if the match is from a perfect match of a subsequence within a database entry, or if several sections/basepairs along the query sequence match portions of a database entry. This is one downside of using IoU as a metric (more on this point later). Our initial tests show the algorithm is fully functional. The next steps are going to be tuning, general performance testing and evaluating the effects of database size.

Benchmark test comparing IoU scores to unencrypted BLAST results. Length = size of the query sequence in bases; Best IoU = highest IoU score returned for that query for the secure matching algorithm; size of DB searched = number of entries in the database being compared against for both the secure matching algorithm and BLAST; BLAST results = results returned from performing a BLAST computation using the same query and database entries as used for the secure matching algorithm.

Fully Homomorphic. One downside to partial homomorphic encryption is its limitation to only linear functions. Should we need to perform non-linear functions, such as calculating all encrypted IoU scores before returning results, or returning only the maximum IoU score, we would need to use a different approach. So, we decided to try a fully homomorphic approach using the SEAL library. This library was much more challenging to work with because there is not a lot of documentation, and the development environment is made for Windows making it difficult to compile using other systems. Even so, we were able to get it compiled on LINUX. Matching tests have not yet been completed using this approach, but will be one of the next steps in this project.

Secure Guard eXtension (SGX). SGX is a secure enclave for operation created by a modern Intel chip that enables native function calls, but it has to be supported on the motherboard. A graphene library, which enables python code to operate within the secure enclave, was integrated into the system of a laptop that has the SGX Intel chip and an SGX-supported motherboard. The LSH/IoU python script was dropped into the secure enclave and it is ready to be tested for the matching algorithm. The security provisions have not been validated, and would require a significant amount of work to do so.

From our experience, the process of development has been difficult so far. The SGX code was buggy and ended up breaking two machines. There has been limited development around the technology, and minimal documentation. However, it appears the author of the code is actively working to improve the code — several updates have been made since we started our project. Our experience has been consistent with previous new-to-the-scene hardware, such as GPUs. This is a “hot” hardware option and most certainly has promise as a powerful encryption capability. We are currently assessing the best path forward for evaluating this approach.

All of the code we developed mentioned above can be found in our GitHub repository BNext-IQT. We have also issued a release for the code that generated the results in this post. Note: This code is actively under development and not ready for production.

Why Locality-Sensitive Hash?

Basic Local Alignment Search Tool (BLAST) is a common algorithm used by the biological community for genetic sequence comparison, but its computational expense makes it less attractive for this project. An alternative approach that has gained traction in the bioinformatics community recently is hashing. A locality-sensitive hash, in essence, compresses a sequence into a binary vector of ones and zeros using a hashing function, and forces the data into a preset vector size. This particular approach is attractive for an encrypted search because it is much simpler computationally due to the ‘hashing’ of the sequences into small k-mers. Additionally, LSHs are able to handle sequences of varying lengths, are robust with regard to insertions and deletions, require minimal transit and computational drag, and provide some degree of data privacy during transit and computation. They do, however, require some information to be shared between parties because the vector size must be set based on the length of the longest sequence of the sequences being scored.

Why IoU?

Common practice for encrypted searches is to return yes/no answers, but given the need to return an indication of similarity, we chose to employ the method called Intersection over Union (IoU) to compare the LSHs. IoU performs an overlap search, where intersection refers to the percent of matching lines between the vectors and union is the number of lines cumulatively between the two vectors that has a value. Every comparison will produce an IoU score, so an IoU threshold (a.k.a. noise ceiling) will need to be identified to determine scores of significance. In practice, if a significant result is returned, that would indicate there is value in gaining access to the database or elements within the database for further exploration.

One consideration when using IoU as an evaluation metric is the effect of comparing entries of various lengths. LSHs and IoU offer some flexibility for various lengths, but if comparing a query and database entry of highly variable lengths (100s of bases compared to 100Ks of bases, such as a gene compared to a whole genome) the results may be substantially lower than the results of a comparable BLAST computation. This could mislead the interpretation of the degree of significance for a match. As we continue with this project, we will determine how significant this misinterpretation may be and also consider alternative approaches for an evaluation metric or a modified IoU approach. Two such solutions could be to develop a version of the algorithm for sequences of differing lengths (e.g. genes, proteins, whole genomes, etc.) or to modify the IoU to be IoLengthOfQuery as to return the proportion of the query that matched to a target database.

What have we learned so far?

This is promising! Our early results provide a first glimpse of a possible solution to enable rapid queries between a DNA sequence of interest and a privately held database while still protecting the query and database information. Other studies have started to emerge applying encryption techniques to genetic data (Erlich and Narayanan, 2014; Ziegeldorf et al, 2017), such as a Stanford study about performing diagnostic disease queries while maintaining privacy of the patient’s genome. While there are multiple approaches that could be taken, our study is the first that we know of that combines LSHs, IoU, and homomorphic encryption for use with microbial genomic data. Moving forward, we will continue to refine the homomorphic encryption algorithms, tune parameters to optimize performance, and evaluate the completeness and soundness of the protocol to determine how feasible and effective these approaches really are.

Stay tuned for more results!