Abstract Genomic data are becoming increasingly valuable as we develop methods to utilize the information at scale and gain a greater understanding of how genetic information relates to biological function. Advances in synthetic biology and the decreased cost of sequencing are increasing the amount of privately held genomic data. As the quantity and value of private genomic data grows, so does the incentive to acquire and protect such data, which creates a need to store and process these data securely. We present an algorithm for the Secure Interrogation of Genomic DataBases (SIG-DB). The SIG-DB algorithm enables databases of genomic sequences to be searched with an encrypted query sequence without revealing the query sequence to the Database Owner or any of the database sequences to the Querier. SIG-DB is the first application of its kind to take advantage of locality-sensitive hashing and homomorphic encryption to allow generalized sequence-to-sequence comparisons of genomic data.

Author summary Genomic data are becoming increasingly valuable as we develop methods to utilize the information at scale and gain a greater understanding of how genetic information relates to biological function. Advances in synthetic biology and the decreased cost of sequencing are increasing the amount of privately held genomic data. As the quantity and value of private genomic data grows, so does the incentive to acquire and protect such data, which creates a need to store and process these data securely. We present an algorithm for the Secure Interrogation of Genomic DataBases (SIG-DB). The SIG-DB algorithm enables databases of genomic sequences to be searched with an encrypted query sequence without revealing the query sequence to the Database Owner or any of the database sequences to the Querier. SIG-DB is the first application of its kind to take advantage of locality-sensitive hashing and homomorphic encryption to allow generalized sequence-to-sequence comparisons of genomic data.

Citation: Titus AJ, Flower A, Hagerty P, Gamble P, Lewis C, Stavish T, et al. (2018) SIG-DB: Leveraging homomorphic encryption to securely interrogate privately held genomic databases. PLoS Comput Biol 14(9): e1006454. https://doi.org/10.1371/journal.pcbi.1006454 Editor: Scott Markel, Dassault Systemes BIOVIA, UNITED STATES Received: April 24, 2018; Accepted: August 22, 2018; Published: September 4, 2018 Copyright: © 2018 Titus et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability: We accessed and downloaded available bacterial genome data in FASTA format from the National Center for Biotechnology Information (NCBI) Assembly RefSeq(14) database (DB), resulting in a total data set of 75,958 samples. The data set contained 244 gigabytes of complete genome, chromosome, scaffold, and contig sequences for all available bacterial species. For algorithm testing, we used E. coli (taxid:562) genomes (n = 7,078) and S. aureus (taxid:1280) genomes (n = 50 genomes) as well as the full data set of all available species genomes (n = 75,958). All code and documentation is available on the B.Next GitHub page, located at https://github.com/BNext-IQT/GEMstone. All sequence data for E. coli (taxid:562) and S. aureus (taxid:1280) were obtained from https://www.ncbi.nlm.nih.gov/assembly. We accessed and downloaded all sequence data used in this study on October 1st, 2017. Funding: The authors received no specific funding for this work Competing interests: All authors certify there are no competing financial interests related to this work. All authors are paid employees of the non-profit firm, In-Q-Tel.

This is a PLOS Computational Biology Methods paper.

Introduction Genomic information is becoming valuable to a variety of applications, including healthcare and industry. Thus, biological data privacy and security is becoming increasingly important, especially as our understanding of how genetic information relates to biological function continues to grow. The genomic sequence of an individual not only provides information of identity, but can reveal family lineage, physical traits, and disease susceptibility. However, there is inherent tension between the Health Insurance Portability and Accountability Act (HIPAA) requirements for patient data privacy, and scientific and/or public health utility of genomic data. For example, researchers who want to study the genetics of disease may wish to compare patient genome databases in a way that allows data mining while complying with HIPAA obligations. Microbial genomic data is also increasing in quantity and value as microbes are being utilized for industrial, agricultural, and medical purposes. Advances in synthetic biology are creating valuable opportunities in healthcare and industry, where the market in 2016 was $3.9B, and is projected to grow to $11.4B by 2021[1]. In particular, synthetic biology is transforming industrial practices, where microorganisms are exploited to produce commercial products, resulting in customized microbes and unique genomic data. With that comes fierce corporate competition, and the genomic information becomes proprietary. Increasingly, the impact of the human microbiome on health is also an active area of research[2,3]. As we expand our understanding of the microbiome, our personal microbial fingerprint may become part of the data stored in electronic health records[4]. Therefore, we may arrive at a consensus that a person’s microbiome is as important to protect as a person’s genome. Additionally, genomic information has become an important component of the identification, characterization, and attribution of an unknown microbiological sample, such as an outbreak of a novel infectious agent. To execute the appropriate bioinformatic analyses, known genomic data is required to compare against the sample’s genomic data. However, the growing amount of privately held and proprietary genomic information, and the lack of secure computation techniques for sequence-to-sequence comparisons on that data, could hinder the timely genomic analysis of the novel microorganism. Data in storage can be protected by hardware- or software-based encryption, but sensitive data is vulnerable during processing. Therefore, there is a need for methods, including search and comparison methods, that permit the processing of data while preserving security. Several approaches have been published that demonstrate software-based, secure, multi-party computation for biomedical applications, including Yao’s protocol[5], oblivious transfer[6], and many others[7]. Because human genomes are highly conserved, however, much of the existing work is on methods for assessing genomic variation at the SNP[8,9] or causal variant[5] level. Homomorphic Encryption (HE)[8–10] is a class of methods with properties[11,12] that make sequence-to-sequence comparisons, such as searching polynucleotide sequence databases, amenable to encryption. HE differs from other encryption methods by allowing operations directly on encrypted data without access to a private key. The result remains encrypted and can be revealed to the owner at a later point with their private key. There are two classes of HE: fully HE and partial HE. Fully HE systems[11] allow an arbitrary set of operations (e.g., addition, subtraction, division, and multiplication) to be performed on ciphertext, but currently come at an operationally prohibitive computational expense for most applications. Partial HE systems allow a much smaller subset of operations on ciphertext, with each encryption system developed to support a specific set of operations. The Paillier homomorphic encryption system (PHE)[12] is an additive (+) HE cryptosystem. PHE algorithms can be orders of magnitude faster than fully HE cryptosystems[11,13], and thus are promising for immediate operational applications. To date, there is no published work we are aware of that leverages any version of HE for polynucleotide sequence-to-sequence comparisons. We have developed an algorithm, Secure Interrogation of Genomic DataBases (SIG-DB), which is designed to enable a two-party exchange. In this exchange, a Querier encrypts a genetic sequence of interest and passes it to a Database Owner. The Database Owner then compares the encrypted sequence to each item in the database and calculates an encrypted similarity score. After comparisons, the encrypted similarity scores are passed back to the Querier, decrypted with the Querier’s private key, and interpreted. In this exchange, the query and each element in the database are kept secret while allowing sequence-to-sequence comparisons. We demonstrate the SIG-DB algorithm’s performance on microbial genomic data obtained from the National Center for Biotechnology Information (NCBI) Assembly database[14], but see broader applications across any sequence-to-sequence comparison.

Discussion We developed the secure interrogation of genomic databases (SIG-DB) algorithm as a proof-of-concept for a suitable, technologically feasible approach to compare a genomic sequence of interest to a privately-held database and produce an indication of similarity between the sequence of interest and each database entry. This approach could enable bioinformatics practitioners or investigators to leverage privately held information in a secure way–a capability that is non-existent today for genomic sequence comparisons. This tool could expedite the timeliness of queries and gain meaningful results in critical timeframes for infectious disease event response decisions. This is important, as many bioinformatics practitioners are not security experts, and thus are less likely to develop applications in this space. The overall security of an encrypted data manipulation algorithm is fundamental to its success. The SIG-DB algorithm leverages established homomorphic encryption schemes to ensure that no outside (non-participating) party can learn the details of specific queries; only information about the query request patterns can be learned (We did not conduct a formal security analysis of homomorphic encryption, as it has been done previously[11–13]). It also ensures that no outside party can view the database records directly. Only metadata is exchanged in the form of encrypted LSH magnitudes, and thus individual data entries are never exposed outside the owners’ own security systems. From the metadata returned, the Querier will learn the number of entries in the database. When combined with the DB LSH magnitudes and the similarity scores returned, the Querier could extrapolate the diversity within the database. Additionally, if the Querier repeated SIG-DB multiple times against the same database using the same query sequence, the Querier could learn about any database modifications that may have occurred. This risk is limited, however, since the execution is controlled by the Database Owner, who retains the right to refuse repeat executions. A fully homomorphic encryption (FHE) approach, such as Microsoft SEAL, reduces the amount of information leaked by not needing to reveal the magnitudes of Database LSHs (unlike PHE), and therefore does not reveal the length of the database entries. This is because FHE[11] allows for division and multiplication, allowing the algorithm to calculate the full similarity score (while keeping it encrypted) before passing the information out of the database. However, FHE comes with an increased computational cost (an FHE implementation of SIG-DB required 2.25 times as long to execute as the PHE implementation), which currently limits its operational utility. We developed this tool with the initial intention of it being used for microbial genomic sequence comparisons, especially when a user needs to quickly identify a source of information relevant to the query sequence, such as an infectious disease outbreak or intentional release of a bioweapons agent. However, there are a number of applications that it could be used for including human genomics, healthcare, organizational collaborations, and more. We demonstrated here that the algorithm is particularly robust to varying sizes of genomes. The algorithm is also robust with regards to mutations. Any change to the hashed k-mer (e.g. SNP, INDEL) will alter the position, hence the threshold for similarity scores should be tuned to the use case. Large rearrangements (e.g. large segment translocations) will largely keep k-mers intact and therefore would not alter their hash position. The current implementation reports three scoring metrics: IoU, IoD, and IoQ. These were included to account for potential variations in sequence size and locations of mutations. Additionally, certain species may have unique genomic elements, such as repeat elements common in plant and human genomes. Given the nature of the LSH data structures, which forces identical k-mer sequences to hash into the same data location, we believe SIG-DB will be robust to those unique genomic elements. This also means that the number of repeats may not be useful as distinguishing factor between sequences in the present algorithm since each identical k-mer would map to the same hash location in the LSH. This is an inherent limitation of the hashing nature of the algorithm which would need to be further characterized before extension into additional organism classes. This would need to be a consideration when planning the operation for certain applications. SIG-DB should be tested using plant and human genomic sequences to confirm the utility for those data sources and optimize for performance. SIG-DB can be optimized for a particular application by modifying the parameters (k-mer size, LSH size, hash collision rate) based on the user’s tolerance for mutations, length of database sequences, and computational resources. The size of the k-mer should be chosen to account for the degree of variability that is captured. Our results indicate that a smaller k-mer is robust to mutations and that tolerance is reduced by increasing the size of the k-mer. The operator will need to decide how much variation should be tolerated and establish their k-mer size accordingly. The LSH size should be tuned to balance the trade-off between computational burden and hash collision. The primary consideration when tuning the size of the LSH for the query is the amount of unique information in the data, or unique k-mers to be exact. For example, an 8-mer of DNA base elements has a maximum of 65,536 unique 8-mers (4^8). This calculation is only dependent on the size of the k-mer and not the species or genomic content; therefore, all species that utilize the four bases of DNA have the same set of unique 8-mers possible. The theoretical optimal LSH has a unique entry for each unique k-mer. In practicality, however, tuning the size of the LSH to the maximum possible 8-mers for the sequences being compared could reduce the size of the LSH needed and therefore reduce the computational run-time. For example, a sequence of 20,000 bp has a maximum of 19,992 unique 8-mers (n-k). To maintain the optimal 5:1 ratio for LSH:unique k-mers, this means the LSH size would only need to be 100,000 instead of 327,680. Computational resources should be considered and optimized for the given situation. The process of sequence comparisons and decryption will be the most computationally burdensome given the operations over every database entry. This should be considered by the query holder when establishing computational infrastructure. Of note, the SIG-DB prototype was developed in Python, and thus runs with the efficiency of a Python program. Extensions of this work should include implementing the SIG-DB algorithm in C++ to allow for faster runtime. The algorithm runtime, however, can only be improved to a certain limit with programming language and runtime environment optimization, as the algorithm itself has an inherent runtime complexity that is independent of language or environment. Overall, the SIG-DB algorithm provides a method for a secure multi-party exchange of information, meaning an exchange of information between a data holder and one or more data queriers. The outcome of this protocol is a similarity score, which provides an indication of the relative similarity between a query and the entries within the interrogated database. Should the user determine there is adequate similarity, the next steps would be based on his needs and processes. We see use cases within organizations where parties are willing and open to participate in the exchange, but where security of information even within the organization is paramount, such as in the case of HIPAA data in healthcare. The potential versatility of this protocol creates an even greater opportunity for impact within the broader scientific community.

Methods Data We accessed and downloaded available bacterial genome data in FASTA format from the National Center for Biotechnology Information (NCBI) Assembly RefSeq[14] database (DB), resulting in a total dataset of 75,958 samples. The dataset contained 244 gigabytes of complete genome, chromosome, scaffold, and contig sequences for all available bacterial species. For algorithm testing, we used E. coli (n = 7,078 genomes) and S. aureus genomes (n = 50 genomes) as well as the full dataset of all available species genomes (n = 75,958). Query preparation To build the LSH with the appropriate false positive rate and minimize computational burden, the maximum length of a sequence in the database must be obtained from the Database Owner to serve as a proxy for the number of unique k-mers possible. From this implementation, the LSH is constructed to have a 5:1 ratio of available space to filled locations. For testing purposes, sequences with maximum length of 20,000 base pairs were used, thus LSHs were initialized to have 100,000 available hash locations. For sequences longer than 20,000 base pairs, the first 20,000 bases were used, and sequences shorter than 20,000 base pairs were used in their entirety. Sequence k-mers were hashed into the LSH using the Python hash function. Homomorphic encryption was implemented with the Python package phe[19], implementing the Paillier additive homomorphic encryption system (PHE)[12]. Under this system, a public key/private key pair is generated and each element of an LSH is encrypted using the public key. The Paillier cryptosystem is based on integer factorization, and as such our default key size (DEFAULT_SIZE = 3072) is chosen to give a minimum of 128 bits of security. Database searching An LSH is created for each database entry using the LSH Constructor, provided by the Querier. The Database Owner executes the comparison between each DB entry LSH and the encrypted Query LSH using the comparison executable provided by the Querier. For each comparison, an encrypted intersection score is calculated by the sum of all the encrypted Query LSH hash locations corresponding to a filled hash entry in the DB entry’s LSH. The magnitude of the DB entry LSH is calculated as the sum of hashes in the LSH. The pair {encrypted intersection score, DB entry LSH magnitude} for each entry in the database are then returned to the Querier for evaluation. Similarity metric calculations To assess the similarity between the query and the entries in the database, the returned encrypted intersection scores are decrypted using the private key. The intersection over union (IoU) score is calculated using the unencrypted intersection score, the query LSH magnitude, and the DB entry LSH magnitude: Additionally, to account for scenarios with a query sequence shorter than the database sequences, the intersection over length of query (IoQ) is calculated using the unencrypted intersection score and the query LSH magnitude: Finally, to account for scenarios with a query sequence longer than the database sequences, the intersection over DB entry LSH magnitude (IoD) is calculated using the unencrypted intersection score and the DB entry LSH magnitude: From the IoU, IoQ, and IoD, the Querier can evaluate the highest similarity and the overall similarity of database entries to the query sequence. SIG-DB algorithm The SIG-DB algorithm is a two-party exchange for sequence-to-sequence comparisons between an encrypted query and elements of a database. The steps of the algorithm are as follows: Querier: From Database Owner, determine largest sequence length in database Querier: Hash query sequence k-mers into a locality-sensitive hash (LSH) with a specified acceptable false positive rate based on largest entry sequence length Querier: Generate a public/private key pair using the Paillier cryptosystem Querier: Encrypt each element in the query LSH using the Paillier public key Querier: Pass the encrypted query LSH, the LSH constructor with hash function, and a processing script to the Database Owner Database Owner: Hash each database sequence entry into an LSH of matching dimension to the query LSH using the original hash function, provided by the Querier. Database Owner: For every database entry LSH: Calculate encrypted intersection score by the sum of all elements in the encrypted query LSH that correspond to a filled entry in the DB LSH Calculate the magnitude of the DB LSH Database Owner: return all pairs (DB LSH magnitude and encrypted intersection score) to the Querier Querier: decrypt the encrypted intersection scores using the Paillier private key Querier: For each query-database entry pair, calculate the intersection over union (IoU), intersection over query magnitude (IoQ), and intersection over DB LSH magnitude (IoD). SIG-DB algorithm testing Our algorithm testing was conducted on an NVIDIA Titan X GPU with 64 GB of memory. The initial implementation of SIG-DB uses 48 parallel processes to encrypt the LSH and generate the reported similarity metrics. Each step in the algorithm is asynchronous and may be parallelized independently. We tested the algorithm using the data downloaded from the NCBI Assembly database. We used an LSH with 500–100,000 available locations for hashing relative to standardized sequence lengths of 100–20,000 base pairs (always in 5:1 ratio, respectively). These lengths were chosen to represent a range of sequences from raw sequence read to small genome sizes. E. coli genomes range from between 4,500,000–6,000,000 bases, but for efficiency, we used a reduced fraction of the full genome as the first n base pairs of the sequence (seq[1:n]). The algorithm scales to encompass full genome size sequences, but appropriate adjustments to LSH size must be made. In silico sequence mutations We introduced mutations into wild-type query sequences randomly across queries of various lengths. These mutations were intended to represent point mutations in the sequence as well as loss of data quality due to sequencing errors. The point mutations were conducted by changing the selected base pair to either one of the remaining three base pairs, or to an ‘X’ representing a data error. These mutations were introduced at random distributions across the sequence, ranging from no mutations (0%) to full sequence mutations (100%). Correctness A dataset was created from two E. coli genomes (K-12 MG1655 and O157:H7, taken from the RefSeq database). 500 non-overlapping database entries were created by sampling at random from the full genomes for each of three lengths: 1000, 2000, and 3000 bp. From each database entry, a 1000 bp query sequence was selected at random. (Queries taken from a database entry of a given length were used only in searches against the database set from which they were generated.) In order to have sufficient coverage of the desired genetic diversity, the selected query sequences were duplicated twice and the duplicates randomly mutated, where each base pair in the two duplicates had a 5% or 10% chance (respectively) of being switched to an alternate base. In total, this produced 1,500 queries for each length-set of database entries: 500 wild-type, 500 samples with 5% mutation, and 500 samples with 10% mutation. ANI and SIG-DB similarity scores were computed for each database entry-query pair within a given length-set. For the SIG-DB score, the maximum of the IoU, IoQ, or IoD was taken to account for sequence:query length differences. A Receiver Operator Characteristic (ROC) curve was generated to determine the tradeoff between true positive rate (TPR) and false positive rate (FPR). Multiple classification thresholds were defined for the SIG-DB similarity scores such that all sequences with similarity ≥ threshold were considered “relevant”. The thresholds were set in an increasing manner, starting with Similarity = 0 representing a threshold that counted all sequences as “relevant”. Each subsequence threshold was then set as the next highest similarity score in the result set. For example, if five sequences had similarity scores of [0.2, 0.33, 0.33, 0.75, 0.87] then each of the unique scores would be set as a classification threshold, resulting in the following counts of the number of “relevant” sequences at each threshold: [5, 4, 2, 1]. For each new threshold, the TPR and FPR were calculated, then plotted as a ROC curve. Practicality To assess the practicality of using SIG-DB in operational applications, we performed a series of tests to understand the amount of time required to run the full operation on a single CPU core, the time savings that could be achieved using a parallelized approach, and the breakout of time requirements by step in the protocol (encryption time, query time, and scoring time). All bacterial sequences in the RefSeq database were used to perform these tests. The database records have a minimum length of 337 bases, a maximum length of 12,106,419 bases, and an average length of 2,756 bases. Heat maps were generated from test runs that evaluate 3 different variables: (1) number of cores used for computation (1–48 cores), (2) query length (100–20,000 bases), and (3) number of entries in the database (100, 1000, 5000, and 10000 FASTA files; or 10,145 sequences, 89,992 sequences, 425,772 sequences, and 860,984 sequences, respectively). Code/Data availability All code and documentation is available on the B.Next GitHub page, located at https://github.com/BNext-IQT/GEMstone. All sequence data for E. coli (taxid:562) and S. aureus (taxid:1280) were obtained from https://www.ncbi.nlm.nih.gov/assembly. We accessed and downloaded all sequence data used in this study on October 1st, 2017.

Supporting information S1 Table. ANI Data statistics describing the distribution of the ANI scores for each database element length dataset. https://doi.org/10.1371/journal.pcbi.1006454.s001 (DOCX) S1 Fig. Similarity scores (IoU, IoQ, and IoD) for k-mer = {8} with random sequence mutation localized to one half of the sequence at rates from 0–100%. The red circle represented the largest mutation rate that correctly returned the sequence of interest as the highest IoU score. https://doi.org/10.1371/journal.pcbi.1006454.s002 (DOCX) S2 Fig. Similarity scores for varying sizes of query and DB entries. (A) DB held constant at 20,000 bases while query varied from 5,000–20,000 bases. (B) Query held constant at 20,000 bases while DB varied from 5,000–20,000 bases. https://doi.org/10.1371/journal.pcbi.1006454.s003 (DOCX) S3 Fig. Receiver Operator Characteristic (ROC) curves for Paillier-SIG-DB. Query length = 1000 bases and ANI threshold of 0.99 for A-C; 0.95 for D. (A) DB entry length = 1000 bp with a total of 750,000 query to database entry comparisons and 1,515 of those with ANI scores above the ANI threshold (true positives) (B) DB entry length = 2000 bp with a total of 750,000 query to database entry comparisons and 1,549 of those with ANI scores above the ANI threshold (C) DB entry length = 3000 bp with a total of 750,000 query to database entry comparisons and 1,708 of those with ANI scores above the ANI threshold (D) DB entry length = 1000bp with a total of 750,000 query to database entry comparisons and 1,877 of those with ANI scores above the ANI threshold (representative of all ROC curves generated using an ANI threshold = 0.95). https://doi.org/10.1371/journal.pcbi.1006454.s004 (DOCX) S4 Fig. Receiver Operator Characteristic (ROC) curves for Paillier-SIG-DB based on each similarity score–(A) IoD, (B) IoU, and (C) IoQ. Query length = 1000 bases and ANI threshold of 0.99. The DB entry length was 1000 bp (top row) with a total of 750,000 query to database entry comparisons and 1,515 of those with ANI scores above the ANI threshold (true positives) and 3000 bp (bottom row) with DB entry length = 3000 bp with a total of 750,000 query to database entry comparisons and 1,708 of those with ANI scores above the ANI threshold. Using IoD as the similarity score when a query is one-third the size of the database entries results in an odd ROC curve representing very poor performance, as shown in (A) bottom row. One possible explanation is that the denominator is larger than the possible intersection space, and as such, we may be washing out the small differences between the intersections of positive and negative examples. In a dataset of only 0.2% positive samples, this leads to broad misclassification. This phenomenon is exactly why we included IoD, IoU, and IoQ as our scoring metrics. https://doi.org/10.1371/journal.pcbi.1006454.s005 (DOCX) S5 Fig. SIG-DB Execution time for a query against a database of 100 entries (10,145 sequences) using various numbers of cores for computation. https://doi.org/10.1371/journal.pcbi.1006454.s006 (DOCX) S6 Fig. Heat map showing time of execution for SIG-DB, comparing a single query of 'qlen' length to a database containing 1000 entries (82,992 sequences). The trends in this heat map are representative of all scenarios tested. https://doi.org/10.1371/journal.pcbi.1006454.s007 (DOCX) S7 Fig. SIG-DB execution time for 100 and 1000 database entries compared to different query lengths; both runs performed with 48 cores. Note: y-axis is in log-scale. https://doi.org/10.1371/journal.pcbi.1006454.s008 (DOCX)

Acknowledgments The authors would like to thank Lisa Porter for her helpful discussions during development of this project and her critical reading of the manuscript; and Nat Puffer and Justin Wilder for their feedback and discussion on encryption and data security.