The human genetics community needs robust protocols that enable secure sharing of genomic data from participants in genetic research. Beacons are web servers that answer allele-presence queries—such as “Do you have a genome that has a specific nucleotide (e.g., A) at a specific genomic position (e.g., position 11,272 on chromosome 1)?”—with either “yes” or “no.” Here, we show that individuals in a beacon are susceptible to re-identification even if the only data shared include presence or absence information about alleles in a beacon. Specifically, we propose a likelihood-ratio test of whether a given individual is present in a given genetic beacon. Our test is not dependent on allele frequencies and is the most powerful test for a specified false-positive rate. Through simulations, we showed that in a beacon with 1,000 individuals, re-identification is possible with just 5,000 queries. Relatives can also be identified in the beacon. Re-identification is possible even in the presence of sequencing errors and variant-calling differences. In a beacon constructed with 65 European individuals from the 1000 Genomes Project, we demonstrated that it is possible to detect membership in the beacon with just 250 SNPs. With just 1,000 SNP queries, we were able to detect the presence of an individual genome from the Personal Genome Project in an existing beacon. Our results show that beacons can disclose membership and implied phenotypic information about participants and do not protect privacy a priori. We discuss risk mitigation through policies and standards such as not allowing anonymous pings of genetic beacons and requiring minimum beacon sizes.

To examine the question of re-identification in a beacon, we have developed a likelihood-ratio test (LRT) that uses allele presence or absence responses from a beacon to predict whether a given individual genome is present in the beacon database. Our approach is independent of allele frequencies. The statistical properties of the LRT guarantee that it is the most powerful test for this problem. A variation of our LRT can detect relatives of the query individual in the beacon. Our results suggest that anonymous-access beacons do not protect individual privacy and are open to re-identification attacks. As a result, they can also disclose phenotype information about individuals whose genomes are present in the beacon.

The Beacon Project by the Global Alliance for Genomics & Health (GA4GH) aims to simplify data sharing through a web service (“beacon”) that provides only allele-presence information. Users can query institutional beacons for information about genomic data available at the institution. Queries are of the form “Do you have a genome that has a specific nucleotide (e.g., A) at a specific genomic position (e.g., position 11,272 on chromosome 1)?” and the beacon server can answer “yes” or “no.” Beacons are intended to be easily set up and to allow data sharing while protecting participant privacy. By providing only allele-presence information, beacons are safe from attacks that require allele frequencies.However, a privacy breach from a beacon would be troubling given that beacons often summarize data with a particular disease of interest. For instance, identifying that a given genome is part of the SFARI beacon, which contains genomic data from families with a child affected by autism spectrum disorder, means that the individual belongs to a family where some member has autism spectrum disorder. Thus, beacons could leak not only membership information but also phenotype information. Although genetic privacy is protected to some extent by the Genetic Information Nondiscrimination Act (GINA), the offered protections are limited, and GINA does not apply to long-term care insurance, life insurance, disability insurance, or other special cases.Therefore, all data-sharing mechanisms, including beacons, must protect participant privacy.

In the coming decade, a great deal of human genomic data, along with linked phenotypes in electronic health records, will be collected in the context of health care. A major goal of the human genomics community is to enable efficient sharing, aggregation, and analysis of these data in order to understand the genetic contributors of health and disease. Previous large-scale data-sharing approaches have had limited success because of the potential for privacy breaches and risks of participant re-identification. Homer et al.and othersshowed that subjects in a genome-wide association study could be re-identified with the use of allele frequencies, resulting in the removal of publicly available allele-frequency data.

Material and Methods

We assume a beacon composed of unrelated individuals from a single population. Given query q = {C, P, A}, the beacon answers “yes” (represented as 1) if allele A is an alternate allele at position P on chromosome C and has a non-zero frequency in the sample used for constructing the beacon, and it answers “no” (represented as 0) otherwise. We consider only bi-allelic SNPs for our analysis.

Thus, given a set of n queries Q = {q 1 , …, q n }, the beacon returns a set of responses R = {x 1 , …, x n }. For our scenario, we assume that the attacker has access to more information—the number of individuals (N) in the beacon database and the site frequency spectrum (SFS) of the population in the beacon—parameterized as a beta distribution with shape parameters (a′, b′). Thus, we assume that alternate allele frequencies f for all SNPs observed in the population are distributed as f ∼ beta(a′, b′).

1 Homer N.

Szelinger S.

Redman M.

Duggan D.

Tembe W.

Muehling J.

Pearson J.V.

Stephan D.A.

Nelson S.F.

Craig D.W. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. For our attack scenario, we assume a setting identical to that used by Homer et al.and others. In this setting, the attacker receives a VCF file listing all the SNP positions at which the query individual has an alternate allele and the genotype calls at the corresponding positions. The attacker then queries the beacon for all heterozygous positions by using the alternate allele listed in the VCF and obtains the set of responses R from the beacon. We develop a LRT that can use the responses R to decide whether the query genome is in the beacon.

If the query individual is present in the beacon, then every allele in the query genome must be present in the beacon. Thus, the beacon will return a “yes” (1) response to every query. If a query individual is not present in the beacon, then the beacon response will be “yes” (1) if some individual in the beacon has the allele and “no” (0) otherwise. By calculating the likelihood of the responses, we can differentiate query individuals in the beacon from those not in the beacon. Our approach for re-identifying individuals within a beacon is based on a LRT that uses this information. For each query genome, we calculate the likelihood of the beacon responses to n allele-presence queries under the null hypothesis that a given individual is not in the beacon and the alternative hypothesis that the given individual is in the beacon. We then calculate the test statistic as the ratio of the two likelihoods.

To make our LRT generalizable across populations, we will remove direct dependence on allele frequencies given that frequencies can vary considerably for a given allele across populations. Instead, we will allow our test to depend on the shape of the SFS, which is described by (a′, b′), the parameters of the beta distribution. Although allele frequencies for a given allele can vary considerably across populations, the SFS parameters for most populations are similar to each other ( Modeling SFSs by Beta Distributions in Appendix A ). Therefore, the results from a test that depends on the shape of the SFS but is independent of the actual allele frequencies can be generalized to many populations ( Figure S1 ).

• Null hypothesis H 0 : query genome is not in the beacon database.

• Alternative hypothesis H 1 : query genome is in the beacon database. Our LRT evaluates the likelihood of the beacon response under two possible hypotheses.

LRT In an ideal setting, we would expect x 1 = x 2 … = x n = 1 if a query genome g is in the beacon B. In practice, because of sequencing errors and differences in variant-calling pipelines, we might have some mismatches between the query copy of a genome and its copy in the beacon. We assume that this happens with probability δ. i be f i . Because the beacon is only queried at the positions where the query genome is heterozygous, f i is not distributed as beta(a′, b′) but shows an ascertainment bias. We can show that f i ∼ beta(a, b), where a = a′ + 1 and b = b′ + 1 in theory ( Let the alternate allele frequency at the SNP corresponding to query qbe f. Because the beacon is only queried at the positions where the query genome is heterozygous, fis not distributed as beta(a′, b′) but shows an ascertainment bias. We can show that f∼ beta(a, b), where a = a′ + 1 and b = b′ + 1 in theory ( Posterior Distribution of Allele Frequencies in Appendix A ). 1 , …, x n } can be written as L ( R ) = ∑ i = 1 N x i log P ( x i = 1 ) + ( 1 − x i ) log P ( x i = 0 ) . (Equation 1)

The log-likelihood of a response set R = {x, …, x} can be written as For the LRT, we need to evaluate this log-likelihood under the null hypothesis and the alternative hypothesis. The null hypothesis is that the query genome is not present in the beacon, and the alternative hypothesis is that the query genome is present in the beacon. L H 1 ( R ) = ∑ i = 1 n x i log ( 1 − δ D N − 1 ) + ( 1 − x i ) log ( δ D N − 1 ) , (Equation 2)

where D N − 1 is the probability that none of N − 1 genomes has an alternate allele at a given position (see We can show that under the alternative hypothesis, the log-likelihood can be calculated aswhere Dis the probability that none of N − 1 genomes has an alternate allele at a given position (see Likelihood under the Alternative Hypothesis in Appendix A ). L H 0 ( R ) = ∑ i = 1 n x i log ( 1 − D N ) + ( 1 − x i ) log ( D N ) (Equation 3)

(see Similarly, the log-likelihood under the null hypothesis is(see Likelihood under the Null Hypothesis in Appendix A ). Λ = L H 0 ( R ) − L H 1 ( R ) = n log ( D N δ D N − 1 ) + log ( δ D N − 1 ( 1 − D N ) D N ( 1 − δ D N − 1 ) ) ∑ i = 1 n x i = n B + C ∑ i = 1 n x i ,

where we have defined B = log ( D N / δ D N − 1 ) and C = log ( δ D N − 1 ( 1 − D N ) / D N ( 1 − δ D N − 1 ) ) (see δ < ( D N / D N − 1 ) , we have C < 0. In practice, because N ≫ 1 , D N ≈ D N − 1 , and mismatch rate δ ≪ 1 , this will always be true. The log of the likelihood-ratio statistic can then be written aswhere we have definedand(see LRT Statistic in Appendix A ). For, we have C < 0. In practice, because, and mismatch rate, this will always be true. Λ = n B + C ∑ i = 1 n x i . (Equation 4)

Therefore, the LRT statistic can be stated as The LRT stated above can be understood to be a test for a simple null hypothesis H 0 : θ = 1 − D N against a simple alternative hypothesis H 1 : θ = 1 − δD N when we are given {x 1 , …, x n } sampled as x i ∼ Bernoulli(θ). By the Neyman-Pearson lemma, the LRT is the most powerful test for a given test size α.

Binomial Test The null hypothesis is rejected if Λ < t for some threshold t. Let t α be such that P(Λ < t α | H 0 ) = α. This is equivalent to rejecting the null hypothesis if ∑ i = 1 n x i > t α ′ , where t α ′ = ( t α − n B / C ) . Because the x i are independent and identically distributed (i.i.d.) under both hypotheses, ∑ i = 1 n x i | H 0 ∼ binomial ( n , 1 − D N ) and ∑ i = 1 n x i | H 1 ∼ binomial ( n , 1 − δ D N − 1 ) . Therefore, the power of the exact test can be calculated as 1 − β = P ( ∑ i = 1 n x i > t α ′ | H 1 ) , where t α ′ is chosen such that P ( ∑ i = 1 n x i > t α ′ | H 0 ) = α . A sufficient statistic for the LRT is the number of “yes” responses from the beacon.

Relationship between the Number of Queries Required and Beacon Size In the null and alternative hypotheses, x i is a Bernoulli random variable. Therefore, by the central limit theorem, the LRT statistic has a Gaussian distribution. We can therefore use the parameters of the Gaussian distribution to obtain a relationship between the number of queries (required for achieving a desired power and false-positive rate) and the number of individuals in the beacon. Let μ 0 and σ 0 be the mean and SD, respectively, of the LRT statistic under the null hypothesis, and let μ 1 and σ 1 be the corresponding values under the alternative hypothesis. μ 0 + σ 0 z α = μ 1 + σ 1 z 1 − β , (Equation 5)

where z y is the y quantile of the standard normal distribution. For an LRT statistic with false-positive rate α, power 1 − β, and a normal distribution, we have thatwhere zis the y quantile of the standard normal distribution. n N a ′ + 1 = ( z α − z 1 − β δ ) 2 Γ ( b ′ + 1 ) 2 a ′ + 1 Γ ( a ′ + b ′ + 2 ) (Equation 6)

(see n ∝ N a ′ + 1 . For the LRT we describe, this relationship is equivalent to(see Gaussian LRT Power Approximation in Appendix A ). The right-hand side of the equation is independent of both n and N for a specified false-positive rate α and power 1 − β. Thus, we have that

LRT for Detecting Relatives The relatedness of two individuals can be parameterized with a single parameter ϕ, which is the probability that the two individuals share an allele at a single SNP. Thus, identical twins have ϕ = 1, parent-offspring and sibling pairs have ϕ = 0.5, first cousins have ϕ = 0.25, and so on. L H 1 ( R ) = ∑ i = 1 n x i log ( 1 − δ D N − 1 − ( 1 − 2 δ ) ( 1 − ϕ ) 2 D N − ( 1 − 2 δ ) ϕ ( 1 − ϕ ) D N − 1 2 ) + ( 1 − x i ) log ( δ D N − 1 + ( 1 − 2 δ ) ( 1 − ϕ ) 2 D N + ( 1 − 2 δ ) ϕ ( 1 − ϕ ) D N − 1 2 ) (Equation 7)

(see The likelihood for the null hypothesis remains the same as before. Under the alternate hypothesis (a relative of the query genome g with relatedness ϕ is present in beacon B), the log-likelihood is given by(see Likelihood under the Alternate Hypothesis in Appendix B ). We can use this form to calculate the LRT statistic for this setting. Here, the exact test uses ∑ i = 1 n x i as the sufficient statistic (as before), and the sufficient statistic is binomially distributed under both hypotheses. The distributions are given by ∑ i = 1 n x i | H 0 ∼ binomial ( n , 1 − D N ) and ∑ i = 1 n x i | H 1 ∼ binomial ( n , 1 − δ D N − 1 − ( 1 − 2 δ ) ( 1 − ϕ ) 2 D N − ( 1 − 2 δ ) ϕ ( 1 − ϕ ) D N − ( 1 / 2 ) ) . Therefore, the power of the exact test can be calculated as β = P ( ∑ i = 1 n x i > t α ′ | H 1 ) , where t α ′ is chosen such that P ( ∑ i = 1 n x i > t α ′ | H 0 ) = α .

Simulation Experiments We simulated 500,000 SNPs in a sample of 1,000 diploid individuals. Alternate allele frequencies were sampled from a multinomial distribution with probabilities obtained from the expected allele-frequency distribution for a standard neutral model under the assumption of a population size of 10,000 individuals. • 200 diploid individuals from the beacon

• 200 diploid individuals not in the beacon and whose genotypes were simulated according to the generated allele frequencies at all SNPs. We constructed a beacon by using the 1,000 simulated individuals. The query set of individuals consisted of For initial experiments, the mismatch rate between the beacon and query copies of the same genomes was set to 10−6 to simulate near-ideal data. The null distribution of the LRT statistic was obtained with the exact-test calculation for the 200 individuals not in the beacon. Power was calculated as the proportion of successfully rejected tests (out of 200) for the query genomes in the beacon.

Detecting Relatives To examine whether relatives could be identified from the beacon, we used 200 individuals from the beacon to generate query genomes with varying degrees of relatedness to the original individual.

Effect of Noise Genome sequencing is more error prone than array genotyping. Even with high-coverage data, biological replicates of the same individual could have 1%–5% SNPs unique to each replicate. On the same sequenced sample, different variant-calling pipelines can produce SNP calls at positions that might differ from each other. We model this in our simulation by allowing for a mismatch probability (δ) that for a query individual who is in the beacon and is heterozygous at the query SNP, the copy in the beacon is a homozygous reference, i.e., the beacon will (erroneously) return 0 as the response to the query. Table S2 shows the levels of mismatch modeled in our experiments.