Using Gaussian copulas to model uniqueness

We consider a dataset \({\cal{D}}\), released by an organization, and containing a sample of \(n_{\cal{D}}\) individuals extracted at random from a population of n individuals, e.g., the US population. Each row x(i) is an individual record, containing d nominal or ordinal attributes (e.g., demographic variables, survey responses) taking values in a discrete sample space \({\cal{X}}\). We consider the rows x(i) to be independent and identically distributed, drawn from the probability distribution X with \({\Bbb P}(X = {\boldsymbol{x}})\), abbreviated p(x).

Our model quantifies, for any individual x, the likelihood ξ x for this record to be unique in the complete population and therefore always successfully re-identified when matched. From ξ x , we derive the likelihood κ x for x to be correctly re-identified when matched, which we call correctness. If Doe’s record x(d) is unique in \({\cal{D}}\), he will always be correctly re-identified (\(\kappa _{{\boldsymbol{x}}^{(d)}} = 1\) and \(\xi _{{\boldsymbol{x}}^{(d)}} = 1\)). However, if two other people share the same attribute (\({\boldsymbol{x}}^{(d)}\) not unique, \(\xi _{{\boldsymbol{x}}^{(d)}} = 0\)), Doe would still have one chance out of three to have been successfully re-identified \(\left( {\kappa _{{\boldsymbol{x}}^{(d)}} = 1/3} \right)\). We model \(\xi _{\boldsymbol{x}}\) as:

$$\xi _{\boldsymbol{x}} \equiv {\Bbb P}\left({\boldsymbol{x}}{\hbox{ unique in }}({\boldsymbol{x}}^{(1)}, \ldots ,{\boldsymbol{x}}^{(n)}) \;|\; \exists i,{\boldsymbol{x}}^{(i)} = {\boldsymbol{x}}\right)$$ (1)

$$= \left(1 - p({\boldsymbol{x}})\right)^{n - 1}$$ (2)

and κ x as:

$$\kappa _{\boldsymbol{x}} \equiv {\Bbb P}\left({\boldsymbol{x}}{\hbox{ correctly matched in }} ({\boldsymbol{x}}^{(1)}, \ldots ,{\boldsymbol{x}}^{(n)}) \;|\; \exists i,{\boldsymbol{x}}^{(i)} = {\boldsymbol{x}}\right)$$ (3)

$$= \frac{1}{n}\frac{{1 - \xi _{\boldsymbol{x}}^{n/(n - 1)}}}{{1 - \xi _{\boldsymbol{x}}^{1/(n - 1)}}}$$ (4)

with proofs in “Methods”.

We model the joint distribution of X 1 , X 2 , … X d using a latent Gaussian copula43. Copulas have been used to study a wide range of dependence structures in finance44, geology45, and biomedicine46 and allow us to model the density of X by specifying separately the marginal distributions, easy to infer from limited samples, and the dependency structure. For a large sample space \({\cal{X}}\) and a small number \(n_{\cal{D}}\) of available records, Gaussian copulas provide a good approximation of the density using only d(d − 1)/2 parameters for the dependency structure and no hyperparameter.

The density of a Gaussian copula C Σ is expressed as:

$$c_\Sigma ({\boldsymbol{u}}) = \frac{1}{{\sqrt {{\mathrm{det}}\,\Sigma } }}{\mathrm{exp}}\left( { - \frac{1}{2}{\mathrm{\Phi }}^{ - 1}({\boldsymbol{u}})^T \cdot (\Sigma ^{ - 1} - {\mathrm{I}}) \cdot {\mathrm{\Phi }}^{ - 1}({\boldsymbol{u}})} \right)$$ (5)

with a covariance matrix Σ, u ∈ [0, 1]d, and Φ the cumulative distribution function (CDF) of a standard univariate normal distribution.

We estimate from \({\cal{D}}\) the marginal distributions Ψ (marginal parameters) for X 1 , …, X d and the copula distribution Σ (covariance matrix), such that p(x) is modeled by

$$q({\boldsymbol{x}}|\Sigma ,\Psi ) = {\int}_{F_1^{ - 1}(x_1 - 1|\Psi )}^{F_1^{ - 1}(x_1|\Psi )} \ldots {\int}_{F_d^{ - 1}(x_d - 1|\Psi )}^{F_d^{ - 1}(x_d|\Psi )} c_\Sigma ({\boldsymbol{u}})\,{\mathrm{d}}{\boldsymbol{u}}$$ (6)

with F j the CDF of the discrete variable X j . In practice, the copula distribution is a continuous distribution on the unit cube, and p(x) its discrete counterpart on \({\cal{X}}\) (see Supplementary Methods).

We select, using maximum likelihood estimation, the marginal distributions from categorical, logarithmic, and negative binomial count distributions (see Supplementary Methods). Sampling the complete set of covariance matrices to estimate the association structure of copulas is computationally expensive for large datasets. We rely instead on a fast two-step approximate inference method: we infer separately each pairwise correlation factor Σ ij and then project the constructed matrix Σ on the set of symmetric positive definite matrices to accurately recover the copula covariance matrix (see “Methods”).

We collect five corpora from publicly available sources: population census (USA and MERNIS) as well as surveys from the UCI Machine Learning repository (ADULT, MIDUS, HDV). From each corpus, we create populations by selecting subsets of attributes (columns) uniformly. The resulting 210 populations cover a large range of uniqueness values (0–0.96), numbers of attributes (2–47), and records (7108–9M individuals). For readability purposes, we report in the main text the numerical results for all five corpora but will show figures only for USA. Figures for MERNIS, ADULT, MIDUS, and HDV are similar and available in Supplementary Information.

Figure 1a shows that, when trained on the entire population, our model correctly estimates population uniqueness \(\Xi _X = \mathop {\sum}

olimits_{{\boldsymbol{x}} \in {\cal{X}}} p({\boldsymbol{x}})\left(1 - p({\boldsymbol{x}})\right)^{n - 1}\), i.e., the expected percentage of unique individuals in (x(1), x(2), …, x(n)). The MAE between the empirical uniqueness of our population Ξ X and the estimated uniqueness \(\widehat {\Xi _X}\) is 0.028 ± 0.026 [mean ± s.d.] for USA and 0.018 ± 0.019 on average across every corpus (see Table 1). Figure 1a and Supplementary Fig. 1 furthermore show that our model correctly estimates uniqueness across all values of uniqueness, with low within-population s.d. (Supplementary Table 3).

Fig. 1 Estimating the population uniqueness of the USA corpus. a We compare, for each population, empirical and estimated population uniqueness (boxplot with median, 25th and 75th percentiles, maximum 1.5 interquartile range (IQR) for each population, with 100 independent trials per population). For example, date of birth, location (PUMA code), marital status, and gender uniquely identify 78.7% of the 3 million people in this population (empirical uniqueness) that our model estimates to be 78.2 ± 0.5% (boxplot in black). b Absolute error when estimating USA’s population uniqueness when the disclosed dataset is randomly sampled from 10% to 0.1%. The boxplots (25, 50, and 75th percentiles, 1.5 IQR) show the distribution of mean absolute error (MAE) for population uniqueness, at one subsampling fraction across all USA populations (100 trials per population and sampling fraction). The y axis shows both p, the sampling fraction, and \(n_{\cal{S}} = p \times n\), the sample size. Our model estimates population uniqueness very well for all sampling fractions with the MAE slightly increasing when only a very small number of records are available (p = 0.1% or 3061 records) Full size image

Table 1 Mean absolute error (mean ± s.d.) when estimating population uniqueness (100 trials per population) Full size table

Figure 1b shows that our model estimates population uniqueness very well even when the dataset is heavily sampled (see Supplementary Fig. 2, for other populations). For instance, our model achieves an MAE of 0.029 ± 0.015 when the dataset only contains 1% of the USA population and an MAE of 0.041 ± 0.053 on average across every corpus. Table 1 shows that our model reaches a similarly low MAE, usually <0.050, across corpora and sampling fractions.

Likelihood of successful re-identification

Once trained, we can use our model to estimate the likelihood of his employer having correctly re-identified John Doe, our 50-year-old male from Berkeley with breast cancer. More specifically, given an individual record x, we can use the trained model to compute the likelihood \(\widehat {\xi _{\boldsymbol{x}}} = \left(1 - q({\boldsymbol{x}}\,|\,\Sigma ,\Psi )\right)^{n - 1}\) for this record x to be unique in the population. Our model takes into account information on both marginal prevalence (e.g., breast cancer prevalence) and global attribute association (e.g., gender and breast cancer). Since the cdf. of a Gaussian copula distribution has no close-form expression, we evaluate q(x|Σ, Ψ) with a numerical integration of the latent continuous joint density inside the hyper-rectangle defined by the d components (x 1 , x 2 , …, x d )47,48. We assume no prior knowledge on the order of outcomes inside marginals for nominal attributes and randomize their order.

Figure 2a shows that, when trained on 1% of the USA populations, our model predicts very well individual uniqueness, achieving a mean AUC (area under the receiver-operator characteristic curve (ROC)) of 0.89. For each population, to avoid overfitting, we train the model on a single 1% sample, then select 1000 records, independent from the training sample, to test the model. For re-identifications that the model predicts to be always correct (\(\widehat {\xi _{\boldsymbol{x}}}\, > \, 0.95\), estimated individual uniqueness >95%), the likelihood of them to be incorrect (false-discovery rate) is 5.26% (see bottom-right inset in Fig. 2a). ROC curves for the other populations are available in Supplementary Fig. 3 and have overall a mean AUC of 0.93 and mean false-discovery rate of 6.67% for \(\widehat {\xi _{\boldsymbol{x}}}\, > \, 0.95\) (see Supplementary Table 1).

Fig. 2 The model predicts correct re-identifications with high confidence. a Receiver operating characteristic (ROC) curves for USA populations (light ROC curve for each population and a solid line for the average ROC curve). Our method accurately predicts the (binary) individual uniqueness. (Inset) False-discovery rate (FDR) for individual records classified with ξ > 0.9, ξ > 0.95, and ξ > 0.99. For re-identifications that the model predicts are likely to be correct \(( {\widehat {\xi _{\boldsymbol{x}}} \,> \, 0.95})\), only 5.26% of them are incorrect (FDR). b Our model outperforms by 39% the best theoretically achievable prediction using population uniqueness across every corpus. A red point shows the Brier Score obtained by our model, when trained on a 1% sample. The solid line represents the lowest Brier Score achievable when using the exact population uniqueness while the dashed line represents the Brier Score of a random guess prediction (BS = 1/3) Full size image

Finally, Fig. 2b shows that our model outperforms even the best theoretically achievable prediction using only population uniqueness, i.e., assigning the score \(\xi _{\boldsymbol{x}}^{{\mathrm{(pop)}}} = \Xi _X\) to every individual (ground truth population uniqueness, see Supplementary Methods). We use the Brier Score (BS)49 to measure the calibration of probabilistic predictions: \({\mathrm{BS}} = \frac{1}{n}\mathop {\sum}

olimits_{i = 1}^n {\left(\xi _{{\boldsymbol{x}}^{(i)}} - \widehat {\xi _{{\boldsymbol{x}}^{(i)}}}\right)^2}\) with, in our case, \(\xi _{{\boldsymbol{x}}^{(i)}}\) the actual uniqueness of the record \({\boldsymbol{x}}^{(i)}\) (1 if \({\boldsymbol{x}}^{(i)}\) is unique and 0 if not) and \(\widehat {\xi _{{\boldsymbol{x}}^{(i)}}}\) the estimated likelihood. Our model obtains scores on average 39% lower than the best theoretically achievable prediction using only population uniqueness, emphasizing the importance of modeling individuals’ characteristics.

Appropriateness of the de-identification model

Using our model, we revisit the (successful) re-identification of Gov. Weld25. We train our model on the 5% Public Use Microdata Sample (PUMS) files using ZIP code, date of birth, and gender and validate it using the last national estimate50. We show that, as a male born on July 31, 1945 and living in Cambridge (02138), the information used by Latanya Sweeney at the time, William Weld was unique with a 58% likelihood (ξ x = 0.58 and κ x = 0.77), meaning that Latanya Sweeney’s re-identification had 77% chances of being correct. We show that, if his medical records had included number of children—5 for William Weld—, her re-identification would have had 99.8% chances of being correct! Figure 3a shows that the same combinations of attributes (ZIP code, date of birth, gender, and number of children) would also identify 79.4% of the population in Massachusetts with high confidence \(( {\widehat {\xi _{\boldsymbol{x}}} \,\,> \,\, 0.80} )\). We finally evaluate the impact of specific attributes on William Weld’s uniqueness. We either change the value of one of his baseline attributes (ZIP code, date of birth, or gender) or add one extra attribute, in both cases picking the attribute at random from its distribution (see Supplementary Methods). Figure 3c shows, for instance, that individuals with 3 cars or no car are harder to re-identify than those with 2 cars. Similarly, it shows that it would not take much to re-identify people living in Harwich Port, MA, a city of <2000 inhabitants.

Fig. 3 Average individual uniqueness increases fast with the number of collected demographic attributes. a Distribution of predicted individual uniqueness knowing ZIP code, date of birth, and gender (resp. ZIP code, date of birth, gender, and number of children) in blue (resp. orange). The dotted blue line at \(\widehat {\xi _{\boldsymbol{x}}} = 0.580\) (resp. dashed orange line at \(\widehat {\xi _{\boldsymbol{x}}} = 0.997\)) illustrates the predicted individual uniqueness of Gov. Weld knowing the same combination of attributes. (Inset) The correctness κ x is solely determined by uniqueness ξ x and population size n (here for Massachusetts). We show individual uniqueness and correctness for William Weld with three (in blue) and four (in orange) attributes. b The boxplots (25, 50, and 75th percentiles, 1.5 IQR) show the average uniqueness 〈ξ x 〉 knowing k demographic attributes, grouped by number of attributes. The individual uniqueness scores ξ x are estimated on the complete population in Massachusetts, based on the 5% Public Use Microdata Sample files. While few attributes might not be sufficient for a re-identification to be correct, collecting a few more attributes will quickly render the re-identification very likely to be successful. For instance, 15 demographic attributes would render 99.98% of people in Massachusetts unique. c Uniqueness varies with the specific value of attributes. For instance, a 33-year-old is less unique than a 58-year-old person. We here either (i) randomly replace the value of one baseline attribute (ZIP code, date of birth, or gender) or (ii) add one extra attribute, both by sampling from its marginal distribution, to the uniqueness of a 58-year-old male from Cambridge, MA. The dashed baseline shows his original uniqueness \(\widehat {\xi _{\boldsymbol{x}}} = 0.580\) and the boxplots the distribution of individual uniqueness obtained after randomly replacing or adding one attribute. A complete description of the attributes and method is available in Supplementary Methods Full size image

Modern datasets contain a large number of points per individuals. For instance, the data broker Experian sold Alteryx access to a de-identified dataset containing 248 attributes per household for 120M Americans51; Cambridge university researchers shared anonymous Facebook data for 3M users collected through the myPersonality app and containing, among other attributes, users’ age, gender, location, status updates, and results on a personality quiz52. These datasets do not necessarily share all the characteristics of the one studied here. Yet, our analysis of the re-identification of Gov. Weld by Latanya Sweeney shows that few attributes are often enough to render the likelihood of correct re-identification very high. For instance, Fig. 3b shows that the average individual uniqueness increases fast with the number of collected demographic attributes and that 15 demographic attributes would render 99.98% of people in Massachusetts unique.

Our results, first, show that few attributes are often sufficient to re-identify with high confidence individuals in heavily incomplete datasets and, second, reject the claim that sampling or releasing partial datasets, e.g., from one hospital network or a single online service, provide plausible deniability. Finally, they show that, third, even if population uniqueness is low—an argument often used to justify that data are sufficiently de-identified to be considered anonymous53—, many individuals are still at risk of being successfully re-identified by an attacker using our model.

As standards for anonymization are being redefined, incl. by national and regional data protection authorities in the EU, it is essential for them to be robust and account for new threats like the one we present in this paper. They need to take into account the individual risk of re-identification and the lack of plausible deniability—even if the dataset is incomplete—, as well as legally recognize the broad range of provable privacy-enhancing systems and security measures that would allow data to be used while effectively preserving people’s privacy54,55.