by

[This post is joint work with Princeton graduate student Changchang Liu and IBM researcher Supriyo Chakraborty. See our paper for full details. — Prateek Mittal ]

The tussle between data utility and data privacy

Information sharing is important for realizing the vision of a data-driven customization of our environment. Data that were earlier locked up in private repositories are now being increasingly shared for enabling new context-aware applications, better monitoring of population statistics, and facilitating academic research in diverse fields. However, sharing personal data gives rise to serious privacy concerns as the data can contain sensitive information that a user might want to keep private. Thus, while on one hand, it is imperative to release utility-providing information, on the other hand, the privacy of users whose data is being shared also needs to be protected.

What is differential privacy?

Among existing data privacy metrics, differential privacy (DP) has emerged has a rigorous mathematical framework for defining and preserving privacy, and has received considerable attention from the privacy community. DP is used for protecting the privacy of individual users’ data while publishing aggregate query responses over statistical databases, and guarantees that the distribution of query responses changes only slightly with the addition or deletion of a single tuple (entry or record) in the database. Thus, after observing the query response, an adversary is limited in its ability to infer sensitive data about any tuple and in turn about the user contributing the tuple. For example, if the government aims to publish the average salary of individuals in New Jersey, DP can reveal the approximate average salary of this region while protecting the privacy of any individual user’s salary. DP achieves this balance between utility and privacy by adding noise (perturbation) to the true query result, for example, via the Laplace perturbation mechanism.

Differential privacy is vulnerable to data correlation!

To provide its guarantees, DP implicitly assumes that the data tuples in the database, each from a different user, are all independent. This is a weak assumption, especially because tuple dependence occurs naturally in datasets due to social interactions between users. For example, consider a social network graph with nodes representing users, and edges representing friendship relationships. An adversary can infer the existence of a social link between two users that are not explicitly connected in the graph using the existence of edges among other users in the graph ( since two friends are likely to have many other friends in common). Similarly, private attributes in a user’s record can be inferred by exploiting the public attributes of other users sharing similar interests. An adversary can easily infer a user’s susceptibility to a contagious disease using access to the noisy query result and also the fact that the user’s immediate family members are part of the database. In an era of big-data, the tuples within a database present rich semantics and complex structures that inevitably lead to data correlation. Therefore, data correlation, i.e., the probabilistic dependence relationship among tuples in the database, which is implicitly ignored in DP, can seriously deteriorate the privacy properties of DP.

Previous work has investigated this problem and proposed several interesting privacy metrics, such as zero-knowledge privacy, pufferfish privacy, membership privacy, inferential privacy, etc. However, the privacy community is lacking effective perturbation mechanisms that can achieve these proposed privacy metrics.

In a paper that we presented at the Network and Distributed System Security Symposium (NDSS 2016), we first demonstrate a Bayesian attack on differentially private mechanisms using real-world datasets (such as Gowalla location data). Our attack exploits the correlation between location information and social information of users in the Gowalla dataset. We show that it is possible to infer a user’s sensitive location information from differentially private responses by exploiting her social relationships. When data is correlated, DP underestimates the amount of noise that must be added to achieve a desired bound on the adversary’s ability to make sensitive inferences. Therefore, correlation (or dependency) among data tuples would break the expected privacy guarantees.

How can we improve differential privacy to protect correlated data? Introducing dependent differential privacy.

Overall, our work generalizes the classic differential privacy framework to explicitly consider correlations in datasets, thus enhancing the applicability of the approach to a wide range of datasets and applications. We formalize the notion of dependent differential privacy (DDP); DDP aims to explicitly incorporate probabilistic dependency constraints among tuples in the privacy formulation. We also propose a perturbation mechanism to achieve the privacy guarantees in DDP; our approach extends the conventional Laplace perturbation mechanism by incorporating the correlation between data tuples. To do so, we introduce the concept of dependent coefficient between two tuples, which aims to capture the correlation between data tuples from the perspective of privacy. (The dependent coefficient between two tuples evaluates the ratio between the dependent indistinguishability of one tuple induced by the other tuple and the self indistinguishability of the tuple itself.) Our perturbation mechanism can be applied for arbitrary data queries and we validated its effectiveness using real-world datasets.

What are future research directions for dependent differential privacy?

First, to form a deeper understanding of our dependent differential privacy framework, it would be interesting to explore the application of standard concepts in differential privacy to our framework, such as local sensitivity and smooth sensitivity. Second, the effectiveness of our proposed perturbation mechanism depends on how well the dependence among data can be modeled and computed. One limitation of our work is that the dependence coefficient is exactly known to both the adversary and the database owner. How to accurately compute the dependence coefficient and deal with potential underestimation would be an interesting future work (note that the overestimation of dependence coefficient continues to provide rigorous DDP guarantees). Finally, in our dependent differential privacy framework, we only considered pairwise correlations between tuples in a database. An important research direction is to generalize the pairwise correlations that we considered in our paper to joint correlations in the dataset.