Data Analytics and Blockchain

Data analytics and machine learning are hugely valuable, providing insights and spurring advancements in many industries including IoT, healthcare, and financial services.

Unfortunately, the data that powers these advancements is often highly sensitive. For example, medical research requires access to sensitive patient data. In many cases this data cannot be accessed or shared due to privacy concerns. This results in data silos in which data is not used for its full potential value.

Blockchain can help solve this problem, though several challenges remain. For example, one would like to use smart contracts to allow researchers to run machine learning over sensitive data without revealing the data to the researchers. This is the premise behind many exciting new blockchain applications including data markets and decentralized hedge funds.

Unfortunately, today’s blockchain platforms cannot directly support applications that compute over sensitive data. As we discussed in a previous blog post, existing blockchains such as Ethereum store all data and state publicly, which could allow any user of the network to steal the data.

The Oasis platform provides confidentiality for smart contract execution through the use of secure enclaves and cryptographic techniques. In short, confidentiality ensures that sensitive data cannot be viewed or stolen when a smart contract runs on the data.

Confidentiality is an essential requirement for protecting privacy that is missing in today’s smart contract platforms. However, protecting the computation process alone is not enough: additional care must be taken to ensure the outputs of computations don’t leak sensitive information.

Privacy Risks of Data Analytics

The results of data analytics and machine learning often reveal more than intended, which can lead to privacy violations. For example, consider a company which releases the average salary of its employees each month.

Month Average Salary

January $73,568

February $74,872

Our intuition might tell us that statistical results such as averages do not reveal information about individuals. This is incorrect. Imagine you know that the company had 58 employees in January, and that the only change in staffing at the company between January and February was the hiring of your friend Bob. Based on the combined information, we can determine Bob’s exact salary: $150,504.

This simple example demonstrates a problem inherent to any statistical query on sensitive data: the results of such queries can often reveal sensitive information — even if that information is not included directly in the output (such as Bob’s salary). This can happen accidentally, often in non-intuitive ways, and doesn’t require that the researcher is intentionally trying to learn private information.

Recent work has shown that machine learning models can also leak information. For example, a recent paper co-authored by Oasis team members demonstrates how private information such as credit card numbers can be extracted from a deep learning model trained on user data. More concerning, this leakage happens across many different types of models, parameters, and training strategies.

Anonymization is Not a Solution

The most common approach for protecting the privacy of individuals is to anonymize data before releasing it. This approach is based on the assumption that if the data contains no identifying information about individuals (names, addresses, etc.) it should be safe for release.

Unfortunately, individuals can often be identified in anonymized datasets using so-called re-identification attacks. For example, in 2009 Netflix released a dataset of anonymized customer movie reviews for a competition to train better recommendation algorithms. Researchers demonstrated how to link the anonymized reviews with data from the Internet Movie Database, and were able to re-identify a large number of Netflix customers. Based on this result, Netflix was prevented by the FTC from launching a second round of the competition.

There are many other examples of anonymized data being used to identify specific individuals, including search logs and taxi trips. In fact, a recent study found that 87 percent of the population in the United States can be uniquely identified by just their ZIP code, gender and date of birth.

These results suggest that traditional approaches for protecting privacy are insufficient. We need a fundamentally new approach that is robust against these and other attacks.

Differential Privacy: A Formal Privacy Guarantee

Differential privacy is a formal definition of privacy. Informally, it states that the result of a computation must be similar whether or not any individual is included in the analysis. In other words, differential privacy guarantees that looking at the output there is no way to tell with certainty whether any individual appears in the data — much less to learn their actual information.

Differential privacy has several desirable properties. First, it doesn’t make any assumptions about what auxiliary information is available, therefore it is immune to all the attacks mentioned above. Additionally, while the definition prevents information from being learned about individuals, it still allows much to be learned about populations in the data, which is the very goal of most data analytics and machine learning problems.

Unfortunately, differential privacy is merely a definition of what privacy means; it does not tell us how to achieve this property. A major goal of our research over the past several years has been to develop algorithms and tools to enforce differential privacy (and other privacy-preserving techniques) for real-world problems such as data analytics and machine learning.

In previous work we developed Chorus, a modular framework for privacy-preserving data analytics. Chorus automatically enforces differential privacy for general-purpose data analytics via several state-of-the-art algorithms. Chorus has been released open-source and is currently deployed at Uber to provide privacy-preserving analytics for its analysts. In addition, we are conducting a pilot with data analysts at the Winton Group using Chorus to predict market trends from location and shopping data while protecting individual privacy.

Our research has also focused on privacy-preserving machine learning, where our work has produced practical new solutions to enforce differential privacy for machine learning tasks. We will present one of these techniques, Approximate Minima Perturbation, at IEEE Security & Privacy 2019, a top security conference.

We will share more technical details of this work in a future blog post.

Privacy Primitives for Smart Contracts in Oasis

At Oasis Labs, we’re building a new platform for privacy-first cloud computing on blockchain. Our mission is to charter the next era for secure computing and enable a new wave of privacy-first applications. This requires a top-to-bottom solution: in addition to providing data confidentiality at the platform level, Oasis will provide built-in privacy primitives at the application level.

We are currently developing a set of libraries to enable developers to build privacy-first smart contracts including data analytics and machine learning. These libraries are built from our extensive experience in this space, and will include the privacy-preserving techniques described above, as well as many new techniques.

The libraries will provide developers with a range of privacy-preserving building blocks such as differential privacy. Developers can use these building blocks to develop smart contracts that are privacy-preserving by design, without requiring the developer to be a security expert. This makes it easy to write applications that access sensitive data in a secure way, while providing guarantees to users that their data won’t be misused.

We look forward to sharing more announcements about this work very soon. If you are interested in building an application using these libraries, we encourage you to apply to join our private testnet.