Many organizations rely on cryptography to secure our privacy, and for a long time now, we’ve trusted these organizations to encrypt our personal data rigorously and responsibly. But when what should never happen, happens — HealthNet, Equifax, Cambridge Analytica — we’re forced to confront our reliance on these third parties.

As decentralized networks, blockchains offer a means of circumventing centralized institutions and removing the need to trust a single party. That means greater transparency, since the network sees and validates every transaction. However, this newfound transparency also makes users’ activities far more vulnerable. Without proper privacy-preserving environments, users on a blockchain’s decentralized infrastructure can suffer from the very transparency it affords.

Bitcoin Privacy Model: Pseudonymity As a Decoy

Bitcoin is the most popular application of blockchain, and many of its descendants share its privacy model. The network doesn’t store IPs or personal information. Instead, it relies on cryptographic pseudonymous addresses to protect its users’ identities. A user — let’s call him Dorian — generates a new address each time he collects a payment. Dorian gives the payer this new account number with no history. The payer can’t guess much about Dorian.

The main concern comes when Dorian spends his money. A Bitcoin transaction often has multiple inputs and outputs. We can make the following assumptions (not always true, as we will see later on):

When a transaction has many inputs, the owner of these is the same person. When a transaction has two outputs, one will be going to the payee, and the other will be change money (unspent transaction output).

A bad actor can crawl the transaction graph and group together addresses that are likely to belong to the same person. All it takes to expose account owners is to associate one of these pseudonyms to an identity. Then it is like unrolling a ball of strings.

Concretely, when you disclose your identity along with a payment, the payee may guess your bank statement. And they may follow your future transactions. Even more worrisome, the payee may not be the one spying you. An ICO can be subpoenaed, an exchange may be hacked, or a shopping website may have curious cookies.

Machines and analytics algorithms are way better at crawling data than humans. They can scrape for data on social media, combine several data sources, and build a probabilistic model associating addresses with identities. In practice, the privacy model that our ecosystem has inherited from Bitcoin is very weak. Here’s an illustration:

Dorian happens to be passionate about blockchains. He writes a Medium article that ends with a “please tip me” section and a “please follow me on Twitter.” You guessed it: it has a BTC address in it, the same one he included in his signature on his favorite forum, which you assume would only compromise his privacy. You tip him (because you like his work) and send him a private message on Twitter to thank him for his efforts. He wants to return the favor and tells the world how generous you are. He tweets, “Thank you @xxxxxxxxxxxxxxxxx for your tip.” Thanks to Dorian, now you’re vulnerable too. A simple algorithm looks for incoming transactions to Dorian’s tip address. It filters them through a time range — say, between his article and his tweet — and flags senders with your identity. Today, sophisticated tools that use machine learning and clustering techniques can be built to de-anonymize most blockchains.

The Gossip Protocol: Understanding Network Patterns Through Honeypot Nodes

Another concern is metadata analysis of our personal communications (evidenced most infamously in Edward Snowden’s revelations on the NSA’s surveillance program, PRISM). In a study of the Bitcoin network’s anonymity, researchers collected data by creating a fake honeypot node. At the time, the node connected to more peers than blockchain.info.

The study revealed several patterns. Sometimes, a computer sent out only one transaction — likely implying an association between an IP and a BTC address. Another pattern was related to software upgrades. They would seemingly trigger a surge of transactions from a single IP address. Heuristics helped map more than 1000 IP to BTC addresses, which, at that time, represented about 2% of the network. This research was conducted 5 years ago — an eternity in the crypto space. The more entropy blockchain transactions have, the easier it is to link them to identities.

Maintaining isolated “wallets” with various pseudonyms is a laborious task for we humans. We naturally favor the use of a single address for receiving payments. Ethereum is actually following this account-based model by design. It encourages us to reuse the same address for all our transactions, meaning its default privacy properties are worse than Bitcoin.

The need for privacy on blockchains is legitimate and vital for every single dApp, not only for banking and financial transactions. A voting system will need to hide if you voted, when you voted, and for whom you voted. A supply chain setup would need to conceal business relationships between participants. A ERC-20 token transaction should hide value and participants, while guaranteeing mass conservation.

Privacy-Preserving Schemes

Privacy standards depend on the industry and also the tradeoffs we are willing to make. For asset transfers — cryptocurrencies, ERC-20 tokens — proper privacy means preserving the transaction graph confidentiality. Transactions should be unlinkable and untraceable, and participants should remain anonymous. The following definitions from the CryptoNote white paper help clarify the standards that a privacy-preserving scheme should meet:

Unlinkability : for any two outgoing transactions, it is impossible to prove they were sent to the same person.

: for any two outgoing transactions, it is impossible to prove they were sent to the same person. Untraceability: for each incoming transaction, all possible senders are equiprobable.

Let’s port that to the ledger metaphor. We would see lines appearing (debit, credit, balance), without knowing who is involved. One way to achieve that would be through a mixing service (more on that later). Still, the information from these “lines” can be linked with a non-blockchain data source, and the privacy shield may not hold. We need more. We need to conceal the amounts. This “line appearing” should disclose neither participants nor amounts, yet it should contain data that ensures a consistent ledger state.

The system should enforce that we neither create assets out of the blue nor double spend. An anonymous crypto scheme that does exactly that is Zerocash. From a blockchain perspective, it means that we don’t store the world state anymore. We store proofs that the world state is consistent and that state transitions are valid. It becomes the user’s responsibility to maintain their own world state.

It is important to note that a lot of work goes into concealing transaction amounts: mixing services, confidential transactions, zero knowledge proofs. Unfortunately, these constructs don’t generalize well to arbitrary computation.

Anonymizing Ethereum’s Computation Layer

The ledger is but one application of blockchains. So if a blockchain is not a ledger, what is it?

Blockchain is a state machine replication protocol, producing a linearly-ordered log across many peers.

Projects like Ethereum make this protocol programmable. Ethereum adds a computation layer via smart contracts and the Ethereum Virtual Machine. A smart contract can implement a ledger, a voting system, a game, and more — the possibilities are limitless and that’s part of the excitement around blockchain.

Things get more complicated, however, when trying to address privacy for this computation layer. We have two ways of looking at this: either compute in broad daylight (and ensure that the executor of a computation can’t understand what it computes), or compute off-chain and prove we did it right.

These problems have been studied for the past 30 years and involve a variety of topics:

Secure Multiparty Computation (SMPC)

Trusted Execution Environment (TEE)

Computation verification

Fully Homomorphic Encryption (FHE)

Indistinguishability Obfuscation

Zero knowledge proofs

Obfuscating smart contract execution seems ideal but may not be practical. For example, computing a simple zk-SNARK (zero knowledge proof) may take 1min+. Verifying the proof on the Ethereum mainnet can cost 1M+ gas. At today’s rates, that’s $3 worth of transaction fees.

A FHE public key offering similar security properties as 2048bits RSA key weight about 2Gb. The cyphertext suffers from that and will be a burden.

On paper, the approach that seems to hurt throughput the least would be TEE. Several projects in the space (Enigma, Corda, Sawtooth) chose to make Intel SGX a first class citizen. Smart contract obfuscation would be achieved by trusting the hardware to hide what it is doing (even to its owner) and to provide an attestation of what program it is executing. Currently, one needs to contact Intel’s servers to verify this attestation. TEE is something we’re most excited about and we are watching to see if it evolves after recent vulnerabilities have come up — cache attacks, Spectre (hardware patch, anyone?).

Current Privacy-Preserving Methods

How practical are privacy-preserving schemes? How do they fare in respect to wire encryption, at-rest encryption, network and meta-data analysis? Are they suitable for private or public networks? I observe the following categories:

Mixing services

The general idea behind mixing services is that if many peers agree to join their inputs and outputs in the same transaction, nobody can tell which input associates with which output. Various protocols build on mixing services including Blindcoin, Coinjoin, CoinShuffle, XIM, and TumbleBit.

Anonymous crypto schemes

These schemes can hide origin, destination, and transaction amounts through cryptographic tools like Zero Knowledge Proofs, Pedersen Commitments, and Ring Signatures. On-chain cryptographic approaches include Zerocash, Mimblewimble, and the Monero protocol.

Secure multiparty computation (SMPC)

In an SMPC, a given number of participants, p1, p2, …, pN, each have private data, respectively d1, d2, …, dN. Participants want to compute the value of a public function on that private data: F(d1, d2, …, dN) while keeping their own inputs secret. A way that can be applied to smart contract execution is to express the computation as a boolean circuit using AND and NOT gates. The circuit would work with encrypted data and produce a result that’s readable by a single party. Techniques under this umbrella include secret sharing schemes, Enigma, Hawk, or Fully Homomorphic Encryption. TrueBit aims for computation correctness via economics incentive, but it doesn’t actually do much for privacy.

Off-chain constructs

One way to tackle privacy is to not use the mainnet to do confidential transactions. Logic is implacable. If many parties want to transact privately, they do so in a private network and settle on the mainnet if needed. State channels, plasma, sharding, private consortium networks, or sidechains could fit this description. The computation happens on a separate network and the result ends up on the mainnet. As uPort has pointed out, off-chain constructs also derisk future attacks. A blockchain is permanent: data encrypted with today’s algorithms may be at risk in a decade!

Toward an Anonymous Blockchain Construct

A performant, general, and completely anonymous blockchain construct is yet to be built. The ZKP conference at MIT last week seems like a big step in the right direction. PegaSys, ConsenSys’ protocol engineering group, was excited to take part, and we’ll be sharing our takeaways in the coming days. If you’re interested in collaborating, please reach out to us anytime. We are committed to working with teams and organizations across the space in order to foster collaborative innovation in the Ethereum ecosystem.