Now, for my second major post, I decided to take on something different — rather than an overview of a specific project, I’ll be writing up more of a conceptual overview of concepts very near and dear to my heart: the concepts of privacy and secrecy, and how they relate to various terminologies becoming more and more popular: namely: Zero Knowledge proofs, ZKSNARKs, Trusted Execution Environments, and Multi-party computation (if there is room at the end of the article, I may dip into more exotic systems like Homomorphic encryption and Indistinguishability obfuscation — but I suspect that will confuse, not edify). Before getting into the meat of the project, I will say that for this article specifically I will, at the request of one of the people promoting it, not be naming any specific projects for the bulk of the article. I should also note that this article will be read by representatives of both the Enigma and Keep projects for technical correctness, and I will make clear where contentful changes by either party are suggested to keep potential conflicts of interest clear.

Outline:

Part I: Definitions:

Before we get into the individual techniques for privacy and secrecy, we will first take some time out to establish definitions for the terms ‘privacy’ and ‘secrecy’, as well as the distinction between data, transactional, and computational privacy/secrecy, and the difference between preservation and creation of security properties. Some may contend that everyone knows what the relevant terms mean — so why establish definitions? For me, the reason is twofold: first, everyone likely has slightly different definitions of these terms: taking time to set out my definitions hopefully ensures that arguments about usage and meaning will be minimized. Secondly, in order to make formal claims, we need to have a precise idea of what properties we’re looking for. Too many cryptographic schema have been lost to slight informalities to be lax.

Definition I: Privacy is that which is visible, but unreadable without certain information.

In other words, private information is what we’d typically think of when we think of encryption: we know that some message is being transmitted (in extreme cases a message that is either ‘yes’ or ‘no’, bearing only one bit of information), and we can pick up the exact ciphertext (encrypted message), but we cannot read the actual message. In the case of a signature, we know what the exact message to be signed is, but we cannot reconstruct the signature without the key. In summary, imagine something private like a message within an obvious sealed envelope — you know exactly where it is, but you can’t read it. More formally, a private message is one where even knowing it’s location and form, one cannot distinguish it from any other possible message. In contrast, secrecy obfuscates even the message’s location. Private data is similar.

Definition II: Secrecy is that which cannot even be detected or distinguished from other communications.

In other words, rather than being a sealed envelope, secrecy is more like a message in invisible ink — not only can it not be read, it cannot be noticed. This is the realm classically known as steganography, rather than encryption. A secret message is often both more and less secure than private messages — it is harder to break since it often cannot be detected, but if it’s method of sending is detected it is often not otherwise protected. In summary, imagine a sealed message like something folded up and hidden — tough to see, but not necessarily hard to read once seen. More formally, a secret message is one where transmissions containing such a message are outwardly equivalent to those not containing it. Secret data, on the other hand, is data who’s location on a distributed system or in a single machine is not known.

Definitions III, IV, and V: Data refers to information contained within a system. Computations refer to operations performed on some data and their results, in particular to whether they leak information about that data. Transactions refer to financial or other exchanges.

Data privacy: data is locatable but unreadable.

Data secrecy: data is unlocatable

Computational privacy: Computations do not leak information about a system’s data, and their results remain secret.

Computational secrecy: Computations cannot be detected/localized in transit — which machines will be performing a computation is unknown/undetectable.

Transactional privacy: the contents of a specific transaction are unreadable, e.g. amount and recipient.

Transactional secrecy: Individual transactions cannot be detected as they occur even with the ability to see network back-and-forth

Definition VI and VII: Preserving a property means that a given operation will not cause a system to lose a property, while creating a property means adding it to a system that did not previously have it.

I.e. symmetric encryption creates the property of data privacy for a system, while cryptographic signatures preserve the property of data privacy for that system.

Part II: ZKPs: Privacy Preserving Mechanisms

Now that we’ve established definitions for the relevant terms, we can proceed to analyze the first of our three technologies: the zero-knowledge proof. What this cryptographic technique allows for is the ability to prove that data has a certain property, without revealing anything else about the data. In particular, a proof in zero knowledge must be simulatable by someone who knows nothing about that data in a situation where they can control any randomness and the data does have the property. Now, this is pretty dry, formal terminology, so I think it may be better to use a metaphor (credit to Boaz Barak, and the wonderful class that was CS 127, for the metaphor). Imagine you are standing in front of a large cave, with two entrances. You know that there is a passage between the two entrances and you’ve bet your friend on the outside that you know where it is. However, you want this passage all to yourself — in particular, you don’t want your friend to show it to other people, or to be able to use your evidence to prove the existence of the passage to others. How can you prove there’s a passage without revealing that info?

The ZKP cave

The answer is randomization. You’ll go into the cave, tell your friend to flip coins secretly, and each time come out of an exit of the cave determined by the coin flip. This is only possible consistently with a passage, but to a bystander they might just assume that your friend and you planned out this randomness (under the simulation paradigm, the simulator will take realities where you happened to go into the same entrance as the coin flip determined). More generally, in the real world, a zero knowledge (interactive) proof looks something like this (actually, this is a protocol for a generalized zero knowledge proof of knowledge, but the difference is subtle enough to be irrelevant).

You want to prove a piece of data D has property A to verifier V. So you follow these steps:

Provide some set encrypted/hashed/otherwise committing string S. V flips a coin. If V flips heads, V asks you to make enough of the string into plaintext to verify that S is actually a commitment to D If V flips tails, V asks you to make enough of the string into plaintext to verify that the string S is a commitment to has property A.

Note that, since S is fixed, the only way for you to be able to do this successfully and consistently is for D to actually have property A — in other words, succeeding at this test consistently is proof that D has A. However, this proof also reveals no additional information about D — since either you’re revealing a string that can’t be easily retranslated into D, or you’re just revealing D itself! (there are more nuances here to establish zero knowledge — those who would like more information are invited to look up the zero knowledge protocol for hamiltonian path, the specific thing this example is based on).

So, we have this object of Zero Knowledge Proof — what, in our earlier terminology, does it get us? As mentioned in the section header, ZKPs preserve data privacy while revealing specific, chosen properties of that data. In other words, if you have some piece of data, say your date of birth, you can validate your being old enough to vote, or to drink, without revealing the actual age — and, similarly, you can prove your credit score to be a certain value without revealing any of the data that went into that calculation. Additionally, ZKPs enable transactional privacy — two parties can validate that a transaction was valid and not a doublespend without revealing amounts, time of transaction, or recipient addresses. This is, of course, extremely powerful. However, it doesn’t get all the way to creating data privacy, or, indeed, to computational privacy. Do you see why? (Hint: note carefully the qualifiers in the previous paragraph).

I’ll give you a bit to think.

…

A bit longer.

…

It’s because ZKPs can only preserve the privacy of data that you already have when running your computations. In other words, if you need to send your data to someone else, or to collate large amounts of data to make a computation, then you have a security hole. Of course, the person running the eventual compute can prove their result correct without needing to leak your data to do so, but, at that point, they have your data and can do whatever they want with it. Zero knowledge proofs, therefore, convert the problem of generalized computational privacy into one of single-node computational privacy, in effect centralizing your security vulnerabilities. A partial solution, but not a perfect one. (ZKPs will also play a role in the next few sections — keep an eye out!)

Part IIb: SNARKs and STARKs: Why do we need a ‘verifier’ in the first place?

Now, ZKPs are all well and good at privacy preservation in a vacuum, but in the real world there are a number of issues with using them as described above. First, the protocol explained above (based, again, on hamiltonian graph cover) is long — it requires a translation of your desired property and situation into a boolean satisfiability problem, then a translation of that into a graph, then finally a ZKP based on the now massive graph. Second, the protocol is interactive — which means you need verifiers on the network to provide you with randomness anytime you might want to produce a proof. Both of these increase the latency of ZKP beyond what we’d want in a production ready system. How can we solve these problems, to make a Zero Knowledge Succinct Noninteractive ARgument for Knowledge (or ZKSNARK)?

Size: Using some pretty clever tricks with polynomial interpolation (the fact that if you know n+1 points on an n degree polynomial you can recover it) and the fact that RSA preserves multiplication of ciphertexts, you can encode your property as a function of polynomials (in an NP complete method known as Quadratic Span programming) which have extremely short ZK proofs of roots. (The explanation of this is pretty math heavy, but https://blog.ethereum.org/2016/12/05/zksnarks-in-a-nutshell/ does a good job of explaining it.)

Interactivity: There are a few ways to make proofs noninteractive — if you want to preserve the randomness, you can have randomness be external to the prover (generally using hashes of external information and the committed proof string), not provided by the verifier — then the verifier just checks the provision matches the randomness and doesn’t need to communicate with the prover. Alternatively, you can make the proof zero knowledge by adding small amounts of random noise into your polynomials, making all non-property information (in particular, the evidence someone would need to reconstruct your proof) indistinguishable from random. Again, apologies for this getting a little mathier, but check the above link for more info.

In contrast to SNARKs, STARKs not only preserve noninteractivity, but use a blend of new algorithms for interactive proofs about polynomials and probabilistically checkable proof techniques to decrease verifier time massively and proof complexity by a large constant factor — thus solving the latency problem through size and speed, and eliminating the external randomness requirement entirely. I’ve again left out the majority — well, all — of the math here in favor of giving an intuitive idea. Check Eli Ben Sasson’s original paper (at https://eprint.iacr.org/2018/046 ) if you want the technical details.

Both of these techniques fix problems with latency and complexity of ZK proofs — however, they do not solve the security problems. Multi-node computations are still just as vulnerable to single-node malice as they were before.

Part III: HSMs and TEEs: Still seemingly privacy preserving: What’s changed?

So, with ZKPs, we’ve entered into a world where multi-node data and computational privacy problems have been reduced to their single-node counterparts. However, it might seem like, at this point, we’re stuck. It seems like there’s no way to solve the problem of single node privacy — after all, that would require some way to know that not only was the right code run on some piece of data, but that only that code saw that piece of data, and nothing else. Otherwise, leaks are always possible. However, it seems like that aim is impossible — we’d need to have complete knowledge of what processes were being run on a device, plus knowledge of any physical accesses.

Now that I’ve said this, some of you are probably wondering — why can’t we just build a physically tamperproof device? Of course, people have and do try to do this sort of thing — the Hardware Security Module concept, or HSM, is just that. These devices, found, for instance, in some smart cards, are built to run very specific pieces of code (generally encryption/decryption operations with on-device stored keys or similar low-effort, high security computations) with near-perfect tamper protection, both against physical and software attacks. The most secure of these devices are built to even resist environmental attacks like varied temperatures and voltage changes meant to cause component failures. (There’s, as far as I know, no way to be secure to attacks that care less about preserving post-attack system integrity. Freezing the entire system in liquid nitrogen, for example, does a relatively good job of preserving memory states well enough to be read out afterwards, if not in a way that allows you to rebuild the system.). These sorts of systems allow for encryption and decryption in a way that keeps the keys themselves secret even on single nodes, certainly a tremendous step forwards.

However, note the qualifier. These systems work for encryption and decryption specifically — general purpose computing is too complex to run on an HSM. Instead, we have to settle for some environment that attests/makes a verifiable promise that only certain code is being run on it, while also allowing us to trust that the larger machine it’s a part of doesn’t have access to the data it’s processing. In other words, we need a Trusted Execution Environment — or TEE, for short. The most popular and fully fleshed out of these TEE projects that I’m aware of is the Intel SGX, which is a specific segment of many new Intel CPUs that supports trusted processing. These TEEs serve, as mentioned, two functions: 1) to attest that the piece of code that was supposed to be running on a given piece of data was, in fact, what was run on it (done by comparison of hashes of system state, as well as a number of other smaller pieces — see https://eprint.iacr.org/2016/086.pdf for an architectural look at that system) , and 2) that no other code had access to that data, by keeping it in a separate, privileged segment of memory and following specific procedures at startup of the unit, shutdown, and in the event of CPU faults (again, see https://eprint.iacr.org/2016/086.pdf ). Now, if both of these guarantees were perfect, we’d have our data privacy-preserving computation schema: send encrypted data and your desired computation to a TEE which has stored on it the instructions to decrypt that data with a stored key, then run the computation, then reencrypt and send back the result. If you’re curious how a system like can be general purpose given that a TEE is set with a specific piece of code to run on any piece of data, think about the situation where the code in the TEE is something like ‘Compile input 1, run it on input 2’ — this elevates trusted attestation over any given specific algorithm to a general trusted environment (keep this strategy, as well as the decrypt/reencrypt strategy in mind! It’ll be useful if/when we cover homomorphic encryption later in this piece).

Now, anyone who’s familiar with my writing style doubtlessly expects this next sentence. Unfortunately, those two guarantees are not (yet) perfect. Now, to be fair, the attestation to correct code portion of SGX seems, as far as I can tell from a reasonable degree of research, to be inviolate — in other words, if your TEE says it ran program X on your data, it’s almost certainly telling the truth. However, the privacy guarantee is imperfect. The fact that the TEE is being run out of a machine and does, to some extent, share memory with that machine allows for a number of cache-level and processor level attacks wherein the user of the outer machine sets and unsets various values in memory, tracking which ones the TEE pulls and using that information to extract information (in theory a developer should be able to write TEE-internal code that is resistant to these sorts of attacks, in practice it is extremely difficult, see https://cdn2.hubspot.net/hubfs/1761386/security-of-intelsgx-key-protection-data-privacy-apps.pdf ). However, these attacks are often somewhat costly to run.

In terms of performance: TEEs and HSMs don’t require any additional comms or computational complexity in theory than normal cloud computing would — in practice, however, there may be slight slowdowns as algorithms are optimized for security rather than speed.

So, in summary: TEEs, were they perfect, would provide us with a fast general-purpose computing schema which preserved privacy over data — however, there exist subtle flaws within TEE that allow for pulling out information through side channels. In contrast, HSMs are not subject to the same flaws, but are more limited in applications.

Part IV: MPC: Privacy preserving, secrecy creating: why does having multiple parties matter?

So, now we’ve run through ZKPs and TEE — both of which were secure and preserved privacy to various extents, but both of which fell prey to flaws stemming from the existence of cleartext data in a known location allowing attackers to extract information. At this point, you might be forgiven for thinking that true privacy preservation is impossible — after all, if you can’t even operate on the data in the clear at all without it being vulnerable, what can you do?

The answer, of course, is to never assemble the data in the clear in the first place. And thus, we’re brought to the next of our cryptographic schema for privacy preserving computations: Multi-Party computation. MPC as a schema exists in 2 pieces: 1) the simpler piece: how one divides a secret securely. 2) the more complex piece: how one operates on a divided secret. We shall cover these pieces each in turn.

First, let’s consider how one would divide a secret up among many parties privately. Now, of course, there’s the naive approach of just giving each a piece of the secret. However, that will be imperfectly private — each party will have some information about the secret, and the more of them come together the more information they have. This, obviously, is untenable — we want the secret to be secure unless enough parties come together and totally unreadable otherwise. Let us first take the simple case wherein we want all parties to have to come together to have to reconstruct a secret. In this case, encoding the secret as some positive or negative number, there’s a very easy algorithm. For our N parties, give out n-1 random numbers to all but one of them, and give the last person the number that, when added to the sum of those n-1, sums to our secret. Since the numbers can be any positive or negative values, even with n-1 out of the n values, we have no idea what the secret is until we have the Nth number.

That gets us to N of N secret sharing. How do we do secret sharing where the secret can be reconstructed by fewer than N, for redundancy (i.e. so you can reconstruct secrets even if nodes drop off the grid)? The answer lies in what’s called polynomial interpolation. Remember, for example, how 2 points define a straight line — and with only one point, there are an infinity of possible lines? With more points, you can define more complex curves/polynomials — the complexity/degree of the curve determines how many points are needed to completely determine it. Therefore, our secret sharing scheme runs as follows: say we want to share a secret S among 5 people such that any 2 can recover it. We’ll encode S as a number, then generate a random line (since we want any 2 points to recover S) which intersects the y axis at S. Then we’ll send one point on the line to each person. As long as 2 people share their points, they can get out the original line and thus the secret — but with only one point, there are infinitely many lines.

So, that gets us secret sharing. How do we operate on this shared data without reconstructing it, and ruining the privacy? Well, if you think about it, you can get any operations from repeated additions and multiplications (assuming you also allow for encoding and decoding of numbers into various data formats) — so all we need to show is that additions and multiplications are possible under this scheme. Additions and constant multiplications are easy — each person just adds/multiplies their own point. However, multiplication of two shares is slightly harder — if each person naively multiplies their shares together, then they’ll end up with points on a polynomial with twice the degree, which requires more than double the points to determine and thus to reveal the data! This degree reduction step, which if done naively requires people to reshare their derived points back around the network, adds a lot of communication complexity to MPC. (For those of you wondering why we can’t just do everything with repeated additions, note that using repeated additions to do multiplication still reveals the second factor, albeit obliquely) If you hear people discussing the quadratic(n²) nature of naive MPC algorithms, this degree reduction is where it comes from. Similarly, the modern algorithms for making MPC faster all rely on eliminating this costly multiplication step. First, we have SPDZ, which relies on clever algebra to share a number of agreed upon constants to allow for a one-step multiplication but induces extra communications time to send out these constants (explained here: https://bristolcrypto.blogspot.com/2016/10/what-is-spdz-part-2-circuit-evaluation.html ). Second, we have quorum-based protocols, which rely on computing pieces of the data at a time and aggregating results into more computations to keep the cost of individual multiplications small (as they’re performed on smaller networks) — these protocols are faster, but require resharing of the secret to new parties relatively often which introduces new computational costs (see part II of my project here: https://www.boazbarak.org/cs127/Projects/mpc.pdf for more information on the reshare frequency bounds). Also note that, in order to provide attestation to code correctness in the MPC case, these schemes generally also incorporate some form of verification step wherein all the parties prove in zero-knowledge that the code they ran on their inputs produced the correct outputs — this induces additional slowdown (certain protocols also rely on after-the-fact tracing through the inputs and outputs of a random subset of parties, assuming any habitual wrongdoers will be caught). So, all in all, MPC is decidedly slower than our previous two methods.

However, note what this schema gives us: we have, assuming enough parties are honest on a large scale, private computations. MPC gives us the ability to guarantee the privacy of data even after it’s been computed on. Now, it should be noted that even this schema is imperfect — if you share data among N parties of whom K are dishonest, they can reconstruct the secret. While you might think it should be possible to hide from those parties that they have shares of the same secret (possibly by anonymizing all computations for the protocol), it turns out such covert MPC is impossible. The proof of this is a little involved, but in short, it turns out that it’s always possible for adversaries to exchange enough information to identify each other. (see part I of https://www.boazbarak.org/cs127/Projects/mpc.pdf for more information on this result and some corollaries).

In summary, MPC, at the cost of speed, gives us (near)perfect generalized, privacy-preserving computations. MPC systems allow us all the capabilities of typical computing, without any of that pesky need to hold data in plaintext.

Part V: Conclusions:

Reader, let me hazard a guess as to what you’re currently thinking: ‘that was a lot of information — is there some succinct takeaway I should be getting out of this besides system basics?’ Well, while system basics are important, there is something short that I want to end off with: two things, in fact. First, being precise in your security matters — without precision in our definitions and distinguishing between data and transactional privacy, it wouldn’t have been anywhere near as simple to clarify the lines between ZKPs, TEEs, and MPC.

Second, and more important: there is no silver bullet. Allow me to repeat that. There is no silver bullet. No one answer, no one technique, completely solves the privacy problem. Some provide partial solutions, like ZKP. Others provide varying levels of speed at the expense of security or vice versa, like the comparison of TEEs to MPC. Still others sacrifice applicability for speed and security, like HSMs. Instead, what is needed is a mix of both various techniques in breadth, to allow people to make tradeoffs between speed and security (like SGX vs MPC) and the addition of techniques in depth such that a message remains secure from start to finish (e.g. using ZKPs to reduce the problem of computation to single-node, then using MPC or SGX to ensure the data is private on that node). Only then can we achieve meaningful privacy.

PS/Things not covered in this article:

Differential privacy: what if computations leak information?

Homomorphic encryption: what if I don’t want adversaries to have any way to break my privacy?

Indistinguishability obfuscation: what if I want the actual algorithms behind my computations to be secret? (Also: witness encryption, deniability encryption, and a whole host of other toys.)