On Monday, MIT hosted a daylong workshop on big data and privacy, co-sponsored by the White House as part of a 90-day review of data privacy policy that President Barack Obama announced in a Jan. 17 speech on U.S. intelligence gathering.



White House Counselor John Podesta, grounded by snow in Washington, delivered his keynote address and took questions over the phone. But Secretary of Commerce Penny Pritzker was on hand, as were MIT President L. Rafael Reif and a host of computer scientists from MIT, Harvard University, and Microsoft Research, who spoke about the technical challenges of protecting privacy in big data sets.



In his brief opening remarks, Reif mentioned the promise of big data and the difficulties that managing it responsibly poses, and he offered the example of MIT’s online-learning initiative, MITx, to illustrate both. “We want to study the huge quantities of data about how MITx students interact with our digital courses,” he said. “We want to measure what really works. We want to use what we learn to improve the way we teach — and to advance the science of teaching overall.”



But, Reif said, the question of how to protect the privacy of MITx students runs into difficulty right out of the gate. “MITx student data are governed by the Family Educational Rights and Privacy Act, or FERPA,” he said. But who counts as an MITx student? “Those who register, but never view course content? Those who view about half of the course content? Those who explore the course deeply, but don’t take the final exam? Or only those who actually earn a certificate?”



Podesta emphasized the history of privacy protection in the U.S., particularly the principles that undergird the Privacy Act of 1974. Those principles, he said, had been “refined” by the Consumer Privacy Bill of Rights that the White House presented in 2012 — whose development was led by one of the workshop’s organizers, Daniel Weitzner, now a research scientist in MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL). In 2011 and 2012, Weitzner served as the nation’s deputy chief technology officer.



The power of data



Of fundamental concern in the era of big data, Podesta said, is the shift from “predicated analysis of data — that is, using data to find something we already know that we’re looking for — to nonpredicated, or pattern-based, searches — using data to find patterns that reveal new insights.”



That concern came into sharper relief during the question-and-answer session, when Podesta was asked whether nonpredicated analysis was at odds with the Fourth Amendment, which requires probable cause for search and seizure.



“That’s what this study is trying to accomplish,” Podesta replied, “which is to take on board the effects of new technology and the questions about whether there is something new and different with respect to the jurisprudence around the Fourth Amendment that has developed over the years and that challenges some of the decisions that have been rendered by the Supreme Court.”



The next speaker was Cynthia Dwork, a distinguished scientist at Microsoft Research and a pioneer of “differential privacy,” the most mathematically rigorous notion of data privacy. The challenge that differential privacy is intended to meet, Dwork explained, is illustrated by a 2008 study by a pair of University of Texas researchers who analyzed the supposedly anonymous customer data released in support of a competition that sought to improve Netflix’s movie-recommendation engine. The UT researchers showed that correlating the rental dates of only three Netflix movies with the dates of posts on the Internet Movie Database was, on average, enough to uniquely identify a user in the data set.



Differential privacy proposes that a computation performed on a database — determining, say, the percentage of fans of “The Godfather” who also liked “Goodfellas” — is privacy-preserving if it yields virtually the same result whether or not the database contains any one person. That definition has led to the development of techniques that can trade some computational accuracy for an arbitrarily small difference in the results of the computation — a difference designated “epsilon.”



The question, Dwork explained, is how small this difference needs to be. That’s something the public needs to decide for itself, she argued, although she did make a few policy recommendations: that anyone who publishes a data analysis should also publish its epsilon, and that anyone who publishes data with an epsilon of infinity — guaranteeing that anyone in the data set can be identified — should be fined.



The limitations of privacy



The rest of the workshop’s technical discussions were divided between two panels. The first convened five CSAIL researchers whose work touches on the analysis of large data sets. John Guttag, the Dugald C. Jackson Professor of Computer Science and Engineering, who researches algorithms for finding diagnostically useful patterns in medical data, set the tone when he said, “People think the public fears loss of privacy. I have my real doubts about some of this. I think most people actually fear death, or death of a loved one, more than they do loss of privacy.”



Guttag described a three-year research project undertaken by his graduate student Jenna Wiens, who developed an algorithm that could identify patients at risk for bacterial infection from a variety of data collected by hospitals. “This work could not have been done with de-identified data,” Guttag said. “You need to know the home ZIP code of the patient: That turns out to be an important factor [along with] the room the patient was in, who else was in that room, who was in that room before the patient.”



None of the other panelists were as outspoken as Guttag, but Manolis Kellis, an associate professor of computer science, explained that the biological pathways that lead from particular genetic variations to incidence of disease are so complex that researchers, like him, who are trying to identify them require a huge amount of data to filter out all the noise. Indeed, he argued, the correlation between the volume of available genomic data and the pace of biological discovery is sharply inflected: Up to a certain point, adding more data yields little new insight, but beyond that point, the rate of useful discovery is exponential.



The importance of trust



Pritzker opened the second half of the workshop. “The American economy has always been grounded in the free flow of data and information,” she said. “I know the power of commerce data firsthand: I used Census Bureau data to launch my first business 25 years ago. My team needed to know the right places to build senior living centers, and the Census Bureau was critical to our decision-making.”



Pritzker cited a report by the McKinsey management consultancy that examined the economic potential of open data in seven industries: education, transportation, consumer products, electricity, oil and gas, health care, and consumer finance. “McKinsey’s analysis showed that open data in these sectors could help unlock $3 trillion in additional value to the global economy,” she said. “And yet, all of this potential hinges on one thing: trust.”



The researchers on the second panel discussed some of the mechanisms that might help secure that trust. The first three speakers — Shafi Goldwasser, Nickolai Zeldovich, and Vinod Vaikuntanathan, all professors in MIT’s Department of Electrical Engineering and Computer Science — discussed cryptographic schemes that enable remote servers to perform computations on encrypted data without actually decrypting it.



Zeldovich described the simplest but, as yet, most practical version of the idea, in which data is wrapped in successively more complex layers of encryption, each of which permits a different type of computation. Depending on the type of query a server is to perform, it can peel away layers of encryption until it arrives at data it can compute on. In experiments, Zeldovich said, this system increased the transaction time of database queries by a relatively modest 30 percent.



Goldwasser described a more complex system in which multiple servers — those of various federal agencies and private hospitals in one example; those of financial institutions and a government watchdog in another — exchange encrypted information. Her group, she explained, has proven that as long as a majority of the participants are honest, such “multiparty computing” schemes can allow each server to specify just the data that it wants to release to the others, without inadvertently leaking information it hopes to protect.



Vaikuntanathan reviewed recent progress on homomorphic encryption, in which a user would send encrypted data to a server that would, without decrypting it, process it and send back a still-encrypted result. Special-purpose homomorphic-encryption schemes have been developed, but Vaikuntanathan described his own group’s research toward the elusive goal of a practical scheme that would allow the server to execute any algorithm at all on the encrypted data.



In the end, the day’s discussions may not have yielded complete answers to the weighty questions that Podesta raised in opening the conference. But it did provide ample evidence of why MIT has been, as he put it, “the cradle for so many game-changing technologies.”