Can a set of equations keep U.S. census data private?

The U.S. Census Bureau is making waves among social scientists with what it calls a “sea change” in how it plans to safeguard the confidentiality of data it releases from the decennial census.

The agency announced in September 2018 that it will apply a mathematical concept called differential privacy to its release of 2020 census data after conducting experiments that suggest current approaches can’t assure confidentiality. But critics of the new policy believe the Census Bureau is moving too quickly to fix a system that isn’t broken. They also fear the changes will degrade the quality of the information used by thousands of researchers, businesses, and government agencies.

The move has implications that extend far beyond the research community. Proponents of differential privacy say a fierce, ongoing legal battle over plans to add a citizenship question to the 2020 census has only underscored the need to assure people that the government will protect their privacy.

A noisy conflict

The Census Bureau’s job is to collect, analyze, and disseminate useful information about the U.S. population. And there’s a lot of it: The agency generated some 7.8 billion statistics about the 308 million people counted in the 2010 census, for example.

At the same time, the bureau is prohibited by law from releasing any information for which “the data furnished by any particular establishment or individual … can be identified.”

Once upon a time, meeting that requirement meant simply removing the names and addresses of respondents. Over the past several decades, however, census officials have developed a bag of statistical tricks aimed at providing additional protection without undermining the quality of the data.

Such perturbations, also known as injecting noise, are meant to foil attempts to reidentify individuals by combining census data with other publicly available information, such as credit reports, voter registration rolls, and property records. But preventing reidentification has grown more challenging with the advent of ever-more-powerful computational tools capable of stripping away privacy.

Census officials now believe those ad hoc methods are no longer good enough to satisfy the law. “The problem is real, and it has moved from a concern to an issue,” says John Thompson, who stepped down as census director in June 2017, and who recently retired as head of the Council of Professional Associations on Federal Statistics in Arlington, Virginia. “In Census Bureau lingo, that means it’s no longer simply a risk, but rather something you have to deal with.”

The agency’s decision to adopt differential privacy was spurred, in part, by recent work on what is known as the “database reconstruction theorem.” The theorem shows that, given access to a sufficiently large amount of information, someone can reconstruct underlying databases and, in theory, identify individuals.

“Database reconstruction theorem is the death knell for traditional [data] publication systems from confidential sources,” says John Abowd, chief scientist and associate director for research at the Census Bureau, located in Suitland, Maryland. “It exposes a vulnerability that we were not designing our systems to address,” says Abowd, who has spearheaded the agency’s efforts to adopt differential privacy.

But some users of census data strongly disagree. Steven Ruggles, a population historian at the University of Minnesota in Minneapolis, is leading the charge against the new policy.

Ruggles says traditional methods have successfully prevented any identity disclosures and, thus, there’s no urgency to do more. If the Census Bureau is hell-bent on imposing differential privacy, he adds, officials should work with the community to iron out the kinks before applying it to the 2020 census and its smaller cousin, the American Community Survey.

“Differential privacy goes above and beyond what is necessary to keep data safe under census law and precedent,” says Ruggles, who also manages a university-based social research institute that disseminates census data. “This is not the time to impose arbitrary and burdensome new rules that will sharply restrict or eliminate access to the nation’s core data sources.”

“My central concern about differential privacy is that it’s a blunt instrument,” he adds. “If you want to provide the same level of protection against reidentification that current methods do, you’re going to have to do a lot more damage to the data than is done now.”

Ways to protect confidentiality

Protecting confidentiality has been a priority for the Census Bureau for most—but not all—of its existence. After the first U.S. census was conducted in 1790, officials posted the results so that residents could correct errors. But in 1850, the interior secretary decreed that the returns would be kept confidential. They were “not to be used in any way to the gratification of curiosity and census officials,” or “the exposure of any man’s business or pursuits,” notes an official history of the census published in 1900. In 1954 the agency’s confidentiality mandate was codified in Title 13 of the U.S. Code.

Publicly available census data come in two flavors. One type, called small-area data, provides the basic characteristics of residents—age, sex, and race/ethnicity—down to the census block level. A census block, often the size of a city block, is the smallest geographic area for which data are reported. There were some 11 million blocks in 2010, of which 6.3 million were inhabited.

The second is called microdata, which are the full records collected by the Census Bureau on individuals—including, for example, the size of the household and the relationships between the residents. When microdata are reported, they are lumped together by areas containing at least 100,000 people.

Together, these census products provide fodder for thousands of researchers. Census data are also the basis for surveys by other government agencies and the private sector that shape decisions ranging from locating new factories or shopping malls to building new roads and schools.

The Census Bureau has used a variety of methods to preserve the confidentiality of these data as it moved from print to magnetic tape to digital distribution. Officials can, for instance, mask the responses of outliers—such as the income of a billionaire. They can also be less precise, for example, by reporting ages within 5-year ranges rather than a single year. Another technique involves swapping information with a respondent possessing many similar characteristics who lives in a different block.

How much noise to inject depends on many factors. However, census officials have never disclosed details of their formula or said how often a particular method is used. They fear that such information could help someone to reverse engineer the process.

A mathematical approach

Differential privacy, first described in 2006, isn’t a substitute for swapping and other ways to perturb the data. Rather, it allows someone—in this case, the Census Bureau—to measure the likelihood that enough information will “leak” from a public data set to open the door to reconstruction.

“Any time you release a statistic, you’re leaking something,” explains Jerry Reiter, a professor of statistics at Duke University in Durham, North Carolina, who has worked on differential privacy as a consultant with the Census Bureau. “The only way to absolutely ensure confidentiality is to release no data. So the question is, how much risk is OK? Differential privacy allows you to put a boundary” on that risk.

A database can be considered differentially protected if the information it yields about someone doesn’t depend on whether that person is part of the database. Differential privacy was originally designed to apply to situations in which outsiders make a series of queries to extract information from a database. In that scenario, each query consumes a little bit of what the experts call a “privacy budget.” After that budget is exhausted, queries are halted in order to prevent database reconstruction.

In the case of census data, however, the agency has already decided what information it will release, and the number of queries is unlimited. So its challenge is to calculate how much the data must be perturbed to prevent reconstruction.

Abowd says the privacy budget “can be set at wherever the agency thinks is appropriate.” A low budget increases privacy with a corresponding loss of accuracy, whereas a high budget reveals more information with less protection. The mathematical parameter is called epsilon; Reiter likens setting epsilon to “turning a knob.” And epsilon can be fine-tuned: Data deemed especially sensitive can receive more protection.

The epsilon can be made public, along with the supporting equations on how it was calculated. In contrast, Abowd says, traditional approaches to limiting disclosure are “fundamentally dishonest” from a scientific perspective because of their underlying uncertainty. “At the moment,” he says, the public doesn’t “know the global disclosure risk. … That’s because the agency doesn’t tell you everything it did to the data before releasing it.”

A simulated attack

A professor of labor economics at Cornell University, Abowd first learned that traditional procedures to limit disclosure were vulnerable—and that algorithms existed to quantify the risk—at a 2005 conference on privacy attended mainly by cryptographers and computer scientists. “We were speaking different languages, and there was no Rosetta Stone,” he says.

He took on the challenge of finding common ground. In 2008, building on a long relationship with the Census Bureau, he and a team at Cornell created the first application of differential privacy to a census product. It is a web-based tool, called OnTheMap, that shows where people work and live.

Abowd took leave from Cornell to join the Census Bureau in June 2016, and one of his first moves was to test the vulnerability of the 2010 census data to an outside attack. The goal was to see how well a census team could reconstruct individual records from the thousands of tables the agency had published—and then try to identify those individuals.

The three-step process required substantial computing power. First, the researchers reconstructed records for individuals—say, a 55-year-old Hispanic woman—by mining the aggregated census tables. Then, they tried to match the reconstructed individuals to even more detailed census block records (that still lacked names or addresses); they found “putative matches” about half the time.

Finally, they compared the putative matches to commercially available credit databases in hopes of attaching a name to a particular record. Even if they could, however, the team didn’t know whether they had actually found the right person.

Abowd won’t say what proportion of the putative matches appeared to be correct. (He says a forthcoming paper will contain the ratio, which he calls “the amount of uncertainty an attacker would have once they claim to have reidentified a person from the public data.”) Although one of Abowd’s recent papers notes that “the risk of re-identification is small,” he believes the experiment proved reidentification “can be done.” And that, he says, “is a strong motivation for moving to differential privacy.”

Too far, too fast?

Such arguments haven’t convinced Ruggles and other social scientists opposed to applying differential privacy on the 2020 census. They are circulating manuscripts that question the significance of the census reconstruction exercise and that call on the agency to delay and change its plan.

Last month they had their first public opportunity to express their opposition during a meeting at census headquarters of the Federal Economic Statistics Advisory Committee (FESAC), which advises the Census Bureau and two other major federal statistical agencies. Abowd and Ruggles went toe to toe during a panel discussion on differential privacy, and council members had a chance to quiz them.

One point of disagreement is the interpretation of federal law. Title 13 requires the agency to mask only the identity of individuals, critics argue, not their characteristics. If identifying characteristics is illegal, Ruggles writes in a recent paper, then “virtually all Census Bureau microdata and small-area products currently fail to meet that standard.”

Abowd reads the law differently. “Steve has gotten it wrong,” he says flatly. “The statute says that what is prohibited is releasing the data in an identifiable way.”

At the meeting, several members of the advisory committee peppered Abowd with questions about the significance of being able to reconstruct 50% of microdata files. That percentage is rather low, they argue. In any event, they say, reconstruction is a far cry from reidentification, which is what the law prohibits. They also wondered why anyone would go to the trouble of messing with census data when there are other, better ways to obtain scads of personal information that can be used to identify individuals.

“I’m not surprised that someone has reconstructed the fact that there are 45-year-old white men living in a particular block,” said Colm O’Muircheartaigh, a professor of public policy at the University of Chicago in Illinois and a member of FESAC. “But that kind of information is neither very interesting or useful.”

Identifying individuals based on household data might be more valuable, he said. “But I imagine it would be much harder to reconstruct a household,” O’Muircheartaigh said. “And even if we could, reconstructing a typical American household—say, two adults and two children—would hardly be a killer identification.”

Census data also don’t age well because of high mobility rates, he added. “These are static data,” he said. “Even if you knew that such and such a person lived somewhere in 2010, how valuable would that be in 2014 or 2018?”

Some meeting attendees also accused Abowd of failing to address the practical effects of applying differential privacy. One skeptic was Kirk Wolter, chief statistician for NORC at the University of Chicago, a research institution that does survey work for many federal agencies. He argued that noisier census data would have a major ripple effect, degrading the quality of many other surveys that rely on census data to select their samples. “These surveys provide the information infrastructure for the country,” he noted. “And all of them would suffer.”

Correcting for those problems will cost money, he predicted, with organizations like NORC having to adjust samples and redesign surveys. And given the tight budgets of most survey research organizations, those could translate into fewer studies—and less information about the country’s residents.

Thompson agrees. “Kirk is exactly right,” he says. Applying differential privacy means “those surveys will take longer and cost more. And they may be less accurate. But you don’t have a choice.”

The citizenship elephant

Proponents of adopting differential privacy say there is also another compelling reason to move forward quickly: a controversial decision made last March by Commerce Secretary Wilbur Ross to add a citizenship question to the 2020 census.

A slew of local and state officials have joined civil rights groups in suing the federal government in a bid to block the question. They argue that adding the question will lead nonresidents and other vulnerable populations to avoid filling out the census form, leading to a significant undercount. And they are worried about privacy, too. Knowing how someone answered the citizenship question, critics say, would allow a government agency to take punitive action against nonresidents.

“Maybe a researcher wouldn’t try to do that,” says Thompson, a witness for the plaintiffs in one of the suits. “But there are a lot of people who might. And I think that [federal immigration officials] would love to have that information.”

Abowd knows the extreme sensitivity of the citizenship question. His emails last year to Ross expressing reservations about adding it to the 2020 census have been publicly revealed by the litigation. And although he tiptoed around the topic during the recent FESAC discussion, it was clear that he was worried about the damage it could wreak on the agency’s credibility.

“The entire history of traditional disclosure limitation was aimed at preventing attackers, armed with external data, from using it in combination with the variables on the [census] microdata file to attach a name and address,” Abowd said during the roundtable. “With regard to 2010, most of those databases did not have race and ethnicity on them. And none have citizenship, to just bring into the room the variable that we probably should be discussing more explicitly.”

Practical issues

Ruggles, meanwhile, has spent a lot of time thinking about the kinds of problems differential privacy might create. His Minnesota institute, for instance, disseminates data from the Census Bureau and 105 other national statistical agencies to 176,000 users. And he fears differential privacy will put a serious crimp in that flow of information.

In the most extreme scenario, he says, the Census Bureau could decide to make 2020 census data available only through its network of 29 secure Federal Statistical Research Data Centers. That would impose serious hardships on users, Ruggles says, because the centers require users to obtain a security clearance, which often involves lengthy waiting periods. Such rules could also prevent most international scholars from using the centers, he says, as well as graduate students seeking a quick turnaround for a dissertation. In addition, researchers are only cleared if their project is deemed to benefit the agency’s mission.

There are also questions of capacity and accessibility. The centers require users to do all their work onsite, so researchers would have to travel, and the centers offer fewer than 300 workstations in total.

Thompson says the Census Bureau needs to address those issues regardless of whether it adopts differential privacy. He agrees with Ruggles that it takes too long to gain access to the research centers, and he thinks the bureau needs to change its definition of what research serves its mission. “I have argued that anyone advancing the science of using data” should be eligible, he says. “We need a 21st-century Census Bureau, and that will take a lot of fixing.”

(With regard to access, Abowd says the agency is considering setting up “virtual” centers that would allow a much broader audience to work with the data. But Ruggles is skeptical that such a system would satisfy the bureau’s own definition of confidentiality.)

A need to communicate

Abowd has said, “The deployment of differential privacy within the Census Bureau marks a sea change for the way that official statistics are produced and published.” And Ruggles agrees. But he says the agency hasn’t done enough to equip researchers with the maps and tools needed to navigate the uncharted waters.

“It’s pretty clear we are going to have a new methodology,” Ruggles concedes. “But I think it could be implemented in a better or worse way. I would like them to consider the trade-offs, and not take such an absolutist stand on the risks.”

Meanwhile, NORC’s Wolter says regardless of whether his concerns are addressed, the bureau must do more outreach—and not just in peer-reviewed journals. “Census badly needs a communications strategy, by real communications specialists,” he said. “There are thousands of users [of census data] who won’t understand any of this stuff. And they need to know what is going to happen.”

Clarification, 17 January 2019, 5:00 p.m.: The first quote from John Abowd in the story has been revised to make it clear that the Census Bureau is now addressing the vulnerability of census data to reidentification.