At some point in your life you've probably been asked to take out a #2 pencil and fill in a series of numbered ovals. This method for gathering standardized data is widely used in elections, tests, and surveys, and it's generally considered to be anonymous: if you don't put your name at the top, you don't expect your answers can be traced back to you.

New research from Princeton University calls that assumption into question. A team led by computer science professor (and current Chief Technologist of the Federal Trade Commission) Ed Felten has demonstrated software techniques for re-identification of respondents using only images of their filled-in bubbles. Their technology has both benign uses—detecting cheating in standardized tests—and malicious ones like undermining the secret ballot.

Co-author Will Clarkson described the group's findings in a Tuesday blog post. The researchers obtained copies of surveys completed by 92 different high school students, which they scanned with a high-resolution scanner. A labeled subset of the bubbles—12 bubbles from each respondent—was used to train a classifier that used a combination of machine learning techniques described in the research paper.

This classifier was then given the remainder of the bubbles—8 from each respondent—and the classifier was asked to re-identify them. It was surprisingly accurate. It got the right answer on the first try (out of 92 options) more than half the time. And the correct answer was on its top ten list more than 90 percent of the time.

The technique has a number of possible applications, both positive and negative. One negative application involves undermining the secret ballot. Some jurisdictions, such as Humboldt County, CA, offer digital images of all ballots cast in recent elections. If a third party obtained a sample of filled-in bubbles from a known Humboldt County voter—perhaps as part of an employment applications—he could use the Princeton team's techniques to identify the voter's ballot.

In principle, this raises voter intimidation concerns. For example, an employer might threaten to fire employees who fail to vote for his preferred candidates. But Joe Calandrino, the study's lead author, concedes that the 51 percent accuracy rate "does leave some room for deniability" for a voter who faces such intimidation. The problem deserves further study, but Humboldt County voters shouldn't lose sleep over it.

A more positive application of the team's research is the detection of cheating. For example, a high school teacher whose students are taking a high-stakes test might be tempted to fill in some answers for his students after they have turned them in. The techniques described by Calandrino et al could be used to scan a large number of documents looking for evidence that the same person filled out bubbles on multiple tests.

Here too, there are questions about whether the algorithm is powerful enough to give useful results. But Calandrino argues it would. First, in this application the algorithm would have many more samples to work with, which might improve its accuracy. More importantly, Calandrino says that he "sees our work as fitting in with other risk-based approaches, like answer analysis." By itself, the algorithm may not be able to definitively prove someone cheated, but it offers valuable, independent evidence of wrongdoing.

We covered related work in 2009. The same Princeton team found that variations in the structure of paper allowed it to be "fingerprinted" with a commodity scanner.

Disclosure: I'm a former member of Felten's research group. I reviewed an early draft of the study, but did not participate in the research.