Researchers who study stylometry—the statistical analysis of linguistic style—have long known that writing is a unique, individualistic process. The vocabulary you select, your syntax, and your grammatical decisions leave behind a signature. Automated tools can now accurately identify the author of a forum post for example, as long as they have adequate training data to work with. But newer research shows that stylometry can also apply to artificial language samples, like code. Software developers, it turns out, leave behind a fingerprint as well.

Rachel Greenstadt, an associate professor of computer science at Drexel University, and Aylin Caliskan, Greenstadt's former PhD student and now an assistant professor at George Washington University, have found that code, like other forms of stylistic expression, are not anonymous. At the DefCon hacking conference Friday, the pair will present a number of studies they've conducted using machine learning techniques to de-anonymize the authors of code samples. Their work, some of which was funded by and conducted in collaboration with the United States Army Research Laboratory, could be useful in a plagiarism dispute, for instance, but also has privacy implications, especially for the thousands of developers who contribute open source code to the world.

How To De-Anonymize Code

Here's a simple explanation for how the researchers used machine learning to uncover who authored a piece of code. First, the algorithm they designed identifies all the features found in a selection of code samples. That's a lot of different characteristics. Think of every aspect that exists in natural language: There's the words you choose, which way you put them together, sentence length, and so on. Greenstadt and Caliskan then narrowed the features to only include the ones that actually distinguish developers from each other, trimming the list from hundreds of thousands to around 50 or so.

The researchers don't rely on low-level features, like how code was formatted. Instead, they create "abstract syntax trees," which reflect code's underlying structure, rather than its arbitrary components. Their technique is akin to prioritizing someone's sentence structure, instead of whether they indent each line in a paragraph.

'People should be aware that it’s generally very hard to 100 percent hide your identity in these kinds of situations.' Rachel Greenstadt, Drexel University

The researchers need examples of someone's work to teach the algorithm to know when it spots another one of their code samples. If a random GitHub account pops up and publishes a code fragment, Greenstadt and Caliskan wouldn't necessarily be able to identify the person behind it, because they only have one sample to work with. (They could possibly tell that it was a developer they hadn't seen before.) Greenstadt and Caliskan, however, don't need your life's work to attribute code to you. It only takes a few short samples.

For example, in a 2017 paper, Caliskan, Greenstadt, and two other researchers demonstrated that even small snippets of code on the repository site GitHub can be enough to differentiate one coder from another with a high degree of accuracy.

Most impressively, Caliskan and a team of other researchers showed in a separate paper that it’s possible to de-anonymize a programmer using only their compiled binary code. After a developer finishes writing a section of code, a program called a compiler turns it into a series of 1s and 0s that can be read by a machine, called binary. To humans, it mostly looks like nonsense.

Caliskan and the other researchers she worked with can decompile the binary back into the C++ programming language, while preserving elements of a developer’s unique style. Imagine you wrote a paper and used Google Translate to transform it into another language. While the text might seem completely different, elements of how you write are still embedded in traits like your syntax. The same holds true for code.

“Style is preserved,” says Caliskan. “There is a very strong stylistic fingerprint that remains when things are based on learning on an individual basis.”