Last fall, AdBlock Plus creator Wladimir Palant revealed that Avast was using its popular antivirus software to collect and sell user data. While the effort was eventually shuttered, Avast CEO Ondrej Vlcek first downplayed the scandal, assuring the public the collected data had been “anonymized”—or stripped of any obvious identifiers like names or phone numbers. “We absolutely do not allow any advertisers or any third party...to get any access through Avast or any data that would allow the third party to target that specific individual,” Vlcek said. But analysis from students at Harvard University shows that anonymization isn’t the magic bullet companies like to pretend it is. Dasha Metropolitansky and Kian Attari, two students at the Harvard John A. Paulson School of Engineering and Applied Sciences, recently built a tool that combs through vast troves of consumer datasets exposed from breaches for a class paper they’ve yet to publish.

“The program takes in a list of personally identifiable information, such as a list of emails or usernames, and searches across the leaks for all the credential data it can find for each person,” Attari said in a press release. They told Motherboard their tool analyzed thousands of datasets from data scandals ranging from the 2015 hack of Experian, to the hacks and breaches that have plagued services from MyHeritage to porn websites. Despite many of these datasets containing “anonymized” data, the students say that identifying actual users wasn’t all that difficult.

“An individual leak is like a puzzle piece,” Harvard researcher Dasha Metropolitansky told Motherboard. “On its own, it isn’t particularly powerful, but when multiple leaks are brought together, they form a surprisingly clear picture of our identities. People may move on from these leaks, but hackers have long memories.”

For example, while one company might only store usernames, passwords, email addresses, and other basic account information, another company may have stored information on your browsing or location data. Independently they may not identify you, but collectively they reveal numerous intimate details even your closest friends and family may not know.

“We showed that an ‘anonymized’ dataset from one place can easily be linked to a non-anonymized dataset from somewhere else via a column that appears in both datasets,” Metropolitansky said. “So we shouldn’t assume that our personal information is safe just because a company claims to limit how much they collect and store.”

The students told Motherboard they were “astonished” by the sheer volume of total data now available online and on the dark web. Metropolitansky and Attari said that even with privacy scandals now a weekly occurrence, the public is dramatically underestimating the impact on privacy and security these leaks, hacks, and breaches have in total. Previous studies have shown that even within independent individual anonymized datasets, identifying users isn’t all that difficult.