The recent repeal of broadband privacy protection in the US has highlighted the highly-personal nature of browsing histories. One natural solution is to use a VPN in order to shield details of your Internet activities from your ISP. But research by a German journalist and data scientist, originally released last year at the 33rd Chaos Computer Club conference in Hamburg, and recently presented anew at Def Con 25 in Las Vegas, has some bad news on that front.

The German duo found that huge datasets of anonymized private Internet histories were being sold by Web analysis companies and data brokers, with much of the material coming from browser extensions. Since these operate before information is sent over any VPN, they can access full details of your Internet activities, and send browser data anywhere. For VPN users, that’s disappointing. Less surprising, perhaps, is the fact that it was relatively easy to discover the identities of many users found in these supposedly anonymized datasets.

The research consisted of some social engineering by the journalist Svea Eckert, followed by data analysis by Andreas Dewes. Eckert set up a Web site and LinkedIn profile for a fake company called Meez Technology, allegedly based in Tel Aviv, which purported to offer “data-driven consulting”. Using Meez Technology as cover, Eckert contacted Web analytics companies and data brokers, asking for Internet browsing histories of German citizens, which she said Meez Technology was interested in acquiring for its data analysis.

In the end, one gave her 14 days’ free access to a month’s worth of “clickstream data” – the complete browser histories – as a sample of what it could offer. The information included 3 billion URLs from three million German users, spread over 9 million different sites. Many companies said they were unable to supply URLs for German users, but were able to offer this information for people in the US and UK.

Once the researchers obtained their dataset, Dewes tried to de-anonymize the individuals it referred to. For some users, this was simple. Dewes had the complete URL, not a truncated portion, so it often showed data that was transmitted to the site in question. Sometimes that included the user’s name. For example, when someone visits their own analytics page on Twitter, the URL contains their Twitter username. Since it is only visible to them and Twitter, that’s not usually a problem. But when Internet browsing datasets include the full URL, it is, because it means that all the URLs linked to an otherwise anonymous user can now be associated with the person identified through one of them – in this case, Twitter. Out of the 3 million anonymous profiles obtained by the researchers, over 100,000 individuals could be identified in this way.

Another approach was what Dewes called “combinatorial deanonymization”, based on earlier research in this area published in 2007. This takes advantage of the fact that you only need around 10 URLs to identify someone uniquely – it turns out that people are very different in their browsing habits. So if there is additional information available that links data in those URLs to an identity, it is likely that all of the other URLs also relate to that person too.

Thanks to the complete URL being available in the dataset, Dewes discovered another important way that sensitive information could be obtained from this “anonymized” data. When requests are sent to Google Translate, the complete text is included in the URL. This means that the researchers were able to read every word of messages sent to Google Translate by users in their dataset. In one case, it included operational details about a German police investigation, where the detective involved was translating requests for assistance to be sent to police overseas.

In a report broadcast on German television in November last year, Eckert met with German politicians whose accounts were successfully de-anonymized by Dewes. In one case, a politician is shown her detailed online activity, including accessing her income tax details, and viewing a treatment for forgetfulness and lack of concentration. That revelation is potentially awkward for a leading politician, but is nothing compared to what was found for another account included in the dataset. It shows a German judge using his computer to order new robes for his work, followed by visits to sadomasochistic porn sites. If the judge can be identified, and the information fell into the wrong hands, there would be evident potential for blackmail of such a prominent public figure, perhaps leading to interference in court cases.

It is not news that datasets claimed to be anonymous are nothing of the kind. Two famous incidents – one involving AOL, the other Netflix – confirmed this a decade ago. Perhaps the most important finding of the German researchers is where these datasets with their complete URLs come from. It turns out that 95% of the Internet history data they analyzed derives from just 10 browser extensions, and that there were no less than 10,000 such extensions spying on their users to some degree.

One browser extension in particular was responsible for creating a database of users’ URLs on a massive scale, and selling those datasets in what was claimed to be an “anonymized” form. It is called, rather ironically, Web of Trust, and alone contributed one billion data points about users’ browsing habits. Since Dewes and Eckert published their results, the company producing the browser extension has modified its privacy policy to make it more explicit that it sells user data in this way.

Few people read everything in privacy policies, and even if they did, they might be reassured that their Internet browsing histories are only released in an anonymized form. But the German research shows that such anonymization can often be reversed to reveal not only the identity of the person to whom the URLs refer, but extremely intimate information that might go beyond merely embarrassing to serious blackmail material.

The message is clear. Although VPNs are great at what they do, and indispensable for today’s online world, they provide no protection against this kind of covert data surveillance by browser extensions. The best solution is to install as few extensions and plug-ins as possible, and to check what their privacy policies say about selling your Internet browsing history.

Featured image by Svea Eckert.