We performed a study about web tracking.

Web tracking refers to the very widespread practice of websites embedding so-called web trackers on their pages. These have various purposes, such as optimizing the loading of images, or embedding additional services on a site. The best known examples are the “Facebook buttons” and “Twitter buttons” found on many sites, but the most widespread ones are by far those from Google, which in fact operates a multitude of such “tracking services”. There are however many more – what they all have in common is that they allow the company that runs them to track precisely which sites have been visited by each user.

Web trackers have been well-know for quite a time, but until now no web-scale study had been performed to measure their extent. Thus, we performed a study about them. We analysed 200 terabytes of data on 3.5 billion web pages.

The key insights are:

90% of all websites contain trackers.

Google tracks 24.8% of all domains on the internet, globally.

When taking into account that not all web sites get visited equally often, Google’s reach is even higher: We estimated that 50.7% of all visited pages on the web include trackers by Google. (Ironically, we estimated this by using PageRank, a measure initially associated with Google itself.)

The top three tracking systems deployed on the web are all operated by Google, and use the domains google-analytics.com , google.com , and googleapis.com .

, , and . The top three companies that have trackers on the web are Google, Facebook, and Twitter, in that order.

These big companies are by far not the only companies that track: There is a long tail of trackers on the web – 50% of the tracking services are embedded on less than ten thousand domains, while tracking services in the top 1% of the distribution are integrated into more than a million domains.

Google, Twitter and Facebook are the dominant tracking companies in almost all countries in the world. Exceptions are Russia and China, in which local companies take the top rank. These are Yandex and CNZZ, respectively. Even in Iran, Google is the most deployed tracker.

Websites about topics that are particularly privacy-sensitive are less likely to contain trackers than other websites, but still, the majority of such sites do contain trackers. For instance, 60% of all forums and other sites about mental health, addiction, sexuality, and gender identity contain trackers, compared to 90% overall.

Many sites contain more than one tracker. In fact, multiple trackers are so common that we were able to determine clusters of trackers that are often used together by individual sites – these allow us to automatically detect different types of trackers such as advertising trackers, counters and sharing widgets. (See the picture above.)

Not all trackers have the explicit purpose of tracking people: Many types of systems perform useful services in which tracking is a side effect. Examples are caching images, optimizing load times, enhancing the usability of a site, etc. For many of these systems, webmasters may not be aware that they allow tracking.

Note: The study was performed using data from 2012 by the Common Crawl project. Due to the fact that their crawling strategy has changed since then, newer data in fact represents a smaller fraction of the whole web.

The study was performed by my colleague Sebastian Schelter from the Database Systems and Information Management Group of the Technical University of Berlin, and myself.

The article is published in the Journal of Web Science, and is available as open access:

Sebastian Schelter and Jérôme Kunegis (2018), “On the Ubiquity of Web Tracking: Insights from a Billion-Page Web Crawl”, The Journal of Web Science: Vol. 4: No. 4, pp 53–66.

The full dataset is available online on Sebastian’s website, and also via the KONECT project.