This project wants to explore how the web is tracked by whom.

As the most tech savvy readers know, when we visit a web page, several things happen in the background.

The page from a server is sent to the browser of the user that start to paint on screen the content. However, it may be necessary for the browser to access other resources, the most common are:

images

instruction for how to style the several elements (like: color, dimension, position of the text), know as CSS

code for animation or smart application, know as JS

fonts (how a text appear)

All these resources may be provided by the same website, or they may be provide by a different website.

If those resources are provided by a different website, the browser needs to obtain them making a request to a different actor.

All these requests may be used to track users on the web, especially if they are associated with cookies (hence the annoying banners on every website) and headers (from those we don’t have banners).

An example of this are the Social button by Facebook, Twitter, Google, Reddit, etc… in order to show those buttons it is necessary to make a request to the respective company and send information about the user. In this way is possible to show very social buttons like (“Jon, Tyrion and Sansa liked this element”) but those social platforms will know what page you have visited.

Finally website may also use analytics solutions that help the website to know who visit their website, what page are visited more often, and other information. The most common analytic solution is provide by Google itself for free, of course the website obtain a lot of useful data, but Google obtain the same data as well.

Armed with this basic knowledge let’s explore how we can know who is tracking the web.

Obtain the data

The simplest way to know what requests are made to what service is to simply render the webpage using a browser like Firefox and track all the requests that are made.

This procedure is not as simple as it may look like, likely thank to help from friends a reasonable simple solution was possible.

Chrome headless and selenium may help also — Ramiro Algozino (@ralgozino) May 14, 2019 How difficult can it be to programmatically get a list of all the request a browser does in order to display a web page?

We programmatically drive Firefox making all the request through a proxy.

Everything was nicely packed together in the selenium-wire project.

The result is a tiny python script that get in input a domain, start Firefox, make Firefox visit and render the homepage, track all the request through a proxy and finally store all the request into a SQLite file.

import sys from seleniumwire import webdriver # Import from seleniumwire from selenium.webdriver.firefox.options import Options import tldextract from urllib.parse import urlparse import sqlite3 import json options = Options() options.headless = True # Create a new instance of the Firefox driver driver = webdriver.Firefox(options=options) original_domain = sys.argv[1] url = 'https://{}'.format(original_domain) # Make a request to the URL driver.get(url) conn = sqlite3.connect("requests.db") c = conn.cursor() c.execute(''' CREATE TABLE IF NOT EXISTS requests( original_domain TEXT NOT NULL, original_url TEXT NOT NULL, time_request INT DEFAULT (strftime('%s','now')), request TEXT, status_code INT, subdomain TEXT, domain TEXT, tld TEXT, scheme TEXT, netloc TEXT, path TEXT, params TEXT, query TEXT, fragment TEXT, request_header TEXT, response_header TEXT ); ''') conn.commit() insert_stmt = """ INSERT INTO requests( original_domain, original_url, request, status_code, subdomain, domain, tld, scheme, netloc, path, params, query, fragment, request_header, response_header ) VALUES(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, json(?), json(?)); """ # Access requests via the `requests` attribute for request in driver.requests: if request.response: rpath = request.path subdomain, domain, tld = tldextract.extract(rpath) parsedRequest = urlparse(rpath) scheme, netloc, path, params, query, fragment = parsedRequest status_code = request.response.status_code data = ( original_domain, url, rpath, request.response.status_code, subdomain, domain, tld, scheme, netloc, path, params, query, fragment, json.dumps(dict(request.headers)), json.dumps(dict(request.response.headers)), ) c.execute(insert_stmt, data); conn.commit() driver.close() driver.quit()

At this point we have a script that given a domain in input, get its home page and store all the requests necessary to render that homepage into a small database.

Then we used the list of the top10million most influent website (actually domains) to know which website are most visited.

We manipulate the list of domain to obtain the first few hundreds of domains:

cat top10milliondomains.csv | awk -F "," '{ print substr($2, 2, length($2) - 2)}' | head -n 1000

And finally we use xargs to run the python script in parallel.

xargs -n1 -P6 python3 tracker.py

Hence, the whole command command was:

cat top10milliondomains.csv | awk -F "," '{ print substr($2, 2, length($2) - 2)}' | head -n 1000 | xargs -n1 -P6 python3 tracker.py

After some hours we collect 186582 requests done while rendering the homepage of 1924 domains. Those requests are against 3472 domains.

The amount of requests is definitely not huge, far from it, but in order to do them Firefox need to render a whole webpage along with the JS and CSS, definitely not a lightweight task.

A brief data analysis will soon follow, follow me on twitter or subscribe to the mail-list to receive updates.

Repository here.

Follow Me Follow @siscia_