Our Measurement Methodology

This section describes our methodology for measuring redirection-based tracking online. The following subsections describe our automated recording crawler, followed by the dataset we deployed the crawler on, and finally how we used the recorded data to understand redirection-based tracking online.

Web crawling. We built a Puppeteer based crawler that interacts with websites and keeps track of what domains forward requests to other domains. The crawler follows a “random walk” strategy to traverse the web, and operates as follows:

Navigate to a URL. The crawler notes if the browser is automatically redirected to another domain, either through a HTTP 3XX header, HTML metatag instruction, or JavaScript executing on the page. If so, the redirection is recorded and the above process starts over. The crawler pauses for 10 seconds, observes what iframes are created on the page, and records the chain domains of each iframe requests. After 10 seconds, the crawler collects all of the anchor tags on the page (in both parent and child frames) that have a non-empty “href” attribute. The crawler then clicks on the selected links until the page is redirected, selecting first from anchors pointing to remote domains, and then the same domain. If the page changes, then the process continues from step 1, otherwise the crawler exits.

The crawler continued the above process until it had run to completion (i.e. exited from step 4), or for 4 minutes, whichever occurred first. All crawls were conducted on AWS infrastructure, and performed from known AWS IPs.

We treated the global Alexa 10k as fairly representative of the web in general. We are considering crawling different regional and “top” lists, but those measurements are not part of this posting. For each crawl, we randomly selected one URL from the Alexa 10k. Because of the large number of crawls (we conducted 50,980), we selected each site in the Alexa 10k multiple times. This was intentional, since our random walk strategy meant that a very different set of domains and URLs might be visited from multiple initial seed URLs.

Graph analysis. We modeled our results as a graph, with nodes representing domains visited during the crawl, and edges representing frame-level requests (i.e., the initial request for content for each frame, and not for page level requests for images, Ajax requests, or similar). Each edge travels from the node representing the domain where the request initiated, and travels to the node representing the domain that served the response.

Each edge was labeled as either a “navigation” edge (meaning that the frame’s URL change was the result of the crawler clicking on a link) or a “redirect” edge (meaning that the frame’s URL changed without any user interaction). We also annotated each edge with information regarding whether the URL request was made by the top level frame, or an iframe, which frame made the request, and a unique identifier for each crawl session, in order to to allow us to recreate each crawl session.