| | __ _____| |____ ___ __ __ _ _ _ \ \ /\ / / _ \ '_ \ \/ / '__/ _` | | | | \ V V / __/ |_) > <| | | (_| | |_| | \_/\_/ \___|_.__/_/\_\_| \__,_|\__, | __/ | |___/ [v 2.2]

About webxray is a tool for analyzing third-party content on webpages and identifying the companies which collect user data. A command line user interface makes webxray easy to use for non-programmers, and those with advanced needs may analyze millions of pages with proper configuration. webxray is a professional tool designed for academic research, and may be used by privacy compliance officers, regulators, and those who are generally curious about hidden data flows on the web. webxray uses a custom library of domain ownership to chart the flow of data from a given third-party domain to a corporate owner, and if applicable, to parent companies. Tracking attribution reports produced by webxray provide robust granularity. Reports of the average numbers of third-parties and cookies per-site, most commonly occurring third-party domains and elements, volumes of data transferred, use of SSL encryption, and more are provided out-of-the-box. A flexible data schema allows for the generation of custom reports as well as authoring extensions to add additional data sources. The public version of webXray uses Chrome to load pages, stores data in a SQLite database, and can be used on a normal desktop computer. There is also a propriety forensic version of webXray designed to meet the demands of academic research and litigation. If you have academic needs please contact Tim Libert, if you have litigation needs please contact us at the webXray company website. Below you will find detailed instructions on how to install the software needed to run webxray in the Dependencies section. The Installation and First Run section provides guidance on getting started. Additional section provide instructions on Using webxray to Analyze Your Own List of Pages, Viewing and Understanding Reports, as well as Advanced Options, and Getting Help.

Dependencies webxray depends on several pieces of software being installed on your computer in advance. If you are familiar with installing dependencies on your own, you may install what is listed below and skip to Installation and First Run . If you are not familiar with dependencies, follow the detailed instructions for Ubuntu and macOS below. Note that webxray can be run on Windows, but detailed instructions are not currently available. The dependences for a standard webxray install are as follows: Python Version 3.4+ https://www.python.org Google Chrome Version 64+ https://www.google.com/chrome/ Chromedriver https://sites.google.com/a/chromium.org/chromedriver/ Selenium https://pypi.python.org/pypi/selenium OS Specific Directions Installing on Ubuntu Step One: Install Google Chrome (if you already have Chrome go to Step Two) If you are using Ubuntu desktop, download Chrome here: https://www.google.com/chrome/ If you are on Ubuntu server, run the following commands: wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb sudo dpkg -i google-chrome-stable_current_amd64.deb It is likely you will get errors, if so, run the following: sudo apt -f install Run the following command to make sure chrome is installed, if you get an error try the above steps again or search the web for advice. google-chrome --version Step Two: Install chromedriver Chromedriver allows other programs to control Chrome. You must download chromedriver from Google: In a browser go to: https://sites.google.com/a/chromium.org/chromedriver/downloads

Find the version of chromedriver which corresponds to the version of Chrome displayed in the previous step.

Download 'chromedriver_linux64.zip' If you are on a server you can find the correct address using the above steps on another computer and use wget to get the file to your server. Once you have downloaded chromedriver, install it with the following commands: unzip chromedriver_linux64.zip sudo mv chromedriver /usr/bin/ Run the following command to make sure chromedriver is installed, if you get an error try the above steps again or search the web for advice. chromedriver --version Step Three: Install pip3 While Ubuntu has Python3 included by default it does not include the Python3 package manager pip3, so you will need to install it using this command: sudo apt install python3-pip Run the following command to make sure pip3 is installed, if you get an error try the above steps again or search the web for advice. pip3 --version Step Four: Install Selenium for Python3 Selenium is the glue between Python3 and web browsers, install it with the following command: sudo pip3 install selenium You are now ready to install webxray. macOS Specific Directions macOS is UNIX-based system and setting up webxray is relatively straight-forward. Step One: Install Homebrew: Homebrew is a command-line tool which helps you install and manage various other command-line tools. To install Homebrew go to the following site and follow the instructions, note it may take some time to download and install: https://brew.sh. By default, Homebrew sends information to Google Analytics, you can disable that with the following command using the terminal (which you should have open after installing Homebrew): brew analytics off Step Two: Install Python3 Python3 is needed to run webxray, enter the following command to install it: brew install python3 To make sure you have the right version of Python installed run the following command: python3 --version ...if you see 3.4 or above you are good to go! Step Three: Install chromedriver Chromedriver allows other programs to control Chrome. Homebrew will install chomedriver for you using the following command: brew install chromedriver Run the following command to make sure chromedriver is installed, if you get an error try the above steps again or search the web for advice. chromedriver --version Step Four: Install Selenium for Python3 Selenium is the glue between Python3 and web browsers, install it with the following command: sudo pip3 install selenium You are now ready to install webxray.

Installation and First Run The basic installation uses Chrome as the browser in 'headless' mode, meaning you will not see the browser open the pages you are analyzing. Data is stored in a SQLite database and you do not need to install a database server; databases are created in the directory './webxray/resources/db/sqlite/'. Once your dependencies are installed you can download webxray from GitHub or you can clone the GitHub repository using the following command: git clone https://github.com/timlib/webxray.git Now webxray is ready to go! To use it enter the following commands: cd webxray python3 run_webxray.py This is the interactive mode and will guide you to scanning a list of sample websites. Important Note: If you are running webxray as the 'root' user in Linux it may not run properly due to limitations in Chrome. If webxray stalls or crashes after the 'Building List of Pages' message, run webxray as a non-root user.

Using webxray to Analyze Your Own List of Pages The raison d'être of webxray is to allow you to analyze pages of your choosing. In order to do so, first place all of the page addresses you wish to scan into a text file and place this file in the "page_lists" directory. Make sure your addresses start with "http://" or "https://", if not, webxray will not recognize them as valid addresses. Once you have placed your page list in the proper directory you may run webxray and it will allow you to select your page list.

Viewing and Understanding Reports Use the interactive mode to guide you to generating an analysis once you have completed your data collection. When it is completed it will be output to the '/reports' directory. This will contain a number of csv files; they are: db_summary.csv: a basic report of what is in the database and how many pages loaded

stats.csv: provides top-level stats on how many domains are contacted, cookies, javascript, etc.

aggregated_tracking_attribution.csv: details on percentages of sites tracked by different companies and their subsidiaries

3p_domain.csv: most frequently occurring third-party domains

3p_element.csv: most frequently occurring third-party elements of all types

3p_image.csv: most frequently occurring third-party images

3p_javascript.csv: most frequently occurring third-party javascript

3p_ssl_use.csv: rates at which detected third-parties encrypt requests

data_xfer_summary.csv: volume and percentage of data received from first- and third-party domains

data_xfer_aggregated.csv: volume and percentage of data received from various companies

data_xfer_by_domain.csv: volume and percentage of data received from specific third-party domains

network: pairings between page domains and third-party domains, you can import this info to network visualization software

per_page_data_flow.csv: one giant file that lists the requests made for each page, off by default

Advanced Options The following are details on how to leverage the power of many advanced functions, and unlike the above, these directions assume you are capable of doing light editing of Python3 code. Analyze a Single Page Sometimes you just want to run a single quick scan, to do so, use the command below. Be sure to replace "http://example.com" with the address of the site you want to scan. python3 run_webxray.py -s http://example.com Run Many Browsers in Parallell to Increase Speed By default, webxray will only run a single browser at a time. Given webxray waits 45 seconds for a page to load, this means it will take over 8 hours to scan 1,000 pages. However, most systems can handle running many more browsers at a time, resulting in significant speed gains. To use one browser per available processor core open run_webxray.py and change "pool_size = 1" to "pool_size = None" - this is the most straight-forward way to increase speed. Since webxray spends most of its time waiting for pages to load you may also experiment with setting pool_size above the number of available cores. It is possible to do some performance tuning to determine how many browsers you can run before you get crashes. If you desire to scan more than 100,000 pages, performance tuning is highly advised. Change How Long the Browser Waits After Loading a Page In order to get all the third-party elements possible, webxray waits for 45 seconds after loading a page. You can make this longer or shorter by changing the line "browser_wait = 45" in run_webxray.py. Note that 45 seconds works very well, and due to specifics of Chrome, setting it to fewer than 45 seonds may result in lost cookies. Run Chrome in Windowed Mode By default, Chrome is run without a window opening on your computer which uses less resources, is less annoying, and is required if your computer doesn't have a monitor. If you do want to see pages loading you can enable 'windowed' mode by opening the file './webxray/ChromeDriver.py', finding the line "self.headless = True", and changing it to " self.headless = False".

Getting Help If you are having problems installing the software or find bugs, please open an issue on GitHub. If you if have advanced needs and require assistance, or if you are interested in comissioning custom written reports rather than running the software yourself, please email Timothy Libert: contact@webxray.eu.