Web Scraping is the technique of extracting data from websites. The term is used typically for automated data extraction. Today, I am going to show you how to crawl websites anonymously. The reason why you want to hide your identity is due to the fact that many web servers apply rules to websites which ban IPs after a certain amount of continuous requests. We are going to use Puppeteer for accessing web pages, cheerio for HTML parsing, and Tor to run each request from a different IP address.

While the legal aspects of Web Scraping vary, with many grey zones, remember to always respect the Terms of Service of each web page you scrape. Ben Bernard has wrote a nice article about those legal issues.

Setting Up Tor

First things first, we have to install our Tor client by using the following command.

sudo apt-get install tor

Configure Tor

Next, we are going to configure our Tor client. The default Tor configuration uses a SOCKS port to provide us with one circuit to a single exit node (i.e. one IP address). This is handy for everyday use, like browsing, but for our specific scenario we need multiple IP addresses, so that we can switch between them while scraping.

To do this, we’ll simply open additional ports to listen for SOCKS connections. This is done by adding multiple SocksPort options to main configuration file under /etc/tor .

Open /etc/tor/torrc file with your preferred editor and add the next lines in the end of the file.

There a couple of things to notice here:

The value of each SocksPort is a number, the port that Tor will listen for connections from SOCKS-speaking applications, like browsers.

is a number, the port that Tor will listen for connections from SOCKS-speaking applications, like browsers. Because SocksPort value is a port to be open, the port must not already be used by another process.

value is a port to be open, the port must not already be used by another process. The initial port starts with value 9050 . This is the default SOCKS of the Tor client.

. This is the default SOCKS of the Tor client. We bypass value 9051 . This port is used by Tor to allow external applications who are connected to this port to control Tor process.

. This port is used by Tor to allow external applications who are connected to this port to control Tor process. As a simple convention, to open more ports, we increment each value after 9051 by one.

Restart the tor client to apply the new changes.

sudo /etc/init.d/tor restart

Create a new Node project

Create a new directory for your project, I’ll call it superWebScraping .

mkdir superWebScraping

Navigate to superWebScraping and initialize an empty Node project.

cd superWebScraping && npm init -y

Install the required dependencies.

npm i --save puppeteer cheerio

Browse with Puppeteer

Puppeteer is a headless browser that uses DevTools Protocol to communicate with Chrome or Chromium. The reason why we don’t use a request library, like tor-request, is due to the fact that request libraries cannot process SPA websites that load their content dynamically.

Create an index.js file and add the below script. The statements are documented inline.

Run the script with

node index.js

You should see the Chromium browser navigating to https://api.ipify.org like the following screenshot