Some websites protect themselves from web scraping. However, sometimes it is still reasonable and fair (and based on a recent US court ruling also legit) to extract data from them. In this article, we'll go through the most commonly used anti-scraping protections and show you how to bypass them.



There are four main categories of protection against scraping:

IP detection IP rate limiting Browser detection Tracking user behavior

IP detection

Some websites deny access to their content based on the location of your IP address. They just want to show their content to users from given countries.



Another option is that some websites block access based on the IP range your address belongs to. This kind of protection is usually implemented to reduce the amount of non-human traffic. For instance, websites will deny access to IP ranges of Amazon Web Services and other commonly known ranges.



This kind of protection is usually easily bypassed by the use of a proxy server.



On Apify platform, you can either use our pool of proxy servers based in the United States, or you can ask us to order a custom dedicated pool from countries you need, or you can use your own proxy servers from services like oxylabs.io or luminati.io.



IP rate limiting

The second most common protection is to limit access based on how many requests were made from a single IP address in a certain period of time.



This kind of protection can be either manual (meaning a human is checking logs, and if they see large volumes of traffic from same IP address, they block the IP) or fully automatic.



For example, for google.com, you can typically make only around 300 requests per day, and if you reach this limit, you will see a CAPTCHA instead of search results. Another example could be a website which allows ten requests per minute and throws an error for anything above this threshold.



Protection like this can be temporary, but sometimes it can be permanent, especially if it is done manually by a human.

There are two ways to work around rate limiting. One option is to limit the max concurrency and possibly even introduce delays (after reaching concurrency 1) in execution to make the crawling process slower. The second option is to use proxy servers and rotate IP addresses after a certain number of requests.



To lower the concurrency, when using our SDK, just pass maxConcurrency option to the Crawler setup. If you use scrapers from our Store, then you can usually set max concurrency in the input.

If even macConcurrency: 1 is too fast, you can add some delays but it is pretty rare. Here is how you can do it in Web Scraper.

async function pageFunction(context) {

// Just wait 5 seconds on each page

await context.waitFor(5000);

// Do your scraping...

}

In Apify actor you can use promises to introduce delays before execution using the sleep() function from the Apify SDK as follows:

const Apify = require('apify');



Apify.main(async () => {

await Apify.utils.sleep(10 * 1000);



// Any code bellow will be delayed by 10 seconds...

});

To use the second method and rotate proxy servers in your Apify actor or task, you can just pass the proxyConfiguration either to the input or the Crawler class setup.

Browser detection

Another relatively pervasive form of protection is based on the web browser that you are using.

User agents

Some websites use the detection of User-Agent HTTP headers to block access from specific devices. You can use rotation of user agents to overcome this limit. But you should also be careful, a lot of libraries contain outdated user agents that can actually make the result worse.



Apify SDK doesn't provide its own user agent rotation right now as we are figuring out the best solution right now. But both Apify.launchPuppeteer() and PuppeteerCrawler functions have a parameter called "userAgent". Here is an example of launching puppeteer with random user agent using the modern-random-ua NPM package:

const Apify = require('apify');

const randomUA = require('modern-random-ua');



Apify.main(async () => {

// Set one random modern user agent for entire browser

const browser = await Apify.launchPuppeteer({

userAgent: randomUA.generate(),

});

const page = await browser.newPage();

// Or you can set user agent for specific page

await page.setUserAgent(randomUA.get());

// And work on your code here

await page.close();

await browser.close();

});

Blocked PhantomJS

Old Apify crawlers used PhantomJS to open web pages, but when you open a web page in PhantomJS, it will add variables to the window object that makes it easy for browser detection libraries to figure out that the connection is automated and not from a real person. Usually, websites which employ protection against PhantomJS will either block these connections or even worse, mark the used IP address as a robot and ban it. Some scraping technologies are still based on PhantomJS.



The only way to crawl websites with this kind of protection is to switch to a standard web browser like headless Chrome or Firefox. That's one of the reasons why we launched Apify actor. All our actors in the Store and our SDK use headless or headful Chrome.

Blocked headless Chrome with Puppeteer

Puppeteer is essentially a Node.js API to headless Chrome. Although it is a relatively new library, there are already anti-scraping solutions on the market that can detect its usage based on a variable it puts into the browser's window.navigator property.



As a start, we developed a solution that removes the property from the web browser and thus prevents these kinds of protections from figuring out that the browser is automated. This feature later expanded to a stealth module that encompasses many useful tricks to make the browser look more human-like.



Here is an example of how to use it with Puppeteer and headless Chrome:

const Apify = require('apify');



Apify.main(async () => {

const browser = await Apify.launchPuppeteer({ stealth: true });

const page = await browser.newPage();



await page.goto('https://www.example.com');



// Add rest of your code here...

await page.close();

await browser.close();

});

Browser fingerprinting

Another option sometimes used by anti-scraping solutions is to create a unique fingerprint of the web browser and connect it using a cookie with the browser's IP address. Then if the IP address changes but the cookie with the fingerprint stays the same, the website will block the request.

In this way, sites are also able to track or ban fingerprints that are commonly used by scraping solutions - for example, Chromium with the default window size running in headless mode.



The best way to fight this type of protection is to remove cookies and change the parameters of your browser for each run and switch to real Chrome browser instead of Chromium.



Here is an example of how to launch Puppeteer with Chrome instead of Chromium using Apify SDK:

const browser = await Apify.launchPuppeteer({

useChrome: true,

});

const page = await browser.newPage();

This example shows how to remove cookies from the current page object:

// Get current cookies from the page for certain URL

const cookies = await page.cookies('https://www.example.com');

// And remove them

await page.deleteCookie(...cookies);

Note that the snippet above needs to be run before you call page.goto() !



And this is how you can randomly change the size of the Puppeteer window using the page.viewport() function:

await page.viewport({

width: 1024 + Math.floor(Math.random() * 100),

height: 768 + Math.floor(Math.random() * 100),

})

Finally, you can use the Apify's base Docker image called Node.JS 8 + Chrome + Xvfb on Debian to make Puppeteer use a normal Chrome in non-headless mode using the X virtual framebuffer (Xvfb).

Tracking user behavior

The last protection option sometimes employed by advanced anti-scraping solutions is to track the behavior of the user in order to detect anything that is not done by humans, like clicking on a link without actually moving a mouse cursor there. This kind of protection is commonly implemented together with browser fingerprinting and IP rate limiting by the most advanced anti-scraping solutions.

Bypassing this protection cannot be easily done with just a simple piece of code, but we have noticed that there are some patterns to look for and if you find those, then it is possible to bypass such a protection. Here's what you need to do:

1) Check the website to see if it's saving data about your browser

You can do that by opening Chrome DevTools in your Chrome browser and going to the Network tab. Then switch to either the XHR or Img tab, as the websites sometimes hide the tracking requests as image loads. Check if there are POST requests made when you open the page or carry out some action on the page. If you find a request that has weird encoded data, then you've hit the jackpot. Here's an example of what it might look like:



If you find a request like this one, you can check the payload value on a site like base64decode.org and if it contains data about your browser, you've found the tracking request.

2) Block the tracking requests

The next step is to disable the tracking. For that you need to go the view of all requests and check the Initiator column for that request. It usually contains the JavaScript file which initiated the call.

You will need to disable this file in order to block the protection. Here's an example how to do that in Puppeteer:

// Tell Puppeteer that you want to be able to block

// requests on this page

await page.setRequestInterception(true);



page.on('request', (request) => {

const url = request.url();

// Check request if it is for the file

// that we want to block, and if it is, abort it

// otherwise just let it run

if (url.endsWith('main.min.js')) request.abort()

else request.continue();

});

Now try to run your Apify actor, if everything works, you've successfully bypassed the protection. If the page stops working properly, then it means that the file contained other functions bundled with the protection and in that case you can use the code above but block the request with your browser data instead of the file that creates them.

And that's all. If you find a website that still does not work even if you follow all these steps, let us know at support@apify.com - we love new challenges :)

Happy crawling!