Feb 04, 2020 • 8 minutes • 3600 views

Introduction

We often need to prove "I'm not a robot". That's because there are robots on the internet - and there are a lot of them. These bots, or as we fondly call them, spiders get blocked all the time, there’s no news in it.

Companies try their best to keep web crawlers out of their websites (unless it’s Googlebot of course).

However, scraping is not illegal. Any data that’s publicly available on the web can be accessed by anyone.

Bots have rights, too.

This article highlights 5 things you should do to avoid getting blacklisted while scraping a website.

Send HTTP headers with your request:

HTTP headers let the client and the server pass additional information with an HTTP request or response

If a server receives a request without any headers, it gives them enough reason to not authorize that request. That's because it indicates that the request is not coming from a browser.

Request headers include Cookie , User-Agent , etc. You can go to the Network tab, and see what headers are being sent by your browser.

Let us try to replicate this request using Python:

import requests HEADERS = { 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9', 'Accept-Encoding':'gzip, deflate', 'Accept-Language':'en-GB,en;q=0.9,en-US;q=0.8,hi;q=0.7,la;q=0.6', 'Cache-Control':'no-cache', 'Connection':'keep-alive', 'DNT':'1', 'Host':'example.com', 'Pragma':'no-cache', 'Upgrade-Insecure-Requests':'1', 'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36', } response = requests.get(url="https://example.com", headers=HEADERS) print(response)

<Response [200]>

In many cases, you need to use the Cookie header to authorize your request.

Proxies

Some websites flag the IP of the source request.

In simpler terms, they put your IP on a black-list for a certain period of time to prevent you from scraping the webpages.

This is where proxy IPs come into play. You can compile a list of these proxy addresses, and then simply keep rotation them.

Even if one of these proxy IPs are blocked, you can always have another.

Sending a request with a proxy server:

import requests PROXIES = { 'http': 144.91.78.58:80, 'https': 144.91.78.58:80 } response = requests.get(url="https://example.com", proxies=PROXIES)

The proxy used in this example might not work at the time when you run it. Fortunately, there are many freemium websites that offer SSL proxies.

You can also purchase premium proxies to ensure good connectivity, speed, uptime, etc.

Time interval

You can’t just overwhelm a server with requests. That’s an easy way to get yourself blocked.

Make your script wait for a few seconds after every request, better if that interval is random — to give your script a more "human touch".

You can use Python's built-in time module:

import requests import time for n in range(0,10): response = requests.get(url=f"https://example.com/p/{n}") time.sleep(3)

Sometimes when time is not a critical factor consider increasing the value of this interval even further.

Switch UA

The User-Agent header helps the server identify the device, or say the source of the request.

Different browsers running on different operating systems have different user-agent strings.

Here's the value of User-Agent header this website received from your browser:

If you access this page using a different browser/OS, this string would look different.

Thankfully, you can customize headers. Switching user-agent with every few requests could reduce the chance of getting blocked — because doing so changes the request as a whole.

This text file contains 1000 different user agent strings [Github Gist] which you can use for as the value for User-Agent header like this:

import requests import random HEADERS = { 'User-Agent': None } USER_AGENT_LIST = [ 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.94 Chrome/37.0.2062.94 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 (KHTML, like Gecko) Version/8.0.8 Safari/600.8.9', 'Mozilla/5.0 (Windows NT 5.2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.85 Safari/537.36' ] HEADERS['User-Agent'] = random.choice(USER_AGENT_LIST)

Headless scraping with Selenium

You can use selenium (a browser automation tool) to open websites, read & parse HTML, etc. It's not really guaranteed that using selenium will never get you blocked. But many websites that immediately block non-browser requests are less likely to block these web drivers.

However, this method is a bit 'heavier', and relatively slower as it involves opening a browser. At the same time, you can execute JavaScript from selenium, and that makes it incredibly useful.

Running Selenium:

First of all, you need to have a web driver, that's what controls the browser.

In this tutorial, we will be using the most popular one - chromedriver, an open-source web driver built by Google to control Chrome.

You can download it from here.

Make sure you download the same version as of the Chrome installed on your computer.

Now we need to install the selenium library for Python:

pip install selenium

That's all we need to setup selenium.

Run this code to open Google in a browser (change the value of CHROMEDRIVER_PATH :

from selenium import webdriver CHROMEDRIVER_PATH = "/path/to/chromedriver" driver = webdriver.Chrome(CHROMEDRIVER_PATH) driver.get("https://google.com/")

The end result should look like this:

Possible errors:

Mismatch of version - Check the version of Chrome (go to chrome://version you are running, and download the same version of chromedriver.

PATH - Have you changed the value of CHROMEDRIVER_PATH to the location of the chromedriver executable you downloaded?

Selenium lets you automate most of the parts you would need for debugging, testing, and scraping.

Conclusion

If this article has to be summarised in one sentence — Make your requests look human.

However, the article doesn’t promote attacking websites in any way. Websites have their own guidelines for web crawlers which you can find at /robots.txt, and you should respect those rules. In fact, web scraping frameworks like Scrapy comply with the /robots.txt rules by default (you can disable this as well).

Further readings