Getting Start with Code

The Following is configuration need to be placed into config.py file. This global configuration is going to be used in different scripts.

# Target dataset path

DATASET_PATH = "./dataset" # Fake user agent for avoiding 503 error

HEADERS = {

'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'

}

BASE_URL = " # Base url of scrappingBASE_URL = " https://burst.shopify.com # Advanced parameters

# Categories want to scrap

CATEGORIES = ["dog","cat"] # Page limit to search images from URL

PAGE_FROM =1

PAGE_TO = 2 # Number of workers for downloading pages and images for better and faster performance

WORKERS = 4

Now, Let’s get our hand dirty with code. import some libraries to getting started with actual code:

from bs4 import BeautifulSoup

import os

import urllib.request

from tqdm import tqdm

import ssl

BeautifulSoup is used for scraping web pages and images while urllib is imported to download web pages and images. tqdm is used for just displaying progress and ssl used for creating fake verification of request.

also, we need to import config.py file to use global configuration.

from config import *

Before starting I am putting some local configuration which is going to used as below.

timeout = 60 # Request timeout

url = BASE_URL + "/dog" # URL being scrapped

target_dir = os.path.join(DATASET_PATH,"dog") # Target directory for scrapping data

To download source of the page, I have used urllib library where context specifies a fake SSL certificate to avoid SSL Exceptions and HEADERS are imported from global configuration to avoid 503 Exception generated by the web servers.

# Bypass SSL verification

context = ssl._create_unverified_context() # Read HTML page and save as long string

req = urllib.request.Request(url, headers=HEADERS)

response = urllib.request.urlopen(req, timeout=timeout, context=context) # Read page source

html = response.read()

The downloaded HTML source code needs to be parsed to access properties of HTML tags. BeautifulSoup is going to use for the same context. It has well-optimized classes and methods to access HTML tags by their unique properties.

# Parse HTML source using BeautifulSoup

soup = BeautifulSoup(html, "html.parser")

Now we have a handler called soup which is the parsed version of HTML source code that can be attached to any supported method of BeautifulSoup class.

As we had learned earlier about unique tags of data. soup has a select() method which is going to help us to find specific tags from parsed HTML content. Since we have unique class of <Img class="js-track-photo-stat-view" > tag.

Below piece of code is will fetch all <Img> tags from a page whose class defined as js-track-photo-stat-view .

image_grids = soup.select('.js-track-photo-stat-view')

Next, we have to extract URL of images which can be helpful to download the image.

image_grids includes multiple entries of <Img> as shown listed below. As you can see, <Img> tag has property named data-srcset which includes URLs of images as per resolution 1x, 2x and so on.

We can access properties of <Img> tag using get() method. Below code going to fetch the content of data-srcset property as well as preprocess data to find the highest resolution image from it.

image_urls = []

for image_tag in tqdm(image_grids,desc="Find Images"): # Fetch data tag which includes sequence of URLS

image_url = image_tag.get('data-srcset') # Extract highest resolution image from data tad

image_url = image_url.split(',')

high_resolution_pair = image_url[-1].split(' ')

high_resolution_image_url = high_resolution_pair[1].replace("@2x", "@3x") # Stack all image urls

image_urls.append(high_resolution_image_url)

Now, We have a list of URLs stacked into image_urls list. All we need to do is download images into targeted directories. Below script is going to do the rest.

# Download images into target directory

for image_url in tqdm(image_urls,desc="Download Images"): # Extract name of file from URL

file_name = image_url.split("/")[-1] # Build target path of image

image_path = os.path.join(target_dir, file_name) # Create directories

if not os.path.exists(target_dir): os.mkdir(target_dir) # Write image to file system

if not os.path.exists(image_path): # Read image from web

req = urllib.request.Request(image_url, headers=HEADERS)

response = urllib.request.urlopen(req, timeout=timeout, context=context) # Write it down to file system

f = open(image_path, 'wb')

f.write(response.read())

f.close()

Every image fetched will be downloaded in target directory. The above script is going to displays output as below which might change as per your computer’s performance and speed of internet!

Find Images: 100%|###########################################################################################################################################| 50/50 [00:00<00:00, 107933.71it/s]

Download Images: 100%|###########################################################################################################################################| 50/50 [00:42<00:00, 2.81it/s]

You can find the whole script at github repo basic_scrapper.py