Creating Your Crawler

I ran the command scrapy startproject olx , which will create a project with the name olx and helpful information for your next steps. You go to the newly created folder and then execute the command for generating the first spider with the name and domain of the site to be crawled:

Adnans-MBP:ScrapyCrawlers AdnanAhmad$ cd olx/ Adnans-MBP:olx AdnanAhmad$ scrapy genspider electronics www.olx.com.pk

Created spider 'electronics' using template 'basic' in module:

olx.spiders.electronics

I generated the code of my first spider with the name electronics since I am accessing the electronics section of OLX. You can name your spider anything you want.

The final project structure will be something like the example below:

Scrapy Project Structure

As you can see, there is a separate folder only for spiders. You can add multiple spiders within a single project. Let’s open the electronics.py spider file. When you open it, you will see something like this:

As you can see, ElectronicsSpider is a subclass of scrapy.Spider . The name property is actually the name of the spider, which was given in the spider generation command. This name will help while running the crawler itself. The allowed_domains property tells us which domains are accessible for this crawler, and start_urls is the place to mention initial URLs that need to be accessed in the first place. Besides the file structure, this is a good feature to create the boundaries of your crawler.

The parse method, as the name suggests, will parse the content of the page being accessed. Since I want to write a crawler that goes to multiple pages, I am going to make a few changes.

In order to make the crawler navigate to several pages, I subclassed my crawler from crawler instead of scrapy.Spider . This class makes crawling many pages of a site easier. You can do something similar with the generated code, but you’ll need to take care of recursion to navigate the next pages.

The next step is to set your rules variable. Here you mention the rules of navigating the site. The LinkExtractor actually takes parameters to draw navigation boundaries. Here I am using restrict_css parameter to set the class for the NEXT page. If you go to this page and inspect element, you can find something like this:

pageNextPrev is the class that will be used to fetch the links to the next pages. The call_back parameter tells which method to use to access the page elements. We will work on this method soon.