Next in our series of Python modules you should know is Scrapy. Do you want to be the next Google ? Well read on.

Home page

Use

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

You can use Scrapy to extract any kind of data from a web page, in HTML, XML, CSV and other formats. I recently used it to automate the extraction of domains and emails on the ISPA Spam Hall of Shame list, for use in a DNSBL.

Installation

pip install scrapy

Usage

Scrapy is a very extensive package it is not possible to describe its full usage in a single blog post, There is tutorial on the scrapy website as well as extensive documentation.

For this post i will describe how i used it to extract listed domains from the ISPA hall of shame website.

The page is http://ispa.org.za/spam/hall-of-shame/ and looking at the page source you find that the domains are displayed in lists with bold text "Domains: " before the actual domains list

<ul> <li><strong>Domains: </strong> dfemail.co.za, extremedeals.co.za, hospitalcoverza.co.za, lifeinsuranceza.co.za, portablebreathalyzer.co.za </li> <li><strong>Addresses: </strong>bounce@dfemail.co.za, bounce@extremedeals.co.za, bounce@hospitalcoverza.co.za, bounce@lifeinsuranceza.co.za, bounce@portablebreathalyzer.co.za, info@dfemail.co.za, info@extremedeals.co.za, info@gmarketing.co.za, info@hospitalcoverza.co.za, info@lifeinsuranceza.co.za, sales@portablebreathalyzer.co.za </li> </ul>

The Xpath expression to extract this will be.

'//li/strong[text()="Domains: "]/following-sibling::text()'

For more information about XPath see the XPath reference.

With the Xpath expression we can now write a spider to download the webpage and extract the data we want.

Create a python file crawl-ispa-domains.py with the following contents

#!/usr/bin/python # -*- coding: utf-8 -*- # crawl-ispa-domains.py # Copyright (C) 2012 Andrew Colin Kissa <andrew@topdog.za.net> # vim: ai ts=4 sts=4 et sw=4 from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector class ISPASpider ( BaseSpider ): name = "ispa-domains" allowed_domains = [ "ispa.org.za" ] start_urls = [ "http://ispa.org.za/spam/hall-of-shame/" , ] def parse ( self , response ): hxs = HtmlXPathSelector ( response ) lines = hxs . select ( '//li/strong[text()="Domains: "]/following-sibling::text()' ) . extract () for line in lines : domains = line . split ( ',' ) domains = [ domain . strip () for domain in domains if domain . strip ()] for domain in domains : print domain

You can then run the spider from the command line and it should provide you will the list of domains extracted.

scrapy runspider --nolog crawl-ispa-domains.py

And there is more

This post just touches a tip of what scrapy can do, use the documentation for details on what can be done using this package.