Scrapy is a Python-based web crawler that can be used to extract information from websites. It is fast and simple, and can navigate pages just like a browser can.

However, note that it is not suitable for websites and apps that use JavaScript to manipulate the user interface. Scrapy loads just the HTML. It has no facilities to execute JavaScript that might be used by the website to tailor the user’s experience.

Installation

We use Virtualenv to install scrapy. This allows us to install scrapy without affecting other system-installed modules.

Create a working directory and initialize a virtual environment in that directory.

mkdir working cd working virtualenv venv . venv/bin/activate

Install scrapy now.

pip install scrapy

Check that it is working. The following display shows the version of scrapy as 1.4.0 .

scrapy # prints Scrapy 1.4.0 - no active project Usage: scrapy <command></command> [options] [args] Available commands: bench Run quick benchmark test fetch Fetch a URL using the Scrapy downloader genspider Generate new spider using pre-defined templates runspider Run a self-contained spider (without creating a project) ...

Writing a Spider

Scrapy works by loading a Python module called a spider , which is a class inheriting from scrapy.Spider .

Let's write a simple spider class to load the top posts from Reddit.

To begin with, create a file called redditspider.py and add the following to it. This is a complete spider class, though one which does not do anything useful for us. A spider class requires, at a minimum, the following:

A name identifying the spider.

identifying the spider. A start_urls list variable containing the URLs from which to begin crawling.

list variable containing the URLs from which to begin crawling. A parse() method, which can be a no-op as shown.

import scrapy class redditspider(scrapy.Spider): name = 'reddit' start_urls = ['https://www.reddit.com/'] def parse(self, response): pass

This class can now be executed as follows:

scrapy runspider redditspider.py # prints ... 2017-06-16 10:42:34 [scrapy.middleware] INFO: Enabled item pipelines: [] 2017-06-16 10:42:34 [scrapy.core.engine] INFO: Spider opened 2017-06-16 10:42:34 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) ...

Turn Off Logging

As you can see, this spider runs and prints a bunch of messages, which can be useful for debugging. However, since it obscures the output of out program, let's turn it off for now.

Add these lines to the beginning of the file:

import logging logging.getLogger('scrapy').setLevel(logging.WARNING)

Now, when we run the spider, we should not see the obfuscating messages.

Parsing the Response

Let's now parse the response from the scraper. This is done in the method parse() . In this method, we use the method response.css() to perform CSS-style selections on the HTML and extract the required elements.

To identify the CSS selections to extract, we use Chrome’s DOM Inspector tool to pick the elements. From reddit’s front page, we see that each post is wrapped in a <div class="thing">...</div> .

So we select all div.thing from the page and use it to work with further.

def parse(self, response): for element in response.css('div.thing'): pass

We also implement the following helper methods within the spider class to extract the required text.

The following method extracts all text from an element as a list, joins the elements with a space, and strips away the leading and trailing whitespace from the result.

def a(self, response, cssSel): return ' '.join(response.css(cssSel).extract()).strip()

And this method extracts text from the first element and returns it.

def f(self, response, cssSel): return response.css(cssSel).extract_first()

Extracting Required Elements

Once these helper methods are in place, let's extract the title from each Reddit post. Within div.thing , the title is available at div.entry>p.title>a.title::text . As mentioned before, this CSS selection for the required elements can be determined from any browser’s DOM Inspector.

def parse(self, resp): for e in resp.css('div.thing'): yield { 'title': self.a(e,'div.entry>p.title>a.title::text'), }

The results are returned to the caller using python’s yield statement. The way yield works is as follows — executing a function which contains a yield statement returns a generator to the caller. The caller repeatedly executes this generator and receives results of the execution till the generator terminates.

In our case, the parse() method returns a dictionary object containing a key ( title ) to the caller on each invocation till the div.thing list ends.

Running the Spider and Collecting Output

Let us now run the spider again. A part of the copious output is shown (after re-instating the log statements).

scrapy runspider redditspider.py # prints ... 2017-06-16 11:35:27 [scrapy.core.scraper] DEBUG: Scraped from {'title': u'The Plight of a Politician'} 2017-06-16 11:35:27 [scrapy.core.scraper] DEBUG: Scraped from {'title': u'Elephants foot compared to humans foot'} ...

It is hard to see the real output. Let us redirect the output to a file ( posts.json ).

scrapy runspider redditspider.py -o posts.json

And here is a part of posts.json .

... {"title": "They got fit together"}, {"title": "Not all heroes wear capes"}, {"title": "This sub"}, {"title": "So I picked this up at a flea market.."}, ...

Extract All Required Information

Let's also extract the subreddit name and the number of votes for each post. To do that, we just update the result returned from the yield statement.

def parse(S, r): for e in r.css('div.thing'): yield { 'title': S.a(e,'div.entry>p.title>a.title::text'), 'votes': S.f(e,'div.score.likes::attr(title)'), 'subreddit': S.a(e,'div.entry>p.tagline>a.subreddit::text'), }

The resulting posts.json :

... {"votes": "28962", "title": "They got fit together", "subreddit": "r/pics"}, {"votes": "6904", "title": "My puppy finally caught his Stub", "subreddit": "r/funny"}, {"votes": "3925", "title": "Reddit, please find this woman who went missing during E3!", "subreddit": "r/NintendoSwitch"}, {"votes": "30079", "title": "Yo-Yo Skills", "subreddit": "r/gifs"}, {"votes": "2379", "title": "For every upvote I won't smoke for a day", "subreddit": "r/stopsmoking"}, ...

Conclusion