09 Mar 2014

Python web scraping resource

If you need to extract data from a web page, then the chances are you looked for their API. Unfortunately this isn't always available and you sometimes have to fall back to web scraping.

In this article I'm going to cover a lot of the things that apply to all web scraping projects and how to overcome some common gotchas.

Please Note: This is a work in progress. I am adding more things as I come across them. Got a suggestion? Drop me an email - jake.austwick@gmail.com

Important

Always check for if you can extract information via an API first if they provide one, RSS / Atom feeds are also a great option if provided.

Prerequisites

We'll be using two external python libraries primarily.

Requests

We'll be using the requests library instead of urllib2. It's better in every way. I could go into details and explain why, but I think the requests page sums it all up in this short paragraph:

Python’s standard urllib2 module provides most of the HTTP capabilities you need, but the API is thoroughly broken. It was built for a different time — and a different web. It requires an enormous amount of work (even method overrides) to perform the simplest of tasks.

lxml

lxml is a XML/HTML parser, so we'll be using it to extract the data from the responses we are returned. Some people prefer BeautifulSoup but I'm confident at writing my own xPaths and so would prefer to stick with lxml for raw speed.

Crawling / Spidering

Crawling is exactly the same as a "general" scraper, except that you find the next URLs to scrape on each page that you parse. Crawling is a technique used to create large databases of information.

Examples:

Crawling an eBay category to gather product information. ( Don't - they have an API).

- they have an API). Crawling a category in the yellow pages for the contact information of a certain profession.

etc...

Storing seen URLs

You'll want to store the URLs you've already visited, so that you don't scrape them multiple times. For a small scrape(< 50k URLs), I would suggest simply using a set for this. You can then simply check for the presence of a URL in the set before fetching it / adding it to the "To Fetch Queue".

If you're going to be crawling a large site, then the above technique of simply using a set won't hold up. The set will grow huge and will start using a ton of memory. The solution to this is a Bloom Filter. I won't go into much detail, but a bloom filter essentially lets us store which urls we have seen, then ask the question "Have I seen this URL before?" and probably (very likely) returning the correct answer. It can do this using very little memory. pybloomfiltermmap is a good Bloom Filter implementation for Python.

Important: Make sure your normalize your URL before inserting it into the set / bloom filter. Depending on the site you may want to remove all query params etc. You don't want to be storing the same URL a ton of times.

CSS Selectors

If you're coming from a Javascript background, you might be more comfortable querying the DOM with CSS selectors rather than xPath. lxml provides a way to do this, it has a css_to_xpath method. Check out this page for a short guide on how to use it:

http://lxml.de/cssselect.html

PyQuery also comes highly recommended to me and I know it's widely used, so it's worth checking out too.

Extracting The Main Content Block

Sometimes you want to perform analysis on just the main content of the page, doing a word count for example. You wouldn't want to include the menu / sidebar within the count.

You can use readability-lxml for this. It's job is to simply extract the main content block (note: It can still contain HTML). Following on our word count example, you would remove the HTML using something like clean_html from within lxml.html.clean , then proceed to split on whitepace and count the results.

You can check out another article on my blog that shows how to use Readability to extract the main content.

Concurrency

We all know you want your data now, but hitting the server with 200 concurrent requests is going to make for one angry webmaster. Don't be that guy. Do everybody a favour and stick to a reasonable amount. 5 is usually enough.

To send requests concurrently easily with requests, check out the grequests library. It allows you to make your code concurrent in pretty much a line or two.

rs = (grequests.get(u) for u in urls) responses = grequests.map(rs)

Robots.txt

Almost every guide out there will tell you to always follow robots.txt rules. I'm not going to do that. The simple fact is they are usually really restrictive and don't even reflect what the webmasters restrictions would be. Most sites just use the default robots.txt for their framework / that their SEO plugin generates. Just make sure you're reasonable.

An exception to this rule is if you are writing a general spider. Something like Google/Bing. In this case, I would follow the rules the disallowed directories and crawl delay. If they don't want something indexing, then don't index it in your search engine / application.

Avoiding Detection

Sometimes you don't want to get caught. I get that. It isn't actually that difficult to achieve. The first thing to do it randomize your user agent. Your IP address will still be the same for all the requests, but that isn't unusual. Places like universities / businesses often have a ton of computers that all route traffic via 1 IP.

The next step is to use proxies. You can pick up shared proxies at a very small monthly cost, and when you're using randomized user agents and sending requests from 100 different IP's spread across different IP ranges and even different datacenters, they're gonna have a hard time detecting you.

Recommended Proxy Provider: My Private Proxy

When I've found myself in the unfortunate place of getting my proxies banned before on certain sites, they have been more the happy to switch them out for new IP's for me.

Disclosure: The above is an affiliate link, I have an account there as I use this proxy service myself.

Extending the Response class

I highly recommend that you extend the requests.model.Response object with some convinience methods of your own. I have wrote a dedicated article on this topic:

http://jakeaustwick.me/extending-the-requests-response-class/

Hire Me

I highly reccommend you use this article as a resource to build / fix your own scraper. If you don't have time however or would like to hire somebody to scrape/crawl something for you then drop me a line. You can find my details on my contact page.

http://jakeaustwick.me/contact-me/

Common Problems

Installing lxml

You may run into some trouble when trying to install lxml, especially on a linux distro. lxml requires the following packages to be installed before:

apt-get install libxml2-dev libxslt-dev python-dev lib32z1-dev

You can then run the pip install lxml command as per usual.

The site uses AJAX, I can't scrape it!

This is simply incorrect. Often if a site uses AJAX, it can make your job even easier. Use the Network tab in Chrome Developer Tools to find the AJAX request, you'll usually be greeted by a response in json. This saves you the effort of having to parse the page, you can just hit the AJAX endpoint directly. If this isn't the case then I suggest using an actual browser for scraping. Check out the "The site is showing different content to my scraper" section for suggestions on this.

The site is showing different content to my scraper

Check the page source in your browser and make sure the information you want to scrape is actually there. It could be that the content was injected into the page via javascript and our scraper doesn't run javascript. If this is the case, your only real option is to run something like PhantomJS. You can control Phantom from Python though, just use something like selenium or splinter. Sites often serve different content to different user agents. You should check the user agent that you are sending in your browser and set your scraper to have the same. It's possible that the site is serving different content to different locations using geotargeting. If you're running your script from a server, then this could be affecting you.

The items I need to scrape are paginated

Sometimes you need to scrape items that aren't all available on one page. Luckily it's usually an easy fix. I suggest checking the URL when changing between the pages, you'll ussually find there is a page or offset parameter changing. This is easy to scrape with something like the following:

base_url = "http://some-url.com/something/?page=%s" for url in [base_url % i for i in xrange(10)]: r = requests.get(url)

I'm getting 403 errors / rate limited

The simple solution to this is don't hit the site so damn hard. Be conservative with the amount of concurrent requests you use. Do you really need the data within the hour, can you really not wait 24 hours for it?

If you really need to keep sending requests to the site, then you're going to have to invest in some proxies. You can then simply create a simple proxy manager class that makes sure you only use the same proxy once every X interval so that it doesn't get blocked.

Check out the Concurrency and Avoiding Detection sections above.

Code Examples

The following examples aren't really of any use on their own. They're simply here as a reference for you.

Simple requests example

I'm not really going to show much in this example, you can find out all you need over at the requests documentation.

import requests response = requests.get('http://jakeaustwick.me') # Response print response.status_code # Response Code print response.headers # Response Headers print response.content # Response Body Content # Request print response.request.headers # Headers you sent with the request

Request with proxy

proxy = {'http' : 'http://102.32.3.1:8080', 'https': 'http://102.32.3.1:4444'} response = requests.get('http://jakeaustwick.me', proxies=proxy)

Parsing the responses body

It's easy to parse the html returned with lxml. Once we have it parsed into a tree, we can call xPaths on it to grab the data we are interested in.

import requests from lxml import html response = requests.get('http://jakeaustwick.me') # Parse the body into a tree parsed_body = html.fromstring(response.text) # Perform xpaths on the tree print parsed_body.xpath('//title/text()') # Get page title print parsed_body.xpath('//a/@href') # Get href attribute of all links

Download all images on a page

The following script will download all the images, and save them in downloaded_images/ . Make sure that directory exists before running it.

import requests from lxml import html import sys import urlparse response = requests.get('http://imgur.com/') parsed_body = html.fromstring(response.text) # Grab links to all images images = parsed_body.xpath('//img/@src') if not images: sys.exit("Found No Images") # Convert any relative urls to absolute urls images = [urlparse.urljoin(response.url, url) for url in images] print 'Found %s images' % len(images) # Only download first 10 for url in images[0:10]: r = requests.get(url) f = open('downloaded_images/%s' % url.split('/')[-1], 'w') f.write(r.content) f.close()

Single Threaded Crawler (basic)

This crawler just shows the basic principle behind crawling. It uses a dequeue so you can pop from the left of the queue and therefore fetch URLs in the order you find them. It doesn't do anything useful with the pages, just printing out the page title.