pyburrow - low-level web crawling library (Python 3)

pyburrow is a Python 3 library for crawling websites: capturing, archiving and processing their resources. It is different from all other known crawlers by being very low level: the HTTP response body is stored as raw unencoded bytes, and further the HTTP response headers and status codes are stored as well. The resulting rich data set is a trove for data mining and analysis.

The crawling algorithm is highly configurable. Crawling begins from one or more supplied root URLs; additional URLs are identified in each response body, and added to a priority queue for a breadth-first search. You can specify limits in URL count or (soon) wall-clock time, URL patterns to accept or reject, and whether or not to honor robots.txt (with or without overrides).

A pyburrow session is fully resumable. An existing log set can be used as a starting point, to continue where an earlier session left off; interrupted runs that thus be easily resumed, without the need to re-read any URLs.

pyburrow is implemented in Python 3.2. It has not been tested in older versions.

Example usage

from pyburrow import Crawler, PickleStore base_url = 'http://example.com/baz' datastore = PickleStore('/home/bobby/crawl-example.dat', mode='rw') crawler = Crawler(datastore, base_url) crawler.crawl()

Serialization

Currently pickle is supported as a serialization format. A near-term priority is an SQLite based serialization alternative, which will much improve performance when crawling sites with many pages.

Tests

Run tests with nose. Note, this must be a Python 3 version of nose. In a properly configured environment, you can just type "nosetests" on the command line to run all tests.

Some of the end-to-end tests will actually start a test webserver on port 8000, via the testutils.TestServer class. Shutdown of this is automatic when the test completes, and should be extremely robust.

License

Licensed under GPL version 3. Development sponsored by Mobile Web Up.

Copyright 2011-2012 Mobile Web Up. All rights reserved.

Want to get paid to solve fascinating mobile web problems while writing Python 3 code? Visit http://mobilewebup.com/jobs/.

Future

Verion 0.9 is stable and useful. Version 1.0 will be able to serialize to SQLite as an alternative to pickle. This will greatly improve performance when crawling larger sites with many URLs; the current pickle-based serialization too quickly runs into disk I/O bottlenecks, even on solid-state drives. Tools for easily transforming between pickle-based and SQLite-based data sets will be included.

Other planned future additions:

In the event of redirection, e.g. a 301 or 302 response, the entire chain is logged as it is followed to a final response.

There is optional, experimental support for detecting certain infinite crawling rabbit-holes in real world websites, like infinite pagination (where, say, http://example.com/foo/?page=1000 will have a "Next Page" link pointing to http://example.com/foo/?page=1001 , up to INT_MAX, even when only the first three pages have real content).

Failed requests are logged, and can be retried at a later time. You can change the configuration of resumed sessions to crawl URLs excluded from the previous run, adding to the data set.