tl;dr

Grab the gist with the complete, working source code.

I often hear the question: "so, you're Perl guy, could you show me how to make a web crawler/spider/scraper, then?" I hope this post series will become my ultimate answer :)

First of all, I compiled a small list of features that people expect of crawlers nowadays:

capability of concurrent, persistent connections; usage of CSS selectors to process HTML; easily modifiable source instead of a flexible OOP inheritance structure; LESS DEPENDENCIES!

Well, sorry WWW::Mechanize, it's not your turn. Instead, the first example will be based on Mojo::UserAgent from the Mojolicious framework. Being event-driven and low on dependencies (none except the Mojolicious itself) is specially attractive for Perl newcomers with some jQuery literacy.

Boilerplate

Let's call our project mojo-crawler.pl . Here's how it begins:

#!/usr/bin/env perl use 5.010; use open qw(:locale); use strict; use utf8; use warnings qw(all); use Mojo::UserAgent; # FIFO queue my @urls = map { Mojo::URL->new($_) } qw( http://sysd.org/page/1/ http://sysd.org/page/2/ http://sysd.org/page/3/ http://sysd.org/page/4/ http://sysd.org/page/5/ http://sysd.org/page/6/ ); # Limit parallel connections to 4 my $max_conn = 4; # User agent following up to 5 redirects my $ua = Mojo::UserAgent ->new(max_redirects => 5) ->detect_proxy; # Keep track of active connections my $active = 0;

Note that I'm using my very own server as a guinea pig. Consider that my oath on the safety of the parallelization method I chose.

Event loop

Here we keep a constant-sized pool of active connections, populating it with URLs from our FIFO queue. The anonymous sub {} fires every time the event loop is idle (0-second timer):

Mojo::IOLoop->recurring( 0 => sub { for ($active + 1 .. $max_conn) { # Dequeue or halt if there are no active crawlers anymore return ($active or Mojo::IOLoop->stop) unless my $url = shift @urls; # Fetch non-blocking just by adding # a callback and marking as active ++$active; $ua->get($url => \&get_callback); } } );

Now, start the event loop unless it is already started somewhere else. In this code, it won't be started anywhere else. But who knows how deep the Copy&Paste will bury it in future?!

# Start event loop if necessary Mojo::IOLoop->start unless Mojo::IOLoop->is_running;

Download callback

Every completed download ends here. Even the failed ones. Thus, when the download is complete, we decrease the $active counter to free a connection slot:

sub get_callback { my (undef, $tx) = @_; # Deactivate --$active; # Parse only OK HTML responses return if not $tx->res->is_status_class(200) or $tx->res->headers->content_type !~ m{^text/html\b}ix; # Request URL my $url = $tx->req->url; say $url; parse_html($url, $tx); return; } # Not implemented yet! sub parse_html { return }

Fear not, the parse_html() is implemented right below!

Take one

Let's make sure this code actually does what it does, while it is still low on the line count:

$ perl mojo-crawler.pl http://sysd.org/page/2/ http://sysd.org/page/4/ http://sysd.org/page/3/ http://sysd.org/page/5/ http://sysd.org/ http://sysd.org/page/6/ $

Fine, the downloads were completed in the order of the time it took to download each resource. Oh, and http://sysd.org/page/1 simply redirects to http://sysd.org/.

Adding some depth

The most difficult part of making web crawlers isn't making them start; it's making them stop. Our complete parse_html() also takes care of feeding the URL queue with the URLs extracted from <a href="..."> links. Plus, it makes a trivial verification on:

Protocols (only HTTP and HTTPS); Path depth (don't go deeper than /a/b/c); URL revisiting (don't download the same resource over and over); Cross-domain links (not allowed).

And, to show we've been there, let's print the title of the page:

sub parse_html { my ($url, $tx) = @_; say $tx->res->dom->at('html title')->text; # Extract and enqueue URLs for my $e ($tx->res->dom('a[href]')->each) { # Validate href attribute my $link = Mojo::URL->new($e->{href}); next if 'Mojo::URL' ne ref $link; # "normalize" link $link = $link->to_abs($tx->req->url)->fragment(undef); next unless grep { $link->protocol eq $_ } qw(http https); # Don't go deeper than /a/b/c next if @{$link->path->parts} > 3; # Access every link only once state $uniq = {}; ++$uniq->{$url->to_string} next if ++$uniq->{$link->to_string} > 1; # Don't visit other hosts next if $link->host ne $url->host; push @urls, $link; say " -> $link"; } say ''; return; }

Take two

This time, it will be a lot slower, as every internal link is followed and downloaded. The crawler will print the accessed URL, the title of the page, and the extracted non-visited links:

$ "time" perl mojo-crawler.pl http://sysd.org/ sysd.org -> http://sysd.org/tag/benchmark/ -> http://sysd.org/tag/command-line-interface/ -> http://sysd.org/tag/console/ -> http://sysd.org/tag/overhead/ -> http://sysd.org/tag/terminal/ -> http://sysd.org/tag/teste/ -> http://sysd.org/tag/tty/ -> http://sysd.org/tag/velocidade/ -> http://sysd.org/tag/browser/ -> http://sysd.org/tag/deprecation/ -> http://sysd.org/tag/ie/ -> http://sysd.org/tag/microsoft/ -> http://sysd.org/tag/navegador/ -> http://sysd.org/tag/webdesign/ -> http://sysd.org/tag/webdev/ -> http://sysd.org/tag/api/ -> http://sysd.org/tag/hack-2/ -> http://sysd.org/tag/integration/ -> http://sysd.org/tag/rest/ ... 27.73user 0.88system 3:48.46elapsed 12%CPU (0avgtext+0avgdata 98272maxresident)k 0inputs+8outputs (0major+6749minor)pagefaults 0swaps $

A very important final note: albeit this tiny crawler operates through the recursive traversal of links, it is implemented in an iterative way. Thus, it is very light on memory consumption. In fact, the only structure that hogs the RAM is the $uniq hashref. tie it to any kind of persistent storage if that concerns you. The FIFO queue @urls could grow a lot if the crawled site has dynamically-generated link lists (or even broken pagination). So, not storing it in some kind of key/value database is a bit reckless.

Conclusion

Despite this being a toy spider, I believe it is good enough to solve 80% of web crawling/scraping problems. The remaining 20% would require much more code, tests and infrastructure (A.K.A. the Pareto principle). Please, don't reinvent the wheel, check out the CommonCrawl project first! And keep checking my Perl blog for more on that 80% focused web-crawling ;)

Acknowledgements

Continued

Web Scraping with Modern Perl (Part 2 - Speed Edition)