Lately I’ve been using the nifty Web::Scraper by the prolific Tatsuhiko Miyagawa. It exposes a compact DSL of three words, process, process_first and result , to scrape sites based on XPath expressions or CSS selectors:



use Data::Dumper;

use Web::Scraper;

use URI;

# Find the <title> element and place its content

# in a hash reference with the key 'title'

my $scraper = scraper {

process '//title', title => 'TEXT';

};

my $data = $scraper->scrape(

URI->new('http://www.google.com')

);

warn Dumper $data;



Gets you



$VAR1 = {

'title' => 'Google'

};



You can also have it return an arrayref for an element, pass in callbacks, or nest scrapers.

I forked the code and added a hashref option, nice when paired with a callback that returns a hash.

scraper (a disguised constructor) and the keywords are exported into the caller’s namespace. If the potential for collision concerns you, wrap up scraper instantiation in a separate module.

The documentation is a bit thin, so in addition check out Miyagawa’s presentation slides.

Share this: Twitter

Facebook

Like this: Like Loading... Related