Ruby Screen-Scraper in 60 Seconds

I often find myself trying to automate content extraction from a saved HTML file or a remote server. I’ve tried a number of approaches over the years, but the dynamic duo of Hpricot and Firebug blew me away - this is by far the fastest way to get what you want without compromising flexibility. Hpricot is an extremely powerful ruby-based HTML parser, and Firebug is arguably the best on-the-fly development add-on for Firefox. Now, I said it will take you about 60 seconds. I lied, it should take less. Let’s get right to it.

Introducing open-uri

Ruby comes with a very flexible, production ready library that wraps all http/https connections into a single method call: open. Among other things, open-uri will gracefully handle http redirects, allow you to specify custom headers, and even work with ftp addresses. In other words, all the dirty work is already done, but you should still check the RDoc. I’ll let the code speak for itself:

require 'rubygems' require 'open-uri' @url = "http://www.igvita.com" @response = '' # open-uri RDoc: http://stdlib.rubyonrails.org/libdoc/open-uri/rdoc/index.html open ( @url , "User-Agent" => "Ruby/ #{ RUBY_VERSION } " , "From" => "email@addr.com" , "Referer" => "http://www.igvita.com/" ) { | f | puts "Fetched document: #{ f . base_uri } " puts " \t Content Type: #{ f . content_type }

" puts " \t Charset: #{ f . charset }

" puts " \t Content-Encoding: #{ f . content_encoding }

" puts " \t Last Modified: #{ f . last_modified }



" # Save the response body @response = f . read }

FireBug kung-foo

Now that we have the document, we need to pull out some content that interests us - usually, this is the tedious part based on regular expressions, stream parsers, etc. Instead, we’re going to sidestep all of these issues and let firebug do its magic. First, install the extension, then while on this page, click in the bottom-right corner of your browser to bring it up. It should ask you if you want to enable firebug (hint: say yes). You should now be greeted with the following screen:

For the sake of an example, assume that we want to extract three things out of this very page: some quoted text (sample below), number of comments, and the list of my latest posts found at the bottom of this page. Here is an example of quoted text:

So which came first, the parser, which will extract this, or this quote? - Extract me!

In your firebug window, click “Inspect” and hover your mouse over the quote. You will notice that firebug navigates to the exact part of the DOM-tree (HTML source code) as you do this. When you put your mouse over the quote, you should see the following:

Here’s the trick, right click on the selected blockquote element in your firebug window and select Copy XPath . This will provide you with the exact drill-down code for the DOM-Tree. In our case your clipboard should contain: /html/body/div[2]/div/div/blockquote .

Hpricot magic

It is at this point that Hpricot comes into the picture, and you have probably guessed it already - it supports XPath. All we need to do is pass our HTML to it to build the internal tree, and then we’re ready to go:

#Rdoc: http://code.whytheluckystiff.net/hpricot/ doc = Hpricot ( @response ) # Retrive number of comments # - Hover your mouse over the 'X Comments' heading at the end of this article # - Copy the XPath and confirm that it's the same as shown below puts ( doc / "/html/body/div[3]/div/div/h2" ) . inner_html # Pull out first quote (<blockquote> .... </blockquote>) # - Note that we don't have to use the full XPath, we can simply search for all quotes # - Because this function can return more than one element, we will only look at 'first' puts ( doc / "blockquote/p" ) . first . inner_html # Pull out all other posted stories and date posted # - This searh function will return multiple elements # - We are going to print the date, and then print the article name beside it ( doc / "/html/body/div[4]/div/div[2]/ul/li/a/span" ) . each do | article | puts " #{ article . inner_html } :: #{ article . next_node . to_s } " end

As you can see I provided a few other examples, but the idea is simple. Open firebug, navigate to component you want to extract, copy XPath, paste it right into the search function of Hpricot and then print out the results. How simple is that? I should also mention that Hpricot is not limited to XPath, nor did my examples cover all the functionality of it, I strongly encourage you to check the official Hpricot page for more tips and tricks.

Running our screen-scraper produces:

Fetched document: http://www.igvita.com/ Content Type: text/html Charset: utf-8 Content-Encoding: Last Modified: No Comments So which came first, the parser, which will extract this, or this quote? - Extract me! 04.02 :: Ruby Screen-Scraper in 60 Seconds 31.01 :: World News With Geographic Heatmaps 27.01 :: Correlating Netflix and IMDB Datasets ...

Copy, paste, done. Now you have no excuse to put off that custom RSS generator you always wanted.

Regin Gaarsmand and Harish Mallipeddi posted PHP and Python equivalents of this method. Awesome!