Easy Web Spidering in Ruby with Anemone

By Ric Roberts

Anemone is a free, multi-threaded Ruby web spider framework from Chris Kite, which is useful for collecting information about websites. With Anemone you can write tasks to generate some interesting statistics on a site just by giving it the URL.

Its only dependency is Nokogiri (an HTML and XML parser). Other than that, you just need to install the gem to get started using Anemone's simple syntax which, among other things, allows you to tell it which pages to include (based on regular expressions) or define callbacks.

This example taken from Anemone's homepage prints out the URL of every page on a site:

require 'anemone' Anemone.crawl("http://www.example.com/") do |anemone| anemone.on_every_page do |page| puts page.url end end

The bin folder in the project contains some more in-depth examples, including tasks for counting the number of unique pages on a site, the number of pages at a certain depth in a site, or a list of urls encountered. There's also a combined-task which wraps up a few of these, intended to be run as a daily cron job.

You can install Anemone as a gem or get the source from Github of course, and there's some fairly comprehensive RDoc documentation available in the source or online.