When building crawlers, most of the effort is expended in guiding them through a website. For example, if we want to crawl all pages and individual posts on this blog, we extract links like so:

Visit current webpage Extract pagination links Extract link to each blog post Enqueue extracted links Continue

In this blog post, I present a new DSL that allows you to concisely describe this process.

This DSL is now part of this crawler: https://github.com/shriphani/pegasus

Enlive provides an idiomatic way to select elements and it forms the foundation of most of the work here. Let us implement the crawler I discussed above:

Since the focus (in this blog post at least) in on extracting links, let us look at the DSL:

1 2 3 4 5 6 7 8 9 10 11 12 ( defextractors ( extract :at-selector [ :article :header :h2 :a ] :follow :href :with-regex # "blog.shriphani.com" ) ( extract :at-selector [ :ul.pagination :a ] :follow :href :with-regex # "blog.shriphani.com" ))

And that is it! We specified an elive selector to pull tags, the attribute entry to follow and then filter these URLs with a regex.

With pegasus, the full crawler is expressed in under 10 lines as:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ( defn crawl-sp-blog-custom-extractor [] ( crawl { :seeds [ "http://blog.shriphani.com" ] :user-agent "Pegasus web crawler" :extractor ( defextractors ( extract :at-selector [ :article :header :h2 :a ] :follow :href :with-regex # "blog.shriphani.com" ) ( extract :at-selector [ :ul.pagination :a ] :follow :href :with-regex # "blog.shriphani.com" )) :corpus-size 20 ;; crawl 20 documents :job-dir "/tmp/sp-blog-corpus" }))

Essentially, under 20 lines.

The DSL was partially inspired by the great work done in this crawler for Node.js: https://github.com/rchipka/node-osmosis

Links:

Pegasus: Star