Web scraping is the process of collecting data from websites. There is a lot of data publically accessible on the Internet and sometimes you want to do something with that data programmatically. Instead of manually copying and pasting stuff from websites to some sort of spreadsheet, you might as well write a script that does it for you. The process depends on the way the data is embedded into a target website. In the usual case you have to perform the following:

Use a web client to download the contents of a webpage. Convert these contents (a string) into some sort of internal representation in your programming language (usually a tree of HTML elements). Find and extract the data you need.

Common Lisp has necessary libraries for each of those steps, but not a one single library to put everything together, which is why I threw together Webgunk, which is a mish-mash of various libraries and helper functions to make web scraping easier. It doesn’t have a stable API or anything and I’m still adding new things to it.

Web client

Edi Weitz’s DRAKMA is the most popular HTTP client for Common Lisp, and it works fine in practice. One small problem I had with it is that its http-request function sometimes returns a string, and sometimes an octet array. Either way, I like strings much more, so I wrote a wrapper around it using FLEXI-STREAMS:

(defun http-request (uri &rest args) "A wrapper around DRAKMA:HTTP-REQUEST which converts octet array which it sometimes returns to normal string" (let* ((result-mv (multiple-value-list (apply #'drakma:http-request uri `(,@args :cookie-jar ,*webgunk-cookie-jar*)))) (result (car result-mv))) (apply #'values (if (and (arrayp result) (equal (array-element-type result) '(unsigned-byte 8))) (flexi-streams:octets-to-string result) result) (cdr result-mv))))

You might notice it preserves all the secondary return values as well.

Parsing HTML

I’m using Closure HTML parser (not to be confused with 1000 other things called Clo[s/j/z]ure) to convert the resulting string into a Lispy representation of DOM tree (CXML-DOM).

(defun parse-url (url &rest args) "Parse HTTP request response to CXML-DOM" (let ((response (apply #'http-request url args))) (chtml:parse response (cxml-dom:make-dom-builder))))

Finding your data

It is possible to use the standartized DOM API to find the required elements in the resulting tree (and it’s worth knowing it), but really, most of the time you want to just use a CSS selector to grab the elements you need. This is where CSS-SELECTORS library comes in handy.

(let ((document (parse-url "http://www.google.com/search?q=something"))) (css:query "h3.r a" document))

returns a list of links from a Google search.

Extracting your data

Getting a text value of a HTML element isn’t as easy as you might think. Because it can have other elements inside of it, you must recursively walk and join all the text nodes. There are also a bunch of rules regarding whitespace which must be stripped correctly from the resulting string.

This is what the function node-text in Webgunk does:

(defun node-text (node &rest args &key test (strip-whitespace t)) (let (values result) (when (or (not test) (funcall test node)) (dom:do-node-list (node (dom:child-nodes node)) (let ((val (case (dom:node-type node) (:element (apply #'node-text node args)) (:text (dom:node-value node))))) (push val values)))) (setf result (apply #'concatenate 'string (nreverse values))) (if strip-whitespace (strip-whitespace result) result)))

It calls strip-whitespace which is just a bunch of regex replacements (see full source code here).

Another place where the data can be hidden is HTML attributes. Fortunately, dom:get-attribute pretty much solves this problem. For example: (dom:get-attribute link “href”) returns href attribute of a node.

That’s it for today. In the next installment I’ll probably discuss authentication and other fun stuff you can do.