This is the first in a series of posts about lisp programming. I assume that you 1: have a common lisp environment installed, 2: have quicklisp, and 3: have basic knowledge about the lisp language. For assistance with the first requirement, look here; for assistance with the second, look here; for help with the 3rd, read the 3rd chapter of Practical Common Lisp and report back.

Interacting with the internet is a highly useful task for a program to perform these days. Common Lisp has some great libraries that make getting up and running with common internet tasks easy. In this post, I’ll show how to get data from webpages with drakma, parse html with cl-html5-parser, and extract data from that parsed html using cl-html5-parser’s DOM functionality. Each of these features is used in my lightweight search-engine cl-breeze.

Let’s get the libraries we’ll be using:

(ql:quickload :drakma) (ql:quickload :cl-html5-parser) (ql:quickload :cl-ppcre) (use-package (list :drakma :html5-parser :cl-ppcre))

Let’s GET the wikipedia article on AVL trees.

(setf (values *wikihtml* *responsecode* *headers*) (http-request "http://en.wikipedia.org/wiki/AVL_tree"))

This stores 1. actual html into *wikihtml*, 2. the response code (200, 404, etc…) into *responsecode*, and 3. the http headers into *headers*. There are more return values we could have captured but didn’t…for full method documentation refer here.

Now let’s convert the HTML into a DOM object so we can work with it using the computer.

(setf *wikiDOM* (parse-html5 *wikihtml*))

Easy as pie. On a side note, I did try out two other Common Lisp html parsing libraries and found them wanting when dealing with various web sites. Feel free to try out Closure Html or CL-HTML-PARSE, but my experience with them while building cl-breeze was that they didn’t deal properly with various modern html practices (they expected documents to adhere to more formal xml-style rules, and weren’t happy when they encountered pages that didn’t), while cl-html5-parser has dealt with everything I’ve run into with impunity.

The DOM is a standardized interface to traverse and manipulate structured documents (think XML). cl-html5-parser includes a minimal implementation of it. You might want to read up on the DOM here or here. In SLIME type “html5-parser:” and then type C-c TAB to see all the implemented DOM operations.

Here’s the code I use to navigate around a document:

(defun flex-dom-map (recurse-p fn node) "fn is applied to each visited node recurse-p controls whether to visit the children of node" (if node (progn (funcall fn node) ;apply the function to the node (if (funcall recurse-p node) (html5-parser:element-map-children (lambda (n-node) (flex-dom-map recurse-p fn n-node)) node)))))

The three parameters to flex-dom-map are 1. “recurse-p”, which, if it returns true, gives flex-dom-map the go ahead to look at “node”‘s children, 2. “fn” which is called for each visited node, and 3. “node” the node itself.

Let’s write a standard recurse-p (recurse-predicate) which ensures we don’t try to extract content from scripts and style blocks:

(defun standard-recurse-p (node) "returns true only if you aren't trying to recurse into a script, style, or noscript tag." (not (or (equalp (node-name node) "script") (equalp (node-name node) "style") (equalp (node-name node) "noscript"))))

Now let’s extract some information from the wikipedia article. First, I’d like its title:

(defun get-title (dom-node) (flex-dom-map #'standard-recurse-p (lambda (node) (if (equalp (node-name node) "title") (return-from get-title (node-value (node-first-child node))))) dom-node)) CL-USER> (get-title *wikiDOM*) "AVL tree - Wikipedia, the free encyclopedia"

That worked nicely. Now let’s try to extract the contents of the article. Here’s a function which extracts the contents of all “text” nodes it encounters and writes them to string:

;;;For our purposes this simple remove-newlines method will suffice. (defun remove-newlines (str) (remove-if (lambda (ch) (or (eql ch #\return) (eql ch #\linefeed))) str)) (defun scrapetext (top-node recurse-p) (remove-newlines (with-output-to-string (s) (flex-dom-map2 recurse-p (lambda (node) (if (equal (html5-parser:node-type node) :TEXT) (format s " ~a " (html5-parser:node-value node)))) top-node))))

Looking at the source code of the site (or inspecting the text using the developer tools of a browser like firefox), we notice that the article contents are contained in a div with id “mw-content-text”. Let’s use our handy dom mapping function to select that node.

(defun get-article-text-node (root-node) "this function maps its way through the DOM nodes till it finds a node with name 'div' and 'id' attribute equal to 'mw-content-text', then returns that node" (flex-dom-map #'standard-recurse-p (lambda (node) (if (equalp (node-name node) "div") (if (equalp (element-attribute node "id") "mw-content-text") (return-from get-article-text-node node)))) root-node )) CL-USER> (setf *wtextDOM* (get-article-text-node *wikiDOM*)) #<ELEMENT div #x302001B6709D> CL-USER> (element-attribute *wtextDOM* "id") "mw-content-text"

Ok so that seems to be working. But we actually want all the text content from within that node, right? Let’s call scrapetext on it with our simple recurse-p and see what happens.

CL-USER> (scrapetext *wtextdom* #'standard-recurse-p) " AVL tree Type Tree Invented 1962 Invented by G. M. Adelson-Velskii and E. M. Landis Time complexity in big O notation Average Worst case ...

Though the article is in there, we’re also pulling in a lot of non-article text from the wikipedia sidebar. Let’s create a better recurse-p which excludes the wikipedia sidebar content.

A quick examination of the page with firefox developer tools reveals that the top “infobox” is contained within a table (so let’s exlude tables from being parsed), and all the other sidebar content is contained in divs whose “class” attribute includes “thumb.” Let’s write a “getwikitext” function which keeps flex-dom-map from peering into tables or nodes with class “thumb”.

(defmethod no-applicable-method ((method (eql #'html5-parser::%node-attributes)) &rest args) "this method surpresses 'no-applicable-method' errors from html5-parser::%node-attributes and simply returns nil instead. These errors come about because we are calling element-attributes on text nodes, which of course do not have attributes" nil) (defun getwikitext (root-node) (let ((wikitextnode (get-article-text-node root-node))) (remove-excess-whitespace (scrapetext wikitextnode (lambda (node) (and (standard-recurse-p node) (not (equalp (node-name node) "table")) (not (scan "thumb" (element-attribute node "class")))))))))

Pulling everything together

CL-USER> (getwikitext (parse-html5 (http-request "http://en.wikipedia.org/wiki/AVL_tree" ))) " In computer science , an AVL tree (Adelson-Velskii and Landis' tree, named after the inventors) is a self-balancing binary search tree . It was the first such data structure to be invented. [ 1 ] In an AVL tree, th ... ;;;It works!

That does indeed appear to be the article content. Could use a little more custom code to deal with section headings and straightening out punctuation, but we’ve made real progress here. Let’s call it a day.

We now have a way to 1. obtain html content from websites, 2. convert that content to a DOM model, and 3. navigate the DOM model to extract information of interest. All of this with a compact and easy to use interface. I hope this article was useful to you.

Complete code used for this exercise.

(defmacro break-transparent (exp) "useful debugging macro I use all the time" `(let ((x ,exp)) (break "argument to break: ~:S" x) x)) (ql:quickload :drakma) (ql:quickload :cl-html5-parser) (ql:quickload :cl-ppcre) (use-package (list :drakma :html5-parser :cl-ppcre)) (setf (values *wikihtml* *responsecode* *headers*) (http-request "http://en.wikipedia.org/wiki/AVL_tree")) (setf *wikiDOM* (parse-html5 *wikihtml*)) (defun flex-dom-map (recurse-p fn node) "fn is applied to each visited node recurse-p controls whether to visit the children of node" (if node (progn (funcall fn node) ;apply the function to the node (if (funcall recurse-p node) (html5-parser:element-map-children (lambda (n-node) (flex-dom-map recurse-p fn n-node)) node))))) (defun standard-recurse-p (node) "returns true only if you aren't trying to recurse into a script, style, or noscript tag." (not (or (equalp (node-name node) "script") (equalp (node-name node) "style") (equalp (node-name node) "noscript")))) (defun get-title (dom-node) (flex-dom-map #'standard-recurse-p (lambda (node) (if (equalp (node-name node) "title") (return-from get-title (node-value (node-first-child node))))) dom-node)) (defun get-article-text-node (root-node) "this function maps its way through the DOM nodes till it finds a node with name 'div' and 'id' attribute equal to 'mw-content-text', then returns that node" (flex-dom-map #'standard-recurse-p (lambda (node) (if (equalp (node-name node) "div") (if (equalp (element-attribute node "id") "mw-content-text") (return-from get-article-text-node node)))) root-node )) (defun getwikitext (root-node) (let ((wikitextnode (get-article-text-node root-node))) (remove-excess-whitespace (scrapetext wikitextnode (lambda (node) (and (standard-recurse-p node) (not (equalp (node-name node) "table")) (not (scan "thumb" (element-attribute node "class"))))))))) (defun scrapetext (top-node recurse-p) (remove-newlines (with-output-to-string (s) (flex-dom-map2 recurse-p (lambda (node) (if (equal (html5-parser:node-type node) :TEXT) (format s " ~a " (html5-parser:node-value node)))) top-node)))) (defmethod no-applicable-method ((method (eql #'html5-parser::%node-attributes)) &rest args) "this method surpresses 'no-applicable-method' errors from html5-parser::%node-attributes and simply returns nil instead. These errors come about because we are calling element-attributes on text nodes, which of course do not have attributes" nil) (defun remove-newlines (str) (remove-if (lambda (ch) (or (eql ch #\return) (eql ch #\linefeed))) str)) (defun remove-excess-whitespace (str) (cl-ppcre:regex-replace-all "\\s+" str " "))