This hands-on walkthrough walks you through writing code which parses robots.txt files. I hope you find something useful along the way: at the least its a nice self-contained example.

If you don’t know what robots.txt files are, they (along with robots meta information within html pages themselves), are part of the convention which defines where automated crawlers are and are not supposed to go. Here’s a quick overview of the file format http://www.robotstxt.org/robotstxt.html.

For my purposes I want to be able to extract the disallowed directories for given user-agent tags (this is the primary purpose of robots.txt).

Let’s get started:

First, lets load the libraries we’ll need and write some code using drakma to grab the contents of a given website’s robots.txt file.

(ql:quickload :cl-ppcre) (ql:quickload :puri) (load "berryutils.lisp") (ql:quickload :alexandria) (ql:quickload :drakma) (defpackage :com.cvberry.robotser (:use :common-lisp :alexandria :com.cvberry.util :cl-ppcre) (:import-from :cl-user)) (in-package :com.cvberry.robotser) ROBOTSER>(setf *engadgetrobots* (drakma:http-request "http://www.engadget.com/robots.txt")) "User-agent: * Disallow: /forward^M Disallow: /traffic^M Disallow: /mm_track^M Disallow: /search^M Disallow: /_uac/adpage.html^M Disallow: /profile^M Disallow: */comments^M Disallow: /supersearch^M Disallow: /permalink/livefyre^M ...

We have the contents of the file in a giant string. First, let’s use cl-ppcre’s regex-replace-all function to strip out every comment (portion of line starting with pound, to end of line) in the robots.txt file. While we’re at it, let’s also normalize windows-style line endings to a standard line ending style (this engadget robots file has windows-style line endings which show up strangely in sbcl and ccl under linux.)

(defun blank-out-comments (robotstext) (cl-ppcre:regex-replace-all "#.*" robotstext "")) (defun normalize-line-endings (text) "replaces all sorts of weird line endings with the standard cl line ending #

ewline" (cl-ppcre:regex-replace-all "(\\r|\

)+" text (string #

ewline))) ROBOTSER>(setf *erwocomments* (normalize-line-endings (blank-out-comments *engadgetrobots*))) "User-agent: * "User-agent: * Disallow: /forward Disallow: /traffic Disallow: /mm_track Disallow: /search Disallow: /_uac/adpage.html Disallow: /profile Disallow: */comments Disallow: /supersearch ...

There will be a bunch of empty lines at the bottom of the robots file where the block comment used to be, but that’s no worry to us.

I’m primarily interested in Disallow directives for User-agent *, but I’d also like the ability to extract the disallow information for an arbitrary user agent. Since this engadget file actually only lists one user agent, let’s also grab the robots file for a site with multiple user agents for testing purposes.

ROBOTSER> (setf *sosrobots* (normalize-line-endings (blank-out-comments (drakma:http-request "http://www.sosmath.com/robots.txt")))) "User-agent: Mediapartners-Google Disallow: User-agent: SiteSnagger Disallow: / User-agent: WebCopier Disallow: / User-agent: Teleport Pro Disallow: / User-agent: * Disallow: /ads Disallow: /backup2 Disallow: /banners Disallow: /books Disallow: /cd Disallow: /cgi-bin Disallow: /geometry Disallow: /gif Disallow: /jpg Disallow: /plus Disallow: /technical Disallow: /temp "

One helpful feature of cl-ppcre is “single-line-mode”. While regex usually goes line by line through your input, looking to match expressions, single-line-mode causes cl-ppcre to treat the whole input as one line. This helps us match our multiline User-agent pattern:

(defun splitbyuseragent (robotstext) (declare (optimize (debug 3))) (let ((uagentsearch (cl-ppcre:create-scanner "user-agent:" :case-insensitive-mode t :single-line-mode t))) (loop for str in (subseq (cl-ppcre:split uagentsearch robotstext) 1) ;the first elt comes BEFORE user-agent directive collect (let* ((after-split (cl-ppcre:split "\

" str)) (title (string-trim " " (elt after-split 0))) (others (subseq after-split 1))) (cons title (list others)))))) ROBOTSER> (splitbyuseragent *sosrobots*) (("Mediapartners-Google" ("Disallow:")) ("SiteSnagger" ("Disallow: /")) ("WebCopier" ("Disallow: /")) ("Teleport Pro" ("Disallow: /")) ("*" ("Disallow: /ads" "Disallow: /backup2" "Disallow: /banners" "Disallow: /books" "Disallow: /cd" "Disallow: /cgi-bin" "Disallow: /geometry" "Disallow: /gif" "Disallow: /jpg" "Disallow: /plus" "Disallow: /technical" "Disallow: /temp")))

Sweet, so we’ve split up the directives by user agent. Now I just need select the user agents I’m interested in and extract the actual directory values from this data and I’m done. Here I use the puri library to perform safe url merging between the root and the relative path.

(defun select-agent (agent agentdata-alist) (cadr (assoc agent agentdata-alist :test #'equalp))) (defun extract-disallow-dirs (rooturl robotslinelist) (declare (optimize (debug 3))) (let* ((disallowscanner (create-scanner "Disallow:(.*)" :case-insensitive-mode t)) (relativepaths (loop for line in robotslinelist when (scan disallowscanner line) collect (register-groups-bind (path) (disallowscanner line) (string-trim " " path))))) (mapcar (lambda (relpath) (easy-uri-merge rooturl relpath)) relativepaths))) ROBOTSER>(extract-disallow-dirs "http://www.sosmath.com" (select-agent "*" (splitbyuseragent *sosrobots*))) ("http://www.sosmath.com/ads" "http://www.sosmath.com/backup2" "http://www.sosmath.com/banners" "http://www.sosmath.com/books" "http://www.sosmath.com/cd" "http://www.sosmath.com/cgi-bin" "http://www.sosmath.com/geometry" "http://www.sosmath.com/gif" "http://www.sosmath.com/jpg" "http://www.sosmath.com/plus" "http://www.sosmath.com/technical" "http://www.sosmath.com/temp") ROBOTSER> (extract-disallow-dirs "http://www.engadget.com" (select-agent "*" (splitbyuseragent *erwocomments*))) ("http://www.engadget.com/forward" "http://www.engadget.com/traffic" "http://www.engadget.com/mm_track" "http://www.engadget.com/search" "http://www.engadget.com/_uac/adpage.html" "http://www.engadget.com/profile" "http://www.engadget.com/*/comments" "http://www.engadget.com/supersearch" "http://www.engadget.com/permalink/livefyre" "http://www.engadget.com/404/" "http://www.engadget.com/500/" "http://www.engadget.com/503/" "http://www.engadget.com/compare/*" ...

We’ve successfully parsed the Disallow directives from these two sites. A quick examination of the robots.txt guidelines suggests we’ve covered the general case as well. Let’s write a function which pulls together the processing steps.

(defun get-forbidden-dirs-for-agent (rooturl robotstext agent) (extract-disallow-dirs rooturl (select-agent agent (splitbyuseragent (normalize-line-endings (blank-out-comments robotstext)))))) ;;;Test ROBOTSER> (get-forbidden-dirs-for-agent "http://www.engadget.com" *engadgetrobots* "*") ("http://www.engadget.com/forward" "http://www.engadget.com/traffic" "http://www.engadget.com/mm_track" "http://www.engadget.com/search" "http://www.engadget.com/_uac/adpage.html" "http://www.engadget.com/profile" "http://www.engadget.com/*/comments" "http://www.engadget.com/supersearch" "http://www.engadget.com/permalink/livefyre" "http://www.engadget.com/404/" "http://www.engadget.com/500/" "http://www.engadget.com/503/" "http://www.engadget.com/compare/*" ...

I’m writing a lightweight site search engine I call cl-breeze, so for the top level wrapper function I am interested in getting directories excluded for both “cl-breeze” first, and if that’s not specified (which will be all the time unless its a website set up to work with cl-breeze), then just getting the excluded directories for “*”. Let’s see what that might look like:

(defun process-site-robots (rooturl) (let ((robotstext (drakma:http-request (easy-uri-merge rooturl "robots.txt")))) (if robotstext (let* ((breezedirs (get-forbidden-dirs-for-agent rooturl robotstext "cl-breeze")) (stardirs (get-forbidden-dirs-for-agent rooturl robotstext "*"))) (if stardirs stardirs breezedirs)))))

Let’s test on a few sites!

ROBOTSER> (process-site-robots "http://www.projecteuler.com") ("http://www.projecteuler.com/fcmedianet.js" "http://www.projecteuler.com/__media__/js/templates.js" "http://www.projecteuler.com/cmedianet" "http://www.projecteuler.com/cmdynet" "http://www.projecteuler.com/mediamainlog.php") ROBOTSER> (process-site-robots "http://www.foxnews.com") ; compiling (DEFUN PROCESS-SITE-ROBOTS ...)("http://www.foxnews.com/portal/*" "http://www.foxnews.com/printer_friendly_story" "http://www.foxnews.com/projects/livestream" "http://www.foxnews.com/story/0,2933,83083,00.html" "http://www.foxnews.com/column_archive/0,2976,71,00.html" "http://www.foxnews.com/other" "http://www.foxnews.com/topics/theme/")

Below is complete code for this exercise.

(ql:quickload :cl-ppcre) (ql:quickload :puri) ;(load "berryutils.lisp") (ql:quickload :alexandria) (ql:quickload :drakma) (defpackage :com.cvberry.robotser (:use :common-lisp :alexandria :com.cvberry.util :cl-ppcre) (:import-from :cl-user)) (in-package :com.cvberry.robotser) (defun process-site-robots (rooturl) (let ((robotstext (drakma:http-request (easy-uri-merge rooturl "robots.txt")))) (if robotstext (let* ((breezedirs (get-forbidden-dirs-for-agent rooturl robotstext "cl-breeze")) (stardirs (get-forbidden-dirs-for-agent rooturl robotstext "*"))) (if stardirs stardirs breezedirs))))) (defun get-forbidden-dirs-for-agent (rooturl robotstext agent) (extract-disallow-dirs rooturl (select-agent agent (splitbyuseragent (normalize-line-endings (blank-out-comments robotstext)))))) (defun select-agent (agent agentdata-alist) (cadr (assoc agent agentdata-alist :test #'equalp))) (defun extract-disallow-dirs (rooturl robotslinelist) (declare (optimize (debug 3))) (let* ((disallowscanner (create-scanner "Disallow:(.*)" :case-insensitive-mode t)) (relativepaths (loop for line in robotslinelist when (scan disallowscanner line) collect (register-groups-bind (path) (disallowscanner line) (string-trim " " path))))) (mapcar (lambda (relpath) (easy-uri-merge rooturl relpath)) relativepaths))) (defun splitbyuseragent (robotstext) (declare (optimize (debug 3))) (let ((uagentsearch (cl-ppcre:create-scanner "user-agent:" :case-insensitive-mode t :single-line-mode t))) (loop for str in (subseq (cl-ppcre:split uagentsearch robotstext) 1) collect (let* ((after-split (cl-ppcre:split "\

" str)) (title (string-trim " " (elt after-split 0))) (others (subseq after-split 1))) (cons title (list others)))))) (defun blank-out-comments (robotstext) (cl-ppcre:regex-replace-all "#.*" robotstext "")) (defun normalize-line-endings (text) "replaces all sorts of weird line endings with the standard cl line ending #

ewline" (cl-ppcre:regex-replace-all "(\\r|\

)+" text (string #

ewline)))