Building A Concurrent Web Scraper With Haskell

Written March 10, 2012

updated: April 16, 2012

Introduction

Let's make a concurrent web scraper! We will use Haskell, because it allows easy concurrency. We will use the HXT library to do the scraping. If you want to follow the HXT bits, you should be comfortable with Arrows in Haskell. If you're not, take a moment to read up on Arrows.

If you don't care about the scraping bits, jump straight to the concurrency section.

Installation

Make sure you have the hxt , url and http packages:

cabal install hxt cabal install url cabal install http cabal install maybet

Basic Setup

First, let's write some basic functions to make life easier for ourselves:

openUrl :: String -> MaybeT IO String openUrl url = case parseURI url of Nothing -> fail "" Just u -> liftIO ( getResponseBody =<< simpleHTTP ( mkRequest GET u )) css :: ArrowXml a => String -> a XmlTree XmlTree css tag = multi ( hasName tag ) get :: String -> IO ( IOSArrow XmlTree ( NTree XNode )) get url = do contents <- runMaybeT $ openUrl url return $ readString [ withParseHTML yes , withWarnings no ] ( fromMaybe "" contents )

I say basic because they will be our building blocks, not because they are easy :P Let's see how they work.

openUrl is a function that will download a web page for us. It returns a MaybeT Monad Transformer. We can use it like

contents <- runMaybeT $ openUrl "http://example.com"

and contents will be a Just if the operation was successful, or Nothing otherwise.

css will allow us to use css selectors on the downloaded page.

get is where things get interesting. First, we download a page using

contents <- runMaybeT $ openUrl url

like we talked about. Next, we parse the page using HXT :

readString [ withParseHTML yes , withWarnings no ] contents

readString takes some options as its first parameter:

withParseHTML : Parse as HTML, which makes sure the parser doesn't break on things like the doctype.

: Parse as HTML, which makes sure the parser doesn't break on things like the doctype. withWarnings : Prints out warnings about malformed html if it's switched on. Since so much of the web is malformed html, I switched it off :P

Now we are ready to start.

Image Downloader

Let's write something that downloads all the images from a given page.

First, let's get a parsed page:

main = do page <- get "http://www.reddit.com/r/pics"

page is now an Arrow. We can run this Arrow at any time to get its value by using runX . Let's try it now:

ghci > runX page [ NTree ( XTag "/" [ NTree ( XAttr "transfer-Status" ) [ NTree ( XText "200" ) [] ], NTree ( XAttr "transfer-Message" ) [ NTree ( XText "OK" ) [] ], NTree ( XAttr "transfer-URI" ) [ NTree ( XText "string:" ) [] ], NTree ( XAttr "source" ) [ NTree ( XText " \" <!DOCTYPE html PUBLIC \\\" -//W3C//DTD XHTML 1.0 ... \" " ) [] ], NTree ( XAttr "transfer-Encoding" ) [ NTree ( XText "UNICODE" ) [] ], NTree ( XAttr "doctype-name" ) ( ... many lines skipped ... )

Wow, that looks confusing. Let's select only what we want. Get just the images:

ghci > runX $ page >>> css "img" [ NTree ( XTag "img" [ NTree ( XAttr "id" ) [ NTree ( XText "header-img" ) [] ], NTree ( XAttr "src" ) [ NTree ( XText "http://f.thumbs.redditmedia.com/nDSO6j0fVaKEV5Hw.png" ) [] ] ...

Aha! Much nicer. Now let's get just the src 's:

ghci > runX $ page >>> css "img" >>> getAttrValue "src" [ "http://f.thumbs.redditmedia.com/nDSO6j0fVaKEV5Hw.png" , "http://pixel.redditmedia.com/pixel/of_defenestration.png" ...

Done! That was easy. Now all we need to do is download these images and save them to disk.

main = do url <- parseArgs doc <- get url imgs <- runX . images $ doc sequence_ $ map download imgs

The first three lines of our main function get a list of links. The images function is very simple:

images tree = tree >>> css "img" >>> getAttrValue "src"

It gets a list of all the image sources, just like we had talked about.

The fourth lines maps the download function over this list to create a list of IO actions. Then we feed that list into sequence_ , which runs the actions one at a time and throws away the return values. We could have used sequence instead, which would have printed the return values.

Here's the download function:

download url = do content <- runMaybeT $ openUrl url case content of Nothing -> putStrLn $ "bad url: " ++ url Just _content -> do let name = tail . uriPath . fromJust . parseURI $ url B . writeFile name ( B . pack _content )

We have to write out binary data, so we use the writeFile defined in Data.ByteString.Char8 , which operates on ByteString s. This is why we need to convert our String to a ByteString first using B.pack . We are also able to do error checking thanks to our openUrl function being a MaybeT . If we didn't get any content, we just print out "bad url: [url]". Otherwise we download the image.

Concurrency

After all that work, the concurrent bit seems almost anti-climactic.

First, install the parallel-io package:

cabal install parallel - io

Import it into the script:

import Control.Concurrent.ParallelIO

ParallelIO defines a new function called parallel_ which we can use anywhere we would have used sequence_ . The IO actions will then get performed concurrently. Change the end of the script to this:

... imgs <- runX . images $ doc parallel_ $ map download imgs stopGlobalPool

stopGlobalPool needs to be called after the last use of a global parallelism combinator. It cleans up the thread pool before shutdown.

Now build the concurrent version (enabling runtime system options):

$ ghc --make grabber_par.hs -threaded -rtsopts

And run it with +RTS -N[number of threads] :

$ ./ grabber_par + RTS - N4

Results

Here's how the two versions performed on my machine:

Without parallelization:

$ time ./ grabber "http://www.reddit.com/r/pics" real 0 m10 . 341 s user 0 m0 . 203 s sys 0 m0 . 048 s

With parallelization (four threads):

$ time ./ grabber_par "http://www.reddit.com/r/pics" + RTS - N4 real 0 m3 . 490 s user 0 m0 . 477 s sys 0 m0 . 154 s

Almost a third of the time!

The ParallelIO library uses MVar s to keep things in sync. Read more about ParallelIO or MVars.

Next Steps

Next steps involve writing this as a crawler that visits links on the page up to a depth of N as well as some way to keep track of visited pages. We also want to keep track of name collisions. If you try to download two images, both named "test.jpg", the concurrent version will error out. The non-concurrent version would just overwrite one image with another, which isn't any good either. On the crawling side, we should watch out for robots.txt files and META tag directives to be polite. And ask for gzip'd data to reduce request time.

We could also parallelize more than just the download, but its a start!

Full Code Listing