This Haskell tutorial was written in early March 2011, and while the below code worked then, it may not work with Github or the necessary Haskell libraries now. If you are interested in downloading from GitHub I suggest looking into the GitHub API or the Haskell github library.

Patch-tag was nice enough to supply a URL which provided exactly the URLs I needed, but I couldn’t expect such personalized support from GitHub. GitHub does supply an API of sorts for developers and hobbyists, said API provides no obvious way to get what I want: ‘URLs for all Haskell-related repos’. So, scraping it is—I’d write a script to munge some GitHub HTML and get the URLs I want that way.

Previously I wrote a simple script to download the repositories of the source repository hosting site Patch-tag.com . Patch-tag specializes in hosting Darcs repositories (usually Haskell-related). GitHub is a much larger & more popular hosting site, and though it does not support Darcs but git (as the name indicates), it is so popular that it still hosts a great deal of Haskell. I’ve downloaded a few repositories out of curiosity or because I was working on the contents of the repository (eg. gitit ), but there are too many to download manually. I needed a script.

(For example, part of my lobbying for Control.Monad.void was based on producing a list of dozens of source files which rewrote that particular idiom, and I have been able to usefully comment & judge based on crude statistics gathered by grepping through my hundreds of repositories.)

Along the lines of Archiving URLs , I like to keep copies of Haskell -related source-code repositories because the files & history might come in handy on occasion, and because having a large collection of repositories lets me search them for random purposes.

Parsing pages The closest I can get to a target URL is https://github.com/languages/Haskell/created. We’ll be parsing that. The first thing to do is to steal TagSoup code from my previous scrapers, so our very crudest version looks like this: import Text.HTML.TagSoup import Text.HTML.Download (openURL) main = do html <- openURL "https://github.com/languages/Haskell/created" let links = linkify html print links linkify l = [x | TagOpen "a" atts <- parseTags l, (_,x) <- atts]

Downloading pages (the lazy way) We run it and it throws an exception! *** Exception: getAddrInfo: does not exist (No address associated with hostname) Oops. We got all wrapped up in parsing the HTML we forgot to make sure that downloading worked in the first place. Well, we’re lazy programmers, so now, on demand, we’ll investigate that problem. The exception thrown sounds like a problem with the openURL call—‘hostname’ is a networking term, not a parsing or printing term. So we try running just openURL "https://github.com/languages/Haskell/created" —same error. Not helpful. We try a different implementation of openURL , mentioned in the other scraping script: import Network.HTTP (getRequest, simpleHTTP) openURL = simpleHTTP . getRequest Calling that again, we see: > openURL "https://github.com/languages/Haskell/created" Loading package HTTP-4000.1.1 ... linking ... done. Right HTTP/1.1 301 Moved Permanently Server: nginx/0.7.67 Date: Tue, 15 Mar 2011 22:54:31 GMT Content-Type: text/html Content-Length: 185 Connection: close Location: https://github.com/languages/Haskell/created Oh dear. It seems that the HTTP package just won’t handle HTTPS; nor does the description mention HTTPS nor any of the module names seem connected. Best to give up entirely on it. If we google ‘Haskell https’, one of the first 20 hits happens to be a Stack Overflow question/page which sounds promising: “Haskell Network.Browser HTTPS Connection”. The one answer says to simply use the Haskell binding to curl. Well, fine. I already had that installed because Darcs uses the binding for downloads. I’ll use that package. We go to the top level module hoping for an easy download. Scrolling down, one’s eye is caught by a curlGetString, which while not necessarily a promising name, does have an interesting type: URLString -> [CurlOption] -> IO (CurlCode, String) . Note especially the return value—from past experience with the HTML package, one would give a good chance that the URLString is just a type synonym for a URL string and the String return just the HTML source we want. What CurlOption might be, I have no idea, but let’s try simply omitting them all. So we load the module in GHCi ( :module + Network.Curl ) and see what curlGetString "https://github.com/languages/Haskell/created" [] does: ( CurlOK , "<!DOCTYPE html> <html> <head> <meta charset='utf-8'> <meta http-equiv=\"X-UA-Compatible\" content=\"chrome=1\"> <title>Recently Created Haskell Repositories - GitHub</title> <link rel=\"search\" type=\"application/opensearchdescription+xml\" href=\"/opensearch.xml\" title=\"GitHub\" /> <link rel=\"fluid-icon\" href=\"https://github.com/fluidicon.png\" title=\"GitHub\" /> ..." ) Great! As they say, ‘try the simplest possible thing that could possibly work’, and this seems to. We don’t really care about the exit code, since this is a hacky script ; we’ll throw it away and only keep the second part of the tuple with the usual snd . It’s in IO so we need to use liftM or fmap before we can apply snd . Combined with our previous Tagsoup code, we get: import Text.HTML.TagSoup import Network.Curl (curlGetString, URLString ) main :: IO () main = do html <- openURL "https://github.com/languages/Haskell/created" let links = linkify html print links openURL :: URLString -> IO String openURL target = fmap snd $ curlGetString target [] linkify :: String -> [ String ] linkify l = [x | TagOpen "a" atts <- parseTags l, (_,x) <- atts]

Spidering (the lazy way) What’s the output of this? [ "logo boring" , "https://github.com" , "/plans" , "/explore" , "/features" , "/blog" , "/login?return_to=https://github.com/languages/Haskell/created" , "/languages/Haskell" , "/explore" , "explore_main" , "/repositories" , "explore_repos" , "/languages" , "selected" , "explore_languages" , "/timeline" , "explore_timeline" , "/search" , "code_search" , "/tips" , "explore_tips" , "/languages/Haskell" , "/languages/Haskell/created" , "selected" , "/languages/Haskell/updated" , "/languages" , "/languages/ActionScript/created" , "/languages/Ada/created" , "/languages/Arc/created" , "/languages/ASP/created" , "/languages/Assembly/created" , "/languages/Boo/created" , "/languages/C/created" , "/languages/C%23/created" , "/languages/C++/created" , "/languages/Clojure/created" , "/languages/CoffeeScript/created" , "/languages/ColdFusion/created" , "/languages/Common%20Lisp/created" , "/languages/D/created" , "/languages/Delphi/created" , "/languages/Duby/created" , "/languages/Eiffel/created" , "/languages/Emacs%20Lisp/created" , "/languages/Erlang/created" , "/languages/F%23/created" , "/languages/Factor/created" , "/languages/FORTRAN/created" , "/languages/Go/created" , "/languages/Groovy/created" , "/languages/HaXe/created" , "/languages/Io/created" , "/languages/Java/created" , "/languages/JavaScript/created" , "/languages/Lua/created" , "/languages/Max/MSP/created" , "/languages/Nu/created" , "/languages/Objective-C/created" , "/languages/Objective-J/created" , "/languages/OCaml/created" , "/languages/ooc/created" , "/languages/Perl/created" , "/languages/PHP/created" , "/languages/Pure%20Data/created" , "/languages/Python/created" , "/languages/R/created" , "/languages/Racket/created" , "/languages/Ruby/created" , "/languages/Scala/created" , "/languages/Scheme/created" , "/languages/sclang/created" , "/languages/Self/created" , "/languages/Shell/created" , "/languages/Smalltalk/created" , "/languages/SuperCollider/created" , "/languages/Tcl/created" , "/languages/Vala/created" , "/languages/Verilog/created" , "/languages/VHDL/created" , "/languages/VimL/created" , "/languages/Visual%20Basic/created" , "/languages/XQuery/created" , "/brownnrl" , "/brownnrl/Real-World-Haskell" , "/joh" , "/joh/tribot" , "/bjornbm" , "/bjornbm/publicstuff" , "/codemac" , "/codemac/yi" , "/poconnell93" , "/poconnell93/chat" , "/jillianfu" , "/jillianfu/Angel" , "/jaspervdj" , "/jaspervdj/sup-host" , "/serras" , "/serras/scion-ghc-7-requisites" , "/serras" , "/serras/scion" , "/iand675" , "/iand675/cgen" , "/shangaslammi" , "/shangaslammi/haskeroids" , "/rukav" , "/rukav/ReplayTrace" , "/jaspervdj" , "/jaspervdj/wol" , "/tomlokhorst" , "/tomlokhorst/wol" , "/bos" , "/bos/concurrent-barrier" , "/jkingry" , "/jkingry/projectEuler" , "/olshanskydr" , "/olshanskydr/xml-enumerator" , "/lorenz" , "/lorenz/fypmaincode" , "/jaspervdj" , "/jaspervdj/data-object-json" , "/jaspervdj" , "/jaspervdj/data-object-yaml" , "/languages/Haskell/created?page=2" , "next" , "/languages/Haskell/created?page=3" , "/languages/Haskell/created?page=4" , "/languages/Haskell/created?page=5" , "/languages/Haskell/created?page=6" , "/languages/Haskell/created?page=7" , "/languages/Haskell/created?page=8" , "/languages/Haskell/created?page=9" , "/languages/Haskell/created?page=208" , "/languages/Haskell/created?page=209" , "/languages/Haskell/created?page=2" , "l" , "next" , "http://www.rackspace.com" , "logo" , "http://www.rackspace.com " , "http://www.rackspacecloud.com" , "https://github.com/blog" , "/login/multipass?to=http%3A%2F%2Fsupport.github.com" , "https://github.com/training" , "http://jobs.github.com" , "http://shop.github.com" , "https://github.com/contact" , "http://develop.github.com" , "http://status.github.com" , "/site/terms" , "/site/privacy" , "https://github.com/security" , "nofollow" , "?locale=de" , "nofollow" , "?locale=fr" , "nofollow" , "?locale=ja" , "nofollow" , "?locale=pt-BR" , "nofollow" , "?locale=ru" , "nofollow" , "?locale=zh" , "#" , "minibutton btn-forward js-all-locales" , "nofollow" , "?locale=en" , "nofollow" , "?locale=af" , "nofollow" , "?locale=ca" , "nofollow" , "?locale=cs" , "nofollow" , "?locale=de" , "nofollow" , "?locale=es" , "nofollow" , "?locale=fr" , "nofollow" , "?locale=hr" , "nofollow" , "?locale=hu" , "nofollow" , "?locale=id" , "nofollow" , "?locale=it" , "nofollow" , "?locale=ja" , "nofollow" , "?locale=nl" , "nofollow" , "?locale=no" , "nofollow" , "?locale=pl" , "nofollow" , "?locale=pt-BR" , "nofollow" , "?locale=ru" , "nofollow" , "?locale=sr" , "nofollow" , "?locale=sv" , "nofollow" , "?locale=zh" , "#" , "js-see-all-keyboard-shortcuts" ] Quite a mouthful, but we can easily filter things down. “/languages/Haskell/created?page=3” is an example of a link to the next page listing Haskell repositories; presumably the current page would be “?page=1”, and the highest listed seems to be “/languages/Haskell/created?page=209”. The actual repositories look like “/jaspervdj/data-object-yaml”. The regularity of the “created” numbering suggests that we can avoid any actual spidering. Instead, we could just figure out what the last page is, the highest page, and then generate all the page names in between because they follow a simple scheme. Assume we have the final number, n, we already know we get the full list through [1..n] ; then we want to prepend “languages/Haskell/created?page=”, but it’s a type error to simply write map ("languages/Haskell/created?page="++) [1..n] . There is only one type-variable in (++) :: [a] -> [a] -> [a] . To convert the Integers to a proper String, we do map show , so that gives us our generator: listPages :: [ String ] listPages = map (\x -> "https://github.com/languages/Haskell/created?page=" ++ show x) [ 1 .. ] (This will throw a warning using -Wall because GHC has to guess whether the 1 is an Int or Integer. This can be quieted by writing (1::Int) instead.) But what is x ? We don’t know the final, highest, oldest page. We don’t know how much of our infinite lazy list to take . It’s easy enough to filter the list to get only the index: filter (isPrefixOf "/languages/Haskell/created?page=") . Then we call last , right? (Or something like head . reverse if we didn’t know last or if we didn’t think to check the hits for [a] -> a ). But if you look back at the original scraping output, you see an example of how a simple approach can go wrong; we read “/languages/Haskell/created?page=209” and then we read “/languages/Haskell/created?page=2”! 2 is less than 209, of course, and is the wrong answer. GitHub is not padding the numbers to look like “created?page=002”, so our simple-minded approach doesn’t work. So we need to extract the number. Easy enough: the prefix is statically known and never changes, so we can hardwire some crude parsing using drop : drop 32 . How to turn the remaining String into an Int? Hopefully one knows about read , but even here Hoogle will save our bacon if we think to look through the list of hits for String -> Int — read turns up as hit #10 or #11. Then, now that we have turned our [String] into [Int], we could sort it and take the last entry, or again go to the standard library and use maximum (like read , it will turn up for [Int] -> Int , if not as highly ranked as one might hope). Tweaking the syntax a little, our final result is: lastPage :: [ String ] -> Int lastPage = maximum . map ( read . drop 32 ) . filter ( "/languages/Haskell/created?page=" `isPrefixOf` ) If we didn’t want to hardwire this for Haskell, we’d probably write the function with an additional parameter and replace the Int with a runtime calculation of what to remove: lastPageGeneric :: String -> [ String ] -> Int lastPageGeneric lang = maximum . map ( read . drop ( length lang)) . filter (lang `isPrefixOf` ) So let’s put what we have together. The program can download an initial index page, parse it, find the name of the last index page, and generate the URLs of all index pages, and print those out (to prove that it all works): import Data.List (isPrefixOf) import Network.Curl (curlGetString, URLString ) import Text.HTML.TagSoup main :: IO () main = do html <- openURL "https://github.com/languages/Haskell/created" let lst = lastPage $ linkify html let indxPgs = take lst listPages print indxPgs openURL :: URLString -> IO String openURL target = fmap snd $ curlGetString target [] linkify :: String -> [ String ] linkify l = [x | TagOpen "a" atts <- parseTags l, (_,x) <- atts] lastPage :: [ String ] -> Int lastPage = maximum . map ( read . drop 32 ) . filter ( "/languages/Haskell/created?page=" `isPrefixOf` ) listPages :: [ String ] listPages = map (\x -> "https://github.com/languages/Haskell/created?page=" ++ show x) [( 1 :: Int ) .. ] So where were we? We had a [String] (in a variable named indxPgs ) which represents all the index pages. We can get the HTML source of each page just by reusing openURL (it works on the first one, so it stands to reason it’d work on all index pages), which is trivial by this point: mapM openURL indxPgs .

Filtering repositories In the TagSoup result, we saw the addresses of the repositories listed on the first index page: [ "/brownnrl" , "/brownnrl/Real-World-Haskell" , "/joh" , "/joh/tribot" , "/bjornbm" , "/bjornbm/publicstuff" , "/codemac" , "/codemac/yi" , "/poconnell93" , "/poconnell93/chat" , "/jillianfu" , "/jillianfu/Angel" , "/jaspervdj" , "/jaspervdj/sup-host" , "/serras" , "/serras/scion-ghc-7-requisites" , "/serras" , "/serras/scion" , "/iand675" , "/iand675/cgen" , "/shangaslammi" , "/shangaslammi/haskeroids" , "/rukav" , "/rukav/ReplayTrace" , "/jaspervdj" , "/jaspervdj/wol" , "/tomlokhorst" , "/tomlokhorst/wol" , "/bos" , "/bos/concurrent-barrier" , "/jkingry" , "/jkingry/projectEuler" , "/olshanskydr" , "/olshanskydr/xml-enumerator" , "/lorenz" , "/lorenz/fypmaincode" , "/jaspervdj" , "/jaspervdj/data-object-json" , "/jaspervdj" , "/jaspervdj/data-object-yaml" ] Without looking at the rendered page in our browser, it’s obvious that GitHub is linking first to whatever user owns or created the repository, and then linking to the repository itself. We don’t want the users, but the repositories. Fortunately, it’s equally obvious that this is true: no user page has two forward-slashes in it, while all repository pages have two forward-slashes in it. So we want to count the forward-slashes and keep every address with exactly 2 forward-slashes. The type for our function takes a list, a possible entry in that list, and returns a count. This is easy to do with primitive recursion and an accumulator, or perhaps length combined with filter ; but the base library already has functions for a -> [a] -> Int . elemIndex annoyingly returns a Maybe Int , so we’ll use elemIndices instead and call length on its output: length (elemIndices '/' x) == 2 . This is not quite right. If we run this on the original parsed output, we get [ "https://github.com" , "/languages/Haskell" , "/languages/Haskell" , "/plategreaves/unordered-containers" , "/vincenthz/hs-tls-extra" , "/aculich/fix-symbols-gitit" , "/sphynx/euler-hs" , "/DRMacIver/unordered-containers" , "/hamishmack/yesod-slides" , "/GNUManiacs/hoppla" , "/DRMacIver/hs-rank-aggregation" , "/naota/hackage-autoebuild" , "/magthe/hsini" , "/dagit/gnuplot-test" , "/imbaczek/HBPoker" , "/sergeyastanin/simpleea" , "/cbaatz/hamu8080" , "/aristidb/xml-enumerator" , "/elliottt/value-supply" , "/gnumaniacs-org/hoppla" , "/emillon/tyson" , "/quelgar/hifive" , "/quelgar/haskell-websockets" , "http://www.rackspace.com" , "http://www.rackspace.com " , "http://www.rackspacecloud.com" , "/login/multipass?to=http%3A%2F%2Fsupport.github.com" , "http://jobs.github.com" , "http://shop.github.com" , "http://develop.github.com" , "http://status.github.com" , "/site/terms" , "/site/privacy" ] It doesn’t look like we mistakenly omitted a repository, but it does look like we mistakenly included things we should not have. We need to filter out anything beginning with a “http://”, “https://”, “/site/”, “/languages/”, or “/login/”. We could call filter multiple times, or use a tricky foldr to accumulate only results which don’t match any of the items in our list ["/languages/", "/login/", "/site/", "http://", "https://"] . But I already wrote the solution to this problem back in the original WP RSS archive-bot where I noticed that my original giant filter call could be replaced by a much more elegant use of any where uniq :: [ String ] -> [ String ] uniq = filter (\x -> not $ any ( flip isInfixOf x) exceptions) exceptions :: [ String ] exceptions = [ "wikimediafoundation" , "http://www.mediawiki.org/" , "wikipedia" , "&curid=" , "index.php?title=" , "&action=" ] In our case, we replace isInfixOf with isPrefixOf , and we have different constants defined in exceptions . To put it all together into a new filtering function, we have: repos :: String -> [ String ] repos = uniq . linkify where uniq :: [ String ] -> [ String ] uniq = filter count . filter (\x -> not $ any ( `isPrefixOf` x) exceptions) exceptions :: [ String ] exceptions = [ "/languages/" , "/login/" , "/site/" , "http://" , "https://" ] count :: String -> Bool count x = length (elemIndices '/' x) == 2 Our new minimalist program, which will test out repos : import Data.List (elemIndices, isPrefixOf) import Network.Curl (curlGetString, URLString ) import Text.HTML.TagSoup main :: IO () main = do html <- openURL "https://github.com/languages/Haskell/created" print $ repos html openURL :: URLString -> IO String openURL target = fmap snd $ curlGetString target [] linkify :: String -> [ String ] linkify l = [x | TagOpen "a" atts <- parseTags l, (_,x) <- atts] repos :: String -> [ String ] repos = uniq . linkify where uniq :: [ String ] -> [ String ] uniq = filter count . filter (\x -> not $ any ( `isPrefixOf` x) exceptions) exceptions :: [ String ] exceptions = [ "/languages/" , "/login/" , "/site/" , "http://" , "https://" ] count :: String -> Bool count x = length (elemIndices '/' x) == 2 The output: [ "/plategreaves/unordered-containers" , "/vincenthz/hs-tls-extra" , "/aculich/fix-symbols-gitit" , "/sphynx/euler-hs" , "/DRMacIver/unordered-containers" , "/hamishmack/yesod-slides" , "/GNUManiacs/hoppla" , "/DRMacIver/hs-rank-aggregation" , "/naota/hackage-autoebuild" , "/magthe/hsini" , "/dagit/gnuplot-test" , "/imbaczek/HBPoker" , "/sergeyastanin/simpleea" , "/cbaatz/hamu808a0" , "/aristidb/xml-enumerator" , "/elliottt/value-supply" , "/gnumaniacs-org/hoppla" , "/emillon/tyson" , "/quelgar/hifive" , "/quelgar/haskell-websockets" ]

Shelling out to git That leaves the ‘shell out to git’ functionality. We could try stealing the spawn (call out to /bin/sh ) code from XMonad, but the point of spawn is that it forks away completely from our script, which will completely screw up our desired lack of parallelism. I ultimately wound up using a function from System.Process , readProcessWithExitCode . (Why readProcessWithExitCode and not readProcess ? Because if a directory already exists, git/ readProcess throws an exception which kills the script!) This will work: shellToGit :: String -> IO () shellToGit u = do (_,y,_) <- readProcessWithExitCode "git" [ "clone" , u] "" print y In retrospect, it might have been a better idea to try to use runCommand or System.Cmd . Alternatively, we could use the same shelling out functionality from the original patch-tag.com script: mapM_ (\x -> runProcess "darcs" [ "get" , "--lazy" , "http://patch-tag.com" ++ x] Nothing Nothing Nothing Nothing Nothing ) targets Which could be rewritten for us (sans logging) as shellToGit :: String -> IO () shellToGit u = runProcess "git" [ "clone" , u] Nothing Nothing Nothing Nothing Nothing >> return () -- We could replace `return ()` with Control.Monad.void to drop `IO ProcessHandle` result Now it’s easy to fill in our 2 missing lines: ... repourls <- mapM getRepos indxPgs let gitURLs = map gitify $ concat repourls mapM_ shellToGit gitURLs (The concat is there because getRepos gave us a [String] for each String , and then we ran it on a [String] —so our result is [[String]] ! But we don’t care about preserving the information about where each String came from, so we smush it down to a single list. Strictly speaking, we didn’t need to do print y in shellToGit , but while developing, it’s a good idea to have some sort of logging—get a sense of what the script is doing. And once you are printing at all, you can sort the list of repository URLs to download them in order by user.) Unique repositories There is one subtlety here worth noting that our script is running rough-shod over. Each URL we download is unique, because usernames are unique on GitHub and each URL is formed from a “/username/reponame” pair. But each downloaded repository is not unique, because git will shuck off the username and create a directory with just the repository name—“/john/bar” and “/jack/bar” will clash, and if you download in that order, the bar repository will be John’s repository and not Jack’s repository. Git will error out the second time, but this error is ignored by the shelling code. The solution would be to tell git to clone to a non-default but unique directory (for example, one could reuse the “/username/reponame” and then one’s target directory would be neatly populated by several hundred directories named after users, each populated by a few repositories with non-unique names). If we went with the per-user approach, our new version would look like this: shellToGit :: String -> IO () shellToGit u = do (_,y,_) <- readProcessWithExitCode "git" [ "clone" , u, drop 19 u] "" print y Why the drop 19 u ? Well, u is the fully qualified URL, eg. “https://www.github.com/sergeyastanin/simpleea”. Obviously we don’t want to execute git clone "https://www.github.com/sergeyastanin/simpleea" "https://www.github.com/sergeyastanin/simpleea" (even though that’d be perfectly valid), because it makes for ugly folders. But drop 19 "https://www.github.com/sergeyastanin/simpleea" turns into “sergeyastanin/simpleea”, giving us the right local directory name with no prefixed slash. Or you just pass in the original “/username/reponame” and use drop 1 on that instead. (Either way, you need to do additional work. Might as well just use drop 19 .) One final note: many of the URLs end in “.git”. If we disliked this, then we could enhance the drop 19 with System.FilePath.dropExtension : dropExtension $ drop 19 u .