Time to get crawling!

Hey there! In the last post, I wrote about things you should know before you write a web crawler so you can be a responsible crawler and have good performance.

If you haven’t read Part 1, you can check it out here.

This article assumes you have a working knowledge of a few higher-level Elixir/OTP concepts like GenServers and Task Supervisors. If you don’t while you may be able to understand what’s going on your mileage may vary.

Now I’m going to take you through actually building one. Remember the parts of a basic crawler are:

A queue to which you load URLs to be crawled.

A worker to crawl and parse the URLs

A storage solution to keep the state of your crawl

A storage solution to keep the results of your crawl

As to what we are going to be crawling, well how about links to Github repos from the readme pages on some initial set of interesting repositories to see what other Github repositories they used. Since this series is based on an early version of the crawler my co-founder and I ultimately made to get the alpha for Whize launched I think this makes good sense. For more context, we happened to test the early version of our crawler by scraping all the links from any repo that was an “Awesome List” which are lists people have been putting together of things they think are well, awesome. The lists reside on Github and they contain repositories, articles and sometimes other types of content that they think are good for people to see. Ultimately that approach wasn’t sufficient for us information-wise so we built a more robust crawler and crawled the entirety of public Github but this does make for a nice example for this blog post!

Writing the initial queue

I’m going to take a top-down approach here so we’ll start with writing the queue, then the worker. In more robust architectures for crawling there may be a tracker or worker manager responsible for coordinating the work between the In between, we’ll cover the above parts and refer back to concepts I had discussed in the first post. So the first thing we want to do is create a queue that our workers can refer to. Elixir by itself doesn’t have the concept of a queue, but the Erlang standard library has a handy primitive, :queue.

A basic GenServer queue

Okay so let’s breakdown the above, the first thing you’ll notice is that my queue is using the GenServer behaviour. This is because our queue will be managing the state between a bunch of different processes running concurrently. If you are familiar with the Agent module you may be wondering why I chose to use the GenServer. There are a few messages that you cannot write custom handling for with the Agent module and you also can’t implement handling for custom messages. This leaves GenServer which has the capability to respond to them. We’ll use a custom message to schedule things like dumping the state of our crawler to a flat-file so you can later load in the state in case you need to pause your crawling or in the likely event of a crash. Let’s see what that looks like.

Writing to disk

You’ll notice that I do three things here. First I add a new message handler that when the GenServer receives the message “:schedule_write_to_disk” it will convert the current state to JSON and then write to a file located at “/tmp”. The second thing I did was add a private function that uses erlang to schedule sending the “:schedule_write_to_disk” message at the interval passed into the function. “self()” refers to the PID (process ID) that will receive the message which in this case is our server. The third thing I did was call the schedule function in “init()” This kicks off writing to disk and then it is called again at the end of the handler which will write again five seconds later and so on. This almost completely handles the case where your crawler crashes that I wrote about in the first post. Now, all we need to do is add a function to take the file and load the queue from it. Let’s do that now.

Load from disk

Finally, you’ll notice we have no way of adding or removing something from the queue, we’ll need to add those abilities so our workers can properly pull from and add to the queue.

Add and remove items

Writing The Initial Worker

Okay, so that takes care of that. Now from here let’s switch it up a bit and take a look at what our workers are going to look like. First, let’s think about what our workers need to do.

Pull a URL from the queue

Process that URL and extract the HTML

Pull whatever information from those URLs

Limit the depth of the crawl

Pull links from those URLs

Add the new URLs back into the queue

Let’s write out a rough version. I’ll go point by point with the code so we can see how each step works and I’ll talk through any interesting decisions I make during each step.

Fetch the page, parse them, get more URLs.

Okay, so the first thing you’ll notice about the above worker is I pass in the depth limit as an argument. There are many ways this could be done, for instance, another way to do it is to set an environment variable. However, I personally like any worker config to be passed down from whatever is responsible for managing the workers. In this simplified crawler architecture it’s going to be the queue. In a more robust set up you may have some kind of tracker that is responsible for assigning URLs to be crawled to workers as well as handling whether or not something has been crawled, restarting failed workers, etc.

The second thing you’ll see is the error handling for the request_page function is somewhat light-weight. I intentionally did this because error handling in crawling can be done in many different ways. That said, in the case you request a page and receive some type of HTTP error its more than likely that the page cannot be crawled for some reason that you won’t be able to fix so it may be better to simply drop those from the queue as we do here. I usually at least log the errors and the code received to make sure it isn’t something that can be fixed later on.

Finally, you’ll see that I call this get_children_urls function with the URL and body of the retrieved page. In this example crawler, all I’m going to do is extract more URLs from the page’s body that is not the current URL. Obviously this wouldn’t be the case in a real project and you would likely have more specific code to extract information besides URLs from the page. However, showing you what I do in these two functions should get you to a place where you could branch out for your own implementation needs. Let’s see what both the request_page and get_children_urls look like so we can see how to retrieve a page and extract information from the page’s HTML.

Very simple page request

request_page/1 is a simple function that takes a URL as a function and using the HTTPoison library makes an HTTP request to grab the contents of the page at that location. If an error is returned from it we handle it appropriately and using the soft error handling pattern in Elixir, bubble it up. If the request goes through we check the response code, also bubbling up an error if it is in a class of known error responses we can’t do anything about. You’ll notice that we are not handling redirects in this case and I’ll leave that up to you as an exercise to implement yourself. If the response is a positive 200 we return the body of the page itself. Now let’s see how we go about extracting actual information from the body of the page.

Using Floki to parse through html

Mostly simple right? Floki is another excellent library in Elixir for parsing through nodes in an HTML document. We use it here to grab any URL in the body of the readme page on Github and pass it through a filter step where we get rid of anything that isn’t on Github’s domain or some common things like their sponsor link which won’t lead us to any new or interesting repositories. Floki is also nice in that the entire library lends itself really nicely to Elixir’s pipeline operator and makes for some really elegant code. Obviously the filtering being done here is pretty naive in practice and there are more sophisticated ways of detecting domains you don’t need but in practice, this worked well enough for what my co-founder and I were doing.

Okay, so those are the last pieces of the worker. The full worker should look like this now.

Putting the worker all together

Okay cool so let’s do a quick recap here of what we have so far.

We have a queue that URLs can be added and removed from.

We have a worker that can take a URL and extract further URLs to be crawled.

What do we still need to do?

The queue needs to be able to launch workers to process URLs.

The queue needs to be able to manage workers in case of a crash

The worker needs to be able to store its results (In this case the page contents)

Connecting Everything Together

Ok, let’s get the queue starting and managing workers. This is going to use something called task supervisors which is a process that is responsible for managing the lifecycle of a spawned task including what to do if that task crashes. A task supervisor must be added to the supervision tree in the application.ex file in your mix project.

Calling the worker from the queue

Okay so first off you would add the above code into the CrawlerQueue module. I add another handle_cast function that takes a :schedule_work message, an interval for when to check the queue for new URLs in milliseconds and a depth_limit so our workers know how deep they should go into their crawls before discarding a URL. You’ll also notice the task supervisor I mentioned earlier and a method called async_nolink. async_nolink means handle this task asynchronously but does not link it’s success or failure to the calling process so in the case of an unrecoverable crash in the worker it won’t also shut down the parent process which is, in this case, our queue. If we wanted to see the result of an unlinked process we could add two handle_info functions to our queue which match on the success and failure messages that a child process under the supervisor would send to our CrawlerQueue process. I won’t go over that here but with that pattern, you could do things like build out retry logic for the case of a failed worker or pass back response times from the worker to alter throttling rates.

Finally, we need to kick off the work from the init function in our queue.

In the init function in CrawlerQueue we add one line to kick off the first check for work:

Call the work from the queue’s init function

And with that, we have one last thing to do before this simple crawler is ready to go. We need to store our page results somewhere. I’m not going to go over the different storage mechanisms in this post or even use a particularly robust solution but for something like this you might think about using MongoDB or s3 to dump the raw data for later processing. Here I’m just going to write out the pages to /tmp as the raw .html. This might even be fine under normal cases if you know you are not going to be crawling a particularly large set of sites. Usually, though outside of simple demonstrations like this, you should probably opt for a more robust solution.

In the CrawlerWorker in the case we have successfully opened a page we are going to use a simple hash of the content and write out a .html file containing the contents to a folder in /tmp.

Write results to disk

And that takes care of everything you need to get a stupid simple crawler working in Elixir.

Upsides, Downsides, Tradeoffs, and Improvements

This entire crawler has a very simple architecture and I wouldn’t use it to do massive multi-site crawling or anything of that nature but for a small crawling project like one large website, this may be all you need. Some downsides to it are managing the workers in the same place as the queue and only checking for work ever 5 seconds. This means crawling is going to move much more slowly and you don’t have a lot of control over tuning the request rate as the crawl is going. Writing URLs to a queue like this also poses a memory issue as your queue grows so does the memory footprint which if you are doing a really large crawl can cause issues if you are not consuming the URLs fast enough.

In a more mature architecture, you might do something like make the workers into long-lived processes like GenServers themselves and then have a worker pool managed by a master process that tracks changes in the queue and doles out work to the workers in a round-robin fashion where the workers have separate work queues of their own that they manage. This would let you continuously check for work and scale worker processes dynamically instead of the static number used here. You could also add something like a dead letter queue for URLs to be retried later at lower priority and custom retry logic as I mentioned earlier.

In this crawler, we also don’t check to see if we had crawled a particular URL before and though the depth rate would eventually exit a cycle it still isn’t the neatest way to handle that particular case. In small to medium crawls you can get away with having a set and checking set inclusion. In larger crawlers you might have to use something like bloom filters, which are a probabilistic data structure, to reduce the memory footprint but risk getting false positives. Checking for URLs we’ve crawled before is both good practices to not waste time in our crawl but it can also be a bigger issue because if you hit a cycle early on in the crawl you are likely to repopulate your queue with a ton of already crawled URLs which completely undermines the efficacy of the crawler.

Another downside is that we are writing both our state and our file outputs to flat files on the host we are crawling on. This is fine in the case of our small goals here but if you were crawling a massive set of websites where your crawler architecture is distributed across many different boxes you would need to use something like mongo or s3 or some other KV-store maybe even graph databases depending on the scale of your crawl so you could more robustly and speedily store your data from the crawl and make working with it later a lot easier.

The good news is there are a lot of neat libraries in Elixir that would make writing a lot of the things I’m suggesting here very simple. For instance, in the actual version of our crawler we wrote for our alpha we used Rihanna which is a Postgres backend job queue to manage the state of our workers and handle crashes and retries. There is also Honeydew which is another excellent job queue library in Elixir. Honeydew’s architecture as a job queue and job runner is even set up in such a way that it is pretty close to the more mature crawler architecture I wrote about above and will get you most of the way there.

There are plenty of great libraries that wrap interfaces for mongo, Postgres, s3 and Elixir even has clients for cutting edge graph databases like dgraph. It helps too that message passing between processes in elixir is naturally a nice pattern to work with when it comes to crawling tasks that setting up robust error handling and passing messages between workers in cases where say you need to broadcast a change of crawling rate makes writing high throughput and robust crawlers in Elixir a breeze.

That said for any of the downsides here with our quickly put together crawler there are solutions within reach that make going from this to a more robust solution a matter of a few hours maybe less. In fact, we are not too far off from the solution we had written to crawl all of the public-facing Github repositories in about a day for Whize about a month ago.

Thank you for reading this post about how to implement a quick and dirty web crawler in Elixir, stay tuned for the next and final post in the series where I will go through and compare and contrast different crawling architectures and storage tools you can use to help you make the best crawler you can get.

You can find the repo for the full code including project setup here:

https://github.com/iantbutler01/CrawlerExample

A quick plug for Whize

Whize is a privacy-conscious search engine focused on content discovery. My co-founder and I finished the alpha about a month ago and put it out to the public about two weeks ago. Since we were showing it to other developers first we crawled what was all of public Github at the time. It acts as a proof of concept for our idea and showcases our prioritization of results that we believe are both good quality and have a measure of novelty.

You can check it out at https://alpha.whize.co

With the traffic we saw on the site when we opened it up we’ve gone ahead and already architected our significantly more powerful and faster beta crawler and are working hard on building it out. We are looking to have a much broader set of results with our open beta around mid April to early May.

If you are interested in following along and receiving updates on our progress check out http://landing.whize.co.