Step 1: Getting Pachyderm up and running

First things first, we’ll need a working Pachyderm setup. I used a single node setup for this tutorial which is the easiest option. If you’d prefer to jump straight to a clustered setup for higher scalability, click here to deploy a cluster on AWS. Everything in this tutorial should work the same, just substitute `localhost:650` with the hostname of one of your EC2 machines.

Step 2: Storing the URLs

Next, we’ll store a set of URLs in Pachyderm so we know what to scrape. Pachyderm exposes a RESTful interface to its distributed file system, pfs. We’ll store the URLs we want to scrape in there:

Notice we’re just creating empty files here. Our scraper just looks at the names of the files to figure out what to scrape.

Step 3: Create the pipeline

Next, we’ll install a pipeline which scrapes these URLs. In Pachyderm, pipelines are computations that get run every time new data is committed to the file system. We define them using Pachfiles with a simple text-based format. Below is the scraper Pachfile, let’s walk through how it works.

Line 5 declares that we’ll be running everything inside the Docker image pachyderm/scraper. It’s just a stock ubuntu image with wget installed since it doesn’t come by default.

Line 8 makes the URL list we stored in pfs in Step 2 available to the pipeline during execution. The data will be mapped into the container under the path /in/urls.

Line 11: this is the interesting part where we actually implement the scraper. Run allows us to parallelize commands across the cluster. I’m running this locally so there’s only 1 machine. Our command lists the URLs and pipes them to wget. We also pass several flags to wget:

recursive, level 1: scrape things the page links to but only go 1 layer deep. Increasing level will greatly increase the amount of data scraped.

page-requisites: this tweeks the behavior of resursive scraping so that we don’t wind up with partially renderable pages.

convert-links: modifies the scraped data to make it easy to serve.

timestamping: this is a nifty feature that wget has to optimize repeatedly scraping the same URLs. We’ll look at it in depth in the later on.

directory-prefix /out, input-file -: Write output to /out. Take URLs to scrape from stdin. /out is where Pachyderm will look for output once the command exits.

Lastly, we append a ; true, to the end. This is a little hack because wget returns errors if any of the links on a page fail to scrape causing the pipeline to fail.

Step 4: Run the pipeline

Time to actually run the pipeline, here’s how:

As soon as you create a commit Pachyderm will start running the installed pipeline. Notice that we named the commit commit1. If you don’t specify a name for the commit Pachyderm will generate a UUID as a name.

Step 5: Inspect the results

As soon as you create that commit Pachyderm will start scraping the URLs you specified and the results will become available live via http. For example, if you point your browser at: localhost:650/pipeline/scraper/file/news.ycombinator.com/ you’ll see the most recently scraped version of HN. You can also see a specific snapshot of the site by appending ?commit=commit1 to any request.

What’s cool about the system we’ve created?

That wasn’t much code but already we’ve got a pretty sophisticated scraping system with the following properties:

We can easily add and remove URLs to the scrape list through a RESTful interface. As the amount of data to be scraped increases, we can scale horizontally without writing any new code. Pachfiles are guaranteed to run the same in a cluster as they do locally, so when you outgrow a local setup you can quickly migrate the same code to a cluster.

Our scrapes have the same commit-based properties as pfs. Each commit results in a snapshot of the internet being scraped and stored. Pachyderm can store this data very space efficiently thanks to its commit-based file system.

Lastly our system is well setup to have other functionality built on top of it.

What can we build on this?

So far we’ve built a simple system that will take a snapshot of the internet whenever we tell it to. Now we’re going to automate this process using cron to create complete our Wayback clone. Add the following to your crontab to create a new commit every hour on the hour:

The commits will have names corresponding to the time they were created, such as 2015–06–18–00 (the last number is hours). You can access a webpage as it appeared at a given time by appending ?commit=2015–06–18–00 to any Pachyderm request. This is where the timestamping flag from Step 3 comes in. With this flag, wget matches each file’s modtime to the modtime returned by the server. This allows wget to avoid downloading the same data twice. You’ll find that even with a large set of URLs, scrapes show up pretty quickly after commits because of how many of the files are unchanged and thus get skipped.

Once our cronjob is set up, we’ve basically got a fully working Wayback Machine. But we can actually build even more on top of that because Pachyderm’s pipeline model makes it really easy to chain computations together. For example, we might want to index each new batch of scrapes as they come in. We can accomplish that by adding a line to our Pachfile like so:

This will run indexer.py in a container with the output of the scraper visible to it. Like with the scraped data, the indexed data can be served directly out of Pachyderm and will have the same commit-based properties as everything else in this tutorial. I haven’t actually built indexer.py, the tutorial has to end somewhere afterall. But you can build it yourself! And if you do we’d love to hear about it.