Many government agencies publish data about their work on a regular basis, often daily or weekly. Some conveniently post it in easy-to-use formats such as CSV files. Others, however, seem to disclose it begrudgingly, each week posting a stack of PDF files on a website and removing the previous set, with no archive anywhere. At least they’re publishing at all, right?

I’ve been working on a pattern for purifying this sea of data, so even as agencies pour in more dirty PDFs, I automatically get pure, clean CSVs. I don’t have a full, generic solution to solve this problem concisely and elegantly, but I want to share this pattern I’ve created.

Data published as a PDF is a hindrance we may never be rid of. While we all know PDFs are designed for presentation, not data analysis, that doesn’t stop agencies from handing them out in response to data requests. Tabula, an open source program for liberating data, has solved part of this problem since it was released in early 2013, but it can be time-consuming to process data contained in multiple PDFs.

The process is particularly annoying for data sets published on a regular basis, like the weekly NYPD precinct-level crime complaint tables. Not only do you have to check daily for new data, but once it’s available, you then have to fire up Tabula to process it — for each of the 85 precincts. It’s duplicative grunt work.

My pattern solves this problem using tabula-extractor, the Ruby library (and command-line tool) that powers Tabula. It’s built to output data to CSVs or to a MySQL database.

I haven’t quite figured out how to fully abstract the solution, and some work specific to each part of the pattern is still necessary. Nevertheless, the pattern is a step toward figuring out what is shared between different instances of the problem — that is, what can eventually be generalized into a library — and boiling down the differences into the inputs for a library. I’ve open sourced three instances of the pattern (that is, three scrapers) for the following data sets:

– Sierra Leone’s Ebola situation reports: GitHub

– The NYPD’s CompStat criminal complaints database weekly reports: GitHub

– The NYPD’s monthly reports of moving summonses: GitHub

Take a look. Each project contains a few executable scripts, in the bin/ folder for parsing files from the web, from the disk or from Amazon S3. All of them use a common parsing script in the lib/ folder. This parser is where the magic, such as it is, happens. Based on configuration options, the parser processes the PDF, extracts the relevant data (using a page number and table dimensions, if they’re common to all PDFs) and then sends it to a CSV file or MySQL database and saves the PDF itself to Amazon S3 storage or to disk.

Precisely how to process the PDF and how to store the data is a pattern I’m still working on. If these scrapers are useful to you, I’d love to hear your thoughts.