We have a data management dilemma, and we hope that you — data-smart people of the world — can help us out. We need a versioning and change tracking system for around 50 million XML files, and no existing solutions seem to fit.

About The Project

The Caselaw Access Project or CAP, previously known as Free The Law, is making all U.S. case law freely accessible online. For more information, see our project page, and this New York Times article.

Our Tracking Task

Like most digitization projects, we generate many page images. The binary image files rarely change and are not difficult to track. However, in addition to images, we create rich XML files containing descriptive/structural metadata and OCR. As we uncover mistakes in the OCR, encounter metadata anomalies, and gather new data through CAP-facilitated research projects, we will need to update these files. Tracking those changes is going to be a bit more difficult.

The Files

We are scanning about 37,000 volumes. Each volume contains multiple pages (obviously) and multiple cases. Usually, a case takes up a few pages, but some cases are so small that several can fit on one page, so there's no direct parent/child relationship between them. Cases never span volumes.

If you're interested in checking out a case for yourself, you can grab a sample case with all the associated files here.

How we split these things up into files:

For each volume:

One METS XML file with all volume-level metadata (~ 1 MB avg)

For each page side:

One lossless jp2 (~2.5 MB avg)

One 1-bit tiff (~60 KB avg)

One ALTO v3 XML file (~75 KB avg)

For each case:

One METS XML file, which includes the text of each case body, and all case-level metadata (~75 KB avg)

The Scale

Roughly 37k volumes, so about 37,000 volume XML files

Roughly 40mil page-sides, so that many jp2s, tiffs, and ALTO XML files

A bit fewer than 10 million Cases, so that many Case METS XML files

Our key requirements:

Data Set Versioning

Ideally this could be done at the corpus or series level (described below.) This would be useful to researchers working with larger sets of data.

Sanitizable Change Tracking

As is the case with most change-tracking systems, when recording changes, we usually want to be able to ascertain the state of the data before the change, whether this is by recording the old version and the new version, or the delta between the two versions. However, with some change types, we do require the ability to either delete the delta or the old data state. Ideally, we would be able to do this without removing the entire change history for the file.

File Authentication

People should be able to check if the version of the file they have is, or ever has been in our repository.

Open Data Format

Even if the change/versioning data isn't natively stored in an easily human-readable format, it must at least be exportable into a useful open format. No strictly proprietary solutions.

Access Control

We have to be able to control access to this data.

Our Wish List

FOSS (Free Open Source Software) Based Solution

Diffing — allow downstream databases to fetch deltas between their current version and the latest

Minimal system management overhead

Ability to efficiently distribute change history with the data, ideally in a human-readable format

XML-aware change tracking, so changes can be applied to XML elements with the same identifiers and content, in different files

Will automatically detect replacement images

What we've considered, and their disadvantages

Git

Dataset is much too large to store in a single repository

Non-plain-text change history

Redacting a single file requires rewriting large portions of the tree

Media Wiki

Not geared to handle XML data

Would require storing in a different format/syncing

Non-plain-text change history

Provides sanitizable change tracking but no versioning of larger data sets

BitKeeper

Non-plain-text change history

Seems to not allow easy sanitization of change history

Dat

P2P Architecture doesn't give us enough access control for the first phase of the project.

Something we write ourselves

Reinvents the wheel, at least in part

Probably not as efficient as more mature tools

Should the data be restructured?

Currently, the repository is fairly flat with each volume in its own directory, but no other hierarchy.

Files could be partitioned by "series." A series is a numbered sequence of volumes from a particular court, such as the Massachusetts Reporter of Decisions. The largest series so far contains approximately 1k volumes, 750k pages, and 215k cases, but they are rather inconsistently sized, with the smallest containing only one volume, and the average containing 71. There are 635 series in total.

Many data consumers will want only case files, and not per-page or per-volume files. It may make sense to store case XML files and non-case-XML files in separate repositories.

What We Need From You

Ideas. We want to make sure that we get this right the first time. If you have insight into solving problems like this, we'd love to hear from you.

Next Steps

Please reach out to us at lil@law.harvard.edu.