As part of eliminating TimesSelect, The New York Times has decided to make all the public domain articles from 1851–1922 available free of charge. These articles are all in the form of images scanned from the original paper. In fact from 1851–1980, all 11 million articles are available as images in PDF format. To generate a PDF version of the article takes quite a bit of work — each article is actually composed of numerous smaller TIFF images that need to be scaled and glued together in a coherent fashion.

Previously we had generated all the PDFs dynamically. This approach had worked reasonably well, but with the strong possibility of a significant traffic increase we started to rethink things. Clearly, pre-generating all the articles and statically serving them would be a great option. Pretty quickly I thought about how we could do this (and have some fun along the way, but beware — my idea of fun is probably radically different from that of most people).

I had been using Amazon S3 service for some time and was quite impressed. And in late 2006 I had begun playing with Amazon EC2. So the the basic idea I had was this: upload 4TB of source data into S3, write some code that would run on numerous EC2 instances to read the source data, create PDFs, and store the results back into S3. S3 would then be used to serve the PDFs to the general public. It all sounded pretty simple, and that is how I got the folks in charge to agree to such an idea — not to mention that Amazon S3/EC2 is pretty easy on the wallet.

The code to generate the PDFs was fairly straightforward, but to get it to run in parallel across multiple machines was an open issue. Being a voracious reader of all kinds of blogs, I had come across and read the MapReduce paper from Google several years ago (it made me weep). I also knew about Hadoop, the open-source implementation of the MapReduce idea. Given all these parts, I had a rough idea of how I could make all this work.

I quickly got to work copying 4TB of data to S3. Next I started writing code to pull all the parts that make up an article out of S3, generate a PDF from them and store the PDF back in S3. This was easy enough using the JetS3t — Open Source Java toolkit for S3, iText PDF Library and installing the Java Advanced Image Extension.

Once the basic code was up and running, I set about learning the intricacies of Hadoop. The Hadoop documentation is pretty sparse but helpful enough. I was able to create a Hadoop cluster on my local machine and wrap my code with the proper Hadoop semantics. After a bit more tweaking and bug fixing, I was ready to deploy Hadoop and my code on a cluster of EC2 machines. For deployment, I created a custom AMI (Amazon Machine Image) for EC2 that was based on a Xen image from my desktop machine. Using some simple Python scripts and the boto library, I booted four EC2 instances of my custom AMI. I logged in, started Hadoop and submitted a test job to generate a couple thousand articles — and to my surprise, it just worked.

I then began some rough calculations and determined that if I used only four machines, it could take some time to generate all 11 million article PDFs. But thanks to the swell people at Amazon, I got access to a few more machines and churned through all 11 million articles in just under 24 hours using 100 EC2 instances, and generated another 1.5TB of data to store in S3. (In fact, it work so well that we ran it twice, since after we were done we noticed an error in the PDFs.)

Honestly, I had a couple of moments of panic. I was using some very new and not totally proven pieces of technology, on a project that was very high profile and on a inflexible deadline. But clearly it worked out, since I am still blogging from open.nytimes.com.

Now that this adventure can be called a success, I can’t imagine how we might have done it without Amazon S3 / EC2 . The one caveat I will offer to people who are interested in doing something like this is that it is highly addictive. We have already completed the S3 / EC2 portion of another project and I have ideas for countless more.

In the 1851–1922 articles, you can see what kind of computer

was desired in 1892: “Computer Wanted”