Here at DocumentCloud, we’re constantly turning PDF files and Office documents into embeddable document viewers. We extract text from the documents with OCR and generate images at multiple sizes for each of the thousands of pages we process every day. To crunch all of this data, we rely on High-CPU Medium instances on Amazon EC2, and our CloudCrowd parallel-processing system. Since the new Micro instances were just announced, we thought it would be wise to try them out by benchmarking some real world work on these new servers. If they proved cost-effective, it would be beneficial for us to use them as worker machines for our document processing.

Benchmarking with Docsplit

To benchmark EC2 Micros, Smalls, and High-CPU Mediums, we used Docsplit. Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages…).

Configuration

For source material, we used a 51 page PDF from The Commercial Appeal‘s recent story on civil rights photographer and FBI informant Ernest Withers: an FBI report that describes the events preceding the assassination of Dr. Martin Luther King Jr.

To benchmark the relative speeds of the instance types, we used Docsplit’s OCR-based text extraction, which is a single-threaded call to Tesseract, as well as Docsplit’s image extraction, which is a multi-threaded call to GraphicsMagick and Ghostscript for PDF to GIF conversion and image resizing.

Here are the commands we ran to download the PDF and extract the images at three different sizes, as well as the full text:

wget http://s3.documentcloud.org/documents/7240/informant-details-invaders-history-and-activities-part-two.pdf

time docsplit images --size 1000x,700x,60x75 --format gif --rolling informant-details-invaders-history-and-activities-part-two.pdf

time docsplit text --ocr informant-details-invaders-history-and-activities-part-two.pdf



Raw Results

High-CPU Medium Small Micro $ time docsplit images <SNIP>.pdf

real 5m24.914s

user 5m49.320s

sys 0m12.650s $ time docsplit text <SNIP>.pdf

real 11m40.346s

user 10m53.560s

sys 0m6.370s

$ time docsplit images <SNIP>.pdf

real 9m37.837s

user 3m27.129s

sys 0m10.049s

$ time docsplit text <SNIP>.pdf

real 15m0.344s

user 5m23.840s

sys 0m7.196s

$ time docsplit images <SNIP>.pdf

real 21m31.671s

user 4m38.190s

sys 0m9.040s $ time docsplit text <SNIP>.pdf

real 51m59.664s

user 6m17.230s

sys 0m2.080s



We then used screen to run two Docsplit image extractions at the same time, since the high-CPU medium instances are dual-core machines.

$ screen

<Screen 1>

$ time docsplit images .pdf

real 6m30.978s

user 5m51.920s

sys 0m11.230s

<Screen 2>

$ time docsplit images .pdf

real 6m26.808s

user 5m50.730s

sys 0m11.180s



Results

Let’s look at the results:

Instance Type Image extraction Text extraction (OCR) Base Cost per Hour High-CPU Medium 5.4 minutes 11.7 minutes $0.17 Small 9.6 minutes 15.0 minutes $0.085 Micro 21.5 minutes 52.0 minutes $0.02

Graphing these values:

Conclusion

It’s not hard to see that the cost-effectiveness of the Micro instances is about twice that of the Medium instances. However, the Medium instance is a dual-core machine, and if we run two Docsplit processes at the same time (which we are already doing), the cost-effectiveness of the High-CPU Medium instance nearly doubles, raising it to the level of a Micro instance.

There is a crucial difference, however. The Micro instance, despite being cheaper, has a faster CPU and takes only 4:35 in actual CPU cycles to do the same work that the High-CPU Medium instances takes 5:49 to accomplish. But because you’re sharing the resources of that Micro instance with other EC2 customers, the High-CPU Medium instance ends up processing the documents 3.6 times faster than the Micro instance. The Micro takes 21:32 to process images, whereas the High-CPU Medium finishes in 5:25 .

Our Recommendation: If raw speed is important to you, the High-CPU Medium makes more financial sense than the Small or Micro instances. But if speed is not an issue, then the cost of the Micro instance actually wins out for single-threaded workloads, since processing takes longer, but costs less overall. It all depends on your setup. With our parallel document imports, we could switch to using all Micro instances and end up processing the same number of pages per day for the same price, but each individual document would take nearly four times longer to finish. So we’re sticking with the High-CPU Medium instances.

Other Notes

Micro instances come with an optional 64-bit configuration, which is very useful if you ever work with large files, like a MongoDB database, large images or PDFs, or anything beyond 2GB in size. Additionally, Micro instances use Amazon’s EBS service for persistent storage. Because EBS is the same cost no matter the instance size, it’s very convenient if you decide to move up or down in instance size. This is comparable to many competing VPS services like Slicehost and Linode, just a different way of combining the various storage and compute components.

Also, there are many other VPS comparison blog posts which describe the differences between CPU-bound, memory-bound, and I/O-bound application performance. Eivind Uggedal compares a number of different applications on a few hosts, including Amazon. The Bit Source compares CPU performance between Amazon and Rackspace.