Preserving visual history

The morgue contains photos from as far back as the late 19th century, and many of its contents have tremendous historical value—some that are not stored anywhere else in the world. In 2015, a broken pipe flooded the archival library, putting the entire collection at risk. Luckily, only minor damage was done, but the event raised the question: How can some of the company’s most precious physical assets be safely stored?

“The morgue is a treasure trove of perishable documents that are a priceless chronicle of not just The Times’s history, but of nearly more than a century of global events that have shaped our modern world,” said Nick Rockwell, chief technology officer, The New York Times.

It’s not only the photos’ imagery that contains valuable information. In many cases the back of the photos include the time when and the place where the photo was taken. Adds Rockwell: “Staff members across the photo department and on the business side have been exploring possible avenues for digitizing the morgue’s photos for years. But as recently as last year, the idea of a digitized archive still seemed out of reach.”

To preserve this priceless history, and to give The Times the ability enhance its reporting with even more visual storytelling and historical context, The Times is digitizing its archive, using Cloud Storage to store high-resolution scans of all of the images in the morgue.

Cloud Storage is our durable system for storing objects, and it provides customers like The Times with automatic life-cycle management, storage in geographically distinct regions, and an easy-to-use management interface and API.



Creating an asset management system

Simply storing high-resolution images is not enough to create a system that photo editors can easily use. A working asset management system must allow the users to be able to browse and search for photos easily. The Times built a processing pipeline that stores and processes the photos and will use cloud technology to process and recognize text, handwriting and other details that can be found in the images.

Here’s how it works. Once an image is ingested into Cloud Storage, The Times uses Cloud Pub/Sub to kick off the processing pipeline to accomplish several tasks. Images are resized through services running on Google Kubernetes Engine (GKE) and the image’s metadata is stored in a PostgreSQL database running on Cloud SQL, Google’s fully-managed database offering.

Cloud Pub/Sub helped The New York Times create its processing pipeline without having to build complex APIs or business process systems. It’s a fully-managed solution, so there’s no time spent maintaining the underlying infrastructure.

In order to resize the images and modify image metadata, The Times uses “ImageMagick” and “ExifTool”, open-source command-line programs. They added ImageMagick and exiftool wrapped with Go services to Docker images in order to run them on GKE in a horizontally-scalable manner with minimal administrative effort. Adding more capacity to process more images is trivial, and The Times can stop or start its Kubernetes cluster when the service is not needed. The images are also stored in Cloud Storage multi-region buckets for availability in multiple locations.

The final piece of the archive is tracking both images and their metadata as they move through The Times’s systems. Cloud SQL is a great choice. For their developers, Cloud SQL provides a standard PostgreSQL instance—as a fully managed service, removing the need to install new versions, apply security patches, or set up complex replication configurations. Cloud SQL provides a simple and easy way for engineers to use a standard SQL solution.



Machine learning for additional insights

Storing the images is only one half of the story. To make an archive like The Times’ morgue even more accessible and useful, it’s beneficial to leverage additional GCP features. In the case of The Times, one of the bigger challenges in scanning their photo archive has been adding data regarding the contents of the images. The Cloud Vision API can help fill that gap.

Let’s take a look at this photo of the old Penn Station from The Times as an example. Here, we are showing you the front and the back of the photo:

