This is the first post about the progress and achievements of our email delivery platform. We’ll start with how we rebuilt our email platform from scratch on Google Cloud Platform to improve the scalability, resilience, and sustainability of sending email at The Times.

By AUSTIN HESS and CARLOS RYMER

The New York Times sends nearly 4 billion emails per year to its customers, ranging from daily newsletters to breaking news alerts to transactional emails. The old system originally built to perform these functions was integrated with over a dozen internal services and provided a custom UI for editors to author emails. But by 2017, it was over a decade old and full of legacy code that could barely meet the demands of The Times’ growing subscriber base, frequently resulting in major user-facing errors. We decided to move all our infrastructure to the cloud by the end of March 2018, spurring us to design a more stable and scalable email platform from the ground up. The end result was a platform with a serverless architecture built on top of Google Cloud Platform.

Limitations of the old system

The old system was built before cloud services took off. At the time, that meant building many components on dedicated virtual machines within our data centers, manually updating the software running on those machines and making architectural choices based primarily on short-term considerations. As a result, the core sending infrastructure relied on layers of archaic and often undocumented scripts to select audiences and compile emails. Countless frustrating hours of debugging and coaxing were needed to address the near-weekly failures. These limitations severely limited our ability to react to problems, develop new features and meet the needs of growing email audiences.

The new platform

The new platform we built fulfills the same fundamental requirements the old system did. Most of the emails The Times sends are bulk newsletters which are delivered to an audience of anywhere from tens of thousands to millions of users who subscribe to a particular email product. Newsletters are assembled in advance by editors who use our custom interface to pull in UI elements and schedule a time for the dispatch, while the content of breaking news alerts is generated programmatically at the time of dispatch. A small but important portion of the traffic comes from transactional emails sent to individual users triggered by specific NYT events, such as the confirmation email sent when someone upgrades to a paid subscription.

The new platform is comprised of a suite of microservices that are each responsible for a specific function. These functions include:

Keeping track of a user’s newsletter subscriptions and metadata

Quickly selecting an audience for a newsletter when sending

Providing an admin UI for assembling email content and scheduling newsletter dispatches

Compiling and sending an individualized email for each recipient

Collecting and aggregating data about the emails sent and user behavior

Providing convenient consumer APIs for other teams to send transactional emails and access business-critical data

Diagram of the services that comprise the email system

When choosing the infrastructure for building these microservices, one of the defining considerations was how our system would deal with spikes in demand. The process that compiles individualized emails can go from a steady, low-traffic state to having millions of compilation tasks queued up and waiting to be processed, all within the span of a minute. Each email generation task may need one or more network calls for each recipient to gather relevant information. After some internal discovery work, we settled on using Google App Engine (GAE) for nearly every microservice in the new platform. GAE’s standard environment allows us to build APIs that scale incredibly quickly with minimal need for manual tuning.

GAE Standard imposes constraints — for example, the code must only run in the context of an HTTP request, and there are time limits on the length of execution responding to a request — that lend themselves to design patterns that make full use of the latest Google Cloud Platform tools and enhance scalability. Instead of long-running processes doing complex operations in memory, we use GAE services to transform and store data, publish data to queues (a “Pub/Sub” in GCP), break up multi-step tasks, spread out load and facilitate retries. We chose to use Go for all of our apps, both for its performance and ease of use, as well as its prevalence at The Times.

The following descriptions highlight some of the most interesting design problems we faced:

Storing user data for efficient individual updates and bulk selection

Our API consumers rely on highly available get/set capabilities for each user’s data, especially newsletter subscriptions. For example, on a registered user’s account page, an API call gets the user’s newsletter subscriptions by their system ID, allowing the user to view and edit their newsletter subscriptions. Given that we wanted the schema extensibility of a No-SQL database, the GAE-specific Datastore seemed like a good fit for this purpose. But Datastore is poorly suited for querying for an entire newsletter audience, which depends on several user properties. Moreover, neither standard SQL nor No-SQL options were well suited to the most advanced audience querying capabilities we hoped to support, such as filtering out users who hadn’t opened recently delivered emails.

The solution was to use a GAE API with Datastore to manage each individual user data object. Every time a user object is modified, a record of the updates to the data (e.g. a new subscription to a given newsletter product) is published to a Pub/Sub from which it is then inserted into BigQuery, GCP’s data warehouse that runs SQL-like queries. Importantly, inserts are cheap and query cost in time and dollars depends on the size of the data the query runs over rather than the complexity of joins, but updating existing data or querying individual rows is very expensive. So by keeping the history of every user subscribe/unsubscribe and metadata change with timestamps, a few simple group-bys allowed us to quickly select currently subscribed members of a particular newsletter audience, even when joining on email open and click data.

Illustration by Zaiwei Zhang

Scaling email generation when delivering many unique messages

Now that we had a solution for consistently storing user data in ways optimized for both quick gets and sets, and also complex bulk queries, we had to build components to support an admin interface for editing email content templates and scheduling email dispatches, retrieving the IDs of audience members for a dispatch, and generating and sending out an individual user’s email based on the template.

The new platform’s admin is comprised of RESTful APIs for modifying newsletter audiences (which translate to BigQuery queries), email content (templates), and schedules, among other components, and all this data is stored in a relational database on Google Cloud SQL. We use a cloud-based cron job to check whether there is a bulk email dispatch that should be triggered each minute. When it is time for a dispatch, data comprising the task is passed to our audience selector service, designed to retrieve from BigQuery the IDs of the users in the audience and publish each as an individual message (with metadata) onto an email generation Pub/Sub. Each such message (a “compile task”) will kick off the generation and sending of an individual email when it is processed in the next step.

The last step in the process is the email compiler, which provides a worker endpoint that accepts these compile tasks. It retrieves (and caches) a generic form of the email content based on the compile task’s template ID, and it retrieves the full user info from the user API with the user ID. This user API call is needed because many fields beyond the user ID included in the compile task may be needed to generate the individual email, like showing or hiding certain sections from the daily ‘Today’s Headlines’ newsletter.

Then, using Go’s built-in templating language, it turns the template content into the individualized emails, with transformations such as adding click tracking parameters to links, adding or hiding sections from the long newsletters based on user preferences, and in the near future, adding personalized article suggestions based on NYT’s recommendation API. Finally, after it generates the individualized email, it is dispatched to the user’s email address using an API from our mail relay vendor.

GAE’s quick scaling means that many instances of the email compiler can spin up very quickly to handle the potentially millions of compile tasks coming from the Pub/Sub, and this architecture makes it easy to add additional API calls as data sources for email content (such as the personalization API or an A/B testing framework) without re-tuning the scaling. Transactional email dispatches also go through the same pathway and are simply triggered by a synchronous API call rather than a message from the Pub/Sub.

In addition to the components that comprise the core email sending infrastructure of the new platform, we also were able to quickly and easily set up several auxiliary GAE services for tasks like collecting email delivery and open logs.

What the new platform has meant so far

The benefits of building the new email platform cannot be overstated. Whereas the old system could take hours to deliver a newsletter to a large audience, the new platform rarely takes more than 10 minutes, and is usually faster. Since launching, we have seen a dramatic reduction in weekly system failures and have set up an extensive suite of alerts and monitoring dashboards that provide us with significant visibility of the system’s inner workings. When problems do occur, they are easy to identify and debug due to the careful separation of concerns throughout the system, as well as the polished GCP logging tools. The new data storage setup is transparent and performant, making everything from GDPR compliance to measuring email recipient engagement far easier. When adding new features, the GAE architecture makes it easy to integrate an additional API call or data source into a service without re-tuning the scaling behavior. This platform will allow us to greatly expand both the capabilities of the editors composing emails and the features of those emails for the recipients in ways on par with the rest of the exciting technical innovation happening at The Times.

We now have confidence that we can rely on a platform that’s scalable, resilient, and sustainable. Rather than simply trying to keep things running, we’re now able to focus on building out features and tools that will help editors create more engaging newsletters to build even deeper relationships with our readers. We’ll cover what this means from a product perspective in an upcoming post, but we’re excited to see what we can build in the future given a solid foundation.