Drupal is one of the largest and longest running open source projects on the web. Keeping Drupal.org and all of its sub-sites and services online and available to the community is no small task. We've partnered with cloud-scale monitoring provider Datadog to be our centralized monitoring and alerting service.

The problem: Monitoring a complex and evolving infrastructure

We often talk about Drupal.org as if it were a single monolithic entity, but in fact it is a complex infrastructure, supporting many sites and services, continuously evolving to meet the needs of the Drupal project.

The infrastructure is built as a combination of VMware hosts, bare metal servers, and cloud instances—all fronted by a global CDN. These sites and services include project hosting, community forums, the Drupal Jobs board, Git hosting, and much more. Many of these systems hook into each other and share data.

The Drupal.org infrastructure includes:

8 production Drupal sites, running on PHP and MySQL(MariaDB)

Code repositories and code viewing with GitTwisted and CGit

The Updates.xml feeds which provide updates information to Drupal sites around the world

Automated testing for the Drupal project through DrupalCI, using a Jenkins dispatcher, DockerHub images, and dynamically scaling testbots on AWS

The Drupal Composer Façade, making Drupal buildable with the PHP composer tool

SOLR servers for search

Our centralized loghost, using rsyslog

Remote backups to rsync.net

DrupalCamp static archives

Pre-production infrastructure for development and staging sites

As you can see, it's not just Drupal sites that we have to monitor, we also have automation systems, cloud instances, and custom services like the Updates system and the Composer Façade.

What we tried in the past

The diversity of sites and services that we manage makes monitoring our infrastructure at the service and application level complex. At one point in time we were using five different monitoring services to keep track of different parts of the infrastructure, each one capable of producing its own alerts, and each one piping alert information in a different format to our channels.

Using this many systems meant that we had redundant monitoring and alerts in place for some services, and that we had overlooked monitoring for other systems entirely. The sheer noise of these many systems, combined with the administrative overhead of managing the many different accounts, left us in a place where it was difficult to separate signal from noise. Yes, we had coverage for the most critical issues that might affect the infrastructure, but we also had significant gaps that were simply invisible until we brought in Tag1 Consulting as our infrastructure partner to audit our monitoring and alert systems.

Our prior solution was too patchwork and ad-hoc to be sustainable in the long term, and critical systems were falling through the cracks.

A new solution: centralization

This is why we've standardized on Datadog as our centralized monitoring solution. We use our Puppet4 tree to automatically configure the Datadog agent for all the hosts in our infrastructure. Right now, Datadog is providing monitoring and alerts for 46 hosts. Instead of generating alerts from many independent services, we're piping them all into Datadog and relying on it to be the central authority of record for the current state of our infrastructure.

Datadog provides integrations with services that we use at all levels of the stack, from our Jenkins automation, to our application performance monitoring with New Relic, to monitoring our Fastly CDN delivery and error rates. In all, we've used over a dozen integrations which allowed us to quickly and easily gain observability across systems, applications, and services. When issues arise, Datadog is integrated with Slack, IRC, and OpsGenie to generate the appropriate alerts for our team.

Building insight over time

Datadog also provides two key tools to give us insight into the state of our infrastructure over time: Dashboards and the Metrics Explorer.

In the example below, you can see how these dashboards give us an at-a-glance picture all the way from CDN errors that might be hit by an end-user, to MySQL activity for non-cached operations, to I/O wait and utilization percentage for resources on disk.

We can build additional visualizations of just about any metric tracked by the Datadog agent or one of the integrated services. This can help us to model patterns in traffic, identify potential database deadlocks, look for processes that might be taking an excessive amount of cpu resources, or anything else we might need.

Moving forward, Datadog will help the Drupal.org engineering team keep the sites and services the Drupal project depends on reliable, resilient, and performant. Over time the centralization of our monitoring and alerts, in combination with the insight we receive from the metrics it provides, will help us reduce the effort of maintaining the infrastructure and increase the uptime and performance of these tools for the community.