You don’t want to annoy or lose customers with an outage. You want your software to respond to requests as quickly as possible. Any application that’s important to your cash flow needs to be monitored. Period.

For some organizations, monitoring is solely in the hands of operations, but recently the shift has been toward cross-discipline engineers who build, test, monitor, and are on call for their applications. In some cases, there are engineers who do nothing but monitor applications, and filter down issues to the developers of an offending application. These cross-discipline engineers are sometimes called DevOps engineers or site reliability engineers (SREs).

Monitoring is a critical component for any software company—or at least it should be. So what do experts from the most successful software companies say about the modern monitoring practices?

Top developers, reliability engineers, and monitoring tool makers including Allison McKnight, an SRE at Etsy, VividCortex CEO Baron Schwartz, and Torkel Ödegaard, creator of Grafana, shared their 10 most illuminating and useful presentations on monitoring. Nineteen more videos, arranged by category, are also included.

Metrics, metrics, everywhere

From the presentation:

“As developers we have a mental model of what our code does ... we spend so much time inside our heads, it's very easy to mistake what's inside of our heads for reality (i.e. to mistake the map for the territory) … we can't know until we measure it”

—Coda Hale, co-founder of Skyliner and creator of Dropwizard

While it’s true this 2011 talk is a little old, it’s certainly evergreen advice. Coda Hale’s “Metrics, Metrics, Everywhere” was the first talk that came to mind when I asked Baron Schwartz about his favorite presentations on monitoring. Developers have even written blog posts specifically to praise this talk's timeless lessons.

The lecture hits all the right notes, especially with its reference to John Boyd’s OODA loop, which is also a core guiding principle for Netflix. Hale explains how developers can use OODA’s “observe, orient, decide, and act” as a model for focusing on the right metrics that give engineers the ability to better observe and react to systems in production.

Crafting performance alerting tools

From the presentation:

"We have a saying that goes, 'If it moves, we'll track it.' And sometimes we'll monitor something even if it doesn't move, just in case it tries to make a run for it later."

—Allison McKnight, performance engineer, Etsy

Etsy is known for its operational chops. This talk explores how they created tools on top of popular open source projects such as Nagios to detect performance slowdowns in their back end. The key theme here is that you can improve your tools by integrating the context that you need while you're using those tools. Good tools are contagious.

Better living through statistics: Monitoring doesn't have to suck

From the presentation:

“The standard practice for alerting is still to check the measurement at the time that it is taken, and it is this ‘check script’ model of monitoring that is long due for an overhaul.”

—Jamie Wilkinson, Site Reliability Engineer, Google

This is a great video to start with if you’re new to monitoring or you need to learn more about relevant statistical methods that can help you improve your monitoring systems. It gives you a solid introduction to the basic concepts of common monitoring systems, and also shows you some simple mathematics—which you may or may not remember from high school—that can give you way more insight into what your infrastructure is actually doing.

What should I monitor, and how should I do it?

From the presentation:

“Graphs have no intrinsic meaning. Don’t stare at a graph and wonder what it means. That’s a backwards process.”

—Baron Schwartz, CEO, VividCortex

Even though it’s a few years old, this talk is still fairly accurate regarding its criticisms of monitoring tools. Not only does it suggest more powerful ways to monitor your applications, it also explains why competition for great engineers is forcing sub-billion-dollar companies to do more monitoring with fewer people.

Baron recently presented a new talk that's a sequel to this one called: "What Should I Instrument, And How Should I Do It?" The slides have a lot of great links and resources.

Monitoring is dead. Long live monitoring

(Video)(Slides)

From the presentation:

“A system is observable if and only if you can determine the behavior of the system based on its outputs. The manner in which a system acts is its behavior. The outputs of a system are concrete results of its behaviors. Monitoring is the action of observing and checking the behavior and outputs of a system and its components over time.”

—Greg Poirier, Software Engineer, Stripe

From the dawn of systems administration to now, monitoring has focused on CPU, memory, disk, process aliveness, and system aliveness. Greg Poirier thinks it’s time the industry outgrows this focus, and we change the conversation around monitoring. His presentation is lively and humorous. This is another good introduction to the general principles of monitoring. It's also more recent than "Better Living Through Statistics: Monitoring Doesn't Have to Suck."

The evolution of monitoring systems at Google

(Video)

From the presentation:

“There’s a famous story of a high-ranking [Google] engineer who spent an entire day trying to debug a problem with their monitoring, and it came down to the fact that within his config, ‘a’ minus ‘b’ was actually an identifier, not an expression. Those sorts of faux pas are all over the place in DSLs, if you’re not careful.”

—Tony Rippy, software engineer, Google

Tony Rippy’s session from Monitorama PDX 2015 is a revealing and entertaining trip down memory lane for those involved in monitoring systems at Google. Did you know, for example, that Google’s first monitoring program was a hilariously simple Perl script that ran on a developer’s desktop? Did you know they spent a lot of money and effort on a monitoring tool that almost no one could use? This presentation is not Tony Rippy’s story. Rather, it’s an aggregation of stories from hundreds of Google engineers over a 16 year period.

Monitoring at Spotify. When things go ping in the night

From the presentation:

“People didn’t care about monitoring until it started spamming them.” —Martin Parm, Infrastructure Engineer, Spotify

Want to see how Spotify’s monitoring evolved over the course of its history? You might be surprised to hear that this talk is much more about how monitoring affected the company culture and how its DevOps techniques evolved.

Building a culture of observability at Stripe

(Video)(Slides)

From the presentation:

“You will never fully replace and make these systems perfect. They require constant care, like a garden. So we’ve been able to switch most people over to the new systems, but the old systems still exist.”

—Cory Watson, observability specialist and software engineer, Stripe

Stripe is a fairly new company that provides APIs for online payments. Like the Spotify talk, this session is more about people than tools and technology. It's a story about how Cory Watson came into the company to be a champion for creating a culture of observability and monitoring. You'll hear about the interesting, clever, and sometimes sneaky things he did to bring about this culture shift.

The art of performance monitoring

(Video)

From the presentation:

“You think to yourself, 'I had an incident, so I should make an alarm for that.' And you had another incident, and you should make a tool for that. At each step you’re making a rational choice, but you don’t realize that the cumulative effect is something that’s hard to maintain, and kind of unbearable.”

—Brian Smith, production engineer, Facebook

In yet another talk from the trenches of a famous company, you’ll learn more specific technical ideas for tools and strategies that can improve monitoring. You’ll also hear about the monitoring mistakes Brian Smith sees developers making . There’s a high density of good ideas for you to digest in this video, and it wraps up in less than 20 minutes.

How monitoring works at scale: Monitoring tools, components, & mentality at Facebook

From the presentation:

“As a developer, if I’m not getting feedback from the code that I’m pushing and the changes that I’m making, I will be burned out immediately, and probably after a year I will leave the company. But the problem isn’t in the company; it’s in the relationship between the developer and production.”

—Ran Leibman, production engineer, Facebook

This slide presentation discusses how Facebook handles the challenges of monitoring its massive-scale environment. Most apps don’t come close to the size of those of Facebook, but there is a lot of good advice for any development shop in this session. Its techniques for dashboards, alerts, monitoring data management, and tooling UX might give you some ideas for your own monitoring tools.

More talks

That's the end of the curated list of talks recommended by the experts, but I've added this categorized list of 19 extra monitoring videos, so read on.

Metrics and culture

How Metrics Shape Your Culture (Video) (Slides)

By Nicole Forsgren

What you decide to monitor and measure shapes who you are as a company. Since metrics shape behavior and incentives, and behaviors and incentives are the heart of culture, metrics inherently shape your culture.

Metrics are for Chumps: Understanding and Overcoming the Roadblocks to Implementing Instrumentation (Video)

By James Fryman

Spending time instrumenting code and tooling becomes a low priority for some teams in the same way that testing and documentation become low priorities—people say it's not important enough to spend time on it. Learn how and why metrics implementations fail.

Tools and data visualization

An Admin's Guide to Data Visualization (Video) (Slides)

By Caskey L. Dickson

Learn how to turn a wall of numbers into a story. This talk covers techniques for mastering common and advanced graphs, and also explores the psychology of how people interpret visual data.

Grafana and the Future of Metrics Visualization (Video)

By Torkel Ödegaard

Grafana is a hot, fairly new open source graph and dashboard composer. This talk is not a tool pitch, though. Rather, it looks at the state of metric visualization and how you can better integrate metrics with alerting in general. If you’re interested in learning more about Grafana, there are more up-to-date talks by Ödegaard. Here are some presentations on Grafana 3.0 and Grafana 4.0.

Prometheus (Video) (Slides)

By Brian Brazil

Another hot technology, especially around microservices monitoring, is the open source monitoring system and time series database, Prometheus. It was originally developed at SoundCloud and is used by Google.

Data science and monitoring

Statistics for Engineers (Video) (Slides)

By Heinrich Hartmann

Much of the monitoring space is focused on code that collects, moves, stores, and displays metrics, but this talk is about the most important part—code that analyzes the meaning behind the metrics. This talk, and the companion article, focus on statistical methods that are relevant to operations. Some mathematical knowledge is required.

Data Science: The Solution to #monitoringsucks (Video) (Slides)

By Patrick Roelke

The #monitoringsucks hashtag kicked off a conversation about the state of monitoring tools in 2011, and Patrick Roelke still thinks today’s leading tools, such as New Relic, Loggly, and DataDog, haven’t given engineers the answers they need. For that, Roelke explores new monitoring tools that harness data science and machine learning.

Scaling monitoring

Scaling Pinterest’s Monitoring System (Video) (Slides)

By Brian Overstreet

Hear the story of how Pinterest scaled its monitoring systems early on, and how it faced many challenges. This presentation explains much about the various open source tools and techniques its engineers used.

Scalable Online Analytics for Monitoring (Video) (Slides)

By Heinrich Hartmann

This opinionated talk favors stateful online computations in monitoring architecture that enable machine learning on high-velocity data. The main components include alerting systems, event engines, stream aggregators, and time-series databases.

Testing in production

Production Testing Through Monitoring (Video) (Slides)

By Leon Fayer

Traditional testing methods can’t fix everything before production. Internet-scale, data-driven problems make it impossible to predict every edge case that can cause production issues, so teams need to be able to fix problems in their production environments quickly. This talk takes a real-world look at common testing inefficiencies, and offers recommendations for top-down metric instrumentation.

Testing in Production (Video)

By Devdas Bhagat

Just as behavior-driven development (BDD) brings the business level into testing, this talk addresses how to bring the business level into monitoring. It explores strategies for business-level and functional monitoring, and also covers the tradeoffs from a development, test, acceptance, production (DTAP) model to continuous deployment.

Monitoring first

Monitoring As A Service (Video) (Slides)

By James Turnbull

It’s important to think about monitoring as a service that you provide to your teams, rather than just a collection of the latest tools. This talk will show you how to build or extend your monitoring environment to be customer-focused instead of infrastructure-focused.

Monitoring at Service Provider Scale (Video) (Slides)

By Chris Jackson

Every organization has different needs for monitoring, so it’s an incredible challenge for service providers to satisfy all of these unique use cases. This short video expresses Rackspace’s idea to build open monitoring stacks on a few fundamental principles. The first principle is that organizations are not snowflakes, so their monitoring doesn't need to be highly customized.

Monitoring as a First Step to a New Service (Video) (Slides)

By Darron Froese

Most organizations bolt monitoring onto the end of the software delivery process, but that’s not a good idea, says Darron Froese. You need to monitor at the beginning of the process, when each package is still in its repository. You should plan and implement monitoring before there’s even data. This talk goes into detail about how to start using monitoring as the first step, not the last one.

Site Reliability Engineering (SRE)

The Keys to SRE (Video)

By Ben Treynor

Google was one of the first companies to spread the idea of having site reliability engineers (SREs). Ben Treynor was at the center of that shift in Google’s operations, and this talk distils the practices of SREs down to the fundamentals. There's a good companion interview available as well.

Performance Checklists for SREs (Video) (Slides)

By Brendan Gregg

A senior performance architect at Netflix reveals how the company handles outages when minutes matter. They use the same tools as would be used in an aviation emergency: checklists.

Production Engineering, There Is No Spoon (Video) (Slides)

By Ran Leibman

Facebook has its own take on SREs, which it calls production engineers. These people work directly with software engineers, from road-mapping to being on call. Hear the inside story from one of Facebook’s production engineers.

Misconceptions & challenges

The Fallacy of Real-Time Analytics in Performance Monitoring (Video)

By Elizabeth Nichols

Real-time analytics are valuable, but only if you avoid a partial solution that merely produces noise and frustration. Check out this interesting vision for a more comprehensive real-time analytics monitoring platform.

Monitoring Challenges (Video) (Slides)

By Adrian Cockcroft

Adrian Cockcroft, former chief architect at Netflix looks forward to the latest architectures—like microservices and serverless—and how they’re creating interesting monitoring challenges. He explains why monitoring vendors keep being disrupted, and concludes with simulation testing and ideas for how to test serverless architectures.

Root Cause, You're Probably Doing it Wrong (Video & Slides)

By TJ Gibson

TJ Gibson gives great advice for addressing the hardest problem in monitoring. Not finding out that something is wrong, but finding out exactly what's causing it.

Those picks are a great starting point. If you have found any other useful videos in the field of monitoring, metrics, and reliability, share them in the comments below.

Keep learning