At PuppetConf 2012, I had an epiphany when watching a talk by Google’s Jamie Wilkinson where he was live-hacking monitoring data in R. I can’t recommend his talk highly enough — as an analytics guy, this blew my mind:

Since then, one thing has become clear to me: As we scale applications and start thinking of servers as cattle rather than pets, coping with the vast amounts of data they generate will require increasingly advanced approaches. That means over time, monitoring will require the integration of statistics and machine learning in a way that’s incredibly rare today, on both the tools and people sides of the equation.

It’s clear that the analysis paralysis induced by the wall of dashboards doesn’t work. We’ve moved to an approach defined largely by alerting on-demand with tools like Nagios, Sensu, and PagerDuty. Most of the data is never viewed unless there’s a problem, in which case you investigate much more deeply than you ever see in any overview or dashboard.

However, most alerting remains broken. It’s based on dumb thresholds rather than anything even the slightest bit smarter. You’re lucky if you can get something as advanced as alerting based on percentiles, let alone standard deviations or their robust alternatives (black magic!). With log analysis, it’s considered great if you can even manage basic pattern-matching to group together repetitive entries. Granted, this is a big step forward from manual analysis, but we’re still a long way from the moon.

This needs to change. As scale and complexity increase with companies moving to the cloud, to microservice architectures, and to transient containers, monitoring needs to go back to school for its Ph.D. to cope with this new generation of IT.

Exceptions are few and far between, often as add-ons that many users haven’t realized exist — for example Prelert (first for Splunk, now available as a standalone API engine too), or Bischeck for Nagios. Etsy open-sourced the Kale stack, which does some of this, but it wasn’t widely adopted. More recently Numenta announced Grok, its own foray into anomaly detection, which looks quite impressive. And today, Twitter announced another R-based tool in its anomaly-detection suite. Many of you may be surprised to hear that, completely on the other end of the tech spectrum, IBM’s monitoring tools can do some of this too.

On the system-state side, we’re seeing more entrants helping deal with related problems like configuration drift including Metafor, ScriptRock, and Opsmatic. They take a variety of approaches at present. But it’s clear that in the long term, a great deal of intelligence will be required behind the scenes because it’s incredibly difficult to effectively visualize web-scale systems.

The tooling of the future applies techniques like adaptive thresholds that vary by day, time, and more; predictive analytics; and anomaly detection to do things like:

Avoid false-positive alerts that wake you up at 3am for no reason;

Prevent eye strain from staring at hundreds of graphs looking for a blip;

Pinpoint problems before they would hit a static threshold, like an instance gradually running out of RAM; and

Group together alerts from a variety of applications and systems into a single logical error.

DevOps or not, I’m running into more people and bleeding-edge vendors who are bringing a “data science” approach to IT. This is epitomized by attendees to Jason Dixon’s Monitorama conference. Before long, it will be unavoidable in modern infrastructure.

Want to get started? You could do a lot worse than Coursera’s data-science specialization.

Disclosure: Prelert, Splunk, IBM, and ScriptRock are clients. Puppet Labs has been. Etsy, Metafor, Nagios Inc, Numenta, Opsmatic, Twitter, and PagerDuty are not.