The various stages of IT monitoring maturity (a framework for enterprise IT stakeholders)

Posted on March 04, 2014

Automated monitoring in IT of any critical IT service goes through a fairly predictable cycle of refinement regardless of the size of organization and complexity of application. The problem is that monitoring - even once defined conceptually - is a still a very large area and the term itself is very generic.

I find it useful to understand the various key stages so that I can recognize where an organization is in its level of maturity, as well to introduce some tangible milestones around where the stakeholders want to go.

In an attempt to create a common understanding I've pulled together this reference post. It includes the various types of monitoring, main coverage points, and typical implementation approaches and protocols involved. I also mention a few of the most commonly overlooked items in each stage.

This article is not meant to be the be-all and end-all. My aim is that establishing guide posts improves discussions and moves all stakeholders forward. The point isn't to create rules, but a framework upon which forward progress can be made.

This reference should be beneficial to technologists and those who work with technologists to support their businesses. It is slightly technical, but most of those details can be skimmed without a loss of benefit. Use this when planning monitoring improvements in your own organization. Don't hesitate to shoot me any suggestions or thoughts. I'd love to hear how anyone uses this in their own planning or within their organization!

The Stages

What's important isn't where you start or even are today, but where you end up. Generally, it makes business sense to move through the phases - downward on this list - as time, money, energy, and focus permit.

Before we get to the first Stage, I really should mention "Stage 0." This is the stage of "we wait for people to call us and tell us something is down." I don't really consider this an implementation phase, but it's important to mention since it is where most organizations start out. Some move quickly away from this ...while others seem to stick with it despite the costs.[1]

Stage 1: Ping response checks

These are designed to answer the eternal query: "Is the server up?" Hopefully the response is "Yes, look the server is up!" but if it's not you'll be able to catch it quicker than if you waited around to notice it yourself or, worse, for a customer/user to contact you to report the outage.

The key word here is server. Basic ping server response checks are the bare minimum to be able to say you're monitoring things. It's hard to consider yourself a professional shop without having this[2]. Unless you hire an OCD technician who works around the clock and doesn't mind repetitive strain injury that is[3].

Covers:

Servers (Physical or Virtualized)

Network Paths/Routes

Network Endpoints

Implementation:

Ping (built into every operating system)

Any off-the-shelf basic monitoring solution (software or service)

Typical Protocols:

ICMP

Commonly Overlooked:

Hops just beyond the "next hop" (e.g. the point _just_ past your ISP's directly connected router)

Auxiliary but still important/vital servers (email, authoritative DNS servers, recursive DNS servers)

Misc critical network endpoints (e.g. WAN links, Internet connections)

Stage 2: Service checks

This is designed to answer the imposing query: "Ah, the server is up, but is the service?" Hopefully the response here is "Yes, the web HTTP service is responding, so the web site is up!"[4]

This stage itself often goes through two sub-stages of refinement:

Front-end (e.g. user facing services such as a web service) Back-end (e.g. a database service and any other services not directly accessed by users).

The service checks themselves are very basic. Their main focus here is on making sure a connection of the appropriate type is accepted on the TCP/UDP port associated with the service being monitored. If slightly smarter, the check may also look for an appropriate banner message or other response indicator (since sometimes ports connect, but the service behind them is non-responsive).

Service checks generally include some performance information (time to respond) as well, but it is rudimentary and you won't have any control (yet) over what it's really testing the performance of.

Covers:

Any TCP or UDP based service - e.g. HTTP, HTTPS, SMTP, DNS, SQL, etc.

Implementation:

Any off-the-shelf basic monitoring solution (software or service) that does more than ping servers (i.e. needs to be able to connect to service ports to see if they answer)

Typical Protocols:

TCP

UDP

Commonly Overlooked:

Auxiliary but still important/vital services (e.g. email, authoritative DNS servers, recursive DNS servers)

Third-party or client-software used APIs

Back-end services that are dependencies for front-end services (e.g. databases)

Stage 3: Interactive checks

Eventually the question from management shifts to: "Yes, the web site is up, but can users do stuff with it? Can they log-in? Buy stuff?" It doesn't take very long[5] for this question to come up. The end result is usually some frustration and embarrassment, followed by a period of overhauling the current monitoring solution.

Covers:

Any TCP or UDP based service - e.g. HTTP, HTTPS, SMTP, DNS, SQL, etc.

Implementation:

Anything other than "basic only" off-the-shelf basic monitoring solution (software or service)

Key Protocols:

HTTP

HTTPS

TCP

Commonly Overlooked:

VoIP

Back-end services that are dependencies of front-end services (e.g. specific databases/queries, third-party or inter-application APIs)

Auxiliary but still important/vital services (e.g. email sending, email receiving, authoritative DNS server queries, recursive DNS server queries)

Any sort of user interaction beyond the basics of logging in

Stage 4: Server performance monitoring

Eventually when problems occur the question shifts to: "Why?" Or someone wants to do some capacity or upgrade planning. In these situations it helps to have deeper visibility - e.g. CPU use, disk I/O, and memory consumption - and to have a way of looking at the real-time and trending utilization. Having data to point at is the only way to build real business cases for investments.

Covers:

CPU

Disk I/O

Memory

Swap space

TCP connections

...sometimes others... pretty much anything that can be pulled from any sub-system of the operating system or hardware (physical or virtualized)

Implementation:

Some basic monitoring solutions

Advanced off-the-shelf monitoring solutions

Key Protocols:

Proprietary

OS specific

In-house (scripts)

SNMP

Commonly Overlooked:

Disk I/O

Memory

Stage 5: Network performance/utilization monitoring

Sometimes services go down because of network issues. Network links may go down, but that should have already been covered in Stage 1 above. Here we are concerned about critical links[6] that may become heavily utilized unexpectedly.

If you are getting alerts, but all your servers and services appear fine, look at the network. Better yet, have your monitoring solution tell you it's the network and not your application so you can get to work fixing it sooner. :-) This is also a good point reevaluate whether some devices are not being monitored that should be (e.g. random non-core switches) and beef up network related monitoring in general - namely logs and error counters for individual interfaces.

Covers:

Routers

Switches

Key WAN/LAN Hand-off points (other networks such as ISPs, switch trunks, and mission critical server hand-offs)

Implementation:

Basic network device/link off-the-shelf monitoring solutions

Any advanced off-the-shelf monitoring solutions

Key Protocols:

SNMP

NetFlow

SSH & Telnet

Syslog

Commonly Overlooked:

Error counters on interfaces

Logs, often containing leading indicators of potential problems ...as well as lagging clues as to root causes of already known problems

Devices Layer 2 Switches Firewalls Switch Ports Switch Trunk Links Unmanaged Switches[7]

QoS policies

Stage 6: Application performance monitoring

Sometimes the problem is in your code (or someone else's). Sometimes there isn't a problem - yet - but if you only had visibility you'd know that a particular database query that gets regularly made was accounting for 80% of the the page load time for every visitor. That sort of thing. It's a big deal and this is the holy grail of monitoring for most folks. Also can include things like log correlation to events.

Covers:

Your own apps

Other people's apps that you host/manage

Implementation:

Proprietary

Language/platform specific application performance monitoring connectors/plug-ins/solutions

Off-the-shelf multi-platform application performance monitoring solutions

Key Protocols:

Proprietary

In-house

Commonly Overlooked:

APIs

Stage 6.5: Resilient and Reliable Monitoring

Somewhere amid the above stages the idea of the reliability of monitoring itself will become important.

One of the first problems that used to come up - when all monitoring was done from on-site/in-house by default (because it was the only choice) - is that the monitoring couldn't really be trusted. That is, a user would call and say your very critical web server is down. You glance over at your fancy monitoring page/app and see the following:

Web server: Green (good!)

HTTP service: Green (good!)

Internet link: Green (good!)

You conclude: must be a problem with the user. Only you're wrong. Your monitoring is good, but it's probing is not diverse enough. It's coverage is only sufficient to tell you how things look from wherever it is probing from. That's it. Unfortunately you have users located in other locations and on all sorts of nefarious Internet connections.

Things are good from where the monitoring is being done, but not from where your users are. Your monitoring is also 100% reliant on everything being good on the network it happens to be on and at the location it is at. If it's running on a single host or cluster of servers, that's also a single point of failure.

This is when you consider expanding the monitoring platform to include geographic diversity, Internet connectivity diversity (since different providers may have problems getting to your services at any given point in time, even if your own Internet link is "up"). This is also where you start to realize how valuable and critical monitoring has become to the business. Thus having some redundancy in the monitoring platform itself may be a good thing as well.

Covers:

Outages within your Internet provider(s) other than on the link directly connected to you

Outages elsewhere on the Internet

Weird network issues (e.g. MTU issues) that only arise "out there"

Losing all visibility into your IT services/assets when your standalone/single point of failure monitoring solution goes offline or is inaccessible for whatever reason

Implementation:

External monitoring services

PoP to PoP

Key Customers

Partners (monitoring swaps)

Key Protocols:

VPNs

TLS/SSL

Whatever You Are Monitoring

Commonly Overlooked:

Different offices/locations

Different Internet providers

Top Internet providers typically used by customers/users

Important Router-to-Router VPN links

Photo credit: http://www.flickr.com/photos/mogwai_83/3022261893/

If you enjoyed this, you're invited to subscribe to be notified when I post similar items. I also invite you to connect with me by email or on Twitter if you have a comment, idea, or question.