Brian Brazil Brian Brazil is a core developer of Prometheus and the founder of Robust Perception. He has over 10 years experience managing and monitoring large-scale systems at companies such as Google. He now uses his skills to help organizations get the most out their systems and staff through better approaches to operations, monitoring and systems design.

When discussing container monitoring, we need to talk about the word “monitoring.” There are a wide array of practices considered to be monitoring between users, developers and sysadmins in different industries. Monitoring — in an operational, container and cloud-based context — has four main use cases:

Knowing when something is wrong.

Having the information to debug a problem.

Trending and reporting.

Plumbing.

Let’s look at each of these use cases and how each obstacle is best approached.

Knowing When Something is Wrong

Alerting is one of the most critical parts of monitoring, but an important question to ask is: What problem is worth waking up an engineer in the middle of the night to look at? It’s tempting to create alerts for anything that’s a red flag or even slightly troublesome, but this can quickly lead to alert fatigue.

Let’s say you’re running a set of user-facing microservices, and you care about the latency of requests. Would central processing unit (CPU) usage on each machine be useful to alert on? The alert will likely flag you that you’re running out of CPU capacity on the machine. It will also have false positives when background processes take a little longer than usual, and false negatives for deadlocks or not having enough threads to use all CPUs.

The CPU is the potential cause of the problem, and high latency is the symptom you are trying to detect. In My Philosophy on Alerting, Rob Ewaschuk points out that there are many potential causes, and it’s difficult to enumerate all of them. It’s better to alert on the symptoms instead, as it results in fewer pages that are more likely to present a real problem worth waking someone up over. In a dynamic container environment where machines are merely a computing substrate, alerting on symptoms rather than causes goes from being a good idea to being essential.

Having the Information to Debug a Problem

Your monitoring system now alerts you to the fact that latency is high. Now what do you do? You could go login to each of your machines, run the top command, check syslog and start tailing application logs. That’s not going to get you anywhere fast though, and it will lose effectiveness as your traffic and infrastructure grows. What your monitoring needs to provide is a way for you to approach problems methodically, giving you the tools you need to narrow down issues.

Microservices can typically be viewed as a tree, with remote procedure calls (RPCs) flowing from the top to the bottom. A problem of high latency in a service is usually caused by a delay in that service or one of its backends. Rather than trying to get inspiration from hundreds of graphs on a dashboard, you can go to the dashboard for the root service and check for signs of overload and delay in its backends. If the delay is in a backend, you repeat the process until you find the service responsible.

Figure 1: Component routing in a microservice.

That process can be taken a step further. Just like how your microservices compose a tree, the subsystems, libraries and middleware inside a single microservice can also be expressed as a tree. The same symptom identification technique can then be applied to further narrow down the issue. To continue debugging from here, you’ll likely use a variety of tools to dig into the process internals, investigate patterns in request logs and cross-correlate requests across machines.

Trending and Reporting

Alerting and debugging tend to be on the timescale of minutes to days. Trending and reporting care about the weeks-to-years timeframe.

A well-used monitoring system collects all sorts of information, from raw hardware utilization and counts of API requests to high-level business metrics. There are the obvious use cases, such as provisioning and capacity planning to be able to meet future demand, but beyond that there’s a wide selection of ways that data can help make engineering and business decisions.

[cycloneslider id=”ebook-5-sponsors”]

Knowing how similar requests are to each other might point to the benefit of a cache, or it might help argue for removing a cache for simplicity. Knowing how each request uses your limited resources can help determine your pricing model. Cross-service and cross-machine statistics can help you spend your time on the best potential optimizations. Your monitoring systems should empower you to make these analyses possible.

Plumbing

When you have a hammer, everything starts to look like a nail.

Plumbing is different from the other use cases, as it’s about getting data from system A to system B, rather than directly supporting responsive decision making. An example might be sending data on the number of sales made per hour to a business intelligence dashboard. Plumbing is about facilitating that pipeline, rather than what actions are taken from the end result. It’s not necessarily monitoring; however, it’s often convenient to use your monitoring system to move some data around to where it needs to go.

If building a tailored solution from scratch could take weeks, and it’s effectively free to use your monitoring system for the same thing, then why not? When evaluating a monitoring system, don’t just look at its ability to do graphing and alerting, but also how easy it is to add custom data sources and extract your captured data later.

Classes of Monitoring

Now that we’ve established some of what monitoring is about, let’s talk about the data being inserted into our monitoring systems.

At their core, most monitoring systems work with the same data: events. Events are all activities that happen between observation points. An event could be an instruction being executed, a function call being made, a request being routed, a remote call procedure (RPC) being received or a response being returned. Events have contextual information, such as what triggered them and what data they’re working with.

We’re going to look at four different ways to use events; each approach makes different tradeoffs and gives you a different view of the system. A complete container monitoring system will have aspects of each approach.

Metrics

Metrics, sometimes called time series, are concerned with events aggregated across time. They count how often each type of event happens, how long each type of event takes and how much data was processed by the event type.

Metrics largely don’t care about the context of the event. You can add context, such as breaking out latency by HTTP endpoint, but then you need to spend resources on a metric for each endpoint. In this case, the number of endpoints would need to be relatively small. This limits the ability to analyze individual occurrences of events; however, in exchange, it allows for tens of thousands of event types to be tracked inside a single service. This means that you can gain insight into how code is performing throughout your application.

We’re going to dig a bit deeper into the constituent parts of metrics-based monitoring. If you’re only used to one or two systems, you may not be aware of the possibilities and tradeoffs that can be made.

Figure 2: The architecture of gathering, storing and visualizing metrics.

Collection

Collection is the process of converting the system state and events into metrics, which can later be gathered by the monitoring system. Collection can happen in several ways:

Completely inside one process. The Prometheus and Dropwizard instrumentation libraries are examples; they keep all state in memory of the process. By converting data from another process into a usable format. collectd and Agentless System Crawler do this by pulling data from the proc filesystem. By two processes working in concert: one to capture the events and the other to convert them into metrics. StatsD is an example, where each event is sent from an application over the network to StatsD.

Ingestion

Ingestion takes metrics from collection and feeds them into the monitoring system. This can be a multi-stage process involving a queueing system, such as Apache Kafka, or a simple data transfer directly from collection. It’s at this point that the push versus pull debate must be mentioned. Both approaches have advantages and disadvantages. We can’t cover the extent of this debate in these pages, but the short version is that both approaches can be scaled and both can work in a containerized environment.

Storage

Once data is ingested, it’s usually stored. It may be short-term storage of only the latest results, but it could be any amount of minutes, hours or days worth of data storage.

Once stored data goes beyond what easily fits in memory on one machine, there’s operational and reliability tradeoffs to be made, and again there are pros and cons based on what the organization requires from their monitoring data. Persisting data beyond the lifetime of a process on disk implies either a need for backups or a willingness to lose data on machine failure. Spreading the data among multiple machines brings with it the fundamental challenges of distributed systems. It’s not difficult to end up with a system where existing data is safe, but new data cannot be ingested and processed.

Processing and Alerting

Data isn’t of much use if you don’t do anything with it. Most metrics systems offer some way to do math on ingested data, and usually also offer a way to alert humans of anomalous conditions. This may happen as the data is ingested or as a separate asynchronous process.

The sophistication of processing between solutions varies greatly. On one end, Graphite has no native processing or alerting capability without third-party tools; however, there’s basic aggregation and arithmetic possible when graphing. On the other end, there are solutions like Prometheus or Sysdig with not only a fully-fledged processing and alerting systems, but also an additional aggregation and deduplication system for alerts.

Visualization

Alerts arriving at your pager is fine, but for debugging, reporting and analysis you want dashboards to visualize that data.

Visualization tools tend to fall into three categories. At the low end, you have built-in ways to produce ad-hoc graphs in the monitoring system itself. In the middle, you have built-in dashboards with limited or no customization. This is common with systems designed for monitoring only one class of system, and where someone else has chosen the dashboards you’re allowed to have. Finally, there’s fully customizable dashboards where you can create almost anything you like.

How They Fit Together

Now that you have an idea of the components involved in a metrics monitoring system, let’s look at some concrete examples of the tradeoffs made by each.

Nagios

The Nagios server usually calls out to scripts on hosts — called checks — and records if they work according to their exit code. If a check is failing too much, it sends out an alert. Visualization is typically offered by a separate built-in dashboard. It can ingest 1KB of data, including metrics (called “perfdata”), from the script and pass it on to another monitoring system.

Figure 3: Metrics handling with Nagios.

Nagios is designed for static setups, which requires a restart to load a new configuration. Its limited processing, focus on host-based monitoring, and ability to only handle small amounts of metrics data makes it unsuitable for monitoring in a container environment. However, it remains useful for basic blackbox monitoring.

collectd, Graphite and Grafana

Many common monitoring stacks combine several components together. A collectd, Graphite and Grafana combination is an example of such. collectd is the collector, pulling data from the kernel and third-party applications such as MySQL. To collect custom metrics from your own applications, you’d use the StatsD protocol, which sends user data protocol (UDP) packets to collectd for individual events. collectd sends metrics to Carbon, which uses a Whisper database for storage. Finally, both Graphite and Grafana themselves can be used for visualization.

Figure 4: An example monitoring stack composed of collectd, Graphite, Grafana.

The StatsD approach to collection is limiting in terms of scale; it’s not unusual to choose to drop some events in order to gain performance. The collectd per-machine approach is also limiting in a containerized environment. For example, if there are MySQL containers dynamically deployed, then the per-machine collectd needs its configuration updated each time.

As alerting is not included, one approach is to have a Nagios check for each individual alert you want. The storage for Graphite can also be challenging to scale, which means your alerting is dependent on your storage being up.

Prometheus

Prometheus takes a different approach than our previous examples. Collection happens where possible inside the application. For third-party applications where that’s not possible, rather than having one collector per machine, there’s one exporter per application. This approach can be easier to manage, at the cost of increased resource usage. In containerized environments like Kubernetes, the exporter would be managed as a sidecar container of the main container. The Prometheus server handles ingestion, processing, alerting and storage. However, to avoid tying a distributed system into critical monitoring, the local Prometheus storage is more like a cache. A separate, non-critical distributed storage system handles longer term storage. This approach offers both monitoring reliability and durability for long-term data.

Figure 5: Metrics handling in Prometheus.

While Prometheus decides what alerts to fire, it does not send emails or pages to users. Alerts are, instead, sent to an Alertmanager, which deduplicates and aggregates alerts from multiple Prometheus servers, and sends notifications.

Sysdig Cloud

The previous sections show how various open source solutions are architected. For comparison, this section describes the architecture of Sysdig Cloud, a commercial solution. Starting with instrumentation, Sysdig Cloud uses a per-host, kernel level collection model. This instrumentation captures application, container, statsd and host metrics with a single collection point. It collects event logs such as Kubernetes scaling, Docker container events and code pushes to correlate with metrics. Per-host agents can reduce resource consumption of monitoring agents and require no modification to application code. It does, however, require a privileged container.

Figure 6: Taking a look at Sysdig’s commercial architecture.

The Sysdig Cloud storage backend consists of horizontally scalable clusters of Cassandra (metrics), MySQL (events), and Redis (intra-service brokering). Building on these components gives high reliability and scale to store years of data for long term trending and analysis. All data is accessible by a REST API. This entire backend can be used via Sysdig’s cloud service or deployed as software in a private cloud for greater security and isolation. This design allows you to avoid running one system for real-time monitoring and another system for long-term analysis or data retention.

[cycloneslider id=”ebook-5-sponsors”]

In addition to handling metrics data, Sysdig Cloud also collects other types of data, including events logs from Kubernetes and Docker containers, and metadata from orchestrators such as Kubernetes. This is used to enrich the information provided from metrics, and it’s not unusual for a metrics system to have integrations beyond what’s purely required for metric processing

Logs

Logs, sometimes called event logs, are all about the context of individual events. How many requests went to an endpoint? Which users are using or calling an endpoint?

Logs make the opposite tradeoff to metrics. They don’t do any aggregation over time. This limits them to tracking around fifty to a hundred pieces of information per event before bandwidth and storage costs tend to become an issue. Even with this limitation, logs usually allow you to find patterns in individual requests, such as if particular users are hitting expensive code paths.

It’s important to distinguish the type of logs you are working with, as they have a variety of different uses and reliability requirements:

Business and transaction logs: These are logs you must keep safe at all costs. Anything involved with billing is a good example of a business or transaction log.

These are logs you must keep safe at all costs. Anything involved with billing is a good example of a business or transaction log. Request logs: These are logs of every request that comes through your system. They’re often used in other parts of the system for optimization and other processing. It’s bad to lose some, but not the end of the world.

These are logs of every request that comes through your system. They’re often used in other parts of the system for optimization and other processing. It’s bad to lose some, but not the end of the world. Application logs: These are logs from the application regarding general system state. For example, they’ll indicate when garbage collection or some other background task is completed. Typically, you’d want only a few of these log messages per minute, as the idea is that a human will directly read the logs. They’re usually only needed when debugging.

These are logs from the application regarding general system state. For example, they’ll indicate when garbage collection or some other background task is completed. Typically, you’d want only a few of these log messages per minute, as the idea is that a human will directly read the logs. They’re usually only needed when debugging. Debug logs: These are very detailed logs to be used for debugging. As these are expensive and only needed in specialized circumstances, they have lower reliability requirements than application logs.

The next time someone talks to you about logs, think about which type of logs they’re talking about in order to properly frame the conversation.

Profiling

Profiling has the same advantages of metrics and logs. It lets you see data about individual events throughout the entire application. The disadvantage is that this tends to be very expensive to do, so it can only be applied tactically.

For example, logs have told you that a user is hitting an expensive code path, and metrics have let you narrow down which subsystem is the likely culprit. Your next step is to profile that subsystem and see in which exact lines of code the CPU is being spent.

There are a variety of Linux profiling tools, including eBPF, gdb, iotop, strace, tcpdump and top. There are also commercial options, like Sysdig, which combine functionality of several of these tools into one package. You can use some of these tools on an ongoing basis, in which case it would fall under metric or logs.

Distributed Tracing

Let’s say you have a system with a frontend running at 310ms latency in the 95th percentile. You receive an alert saying the frontend 95th percentile latency has increased to 510ms! What do you think you’ll see in the 95th percentile latency of the culprit backend?

Figure 7: An example system’s latency mapping.

The answer is that you might see an increase of the same size as on the frontend, but you might not. There could be no change or even a decrease in latency. It all depends on the correlations of the latencies. Remember, the 95th percentile is effectively throwing 95 percent of data away, so you won’t notice changes outside of that 5 percent.

What’s going on here isn’t obvious from the latency graphs, and that’s where distributed tracing comes in. It’s a form of logging and profiling. It is particularly useful in environments such as those using containers and microservices with a lot of inter-service communication.

How it works is that each individual incoming request gets a unique identifier. As the request passes through different services on different machines, that information is logged with the identifier. Each request is then stitched back together from the logs to see exactly where time was spent for each request. Due to cost of the logging, it’s sometimes only possible to trace a subset of incoming requests.

The result is a visualization of when each backend in your tree of services was called, allowing you to see where time is spent, what order requests are made in and which RPCs are on the critical path. Applied to the example in Figure 6, you’d notice that all the fast requests only hit Backend 1, while the slow requests are hitting both backends. This would tip you off that it’s the logic about communicating with Backend 2 that you need to evaluate. While this is a simple example, imagine how long it’d take you to figure out if you had dozens of services without distributed tracing.

Conclusion

In this article, we’ve covered the use cases for monitoring, which should help you understand the problems that can be solved with monitoring. We learned about the four different ways for using events: metrics, logs, profiling and distributed tracing. In breaking down the metrics-based approach, we looked at how data is collected, ingested, stored, processed, alerted and visualized.

Now that you have a better feel for the types of monitoring systems and the problems they solve, you will be able to thoroughly evaluate many different solutions. There are many approaches to designing a monitoring system, and each has their own advantages and disadvantages. When looking to evaluate a monitoring solution, first assess whether it’s primarily based on metrics, logs, profiling or distributed tracing. From there, see what features it has that’ll fit into your overall monitoring strategy, in terms of alerts requiring intelligent human action, the information you need to debug, and to integrate with your systems.

Each solution has its pros and cons, and you’ll almost certainly need more than one tool to create a comprehensive solution for monitoring containers.

Feature image via Pixabay.