Monitoring systems help DevOps teams detect and solve performance issues faster. With Docker still on the rise, it’s important to get container monitoring right from the start.

This is no easy feat. Monitoring Docker containers is very complex. Developing a strategy and building an appropriate monitoring system is not simple at all.

In this post, we’re going to delve deep into what container monitoring is and why you need it. We’ll also cover how it works, what to monitor, and the best open-source container monitoring tools available today. They may not have full-blown features like the Sematext Docker monitoring integration or Datadog, but keep in mind they’re open-source products and can hold their own just fine.

What is Container Monitoring?

Container monitoring is the process of tracking the performance of applications built on microservices architectures by collecting and analyzing performance metrics. Due to their ephemeral nature, containers are rather complex and more challenging to keep an eye on compared to traditional applications running on virtual servers or bare metal servers. Still, monitoring is a must – a critical step in ensuring the optimal performance of containers running in different environments. Learn why you need monitoring whether you’re working with containers, microservices, and any type of app really, from our Complete guide to alerting and monitoring.

Why Should You Monitor Docker Containers?

On-going monitoring is essential to ensure peak app performance, containerized or otherwise. When it comes to Docker containers however, monitoring helps you to:

Detect and solve issues early and proactively to avoid risks at the production level

Implement changes safely as the whole environment is monitored

Fine-tune applications to deliver improved performance and better user experience

Optimize resource allocation

How Does Container Monitoring Work: What Metrics to Monitor

Monitoring containers is not that different than monitoring traditional deployments as in both cases you needed metrics, logs, service discovery and health checks. However, it’s more complex due to their dynamic and multi-layered nature. But a good container monitoring solution can navigate through all the layers within a stack.

First, with monitoring, you get an overview of basic performance metrics such as memory utilization and CPU usage, as well as of container-specific metrics like CPU limit and memory limit. Together, these metrics provide utilization ratios useful to decide when to scale up, out, or in.

Container performance monitoring also considers the container infrastructure. With Docker – as well as Kubernetes, Docker Swarm and DC/OS – you have to look at memory and CPU ratios for the cluster itself as they will tell you when to scale up, out, or in.

However, besides the infrastructure layer, there’s the application layer as well. Say a container runs an HTTP server, you need to collect metrics such as latency and request count. You can’t gather these metrics from the container runtime, but a monitoring system has to either pull them from the process or the process must push them to the monitoring system.

Finally, any layer within the stack can cause errors. “Container restart” or “could not connect to database” are not uncommon and are very useful. Container performance monitoring takes care of these too.

Learn more about these metrics and many others from Docker Container Performance Metrics to Monitor.

How to Monitor Docker: 3 Types of Container Monitoring Tools You Should Know About

Now that you know what to watch for when it comes to monitoring your containerized app, you just have to choose the monitoring tool(s) that fits your specific use case. Here are the types of open-source Docker monitoring tools you should consider using for better operational insights into container deployments.

1. Command Line Tools

The first step to get visibility into your container infrastructure is probably using the built-in tools like docker command line and kubectl for Kubernetes. There’s a whole set of commands that are used to find the relevant information about containers. Please note that the usage of kubectl and docker command is typically available to only a few people who have direct access to the orchestration tool. Nevertheless, all cloud engineers require command line skills and, in some situations, command line tools are indeed the only tools available.

Before we start looking at Docker log collection tools, check out these two useful Docker Cheatsheets.

2. Open Source Tools for Docker Monitoring, Logging and Tracing

Several open source tools are available for DIY-style container monitoring and logging. Typically container logs and metrics are stored in different data stores. The ELK stack is the tool of choice for logs while Prometheus is popular for metrics.

Depending on your metrics and logs data store choices you may need to use a different set of data collectors and dashboard tools. Telegraf and Prometheus are the most flexible open source data collectors we’ve evaluated. Prometheus exporters need a scraper (Prometheus Server or alternative 3rd party scraper) or a remote storage interface for Prometheus Server to store metrics in alternative data stores. Grafana is the most flexible monitoring dashboard tool with support for most data sources like Prometheus, InfluxDB, Elasticsearch, etc.

Kibana and metric beats for data collection are tightly bound to Elasticsearch and are thus not usable with any other data store.

The following matrix shows which data collectors typically play with which storage engine and monitoring dashboard tool. Note there are several other variations possible.

Data Collector for Containers Storage / Time Series DB User Interface Log collectors Logagent Elasticsearch Sematext Cloud InfluxDB ClickHouse DB Kibana & Grafana & Sematext Telegraf / syslog + docker syslog driver InfluxDB Sematext Cloud Grafana, Chronograf, Sematext Fluentd

Filebeat Elasticsearch Kibana Metric collectors Sematext Agent InfluxDB Sematext Cloud Chronograf & Sematext Cloud Metric Beats Elasticsearch Kibana Telegraf InfluxDB Grafana, Chronograf Telegraf Elasticsearch output Elasticsearch Kibana Prometheus Exporters Prometheus

Various 3rd party and commercial integrations Promdash, Grafana

Compatibility of monitoring tools and time series storage engines

The Elastic Stack might seem like an excellent candidate to unify metrics and logs in one data store. As providers of Elasticsearch consulting, Elasticsearch training, and Elasticsearch support we would love nothing more than everyone using Elasticsearch for not just logs, but also metrics. However, the truth is that Elasticsearch is not the most efficient as time series databases for metrics. Trust us, we’ve run numerous benchmarks, applied all kinds of performance tuning tricks from our rather big back of Elasticsearch tricks, but it turns out there are better, more efficient, faster data stores for metrics than Elasticsearch. The setup and maintenance of logging and monitoring infrastructure become complicated when it reaches a larger scale.

After the initial setup of storage engines for metrics and logs the time-consuming work starts: the setup of log shippers and monitoring agents, dashboards and alert rules. Dealing with log collection for containers can be tricky, so you’ll want to consult top 10 Docker logging gotchas and Docker log driver alternatives.

3. Visualization Tools

After the setup of data collectors, we need to visualize metrics and logs.

Docker Monitoring with Sematext

Sematext provides a more comprehensive, and easy to set up, monitoring dashboard for metrics, events, and logs. Giving you actionable insights about containers and infrastructure, we like to call full-stack observability. With anomaly detection, alerting and correlations between all parts of your infrastructure, clusters, and containers, you get all you need in one place.

Docker Monitoring with Grafana

The most popular dashboard tools are Grafana and Kibana. Grafana does an excellent job as a dashboard tool for showing data from a number of data sources including Elasticsearch, InfluxDB and Prometheus. In general, though, Grafana is really more tailored for metrics even though using Grafana for logs with Elasticsearch is possible, too. Grafana is still very limited for ad-hoc log searches, but has integrated alerting on logs.

Docker Monitoring with Kibana

Kibana, on the other hand, supports only Elasticsearch as a data source. Some dashboard views are “impossible” to implement because different monitoring and logging tools have limited options to correlate data from different data stores. Once dashboards are built and ready to share with the team, the next hot topic for Kibana users is security, authentication and role-based access control (RBAC). Grafana supports user authentication and simple roles, while Kibana (or in general the Elastic Stack) requires X-Pack as commercial extensions to support various security features like user authentication and RBAC. Depending on the requirements of your organization one of the X-Pack Alternatives might be helpful.

Microservices Distributed Transaction Tracing

Until now we discussed only monitoring and logging. We completely ignored distributed transaction tracing as the third pillar of observability for a moment. Keep in mind that as soon we start collecting transaction traces across microservices, the amount of data will explode and thus further increase the total cost of ownership of an on-premise monitoring setup. Note that data collection tools mentioned in this post handle only metrics and logs, not traces (for more on transaction tracing and, more specifically OpenTracing-compatible tracers see our Jaeger vs. Zipkin). Similarly, the dashboard tools we covered here don’t come with data collection and visualizations for transaction traces. This means that for distributed transaction tracing we need the third set of tools if we want to put together and run our own monitoring setup – welcome to the DevOps jungle!

Total Cost of Ownership

When planning the setup of open source monitoring people often underestimate the amount of data generated by monitoring agents and log shippers. More specifically, most organizations underestimate the resources needed for processing, storage, and retrieval of metrics and logs as their volume grows and, even more importantly, organizations often underestimate the human effort and time that will have to be invested into ongoing maintenance of the monitoring infrastructure and open-source tools. When that happens not only does the cost of infrastructure for monitoring and logging jump beyond anyone’s predictions, but so does the time and thus money required for maintenance. A common way to deal with this is to limit data retention. This requires fewer resources, less expertise needed to scale the infrastructure and tools and thus less maintenance, but this of course limits visibility and insights one can derive from long-term data.

Infrastructure costs are only one reason why we often see limited storage for metrics, traces, and logs. For example, InfluxDB has no clustering or sharding in the open source edition, and Prometheus supports only short retention time to avoid performance problems.

Another approach used for dealing with that is the reduction of granularity of metrics from 10-second accuracy to a minute or even more, sampling, and such. As a consequence, DevOps teams have less accurate information with less time to analyze problems, and limited view visibility for permanent or recurring issues, conducting historical trend analysis, or capacity planning.

Free OpenTracing eBook Want to get useful how-to instructions, copy-paste code for tracer registration? We’ve prepared an OpenTracing eBook which puts all key OpenTracing information at your fingertips: from introducing OpenTracing, explaining what it is and does, how it works, to covering Zipkin followed by Jaeger, both being popular distributed tracers, and finally, compare Jaeger vs. Zipkin. Download yours.

DIY Container Monitoring Pros and Cons

There are a number of open-source container observability tools for logging, monitoring, and tracing. If you and your team have time and if observability really needs to be your team’s core competency, you’ll need to invest time into finding the most promising tools, learning how to actually use them while evaluating them, and finally install, configure, and maintain them. It would be wise to compare multiple solutions and check how well various tools play together. Here’s how we recommend to choose the best container monitoring tool for your use case:

Coverage of collected metrics . Some tools collect only a few metrics, some gather a ton of metrics (which you may not really need), while other tools let you configure which metrics to collect. Missing relevant metrics can be frustrating when one is working under pressure to solve a production issue, just like having too many or wrong metrics will make it harder to locate signals that truly matter. Tools that require configuration for collection or visualization of each metric are time-consuming to set up and maintain. Don’t choose such tools. Instead, look for tools that give you good defaults and freedom to customize which metrics to collect.

. Some tools collect only a few metrics, some gather a ton of metrics (which you may not really need), while other tools let you configure which metrics to collect. Missing relevant metrics can be frustrating when one is working under pressure to solve a production issue, just like having too many or wrong metrics will make it harder to locate signals that truly matter. Tools that require configuration for collection or visualization of each metric are time-consuming to set up and maintain. Don’t choose such tools. Instead, look for tools that give you good defaults and freedom to customize which metrics to collect. Coverage of log formats . A typical application stack consists of multiple components like databases, web servers, message queues, etc. Make sure that you can structure logs from your applications. This is key if you want to use your logs not only for troubleshooting, but also for deriving insights from logs. Defining log parser patterns with regular expressions or grok is time-consuming. It is very helpful having a library of existing patterns. This is a time saver, especially in the container world when you use official docker images.

. A typical application stack consists of multiple components like databases, web servers, message queues, etc. Make sure that you can structure logs from your applications. This is key if you want to use your logs not only for troubleshooting, but also for deriving insights from logs. Defining log parser patterns with regular expressions or grok is time-consuming. It is very helpful having a library of existing patterns. This is a time saver, especially in the container world when you use official docker images. Collection of events . Any indication of why a service was restarted or crashed will help you classify problems quickly and get to the root cause faster. Any container monitoring tool should thus be collecting Docker events and Kubernetes status events if you run Kubernetes.

. Any indication of why a service was restarted or crashed will help you classify problems quickly and get to the root cause faster. Any container monitoring tool should thus be collecting Docker events and Kubernetes status events if you run Kubernetes. Correlation of metrics, logs, and traces . Whether you initially spot a problem through metrics, logs, or traces, having access to all this observability data makes troubleshooting so much faster. A single UI displaying data from various sources is thus key for an interactive drill down, fast troubleshooting, faster MTTR and, frankly, makes devops’ job more enjoyable. See example.

. Whether you initially spot a problem through metrics, logs, or traces, having access to all this observability data makes troubleshooting so much faster. A single UI displaying data from various sources is thus key for an interactive drill down, fast troubleshooting, faster MTTR and, frankly, makes devops’ job more enjoyable. See example. Machine Learning capabilities and anomaly detection for alerting on logs and metrics . Threshold-based alerts work well only for known and constant workloads. In dynamic environments, threshold-based alerts create too much noise. Make sure the solution you select has this core capability and that it doesn’t take ages to learn the baseline or require too much tweaking, training, and such.

. Threshold-based alerts work well only for known and constant workloads. In dynamic environments, threshold-based alerts create too much noise. Make sure the solution you select has this core capability and that it doesn’t take ages to learn the baseline or require too much tweaking, training, and such. Detect and correlate metrics with the same behavior. When metrics behave in similar patterns, we typically find one of the metrics is the symptom of the root cause of a performance bottleneck. A good example we have seen in practice is high CPU usage paired with container swap activity and disk IO – in such a case CPU usage and even more disk IO could be reduced by switching off swapping for containers. For system metrics above the correlation is often known – but when you track your application-specific metrics you might find new correlation and bottlenecks in your microservices to optimize.

When metrics behave in similar patterns, we typically find one of the metrics is the symptom of the root cause of a performance bottleneck. A good example we have seen in practice is high CPU usage paired with container swap activity and disk IO – in such a case CPU usage and even more disk IO could be reduced by switching off swapping for containers. For system metrics above the correlation is often known – but when you track your application-specific metrics you might find new correlation and bottlenecks in your microservices to optimize. Single sign-on. Correlating data stored in silos is impossible. Moreover, using multiple services often requires multiple accounts and forces you to learn not one, but multiple services, their UIs, etc. Each time you need to use both of them there is the painful overhead of needing to adjust things like time ranges before you can look at data in them in separate windows. This costs time and money and makes it harder to share data with the team.

Correlating data stored in silos is impossible. Moreover, using multiple services often requires multiple accounts and forces you to learn not one, but multiple services, their UIs, etc. Each time you need to use both of them there is the painful overhead of needing to adjust things like time ranges before you can look at data in them in separate windows. This costs time and money and makes it harder to share data with the team. Role-based access control. Lack of RBAC is going to be a show-stopper for any tool seeking adoption at corporate level. Tools that work fine for small teams and SMBs, but lack multi-user support with roles and permissions almost never meet the requirements of large enterprises.

Wrap up!

No matter how many Docker containers you’re running, monitoring is key to keeping your app up and running and your users happy. For that, DevOps engineers need well-integrated monitoring, logging and tracing solutions with advanced functionality like correlating between metrics, traces and logs. The saved costs of engineering and infrastructure required to run in-house monitoring can quickly pay off. Adjustable data retention times per monitored service help optimize costs and satisfy operational needs. This results in a better user experience for your DevOps team, especially faster troubleshooting which minimizes the revenue loss once a critical bug or performance issue hits your commercial services. While developing Sematext Cloud, we had the above ideas in mind with the goal to provide a better container monitoring solution. You can read more about it in Part 3 of this Docker Monitoring Guide, Docker Container Monitoring with Sematext.

And lastly, if you’re using Docker with Kubernetes, check out our Guides to Kubernetes Logging and Kubernetes Monitoring to set up a comprehensive monitoring strategy that covers all your bases.

Share Twitter

Facebook

LinkedIn

Reddit

Email

