By Michael Kraus

As Prometheus gave fire to mankind, the distributed monitoring software with the same name illuminates the admin's mind in native cloud environments, offering metrics for monitored systems and applications.

Wherever container-based microservices spread, classic monitoring tools such as Nagios [1] and Icinga [2] quickly reach their limits. They are simply not designed to monitor short-lived objects such as containers. In native cloud environments, Prometheus [3], with its time series database approach, has therefore blossomed into an indispensable tool. The software is related to the Kubernetes [4] container orchestrator: Whereas Kubernetes comes from Google's Borg cluster system, Prometheus is rooted in Borgmon, the monitoring tool for Borg.

Matt Proud and Julius Volz, two former site reliability engineers (SREs) with Google, helped incubate Prometheus to get it ready for production when working for SoundCloud in 2012. Starting in 2014, other companies began taking advantage of it. In 2015, the creators published it as an open source project with an official announcement [5], although it previously also existed as open source on GitHub [6]. Today, programmers interested in doing so can develop Prometheus under the umbrella of the Cloud Native Computing Foundation (CNCF) [7], along with other prominent projects such as Containerd, rkt, Kubernetes, and gRPC.

What's Going On

Thanks to its minimalist architecture and easy installation, Prometheus, written in Go, is easy to try out. To install the software, first download Prometheus [8] and then unpack and launch:

tar xzvf prometheus-1.5.2.linux-*.tar.gz cd prometheus-1.5.2.linux-amd64/ ./prometheus

If you then call http://localhost:9090/metrics in your web browser, you will see the internal metrics for Prometheus (Figure 1). You can reach the Prometheus web interface shown in Figure 2, which is intended more for debugging purposes, from http://localhost:9090/graph . Many articles, blog posts, and conference keynotes [9]-[11] can help you delve the depths of Prometheus. In this article, I focus on retrieving metrics.

Figure 1: If you access the metrics subpage, Prometheus provides a number of internal metrics.

Figure 2: The Prometheus web interface is quite plain.

Collecting Metrics

Classic monitoring tools like Icinga and Nagios monitor components or applications with the help of small programs (plugins). This approach is known as "blackbox monitoring." However, Prometheus is a representative of the "whitebox monitoring" camp, wherein systems and applications voluntarily provide metrics in the Prometheus format. The ever-increasing number of applications that already do this, including Docker, Kubernetes, Etcd, and recently GitLab, are thus known as "instrumented applications."

Supported by exporters, Prometheus watches more than just instrumented systems and applications. As independent programs, exporters extract metrics from the monitored systems and convert them to a Prometheus-readable format. The most famous of these is node_exporter [12], which reads and provides operating system metrics such as memory usage and network load. Meanwhile, a number of exporters [13] exist for a wide range of protocols and services, such as Apache, MySQL, SNMP, and vSphere.

Making Nodes Transparent

To test node_exporter (like Prometheus, also written in Go), enter

tar xzvf node_exporter-0.14.0.*.tar.gz cd node_exporter-0.14.0.linux-amd64 ./node_exporter

to find a list of metrics for your system, which you can view at http://localhost:9100/metrics . You are now in a position to set up quite meaningful basic monitoring.

Node Exporter is confined to machine metrics by definition, whereas information about running processes is typically found at the application level, which requires other exporters. If you pass a directory into node_exporter with the --collector.textfile.directory option, Node Exporter reads the *.prom text files stored in the directory and evaluates the metrics they contain. A cronjob passes its completion time to Prometheus like this:

echo my_batch_completion_time $(date +%s) > </path/to/directory/my_batch_job>.prom.$$ mv </path/to/directory/my_batch_job>.prom.$$ </path/to/directory/my_batch_job>.prom

Prometheus queries the configured monitoring targets at definable intervals. The software distinguishes between jobs and instances [14]: An instance is a single monitoring target, whereas a job is a collection of similar instances. The polling intervals are usually between 5 and 60 seconds; the default is 15 seconds. The metrics are transmitted via HTTP if you use a browser. Prometheus uses the more efficient protocol buffer internally. This two-pronged approach makes it easy to see exactly what metrics an application or an exporter provides.