In the last post in our series “Prometheus and Kubernetes”, Tom talked about how we use Prometheus to monitor the applications and services we deploy to Kubernetes. In this post, I want to talk about how we use Prometheus to monitor our Kubernetes cluster itself.

This is the fourth post in our series on Prometheus and Kubernetes – see “A Perfect Match”, “Deploying”, and “Monitoring Your Applications” for the previous instalments.

TL;DR

At Weaveworks:

We use custom tools like kubediff so that we know our configuration matches what’s actually running.

With kube-api-exporter, we can tell whether our rollouts actually worked.

The Prometheus node-exporter monitors resource usage of our VMs and provides cAdvisor information about our running containers.

We want to put as little effort as possible into maintaining our Kubernetes clusters. After all, we’re using Kubernetes so we can deploy and run our own stuff more effectively—it’s a means to an end.

The same can be said of our monitoring choice. We want the bare minimum of monitoring that we can get away with.

One that:

Ensures what we think we’ve deployed is what’s actually deployed.

Helps us understand resource usage so we know when to expand the cluster (capacity planning).

Allows us to see what Kubernetes is doing when we are investigating issues that affect users.

Also, if Kubernetes can give us any useful application-level signals “for free” then we want to grab that with both hands.

Monitoring Deployments With kube-api-exporter

When we deploy something to Kubernetes, the first thing we want to know is whether or not the deployment worked. Kubernetes has this information but does not yet export it (although people are talking about changing that) so we need to provide our own exporter. At Weave, we use a simple, custom Python daemon called kube-api-exporter, which calls the apiserver and exports all of the information about all of the Deployments and Pods as Prometheus metrics. However, there is also a more complete tool called kube-state-metrics that we are quite keen to try out.

We deploy kube-api-exporter as a normal Kubernetes service with only one replica:

apiVersion: extensions/v1beta1 kind: Deployment metadata: namespace: monitoring name: kube-api-exporter spec: replicas: 1 template: metadata: labels: name: kube-api-exporter spec: containers: - name: kube-api-exporter image: tomwilkie/kube-api-exporter imagePullPolicy: IfNotPresent Ports: - containerPort: 80

With this and the corresponding Service definition in place, we can start alerting on dodgy deployments:

ALERT DeploymentGenerationMismatch IF k8s_deployment_status_observedGeneration{job="monitoring/kube-api-exporter"} != k8s_deployment_metadata_generation{job="monitoring/kube-api-exporter"} FOR 5m LABELS { severity="critical" } ANNOTATIONS { summary = "Deployment of {{$labels.exported_namespace}}/{{$labels.name}} failed", description = "Deployment of {{$labels.exported_namespace}}/{{$labels.name}} failed - observed generation != intended generation.", }

“Generation” is a number that increments with each deployment. When we trigger a deployment (normally by kubectl applying a YAML file), metadata_generation gets incremented. When the deployment succeeds, observedGeneration gets incremented. If these numbers aren’t equal for 5 minutes then we have tried to deploy something and failed, so our on call should get a critical alert.

ALERT DeploymentReplicasMismatch IF (k8s_deployment_spec_replicas{job="monitoring/kube-api-exporter"} != k8s_deployment_status_availableReplicas{job="monitoring/kube-api-exporter"}) or (k8s_deployment_spec_replicas{job="monitoring/kube-api-exporter"} unless k8s_deployment_status_availableReplicas{job="monitoring/kube-api-exporter"}) FOR 5m LABELS { severity="critical" } ANNOTATIONS { summary = "Deployment of {{$labels.exported_namespace}}/{{$labels.name}} failed", description = "Deployment of {{$labels.exported_namespace}}/{{$labels.name}} failed - observed replicas != intended replicas.", }

Similarly, if a job should have 6 replicas and five minutes later it still has 4, then something has gone seriously wrong, and our on call developer will receive a critical alert saying so.

Deployments can also fail when the pods end up in crash loops: starting only

to crash shortly afterward. Our generation and replica mismatch checks don’t catch this, so we added a dedicated alert:

ALERT PodRestartingTooMuch IF rate(k8s_pod_status_restartCount[1m]) > 1/(5*60) FOR 1h LABELS { severity="warning" } ANNOTATIONS { summary = "Pod {{$labels.namespace}}/{{$label.name}} restarting too much.", description = "Pod {{$labels.namespace}}/{{$label.name}} restarting too much.", }

If a job restarts more than once in a five minute period then something is going wrong, but not wrong enough to wake us up—hence “warning” and not “critical”.

Ensuring Configuration Matches Reality With kubediff

The configuration for all of our Kubernetes Services, Deployments, Namespaces and so forth is in a Git repository so that we can get version control, code review, and—importantly—easy rollback to reduce mean time to recovery.

However once we do this, we are faced with the certain truth that things will happen to cause our clusters to diverge from their version controlled configuration. Someone might merge a change to master without applying it to production, or someone might run kubectl directly against a cluster without making the corresponding change to the Git repository.

At Weave Cloud, we don’t exactly want to prevent this—sometimes it’s useful to quickly make a change to production—but we do want to minimize it, and we certainly want to know when it’s happening. That is, we want to get a non-critical alert when there has been a difference for longer than we’d like.

Our solution for this starts with kubediff. kubediff is a command-line tool that shows the differences between your running configuration and your version controlled configuration. If there is a difference between them, it returns a non-zero exit code.

kubediff by itself isn’t enough for us to get the alerts we need. Since it

works by scraping metrics from running servers, it can’t monitor command-line

programs. To get around this, we wrote a simple wrapper called prom-run, which periodically runs a command and returns its exit code as a prometheus metric.

The last section provides a way to sync our configuration from GitHub onto the Pod in which prom-run is running. We do this using git-sync, which describes itself as ‘a perfect “sidecar” container in Kubernetes’. It periodically pulls files down from a Git repository so that an application can consume them.

volumes: - name: repo emptyDir: {} containers: - name: git-sync image: tomwilkie/git-sync:f6165715ce9d args: - -repo=https://github.com/weaveworks/<config-repo> - -wait=60 - -dest=/data/service-conf volumeMounts: - name: repo mountPath: /data - name: prom-run image: weaveworks/kubediff:master-a3dcdee imagePullPolicy: IfNotPresent args: - -period=60s - -listen-addr=:80 - /kubediff - /data/service-conf/k8s/dev volumeMounts: - name: repo mountPath: /data ports: - containerPort: 80

git-sync and prom-run are running in two containers in the same Pod. Each share the /data directory, which contains the synced Git repository with our Kubernetes configuration. prom-run is configured to run kubediff every 60 seconds.

With that deployed, Prometheus is configured to alert us when our cluster and our configuration have had differences for more than two hours:

ALERT Kubediff IF max(command_exit_code{job="monitoring/kubediff"}) != 0 FOR 2h LABELS { severity="warning" } ANNOTATIONS { summary = "Kubediff has detected a difference in running config.", description = "Kubediff has detected a difference in running config.", }

The alert message just tells us that differences exist, without letting us

know what those differences are. When we get these alerts (and we get them on

dev once or twice a week), we go to the kubediff web page to see what needs

fixing:

Example kubediff alert from Slack

Verifying Kubernetes Infrastructure Configuration Matches Deployment

kubediff ensures our configuration for things within Kubernetes matches what is actually running. But what about our Kubernetes infrastructure itself?

We run our Kubernetes infrastructure on AWS and configure that infrastructure using Terraform. We keep the infrastructure configuration in version control, the same way we do for our Kubernetes configuration, but instead of using kubediff, we use terradiff.

terradiff is not a stand-alone script, rather it’s the name we give to our setup. We use prom-run to periodically run terraform plan to compare our tfstate (which we keep in the same Git repository as our terraform configuration) against our AWS accounts.

Gathering kubelet Metrics

Up to this point, we are monitoring our API server and are guaranteeing that our configuration matches our deployment, both at the Kubernetes level and at the infrastructure level. The next step is to monitor the nodes themselves.

You can think of nodes in two ways—they are either the machines that make up your Kubernetes cluster, or they are the kubelets that represent those machines. This section focus on the kubelets.

First, we need to map the Kubernetes nodes into Prometheus:

- job_name: 'kubernetes-nodes' kubernetes_sd_configs: - api_servers: - 'https://kubernetes.default.svc.cluster.local' in_cluster: true role: node tls_config: insecure_skip_verify: true relabel_configs: - target_label: __scheme__ replacement: https

This tells Prometheus that we want information about our Kubernetes nodes exported under the kubernetes-nodes job.

The nodes are only serving HTTPS, so we relabel the scheme to get Prometheus to use HTTPS to scrape them. We also set insecure_skip_verify as we haven’t yet figured out how to verify the TLS certificates we get from the nodes. If we don’t set it, the Prometheus Targets page says “cannot validate certificate for x.x.x.x because it doesn’t contain any IP SANs”.

Once this is done, we can query Prometheus for metrics on our kubelets, including cAdvisor data. For example, this is what our Prometheus CPU and memory look like on our dev cluster:

The spike at the end of the CPU graph was caused by me looking at the graphs. The queries for these graphs are:

sum(irate(container_cpu_usage_seconds_total{job="kubernetes-nodes",io_kubernetes_pod_namespace="monitoring",io_kubernetes_pod_name=~"prometheus-.*"}[1m])) by (io_kubernetes_pod_namespace,io_kubernetes_pod_name)

for CPU and:

sum(container_memory_usage_bytes{job="kubernetes-nodes",io_kubernetes_pod_namespace="monitoring",io_kubernetes_pod_name=~"prometheus-.*"}) by (io_kubernetes_pod_namespace,io_kubernetes_pod_name)

for memory.

Gathering Node Metrics With the Prometheus node_exporter

Having cAdvisor data about the kubelet containers is a great start, but we also want information about the machines they are running on, so that we can do capacity planning or at least figure out if we’re about to run out of RAM. To get these metrics, we use the Prometheus node exporter, which exports machine-level metrics.

In order to run one node exporter on each node in our cluster, we will need to set up a DaemonSet.

Prometheus doesn’t have built-in support for discovering DaemonSets, so instead we tell Prometheus to discover and to scrape all of the Pods and only keep those that have the prometheus.io.scrape annotation set to "true" :

# Scrape some pods (for daemon sets) - job_name: 'kubernetes-pods' kubernetes_sd_configs: - api_servers: - 'https://kubernetes.default.svc.cluster.local' in_cluster: true role: pod # You can specify the following annotations (on pods): # prometheus.io.scrape: true - scrape this pod # prometheus.io.port - scrape this port relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_namespace, __meta_kubernetes_pod_label_name] separator: '/' target_label: job - source_labels: [__meta_kubernetes_pod_node_name] target_label: node

This configuration sets the job name for discovered Pods to <namespace/name> , where ‘name’ is the value of the ‘name’ label rather than the actual Pod name, since the Pod name is randomly appended to when the Pod starts up.

With the Prometheus configuration in place, we can deploy our node_exporter DaemonSet:

apiVersion: extensions/v1beta1 kind: DaemonSet metadata: namespace: monitoring name: prom-node-exporter spec: template: metadata: name: prom-node-exporter labels: name: prom-node-exporter annotations: prometheus.io.scrape: "true" spec: hostPID: true containers: - name: prom-node-exporter image: prom/node-exporter:0.12.0 securityContext: privileged: true ports: - containerPort: 9100

Note that prometheus.io.scrape is set to "true" so that Prometheus will discover it. We also set hostPID to true and run the container in a privileged security context so it can get all the data it needs from the underlying VM.

Once this is deployed, we can easily get graphs like this one:

Which shows the memory usage for each node in our dev cluster. We’re getting some pretty good utilization there!

The query used for that graph looks like:

(node_memory_MemTotal{job="monitoring/prom-node-exporter"} - node_memory_MemFree{job="monitoring/prom-node-exporter"}) / node_memory_MemTotal{job="monitoring/prom-node-exporter"}

Conclusions

Prometheus is a great way to monitor a Kubernetes cluster. At Weaveworks, we

monitor the Kubernetes API server for basic liveness

have custom tools like kubediff so we know that our config matches what’s actually running

use kube-api-exporter to tell us whether our rollouts actually worked and to detect crash loops and pod flapping

use the Prometheus node-exporter to monitor resource usage of our VMs and to get cAdvisor information about our running containers

And there’s a lot more we could do. We’re keen to learn: How do you monitor your Kubernetes cluster? How do you wish you could monitor it?



