In a previous post. I showed how to monitor data using Collectd, Influxdb and Grafana. In the mean time I wanted to add more functionalities to Colectd but it was difficult to find plugings for Nvidia GPU and also to monitor other docker instances. Then I found Telegraf, which is a tool from the same InfluxDB company that can collect data from several sources. There are three advantages that made me change to Telegraf instead of collectd:

Telegraf is written in go, which make it fast, light and it reduces the footprint when collecting data.

There is an extensive list of telegraf plugins indexed in one official github repository. Which makes very practical to find and install plugins.

There is more active support in telegraf github than in collectd github.

In this post I will show how to use Telegraf plugins to monitor GPU devices and battery status using Grafana.

Monitoring stack

My current stack is still using docker compose as an orchestration of service. Which allows me to deploy all my services with a simple docker-compose up . Moreover I use a Makefile to control my docker compose commands and inject environment variables to the docker-compose.yml . I don’t git the real environment file but instead a fake environment file to show samples of used variables.

I define the following containers in my docker compose file:

Telegraf: Collecting data

Influxdb: Saving data

Grafana: Displaying data

Monitoring stack





Grafana parameters for queries

I collect data from CPU, RAM, uptime, connected users, network utilisation, disk utilization and docker stats. It is possible to use Influxdb queries in grafana interface, which helps to chose the available parameters. For example in order to get docker CPU utilization for each available container we can use the following query:

SELECT mean ( "usage_percent" ) FROM "docker_container_cpu" WHERE $ timeFilter GROUP BY time ( $ __interval ), "container_name" fill ( null )

The group by container_name allows to separate values for each available container and then we can use grafana alias pattern options in order to have give nice names to each line.

Using alias pattern to name group by variables

GPU monitoring

One of the main reasons I started using telegraf was because I wanted to monitor a server with NVIDIA GPU and telegraf proposed a nice nvidia plugin to do so.

I use an environment variable if I want to monitor GPU. This will include an additional docker-compose file with special configuration values for Nvidia GPU.

version : ' 2.3' services : telegraf : runtime : nvidia environment : - NVIDIA_VISIBLE_DEVICES=all volumes : - ./docker/telegraf/telegraf-gpu.conf:/etc/telegraf/telegraf.conf

The first one is the runtime option so that docker can access to the GPU.

The other one if an environment variable NVIDIA_VISIBLE_DEVICES to select the number of allowed NVIDIA devices. The possible values are either all or the device number (multiple device id can be added separated with comma).

to select the number of allowed NVIDIA devices. The possible values are either all or the device number (multiple device id can be added separated with comma). I use a different telegraf configuration when monitoring GPU because I add the telegraf plugin to monitor GPU using nvidia_smi tool. The only thing that changes is to uncomment the following lines:

[[ inputs . nvidia_smi ]] ## Optional: path to nvidia-smi binary, defaults to $PATH via exec.LookPath # bin_path = /usr/bin/nvidia-smi ## Optional: timeout for GPU polling # timeout = 5s

In the image below you can see the temperature of each GPU and the utilization of them. I have a batch of work distributed on all GPU that when it finish it writes things in a database.

GPU temperature and utilisation





You can notice two things:

GPU 0 has a lower temperature than the other devices. This occurs because in my disposition GPU 0 is placed near the border and the fan is not blocked by the other GPU.

Temperatures don’t go beyond 80C, which is OK for my GPU.

Battery monitoring

I was also curious about my battery utilisation. Because I try to optimize the charging cycles by don’t letting the battery go below 20% and not charging above 90%. So I tried telegraf battery plugin, which fetch battery status from /proc folder. The following image shows the battery capacity and the battery cycle count.

Battery monitoring using Grafana





My laptop has two batteries. I try to use only one battery, the one that it can be remove from the laptop and so that I can be replaced easily. So I can see that the number cycles are lower for BAT0. I also try to do complete cycle for the batteries as one can see in the battery capacity plot.

Conclusion

The combination of telegraf, influxdb and grafana allows me to get an overview of the resources of my system. Combining them with docker allows me to deploy it easily in any remote server. All the stack is easily deploy using docker-compose. You can checkout the github code here.