In a previous post. I showed how to monitor data using Collectd, Influxdb and Grafana. In the mean time I wanted to add more functionalities to Colectd but it was difficult to find plugings for Nvidia GPU and also to monitor other docker instances. Then I found Telegraf, which is a tool from the same InfluxDB company that can collect data from several sources. There are three advantages that made me change to Telegraf instead of collectd:
- Telegraf is written in go, which make it fast, light and it reduces the footprint when collecting data.
- There is an extensive list of telegraf plugins indexed in one official github repository. Which makes very practical to find and install plugins.
- There is more active support in telegraf github than in collectd github.
In this post I will show how to use Telegraf plugins to monitor GPU devices and battery status using Grafana.
My current stack is still using docker compose as an orchestration of service.
Which allows me to deploy all my services with a simple
Moreover I use a
Makefile to control my docker compose commands and inject
environment variables to the
I don’t git the real environment file but instead a fake environment file to
show samples of used variables.
I define the following containers in my docker compose file:
- Telegraf: Collecting data
- Influxdb: Saving data
- Grafana: Displaying data
Grafana parameters for queries
I collect data from CPU, RAM, uptime, connected users, network utilisation, disk utilization and docker stats. It is possible to use Influxdb queries in grafana interface, which helps to chose the available parameters. For example in order to get docker CPU utilization for each available container we can use the following query:
SELECT mean("usage_percent") FROM "docker_container_cpu" WHERE $timeFilter GROUP BY time($__interval), "container_name" fill(null)
The group by
container_name allows to separate values for each available container and then we can use grafana alias pattern options in order to have give nice names to each line.
Using alias pattern to name group by variables
One of the main reasons I started using telegraf was because I wanted to monitor a server with NVIDIA GPU and telegraf proposed a nice nvidia plugin to do so.
I use an environment variable if I want to monitor GPU. This will include an additional docker-compose file with special configuration values for Nvidia GPU.
version: '2.3' services: telegraf: runtime: nvidia environment: - NVIDIA_VISIBLE_DEVICES=all volumes: - ./docker/telegraf/telegraf-gpu.conf:/etc/telegraf/telegraf.conf
- The first one is the runtime option so that docker can access to the GPU.
- The other one if an environment variable
NVIDIA_VISIBLE_DEVICESto select the number of allowed NVIDIA devices. The possible values are either all or the device number (multiple device id can be added separated with comma).
- I use a different telegraf configuration when monitoring GPU because I add the telegraf plugin to monitor GPU using
nvidia_smitool. The only thing that changes is to uncomment the following lines:
[[inputs.nvidia_smi]] ## Optional: path to nvidia-smi binary, defaults to $PATH via exec.LookPath # bin_path = /usr/bin/nvidia-smi ## Optional: timeout for GPU polling # timeout = 5s
In the image below you can see the temperature of each GPU and the utilization of them. I have a batch of work distributed on all GPU that when it finish it writes things in a database.
GPU temperature and utilisation
You can notice two things:
- GPU 0 has a lower temperature than the other devices. This occurs because in my disposition GPU 0 is placed near the border and the fan is not blocked by the other GPU.
- Temperatures don’t go beyond 80C, which is OK for my GPU.
I was also curious about my battery utilisation. Because I try to optimize the
charging cycles by don’t letting the battery go below 20% and not charging
So I tried telegraf battery plugin,
which fetch battery status from
The following image shows the battery capacity and the battery cycle count.
Battery monitoring using Grafana
My laptop has two batteries. I try to use only one battery, the one that it can be remove from the laptop and so that I can be replaced easily. So I can see that the number cycles are lower for BAT0. I also try to do complete cycle for the batteries as one can see in the battery capacity plot.
The combination of telegraf, influxdb and grafana allows me to get an overview of the resources of my system. Combining them with docker allows me to deploy it easily in any remote server. All the stack is easily deploy using docker-compose. You can checkout the github code here.