⏳
3 mins
read time
In a previous post. I showed how to monitor data using Collectd, Influxdb and Grafana. In the mean time I wanted to add more functionalities to Colectd but it was difficult to find plugings for Nvidia GPU and also to monitor other docker instances. Then I found Telegraf, which is a tool from the same InfluxDB company that can collect data from several sources. There are three advantages that made me change to Telegraf instead of collectd:
In this post I will show how to use Telegraf plugins to monitor GPU devices and battery status using Grafana.
My current stack is still using docker compose as an orchestration of service.
Which allows me to deploy all my services with a simple docker-compose up
.
Moreover I use a Makefile
to control my docker compose commands and inject
environment variables to the docker-compose.yml
.
I don’t git the real environment file but instead a fake environment file to
show samples of used variables.
I define the following containers in my docker compose file:
Monitoring stack
I collect data from CPU, RAM, uptime, connected users, network utilisation, disk utilization and docker stats. It is possible to use Influxdb queries in grafana interface, which helps to chose the available parameters. For example in order to get docker CPU utilization for each available container we can use the following query:
SELECT mean("usage_percent")
FROM "docker_container_cpu"
WHERE $timeFilter
GROUP BY time($__interval), "container_name" fill(null)
The group by container_name
allows to separate values for each available container and then we can use grafana alias pattern options in order to have give nice names to each line.
Using alias pattern to name group by variables
One of the main reasons I started using telegraf was because I wanted to monitor a server with NVIDIA GPU and telegraf proposed a nice nvidia plugin to do so.
I use an environment variable if I want to monitor GPU. This will include an additional docker-compose file with special configuration values for Nvidia GPU.
version: '2.3'
services:
telegraf:
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=all
volumes:
- ./docker/telegraf/telegraf-gpu.conf:/etc/telegraf/telegraf.conf
NVIDIA_VISIBLE_DEVICES
to select the number of allowed NVIDIA devices. The possible values are either all or the device number (multiple device id can be added separated with comma).nvidia_smi
tool. The only thing that changes is to uncomment the following lines:[[inputs.nvidia_smi]]
## Optional: path to nvidia-smi binary, defaults to $PATH via exec.LookPath
# bin_path = /usr/bin/nvidia-smi
## Optional: timeout for GPU polling
# timeout = 5s
In the image below you can see the temperature of each GPU and the utilization of them. I have a batch of work distributed on all GPU that when it finish it writes things in a database.
GPU temperature and utilisation
You can notice two things:
I was also curious about my battery utilisation. Because I try to optimize the
charging cycles by don’t letting the battery go below 20% and not charging
above 90%.
So I tried telegraf battery plugin,
which fetch battery status from /proc
folder.
The following image shows the battery capacity and the battery cycle count.
Battery monitoring using Grafana
My laptop has two batteries. I try to use only one battery, the one that it can be remove from the laptop and so that I can be replaced easily. So I can see that the number cycles are lower for BAT0. I also try to do complete cycle for the batteries as one can see in the battery capacity plot.
The combination of telegraf, influxdb and grafana allows me to get an overview of the resources of my system. Combining them with docker allows me to deploy it easily in any remote server. All the stack is easily deploy using docker-compose. You can checkout the github code here.
Docker allows to easily deploy a monitoring system using beautiful Grafana dashboards and connected with optimized data sources with Influxdb and Collectd
This article shows how to analyze logs using Kibana dashboards. Fluentbit is used for injecting logs to elasticsearch, then it is connected to kibana to get some insights.
Traefik is a modern and dynamic reverse proxy with a native support with docker containers. This article compares Traefik with existing solutions and shows how to setup a privacy compliant monitoring tool with GoAccess.
We created a simple project to show how to deploy a lambda function that returns the shape and size in bytes of an image by passing binary files.