###監控gpu
url:https://github.com/NVIDIA/gpu-monitoring-tools/tree/master/exporters/prometheus-dcgm
實際操作:
docker run --runtime=nvidia --rm --name=nvidia-dcgm-exporter nvidia/dcgm-exporter
需要做以下操作docker才可以啟動:
# Add the package repositories
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \
sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
# Install nvidia-docker2 and reload the Docker daemon configuration
sudo apt-get install -y nvidia-docker2
sudo systemctl daemon-reload #重新讀取配置文件
sudo systemctl restart docker #重啟docker服務
sudo pkill -SIGHUP dockerd #未知
預執行命令:
$ docker run --runtime=nvidia --rm --name=nvidia-dcgm-exporter nvidia/dcgm-exporter
# The output of dcgmi discovery and nvidia-smi should be same.
$ docker exec nvidia-dcgm-exporter dcgmi discovery -i a -v | grep -c 'GPU ID:'
$ nvidia-smi -L | wc -l
#這里可以看gpu方式來查看一些數據
url:https://github.com/NVIDIA/gpu-monitoring-tools/tree/master/exporters/prometheus-dcgm
#這里我做了映射將數據映射到本地來
mkdir -p /usr/local/prometheus #創建了需要存放監控到gpu的數值在機器上
###選用本機的9100node_exporter端口
docker tag nvidia/dcgm-exporter nvidia-dcgm-exporter
docker run -d --runtime=nvidia --rm --name=nvidia-dcgm-exporter -v /run/prometheus:/run/prometheus nvidia-dcgm-exporter
或者
docker run -d --rm --cap-add=sys_admin --runtime=nvidia --name=nvidia-dcgm-exporter -v /run/prometheus:/run/prometheus nvidia-dcgm-exporter -p
docker run -d --rm --net="host" --pid="host" quay.io/prometheus/node-exporter --collector.textfile.directory="/run/prometheus"
查看9090端口的promethus是否有dcgm接口
如果有的話 那么gpu監控完成了 接着找grafana的gpu模板吧
###自定義grafana模板 dcgm_board_limit_violation dcgm_dec_utilization dcgm_enc_utilization dcgm_fb_free dcgm_fb_used dcgm_gpu_temp# dcgm_gpu_utilization dcgm_low_util_violation dcgm_mem_copy_utilization dcgm_memort_clock dcgm_pcie_replay_counter dcgm_pcie_rx_throughput dcgm_pcie_tx_throughput
dcgm_power_usage dcgm_power_violation dcgm_reliability_violation dcgm_sm_clock dcgm_sync_boost_violation dcgm_thermal_violation dcgm_total_energy_consumption dcgm_xid_errors