老板问我:这 8 台 A100 机器,每个月电费好几万,利用率到底怎么样?我打开 nvidia-smi 看了眼:嗯,现在 67%。老板不满意:我要看趋势,看历史,看报表。好吧,那就得上正经的监控系统了。这篇文章记录一下用 Prometheus + DCGM 搭建 GPU 监控的完整过程。
一、概述
1.1 背景介绍
nvidia-smi 是个好工具,但它只能看当前状态。线上的 GPU 集群要回答这些问题:
- 过去一周的平均 GPU 利用率是多少?
- 哪些时段 GPU 是空闲的,可以跑离线任务?
- 显存使用有没有慢慢增长的趋势(内存泄漏)?
- 某张卡的温度是不是一直偏高,需要检修?
- 推理服务的 QPS 和 GPU 利用率是什么关系?
要回答这些问题,需要一套完整的监控体系:采集、存储、可视化、告警。这就是 Prometheus + DCGM + Grafana 组合要做的事。
1.2 技术特点
DCGM (Data Center GPU Manager)
NVIDIA 官方出品的 GPU 管理工具,比 nvidia-smi 更专业:
- 支持 150+ 种 GPU 指标,远超 nvidia-smi 能拿到的
- 自带健康检查和诊断功能
- 支持 GPU 分组管理,适合多租户场景
- 提供 Prometheus exporter,开箱即用
- 对 GPU 驱动的负担极小(轮询间隔可配置)
Prometheus
云原生监控的事实标准:
- Pull 模型,适合动态变化的集群
- 强大的 PromQL 查询语言
- 内置告警管理器
- 生态丰富,各种 exporter 和 dashboard
这套方案的优势
- 官方支持:NVIDIA 维护 dcgm-exporter,更新及时
- 全面覆盖:从 GPU 利用率到 NVLink 带宽,应有尽有
- 扩展性好:几百台 GPU 机器也能轻松应对
- 成本可控:全开源组件,没有 license 费用
1.3 适用场景
这套方案适用于:
- 大模型训练集群:监控训练进度、GPU 利用率、通信效率
- 推理服务集群:监控 QPS、延迟、资源利用率
- GPU 虚拟化环境:MIG、vGPU 等场景的资源监控
- 混合负载集群:训练 + 推理混部,需要精细化调度
不太适用于:
- 单机开发环境(杀鸡用牛刀)
- 对指标延迟要求极高的场景(Prometheus 默认 15s 采集间隔)
1.4 环境要求
硬件要求
- NVIDIA GPU:支持 Kepler 架构(K80)及以上
- 推荐 Volta/Ampere/Hopper 架构,支持更多指标
软件要求
NVIDIA Driver: 450.80.02+(推荐 525+ 或 550+)
CUDA: 11.0+
Docker: 20.10+(如果用容器部署)
Kubernetes: 1.25+(如果用 K8s 部署)
网络要求
- Prometheus 能访问所有 GPU 节点的 9400 端口(dcgm-exporter 默认端口)
- Grafana 能访问 Prometheus
- 可选:告警需要能访问 Alertmanager、邮件服务器、企业微信/钉钉 webhook
测试环境
本文的配置在以下环境验证:
OS: Ubuntu 22.04 LTS
GPU: 8 x NVIDIA A100-80GB
Driver: 550.90.07
DCGM: 3.3.6
Prometheus: 2.48.0
Grafana: 10.2.0
二、详细步骤
2.1 准备工作
安装 NVIDIA 驱动
如果还没装驱动,这是最基础的一步:
# Ubuntu
sudo apt update
sudo apt install -y nvidia-driver-550-server
# Reboot
sudo reboot
# Verify
nvidia-smi
安装 DCGM
DCGM 可以通过 apt/yum 安装,也可以用容器。先说裸机安装:
# Add NVIDIA repo
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
# Install datacenter-gpu-manager
sudo apt update
sudo apt install -y datacenter-gpu-manager
# Enable and start service
sudo systemctl enable nvidia-dcgm
sudo systemctl start nvidia-dcgm
# Verify
dcgmi discovery -l
看到输出类似这样说明安装成功:
8 GPUs found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information |
+--------+----------------------------------------------------------------------+
| 0 | Name: NVIDIA A100-SXM4-80GB |
| | PCI Bus ID: 00000000:07:00.0 |
| | Device UUID: GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx |
+--------+----------------------------------------------------------------------+
...
安装 dcgm-exporter
dcgm-exporter 把 DCGM 的指标暴露为 Prometheus 格式:
# Method 1: Container (recommended)
docker run -d \
--name dcgm-exporter \
--gpus all \
--restart unless-stopped \
-p 9400:9400 \
nvcr.io/nvidia/k8s/dcgm-exporter:3.3.6-3.4.2-ubuntu22.04
# Method 2: Binary installation
# Download from https://github.com/NVIDIA/dcgm-exporter/releases
wget https://github.com/NVIDIA/dcgm-exporter/releases/download/3.3.6/dcgm-exporter-3.3.6-amd64.tar.gz
tar -xzf dcgm-exporter-3.3.6-amd64.tar.gz
sudo mv dcgm-exporter /usr/local/bin/
# Create systemd service
sudo tee /etc/systemd/system/dcgm-exporter.service << 'EOF'
[Unit]
Description=DCGM Exporter
After=nvidia-dcgm.service
Requires=nvidia-dcgm.service
[Service]
Type=simple
ExecStart=/usr/local/bin/dcgm-exporter -f /etc/dcgm-exporter/default-counters.csv
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable dcgm-exporter
sudo systemctl start dcgm-exporter
验证 exporter 正常工作:
curl http://localhost:9400/metrics | head -50
应该能看到类似这样的输出:
# HELP DCGM_FI_DEV_GPU_TEMP GPU temperature (in C).
# TYPE DCGM_FI_DEV_GPU_TEMP gauge
DCGM_FI_DEV_GPU_TEMP{gpu="0",UUID="GPU-xxx",device="nvidia0",...} 42
DCGM_FI_DEV_GPU_TEMP{gpu="1",UUID="GPU-yyy",device="nvidia1",...} 44
...
2.2 核心配置
Prometheus 配置
假设你已经有了 Prometheus,添加 GPU 节点的抓取配置:
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
# Static config for few nodes
- job_name: 'dcgm-gpu'
static_configs:
- targets:
- 'gpu-node-01:9400'
- 'gpu-node-02:9400'
- 'gpu-node-03:9400'
- 'gpu-node-04:9400'
relabel_configs:
- source_labels: [__address__]
regex: '(.+):9400'
target_label: instance
replacement: '${1}'
# File-based service discovery for dynamic nodes
- job_name: 'dcgm-gpu-dynamic'
file_sd_configs:
- files:
- '/etc/prometheus/targets/gpu_nodes.json'
refresh_interval: 30s
# /etc/prometheus/targets/gpu_nodes.json
[
{
"targets": ["gpu-node-01:9400", "gpu-node-02:9400"],
"labels": {
"cluster": "training",
"datacenter": "dc1"
}
},
{
"targets": ["gpu-node-03:9400", "gpu-node-04:9400"],
"labels": {
"cluster": "inference",
"datacenter": "dc1"
}
}
]
自定义采集指标
dcgm-exporter 默认采集常用指标,但不是全部。可以通过配置文件自定义:
# /etc/dcgm-exporter/custom-counters.csv
# Format: DCGM_FIELD_ID, Prometheus_Metric_Name, Type
#
# Basic metrics
DCGM_FI_DEV_GPU_UTIL, DCGM_FI_DEV_GPU_UTIL, gauge
DCGM_FI_DEV_MEM_COPY_UTIL, DCGM_FI_DEV_MEM_COPY_UTIL, gauge
DCGM_FI_DEV_FB_FREE, DCGM_FI_DEV_FB_FREE, gauge
DCGM_FI_DEV_FB_USED, DCGM_FI_DEV_FB_USED, gauge
DCGM_FI_DEV_GPU_TEMP, DCGM_FI_DEV_GPU_TEMP, gauge
DCGM_FI_DEV_POWER_USAGE, DCGM_FI_DEV_POWER_USAGE, gauge
# Memory metrics
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter
DCGM_FI_DEV_MEM_CLOCK, DCGM_FI_DEV_MEM_CLOCK, gauge
DCGM_FI_DEV_SM_CLOCK, DCGM_FI_DEV_SM_CLOCK, gauge
# PCIe metrics
DCGM_FI_DEV_PCIE_TX_THROUGHPUT, DCGM_FI_DEV_PCIE_TX_THROUGHPUT, gauge
DCGM_FI_DEV_PCIE_RX_THROUGHPUT, DCGM_FI_DEV_PCIE_RX_THROUGHPUT, gauge
# NVLink metrics (important for multi-GPU training)
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, gauge
# Tensor Core metrics
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge
# Error counts
DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter
# XID errors (GPU faults)
DCGM_FI_DEV_XID_ERRORS, DCGM_FI_DEV_XID_ERRORS, gauge
启动 exporter 时指定配置文件:
docker run -d \
--name dcgm-exporter \
--gpus all \
-p 9400:9400 \
-v /etc/dcgm-exporter:/etc/dcgm-exporter:ro \
nvcr.io/nvidia/k8s/dcgm-exporter:3.3.6-3.4.2-ubuntu22.04 \
-f /etc/dcgm-exporter/custom-counters.csv
2.3 Kubernetes 部署
如果你的 GPU 集群是 K8s 管理的,推荐用 DaemonSet 部署 dcgm-exporter:
# dcgm-exporter-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: dcgm-exporter
namespace: monitoring
labels:
app: dcgm-exporter
spec:
selector:
matchLabels:
app: dcgm-exporter
template:
metadata:
labels:
app: dcgm-exporter
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9400"
spec:
nodeSelector:
nvidia.com/gpu.present: "true"
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: dcgm-exporter
image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.6-3.4.2-ubuntu22.04
ports:
- name: metrics
containerPort: 9400
env:
- name: DCGM_EXPORTER_LISTEN
value: ":9400"
- name: DCGM_EXPORTER_KUBERNETES
value: "true"
securityContext:
runAsNonRoot: false
runAsUser: 0
capabilities:
add:
- SYS_ADMIN
volumeMounts:
- name: pod-gpu-resources
mountPath: /var/lib/kubelet/pod-resources
readOnly: true
- name: custom-counters
mountPath: /etc/dcgm-exporter
readOnly: true
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 256Mi
volumes:
- name: pod-gpu-resources
hostPath:
path: /var/lib/kubelet/pod-resources
- name: custom-counters
configMap:
name: dcgm-exporter-config
---
apiVersion: v1
kind: ConfigMap
metadata:
name: dcgm-exporter-config
namespace: monitoring
data:
default-counters.csv: |
DCGM_FI_DEV_GPU_UTIL, DCGM_FI_DEV_GPU_UTIL, gauge
DCGM_FI_DEV_MEM_COPY_UTIL, DCGM_FI_DEV_MEM_COPY_UTIL, gauge
DCGM_FI_DEV_FB_FREE, DCGM_FI_DEV_FB_FREE, gauge
DCGM_FI_DEV_FB_USED, DCGM_FI_DEV_FB_USED, gauge
DCGM_FI_DEV_GPU_TEMP, DCGM_FI_DEV_GPU_TEMP, gauge
DCGM_FI_DEV_POWER_USAGE, DCGM_FI_DEV_POWER_USAGE, gauge
DCGM_FI_DEV_SM_CLOCK, DCGM_FI_DEV_SM_CLOCK, gauge
DCGM_FI_DEV_PCIE_TX_THROUGHPUT, DCGM_FI_DEV_PCIE_TX_THROUGHPUT, gauge
DCGM_FI_DEV_PCIE_RX_THROUGHPUT, DCGM_FI_DEV_PCIE_RX_THROUGHPUT, gauge
---
apiVersion: v1
kind: Service
metadata:
name: dcgm-exporter
namespace: monitoring
labels:
app: dcgm-exporter
spec:
type: ClusterIP
ports:
- port: 9400
targetPort: 9400
name: metrics
selector:
app: dcgm-exporter
# Deploy
kubectl apply -f dcgm-exporter-daemonset.yaml
# Verify pods running on GPU nodes
kubectl get pods -n monitoring -l app=dcgm-exporter -o wide
# Check metrics
kubectl port-forward -n monitoring daemonset/dcgm-exporter 9400:9400
curl http://localhost:9400/metrics
ServiceMonitor 配置(for Prometheus Operator)
# servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: dcgm-exporter
namespace: monitoring
labels:
release: prometheus
spec:
selector:
matchLabels:
app: dcgm-exporter
endpoints:
- port: metrics
interval: 15s
path: /metrics
relabelings:
- sourceLabels: [__meta_kubernetes_pod_node_name]
targetLabel: node
- sourceLabels: [__meta_kubernetes_namespace]
targetLabel: namespace
2.4 启动验证
检查指标采集
# Check Prometheus targets
curl http://prometheus:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.job == "dcgm-gpu")'
# Query GPU utilization
curl -G http://prometheus:9090/api/v1/query \
--data-urlencode 'query=DCGM_FI_DEV_GPU_UTIL'
# Check all available DCGM metrics
curl http://prometheus:9090/api/v1/label/__name__/values | jq '.data[] | select(startswith("DCGM"))'
常用验证查询
# Average GPU utilization across all GPUs
avg(DCGM_FI_DEV_GPU_UTIL)
# GPU utilization by node
avg by (instance) (DCGM_FI_DEV_GPU_UTIL)
# Total GPU memory usage in GB
sum(DCGM_FI_DEV_FB_USED) / 1024
# GPU temperature distribution
histogram_quantile(0.95, DCGM_FI_DEV_GPU_TEMP)
三、示例代码和配置
3.1 完整的监控栈部署
docker-compose 部署(适合小规模)
# docker-compose.yml
version: '3.8'
services:
prometheus:
image: prom/prometheus:v2.48.0
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus:/etc/prometheus
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
restart: unless-stopped
grafana:
image: grafana/grafana:10.2.0
container_name: grafana
ports:
- "3000:3000"
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
- ./grafana/dashboards:/var/lib/grafana/dashboards
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin123
- GF_USERS_ALLOW_SIGN_UP=false
restart: unless-stopped
alertmanager:
image: prom/alertmanager:v0.26.0
container_name: alertmanager
ports:
- "9093:9093"
volumes:
- ./alertmanager:/etc/alertmanager
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
restart: unless-stopped
volumes:
prometheus_data:
grafana_data:
# prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
rule_files:
- '/etc/prometheus/rules/*.yml'
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'dcgm-gpu'
static_configs:
- targets:
- 'gpu-node-01:9400'
- 'gpu-node-02:9400'
- 'gpu-node-03:9400'
- 'gpu-node-04:9400'
relabel_configs:
- source_labels: [__address__]
regex: '(.+):9400'
target_label: instance
replacement: '${1}'
3.2 告警规则配置
# prometheus/rules/gpu_alerts.yml
groups:
- name: gpu_alerts
interval: 30s
rules:
# GPU utilization too low (wasting resources)
- alert: GPUUtilizationLow
expr: avg_over_time(DCGM_FI_DEV_GPU_UTIL[30m]) < 20
for: 1h
labels:
severity: warning
annotations:
summary: "GPU utilization is low on {{ $labels.instance }}"
description: "GPU {{ $labels.gpu }} on {{ $labels.instance }} has been below 20% utilization for over 1 hour. Current: {{ $value | printf \"%.1f\" }}%"
# GPU memory almost full
- alert: GPUMemoryHigh
expr: (DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)) * 100 > 95
for: 5m
labels:
severity: critical
annotations:
summary: "GPU memory usage critical on {{ $labels.instance }}"
description: "GPU {{ $labels.gpu }} memory usage is above 95%. Used: {{ $value | printf \"%.1f\" }}%"
# GPU temperature too high
- alert: GPUTemperatureHigh
expr: DCGM_FI_DEV_GPU_TEMP > 83
for: 5m
labels:
severity: critical
annotations:
summary: "GPU temperature critical on {{ $labels.instance }}"
description: "GPU {{ $labels.gpu }} temperature is {{ $value }}°C, which exceeds safe threshold (83°C)"
# GPU temperature warning
- alert: GPUTemperatureWarning
expr: DCGM_FI_DEV_GPU_TEMP > 75
for: 10m
labels:
severity: warning
annotations:
summary: "GPU temperature elevated on {{ $labels.instance }}"
description: "GPU {{ $labels.gpu }} temperature is {{ $value }}°C"
# ECC errors detected
- alert: GPUECCErrors
expr: increase(DCGM_FI_DEV_ECC_DBE_VOL_TOTAL[1h]) > 0
for: 0m
labels:
severity: critical
annotations:
summary: "GPU ECC double-bit errors on {{ $labels.instance }}"
description: "GPU {{ $labels.gpu }} has detected uncorrectable ECC errors. This may indicate hardware failure."
# XID errors (GPU faults)
- alert: GPUXIDErrors
expr: increase(DCGM_FI_DEV_XID_ERRORS[5m]) > 0
for: 0m
labels:
severity: warning
annotations:
summary: "GPU XID errors on {{ $labels.instance }}"
description: "GPU {{ $labels.gpu }} reported XID error {{ $value }}. Check nvidia-smi for details."
# GPU power usage near limit
- alert: GPUPowerHigh
expr: DCGM_FI_DEV_POWER_USAGE > 380
for: 10m
labels:
severity: warning
annotations:
summary: "GPU power consumption high on {{ $labels.instance }}"
description: "GPU {{ $labels.gpu }} power usage is {{ $value }}W, approaching TDP limit"
# DCGM exporter down
- alert: DCGMExporterDown
expr: up{job="dcgm-gpu"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "DCGM Exporter is down on {{ $labels.instance }}"
description: "Cannot scrape GPU metrics from {{ $labels.instance }}. Check if dcgm-exporter is running."
# NVLink bandwidth degraded
- alert: NVLinkBandwidthLow
expr: DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL < 200000
for: 15m
labels:
severity: warning
annotations:
summary: "NVLink bandwidth degraded on {{ $labels.instance }}"
description: "GPU {{ $labels.gpu }} NVLink bandwidth is {{ $value | humanize }}B/s, which is below expected"
3.3 Alertmanager 配置
# alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'instance']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'pagerduty'
continue: true
- match:
severity: critical
receiver: 'slack-critical'
- match:
severity: warning
receiver: 'slack-warning'
receivers:
- name: 'default'
webhook_configs:
- url: 'http://localhost:5001/webhook'
send_resolved: true
- name: 'slack-critical'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#gpu-alerts-critical'
title: '{{ .GroupLabels.alertname }}'
text: >-
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Severity:* {{ .Labels.severity }}
{{ end }}
send_resolved: true
- name: 'slack-warning'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#gpu-alerts-warning'
title: '{{ .GroupLabels.alertname }}'
text: >-
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
{{ end }}
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'YOUR-PAGERDUTY-SERVICE-KEY'
severity: critical
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
3.4 Grafana Dashboard
这是个完整的 GPU 监控 Dashboard JSON,可以直接导入 Grafana:
{
"annotations": {
"list": []
},
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 1,
"id": null,
"links": [],
"liveNow": false,
"panels": [
{
"collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 0 },
"id": 1,
"panels": [],
"title": "Overview",
"type": "row"
},
{
"datasource": { "type": "prometheus", "uid": "${datasource}" },
"fieldConfig": {
"defaults": {
"color": { "mode": "palette-classic" },
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 50 },
{ "color": "red", "value": 80 }
]
},
"unit": "percent"
}
},
"gridPos": { "h": 4, "w": 6, "x": 0, "y": 1 },
"id": 2,
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "auto",
"orientation": "auto",
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
"textMode": "auto"
},
"pluginVersion": "10.2.0",
"targets": [
{
"expr": "avg(DCGM_FI_DEV_GPU_UTIL{instance=~\"$instance\"})",
"legendFormat": "Avg GPU Util",
"refId": "A"
}
],
"title": "Average GPU Utilization",
"type": "stat"
},
{
"datasource": { "type": "prometheus", "uid": "${datasource}" },
"fieldConfig": {
"defaults": {
"color": { "mode": "palette-classic" },
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 70 },
{ "color": "red", "value": 85 }
]
},
"unit": "celsius"
}
},
"gridPos": { "h": 4, "w": 6, "x": 6, "y": 1 },
"id": 3,
"options": {
"colorMode": "value",
"graphMode": "none",
"reduceOptions": { "calcs": ["max"], "fields": "", "values": false }
},
"targets": [
{
"expr": "max(DCGM_FI_DEV_GPU_TEMP{instance=~\"$instance\"})",
"legendFormat": "Max Temp",
"refId": "A"
}
],
"title": "Max GPU Temperature",
"type": "stat"
},
{
"datasource": { "type": "prometheus", "uid": "${datasource}" },
"fieldConfig": {
"defaults": {
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [{ "color": "blue", "value": null }]
},
"unit": "watt"
}
},
"gridPos": { "h": 4, "w": 6, "x": 12, "y": 1 },
"id": 4,
"options": { "reduceOptions": { "calcs": ["lastNotNull"] } },
"targets": [
{
"expr": "sum(DCGM_FI_DEV_POWER_USAGE{instance=~\"$instance\"})",
"legendFormat": "Total Power",
"refId": "A"
}
],
"title": "Total Power Consumption",
"type": "stat"
},
{
"datasource": { "type": "prometheus", "uid": "${datasource}" },
"fieldConfig": {
"defaults": {
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [{ "color": "green", "value": null }]
},
"unit": "short"
}
},
"gridPos": { "h": 4, "w": 6, "x": 18, "y": 1 },
"id": 5,
"options": { "reduceOptions": { "calcs": ["lastNotNull"] } },
"targets": [
{
"expr": "count(DCGM_FI_DEV_GPU_UTIL{instance=~\"$instance\"})",
"legendFormat": "GPU Count",
"refId": "A"
}
],
"title": "Active GPUs",
"type": "stat"
},
{
"collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 5 },
"id": 10,
"panels": [],
"title": "GPU Utilization",
"type": "row"
},
{
"datasource": { "type": "prometheus", "uid": "${datasource}" },
"fieldConfig": {
"defaults": {
"color": { "mode": "palette-classic" },
"custom": {
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 20,
"gradientMode": "none",
"lineInterpolation": "smooth",
"lineWidth": 1,
"showPoints": "never"
},
"mappings": [],
"max": 100,
"min": 0,
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }] },
"unit": "percent"
}
},
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 6 },
"id": 11,
"options": { "legend": { "calcs": ["mean", "max"], "displayMode": "table", "placement": "bottom" } },
"targets": [
{
"expr": "DCGM_FI_DEV_GPU_UTIL{instance=~\"$instance\"}",
"legendFormat": "{{ instance }} GPU {{ gpu }}",
"refId": "A"
}
],
"title": "GPU Utilization by GPU",
"type": "timeseries"
},
{
"datasource": { "type": "prometheus", "uid": "${datasource}" },
"fieldConfig": {
"defaults": {
"color": { "mode": "palette-classic" },
"custom": {
"drawStyle": "line",
"fillOpacity": 20,
"lineInterpolation": "smooth",
"lineWidth": 1
},
"max": 100,
"min": 0,
"unit": "percent"
}
},
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 6 },
"id": 12,
"targets": [
{
"expr": "(DCGM_FI_DEV_FB_USED{instance=~\"$instance\"} / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)) * 100",
"legendFormat": "{{ instance }} GPU {{ gpu }}",
"refId": "A"
}
],
"title": "GPU Memory Utilization",
"type": "timeseries"
},
{
"collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 14 },
"id": 20,
"panels": [],
"title": "Temperature & Power",
"type": "row"
},
{
"datasource": { "type": "prometheus", "uid": "${datasource}" },
"fieldConfig": {
"defaults": {
"custom": { "drawStyle": "line", "fillOpacity": 10, "lineWidth": 1 },
"unit": "celsius"
}
},
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 15 },
"id": 21,
"targets": [
{
"expr": "DCGM_FI_DEV_GPU_TEMP{instance=~\"$instance\"}",
"legendFormat": "{{ instance }} GPU {{ gpu }}",
"refId": "A"
}
],
"title": "GPU Temperature",
"type": "timeseries"
},
{
"datasource": { "type": "prometheus", "uid": "${datasource}" },
"fieldConfig": {
"defaults": {
"custom": { "drawStyle": "line", "fillOpacity": 10, "lineWidth": 1 },
"unit": "watt"
}
},
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 15 },
"id": 22,
"targets": [
{
"expr": "DCGM_FI_DEV_POWER_USAGE{instance=~\"$instance\"}",
"legendFormat": "{{ instance }} GPU {{ gpu }}",
"refId": "A"
}
],
"title": "GPU Power Usage",
"type": "timeseries"
},
{
"collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 23 },
"id": 30,
"panels": [],
"title": "Memory Details",
"type": "row"
},
{
"datasource": { "type": "prometheus", "uid": "${datasource}" },
"fieldConfig": {
"defaults": {
"custom": { "drawStyle": "bars", "fillOpacity": 80, "stacking": { "mode": "normal" } },
"unit": "decmbytes"
}
},
"gridPos": { "h": 8, "w": 24, "x": 0, "y": 24 },
"id": 31,
"targets": [
{
"expr": "DCGM_FI_DEV_FB_USED{instance=~\"$instance\"}",
"legendFormat": "{{ instance }} GPU {{ gpu }} Used",
"refId": "A"
},
{
"expr": "DCGM_FI_DEV_FB_FREE{instance=~\"$instance\"}",
"legendFormat": "{{ instance }} GPU {{ gpu }} Free",
"refId": "B"
}
],
"title": "GPU Memory Usage (MB)",
"type": "timeseries"
}
],
"refresh": "30s",
"schemaVersion": 38,
"style": "dark",
"tags": ["gpu", "nvidia", "dcgm"],
"templating": {
"list": [
{
"current": { "selected": false, "text": "prometheus", "value": "prometheus" },
"hide": 0,
"includeAll": false,
"multi": false,
"name": "datasource",
"options": [],
"query": "prometheus",
"refresh": 1,
"regex": "",
"skipUrlSync": false,
"type": "datasource"
},
{
"allValue": ".*",
"current": { "selected": true, "text": "All", "value": "$__all" },
"datasource": { "type": "prometheus", "uid": "${datasource}" },
"definition": "label_values(DCGM_FI_DEV_GPU_UTIL, instance)",
"hide": 0,
"includeAll": true,
"multi": true,
"name": "instance",
"options": [],
"query": { "query": "label_values(DCGM_FI_DEV_GPU_UTIL, instance)", "refId": "A" },
"refresh": 2,
"regex": "",
"skipUrlSync": false,
"sort": 1,
"type": "query"
}
]
},
"time": { "from": "now-1h", "to": "now" },
"timepicker": {},
"timezone": "",
"title": "GPU Cluster Monitoring",
"uid": "gpu-cluster-monitoring",
"version": 1,
"weekStart": ""
}
保存为 gpu-dashboard.json,然后在 Grafana 中导入。
四、最佳实践和注意事项
4.1 性能优化
采集间隔调优
默认 15s 采集间隔对大部分场景够用。但需要根据情况调整:
# High frequency (for debugging)
scrape_interval: 5s # More CPU on Prometheus, more storage
# Standard (production)
scrape_interval: 15s # Good balance
# Low frequency (for large clusters)
scrape_interval: 30s # Less accurate, but lower overhead
超过 100 个 GPU 节点时,建议:
- 使用 15s-30s 的采集间隔
- 开启 Prometheus 的远程写入,用 Thanos/Cortex 做长期存储
- 考虑按集群分片部署多个 Prometheus
减少标签基数
DCGM 默认会带上 UUID 标签,每张卡都不一样,会造成高基数问题:
# prometheus.yml - drop high cardinality labels
relabel_configs:
- action: labeldrop
regex: 'UUID' # Drop UUID label to reduce cardinality
- action: labeldrop
regex: 'modelName' # If all GPUs are same model
使用 Recording Rules
预计算常用查询,减少 Dashboard 加载时间:
# recording_rules.yml
groups:
- name: gpu_recording_rules
interval: 30s
rules:
# Average GPU utilization by node
- record: node:gpu_utilization:avg
expr: avg by (instance) (DCGM_FI_DEV_GPU_UTIL)
# Total memory used by node (in GB)
- record: node:gpu_memory_used_gb:sum
expr: sum by (instance) (DCGM_FI_DEV_FB_USED) / 1024
# GPU memory utilization percentage
- record: gpu:memory_utilization:ratio
expr: |
DCGM_FI_DEV_FB_USED /
(DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)
# 5 minute average GPU utilization
- record: gpu:utilization:avg5m
expr: avg_over_time(DCGM_FI_DEV_GPU_UTIL[5m])
4.2 安全加固
TLS 加密
在生产环境,exporter 和 Prometheus 之间的通信应该加密:
# dcgm-exporter with TLS
docker run -d \
--name dcgm-exporter \
--gpus all \
-p 9400:9400 \
-v /etc/dcgm-exporter/certs:/certs:ro \
nvcr.io/nvidia/k8s/dcgm-exporter:3.3.6-3.4.2-ubuntu22.04 \
--web.config=/certs/web-config.yml
# web-config.yml
tls_server_config:
cert_file: /certs/server.crt
key_file: /certs/server.key
# prometheus.yml
scrape_configs:
- job_name: 'dcgm-gpu'
scheme: https
tls_config:
ca_file: /etc/prometheus/certs/ca.crt
insecure_skip_verify: false
static_configs:
- targets: ['gpu-node-01:9400']
认证
添加 Basic Auth:
# web-config.yml
basic_auth_users:
prometheus: $2y$10$xxxxx # bcrypt hash of password
# prometheus.yml
scrape_configs:
- job_name: 'dcgm-gpu'
basic_auth:
username: prometheus
password_file: /etc/prometheus/secrets/dcgm_password
4.3 高可用配置
Prometheus HA
两个 Prometheus 实例抓取相同目标:
# prometheus-1.yml
global:
external_labels:
replica: prometheus-1
# prometheus-2.yml
global:
external_labels:
replica: prometheus-2
配合 Thanos 或 Cortex 做去重和长期存储。
Alertmanager HA
# alertmanager-1.yml
cluster:
listen-address: "0.0.0.0:9094"
peers:
- alertmanager-2:9094
# alertmanager-2.yml
cluster:
listen-address: "0.0.0.0:9094"
peers:
- alertmanager-1:9094
4.4 常见错误
错误 1:No data points
症状:Grafana 显示 "No data"
排查步骤:
# 1. Check if exporter is running
curl http://gpu-node-01:9400/metrics
# 2. Check Prometheus targets
curl http://prometheus:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.job == "dcgm-gpu")'
# 3. Check firewall
sudo iptables -L -n | grep 9400
常见原因:
- dcgm-exporter 没启动
- 防火墙阻止了 9400 端口
- Prometheus 配置的地址不对
错误 2:DCGM initialization failed
Error: Cannot connect to the DCGM socket
解决方案:
# Check DCGM service
sudo systemctl status nvidia-dcgm
# Restart if needed
sudo systemctl restart nvidia-dcgm
sudo systemctl restart dcgm-exporter
# Check permissions
ls -la /var/run/nvidia-dcgm.sock
错误 3:指标值为 0 或缺失
部分指标在某些 GPU 型号上不支持:
# Check supported fields
dcgmi dmon -e 1009,1010,1011 -c 1
# Check if profiling metrics need to be enabled
dcgmi profile --pause # Disable
dcgmi profile --resume # Enable
错误 4:内存使用持续增长
dcgm-exporter 内存泄漏(老版本有这个问题):
# Check memory usage
ps aux | grep dcgm-exporter
# Upgrade to latest version
docker pull nvcr.io/nvidia/k8s/dcgm-exporter:3.3.6-3.4.2-ubuntu22.04
五、故障排查和监控
5.1 日志查看
DCGM 服务日志
# Systemd logs
journalctl -u nvidia-dcgm -f
# Docker logs
docker logs dcgm-exporter -f --tail 100
# K8s logs
kubectl logs -f daemonset/dcgm-exporter -n monitoring
常见日志分析
# Find errors
journalctl -u nvidia-dcgm | grep -i error
# DCGM initialization issues
journalctl -u nvidia-dcgm | grep -i "failed to"
5.2 常见问题排查
问题 1:GPU 利用率波动大
# Check utilization variance
stddev_over_time(DCGM_FI_DEV_GPU_UTIL[5m])
# Compare with request rate (if using vLLM)
rate(vllm_request_success_total[5m])
原因分析:
- 请求到达不均匀 → 检查负载均衡
- Batch size 太小 → 调整推理框架参数
- 短请求多 → 考虑请求合并
问题 2:显存使用持续增长
# Memory usage trend
deriv(DCGM_FI_DEV_FB_USED[1h])
# Compare before and after restart
DCGM_FI_DEV_FB_USED offset 1d
原因分析:
- Python 对象未释放 → 检查代码
- KV Cache 未清理 → 检查推理框架配置
- CUDA Context 泄漏 → 升级驱动
问题 3:温度过高
# Check fan speed
nvidia-smi -q -d FAN
# Check power limit
nvidia-smi -q -d POWER
# Reduce power limit if needed (temporary)
sudo nvidia-smi -pl 300 # Set 300W limit
排查方向:
- 机房温度 → 检查空调
- 风道堵塞 → 检查服务器灰尘
- 散热片问题 → 联系硬件维护
5.3 性能监控
关键指标看板
# Overall cluster health score (0-100)
(
(avg(DCGM_FI_DEV_GPU_UTIL) / 100 * 0.3) +
(1 - max(DCGM_FI_DEV_GPU_TEMP) / 90 * 0.2) +
(1 - sum(rate(DCGM_FI_DEV_XID_ERRORS[1h])) * 0.3) +
(sum(up{job="dcgm-gpu"}) / count(up{job="dcgm-gpu"}) * 0.2)
) * 100
# Cost efficiency (useful work per watt)
sum(rate(vllm_request_success_total[5m])) / sum(DCGM_FI_DEV_POWER_USAGE)
自动化巡检脚本
#!/usr/bin/env python3
"""
GPU cluster health check script.
Run daily via cron to catch issues early.
"""
import requests
import json
from datetime import datetime
PROMETHEUS_URL = "http://prometheus:9090"
THRESHOLDS = {
"gpu_utilization_low": 20,
"gpu_temperature_high": 80,
"memory_usage_high": 95,
}
def query_prometheus(query: str) -> list:
"""Execute PromQL query and return results."""
resp = requests.get(
f"{PROMETHEUS_URL}/api/v1/query",
params={"query": query}
)
data = resp.json()
if data["status"] != "success":
raise Exception(f"Query failed: {data}")
return data["data"]["result"]
def check_gpu_health():
"""Run all GPU health checks."""
issues = []
# Check GPU utilization
results = query_prometheus("avg_over_time(DCGM_FI_DEV_GPU_UTIL[1h])")
for r in results:
if float(r["value"][1]) < THRESHOLDS["gpu_utilization_low"]:
issues.append({
"type": "low_utilization",
"instance": r["metric"].get("instance"),
"gpu": r["metric"].get("gpu"),
"value": float(r["value"][1])
})
# Check GPU temperature
results = query_prometheus("DCGM_FI_DEV_GPU_TEMP")
for r in results:
if float(r["value"][1]) > THRESHOLDS["gpu_temperature_high"]:
issues.append({
"type": "high_temperature",
"instance": r["metric"].get("instance"),
"gpu": r["metric"].get("gpu"),
"value": float(r["value"][1])
})
# Check memory usage
results = query_prometheus(
"(DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)) * 100"
)
for r in results:
if float(r["value"][1]) > THRESHOLDS["memory_usage_high"]:
issues.append({
"type": "high_memory",
"instance": r["metric"].get("instance"),
"gpu": r["metric"].get("gpu"),
"value": float(r["value"][1])
})
# Check for ECC errors
results = query_prometheus("increase(DCGM_FI_DEV_ECC_DBE_VOL_TOTAL[24h]) > 0")
for r in results:
issues.append({
"type": "ecc_errors",
"instance": r["metric"].get("instance"),
"gpu": r["metric"].get("gpu"),
"value": float(r["value"][1])
})
return issues
def generate_report(issues: list) -> str:
"""Generate health check report."""
report = f"""
GPU Cluster Health Report
Generated: {datetime.now().isoformat()}
{'='*50}
Total Issues Found: {len(issues)}
"""
if not issues:
report += "\nAll GPUs are healthy!\n"
else:
by_type = {}
for issue in issues:
t = issue["type"]
if t not in by_type:
by_type[t] = []
by_type[t].append(issue)
for issue_type, items in by_type.items():
report += f"\n{issue_type.upper()} ({len(items)} issues):\n"
for item in items:
report += f" - {item['instance']} GPU {item['gpu']}: {item['value']:.1f}\n"
return report
if __name__ == "__main__":
issues = check_gpu_health()
report = generate_report(issues)
print(report)
# Send to Slack/Email if issues found
if issues:
# webhook_url = "https://hooks.slack.com/..."
# requests.post(webhook_url, json={"text": report})
pass
5.4 备份恢复
Prometheus 数据备份
# Create snapshot
curl -X POST http://prometheus:9090/api/v1/admin/tsdb/snapshot
# Backup snapshot directory
tar -czvf prometheus-backup-$(date +%Y%m%d).tar.gz \
/prometheus/snapshots/
# Restore
tar -xzvf prometheus-backup-20241219.tar.gz -C /prometheus/
Grafana Dashboard 备份
# Export all dashboards
for uid in $(curl -s http://admin:password@grafana:3000/api/search | jq -r '.[].uid'); do
curl -s "http://admin:password@grafana:3000/api/dashboards/uid/$uid" | \
jq '.dashboard' > "dashboard-$uid.json"
done
# Import dashboard
curl -X POST \
-H "Content-Type: application/json" \
-d @dashboard-xxx.json \
http://admin:password@grafana:3000/api/dashboards/import
六、总结
6.1 技术要点回顾
搭建 GPU 监控系统的核心步骤:
- 安装 DCGM:NVIDIA 官方工具,提供丰富的 GPU 指标
- 部署 dcgm-exporter:把指标暴露为 Prometheus 格式
- 配置 Prometheus:抓取并存储指标
- 创建 Grafana Dashboard:可视化展示
- 设置告警规则:温度过高、利用率过低、ECC 错误等
关键配置文件:
/etc/dcgm-exporter/custom-counters.csv:自定义采集指标
prometheus.yml:抓取目标配置
gpu_alerts.yml:告警规则
6.2 进阶学习方向
MIG 监控
A100/H100 支持 MIG(Multi-Instance GPU),可以把一张卡切成多个独立实例:
# Enable MIG
sudo nvidia-smi -i 0 -mig 1
# Create GPU instances
sudo nvidia-smi mig -i 0 -cgi 9,9,9 -C
# DCGM exporter will automatically detect MIG instances
vGPU 监控
虚拟化场景需要额外的 vGPU 指标:
# Install vGPU manager
# DCGM can monitor both pGPU and vGPU
dcgmi discovery -l
与业务指标联动
把 GPU 利用率和业务 QPS 关联分析:
# Correlation between GPU util and request rate
DCGM_FI_DEV_GPU_UTIL
/
sum(rate(http_requests_total[5m])) by (instance)
6.3 参考资料
- NVIDIA DCGM 官方文档
- dcgm-exporter GitHub
- Prometheus 官方文档
- NVIDIA GPU 指标字段列表
附录
A. 命令速查表
# DCGM basic commands
dcgmi discovery -l # List all GPUs
dcgmi diag -r 1 # Quick diagnostic
dcgmi diag -r 3 # Full diagnostic (takes longer)
dcgmi health -g 0 # Check GPU 0 health
dcgmi stats -g 0 -e # Enable stats collection
dcgmi dmon -e 1009,1010 # Monitor specific fields
# dcgm-exporter
docker run --gpus all -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:3.3.6-3.4.2-ubuntu22.04
curl http://localhost:9400/metrics
# Prometheus queries
curl -G 'http://prometheus:9090/api/v1/query' --data-urlencode 'query=DCGM_FI_DEV_GPU_UTIL'
curl -G 'http://prometheus:9090/api/v1/query_range' \
--data-urlencode 'query=avg(DCGM_FI_DEV_GPU_UTIL)' \
--data-urlencode 'start=2024-01-01T00:00:00Z' \
--data-urlencode 'end=2024-01-02T00:00:00Z' \
--data-urlencode 'step=1h'
B. 常用 DCGM 字段 ID
| Field ID |
名称 |
类型 |
说明 |
| 100 |
DCGM_FI_DRIVER_VERSION |
String |
驱动版本 |
| 1001 |
DCGM_FI_DEV_SM_CLOCK |
Gauge |
SM 时钟频率 (MHz) |
| 1002 |
DCGM_FI_DEV_MEM_CLOCK |
Gauge |
显存时钟频率 (MHz) |
| 1004 |
DCGM_FI_DEV_GPU_TEMP |
Gauge |
GPU 温度 (°C) |
| 1005 |
DCGM_FI_DEV_POWER_USAGE |
Gauge |
功耗 (W) |
| 1009 |
DCGM_FI_DEV_FB_FREE |
Gauge |
可用显存 (MB) |
| 1010 |
DCGM_FI_DEV_FB_USED |
Gauge |
已用显存 (MB) |
| 1011 |
DCGM_FI_DEV_FB_TOTAL |
Gauge |
总显存 (MB) |
| 1100 |
DCGM_FI_DEV_GPU_UTIL |
Gauge |
GPU 利用率 (%) |
| 1101 |
DCGM_FI_DEV_MEM_COPY_UTIL |
Gauge |
显存带宽利用率 (%) |
| 1140 |
DCGM_FI_DEV_PCIE_TX_THROUGHPUT |
Gauge |
PCIe 发送带宽 |
| 1141 |
DCGM_FI_DEV_PCIE_RX_THROUGHPUT |
Gauge |
PCIe 接收带宽 |
| 1006 |
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION |
Counter |
累计能耗 (mJ) |
| 310 |
DCGM_FI_DEV_XID_ERRORS |
Gauge |
XID 错误码 |
C. 术语表
| 术语 |
解释 |
| DCGM |
Data Center GPU Manager,NVIDIA 数据中心 GPU 管理工具 |
| Exporter |
Prometheus 生态中负责暴露指标的组件 |
| Scrape |
Prometheus 从 Exporter 拉取指标的动作 |
| PromQL |
Prometheus Query Language,查询语言 |
| Recording Rule |
预计算规则,提前计算常用查询 |
| Alerting Rule |
告警规则,定义何时触发告警 |
| MIG |
Multi-Instance GPU,A100/H100 支持的 GPU 分割技术 |
| ECC |
Error Correcting Code,显存纠错 |
| XID |
NVIDIA GPU 的错误代码 |
| SM |
Streaming Multiprocessor,GPU 计算单元 |
| TDP |
Thermal Design Power,设计功耗 |