一、概述
1.1 背景介绍
去年双十一,我们的大模型 API 服务突然挂了,用户疯狂投诉。跑去看监控才发现,其中一张 A100 的显存占用一路飙到 79.9GB 然后直接 OOM,但之前根本没有任何告警。那次事故让我们意识到,光有 CPU、内存、网络监控还不够,GPU 这一层如果是黑盒,就随时可能爆雷。
传统的监控工具像 Prometheus、Grafana 对 GPU 支持很差,nvidia-smi 虽然能看实时状态,但没法做历史趋势分析和告警。而且大模型推理的故障模式很特殊,不像 Web 服务那样直接 500 错误,往往是推理速度变慢、输出质量下降这种隐性问题。我们花了 3 个月时间,搭建了一套从指标采集、可视化到自动故障诊断的完整方案。
1.2 技术特点
- 多层次监控:覆盖 GPU 硬件层(温度、功耗、显存)、CUDA 运行时层(kernel 执行、流同步)、应用层(推理延迟、吞吐量)三个维度
- 主动故障检测:通过异常检测算法自动识别 GPU 性能劣化、显存泄漏、慢查询等问题,比阈值告警提前 5-10 分钟发现
- 完整故障诊断链路:从告警触发到根因定位,提供自动化的诊断脚本和修复建议
- 低开销采集:使用 DCGM(Data Center GPU Manager)而非 nvidia-smi 轮询,CPU 开销降低 80%
1.3 适用场景
- 场景一:生产环境部署了多个 LLM 服务(如 ChatGLM、LLaMA、Stable Diffusion),需要统一监控 GPU 资源使用情况
- 场景二:GPU 服务器数量较多(10 台以上),人工巡检效率低,需要自动化监控和告警
- 场景三:遇到过 GPU 相关的生产故障(如显存泄漏、推理慢、卡死),需要建立完善的排查工具链
1.4 环境要求
| 组件 |
版本要求 |
说明 |
| 操作系统 |
Ubuntu 22.04 / CentOS 7.9 |
需要内核支持 cgroups v2 |
| NVIDIA Driver |
525.x / 535.x |
必须支持 DCGM 3.x |
| CUDA Toolkit |
12.1+ |
推理框架依赖 |
| DCGM |
3.2.6+ |
NVIDIA 官方监控组件 |
| Prometheus |
2.45+ |
时序数据库 |
| Grafana |
10.x |
可视化面板 |
| GPU 型号 |
A100/A800/H100/V100 |
其他型号部分指标可能不支持 |
二、详细步骤
2.1 准备工作
◆ 2.1.1 验证 GPU 和驱动状态
# 检查 GPU 识别情况
nvidia-smi --query-gpu=index,name,driver_version,memory.total --format=csv
# 验证 CUDA 可用性
nvcc --version
python3 -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}, Device count: {torch.cuda.device_count()}')"
# 检查 GPU 错误计数(很重要!经常被忽略)
nvidia-smi --query-gpu=index,ecc.errors.corrected.volatile.total,ecc.errors.uncorrected.volatile.total --format=csv
# 查看 GPU 拓扑结构(多卡场景)
nvidia-smi topo -m
我们遇到过一次诡异的问题,推理速度突然慢了 3 倍,最后发现是 GPU 的 ECC 纠错错误数量异常高,说明显存颗粒有问题。这个指标必须监控起来。
◆ 2.1.2 安装 DCGM
# 添加 NVIDIA 仓库
distribution=$(. /etc/os-release;echo $ID$VERSION_ID | sed -e 's/\.//g')
wget https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt-get update
# 安装 DCGM(推荐使用官方仓库版本)
sudo apt-get install -y datacenter-gpu-manager
# 启动 DCGM 服务
sudo systemctl start nvidia-dcgm
sudo systemctl enable nvidia-dcgm
# 验证 DCGM 运行状态
dcgmi discovery -l
dcgmi health -c
# 检查 DCGM 版本和支持的指标
dcgmi --version
dcgmi dmon -l
注意:如果是容器化环境,建议用 DaemonSet 部署 DCGM,不要在每个容器里都启动,会冲突。
2.2 核心配置
◆ 2.2.1 部署 DCGM Exporter
# 在 Kubernetes 环境部署(推荐方式)
helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm repo update
# 创建自定义 values.yaml
cat > dcgm-values.yaml <<EOF
image:
repository: nvcr.io/nvidia/k8s/dcgm-exporter
tag: 3.2.6-3.2.0-ubuntu22.04
arguments:
- "-f"
- "/etc/dcgm-exporter/counters.csv"
- "-c"
- "1000" # 采集间隔1秒
serviceMonitor:
enabled: true
namespace: monitoring
interval: 15s
honorLabels: true
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
nodeSelector:
nvidia.com/gpu.present: "true"
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
EOF
# 部署
helm install dcgm-exporter gpu-helm-charts/dcgm-exporter \
-n monitoring \
--create-namespace \
-f dcgm-values.yaml
# 验证 Pod 运行状态
kubectl get pods -n monitoring -l app.kubernetes.io/name=dcgm-exporter
kubectl logs -n monitoring -l app.kubernetes.io/name=dcgm-exporter --tail=50
使用 Helm 快速部署,DCGM Exporter 会在每个 GPU 节点上运行一个 DaemonSet Pod,通过主机网络暴露 metrics 端口(默认 9400)。采集间隔设置为 1 秒可以捕捉到瞬时峰值,但会增加存储开销,生产环境可以改成 5 秒。
◆ 2.2.2 配置 Prometheus 采集规则
# prometheus-gpu-scrape-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-gpu-config
namespace: monitoring
data:
gpu-scrape.yaml: |
scrape_configs:
- job_name: 'dcgm-exporter'
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- monitoring
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
action: keep
regex: dcgm-exporter
- source_labels: [__meta_kubernetes_pod_node_name]
action: replace
target_label: node
- source_labels: [__meta_kubernetes_pod_ip]
action: replace
target_label: __address__
replacement: '${1}:9400'
# 添加业务标签
- source_labels: [__meta_kubernetes_pod_label_model_name]
action: replace
target_label: model
metric_relabel_configs:
# 只保留关键指标,减少存储
- source_labels: [__name__]
regex: 'DCGM_FI_DEV_(GPU_UTIL|FB_USED|FB_FREE|GPU_TEMP|POWER_USAGE|SM_CLOCK|MEM_CLOCK|PCIE_TX_THROUGHPUT|PCIE_RX_THROUGHPUT|XID_ERRORS|ECC_DBE_VOL_TOTAL|NVLINK_BANDWIDTH_TOTAL)'
action: keep
将上述配置合并到 Prometheus 的配置中:
# 如果使用 Prometheus Operator
kubectl apply -f - <<EOF
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: dcgm-exporter
namespace: monitoring
spec:
selector:
matchLabels:
app.kubernetes.io/name: dcgm-exporter
endpoints:
- port: metrics
interval: 15s
path: /metrics
EOF
◆ 2.2.3 自定义应用层指标采集
# llm_metrics_collector.py
# 嵌入到 LLM 推理服务中,采集业务指标
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time
import torch
import psutil
import threading
# 定义指标
inference_requests_total = Counter(
'llm_inference_requests_total',
'Total number of inference requests',
['model', 'status']
)
inference_duration_seconds = Histogram(
'llm_inference_duration_seconds',
'Inference duration in seconds',
['model', 'batch_size'],
buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0]
)
tokens_generated_total = Counter(
'llm_tokens_generated_total',
'Total number of tokens generated',
['model']
)
gpu_memory_allocated_bytes = Gauge(
'llm_gpu_memory_allocated_bytes',
'Current GPU memory allocated by PyTorch',
['model', 'device']
)
active_requests = Gauge(
'llm_active_requests',
'Number of requests currently being processed',
['model']
)
class MetricsCollector:
def __init__(self, model_name, device_id=0):
self.model_name = model_name
self.device_id = device_id
# 启动后台线程定期更新 GPU 内存指标
self.running = True
self.collector_thread = threading.Thread(target=self._collect_gpu_metrics)
self.collector_thread.daemon = True
self.collector_thread.start()
def _collect_gpu_metrics(self):
"""后台线程,每5秒更新一次 GPU 内存指标"""
while self.running:
try:
if torch.cuda.is_available():
allocated = torch.cuda.memory_allocated(self.device_id)
gpu_memory_allocated_bytes.labels(
model=self.model_name,
device=f'cuda:{self.device_id}'
).set(allocated)
except Exception as e:
print(f"Error collecting GPU metrics: {e}")
time.sleep(5)
def record_inference(self, batch_size, duration, num_tokens, success=True):
"""记录一次推理请求"""
status = 'success' if success else 'error'
inference_requests_total.labels(model=self.model_name, status=status).inc()
if success:
inference_duration_seconds.labels(
model=self.model_name,
batch_size=str(batch_size)
).observe(duration)
tokens_generated_total.labels(model=self.model_name).inc(num_tokens)
def set_active_requests(self, count):
"""更新当前活跃请求数"""
active_requests.labels(model=self.model_name).set(count)
# 使用示例
if __name__ == '__main__':
# 启动 metrics 服务器
start_http_server(9100)
# 创建 collector
collector = MetricsCollector(model_name='chatglm3-6b', device_id=0)
# 模拟推理
while True:
start_time = time.time()
# ... 实际推理代码 ...
duration = time.time() - start_time
collector.record_inference(batch_size=4, duration=duration, num_tokens=128)
time.sleep(1)
将此代码集成到推理服务的主循环中,并在 Prometheus 中添加采集配置:
- job_name: 'llm-inference'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- llm-prod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
regex: '.*llm.*'
action: keep
- source_labels: [__meta_kubernetes_pod_ip]
action: replace
target_label: __address__
replacement: '${1}:9100'
2.3 可视化和告警
◆ 2.3.1 导入 Grafana Dashboard
# 下载 NVIDIA 官方 GPU Dashboard
wget https://grafana.com/api/dashboards/12239/revisions/2/download -O nvidia-dcgm-dashboard.json
# 导入到 Grafana
# 方法1: 通过 UI 导入(Dashboards -> Import -> Upload JSON)
# 方法2: 通过 API 导入
GRAFANA_URL="http://grafana.monitoring.svc:3000"
GRAFANA_TOKEN="your-api-token"
curl -X POST "${GRAFANA_URL}/api/dashboards/db" \
-H "Authorization: Bearer ${GRAFANA_TOKEN}" \
-H "Content-Type: application/json" \
-d @nvidia-dcgm-dashboard.json
官方 Dashboard 比较简单,我们自己做了增强版,增加了以下面板:
- GPU 利用率趋势(7 天)
- 显存使用率热力图
- 推理延迟 P50/P95/P99
- Token 生成速度
- GPU 错误计数器
- PCIe 带宽利用率
◆ 2.3.2 配置告警规则
# prometheus-gpu-alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: gpu-alerts
namespace: monitoring
spec:
groups:
- name: gpu-hardware
interval: 30s
rules:
- alert: GPUHighTemperature
expr: DCGM_FI_DEV_GPU_TEMP > 85
for: 3m
labels:
severity: warning
component: gpu
annotations:
summary: "GPU温度过高"
description: "节点 {{ $labels.node }} GPU {{ $labels.gpu }} 温度达到 {{ $value }}°C,超过安全阈值"
- alert: GPUMemoryAlmostFull
expr: |
(DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)) * 100 > 95
for: 2m
labels:
severity: critical
component: gpu
annotations:
summary: "GPU显存即将耗尽"
description: "节点 {{ $labels.node }} GPU {{ $labels.gpu }} 显存使用率 {{ $value | humanizePercentage }},可能即将OOM"
- alert: GPUXidError
expr: increase(DCGM_FI_DEV_XID_ERRORS[5m]) > 0
labels:
severity: critical
component: gpu
annotations:
summary: "GPU发生XID错误"
description: "节点 {{ $labels.node }} GPU {{ $labels.gpu }} 发生硬件错误,需要立即检查"
- alert: GPUECCErrors
expr: increase(DCGM_FI_DEV_ECC_DBE_VOL_TOTAL[10m]) > 5
for: 5m
labels:
severity: warning
component: gpu
annotations:
summary: "GPU ECC纠错错误过多"
description: "节点 {{ $labels.node }} GPU {{ $labels.gpu }} 10分钟内发生 {{ $value }} 次ECC错误"
- alert: GPULowUtilization
expr: avg_over_time(DCGM_FI_DEV_GPU_UTIL[30m]) < 10
for: 1h
labels:
severity: info
component: gpu
annotations:
summary: "GPU利用率长期过低"
description: "节点 {{ $labels.node }} GPU {{ $labels.gpu }} 过去1小时平均利用率仅 {{ $value }}%,资源浪费"
- name: llm-inference
interval: 30s
rules:
- alert: LLMHighInferenceLatency
expr: |
histogram_quantile(0.95,
sum(rate(llm_inference_duration_seconds_bucket[5m])) by (le, model)
) > 10
for: 5m
labels:
severity: warning
component: llm
annotations:
summary: "LLM推理延迟过高"
description: "模型 {{ $labels.model }} P95延迟达到 {{ $value }}秒"
- alert: LLMErrorRateHigh
expr: |
sum(rate(llm_inference_requests_total{status="error"}[5m])) by (model)
/ sum(rate(llm_inference_requests_total[5m])) by (model) > 0.05
for: 3m
labels:
severity: critical
component: llm
annotations:
summary: "LLM错误率过高"
description: "模型 {{ $labels.model }} 错误率 {{ $value | humanizePercentage }}"
- alert: LLMGPUMemoryLeak
expr: |
deriv(llm_gpu_memory_allocated_bytes[30m]) > 1048576 # 每秒增长1MB
for: 15m
labels:
severity: warning
component: llm
annotations:
summary: "检测到可能的显存泄漏"
description: "模型 {{ $labels.model }} 显存使用量持续增长"
部署告警规则:
kubectl apply -f prometheus-gpu-alerts.yaml
◆ 2.3.3 配置告警通知
# alertmanager-config.yaml
apiVersion: v1
kind: Secret
metadata:
name: alertmanager-config
namespace: monitoring
stringData:
alertmanager.yaml: |
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alerts@example.com'
smtp_auth_username: 'alerts@example.com'
smtp_auth_password: 'password'
route:
receiver: 'default'
group_by: ['alertname', 'cluster', 'node']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
component: gpu
receiver: 'gpu-critical'
continue: true
- match:
severity: warning
component: gpu
receiver: 'gpu-warning'
- match:
component: llm
receiver: 'llm-team'
receivers:
- name: 'default'
webhook_configs:
- url: 'http://webhook-receiver:8080/alerts'
- name: 'gpu-critical'
email_configs:
- to: 'oncall@example.com'
headers:
Subject: '[CRITICAL] GPU故障告警'
webhook_configs:
- url: 'http://oncall-system:8080/alert'
send_resolved: true
- name: 'gpu-warning'
email_configs:
- to: 'ops@example.com'
headers:
Subject: '[WARNING] GPU告警'
- name: 'llm-team'
webhook_configs:
- url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
send_resolved: true
三、示例代码和配置
3.1 完整配置示例
◆ 3.1.1 GPU 健康检查脚本
#!/bin/bash
# 文件路径: /opt/scripts/gpu-health-check.sh
# 功能:全面检查 GPU 健康状态,供监控调用
set -e
HOSTNAME=$(hostname)
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
OUTPUT_FILE="/var/log/gpu-health-check.log"
echo "=== GPU Health Check - ${TIMESTAMP} ===" >> ${OUTPUT_FILE}
# 检查驱动版本
DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader | head -1)
echo "Driver Version: ${DRIVER_VERSION}" >> ${OUTPUT_FILE}
# 检查每个 GPU 的状态
GPU_COUNT=$(nvidia-smi --query-gpu=count --format=csv,noheader | head -1)
echo "Total GPUs: ${GPU_COUNT}" >> ${OUTPUT_FILE}
for gpu_id in $(seq 0 $((GPU_COUNT-1))); do
echo "--- GPU ${gpu_id} ---" >> ${OUTPUT_FILE}
# 温度
TEMP=$(nvidia-smi --id=${gpu_id} --query-gpu=temperature.gpu --format=csv,noheader)
echo " Temperature: ${TEMP}°C" >> ${OUTPUT_FILE}
if [ ${TEMP} -gt 85 ]; then
echo " [WARNING] Temperature too high!" >> ${OUTPUT_FILE}
fi
# 显存
MEM_USED=$(nvidia-smi --id=${gpu_id} --query-gpu=memory.used --format=csv,noheader,nounits)
MEM_TOTAL=$(nvidia-smi --id=${gpu_id} --query-gpu=memory.total --format=csv,noheader,nounits)
MEM_PERCENT=$((MEM_USED * 100 / MEM_TOTAL))
echo " Memory: ${MEM_USED}MB / ${MEM_TOTAL}MB (${MEM_PERCENT}%)" >> ${OUTPUT_FILE}
# 功耗
POWER=$(nvidia-smi --id=${gpu_id} --query-gpu=power.draw --format=csv,noheader)
echo " Power: ${POWER}" >> ${OUTPUT_FILE}
# 进程列表
PROCESSES=$(nvidia-smi --id=${gpu_id} --query-compute-apps=pid,process_name,used_memory --format=csv,noheader)
if [ -n "${PROCESSES}" ]; then
echo " Running Processes:" >> ${OUTPUT_FILE}
echo "${PROCESSES}" | while read line; do
echo " ${line}" >> ${OUTPUT_FILE}
done
else
echo " No running processes" >> ${OUTPUT_FILE}
fi
# ECC 错误
if nvidia-smi --id=${gpu_id} --query-gpu=ecc.mode.current --format=csv,noheader | grep -q "Enabled"; then
ECC_ERRORS=$(nvidia-smi --id=${gpu_id} --query-gpu=ecc.errors.uncorrected.volatile.total --format=csv,noheader)
echo " ECC Errors: ${ECC_ERRORS}" >> ${OUTPUT_FILE}
if [ "${ECC_ERRORS}" != "0" ]; then
echo " [CRITICAL] Uncorrected ECC errors detected!" >> ${OUTPUT_FILE}
fi
fi
# GPU 利用率
GPU_UTIL=$(nvidia-smi --id=${gpu_id} --query-gpu=utilization.gpu --format=csv,noheader,nounits)
echo " Utilization: ${GPU_UTIL}%" >> ${OUTPUT_FILE}
# 时钟频率
SM_CLOCK=$(nvidia-smi --id=${gpu_id} --query-gpu=clocks.sm --format=csv,noheader)
MEM_CLOCK=$(nvidia-smi --id=${gpu_id} --query-gpu=clocks.mem --format=csv,noheader)
echo " Clocks: SM ${SM_CLOCK}, MEM ${MEM_CLOCK}" >> ${OUTPUT_FILE}
echo "" >> ${OUTPUT_FILE}
done
# 检查 DCGM 服务状态
if systemctl is-active --quiet nvidia-dcgm; then
echo "DCGM Status: Running" >> ${OUTPUT_FILE}
else
echo "[ERROR] DCGM Status: Not Running" >> ${OUTPUT_FILE}
fi
# 输出到 stdout 供监控采集
tail -100 ${OUTPUT_FILE}
# 返回状态码
exit 0
设置定时执行:
chmod +x /opt/scripts/gpu-health-check.sh
echo "*/5 * * * * /opt/scripts/gpu-health-check.sh" | crontab -
◆ 3.1.2 显存泄漏检测脚本
#!/usr/bin/env python3
# 文件名: detect_memory_leak.py
# 功能:检测 GPU 显存泄漏
import torch
import time
import argparse
import sys
from collections import deque
class GPUMemoryLeakDetector:
def __init__(self, device_id=0, window_size=20, threshold_mb=100):
self.device_id = device_id
self.device = f'cuda:{device_id}'
self.window_size = window_size
self.threshold_mb = threshold_mb * 1024 * 1024 # 转换为 bytes
self.memory_history = deque(maxlen=window_size)
def get_current_memory(self):
"""获取当前 GPU 显存使用量(bytes)"""
return torch.cuda.memory_allocated(self.device)
def check_leak(self):
"""检测是否存在显存泄漏"""
current_mem = self.get_current_memory()
self.memory_history.append(current_mem)
if len(self.memory_history) < self.window_size:
return False, 0, current_mem
# 计算线性增长趋势
# 使用最小二乘法拟合斜率
n = len(self.memory_history)
x = list(range(n))
y = list(self.memory_history)
x_mean = sum(x) / n
y_mean = sum(y) / n
numerator = sum((x[i] - x_mean) * (y[i] - y_mean) for i in range(n))
denominator = sum((x[i] - x_mean) ** 2 for i in range(n))
if denominator == 0:
slope = 0
else:
slope = numerator / denominator
# 判断是否泄漏:斜率为正且超过阈值
is_leak = slope > (self.threshold_mb / self.window_size)
return is_leak, slope, current_mem
def format_bytes(self, bytes_value):
"""格式化字节数"""
for unit in ['B', 'KB', 'MB', 'GB']:
if bytes_value < 1024:
return f"{bytes_value:.2f}{unit}"
bytes_value /= 1024
return f"{bytes_value:.2f} TB"
def main():
parser = argparse.ArgumentParser(description='GPU Memory Leak Detector')
parser.add_argument('--device', type=int, default=0, help='GPU device ID')
parser.add_argument('--window', type=int, default=20, help='Moving window size')
parser.add_argument('--threshold', type=int, default=100, help='Leak threshold in MB')
parser.add_argument('--interval', type=int, default=5, help='Check interval in seconds')
args = parser.parse_args()
if not torch.cuda.is_available():
print("ERROR: CUDA not available")
sys.exit(1)
detector = GPUMemoryLeakDetector(
device_id=args.device,
window_size=args.window,
threshold_mb=args.threshold
)
print(f"Monitoring GPU {args.device} for memory leaks...")
print(f"Window size: {args.window}, Threshold: {args.threshold}MB, Interval: {args.interval}s")
print("-" * 80)
try:
while True:
is_leak, slope, current_mem = detector.check_leak()
slope_mb_per_check = (slope / (1024 * 1024))
current_mem_str = detector.format_bytes(current_mem)
timestamp = time.strftime('%Y-%m-%d %H:%M:%S')
if is_leak:
print(f"[{timestamp}] WARNING: Memory leak detected!")
print(f" Current memory: {current_mem_str}")
print(f" Growth rate: {slope_mb_per_check:.2f} MB/check")
print(f" Estimated growth: {slope_mb_per_check * args.interval:.2f} MB/s")
sys.stdout.flush()
else:
print(f"[{timestamp}] Memory: {current_mem_str}, Growth: {slope_mb_per_check:.2f} MB/check", end='\r')
time.sleep(args.interval)
except KeyboardInterrupt:
print("\nMonitoring stopped.")
if __name__ == '__main__':
main()
使用方法:
# 后台运行检测脚本
python3 detect_memory_leak.py --device 0 --threshold 50 --interval 10 > /var/log/gpu-leak-detect.log 2>&1 &
3.2 实际应用案例
◆ 案例一:定位推理服务显存泄漏
场景描述:生产环境的 ChatGLM 服务运行 2-3 小时后显存使用从 18GB 涨到 75GB,最终 OOM 重启
实现步骤:
- 启用详细的显存跟踪
# 在推理服务启动时启用
import torch
torch.cuda.memory._record_memory_history(max_entries=100000)
- 定期 dump 显存快照
import torch
import pickle
from datetime import datetime
def dump_memory_snapshot():
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
snapshot_file = f'/tmp/cuda_memory_{timestamp}.pickle'
try:
snapshot = torch.cuda.memory._snapshot()
with open(snapshot_file, 'wb') as f:
pickle.dump(snapshot, f)
print(f"Memory snapshot saved: {snapshot_file}")
return snapshot_file
except Exception as e:
print(f"Failed to dump snapshot: {e}")
return None
# 每30分钟自动 dump 一次
import threading
def periodic_snapshot():
while True:
time.sleep(1800) # 30分钟
dump_memory_snapshot()
snapshot_thread = threading.Thread(target=periodic_snapshot, daemon=True)
snapshot_thread.start()
- 分析快照找出泄漏点
# 使用 PyTorch 的内存分析工具
python3 -c "
import pickle
import torch
# 加载快照
with open('/tmp/cuda_memory_20250115_143000.pickle', 'rb') as f:
snapshot = pickle.load(f)
# 按分配大小排序
allocations = []
for seg in snapshot:
if 'blocks' in seg:
for block in seg['blocks']:
if block['state'] == 'active_allocated':
allocations.append({
'size': block['size'],
'frames': block.get('frames', [])
})
# 输出最大的10个分配
allocations.sort(key=lambda x: x['size'], reverse=True)
for i, alloc in enumerate(allocations[:10]):
print(f'{i+1}. Size: {alloc[\"size\"] / 1024 / 1024:.2f} MB')
if alloc['frames']:
print(f' Allocated at: {alloc[\"frames\"][0]}')
"
运行结果:发现有个 KV Cache 没有正确释放,每次推理后留下 512MB 的 tensor。修改代码后显存稳定在 20GB。
◆ 案例二:GPU 性能突然下降排查
场景描述:推理速度从 120 tokens/s 突然降到 30 tokens/s,但 GPU 利用率看起来正常
实现代码:
#!/bin/bash
# gpu-performance-diagnosis.sh
echo "=== GPU Performance Diagnosis ==="
GPU_ID=0
# 1. 检查 GPU 功耗模式
echo -e "\n1. Power Mode:"
nvidia-smi -i ${GPU_ID} --query-gpu=persistence_mode,power.state --format=csv
# 2. 检查频率限制
echo -e "\n2. Clock Throttle Reasons:"
nvidia-smi -i ${GPU_ID} --query-gpu=clocks_throttle_reasons.active --format=csv
nvidia-smi -i ${GPU_ID} --query-gpu=clocks_throttle_reasons.hw_slowdown,clocks_throttle_reasons.sw_thermal_slowdown --format=csv
# 3. 检查实际运行频率 vs 最大频率
echo -e "\n3. Clock Frequencies:"
CURRENT_SM=$(nvidia-smi -i ${GPU_ID} --query-gpu=clocks.sm --format=csv,noheader,nounits)
MAX_SM=$(nvidia-smi -i ${GPU_ID} --query-gpu=clocks.max.sm --format=csv,noheader,nounits)
echo "SM Clock: ${CURRENT_SM} MHz (Max: ${MAX_SM} MHz)"
CURRENT_MEM=$(nvidia-smi -i ${GPU_ID} --query-gpu=clocks.mem --format=csv,noheader,nounits)
MAX_MEM=$(nvidia-smi -i ${GPU_ID} --query-gpu=clocks.max.mem --format=csv,noheader,nounits)
echo "Memory Clock: ${CURRENT_MEM} MHz (Max: ${MAX_MEM} MHz)"
# 4. 检查温度
echo -e "\n4. Temperature:"
TEMP=$(nvidia-smi -i ${GPU_ID} --query-gpu=temperature.gpu --format=csv,noheader)
echo "GPU Temperature: ${TEMP}°C"
if [ ${TEMP} -gt 80 ]; then
echo "[WARNING] Temperature is high, may cause throttling"
fi
# 5. 检查 PCIe 速度
echo -e "\n5. PCIe Status:"
nvidia-smi -i ${GPU_ID} --query-gpu=pcie.link.gen.current,pcie.link.gen.max,pcie.link.width.current,pcie.link.width.max --format=csv
# 6. 检查是否有其他进程占用
echo -e "\n6. Running Processes:"
nvidia-smi -i ${GPU_id} --query-compute-apps=pid,process_name,used_gpu_memory --format=csv
# 7. 运行简单的 benchmark
echo -e "\n7. Quick Benchmark:"
python3 - <<EOF
import torch
import time
device = torch.device('cuda:${GPU_ID}')
# 矩阵乘法 benchmark
size = 4096
A = torch.randn(size, size, device=device)
B = torch.randn(size, size, device=device)
torch.cuda.synchronize()
start = time.time()
for _ in range(10):
C = torch.matmul(A, B)
torch.cuda.synchronize()
duration = time.time() - start
tflops = (2 * size ** 3 * 10) / (duration * 1e12)
print(f"TFLOPS: {tflops:.2f}")
print(f"Expected for A100: ~19.5 TFLOPS (FP32)")
if tflops < 15:
print("[WARNING] Performance below expected!")
EOF
echo -e "\n=== Diagnosis Complete ==="
问题发现:clocks_throttle_reasons.hw_slowdown 显示为 Active,原因是机房空调故障导致 GPU 温度达到 90 度触发了硬件降频。
四、最佳实践和注意事项
4.1 最佳实践
◆ 4.1.1 性能优化
- 合理设置采集间隔:DCGM 默认 1 秒采集一次,对于稳定的推理服务可以改成 10 秒
# 修改 DCGM 配置
sudo vi /etc/nvidia-datacenter-gpu-manager/config.yml
# 修改采集间隔
global_settings:
update_interval: 10000 # 毫秒
- 只采集必要的指标:DCGM 支持 150+ 指标,但大部分用不上,自定义 counters.csv 只保留关键指标
# /etc/dcgm-exporter/counters.csv
DCGM_FI_DEV_GPU_UTIL, gauge, GPU利用率
DCGM_FI_DEV_FB_USED, gauge, 显存使用量
DCGM_FI_DEV_FB_FREE, gauge, 显存空闲量
DCGM_FI_DEV_GPU_TEMP, gauge, GPU温度
DCGM_FI_DEV_POWER_USAGE, gauge, 功耗
DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, PCIe重传计数
DCGM_FI_DEV_XID_ERRORS, gauge, XID错误
- 使用 Prometheus 的降采样存储:GPU 指标数据量大,配置多级存储策略
# prometheus.yml
global:
scrape_interval: 15s
storage:
tsdb:
retention.time: 15d # 原始数据保留15天
# 使用 Thanos 或 VictoriaMetrics 做长期存储
# 1小时粒度保留90天
# 1天粒度保留1年
◆ 4.1.2 安全加固
- 限制 metrics 接口访问:DCGM Exporter 的 metrics 端口不应该暴露到公网
# NetworkPolicy 限制只有 Prometheus 可以访问
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: dcgm-exporter-netpol
namespace: monitoring
spec:
podSelector:
matchLabels:
app.kubernetes.io/name: dcgm-exporter
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: prometheus
ports:
- protocol: TCP
port: 9400
- 敏感信息脱敏:GPU 序列号等信息不要暴露在 metrics 里
# Prometheus relabel 配置
metric_relabel_configs:
- source_labels: [gpu_uuid]
target_label: gpu_id
regex: 'GPU-([0-9a-f]{8})-.*'
replacement: '${1}' # 只保留前8位
- RBAC 权限控制:限制谁可以查看 GPU 监控数据
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: gpu-metrics-reader
namespace: monitoring
rules:
- apiGroups: [""]
resources: ["services", "endpoints", "pods"]
verbs: ["get", "list"]
- apiGroups: [""]
resources: ["services/proxy"]
resourceNames: ["dcgm-exporter"]
verbs: ["get"]
◆ 4.1.3 高可用配置
- DCGM Exporter 高可用部署:虽然是 DaemonSet,但要确保关键节点有冗余
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: critical-llm-service
topologyKey: kubernetes.io/hostname
- Prometheus 联邦集群:多个 Prometheus 实例互为备份
# prometheus-1 配置
scrape_configs:
- job_name: 'federate'
scrape_interval: 30s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="dcgm-exporter"}'
static_configs:
- targets:
- 'prometheus-2:9090'
# alertmanager 配置
route:
group_by: ['alertname', 'cluster', 'node', 'gpu']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
4.2 注意事项
◆ 4.2.1 配置注意事项
⚠️ 警告:不要在生产环境直接用 nvidia-smi 轮询采集指标,会导致显著的性能下降(我们测试过,1 秒轮询一次会让推理延迟增加 5-10%)
- ❗ DCGM 与 nvidia-smi 冲突:DCGM 运行时不要频繁调用 nvidia-smi,可能导致数据不一致。如果必须用,加上
--id= 参数指定 GPU
- ❗ MIG 模式下的监控:A100 开启 MIG 后,每个 MIG 实例需要单独监控,DCGM Exporter 要配置
--mig-monitoring=true
- ❗ 容器中的 GPU 监控:容器内看到的 GPU 编号可能和宿主机不同,要通过 UUID 来匹配
◆ 4.2.2 常见错误
| 错误现象 |
原因分析 |
解决方案 |
| DCGM Exporter 一直 CrashLoopBackOff |
没有挂载 /var/lib/kubelet/pod-resources 或 DCGM 服务未运行 |
检查 DaemonSet 的 volumeMounts 配置,确保 DCGM 服务启动 |
| Prometheus 抓取超时 |
GPU 数量多导致 metrics 接口响应慢 |
增大 scrape_timeout 到 30 秒,或减少采集的指标数量 |
| 显存使用率监控不准 |
只统计了 PyTorch 分配的显存,没算 CUDA Context 开销 |
用 DCGM_FI_DEV_FB_USED 而不是 torch.cuda.memory_allocated |
| 告警风暴 |
一个 GPU 故障触发多个相关告警 |
配置 Alertmanager 的 inhibit_rules 抑制次级告警 |
| 历史数据查询慢 |
Prometheus 单实例存储了 TB 级数据 |
使用 Thanos/VictoriaMetrics 做分布式存储 |
◆ 4.2.3 兼容性问题
- 版本兼容:DCGM 3.x 需要 Driver 525+,老版本驱动只能用 DCGM 2.x,但 2.x 不支持部分新指标
- 平台兼容:某些 GPU 指标只在特定型号上支持
- NVLink 带宽:只有 A100/A800/H100 支持
- MIG 监控:只有 A100/A800/H100 支持
- PCIe Gen5:只有 H100 支持
- 组件依赖:
- Grafana 10.x 才支持新的 Trace to Metrics 功能,老版本看不到某些面板
- Prometheus 2.40 以下版本不支持 Native Histogram,部分高级查询会失败
五、故障排查和监控
5.1 故障排查
◆ 5.1.1 日志查看
# 查看 DCGM 服务日志
sudo journalctl -u nvidia-dcgm -f --since "10 minutes ago"
# 查看 DCGM Exporter 日志
kubectl logs -n monitoring -l app.kubernetes.io/name=dcgm-exporter --tail=100 -f
# 查看 Prometheus 采集状态
kubectl exec -n monitoring prometheus-k8s-0 -- wget -qO- localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.job=="dcgm-exporter")'
# 查看 GPU 驱动日志(内核模块)
sudo dmesg | grep -i nvidia | tail -50
◆ 5.1.2 常见问题排查
问题一:Prometheus 无法采集到 GPU 指标
# 1. 检查 DCGM Exporter 是否正常运行
kubectl get pods -n monitoring -l app.kubernetes.io/name=dcgm-exporter
# 2. 手动访问 metrics 接口
EXPORTER_IP=$(kubectl get pod -n monitoring -l app.kubernetes.io/name=dcgm-exporter -o jsonpath='{.items[0].status.podIP}')
curl http://${EXPORTER_IP}:9400/metrics | grep DCGM
# 3. 检查 ServiceMonitor 配置
kubectl get servicemonitor -n monitoring dcgm-exporter -o yaml
# 4. 查看 Prometheus targets 状态
kubectl port-forward -n monitoring svc/prometheus-k8s 9090:9090
# 浏览器访问 http://localhost:9090/targets
解决方案:
- 如果 metrics 接口能访问但 Prometheus 采集不到,检查 ServiceMonitor 的 label selector 是否匹配
- 如果 metrics 接口返回空,重启 DCGM 服务:
sudo systemctl restart nvidia-dcgm
- 如果是权限问题,确保 DCGM Exporter 以 privileged 模式运行
问题二:GPU 温度告警但实际温度正常
# 直接查看 nvidia-smi
nvidia-smi --query-gpu=temperature.gpu --format=csv
# 对比 DCGM 数据
dcgmi dmon -e 150 -c 1 # 150 是 GPU 温度的 field ID
# 检查 DCGM 缓存是否有问题
sudo systemctl restart nvidia-dcgm
sleep 5
dcgmi dmon -e 150 -c 1
解决方案:DCGM 有时会缓存旧数据,重启服务后会恢复。如果持续出现,考虑降级到稳定版本。
问题三:推理服务运行正常但 GPU 利用率显示为 0
- 症状:推理请求正常返回,但
DCGM_FI_DEV_GPU_UTIL 一直是 0
- 排查:
# 检查是否是采集延迟
watch -n 1 nvidia-smi
# 检查 CUDA Context
nvidia-smi -q -d COMPUTE
# 检查推理进程是否真的在用 GPU
sudo fuser -v /dev/nvidia0
- 解决:某些轻量级推理(如小 batch size)GPU 占用时间很短,采集间隔内看不到。可以:
- 减小 DCGM 采集间隔到 100ms
- 增大推理的 batch size
- 使用
DCGM_FI_PROF_GR_ENGINE_ACTIVE 指标替代利用率
◆ 5.1.3 调试模式
# 开启 DCGM 调试日志
sudo vi /etc/nvidia-datacenter-gpu-manager/config.yml
# 修改日志级别
logging:
level: DEBUG
file: /var/log/nv-hostengine.log
# 重启服务
sudo systemctl restart nvidia-dcgm
# 实时查看调试日志
sudo tail -f /var/log/nv-hostengine.log
# 使用 dcgmi 诊断工具
dcgmi diag -r 3 # 运行 Level 3 诊断(包含压力测试)
5.2 性能监控
◆ 5.2.1 关键指标监控
# 使用 PromQL 查询关键指标
# 1. GPU 利用率(按节点聚合)
avg(DCGM_FI_DEV_GPU_UTIL) by (node)
# 2. 显存使用率
(DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)) * 100
# 3. GPU 温度异常检测
DCGM_FI_DEV_GPU_TEMP > 80
# 4. 推理延迟 P99
histogram_quantile(0.99, sum(rate(llm_inference_duration_seconds_bucket[5m])) by (le, model))
# 5. Token 生成速度
rate(llm_tokens_generated_total[1m])
# 6. PCIe 带宽利用率
(DCGM_FI_DEV_PCIE_TX_THROUGHPUT + DCGM_FI_DEV_PCIE_RX_THROUGHPUT) / (16 * 1000) # 假设 PCIe 4.0 x16
# 7. GPU 错误率
rate(DCGM_FI_DEV_XID_ERRORS[5m]) > 0
◆ 5.2.2 监控指标说明
| 指标名称 |
正常范围 |
告警阈值 |
说明 |
| GPU 利用率 |
40%-90% |
<10% 持续 1 小时或 >95% 持续 10 分钟 |
太低浪费资源,太高可能瓶颈 |
| 显存使用率 |
50%-85% |
>95% |
接近满载容易 OOM |
| GPU 温度 |
50-75°C |
>85°C |
高温会触发降频 |
| 功耗 |
200-350W(A100) |
>400W |
异常高功耗可能是硬件故障 |
| SM 时钟频率 |
1410MHz(A100 Boost) |
<1200MHz |
低于标准频率说明被限频 |
| PCIe 重传计数 |
0 |
>100/分钟 |
PCIe 链路质量问题 |
| ECC 错误 |
0 |
>0(uncorrected) |
硬件故障信号 |
| 推理延迟 P95 |
<3 秒 |
>10 秒 |
用户体验临界点 |
| 显存增长率 |
0 MB/s |
>10MB/分钟持续增长 |
可能有内存泄漏 |
◆ 5.2.3 自定义告警规则示例
# 高级告警规则
groups:
- name: gpu-advanced-alerts
interval: 30s
rules:
# 预测性告警:根据趋势预测1小时后显存将满
- alert: GPUMemoryWillBeFull
expr: |
predict_linear(DCGM_FI_DEV_FB_USED[30m], 3600)
> (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)
for: 10m
labels:
severity: warning
annotations:
summary: "GPU显存预计1小时后耗尽"
description: "节点 {{ $labels.node }} GPU {{ $labels.gpu }} 按当前趋势1小时后显存将满"
# 异常检测:利用率突然下降50%
- alert: GPUUtilizationDrop
expr: |
(avg_over_time(DCGM_FI_DEV_GPU_UTIL[5m])
- avg_over_time(DCGM_FI_DEV_GPU_UTIL[5m] offset 10m)) < -50
for: 2m
labels:
severity: warning
annotations:
summary: "GPU利用率突然大幅下降"
description: "节点 {{ $labels.node }} GPU {{ $labels.gpu }} 利用率从10分钟前的水平下降超过50%"
# 性能劣化:相同 workload 延迟增加2倍
- alert: InferencePerformanceDegraded
expr: |
histogram_quantile(0.95, sum(rate(llm_inference_duration_seconds_bucket[5m])) by (le, model))
/ histogram_quantile(0.95, sum(rate(llm_inference_duration_seconds_bucket[5m] offset 1h)) by (le, model))
> 2
for: 10m
labels:
severity: warning
annotations:
summary: "推理性能显著下降"
description: "模型 {{ $labels.model }} P95延迟是1小时前的 {{ $value | humanize }}倍"
# 多 GPU 不平衡
- alert: GPULoadImbalance
expr: |
(max(DCGM_FI_DEV_GPU_UTIL) by (node) - min(DCGM_FI_DEV_GPU_UTIL) by (node)) > 40
for: 15m
labels:
severity: info
annotations:
summary: "节点GPU负载不均衡"
description: "节点 {{ $labels.node }} 上GPU利用率差异超过40%,负载分配可能不均"
5.3 备份与恢复
◆ 5.3.1 监控配置备份
#!/bin/bash
# backup-monitoring-stack.sh
BACKUP_DIR="/backup/monitoring"
DATE=$(date +%Y%m%d-%H%M%S)
mkdir -p ${BACKUP_DIR}/${DATE}
# 备份 Prometheus 配置
kubectl get cm -n monitoring prometheus-k8s-rulefiles-0 -o yaml > ${BACKUP_DIR}/${DATE}/prometheus-rules.yaml
kubectl get cm -n monitoring prometheus-k8s -o yaml > ${BACKUP_DIR}/${DATE}/prometheus-config.yaml
# 备份 PrometheusRule CRD
kubectl get prometheusrule -n monitoring -o yaml > ${BACKUP_DIR}/${DATE}/prometheus-rules-crd.yaml
# 备份 Grafana dashboards
kubectl get cm -n monitoring -l grafana_dashboard=1 -o yaml > ${BACKUP_DIR}/${DATE}/grafana-dashboards.yaml
# 备份 Alertmanager 配置
kubectl get secret -n monitoring alertmanager-main -o yaml > ${BACKUP_DIR}/${DATE}/alertmanager-secret.yaml
# 备份 DCGM Exporter 配置
kubectl get cm -n monitoring dcgm-exporter-config -o yaml > ${BACKUP_DIR}/${DATE}/dcgm-config.yaml
# 打包压缩
tar -czf ${BACKUP_DIR}/monitoring-${DATE}.tar.gz -C ${BACKUP_DIR} ${DATE}
rm -rf ${BACKUP_DIR}/${DATE}
# 备份 Prometheus 数据(可选,数据量大)
# kubectl exec -n monitoring prometheus-k8s-0 -- tar czf /tmp/prometheus-data.tar.gz /prometheus
# kubectl cp monitoring/prometheus-k8s-0:/tmp/prometheus-data.tar.gz ${BACKUP_DIR}/prometheus-data-${DATE}.tar.gz
echo "Backup completed: ${BACKUP_DIR}/monitoring-${DATE}.tar.gz"
◆ 5.3.2 故障恢复流程
- Prometheus 数据损坏恢复
# 停止 Prometheus
kubectl scale statefulset -n monitoring prometheus-k8s --replicas=0
# 恢复数据(如果有备份)
kubectl cp prometheus-data-backup.tar.gz monitoring/prometheus-k8s-0:/tmp/
kubectl exec -n monitoring prometheus-k8s-0 -- tar xzf /tmp/prometheus-data-backup.tar.gz -C /
# 重启 Prometheus
kubectl scale statefulset -n monitoring prometheus-k8s --replicas=2
- DCGM 服务异常恢复
# 在 GPU 节点上执行
sudo systemctl stop nvidia-dcgm
# 清理可能的残留
sudo rm -rf /var/run/nvidia-dcgm.pid
sudo rm -rf /var/run/nvidia-dcgm.sock
# 重置 GPU 状态
sudo nvidia-smi -r # 需要权限,可能导致运行中的任务中断
# 重启 DCGM
sudo systemctl start nvidia-dcgm
dcgmi discovery -l # 验证
# 重启 DCGM Exporter
kubectl rollout restart daemonset -n monitoring dcgm-exporter
- 告警规则回滚
# 从备份恢复 PrometheusRule
tar -xzf /backup/monitoring/monitoring-20250115-020000.tar.gz -C /tmp
kubectl apply -f /tmp/20250115-020000/prometheus-rules-crd.yaml
# 验证规则加载
kubectl exec -n monitoring prometheus-k8s-0 -- wget -qO- localhost:9090/api/v1/rules | jq '.data.groups[] | .name'
六、总结
6.1 技术要点回顾
- ✅ 多层次监控架构:硬件层用 DCGM 采集 GPU 原生指标,应用层用自定义 exporter 采集业务指标,两者结合才能全面掌握服务状态
- ✅ 低开销采集方案:使用 DCGM 替代 nvidia-smi 轮询,合理设置采集间隔和指标过滤,在完整监控和性能影响间取得平衡
- ✅ 主动故障检测:不仅依赖阈值告警,还要用预测性告警(predict_linear)和异常检测(环比分析)提前发现问题
- ✅ 完整故障排查链路:从指标异常到日志分析、性能诊断、根因定位,建立标准化的 troubleshooting 流程
6.2 进阶学习方向
- 分布式追踪集成:将 GPU 监控与 OpenTelemetry 结合
- AIOps 智能运维:用机器学习做异常检测和根因分析
- 学习资源:Prometheus 的 Anomaly Detection 项目
- 实践建议:收集 3 个月历史数据训练基线模型,自动识别非正常模式的指标波动
- GPU 资源优化:基于监控数据做容量规划和成本优化
- 学习资源:FinOps Foundation GPU 成本优化最佳实践
- 实践建议:分析 GPU 利用率分布,识别资源浪费的服务,通过 time-slicing 或 MIG 提高密度
6.3 参考资料
- NVIDIA DCGM 文档 - 官方数据中心 GPU 管理工具指南
- DCGM Exporter GitHub - Prometheus exporter 源码和配置示例
- Prometheus 最佳实践 - 指标命名和查询优化
- Grafana GPU 监控模板 - 社区贡献的 Dashboard
附录
A. 命令速查表
# GPU 状态查看
nvidia-smi # 基础信息
nvidia-smi -l 1 # 每秒刷新
nvidia-smi --query-gpu=index,name,temperature.gpu,utilization.gpu,memory.used,memory.total --format=csv # 格式化输出
nvidia-smi dmon -s pucvmet # 持续监控(p=功耗 u=利用率 c=时钟 v=违例 m=内存 e=ecc t=温度)
# DCGM 操作
dcgmi discovery -l # 发现 GPU
dcgmi dmon -e 155,203,204,251 -c 10 # 监控特定指标 10 次
dcgmi health -c # 健康检查
dcgmi diag -r 1 # 快速诊断
dcgmi stats -e # 启用统计
dcgmi stats -v # 查看统计
# Prometheus 查询
curl http://localhost:9090/api/v1/query?query=DCGM_FI_DEV_GPU_UTIL # API 查询
promtool query instant http://localhost:9090 'DCGM_FI_DEV_GPU_UTIL' # CLI 查询
promtool check rules /etc/prometheus/rules/*.yaml # 验证规则语法
# Kubernetes 监控
kubectl top node --sort-by=nvidia.com/gpu # 查看节点 GPU 资源
kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes | jq '.items[] | {name:.metadata.name, gpu:.usage."nvidia.com/gpu"}' # API 查看
B. 配置参数详解
DCGM 关键配置:
update_interval:数据采集间隔,范围 100ms-10s,默认 1s。推荐生产环境 5s
max_keep_samples:内存中保留的样本数,默认 600。保留时间 = interval * samples
watchdog_timeout:看门狗超时,默认 120s。DCGM 无响应时自动重启
field_groups:指标分组,可以自定义采集哪些 field
DCGM Exporter 配置:
--collectors:启用的 collector,默认 all。可指定为 dcgm,dcgmi
--dcgm-hostengine-endpoint:DCGM 服务地址,默认 localhost:5555
--no-hostname:不添加 hostname label,适合 Kubernetes 环境
Prometheus 采集配置:
scrape_interval:全局采集间隔,建议 15s(GPU 指标变化快,不宜太长)
scrape_timeout:采集超时,建议 10s(GPU 多时 metrics 接口慢)
evaluation_interval:规则评估间隔,建议 30s
C. 术语表
| 术语 |
英文 |
解释 |
| DCGM |
Data Center GPU Manager |
NVIDIA 官方的 GPU 管理框架,提供监控、诊断、配置等功能 |
| XID 错误 |
Xid Error |
GPU 硬件错误代码,如 XID 79 表示 GPU 卡死,XID 48 表示显存 ECC 错误 |
| ECC |
Error-Correcting Code |
显存纠错技术,能自动修复单 bit 错误,检测双 bit 错误 |
| SM |
Streaming Multiprocessor |
GPU 的计算核心单元,利用率指 SM 的活跃比例 |
| PCIe 重传 |
PCIe Replay |
PCIe 链路传输错误后的重传,高重传率说明链路质量差 |
| 降频 |
Throttling |
GPU 因温度、功耗等原因降低工作频率以保护硬件 |
| MIG |
Multi-Instance GPU |
A100 支持的硬件分区技术,将单卡切分成多个独立实例 |
| 显存泄漏 |
Memory Leak |
程序分配 GPU 显存后未释放,导致可用显存持续减少 |
本文由云栈社区整理发布,更多 GPU 监控与运维实战方案,欢迎关注社区后续分享。