云栈社区»论坛 › 技术文档「 Note & Doc 」 › DCGM + Prometheus：大模型GPU集群监控的完整实战

5561 积分	1 好友	757 主题

发消息

DCGM + Prometheus：大模型GPU集群监控的完整实战

发表于昨天 04:36 | 查看: 4| 回复: 0

老板问我：这 8 台 A100 机器，每个月电费好几万，利用率到底怎么样？我打开 nvidia-smi 看了眼：嗯，现在 67%。老板不满意：我要看趋势，看历史，看报表。好吧，那就得上正经的监控系统了。这篇文章记录一下用 Prometheus + DCGM 搭建 GPU 监控的完整过程。

一、概述

1.1 背景介绍

nvidia-smi 是个好工具，但它只能看当前状态。线上的 GPU 集群要回答这些问题：

过去一周的平均 GPU 利用率是多少？
哪些时段 GPU 是空闲的，可以跑离线任务？
显存使用有没有慢慢增长的趋势（内存泄漏）？
某张卡的温度是不是一直偏高，需要检修？
推理服务的 QPS 和 GPU 利用率是什么关系？

要回答这些问题，需要一套完整的监控体系：采集、存储、可视化、告警。这就是 Prometheus + DCGM + Grafana 组合要做的事。

1.2 技术特点

DCGM (Data Center GPU Manager)

NVIDIA 官方出品的 GPU 管理工具，比 nvidia-smi 更专业：

支持 150+ 种 GPU 指标，远超 nvidia-smi 能拿到的
自带健康检查和诊断功能
支持 GPU 分组管理，适合多租户场景
提供 Prometheus exporter，开箱即用
对 GPU 驱动的负担极小（轮询间隔可配置）

Prometheus

云原生监控的事实标准：

Pull 模型，适合动态变化的集群
强大的 PromQL 查询语言
内置告警管理器
生态丰富，各种 exporter 和 dashboard

这套方案的优势

官方支持：NVIDIA 维护 dcgm-exporter，更新及时
全面覆盖：从 GPU 利用率到 NVLink 带宽，应有尽有
扩展性好：几百台 GPU 机器也能轻松应对
成本可控：全开源组件，没有 license 费用

1.3 适用场景

这套方案适用于：

大模型训练集群：监控训练进度、GPU 利用率、通信效率
推理服务集群：监控 QPS、延迟、资源利用率
GPU 虚拟化环境：MIG、vGPU 等场景的资源监控
混合负载集群：训练 + 推理混部，需要精细化调度

不太适用于：

单机开发环境（杀鸡用牛刀）
对指标延迟要求极高的场景（Prometheus 默认 15s 采集间隔）

1.4 环境要求

硬件要求

NVIDIA GPU：支持 Kepler 架构(K80)及以上
推荐 Volta/Ampere/Hopper 架构，支持更多指标

软件要求

NVIDIA Driver: 450.80.02+（推荐 525+ 或 550+）
CUDA: 11.0+
Docker: 20.10+（如果用容器部署）
Kubernetes: 1.25+（如果用 K8s 部署）

网络要求

Prometheus 能访问所有 GPU 节点的 9400 端口（dcgm-exporter 默认端口）
Grafana 能访问 Prometheus
可选：告警需要能访问 Alertmanager、邮件服务器、企业微信/钉钉 webhook

测试环境

本文的配置在以下环境验证：

OS: Ubuntu 22.04 LTS
GPU: 8 x NVIDIA A100-80GB
Driver: 550.90.07
DCGM: 3.3.6
Prometheus: 2.48.0
Grafana: 10.2.0

二、详细步骤

2.1 准备工作

安装 NVIDIA 驱动

如果还没装驱动，这是最基础的一步：

# Ubuntu
sudo apt update
sudo apt install -y nvidia-driver-550-server

# Reboot
sudo reboot

# Verify
nvidia-smi

安装 DCGM

DCGM 可以通过 apt/yum 安装，也可以用容器。先说裸机安装：

# Add NVIDIA repo
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

# Install datacenter-gpu-manager
sudo apt update
sudo apt install -y datacenter-gpu-manager

# Enable and start service
sudo systemctl enable nvidia-dcgm
sudo systemctl start nvidia-dcgm

# Verify
dcgmi discovery -l

看到输出类似这样说明安装成功：

8 GPUs found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information                                                   |
+--------+----------------------------------------------------------------------+
| 0      | Name: NVIDIA A100-SXM4-80GB                                         |
|        | PCI Bus ID: 00000000:07:00.0                                        |
|        | Device UUID: GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx               |
+--------+----------------------------------------------------------------------+
...

安装 dcgm-exporter

dcgm-exporter 把 DCGM 的指标暴露为 Prometheus 格式：

# Method 1: Container (recommended)
docker run -d \
    --name dcgm-exporter \
    --gpus all \
    --restart unless-stopped \
    -p 9400:9400 \
    nvcr.io/nvidia/k8s/dcgm-exporter:3.3.6-3.4.2-ubuntu22.04

# Method 2: Binary installation
# Download from https://github.com/NVIDIA/dcgm-exporter/releases
wget https://github.com/NVIDIA/dcgm-exporter/releases/download/3.3.6/dcgm-exporter-3.3.6-amd64.tar.gz
tar -xzf dcgm-exporter-3.3.6-amd64.tar.gz
sudo mv dcgm-exporter /usr/local/bin/

# Create systemd service
sudo tee /etc/systemd/system/dcgm-exporter.service << 'EOF'
[Unit]
Description=DCGM Exporter
After=nvidia-dcgm.service
Requires=nvidia-dcgm.service

[Service]
Type=simple
ExecStart=/usr/local/bin/dcgm-exporter -f /etc/dcgm-exporter/default-counters.csv
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable dcgm-exporter
sudo systemctl start dcgm-exporter

验证 exporter 正常工作：

curl http://localhost:9400/metrics | head -50

应该能看到类似这样的输出：

# HELP DCGM_FI_DEV_GPU_TEMP GPU temperature (in C).
# TYPE DCGM_FI_DEV_GPU_TEMP gauge
DCGM_FI_DEV_GPU_TEMP{gpu="0",UUID="GPU-xxx",device="nvidia0",...} 42
DCGM_FI_DEV_GPU_TEMP{gpu="1",UUID="GPU-yyy",device="nvidia1",...} 44
...

2.2 核心配置

Prometheus 配置

假设你已经有了 Prometheus，添加 GPU 节点的抓取配置：

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  # Static config for few nodes
  - job_name: 'dcgm-gpu'
    static_configs:
      - targets:
          - 'gpu-node-01:9400'
          - 'gpu-node-02:9400'
          - 'gpu-node-03:9400'
          - 'gpu-node-04:9400'
    relabel_configs:
      - source_labels: [__address__]
        regex: '(.+):9400'
        target_label: instance
        replacement: '${1}'

  # File-based service discovery for dynamic nodes
  - job_name: 'dcgm-gpu-dynamic'
    file_sd_configs:
      - files:
          - '/etc/prometheus/targets/gpu_nodes.json'
        refresh_interval: 30s

# /etc/prometheus/targets/gpu_nodes.json
[
  {
    "targets": ["gpu-node-01:9400", "gpu-node-02:9400"],
    "labels": {
      "cluster": "training",
      "datacenter": "dc1"
    }
  },
  {
    "targets": ["gpu-node-03:9400", "gpu-node-04:9400"],
    "labels": {
      "cluster": "inference",
      "datacenter": "dc1"
    }
  }
]

自定义采集指标

dcgm-exporter 默认采集常用指标，但不是全部。可以通过配置文件自定义：

# /etc/dcgm-exporter/custom-counters.csv
# Format: DCGM_FIELD_ID, Prometheus_Metric_Name, Type
#
# Basic metrics
DCGM_FI_DEV_GPU_UTIL,       DCGM_FI_DEV_GPU_UTIL,        gauge
DCGM_FI_DEV_MEM_COPY_UTIL,  DCGM_FI_DEV_MEM_COPY_UTIL,   gauge
DCGM_FI_DEV_FB_FREE,        DCGM_FI_DEV_FB_FREE,         gauge
DCGM_FI_DEV_FB_USED,        DCGM_FI_DEV_FB_USED,         gauge
DCGM_FI_DEV_GPU_TEMP,       DCGM_FI_DEV_GPU_TEMP,        gauge
DCGM_FI_DEV_POWER_USAGE,    DCGM_FI_DEV_POWER_USAGE,     gauge

# Memory metrics
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter
DCGM_FI_DEV_MEM_CLOCK,      DCGM_FI_DEV_MEM_CLOCK,       gauge
DCGM_FI_DEV_SM_CLOCK,       DCGM_FI_DEV_SM_CLOCK,        gauge

# PCIe metrics
DCGM_FI_DEV_PCIE_TX_THROUGHPUT, DCGM_FI_DEV_PCIE_TX_THROUGHPUT, gauge
DCGM_FI_DEV_PCIE_RX_THROUGHPUT, DCGM_FI_DEV_PCIE_RX_THROUGHPUT, gauge

# NVLink metrics (important for multi-GPU training)
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, gauge

# Tensor Core metrics
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge

# Error counts
DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter

# XID errors (GPU faults)
DCGM_FI_DEV_XID_ERRORS,     DCGM_FI_DEV_XID_ERRORS,      gauge

启动 exporter 时指定配置文件：

docker run -d \
    --name dcgm-exporter \
    --gpus all \
    -p 9400:9400 \
    -v /etc/dcgm-exporter:/etc/dcgm-exporter:ro \
    nvcr.io/nvidia/k8s/dcgm-exporter:3.3.6-3.4.2-ubuntu22.04 \
    -f /etc/dcgm-exporter/custom-counters.csv

2.3 Kubernetes 部署

如果你的 GPU 集群是 K8s 管理的，推荐用 DaemonSet 部署 dcgm-exporter：

# dcgm-exporter-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: dcgm-exporter
  namespace: monitoring
  labels:
    app: dcgm-exporter
spec:
  selector:
    matchLabels:
      app: dcgm-exporter
  template:
    metadata:
      labels:
        app: dcgm-exporter
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9400"
    spec:
      nodeSelector:
        nvidia.com/gpu.present: "true"
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      containers:
        - name: dcgm-exporter
          image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.6-3.4.2-ubuntu22.04
          ports:
            - name: metrics
              containerPort: 9400
          env:
            - name: DCGM_EXPORTER_LISTEN
              value: ":9400"
            - name: DCGM_EXPORTER_KUBERNETES
              value: "true"
          securityContext:
            runAsNonRoot: false
            runAsUser: 0
            capabilities:
              add:
                - SYS_ADMIN
          volumeMounts:
            - name: pod-gpu-resources
              mountPath: /var/lib/kubelet/pod-resources
              readOnly: true
            - name: custom-counters
              mountPath: /etc/dcgm-exporter
              readOnly: true
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 500m
              memory: 256Mi
      volumes:
        - name: pod-gpu-resources
          hostPath:
            path: /var/lib/kubelet/pod-resources
        - name: custom-counters
          configMap:
            name: dcgm-exporter-config
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: dcgm-exporter-config
  namespace: monitoring
data:
  default-counters.csv: |
    DCGM_FI_DEV_GPU_UTIL,       DCGM_FI_DEV_GPU_UTIL,        gauge
    DCGM_FI_DEV_MEM_COPY_UTIL,  DCGM_FI_DEV_MEM_COPY_UTIL,   gauge
    DCGM_FI_DEV_FB_FREE,        DCGM_FI_DEV_FB_FREE,         gauge
    DCGM_FI_DEV_FB_USED,        DCGM_FI_DEV_FB_USED,         gauge
    DCGM_FI_DEV_GPU_TEMP,       DCGM_FI_DEV_GPU_TEMP,        gauge
    DCGM_FI_DEV_POWER_USAGE,    DCGM_FI_DEV_POWER_USAGE,     gauge
    DCGM_FI_DEV_SM_CLOCK,       DCGM_FI_DEV_SM_CLOCK,        gauge
    DCGM_FI_DEV_PCIE_TX_THROUGHPUT, DCGM_FI_DEV_PCIE_TX_THROUGHPUT, gauge
    DCGM_FI_DEV_PCIE_RX_THROUGHPUT, DCGM_FI_DEV_PCIE_RX_THROUGHPUT, gauge
---
apiVersion: v1
kind: Service
metadata:
  name: dcgm-exporter
  namespace: monitoring
  labels:
    app: dcgm-exporter
spec:
  type: ClusterIP
  ports:
    - port: 9400
      targetPort: 9400
      name: metrics
  selector:
    app: dcgm-exporter

# Deploy
kubectl apply -f dcgm-exporter-daemonset.yaml

# Verify pods running on GPU nodes
kubectl get pods -n monitoring -l app=dcgm-exporter -o wide

# Check metrics
kubectl port-forward -n monitoring daemonset/dcgm-exporter 9400:9400
curl http://localhost:9400/metrics

ServiceMonitor 配置（for Prometheus Operator）

# servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
  namespace: monitoring
  labels:
    release: prometheus
spec:
  selector:
    matchLabels:
      app: dcgm-exporter
  endpoints:
    - port: metrics
      interval: 15s
      path: /metrics
      relabelings:
        - sourceLabels: [__meta_kubernetes_pod_node_name]
          targetLabel: node
        - sourceLabels: [__meta_kubernetes_namespace]
          targetLabel: namespace

2.4 启动验证

检查指标采集

# Check Prometheus targets
curl http://prometheus:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.job == "dcgm-gpu")'

# Query GPU utilization
curl -G http://prometheus:9090/api/v1/query \
    --data-urlencode 'query=DCGM_FI_DEV_GPU_UTIL'

# Check all available DCGM metrics
curl http://prometheus:9090/api/v1/label/__name__/values | jq '.data[] | select(startswith("DCGM"))'

常用验证查询

# Average GPU utilization across all GPUs
avg(DCGM_FI_DEV_GPU_UTIL)

# GPU utilization by node
avg by (instance) (DCGM_FI_DEV_GPU_UTIL)

# Total GPU memory usage in GB
sum(DCGM_FI_DEV_FB_USED) / 1024

# GPU temperature distribution
histogram_quantile(0.95, DCGM_FI_DEV_GPU_TEMP)

三、示例代码和配置

3.1 完整的监控栈部署

docker-compose 部署（适合小规模）

# docker-compose.yml
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:v2.48.0
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus:/etc/prometheus
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
    restart: unless-stopped

  grafana:
    image: grafana/grafana:10.2.0
    container_name: grafana
    ports:
      - "3000:3000"
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
      - ./grafana/dashboards:/var/lib/grafana/dashboards
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin123
      - GF_USERS_ALLOW_SIGN_UP=false
    restart: unless-stopped

  alertmanager:
    image: prom/alertmanager:v0.26.0
    container_name: alertmanager
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager:/etc/alertmanager
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:

# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - '/etc/prometheus/rules/*.yml'

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'dcgm-gpu'
    static_configs:
      - targets:
          - 'gpu-node-01:9400'
          - 'gpu-node-02:9400'
          - 'gpu-node-03:9400'
          - 'gpu-node-04:9400'
    relabel_configs:
      - source_labels: [__address__]
        regex: '(.+):9400'
        target_label: instance
        replacement: '${1}'

3.2 告警规则配置

# prometheus/rules/gpu_alerts.yml
groups:
  - name: gpu_alerts
    interval: 30s
    rules:
      # GPU utilization too low (wasting resources)
      - alert: GPUUtilizationLow
        expr: avg_over_time(DCGM_FI_DEV_GPU_UTIL[30m]) < 20
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "GPU utilization is low on {{ $labels.instance }}"
          description: "GPU {{ $labels.gpu }} on {{ $labels.instance }} has been below 20% utilization for over 1 hour. Current: {{ $value | printf \"%.1f\" }}%"

      # GPU memory almost full
      - alert: GPUMemoryHigh
        expr: (DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)) * 100 > 95
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "GPU memory usage critical on {{ $labels.instance }}"
          description: "GPU {{ $labels.gpu }} memory usage is above 95%. Used: {{ $value | printf \"%.1f\" }}%"

      # GPU temperature too high
      - alert: GPUTemperatureHigh
        expr: DCGM_FI_DEV_GPU_TEMP > 83
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "GPU temperature critical on {{ $labels.instance }}"
          description: "GPU {{ $labels.gpu }} temperature is {{ $value }}°C, which exceeds safe threshold (83°C)"

      # GPU temperature warning
      - alert: GPUTemperatureWarning
        expr: DCGM_FI_DEV_GPU_TEMP > 75
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "GPU temperature elevated on {{ $labels.instance }}"
          description: "GPU {{ $labels.gpu }} temperature is {{ $value }}°C"

      # ECC errors detected
      - alert: GPUECCErrors
        expr: increase(DCGM_FI_DEV_ECC_DBE_VOL_TOTAL[1h]) > 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "GPU ECC double-bit errors on {{ $labels.instance }}"
          description: "GPU {{ $labels.gpu }} has detected uncorrectable ECC errors. This may indicate hardware failure."

      # XID errors (GPU faults)
      - alert: GPUXIDErrors
        expr: increase(DCGM_FI_DEV_XID_ERRORS[5m]) > 0
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "GPU XID errors on {{ $labels.instance }}"
          description: "GPU {{ $labels.gpu }} reported XID error {{ $value }}. Check nvidia-smi for details."

      # GPU power usage near limit
      - alert: GPUPowerHigh
        expr: DCGM_FI_DEV_POWER_USAGE > 380
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "GPU power consumption high on {{ $labels.instance }}"
          description: "GPU {{ $labels.gpu }} power usage is {{ $value }}W, approaching TDP limit"

      # DCGM exporter down
      - alert: DCGMExporterDown
        expr: up{job="dcgm-gpu"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "DCGM Exporter is down on {{ $labels.instance }}"
          description: "Cannot scrape GPU metrics from {{ $labels.instance }}. Check if dcgm-exporter is running."

      # NVLink bandwidth degraded
      - alert: NVLinkBandwidthLow
        expr: DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL < 200000
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "NVLink bandwidth degraded on {{ $labels.instance }}"
          description: "GPU {{ $labels.gpu }} NVLink bandwidth is {{ $value | humanize }}B/s, which is below expected"

3.3 Alertmanager 配置

# alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'instance']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
      continue: true
    - match:
        severity: critical
      receiver: 'slack-critical'
    - match:
        severity: warning
      receiver: 'slack-warning'

receivers:
  - name: 'default'
    webhook_configs:
      - url: 'http://localhost:5001/webhook'
        send_resolved: true

  - name: 'slack-critical'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#gpu-alerts-critical'
        title: '{{ .GroupLabels.alertname }}'
        text: >-
          {{ range .Alerts }}
          *Alert:* {{ .Annotations.summary }}
          *Description:* {{ .Annotations.description }}
          *Severity:* {{ .Labels.severity }}
          {{ end }}
        send_resolved: true

  - name: 'slack-warning'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#gpu-alerts-warning'
        title: '{{ .GroupLabels.alertname }}'
        text: >-
          {{ range .Alerts }}
          *Alert:* {{ .Annotations.summary }}
          {{ end }}

  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: 'YOUR-PAGERDUTY-SERVICE-KEY'
        severity: critical

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

3.4 Grafana Dashboard

这是个完整的 GPU 监控 Dashboard JSON，可以直接导入 Grafana：

{
  "annotations": {
    "list": []
  },
  "editable": true,
  "fiscalYearStartMonth": 0,
  "graphTooltip": 1,
  "id": null,
  "links": [],
  "liveNow": false,
  "panels": [
    {
      "collapsed": false,
      "gridPos": { "h": 1, "w": 24, "x": 0, "y": 0 },
      "id": 1,
      "panels": [],
      "title": "Overview",
      "type": "row"
    },
    {
      "datasource": { "type": "prometheus", "uid": "${datasource}" },
      "fieldConfig": {
        "defaults": {
          "color": { "mode": "palette-classic" },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              { "color": "green", "value": null },
              { "color": "yellow", "value": 50 },
              { "color": "red", "value": 80 }
            ]
          },
          "unit": "percent"
        }
      },
      "gridPos": { "h": 4, "w": 6, "x": 0, "y": 1 },
      "id": 2,
      "options": {
        "colorMode": "value",
        "graphMode": "area",
        "justifyMode": "auto",
        "orientation": "auto",
        "reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
        "textMode": "auto"
      },
      "pluginVersion": "10.2.0",
      "targets": [
        {
          "expr": "avg(DCGM_FI_DEV_GPU_UTIL{instance=~\"$instance\"})",
          "legendFormat": "Avg GPU Util",
          "refId": "A"
        }
      ],
      "title": "Average GPU Utilization",
      "type": "stat"
    },
    {
      "datasource": { "type": "prometheus", "uid": "${datasource}" },
      "fieldConfig": {
        "defaults": {
          "color": { "mode": "palette-classic" },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              { "color": "green", "value": null },
              { "color": "yellow", "value": 70 },
              { "color": "red", "value": 85 }
            ]
          },
          "unit": "celsius"
        }
      },
      "gridPos": { "h": 4, "w": 6, "x": 6, "y": 1 },
      "id": 3,
      "options": {
        "colorMode": "value",
        "graphMode": "none",
        "reduceOptions": { "calcs": ["max"], "fields": "", "values": false }
      },
      "targets": [
        {
          "expr": "max(DCGM_FI_DEV_GPU_TEMP{instance=~\"$instance\"})",
          "legendFormat": "Max Temp",
          "refId": "A"
        }
      ],
      "title": "Max GPU Temperature",
      "type": "stat"
    },
    {
      "datasource": { "type": "prometheus", "uid": "${datasource}" },
      "fieldConfig": {
        "defaults": {
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [{ "color": "blue", "value": null }]
          },
          "unit": "watt"
        }
      },
      "gridPos": { "h": 4, "w": 6, "x": 12, "y": 1 },
      "id": 4,
      "options": { "reduceOptions": { "calcs": ["lastNotNull"] } },
      "targets": [
        {
          "expr": "sum(DCGM_FI_DEV_POWER_USAGE{instance=~\"$instance\"})",
          "legendFormat": "Total Power",
          "refId": "A"
        }
      ],
      "title": "Total Power Consumption",
      "type": "stat"
    },
    {
      "datasource": { "type": "prometheus", "uid": "${datasource}" },
      "fieldConfig": {
        "defaults": {
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [{ "color": "green", "value": null }]
          },
          "unit": "short"
        }
      },
      "gridPos": { "h": 4, "w": 6, "x": 18, "y": 1 },
      "id": 5,
      "options": { "reduceOptions": { "calcs": ["lastNotNull"] } },
      "targets": [
        {
          "expr": "count(DCGM_FI_DEV_GPU_UTIL{instance=~\"$instance\"})",
          "legendFormat": "GPU Count",
          "refId": "A"
        }
      ],
      "title": "Active GPUs",
      "type": "stat"
    },
    {
      "collapsed": false,
      "gridPos": { "h": 1, "w": 24, "x": 0, "y": 5 },
      "id": 10,
      "panels": [],
      "title": "GPU Utilization",
      "type": "row"
    },
    {
      "datasource": { "type": "prometheus", "uid": "${datasource}" },
      "fieldConfig": {
        "defaults": {
          "color": { "mode": "palette-classic" },
          "custom": {
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 20,
            "gradientMode": "none",
            "lineInterpolation": "smooth",
            "lineWidth": 1,
            "showPoints": "never"
          },
          "mappings": [],
          "max": 100,
          "min": 0,
          "thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }] },
          "unit": "percent"
        }
      },
      "gridPos": { "h": 8, "w": 12, "x": 0, "y": 6 },
      "id": 11,
      "options": { "legend": { "calcs": ["mean", "max"], "displayMode": "table", "placement": "bottom" } },
      "targets": [
        {
          "expr": "DCGM_FI_DEV_GPU_UTIL{instance=~\"$instance\"}",
          "legendFormat": "{{ instance }} GPU {{ gpu }}",
          "refId": "A"
        }
      ],
      "title": "GPU Utilization by GPU",
      "type": "timeseries"
    },
    {
      "datasource": { "type": "prometheus", "uid": "${datasource}" },
      "fieldConfig": {
        "defaults": {
          "color": { "mode": "palette-classic" },
          "custom": {
            "drawStyle": "line",
            "fillOpacity": 20,
            "lineInterpolation": "smooth",
            "lineWidth": 1
          },
          "max": 100,
          "min": 0,
          "unit": "percent"
        }
      },
      "gridPos": { "h": 8, "w": 12, "x": 12, "y": 6 },
      "id": 12,
      "targets": [
        {
          "expr": "(DCGM_FI_DEV_FB_USED{instance=~\"$instance\"} / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)) * 100",
          "legendFormat": "{{ instance }} GPU {{ gpu }}",
          "refId": "A"
        }
      ],
      "title": "GPU Memory Utilization",
      "type": "timeseries"
    },
    {
      "collapsed": false,
      "gridPos": { "h": 1, "w": 24, "x": 0, "y": 14 },
      "id": 20,
      "panels": [],
      "title": "Temperature & Power",
      "type": "row"
    },
    {
      "datasource": { "type": "prometheus", "uid": "${datasource}" },
      "fieldConfig": {
        "defaults": {
          "custom": { "drawStyle": "line", "fillOpacity": 10, "lineWidth": 1 },
          "unit": "celsius"
        }
      },
      "gridPos": { "h": 8, "w": 12, "x": 0, "y": 15 },
      "id": 21,
      "targets": [
        {
          "expr": "DCGM_FI_DEV_GPU_TEMP{instance=~\"$instance\"}",
          "legendFormat": "{{ instance }} GPU {{ gpu }}",
          "refId": "A"
        }
      ],
      "title": "GPU Temperature",
      "type": "timeseries"
    },
    {
      "datasource": { "type": "prometheus", "uid": "${datasource}" },
      "fieldConfig": {
        "defaults": {
          "custom": { "drawStyle": "line", "fillOpacity": 10, "lineWidth": 1 },
          "unit": "watt"
        }
      },
      "gridPos": { "h": 8, "w": 12, "x": 12, "y": 15 },
      "id": 22,
      "targets": [
        {
          "expr": "DCGM_FI_DEV_POWER_USAGE{instance=~\"$instance\"}",
          "legendFormat": "{{ instance }} GPU {{ gpu }}",
          "refId": "A"
        }
      ],
      "title": "GPU Power Usage",
      "type": "timeseries"
    },
    {
      "collapsed": false,
      "gridPos": { "h": 1, "w": 24, "x": 0, "y": 23 },
      "id": 30,
      "panels": [],
      "title": "Memory Details",
      "type": "row"
    },
    {
      "datasource": { "type": "prometheus", "uid": "${datasource}" },
      "fieldConfig": {
        "defaults": {
          "custom": { "drawStyle": "bars", "fillOpacity": 80, "stacking": { "mode": "normal" } },
          "unit": "decmbytes"
        }
      },
      "gridPos": { "h": 8, "w": 24, "x": 0, "y": 24 },
      "id": 31,
      "targets": [
        {
          "expr": "DCGM_FI_DEV_FB_USED{instance=~\"$instance\"}",
          "legendFormat": "{{ instance }} GPU {{ gpu }} Used",
          "refId": "A"
        },
        {
          "expr": "DCGM_FI_DEV_FB_FREE{instance=~\"$instance\"}",
          "legendFormat": "{{ instance }} GPU {{ gpu }} Free",
          "refId": "B"
        }
      ],
      "title": "GPU Memory Usage (MB)",
      "type": "timeseries"
    }
  ],
  "refresh": "30s",
  "schemaVersion": 38,
  "style": "dark",
  "tags": ["gpu", "nvidia", "dcgm"],
  "templating": {
    "list": [
      {
        "current": { "selected": false, "text": "prometheus", "value": "prometheus" },
        "hide": 0,
        "includeAll": false,
        "multi": false,
        "name": "datasource",
        "options": [],
        "query": "prometheus",
        "refresh": 1,
        "regex": "",
        "skipUrlSync": false,
        "type": "datasource"
      },
      {
        "allValue": ".*",
        "current": { "selected": true, "text": "All", "value": "$__all" },
        "datasource": { "type": "prometheus", "uid": "${datasource}" },
        "definition": "label_values(DCGM_FI_DEV_GPU_UTIL, instance)",
        "hide": 0,
        "includeAll": true,
        "multi": true,
        "name": "instance",
        "options": [],
        "query": { "query": "label_values(DCGM_FI_DEV_GPU_UTIL, instance)", "refId": "A" },
        "refresh": 2,
        "regex": "",
        "skipUrlSync": false,
        "sort": 1,
        "type": "query"
      }
    ]
  },
  "time": { "from": "now-1h", "to": "now" },
  "timepicker": {},
  "timezone": "",
  "title": "GPU Cluster Monitoring",
  "uid": "gpu-cluster-monitoring",
  "version": 1,
  "weekStart": ""
}

保存为 gpu-dashboard.json，然后在 Grafana 中导入。

四、最佳实践和注意事项

4.1 性能优化

采集间隔调优

默认 15s 采集间隔对大部分场景够用。但需要根据情况调整：

# High frequency (for debugging)
scrape_interval: 5s   # More CPU on Prometheus, more storage

# Standard (production)
scrape_interval: 15s  # Good balance

# Low frequency (for large clusters)
scrape_interval: 30s  # Less accurate, but lower overhead

超过 100 个 GPU 节点时，建议：

使用 15s-30s 的采集间隔
开启 Prometheus 的远程写入，用 Thanos/Cortex 做长期存储
考虑按集群分片部署多个 Prometheus

减少标签基数

DCGM 默认会带上 UUID 标签，每张卡都不一样，会造成高基数问题：

# prometheus.yml - drop high cardinality labels
relabel_configs:
  - action: labeldrop
    regex: 'UUID'        # Drop UUID label to reduce cardinality
  - action: labeldrop
    regex: 'modelName'   # If all GPUs are same model

使用 Recording Rules

预计算常用查询，减少 Dashboard 加载时间：

# recording_rules.yml
groups:
  - name: gpu_recording_rules
    interval: 30s
    rules:
      # Average GPU utilization by node
      - record: node:gpu_utilization:avg
        expr: avg by (instance) (DCGM_FI_DEV_GPU_UTIL)

      # Total memory used by node (in GB)
      - record: node:gpu_memory_used_gb:sum
        expr: sum by (instance) (DCGM_FI_DEV_FB_USED) / 1024

      # GPU memory utilization percentage
      - record: gpu:memory_utilization:ratio
        expr: |
          DCGM_FI_DEV_FB_USED /
          (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)

      # 5 minute average GPU utilization
      - record: gpu:utilization:avg5m
        expr: avg_over_time(DCGM_FI_DEV_GPU_UTIL[5m])

4.2 安全加固

TLS 加密

在生产环境，exporter 和 Prometheus 之间的通信应该加密：

# dcgm-exporter with TLS
docker run -d \
  --name dcgm-exporter \
  --gpus all \
  -p 9400:9400 \
  -v /etc/dcgm-exporter/certs:/certs:ro \
  nvcr.io/nvidia/k8s/dcgm-exporter:3.3.6-3.4.2-ubuntu22.04 \
  --web.config=/certs/web-config.yml

# web-config.yml
tls_server_config:
  cert_file: /certs/server.crt
  key_file: /certs/server.key

# prometheus.yml
scrape_configs:
  - job_name: 'dcgm-gpu'
    scheme: https
    tls_config:
      ca_file: /etc/prometheus/certs/ca.crt
      insecure_skip_verify: false
    static_configs:
      - targets: ['gpu-node-01:9400']

认证

添加 Basic Auth：

# web-config.yml
basic_auth_users:
  prometheus: $2y$10$xxxxx   # bcrypt hash of password

# prometheus.yml
scrape_configs:
  - job_name: 'dcgm-gpu'
    basic_auth:
      username: prometheus
      password_file: /etc/prometheus/secrets/dcgm_password

4.3 高可用配置

Prometheus HA

两个 Prometheus 实例抓取相同目标：

# prometheus-1.yml
global:
  external_labels:
    replica: prometheus-1

# prometheus-2.yml
global:
  external_labels:
    replica: prometheus-2

配合 Thanos 或 Cortex 做去重和长期存储。

Alertmanager HA

# alertmanager-1.yml
cluster:
  listen-address: "0.0.0.0:9094"
  peers:
    - alertmanager-2:9094

# alertmanager-2.yml
cluster:
  listen-address: "0.0.0.0:9094"
  peers:
    - alertmanager-1:9094

4.4 常见错误

错误 1：No data points

症状：Grafana 显示 "No data"

排查步骤：

# 1. Check if exporter is running
curl http://gpu-node-01:9400/metrics

# 2. Check Prometheus targets
curl http://prometheus:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.job == "dcgm-gpu")'

# 3. Check firewall
sudo iptables -L -n | grep 9400

常见原因：

dcgm-exporter 没启动
防火墙阻止了 9400 端口
Prometheus 配置的地址不对

错误 2：DCGM initialization failed

Error: Cannot connect to the DCGM socket

解决方案：

# Check DCGM service
sudo systemctl status nvidia-dcgm

# Restart if needed
sudo systemctl restart nvidia-dcgm
sudo systemctl restart dcgm-exporter

# Check permissions
ls -la /var/run/nvidia-dcgm.sock

错误 3：指标值为 0 或缺失

部分指标在某些 GPU 型号上不支持：

# Check supported fields
dcgmi dmon -e 1009,1010,1011 -c 1

# Check if profiling metrics need to be enabled
dcgmi profile --pause   # Disable
dcgmi profile --resume  # Enable

错误 4：内存使用持续增长

dcgm-exporter 内存泄漏（老版本有这个问题）：

# Check memory usage
ps aux | grep dcgm-exporter

# Upgrade to latest version
docker pull nvcr.io/nvidia/k8s/dcgm-exporter:3.3.6-3.4.2-ubuntu22.04

五、故障排查和监控

5.1 日志查看

DCGM 服务日志

# Systemd logs
journalctl -u nvidia-dcgm -f

# Docker logs
docker logs dcgm-exporter -f --tail 100

# K8s logs
kubectl logs -f daemonset/dcgm-exporter -n monitoring

常见日志分析

# Find errors
journalctl -u nvidia-dcgm | grep -i error

# DCGM initialization issues
journalctl -u nvidia-dcgm | grep -i "failed to"

5.2 常见问题排查

问题 1：GPU 利用率波动大

# Check utilization variance
stddev_over_time(DCGM_FI_DEV_GPU_UTIL[5m])

# Compare with request rate (if using vLLM)
rate(vllm_request_success_total[5m])

原因分析：

请求到达不均匀 → 检查负载均衡
Batch size 太小 → 调整推理框架参数
短请求多 → 考虑请求合并

问题 2：显存使用持续增长

# Memory usage trend
deriv(DCGM_FI_DEV_FB_USED[1h])

# Compare before and after restart
DCGM_FI_DEV_FB_USED offset 1d

原因分析：

Python 对象未释放 → 检查代码
KV Cache 未清理 → 检查推理框架配置
CUDA Context 泄漏 → 升级驱动

问题 3：温度过高

# Check fan speed
nvidia-smi -q -d FAN

# Check power limit
nvidia-smi -q -d POWER

# Reduce power limit if needed (temporary)
sudo nvidia-smi -pl 300   # Set 300W limit

排查方向：

机房温度 → 检查空调
风道堵塞 → 检查服务器灰尘
散热片问题 → 联系硬件维护

5.3 性能监控

关键指标看板

# Overall cluster health score (0-100)
(
  (avg(DCGM_FI_DEV_GPU_UTIL) / 100 * 0.3) +
  (1 - max(DCGM_FI_DEV_GPU_TEMP) / 90 * 0.2) +
  (1 - sum(rate(DCGM_FI_DEV_XID_ERRORS[1h])) * 0.3) +
  (sum(up{job="dcgm-gpu"}) / count(up{job="dcgm-gpu"}) * 0.2)
) * 100

# Cost efficiency (useful work per watt)
sum(rate(vllm_request_success_total[5m])) / sum(DCGM_FI_DEV_POWER_USAGE)

自动化巡检脚本

#!/usr/bin/env python3
"""
GPU cluster health check script.
Run daily via cron to catch issues early.
"""

import requests
import json
from datetime import datetime

PROMETHEUS_URL = "http://prometheus:9090"
THRESHOLDS = {
    "gpu_utilization_low": 20,
    "gpu_temperature_high": 80,
    "memory_usage_high": 95,
}

def query_prometheus(query: str) -> list:
    """Execute PromQL query and return results."""
    resp = requests.get(
        f"{PROMETHEUS_URL}/api/v1/query",
        params={"query": query}
    )
    data = resp.json()
    if data["status"] != "success":
        raise Exception(f"Query failed: {data}")
    return data["data"]["result"]

def check_gpu_health():
    """Run all GPU health checks."""
    issues = []

    # Check GPU utilization
    results = query_prometheus("avg_over_time(DCGM_FI_DEV_GPU_UTIL[1h])")
    for r in results:
        if float(r["value"][1]) < THRESHOLDS["gpu_utilization_low"]:
            issues.append({
                "type": "low_utilization",
                "instance": r["metric"].get("instance"),
                "gpu": r["metric"].get("gpu"),
                "value": float(r["value"][1])
            })

    # Check GPU temperature
    results = query_prometheus("DCGM_FI_DEV_GPU_TEMP")
    for r in results:
        if float(r["value"][1]) > THRESHOLDS["gpu_temperature_high"]:
            issues.append({
                "type": "high_temperature",
                "instance": r["metric"].get("instance"),
                "gpu": r["metric"].get("gpu"),
                "value": float(r["value"][1])
            })

    # Check memory usage
    results = query_prometheus(
        "(DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)) * 100"
    )
    for r in results:
        if float(r["value"][1]) > THRESHOLDS["memory_usage_high"]:
            issues.append({
                "type": "high_memory",
                "instance": r["metric"].get("instance"),
                "gpu": r["metric"].get("gpu"),
                "value": float(r["value"][1])
            })

    # Check for ECC errors
    results = query_prometheus("increase(DCGM_FI_DEV_ECC_DBE_VOL_TOTAL[24h]) > 0")
    for r in results:
        issues.append({
            "type": "ecc_errors",
            "instance": r["metric"].get("instance"),
            "gpu": r["metric"].get("gpu"),
            "value": float(r["value"][1])
        })

    return issues

def generate_report(issues: list) -> str:
    """Generate health check report."""
    report = f"""
GPU Cluster Health Report
Generated: {datetime.now().isoformat()}
{'='*50}

Total Issues Found: {len(issues)}
"""
    if not issues:
        report += "\nAll GPUs are healthy!\n"
    else:
        by_type = {}
        for issue in issues:
            t = issue["type"]
            if t not in by_type:
                by_type[t] = []
            by_type[t].append(issue)

        for issue_type, items in by_type.items():
            report += f"\n{issue_type.upper()} ({len(items)} issues):\n"
            for item in items:
                report += f"  - {item['instance']} GPU {item['gpu']}: {item['value']:.1f}\n"

    return report

if __name__ == "__main__":
    issues = check_gpu_health()
    report = generate_report(issues)
    print(report)

    # Send to Slack/Email if issues found
    if issues:
        # webhook_url = "https://hooks.slack.com/..."
        # requests.post(webhook_url, json={"text": report})
        pass

5.4 备份恢复

Prometheus 数据备份

# Create snapshot
curl -X POST http://prometheus:9090/api/v1/admin/tsdb/snapshot

# Backup snapshot directory
tar -czvf prometheus-backup-$(date +%Y%m%d).tar.gz \
    /prometheus/snapshots/

# Restore
tar -xzvf prometheus-backup-20241219.tar.gz -C /prometheus/

Grafana Dashboard 备份

# Export all dashboards
for uid in $(curl -s http://admin:password@grafana:3000/api/search | jq -r '.[].uid'); do
    curl -s "http://admin:password@grafana:3000/api/dashboards/uid/$uid" | \
        jq '.dashboard' > "dashboard-$uid.json"
done

# Import dashboard
curl -X POST \
    -H "Content-Type: application/json" \
    -d @dashboard-xxx.json \
    http://admin:password@grafana:3000/api/dashboards/import

六、总结

6.1 技术要点回顾

搭建 GPU 监控系统的核心步骤：

安装 DCGM：NVIDIA 官方工具，提供丰富的 GPU 指标
部署 dcgm-exporter：把指标暴露为 Prometheus 格式
配置 Prometheus：抓取并存储指标
创建 Grafana Dashboard：可视化展示
设置告警规则：温度过高、利用率过低、ECC 错误等

关键配置文件：

/etc/dcgm-exporter/custom-counters.csv：自定义采集指标
prometheus.yml：抓取目标配置
gpu_alerts.yml：告警规则

6.2 进阶学习方向

MIG 监控

A100/H100 支持 MIG（Multi-Instance GPU），可以把一张卡切成多个独立实例：

# Enable MIG
sudo nvidia-smi -i 0 -mig 1

# Create GPU instances
sudo nvidia-smi mig -i 0 -cgi 9,9,9 -C

# DCGM exporter will automatically detect MIG instances

vGPU 监控

虚拟化场景需要额外的 vGPU 指标：

# Install vGPU manager
# DCGM can monitor both pGPU and vGPU
dcgmi discovery -l

与业务指标联动

把 GPU 利用率和业务 QPS 关联分析：

# Correlation between GPU util and request rate
DCGM_FI_DEV_GPU_UTIL
/
sum(rate(http_requests_total[5m])) by (instance)

6.3 参考资料

NVIDIA DCGM 官方文档
dcgm-exporter GitHub
Prometheus 官方文档
NVIDIA GPU 指标字段列表

附录

A. 命令速查表

# DCGM basic commands
dcgmi discovery -l              # List all GPUs
dcgmi diag -r 1                 # Quick diagnostic
dcgmi diag -r 3                 # Full diagnostic (takes longer)
dcgmi health -g 0               # Check GPU 0 health
dcgmi stats -g 0 -e             # Enable stats collection
dcgmi dmon -e 1009,1010         # Monitor specific fields

# dcgm-exporter
docker run --gpus all -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:3.3.6-3.4.2-ubuntu22.04
curl http://localhost:9400/metrics

# Prometheus queries
curl -G 'http://prometheus:9090/api/v1/query' --data-urlencode 'query=DCGM_FI_DEV_GPU_UTIL'
curl -G 'http://prometheus:9090/api/v1/query_range' \
    --data-urlencode 'query=avg(DCGM_FI_DEV_GPU_UTIL)' \
    --data-urlencode 'start=2024-01-01T00:00:00Z' \
    --data-urlencode 'end=2024-01-02T00:00:00Z' \
    --data-urlencode 'step=1h'

B. 常用 DCGM 字段 ID

Field ID	名称	类型	说明
100	DCGM_FI_DRIVER_VERSION	String	驱动版本
1001	DCGM_FI_DEV_SM_CLOCK	Gauge	SM 时钟频率 (MHz)
1002	DCGM_FI_DEV_MEM_CLOCK	Gauge	显存时钟频率 (MHz)
1004	DCGM_FI_DEV_GPU_TEMP	Gauge	GPU 温度 (°C)
1005	DCGM_FI_DEV_POWER_USAGE	Gauge	功耗 (W)
1009	DCGM_FI_DEV_FB_FREE	Gauge	可用显存 (MB)
1010	DCGM_FI_DEV_FB_USED	Gauge	已用显存 (MB)
1011	DCGM_FI_DEV_FB_TOTAL	Gauge	总显存 (MB)
1100	DCGM_FI_DEV_GPU_UTIL	Gauge	GPU 利用率 (%)
1101	DCGM_FI_DEV_MEM_COPY_UTIL	Gauge	显存带宽利用率 (%)
1140	DCGM_FI_DEV_PCIE_TX_THROUGHPUT	Gauge	PCIe 发送带宽
1141	DCGM_FI_DEV_PCIE_RX_THROUGHPUT	Gauge	PCIe 接收带宽
1006	DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION	Counter	累计能耗 (mJ)
310	DCGM_FI_DEV_XID_ERRORS	Gauge	XID 错误码

C. 术语表

术语	解释
DCGM	Data Center GPU Manager，NVIDIA 数据中心 GPU 管理工具
Exporter	Prometheus 生态中负责暴露指标的组件
Scrape	Prometheus 从 Exporter 拉取指标的动作
PromQL	Prometheus Query Language，查询语言
Recording Rule	预计算规则，提前计算常用查询
Alerting Rule	告警规则，定义何时触发告警
MIG	Multi-Instance GPU，A100/H100 支持的 GPU 分割技术
ECC	Error Correcting Code，显存纠错
XID	NVIDIA GPU 的错误代码
SM	Streaming Multiprocessor，GPU 计算单元
TDP	Thermal Design Power，设计功耗

上一篇：阿里Java面试复盘：高压下的情绪管理与工作稳定性实战攻略
下一篇：Ansible 剧本编写规范：自动化脚本的优雅实践与避坑指南

Prometheus, DCGM, GPU监控, Grafana, 运维