找回密码
立即注册
搜索
热搜: Java Python Linux Go
发回帖 发新帖

5561

积分

1

好友

757

主题
发表于 昨天 04:36 | 查看: 4| 回复: 0

老板问我:这 8 台 A100 机器,每个月电费好几万,利用率到底怎么样?我打开 nvidia-smi 看了眼:嗯,现在 67%。老板不满意:我要看趋势,看历史,看报表。好吧,那就得上正经的监控系统了。这篇文章记录一下用 Prometheus + DCGM 搭建 GPU 监控的完整过程。

一、概述

1.1 背景介绍

nvidia-smi 是个好工具,但它只能看当前状态。线上的 GPU 集群要回答这些问题:

  • 过去一周的平均 GPU 利用率是多少?
  • 哪些时段 GPU 是空闲的,可以跑离线任务?
  • 显存使用有没有慢慢增长的趋势(内存泄漏)?
  • 某张卡的温度是不是一直偏高,需要检修?
  • 推理服务的 QPS 和 GPU 利用率是什么关系?

要回答这些问题,需要一套完整的监控体系:采集、存储、可视化、告警。这就是 Prometheus + DCGM + Grafana 组合要做的事。

1.2 技术特点

DCGM (Data Center GPU Manager)

NVIDIA 官方出品的 GPU 管理工具,比 nvidia-smi 更专业:

  • 支持 150+ 种 GPU 指标,远超 nvidia-smi 能拿到的
  • 自带健康检查和诊断功能
  • 支持 GPU 分组管理,适合多租户场景
  • 提供 Prometheus exporter,开箱即用
  • 对 GPU 驱动的负担极小(轮询间隔可配置)

Prometheus

云原生监控的事实标准:

  • Pull 模型,适合动态变化的集群
  • 强大的 PromQL 查询语言
  • 内置告警管理器
  • 生态丰富,各种 exporter 和 dashboard

这套方案的优势

  • 官方支持:NVIDIA 维护 dcgm-exporter,更新及时
  • 全面覆盖:从 GPU 利用率到 NVLink 带宽,应有尽有
  • 扩展性好:几百台 GPU 机器也能轻松应对
  • 成本可控:全开源组件,没有 license 费用

1.3 适用场景

这套方案适用于:

  • 大模型训练集群:监控训练进度、GPU 利用率、通信效率
  • 推理服务集群:监控 QPS、延迟、资源利用率
  • GPU 虚拟化环境:MIG、vGPU 等场景的资源监控
  • 混合负载集群:训练 + 推理混部,需要精细化调度

不太适用于:

  • 单机开发环境(杀鸡用牛刀)
  • 对指标延迟要求极高的场景(Prometheus 默认 15s 采集间隔)

1.4 环境要求

硬件要求

  • NVIDIA GPU:支持 Kepler 架构(K80)及以上
  • 推荐 Volta/Ampere/Hopper 架构,支持更多指标

软件要求

NVIDIA Driver: 450.80.02+(推荐 525+ 或 550+)
CUDA: 11.0+
Docker: 20.10+(如果用容器部署)
Kubernetes: 1.25+(如果用 K8s 部署)

网络要求

  • Prometheus 能访问所有 GPU 节点的 9400 端口(dcgm-exporter 默认端口)
  • Grafana 能访问 Prometheus
  • 可选:告警需要能访问 Alertmanager、邮件服务器、企业微信/钉钉 webhook

测试环境

本文的配置在以下环境验证:

OS: Ubuntu 22.04 LTS
GPU: 8 x NVIDIA A100-80GB
Driver: 550.90.07
DCGM: 3.3.6
Prometheus: 2.48.0
Grafana: 10.2.0

二、详细步骤

2.1 准备工作

安装 NVIDIA 驱动

如果还没装驱动,这是最基础的一步:

# Ubuntu
sudo apt update
sudo apt install -y nvidia-driver-550-server

# Reboot
sudo reboot

# Verify
nvidia-smi

安装 DCGM

DCGM 可以通过 apt/yum 安装,也可以用容器。先说裸机安装:

# Add NVIDIA repo
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

# Install datacenter-gpu-manager
sudo apt update
sudo apt install -y datacenter-gpu-manager

# Enable and start service
sudo systemctl enable nvidia-dcgm
sudo systemctl start nvidia-dcgm

# Verify
dcgmi discovery -l

看到输出类似这样说明安装成功:

8 GPUs found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information                                                   |
+--------+----------------------------------------------------------------------+
| 0      | Name: NVIDIA A100-SXM4-80GB                                         |
|        | PCI Bus ID: 00000000:07:00.0                                        |
|        | Device UUID: GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx               |
+--------+----------------------------------------------------------------------+
...

安装 dcgm-exporter

dcgm-exporter 把 DCGM 的指标暴露为 Prometheus 格式:

# Method 1: Container (recommended)
docker run -d \
    --name dcgm-exporter \
    --gpus all \
    --restart unless-stopped \
    -p 9400:9400 \
    nvcr.io/nvidia/k8s/dcgm-exporter:3.3.6-3.4.2-ubuntu22.04

# Method 2: Binary installation
# Download from https://github.com/NVIDIA/dcgm-exporter/releases
wget https://github.com/NVIDIA/dcgm-exporter/releases/download/3.3.6/dcgm-exporter-3.3.6-amd64.tar.gz
tar -xzf dcgm-exporter-3.3.6-amd64.tar.gz
sudo mv dcgm-exporter /usr/local/bin/

# Create systemd service
sudo tee /etc/systemd/system/dcgm-exporter.service << 'EOF'
[Unit]
Description=DCGM Exporter
After=nvidia-dcgm.service
Requires=nvidia-dcgm.service

[Service]
Type=simple
ExecStart=/usr/local/bin/dcgm-exporter -f /etc/dcgm-exporter/default-counters.csv
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable dcgm-exporter
sudo systemctl start dcgm-exporter

验证 exporter 正常工作:

curl http://localhost:9400/metrics | head -50

应该能看到类似这样的输出:

# HELP DCGM_FI_DEV_GPU_TEMP GPU temperature (in C).
# TYPE DCGM_FI_DEV_GPU_TEMP gauge
DCGM_FI_DEV_GPU_TEMP{gpu="0",UUID="GPU-xxx",device="nvidia0",...} 42
DCGM_FI_DEV_GPU_TEMP{gpu="1",UUID="GPU-yyy",device="nvidia1",...} 44
...

2.2 核心配置

Prometheus 配置

假设你已经有了 Prometheus,添加 GPU 节点的抓取配置:

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  # Static config for few nodes
  - job_name: 'dcgm-gpu'
    static_configs:
      - targets:
          - 'gpu-node-01:9400'
          - 'gpu-node-02:9400'
          - 'gpu-node-03:9400'
          - 'gpu-node-04:9400'
    relabel_configs:
      - source_labels: [__address__]
        regex: '(.+):9400'
        target_label: instance
        replacement: '${1}'

  # File-based service discovery for dynamic nodes
  - job_name: 'dcgm-gpu-dynamic'
    file_sd_configs:
      - files:
          - '/etc/prometheus/targets/gpu_nodes.json'
        refresh_interval: 30s
# /etc/prometheus/targets/gpu_nodes.json
[
  {
    "targets": ["gpu-node-01:9400", "gpu-node-02:9400"],
    "labels": {
      "cluster": "training",
      "datacenter": "dc1"
    }
  },
  {
    "targets": ["gpu-node-03:9400", "gpu-node-04:9400"],
    "labels": {
      "cluster": "inference",
      "datacenter": "dc1"
    }
  }
]

自定义采集指标

dcgm-exporter 默认采集常用指标,但不是全部。可以通过配置文件自定义:

# /etc/dcgm-exporter/custom-counters.csv
# Format: DCGM_FIELD_ID, Prometheus_Metric_Name, Type
#
# Basic metrics
DCGM_FI_DEV_GPU_UTIL,       DCGM_FI_DEV_GPU_UTIL,        gauge
DCGM_FI_DEV_MEM_COPY_UTIL,  DCGM_FI_DEV_MEM_COPY_UTIL,   gauge
DCGM_FI_DEV_FB_FREE,        DCGM_FI_DEV_FB_FREE,         gauge
DCGM_FI_DEV_FB_USED,        DCGM_FI_DEV_FB_USED,         gauge
DCGM_FI_DEV_GPU_TEMP,       DCGM_FI_DEV_GPU_TEMP,        gauge
DCGM_FI_DEV_POWER_USAGE,    DCGM_FI_DEV_POWER_USAGE,     gauge

# Memory metrics
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter
DCGM_FI_DEV_MEM_CLOCK,      DCGM_FI_DEV_MEM_CLOCK,       gauge
DCGM_FI_DEV_SM_CLOCK,       DCGM_FI_DEV_SM_CLOCK,        gauge

# PCIe metrics
DCGM_FI_DEV_PCIE_TX_THROUGHPUT, DCGM_FI_DEV_PCIE_TX_THROUGHPUT, gauge
DCGM_FI_DEV_PCIE_RX_THROUGHPUT, DCGM_FI_DEV_PCIE_RX_THROUGHPUT, gauge

# NVLink metrics (important for multi-GPU training)
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, gauge

# Tensor Core metrics
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge

# Error counts
DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter

# XID errors (GPU faults)
DCGM_FI_DEV_XID_ERRORS,     DCGM_FI_DEV_XID_ERRORS,      gauge

启动 exporter 时指定配置文件:

docker run -d \
    --name dcgm-exporter \
    --gpus all \
    -p 9400:9400 \
    -v /etc/dcgm-exporter:/etc/dcgm-exporter:ro \
    nvcr.io/nvidia/k8s/dcgm-exporter:3.3.6-3.4.2-ubuntu22.04 \
    -f /etc/dcgm-exporter/custom-counters.csv

2.3 Kubernetes 部署

如果你的 GPU 集群是 K8s 管理的,推荐用 DaemonSet 部署 dcgm-exporter:

# dcgm-exporter-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: dcgm-exporter
  namespace: monitoring
  labels:
    app: dcgm-exporter
spec:
  selector:
    matchLabels:
      app: dcgm-exporter
  template:
    metadata:
      labels:
        app: dcgm-exporter
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9400"
    spec:
      nodeSelector:
        nvidia.com/gpu.present: "true"
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      containers:
        - name: dcgm-exporter
          image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.6-3.4.2-ubuntu22.04
          ports:
            - name: metrics
              containerPort: 9400
          env:
            - name: DCGM_EXPORTER_LISTEN
              value: ":9400"
            - name: DCGM_EXPORTER_KUBERNETES
              value: "true"
          securityContext:
            runAsNonRoot: false
            runAsUser: 0
            capabilities:
              add:
                - SYS_ADMIN
          volumeMounts:
            - name: pod-gpu-resources
              mountPath: /var/lib/kubelet/pod-resources
              readOnly: true
            - name: custom-counters
              mountPath: /etc/dcgm-exporter
              readOnly: true
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 500m
              memory: 256Mi
      volumes:
        - name: pod-gpu-resources
          hostPath:
            path: /var/lib/kubelet/pod-resources
        - name: custom-counters
          configMap:
            name: dcgm-exporter-config
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: dcgm-exporter-config
  namespace: monitoring
data:
  default-counters.csv: |
    DCGM_FI_DEV_GPU_UTIL,       DCGM_FI_DEV_GPU_UTIL,        gauge
    DCGM_FI_DEV_MEM_COPY_UTIL,  DCGM_FI_DEV_MEM_COPY_UTIL,   gauge
    DCGM_FI_DEV_FB_FREE,        DCGM_FI_DEV_FB_FREE,         gauge
    DCGM_FI_DEV_FB_USED,        DCGM_FI_DEV_FB_USED,         gauge
    DCGM_FI_DEV_GPU_TEMP,       DCGM_FI_DEV_GPU_TEMP,        gauge
    DCGM_FI_DEV_POWER_USAGE,    DCGM_FI_DEV_POWER_USAGE,     gauge
    DCGM_FI_DEV_SM_CLOCK,       DCGM_FI_DEV_SM_CLOCK,        gauge
    DCGM_FI_DEV_PCIE_TX_THROUGHPUT, DCGM_FI_DEV_PCIE_TX_THROUGHPUT, gauge
    DCGM_FI_DEV_PCIE_RX_THROUGHPUT, DCGM_FI_DEV_PCIE_RX_THROUGHPUT, gauge
---
apiVersion: v1
kind: Service
metadata:
  name: dcgm-exporter
  namespace: monitoring
  labels:
    app: dcgm-exporter
spec:
  type: ClusterIP
  ports:
    - port: 9400
      targetPort: 9400
      name: metrics
  selector:
    app: dcgm-exporter
# Deploy
kubectl apply -f dcgm-exporter-daemonset.yaml

# Verify pods running on GPU nodes
kubectl get pods -n monitoring -l app=dcgm-exporter -o wide

# Check metrics
kubectl port-forward -n monitoring daemonset/dcgm-exporter 9400:9400
curl http://localhost:9400/metrics

ServiceMonitor 配置(for Prometheus Operator)

# servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
  namespace: monitoring
  labels:
    release: prometheus
spec:
  selector:
    matchLabels:
      app: dcgm-exporter
  endpoints:
    - port: metrics
      interval: 15s
      path: /metrics
      relabelings:
        - sourceLabels: [__meta_kubernetes_pod_node_name]
          targetLabel: node
        - sourceLabels: [__meta_kubernetes_namespace]
          targetLabel: namespace

2.4 启动验证

检查指标采集

# Check Prometheus targets
curl http://prometheus:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.job == "dcgm-gpu")'

# Query GPU utilization
curl -G http://prometheus:9090/api/v1/query \
    --data-urlencode 'query=DCGM_FI_DEV_GPU_UTIL'

# Check all available DCGM metrics
curl http://prometheus:9090/api/v1/label/__name__/values | jq '.data[] | select(startswith("DCGM"))'

常用验证查询

# Average GPU utilization across all GPUs
avg(DCGM_FI_DEV_GPU_UTIL)

# GPU utilization by node
avg by (instance) (DCGM_FI_DEV_GPU_UTIL)

# Total GPU memory usage in GB
sum(DCGM_FI_DEV_FB_USED) / 1024

# GPU temperature distribution
histogram_quantile(0.95, DCGM_FI_DEV_GPU_TEMP)

三、示例代码和配置

3.1 完整的监控栈部署

docker-compose 部署(适合小规模)

# docker-compose.yml
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:v2.48.0
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus:/etc/prometheus
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
    restart: unless-stopped

  grafana:
    image: grafana/grafana:10.2.0
    container_name: grafana
    ports:
      - "3000:3000"
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
      - ./grafana/dashboards:/var/lib/grafana/dashboards
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin123
      - GF_USERS_ALLOW_SIGN_UP=false
    restart: unless-stopped

  alertmanager:
    image: prom/alertmanager:v0.26.0
    container_name: alertmanager
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager:/etc/alertmanager
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:
# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - '/etc/prometheus/rules/*.yml'

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'dcgm-gpu'
    static_configs:
      - targets:
          - 'gpu-node-01:9400'
          - 'gpu-node-02:9400'
          - 'gpu-node-03:9400'
          - 'gpu-node-04:9400'
    relabel_configs:
      - source_labels: [__address__]
        regex: '(.+):9400'
        target_label: instance
        replacement: '${1}'

3.2 告警规则配置

# prometheus/rules/gpu_alerts.yml
groups:
  - name: gpu_alerts
    interval: 30s
    rules:
      # GPU utilization too low (wasting resources)
      - alert: GPUUtilizationLow
        expr: avg_over_time(DCGM_FI_DEV_GPU_UTIL[30m]) < 20
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "GPU utilization is low on {{ $labels.instance }}"
          description: "GPU {{ $labels.gpu }} on {{ $labels.instance }} has been below 20% utilization for over 1 hour. Current: {{ $value | printf \"%.1f\" }}%"

      # GPU memory almost full
      - alert: GPUMemoryHigh
        expr: (DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)) * 100 > 95
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "GPU memory usage critical on {{ $labels.instance }}"
          description: "GPU {{ $labels.gpu }} memory usage is above 95%. Used: {{ $value | printf \"%.1f\" }}%"

      # GPU temperature too high
      - alert: GPUTemperatureHigh
        expr: DCGM_FI_DEV_GPU_TEMP > 83
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "GPU temperature critical on {{ $labels.instance }}"
          description: "GPU {{ $labels.gpu }} temperature is {{ $value }}°C, which exceeds safe threshold (83°C)"

      # GPU temperature warning
      - alert: GPUTemperatureWarning
        expr: DCGM_FI_DEV_GPU_TEMP > 75
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "GPU temperature elevated on {{ $labels.instance }}"
          description: "GPU {{ $labels.gpu }} temperature is {{ $value }}°C"

      # ECC errors detected
      - alert: GPUECCErrors
        expr: increase(DCGM_FI_DEV_ECC_DBE_VOL_TOTAL[1h]) > 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "GPU ECC double-bit errors on {{ $labels.instance }}"
          description: "GPU {{ $labels.gpu }} has detected uncorrectable ECC errors. This may indicate hardware failure."

      # XID errors (GPU faults)
      - alert: GPUXIDErrors
        expr: increase(DCGM_FI_DEV_XID_ERRORS[5m]) > 0
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "GPU XID errors on {{ $labels.instance }}"
          description: "GPU {{ $labels.gpu }} reported XID error {{ $value }}. Check nvidia-smi for details."

      # GPU power usage near limit
      - alert: GPUPowerHigh
        expr: DCGM_FI_DEV_POWER_USAGE > 380
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "GPU power consumption high on {{ $labels.instance }}"
          description: "GPU {{ $labels.gpu }} power usage is {{ $value }}W, approaching TDP limit"

      # DCGM exporter down
      - alert: DCGMExporterDown
        expr: up{job="dcgm-gpu"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "DCGM Exporter is down on {{ $labels.instance }}"
          description: "Cannot scrape GPU metrics from {{ $labels.instance }}. Check if dcgm-exporter is running."

      # NVLink bandwidth degraded
      - alert: NVLinkBandwidthLow
        expr: DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL < 200000
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "NVLink bandwidth degraded on {{ $labels.instance }}"
          description: "GPU {{ $labels.gpu }} NVLink bandwidth is {{ $value | humanize }}B/s, which is below expected"

3.3 Alertmanager 配置

# alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'instance']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
      continue: true
    - match:
        severity: critical
      receiver: 'slack-critical'
    - match:
        severity: warning
      receiver: 'slack-warning'

receivers:
  - name: 'default'
    webhook_configs:
      - url: 'http://localhost:5001/webhook'
        send_resolved: true

  - name: 'slack-critical'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#gpu-alerts-critical'
        title: '{{ .GroupLabels.alertname }}'
        text: >-
          {{ range .Alerts }}
          *Alert:* {{ .Annotations.summary }}
          *Description:* {{ .Annotations.description }}
          *Severity:* {{ .Labels.severity }}
          {{ end }}
        send_resolved: true

  - name: 'slack-warning'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#gpu-alerts-warning'
        title: '{{ .GroupLabels.alertname }}'
        text: >-
          {{ range .Alerts }}
          *Alert:* {{ .Annotations.summary }}
          {{ end }}

  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: 'YOUR-PAGERDUTY-SERVICE-KEY'
        severity: critical

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

3.4 Grafana Dashboard

这是个完整的 GPU 监控 Dashboard JSON,可以直接导入 Grafana

{
  "annotations": {
    "list": []
  },
  "editable": true,
  "fiscalYearStartMonth": 0,
  "graphTooltip": 1,
  "id": null,
  "links": [],
  "liveNow": false,
  "panels": [
    {
      "collapsed": false,
      "gridPos": { "h": 1, "w": 24, "x": 0, "y": 0 },
      "id": 1,
      "panels": [],
      "title": "Overview",
      "type": "row"
    },
    {
      "datasource": { "type": "prometheus", "uid": "${datasource}" },
      "fieldConfig": {
        "defaults": {
          "color": { "mode": "palette-classic" },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              { "color": "green", "value": null },
              { "color": "yellow", "value": 50 },
              { "color": "red", "value": 80 }
            ]
          },
          "unit": "percent"
        }
      },
      "gridPos": { "h": 4, "w": 6, "x": 0, "y": 1 },
      "id": 2,
      "options": {
        "colorMode": "value",
        "graphMode": "area",
        "justifyMode": "auto",
        "orientation": "auto",
        "reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
        "textMode": "auto"
      },
      "pluginVersion": "10.2.0",
      "targets": [
        {
          "expr": "avg(DCGM_FI_DEV_GPU_UTIL{instance=~\"$instance\"})",
          "legendFormat": "Avg GPU Util",
          "refId": "A"
        }
      ],
      "title": "Average GPU Utilization",
      "type": "stat"
    },
    {
      "datasource": { "type": "prometheus", "uid": "${datasource}" },
      "fieldConfig": {
        "defaults": {
          "color": { "mode": "palette-classic" },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              { "color": "green", "value": null },
              { "color": "yellow", "value": 70 },
              { "color": "red", "value": 85 }
            ]
          },
          "unit": "celsius"
        }
      },
      "gridPos": { "h": 4, "w": 6, "x": 6, "y": 1 },
      "id": 3,
      "options": {
        "colorMode": "value",
        "graphMode": "none",
        "reduceOptions": { "calcs": ["max"], "fields": "", "values": false }
      },
      "targets": [
        {
          "expr": "max(DCGM_FI_DEV_GPU_TEMP{instance=~\"$instance\"})",
          "legendFormat": "Max Temp",
          "refId": "A"
        }
      ],
      "title": "Max GPU Temperature",
      "type": "stat"
    },
    {
      "datasource": { "type": "prometheus", "uid": "${datasource}" },
      "fieldConfig": {
        "defaults": {
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [{ "color": "blue", "value": null }]
          },
          "unit": "watt"
        }
      },
      "gridPos": { "h": 4, "w": 6, "x": 12, "y": 1 },
      "id": 4,
      "options": { "reduceOptions": { "calcs": ["lastNotNull"] } },
      "targets": [
        {
          "expr": "sum(DCGM_FI_DEV_POWER_USAGE{instance=~\"$instance\"})",
          "legendFormat": "Total Power",
          "refId": "A"
        }
      ],
      "title": "Total Power Consumption",
      "type": "stat"
    },
    {
      "datasource": { "type": "prometheus", "uid": "${datasource}" },
      "fieldConfig": {
        "defaults": {
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [{ "color": "green", "value": null }]
          },
          "unit": "short"
        }
      },
      "gridPos": { "h": 4, "w": 6, "x": 18, "y": 1 },
      "id": 5,
      "options": { "reduceOptions": { "calcs": ["lastNotNull"] } },
      "targets": [
        {
          "expr": "count(DCGM_FI_DEV_GPU_UTIL{instance=~\"$instance\"})",
          "legendFormat": "GPU Count",
          "refId": "A"
        }
      ],
      "title": "Active GPUs",
      "type": "stat"
    },
    {
      "collapsed": false,
      "gridPos": { "h": 1, "w": 24, "x": 0, "y": 5 },
      "id": 10,
      "panels": [],
      "title": "GPU Utilization",
      "type": "row"
    },
    {
      "datasource": { "type": "prometheus", "uid": "${datasource}" },
      "fieldConfig": {
        "defaults": {
          "color": { "mode": "palette-classic" },
          "custom": {
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 20,
            "gradientMode": "none",
            "lineInterpolation": "smooth",
            "lineWidth": 1,
            "showPoints": "never"
          },
          "mappings": [],
          "max": 100,
          "min": 0,
          "thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }] },
          "unit": "percent"
        }
      },
      "gridPos": { "h": 8, "w": 12, "x": 0, "y": 6 },
      "id": 11,
      "options": { "legend": { "calcs": ["mean", "max"], "displayMode": "table", "placement": "bottom" } },
      "targets": [
        {
          "expr": "DCGM_FI_DEV_GPU_UTIL{instance=~\"$instance\"}",
          "legendFormat": "{{ instance }} GPU {{ gpu }}",
          "refId": "A"
        }
      ],
      "title": "GPU Utilization by GPU",
      "type": "timeseries"
    },
    {
      "datasource": { "type": "prometheus", "uid": "${datasource}" },
      "fieldConfig": {
        "defaults": {
          "color": { "mode": "palette-classic" },
          "custom": {
            "drawStyle": "line",
            "fillOpacity": 20,
            "lineInterpolation": "smooth",
            "lineWidth": 1
          },
          "max": 100,
          "min": 0,
          "unit": "percent"
        }
      },
      "gridPos": { "h": 8, "w": 12, "x": 12, "y": 6 },
      "id": 12,
      "targets": [
        {
          "expr": "(DCGM_FI_DEV_FB_USED{instance=~\"$instance\"} / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)) * 100",
          "legendFormat": "{{ instance }} GPU {{ gpu }}",
          "refId": "A"
        }
      ],
      "title": "GPU Memory Utilization",
      "type": "timeseries"
    },
    {
      "collapsed": false,
      "gridPos": { "h": 1, "w": 24, "x": 0, "y": 14 },
      "id": 20,
      "panels": [],
      "title": "Temperature & Power",
      "type": "row"
    },
    {
      "datasource": { "type": "prometheus", "uid": "${datasource}" },
      "fieldConfig": {
        "defaults": {
          "custom": { "drawStyle": "line", "fillOpacity": 10, "lineWidth": 1 },
          "unit": "celsius"
        }
      },
      "gridPos": { "h": 8, "w": 12, "x": 0, "y": 15 },
      "id": 21,
      "targets": [
        {
          "expr": "DCGM_FI_DEV_GPU_TEMP{instance=~\"$instance\"}",
          "legendFormat": "{{ instance }} GPU {{ gpu }}",
          "refId": "A"
        }
      ],
      "title": "GPU Temperature",
      "type": "timeseries"
    },
    {
      "datasource": { "type": "prometheus", "uid": "${datasource}" },
      "fieldConfig": {
        "defaults": {
          "custom": { "drawStyle": "line", "fillOpacity": 10, "lineWidth": 1 },
          "unit": "watt"
        }
      },
      "gridPos": { "h": 8, "w": 12, "x": 12, "y": 15 },
      "id": 22,
      "targets": [
        {
          "expr": "DCGM_FI_DEV_POWER_USAGE{instance=~\"$instance\"}",
          "legendFormat": "{{ instance }} GPU {{ gpu }}",
          "refId": "A"
        }
      ],
      "title": "GPU Power Usage",
      "type": "timeseries"
    },
    {
      "collapsed": false,
      "gridPos": { "h": 1, "w": 24, "x": 0, "y": 23 },
      "id": 30,
      "panels": [],
      "title": "Memory Details",
      "type": "row"
    },
    {
      "datasource": { "type": "prometheus", "uid": "${datasource}" },
      "fieldConfig": {
        "defaults": {
          "custom": { "drawStyle": "bars", "fillOpacity": 80, "stacking": { "mode": "normal" } },
          "unit": "decmbytes"
        }
      },
      "gridPos": { "h": 8, "w": 24, "x": 0, "y": 24 },
      "id": 31,
      "targets": [
        {
          "expr": "DCGM_FI_DEV_FB_USED{instance=~\"$instance\"}",
          "legendFormat": "{{ instance }} GPU {{ gpu }} Used",
          "refId": "A"
        },
        {
          "expr": "DCGM_FI_DEV_FB_FREE{instance=~\"$instance\"}",
          "legendFormat": "{{ instance }} GPU {{ gpu }} Free",
          "refId": "B"
        }
      ],
      "title": "GPU Memory Usage (MB)",
      "type": "timeseries"
    }
  ],
  "refresh": "30s",
  "schemaVersion": 38,
  "style": "dark",
  "tags": ["gpu", "nvidia", "dcgm"],
  "templating": {
    "list": [
      {
        "current": { "selected": false, "text": "prometheus", "value": "prometheus" },
        "hide": 0,
        "includeAll": false,
        "multi": false,
        "name": "datasource",
        "options": [],
        "query": "prometheus",
        "refresh": 1,
        "regex": "",
        "skipUrlSync": false,
        "type": "datasource"
      },
      {
        "allValue": ".*",
        "current": { "selected": true, "text": "All", "value": "$__all" },
        "datasource": { "type": "prometheus", "uid": "${datasource}" },
        "definition": "label_values(DCGM_FI_DEV_GPU_UTIL, instance)",
        "hide": 0,
        "includeAll": true,
        "multi": true,
        "name": "instance",
        "options": [],
        "query": { "query": "label_values(DCGM_FI_DEV_GPU_UTIL, instance)", "refId": "A" },
        "refresh": 2,
        "regex": "",
        "skipUrlSync": false,
        "sort": 1,
        "type": "query"
      }
    ]
  },
  "time": { "from": "now-1h", "to": "now" },
  "timepicker": {},
  "timezone": "",
  "title": "GPU Cluster Monitoring",
  "uid": "gpu-cluster-monitoring",
  "version": 1,
  "weekStart": ""
}

保存为 gpu-dashboard.json,然后在 Grafana 中导入。

四、最佳实践和注意事项

4.1 性能优化

采集间隔调优

默认 15s 采集间隔对大部分场景够用。但需要根据情况调整:

# High frequency (for debugging)
scrape_interval: 5s   # More CPU on Prometheus, more storage

# Standard (production)
scrape_interval: 15s  # Good balance

# Low frequency (for large clusters)
scrape_interval: 30s  # Less accurate, but lower overhead

超过 100 个 GPU 节点时,建议:

  • 使用 15s-30s 的采集间隔
  • 开启 Prometheus 的远程写入,用 Thanos/Cortex 做长期存储
  • 考虑按集群分片部署多个 Prometheus

减少标签基数

DCGM 默认会带上 UUID 标签,每张卡都不一样,会造成高基数问题:

# prometheus.yml - drop high cardinality labels
relabel_configs:
  - action: labeldrop
    regex: 'UUID'        # Drop UUID label to reduce cardinality
  - action: labeldrop
    regex: 'modelName'   # If all GPUs are same model

使用 Recording Rules

预计算常用查询,减少 Dashboard 加载时间:

# recording_rules.yml
groups:
  - name: gpu_recording_rules
    interval: 30s
    rules:
      # Average GPU utilization by node
      - record: node:gpu_utilization:avg
        expr: avg by (instance) (DCGM_FI_DEV_GPU_UTIL)

      # Total memory used by node (in GB)
      - record: node:gpu_memory_used_gb:sum
        expr: sum by (instance) (DCGM_FI_DEV_FB_USED) / 1024

      # GPU memory utilization percentage
      - record: gpu:memory_utilization:ratio
        expr: |
          DCGM_FI_DEV_FB_USED /
          (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)

      # 5 minute average GPU utilization
      - record: gpu:utilization:avg5m
        expr: avg_over_time(DCGM_FI_DEV_GPU_UTIL[5m])

4.2 安全加固

TLS 加密

在生产环境,exporter 和 Prometheus 之间的通信应该加密:

# dcgm-exporter with TLS
docker run -d \
  --name dcgm-exporter \
  --gpus all \
  -p 9400:9400 \
  -v /etc/dcgm-exporter/certs:/certs:ro \
  nvcr.io/nvidia/k8s/dcgm-exporter:3.3.6-3.4.2-ubuntu22.04 \
  --web.config=/certs/web-config.yml
# web-config.yml
tls_server_config:
  cert_file: /certs/server.crt
  key_file: /certs/server.key
# prometheus.yml
scrape_configs:
  - job_name: 'dcgm-gpu'
    scheme: https
    tls_config:
      ca_file: /etc/prometheus/certs/ca.crt
      insecure_skip_verify: false
    static_configs:
      - targets: ['gpu-node-01:9400']

认证

添加 Basic Auth:

# web-config.yml
basic_auth_users:
  prometheus: $2y$10$xxxxx   # bcrypt hash of password
# prometheus.yml
scrape_configs:
  - job_name: 'dcgm-gpu'
    basic_auth:
      username: prometheus
      password_file: /etc/prometheus/secrets/dcgm_password

4.3 高可用配置

Prometheus HA

两个 Prometheus 实例抓取相同目标:

# prometheus-1.yml
global:
  external_labels:
    replica: prometheus-1

# prometheus-2.yml
global:
  external_labels:
    replica: prometheus-2

配合 Thanos 或 Cortex 做去重和长期存储。

Alertmanager HA

# alertmanager-1.yml
cluster:
  listen-address: "0.0.0.0:9094"
  peers:
    - alertmanager-2:9094

# alertmanager-2.yml
cluster:
  listen-address: "0.0.0.0:9094"
  peers:
    - alertmanager-1:9094

4.4 常见错误

错误 1:No data points

症状:Grafana 显示 "No data"

排查步骤:

# 1. Check if exporter is running
curl http://gpu-node-01:9400/metrics

# 2. Check Prometheus targets
curl http://prometheus:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.job == "dcgm-gpu")'

# 3. Check firewall
sudo iptables -L -n | grep 9400

常见原因:

  • dcgm-exporter 没启动
  • 防火墙阻止了 9400 端口
  • Prometheus 配置的地址不对

错误 2:DCGM initialization failed

Error: Cannot connect to the DCGM socket

解决方案:

# Check DCGM service
sudo systemctl status nvidia-dcgm

# Restart if needed
sudo systemctl restart nvidia-dcgm
sudo systemctl restart dcgm-exporter

# Check permissions
ls -la /var/run/nvidia-dcgm.sock

错误 3:指标值为 0 或缺失

部分指标在某些 GPU 型号上不支持:

# Check supported fields
dcgmi dmon -e 1009,1010,1011 -c 1

# Check if profiling metrics need to be enabled
dcgmi profile --pause   # Disable
dcgmi profile --resume  # Enable

错误 4:内存使用持续增长

dcgm-exporter 内存泄漏(老版本有这个问题):

# Check memory usage
ps aux | grep dcgm-exporter

# Upgrade to latest version
docker pull nvcr.io/nvidia/k8s/dcgm-exporter:3.3.6-3.4.2-ubuntu22.04

五、故障排查和监控

5.1 日志查看

DCGM 服务日志

# Systemd logs
journalctl -u nvidia-dcgm -f

# Docker logs
docker logs dcgm-exporter -f --tail 100

# K8s logs
kubectl logs -f daemonset/dcgm-exporter -n monitoring

常见日志分析

# Find errors
journalctl -u nvidia-dcgm | grep -i error

# DCGM initialization issues
journalctl -u nvidia-dcgm | grep -i "failed to"

5.2 常见问题排查

问题 1:GPU 利用率波动大

# Check utilization variance
stddev_over_time(DCGM_FI_DEV_GPU_UTIL[5m])

# Compare with request rate (if using vLLM)
rate(vllm_request_success_total[5m])

原因分析:

  • 请求到达不均匀 → 检查负载均衡
  • Batch size 太小 → 调整推理框架参数
  • 短请求多 → 考虑请求合并

问题 2:显存使用持续增长

# Memory usage trend
deriv(DCGM_FI_DEV_FB_USED[1h])

# Compare before and after restart
DCGM_FI_DEV_FB_USED offset 1d

原因分析:

  • Python 对象未释放 → 检查代码
  • KV Cache 未清理 → 检查推理框架配置
  • CUDA Context 泄漏 → 升级驱动

问题 3:温度过高

# Check fan speed
nvidia-smi -q -d FAN

# Check power limit
nvidia-smi -q -d POWER

# Reduce power limit if needed (temporary)
sudo nvidia-smi -pl 300   # Set 300W limit

排查方向:

  • 机房温度 → 检查空调
  • 风道堵塞 → 检查服务器灰尘
  • 散热片问题 → 联系硬件维护

5.3 性能监控

关键指标看板

# Overall cluster health score (0-100)
(
  (avg(DCGM_FI_DEV_GPU_UTIL) / 100 * 0.3) +
  (1 - max(DCGM_FI_DEV_GPU_TEMP) / 90 * 0.2) +
  (1 - sum(rate(DCGM_FI_DEV_XID_ERRORS[1h])) * 0.3) +
  (sum(up{job="dcgm-gpu"}) / count(up{job="dcgm-gpu"}) * 0.2)
) * 100

# Cost efficiency (useful work per watt)
sum(rate(vllm_request_success_total[5m])) / sum(DCGM_FI_DEV_POWER_USAGE)

自动化巡检脚本

#!/usr/bin/env python3
"""
GPU cluster health check script.
Run daily via cron to catch issues early.
"""

import requests
import json
from datetime import datetime

PROMETHEUS_URL = "http://prometheus:9090"
THRESHOLDS = {
    "gpu_utilization_low": 20,
    "gpu_temperature_high": 80,
    "memory_usage_high": 95,
}

def query_prometheus(query: str) -> list:
    """Execute PromQL query and return results."""
    resp = requests.get(
        f"{PROMETHEUS_URL}/api/v1/query",
        params={"query": query}
    )
    data = resp.json()
    if data["status"] != "success":
        raise Exception(f"Query failed: {data}")
    return data["data"]["result"]

def check_gpu_health():
    """Run all GPU health checks."""
    issues = []

    # Check GPU utilization
    results = query_prometheus("avg_over_time(DCGM_FI_DEV_GPU_UTIL[1h])")
    for r in results:
        if float(r["value"][1]) < THRESHOLDS["gpu_utilization_low"]:
            issues.append({
                "type": "low_utilization",
                "instance": r["metric"].get("instance"),
                "gpu": r["metric"].get("gpu"),
                "value": float(r["value"][1])
            })

    # Check GPU temperature
    results = query_prometheus("DCGM_FI_DEV_GPU_TEMP")
    for r in results:
        if float(r["value"][1]) > THRESHOLDS["gpu_temperature_high"]:
            issues.append({
                "type": "high_temperature",
                "instance": r["metric"].get("instance"),
                "gpu": r["metric"].get("gpu"),
                "value": float(r["value"][1])
            })

    # Check memory usage
    results = query_prometheus(
        "(DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)) * 100"
    )
    for r in results:
        if float(r["value"][1]) > THRESHOLDS["memory_usage_high"]:
            issues.append({
                "type": "high_memory",
                "instance": r["metric"].get("instance"),
                "gpu": r["metric"].get("gpu"),
                "value": float(r["value"][1])
            })

    # Check for ECC errors
    results = query_prometheus("increase(DCGM_FI_DEV_ECC_DBE_VOL_TOTAL[24h]) > 0")
    for r in results:
        issues.append({
            "type": "ecc_errors",
            "instance": r["metric"].get("instance"),
            "gpu": r["metric"].get("gpu"),
            "value": float(r["value"][1])
        })

    return issues

def generate_report(issues: list) -> str:
    """Generate health check report."""
    report = f"""
GPU Cluster Health Report
Generated: {datetime.now().isoformat()}
{'='*50}

Total Issues Found: {len(issues)}
"""
    if not issues:
        report += "\nAll GPUs are healthy!\n"
    else:
        by_type = {}
        for issue in issues:
            t = issue["type"]
            if t not in by_type:
                by_type[t] = []
            by_type[t].append(issue)

        for issue_type, items in by_type.items():
            report += f"\n{issue_type.upper()} ({len(items)} issues):\n"
            for item in items:
                report += f"  - {item['instance']} GPU {item['gpu']}: {item['value']:.1f}\n"

    return report

if __name__ == "__main__":
    issues = check_gpu_health()
    report = generate_report(issues)
    print(report)

    # Send to Slack/Email if issues found
    if issues:
        # webhook_url = "https://hooks.slack.com/..."
        # requests.post(webhook_url, json={"text": report})
        pass

5.4 备份恢复

Prometheus 数据备份

# Create snapshot
curl -X POST http://prometheus:9090/api/v1/admin/tsdb/snapshot

# Backup snapshot directory
tar -czvf prometheus-backup-$(date +%Y%m%d).tar.gz \
    /prometheus/snapshots/

# Restore
tar -xzvf prometheus-backup-20241219.tar.gz -C /prometheus/

Grafana Dashboard 备份

# Export all dashboards
for uid in $(curl -s http://admin:password@grafana:3000/api/search | jq -r '.[].uid'); do
    curl -s "http://admin:password@grafana:3000/api/dashboards/uid/$uid" | \
        jq '.dashboard' > "dashboard-$uid.json"
done

# Import dashboard
curl -X POST \
    -H "Content-Type: application/json" \
    -d @dashboard-xxx.json \
    http://admin:password@grafana:3000/api/dashboards/import

六、总结

6.1 技术要点回顾

搭建 GPU 监控系统的核心步骤:

  1. 安装 DCGM:NVIDIA 官方工具,提供丰富的 GPU 指标
  2. 部署 dcgm-exporter:把指标暴露为 Prometheus 格式
  3. 配置 Prometheus:抓取并存储指标
  4. 创建 Grafana Dashboard:可视化展示
  5. 设置告警规则:温度过高、利用率过低、ECC 错误等

关键配置文件:

  • /etc/dcgm-exporter/custom-counters.csv:自定义采集指标
  • prometheus.yml:抓取目标配置
  • gpu_alerts.yml:告警规则

6.2 进阶学习方向

MIG 监控

A100/H100 支持 MIG(Multi-Instance GPU),可以把一张卡切成多个独立实例:

# Enable MIG
sudo nvidia-smi -i 0 -mig 1

# Create GPU instances
sudo nvidia-smi mig -i 0 -cgi 9,9,9 -C

# DCGM exporter will automatically detect MIG instances

vGPU 监控

虚拟化场景需要额外的 vGPU 指标:

# Install vGPU manager
# DCGM can monitor both pGPU and vGPU
dcgmi discovery -l

与业务指标联动

把 GPU 利用率和业务 QPS 关联分析:

# Correlation between GPU util and request rate
DCGM_FI_DEV_GPU_UTIL
/
sum(rate(http_requests_total[5m])) by (instance)

6.3 参考资料

  • NVIDIA DCGM 官方文档
  • dcgm-exporter GitHub
  • Prometheus 官方文档
  • NVIDIA GPU 指标字段列表

附录

A. 命令速查表

# DCGM basic commands
dcgmi discovery -l              # List all GPUs
dcgmi diag -r 1                 # Quick diagnostic
dcgmi diag -r 3                 # Full diagnostic (takes longer)
dcgmi health -g 0               # Check GPU 0 health
dcgmi stats -g 0 -e             # Enable stats collection
dcgmi dmon -e 1009,1010         # Monitor specific fields

# dcgm-exporter
docker run --gpus all -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:3.3.6-3.4.2-ubuntu22.04
curl http://localhost:9400/metrics

# Prometheus queries
curl -G 'http://prometheus:9090/api/v1/query' --data-urlencode 'query=DCGM_FI_DEV_GPU_UTIL'
curl -G 'http://prometheus:9090/api/v1/query_range' \
    --data-urlencode 'query=avg(DCGM_FI_DEV_GPU_UTIL)' \
    --data-urlencode 'start=2024-01-01T00:00:00Z' \
    --data-urlencode 'end=2024-01-02T00:00:00Z' \
    --data-urlencode 'step=1h'

B. 常用 DCGM 字段 ID

Field ID 名称 类型 说明
100 DCGM_FI_DRIVER_VERSION String 驱动版本
1001 DCGM_FI_DEV_SM_CLOCK Gauge SM 时钟频率 (MHz)
1002 DCGM_FI_DEV_MEM_CLOCK Gauge 显存时钟频率 (MHz)
1004 DCGM_FI_DEV_GPU_TEMP Gauge GPU 温度 (°C)
1005 DCGM_FI_DEV_POWER_USAGE Gauge 功耗 (W)
1009 DCGM_FI_DEV_FB_FREE Gauge 可用显存 (MB)
1010 DCGM_FI_DEV_FB_USED Gauge 已用显存 (MB)
1011 DCGM_FI_DEV_FB_TOTAL Gauge 总显存 (MB)
1100 DCGM_FI_DEV_GPU_UTIL Gauge GPU 利用率 (%)
1101 DCGM_FI_DEV_MEM_COPY_UTIL Gauge 显存带宽利用率 (%)
1140 DCGM_FI_DEV_PCIE_TX_THROUGHPUT Gauge PCIe 发送带宽
1141 DCGM_FI_DEV_PCIE_RX_THROUGHPUT Gauge PCIe 接收带宽
1006 DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION Counter 累计能耗 (mJ)
310 DCGM_FI_DEV_XID_ERRORS Gauge XID 错误码

C. 术语表

术语 解释
DCGM Data Center GPU Manager,NVIDIA 数据中心 GPU 管理工具
Exporter Prometheus 生态中负责暴露指标的组件
Scrape Prometheus 从 Exporter 拉取指标的动作
PromQL Prometheus Query Language,查询语言
Recording Rule 预计算规则,提前计算常用查询
Alerting Rule 告警规则,定义何时触发告警
MIG Multi-Instance GPU,A100/H100 支持的 GPU 分割技术
ECC Error Correcting Code,显存纠错
XID NVIDIA GPU 的错误代码
SM Streaming Multiprocessor,GPU 计算单元
TDP Thermal Design Power,设计功耗



上一篇:阿里Java面试复盘:高压下的情绪管理与工作稳定性实战攻略
下一篇:Ansible 剧本编写规范:自动化脚本的优雅实践与避坑指南
您需要登录后才可以回帖 登录 | 立即注册

手机版|小黑屋|网站地图|云栈社区 ( 苏ICP备2022046150号-2 )

GMT+8, 2026-4-30 10:02 , Processed in 0.638758 second(s), 42 queries , Gzip On.

Powered by Discuz! X3.5

© 2025-2026 云栈社区.

快速回复 返回顶部 返回列表