一、概述
1.1 背景介绍
在互联网企业中,一套完善的监控体系是保障系统稳定性的基石。我在阿里巴巴工作期间,参与了多个千万级用户产品的监控体系建设,深刻体会到"没有监控就等于裸奔"的道理。传统的Zabbix、Nagios等监控系统在云原生时代逐渐暴露出配置复杂、扩展性差、不支持动态服务发现等问题。
Prometheus作为CNCF毕业项目,已成为云原生监控的事实标准。它的多维数据模型、强大的PromQL查询语言、灵活的服务发现机制完美契合微服务和容器化架构。根据CNCF 2023年度调查,超过83%的企业在生产环境使用Prometheus。
本文将基于我在阿里的实战经验,从0到1讲解如何在生产环境落地Prometheus监控体系,涵盖架构设计、核心组件部署、告警规则编写、性能优化等全流程最佳实践。这套方案已在日均PV千万级别的业务中验证,可直接应用于生产环境。
1.2 技术特点
- 多维数据模型:基于时间序列,通过标签(Label)实现多维度数据聚合和查询,单个指标可携带多个维度信息
- 强大的查询语言:PromQL支持聚合、数学运算、时间序列分析等复杂查询,可实现P95/P99分位数计算、环比同比分析
- 高效的数据存储:本地存储采用自定义TSDB,单节点可支持百万级时间序列,写入性能达10万samples/s
- 灵活的服务发现:原生支持Kubernetes、Consul、EC2等服务发现机制,自动感知目标变更无需手动配置
- 可视化和告警:与Grafana深度集成,提供丰富的可视化图表;Alertmanager支持多种告警渠道和分组、抑制、静默等高级特性
- 生态系统完善:官方和社区提供200+Exporter,覆盖主流中间件和云服务,开箱即用
1.3 适用场景
-
场景一:Kubernetes容器集群监控
- 监控Pod、Node、Container的资源使用情况(CPU、内存、网络、磁盘)
- 追踪Deployment、StatefulSet、DaemonSet的副本数和健康状态
- 监控Ingress流量、Service延迟、PV存储容量等关键指标
- 通过Kube-state-metrics暴露K8s API对象元数据
-
场景二:微服务应用性能监控(APM)
- 红绿灯指标:请求量(Rate)、错误率(Errors)、响应时间(Duration)
- 业务指标:订单量、支付成功率、用户活跃度等自定义指标
- 依赖服务监控:MySQL、Redis、Kafka、RabbitMQ等中间件性能
- 分布式追踪:结合Jaeger/Zipkin实现调用链分析
-
场景三:基础设施和中间件监控
- 物理机/虚拟机:CPU、内存、磁盘、网络、进程数、文件描述符
- 数据库:MySQL慢查询、连接数、QPS、复制延迟;Redis命中率、内存碎片
- 消息队列:Kafka消费延迟、RabbitMQ队列深度、Pulsar吞吐量
- 负载均衡:Nginx请求分布、HAProxy后端健康度、云厂商LB监控
1.4 环境要求
| 组件 |
版本要求 |
说明 |
| 操作系统 |
CentOS 7.6+/Ubuntu 20.04+/Rocky Linux 8+ |
推荐使用容器化部署,降低环境依赖 |
| Prometheus |
2.40+ (LTS) / 2.48+ (Latest) |
生产环境建议使用LTS长期支持版本 |
| Grafana |
9.5+ / 10.2+ |
10.x版本UI更现代,支持更多插件 |
| Alertmanager |
0.25+ |
需与Prometheus版本匹配 |
| Node Exporter |
1.6+ |
用于采集Linux系统指标 |
| Kubernetes |
1.24+ (如使用K8s) |
需支持ServiceMonitor CRD |
| 硬件配置 |
8C16G起步,推荐16C32G |
根据时间序列数量评估,百万序列约需32GB内存 |
| 磁盘空间 |
系统盘50GB+,数据盘500GB+ SSD |
TSDB数据增长快,建议SSD盘提升查询性能 |
| 网络带宽 |
100Mbps+ |
大规模集群建议千兆网络 |
二、详细步骤
2.1 准备工作
◆ 2.1.1 系统检查
# 检查系统版本
cat /etc/os-release
uname -r
# 预期输出:CentOS 7.9 或 Ubuntu 20.04,内核版本3.10+
# 检查资源状况
free -h
# 确保可用内存>16GB(生产环境)
df -h
# 确保数据盘可用空间>500GB
# 检查时间同步(监控系统对时间精度要求高)
timedatectl status
# 确保 System clock synchronized: yes
# 如未同步,安装chrony或ntpd
yum install -y chrony
systemctl enable chronyd --now
# 检查防火墙和SELinux
firewall-cmd --state
getenforce
# 生产环境建议开启防火墙,仅开放必要端口
# 检查磁盘IO性能(监控系统对磁盘IO要求高)
yum install -y fio
fio --name=random-write --ioengine=libaio --iodepth=64 --rw=randwrite \
--bs=4k --direct=1 --size=1G --numjobs=4 --runtime=60 --group_reporting
# 预期IOPS>5000(SSD),延迟<10ms
# 检查必要端口是否被占用
netstat -tuln | grep -E ':((9090|9093|9100|3000))'
# 9090: Prometheus, 9093: Alertmanager, 9100: Node Exporter, 3000: Grafana
◆ 2.1.2 安装依赖
# CentOS/Rocky Linux系统
sudo yum install -y wget curl vim net-tools telnet htop
# Ubuntu/Debian系统
sudo apt update
sudo apt install -y wget curl vim net-tools telnet htop
# 创建专用用户(安全最佳实践)
sudo useradd -r -M -s /sbin/nologin prometheus
# 创建目录结构
sudo mkdir -p /data/prometheus/{data,config,rules,exporters}
sudo mkdir -p /data/grafana/{data,plugins,dashboards}
sudo mkdir -p /data/alertmanager/{data,templates}
sudo mkdir -p /var/log/prometheus
# 设置权限
sudo chown -R prometheus:prometheus /data/prometheus
sudo chown -R prometheus:prometheus /var/log/prometheus
# 下载Prometheus组件(国内建议使用镜像加速)
cd /usr/local/src
# Prometheus Server
PROM_VERSION="2.48.0"
wget https://github.com/prometheus/prometheus/releases/download/v${PROM_VERSION}/prometheus-${PROM_VERSION}.linux-amd64.tar.gz
# Alertmanager
AM_VERSION="0.26.0"
wget https://github.com/prometheus/alertmanager/releases/download/v${AM_VERSION}/alertmanager-${AM_VERSION}.linux-amd64.tar.gz
# Node Exporter
NE_VERSION="1.7.0"
wget https://github.com/prometheus/node_exporter/releases/download/v${NE_VERSION}/node_exporter-${NE_VERSION}.linux-amd64.tar.gz
# 解压安装
tar -zxvf prometheus-${PROM_VERSION}.linux-amd64.tar.gz
tar -zxvf alertmanager-${AM_VERSION}.linux-amd64.tar.gz
tar -zxvf node_exporter-${NE_VERSION}.linux-amd64.tar.gz
# 移动到标准目录
sudo mv prometheus-${PROM_VERSION}.linux-amd64 /usr/local/prometheus
sudo mv alertmanager-${AM_VERSION}.linux-amd64 /usr/local/alertmanager
sudo mv node_exporter-${NE_VERSION}.linux-amd64 /usr/local/node_exporter
# 创建软链接
sudo ln -s /usr/local/prometheus/prometheus /usr/local/bin/prometheus
sudo ln -s /usr/local/prometheus/promtool /usr/local/bin/promtool
sudo ln -s /usr/local/alertmanager/alertmanager /usr/local/bin/alertmanager
sudo ln -s /usr/local/node_exporter/node_exporter /usr/local/bin/node_exporter
# 验证安装
prometheus --version
alertmanager --version
node_exporter --version
2.2 核心配置
◆ 2.2.1 Prometheus Server配置
# 文件路径:/data/prometheus/config/prometheus.yml
global:
# 默认抓取间隔(建议15-60秒,根据业务调整)
scrape_interval: 30s
# 默认抓取超时
scrape_timeout: 10s
# 评估告警规则间隔
evaluation_interval: 30s
# 全局标签(用于区分环境和集群)
external_labels:
cluster: 'production'
region: 'cn-beijing'
env: 'prod'
# 告警管理器配置
alerting:
alertmanagers:
- static_configs:
- targets:
- '127.0.0.1:9093'
timeout: 10s
# 告警规则文件
rule_files:
- '/data/prometheus/rules/*.yml'
# 抓取配置
scrape_configs:
# Prometheus自身监控
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
labels:
instance_type: 'prometheus-server'
# Node Exporter - 物理机监控
- job_name: 'node'
static_configs:
- targets:
- '192.168.1.101:9100'
- '192.168.1.102:9100'
- '192.168.1.103:9100'
labels:
env: 'production'
cluster: 'web-cluster'
- targets:
- '192.168.1.201:9100'
- '192.168.1.202:9100'
labels:
env: 'production'
cluster: 'db-cluster'
# 自定义指标重标记(过滤不需要的指标)
metric_relabel_configs:
- source_labels: [__name__]
regex: 'node_nf_conntrack_.*'
action: drop
# MySQL Exporter - 数据库监控
- job_name: 'mysql'
static_configs:
- targets:
- '192.168.1.201:9104'
- '192.168.1.202:9104'
labels:
cluster: 'mysql-master-slave'
# Redis Exporter - 缓存监控
- job_name: 'redis'
static_configs:
- targets:
- '192.168.1.211:9121'
- '192.168.1.212:9121'
labels:
cluster: 'redis-cluster'
# Nginx Exporter - Web服务器监控
- job_name: 'nginx'
static_configs:
- targets:
- '192.168.1.101:9113'
- '192.168.1.102:9113'
# Blackbox Exporter - 黑盒监控(HTTP/HTTPS/TCP/ICMP)
- job_name: 'blackbox_http'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://www.example.com
- https://api.example.com
- https://admin.example.com
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 127.0.0.1:9115
# Kubernetes服务发现(如使用K8s)
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
# 应用自定义指标(SpringBoot Actuator)
- job_name: 'springboot-apps'
consul_sd_configs:
- server: '127.0.0.1:8500'
services: ['springboot']
relabel_configs:
- source_labels: [__meta_consul_service]
target_label: job
- source_labels: [__meta_consul_service_metadata_metrics_path]
target_label: __metrics_path__
regex: (.+)
- source_labels: [__meta_consul_tags]
regex: '.*,env=([^,]+),.*'
target_label: env
# 远程写入(Thanos/VictoriaMetrics长期存储)
# remote_write:
# - url: "http://thanos-receiver:19291/api/v1/receive"
# queue_config:
# capacity: 10000
# max_shards: 50
# min_shards: 1
# max_samples_per_send: 5000
# batch_send_deadline: 5s
# 远程读取(联邦查询)
# remote_read:
# - url: "http://thanos-query:9090/api/v1/read"
# read_recent: true
说明:这是一个生产级别的配置文件,包含以下关键点:
scrape_interval: 30s:平衡数据精度和存储成本,高频交易系统可调整为15s
external_labels:用于多集群联邦和远程存储时区分数据来源
metric_relabel_configs:过滤不需要的指标,减少存储压力(实测可减少30%存储)
- 多种服务发现方式:static_configs(静态)、consul_sd_configs(Consul)、kubernetes_sd_configs(K8s)
◆ 2.2.2 Alertmanager告警配置
# 文件路径:/data/alertmanager/config/alertmanager.yml
global:
# SMTP邮件配置
smtp_smarthost: 'smtp.exmail.qq.com:465'
smtp_from: 'alert@example.com'
smtp_auth_username: 'alert@example.com'
smtp_auth_password: 'your_password'
smtp_require_tls: true
# 告警解决超时(如果告警没有显式解决,这个时间后自动标记为resolved)
resolve_timeout: 5m
# 告警模板
templates:
- '/data/alertmanager/templates/*.tmpl'
# 路由配置(核心:告警如何分发)
route:
# 默认接收者
receiver: 'default'
# 分组标签(相同标签的告警会聚合发送)
group_by: ['alertname', 'cluster', 'service']
# 首次告警等待时间(聚合同组告警)
group_wait: 30s
# 同组告警再次发送间隔
group_interval: 5m
# 重复告警间隔
repeat_interval: 4h
# 子路由(根据标签匹配不同的接收者)
routes:
# P0级别告警:电话+短信+企业微信
- match:
severity: critical
receiver: 'critical-alerts'
group_wait: 10s
group_interval: 1m
repeat_interval: 30m
# 数据库告警:DBA团队
- match_re:
service: '^(mysql|redis|mongodb)$'
receiver: 'dba-team'
group_by: ['alertname', 'instance']
continue: true # 继续匹配后续路由
# Kubernetes告警:运维团队
- match:
job: 'kubernetes-.*'
receiver: 'ops-team'
# 业务告警:开发团队
- match:
team: 'backend'
receiver: 'backend-team'
# 测试环境告警:降低优先级
- match:
env: 'staging'
receiver: 'test-alerts'
group_interval: 30m
repeat_interval: 24h
# 抑制规则(当某些告警触发时,抑制其他告警)
inhibit_rules:
# 当节点宕机时,抑制该节点上的所有服务告警
- source_match:
alertname: 'NodeDown'
target_match_re:
alertname: '.*'
equal: ['instance']
# 当实例宕机时,抑制磁盘/内存告警
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['instance']
# 接收者配置
receivers:
# 默认接收者(企业微信+邮件)
- name: 'default'
email_configs:
- to: 'ops@example.com'
headers:
Subject: '[Prometheus Alert] {{ .GroupLabels.alertname }}'
wechat_configs:
- corp_id: 'ww123456789abcdef'
to_party: '1' # 部门ID
agent_id: '1000001'
api_secret: 'your_api_secret'
message: '{{ template "wechat.default.message" . }}'
# P0级别告警(电话+短信+企业微信)
- name: 'critical-alerts'
webhook_configs:
# 云片短信网关
- url: 'https://sms.yunpian.com/v2/sms/prometheus_alert.json'
send_resolved: false
wechat_configs:
- corp_id: 'ww123456789abcdef'
to_user: '@all' # 通知所有人
agent_id: '1000001'
api_secret: 'your_api_secret'
message: |
【P0告警】
告警名称: {{ .GroupLabels.alertname }}
告警级别: {{ .CommonLabels.severity }}
告警主机: {{ .CommonLabels.instance }}
告警详情: {{ range .Alerts }}{{ .Annotations.description }}{{ end }}
触发时间: {{ .CommonAnnotations.timestamp }}
# DBA团队(钉钉机器人)
- name: 'dba-team'
webhook_configs:
- url: 'https://oapi.dingtalk.com/robot/send?access_token=xxx'
send_resolved: true
# 运维团队(Slack)
- name: 'ops-team'
slack_configs:
- api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXX'
channel: '#ops-alerts'
title: 'Prometheus Alert'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
# 后端开发团队(PagerDuty)
- name: 'backend-team'
pagerduty_configs:
- service_key: 'your_pagerduty_service_key'
description: '{{ .GroupLabels.alertname }}'
# 测试环境告警(仅邮件)
- name: 'test-alerts'
email_configs:
- to: 'qa@example.com'
参数说明:
group_by:相同标签的告警聚合发送,避免告警风暴(实测可减少80%告警噪音)
group_wait:等待30秒收集同组告警再发送,避免频繁通知
inhibit_rules:抑制规则是告警降噪的关键,合理配置可过滤90%的冗余告警
continue: true:告警匹配该路由后继续匹配后续路由,实现多接收者通知
◆ 2.2.3 告警规则配置
# 文件路径:/data/prometheus/rules/base_rules.yml
groups:
# 基础设施告警规则组
- name: infrastructure_alerts
interval: 30s
rules:
# 实例宕机(最高优先级)
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: critical
team: ops
annotations:
summary: "实例 {{ $labels.instance }} 已宕机"
description: "Job {{ $labels.job }} 的实例 {{ $labels.instance }} 已宕机超过1分钟"
runbook_url: "https://wiki.example.com/runbook/instance-down"
# CPU使用率告警(P95 > 80%持续5分钟)
- alert: HighCPUUsage
expr: |
(100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80
for: 5m
labels:
severity: warning
team: ops
annotations:
summary: "CPU使用率过高: {{ $labels.instance }}"
description: "实例 {{ $labels.instance }} 的CPU使用率为 {{ $value | humanizePercentage }},持续5分钟超过80%"
value: "{{ $value | humanizePercentage }}"
# 内存使用率告警(可用内存<10%)
- alert: HighMemoryUsage
expr: |
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
for: 3m
labels:
severity: warning
team: ops
annotations:
summary: "内存使用率过高: {{ $labels.instance }}"
description: "实例 {{ $labels.instance }} 的内存使用率为 {{ $value | humanizePercentage }}"
value: "{{ $value | humanizePercentage }}"
# 磁盘使用率告警(分区使用>85%)
- alert: HighDiskUsage
expr: |
(node_filesystem_size_bytes{fstype=~"ext4|xfs"} - node_filesystem_avail_bytes{fstype=~"ext4|xfs"})
/ node_filesystem_size_bytes{fstype=~"ext4|xfs"} * 100 > 85
for: 5m
labels:
severity: warning
team: ops
annotations:
summary: "磁盘使用率过高: {{ $labels.instance }}"
description: "实例 {{ $labels.instance }} 的挂载点 {{ $labels.mountpoint }} 使用率为 {{ $value | humanizePercentage }}"
value: "{{ $value | humanizePercentage }}"
# 磁盘预测告警(预测7天内磁盘满)
- alert: DiskWillFillIn7Days
expr: |
predict_linear(node_filesystem_avail_bytes{fstype=~"ext4|xfs"}[1h], 7*24*3600) < 0
for: 1h
labels:
severity: warning
team: ops
annotations:
summary: "磁盘空间预测不足: {{ $labels.instance }}"
description: "根据当前增长趋势,实例 {{ $labels.instance }} 的挂载点 {{ $labels.mountpoint }} 将在7天内耗尽"
# 磁盘IO等待过高
- alert: HighDiskIOWait
expr: |
rate(node_disk_io_time_seconds_total[5m]) > 0.8
for: 5m
labels:
severity: warning
team: ops
annotations:
summary: "磁盘IO等待过高: {{ $labels.instance }}"
description: "实例 {{ $labels.instance }} 的磁盘 {{ $labels.device }} IO等待时间占比为 {{ $value | humanizePercentage }}"
# 网络流量异常
- alert: HighNetworkTraffic
expr: |
rate(node_network_receive_bytes_total{device!~"lo|docker.*|veth.*"}[5m]) > 100*1024*1024
for: 5m
labels:
severity: info
team: ops
annotations:
summary: "入站流量异常: {{ $labels.instance }}"
description: "实例 {{ $labels.instance }} 的网卡 {{ $labels.device }} 入站流量为 {{ $value | humanize }}B/s"
# 文件描述符使用率告警
- alert: HighFileDescriptorUsage
expr: |
node_filefd_allocated / node_filefd_maximum * 100 > 80
for: 5m
labels:
severity: warning
team: ops
annotations:
summary: "文件描述符使用率过高: {{ $labels.instance }}"
description: "实例 {{ $labels.instance }} 的文件描述符使用率为 {{ $value | humanizePercentage }}"
# 数据库告警规则组
- name: database_alerts
interval: 30s
rules:
# MySQL主从复制延迟
- alert: MySQLReplicationLag
expr: mysql_slave_status_seconds_behind_master > 60
for: 2m
labels:
severity: critical
team: dba
annotations:
summary: "MySQL主从复制延迟: {{ $labels.instance }}"
description: "MySQL实例 {{ $labels.instance }} 的主从复制延迟为 {{ $value }} 秒,超过60秒阈值"
# MySQL连接数过高
- alert: MySQLTooManyConnections
expr: |
mysql_global_status_threads_connected / mysql_global_variables_max_connections * 100 > 80
for: 5m
labels:
severity: warning
team: dba
annotations:
summary: "MySQL连接数过高: {{ $labels.instance }}"
description: "MySQL实例 {{ $labels.instance }} 的连接数使用率为 {{ $value | humanizePercentage }}"
# MySQL慢查询过多
- alert: MySQLHighSlowQueries
expr: |
rate(mysql_global_status_slow_queries[5m]) > 10
for: 5m
labels:
severity: warning
team: dba
annotations:
summary: "MySQL慢查询过多: {{ $labels.instance }}"
description: "MySQL实例 {{ $labels.instance }} 的慢查询速率为 {{ $value }} 次/秒"
# Redis内存使用率过高
- alert: RedisHighMemoryUsage
expr: |
redis_memory_used_bytes / redis_memory_max_bytes * 100 > 90
for: 5m
labels:
severity: critical
team: dba
annotations:
summary: "Redis内存使用率过高: {{ $labels.instance }}"
description: "Redis实例 {{ $labels.instance }} 的内存使用率为 {{ $value | humanizePercentage }}"
# Redis命中率过低
- alert: RedisLowHitRate
expr: |
rate(redis_keyspace_hits_total[5m]) / (rate(redis_keyspace_hits_total[5m]) + rate(redis_keyspace_misses_total[5m])) * 100 < 80
for: 10m
labels:
severity: warning
team: dba
annotations:
summary: "Redis缓存命中率过低: {{ $labels.instance }}"
description: "Redis实例 {{ $labels.instance }} 的缓存命中率为 {{ $value | humanizePercentage }}"
# 应用层告警规则组
- name: application_alerts
interval: 30s
rules:
# HTTP 5xx错误率过高
- alert: HighHTTP5xxErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job, instance)
/
sum(rate(http_requests_total[5m])) by (job, instance) * 100 > 5
for: 5m
labels:
severity: critical
team: backend
annotations:
summary: "HTTP 5xx错误率过高: {{ $labels.instance }}"
description: "应用 {{ $labels.job }} 实例 {{ $labels.instance }} 的5xx错误率为 {{ $value | humanizePercentage }}"
# HTTP响应时间过慢(P99 > 2秒)
- alert: HighHTTPResponseTime
expr: |
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le)) > 2
for: 10m
labels:
severity: warning
team: backend
annotations:
summary: "HTTP响应时间过慢: {{ $labels.job }}"
description: "应用 {{ $labels.job }} 的P99响应时间为 {{ $value }} 秒,超过2秒阈值"
# 队列积压告警
- alert: HighQueueBacklog
expr: |
rabbitmq_queue_messages_ready > 10000
for: 5m
labels:
severity: warning
team: backend
annotations:
summary: "消息队列积压: {{ $labels.queue }}"
description: "队列 {{ $labels.queue }} 积压消息数为 {{ $value }}"
# Kubernetes告警规则组
- name: kubernetes_alerts
interval: 30s
rules:
# Pod一直重启
- alert: PodCrashLooping
expr: |
rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: critical
team: ops
annotations:
summary: "Pod持续重启: {{ $labels.namespace }}/{{ $labels.pod }}"
description: "命名空间 {{ $labels.namespace }} 中的Pod {{ $labels.pod }} 在过去15分钟内重启了 {{ $value }} 次"
# Pod未就绪
- alert: PodNotReady
expr: |
sum by (namespace, pod) (kube_pod_status_phase{phase!~"Running|Succeeded"}) > 0
for: 10m
labels:
severity: warning
team: ops
annotations:
summary: "Pod未就绪: {{ $labels.namespace }}/{{ $labels.pod }}"
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} 状态为 {{ $labels.phase }},超过10分钟"
# Node内存压力
- alert: NodeMemoryPressure
expr: |
kube_node_status_condition{condition="MemoryPressure",status="true"} == 1
for: 5m
labels:
severity: warning
team: ops
annotations:
summary: "节点内存压力: {{ $labels.node }}"
description: "Kubernetes节点 {{ $labels.node }} 出现内存压力"
2.3 启动和验证
◆ 2.3.1 启动服务
# 创建Prometheus systemd服务文件
cat > /etc/systemd/system/prometheus.service << 'EOF'
[Unit]
Description=Prometheus Monitoring System
Documentation=https://prometheus.io/docs/
After=network.target
[Service]
Type=simple
User=prometheus
Group=prometheus
ExecStart=/usr/local/bin/prometheus \
--config.file=/data/prometheus/config/prometheus.yml \
--storage.tsdb.path=/data/prometheus/data \
--storage.tsdb.retention.time=30d \
--storage.tsdb.retention.size=100GB \
--web.listen-address=0.0.0.0:9090 \
--web.max-connections=512 \
--web.enable-lifecycle \
--web.enable-admin-api \
--log.level=info \
--log.format=json
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure
RestartSec=5s
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target
EOF
# 创建Alertmanager systemd服务文件
cat > /etc/systemd/system/alertmanager.service << 'EOF'
[Unit]
Description=Prometheus Alertmanager
Documentation=https://prometheus.io/docs/alerting/alertmanager/
After=network.target
[Service]
Type=simple
User=prometheus
Group=prometheus
ExecStart=/usr/local/bin/alertmanager \
--config.file=/data/alertmanager/config/alertmanager.yml \
--storage.path=/data/alertmanager/data \
--web.listen-address=0.0.0.0:9093 \
--log.level=info
Restart=on-failure
RestartSec=5s
[Install]
WantedBy=multi-user.target
EOF
# 创建Node Exporter systemd服务文件
cat > /etc/systemd/system/node_exporter.service << 'EOF'
[Unit]
Description=Prometheus Node Exporter
Documentation=https://prometheus.io/docs/guides/node-exporter/
After=network.target
[Service]
Type=simple
User=prometheus
Group=prometheus
ExecStart=/usr/local/bin/node_exporter \
--web.listen-address=0.0.0.0:9100 \
--collector.filesystem.mount-points-exclude="^/(sys|proc|dev|host|etc)($$|/)" \
--collector.netclass.ignored-devices="^(veth.*|docker.*|lo)$" \
--log.level=info
Restart=on-failure
RestartSec=5s
[Install]
WantedBy=multi-user.target
EOF
# 重载systemd
systemctl daemon-reload
# 启动服务
systemctl start prometheus
systemctl start alertmanager
systemctl start node_exporter
# 设置开机自启
systemctl enable prometheus
systemctl enable alertmanager
systemctl enable node_exporter
# 查看服务状态
systemctl status prometheus
systemctl status alertmanager
systemctl status node_exporter
# 检查日志
journalctl -u prometheus -f --since "5 minutes ago"
◆ 2.3.2 功能验证
# 验证Prometheus进程和端口
ps aux | grep prometheus
netstat -tuln | grep -E ':((9090|9093|9100))'
# 预期输出:
# tcp6 0 0 :::9090 :::* LISTEN
# tcp6 0 0 :::9093 :::* LISTEN
# tcp6 0 0 :::9100 :::* LISTEN
# 验证Prometheus Web UI
curl http://localhost:9090/-/healthy
# 预期输出:Prometheus is Healthy.
curl http://localhost:9090/api/v1/status/config | jq .
# 输出当前配置
# 验证targets状态
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, instance: .labels.instance, health: .health}'
# 所有target的health应为"up"
# 验证告警规则加载
curl http://localhost:9090/api/v1/rules | jq '.data.groups[] | .name'
# 输出所有告警规则组名称
# 验证Alertmanager
curl http://localhost:9093/-/healthy
# 预期输出:OK
# 触发一个测试告警
curl -X POST http://localhost:9093/api/v1/alerts -d '[{
"labels": {"alertname": "TestAlert", "severity": "info"},
"annotations": {"summary": "This is a test alert"},
"startsAt": "2024-11-08T10:00:00Z",
"endsAt": "2024-11-08T10:05:00Z"
}]'
# 查看Alertmanager告警
curl http://localhost:9093/api/v1/alerts | jq .
# 验证Node Exporter指标采集
curl http://localhost:9100/metrics | grep "node_cpu_seconds_total"
# 应输出CPU时间序列数据
# 使用promtool验证配置文件
promtool check config /data/prometheus/config/prometheus.yml
# 预期输出:SUCCESS: /data/prometheus/config/prometheus.yml is valid prometheus config file syntax
promtool check rules /data/prometheus/rules/*.yml
# 预期输出:SUCCESS: rules are valid
# PromQL查询测试
curl -G http://localhost:9090/api/v1/query \
--data-urlencode 'query=up' | jq .
# 查询所有实例的up状态
curl -G http://localhost:9090/api/v1/query \
--data-urlencode 'query=100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)' | jq .
# 查询CPU使用率
# 查看TSDB统计信息
curl http://localhost:9090/api/v1/status/tsdb | jq .
# 输出时间序列数量、内存占用等信息
三、示例代码和配置
3.1 完整配置示例
◆ 3.1.1 Grafana Dashboard JSON(Node Exporter监控面板)
{
"dashboard": {
"title": "Node Exporter Full Dashboard",
"tags": ["prometheus", "node-exporter"],
"timezone": "Asia/Shanghai",
"panels": [
{
"id": 1,
"title": "CPU使用率",
"type": "graph",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
"targets": [
{
"expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "{{ instance }}",
"refId": "A"
}
],
"yaxes": [
{"format": "percent", "max": 100, "min": 0}
],
"alert": {
"name": "CPU使用率告警",
"conditions": [
{
"evaluator": {"params": [80], "type": "gt"},
"operator": {"type": "and"},
"query": {"params": ["A", "5m", "now"]},
"reducer": {"params": [], "type": "avg"},
"type": "query"
}
],
"executionErrorState": "alerting",
"for": "5m",
"frequency": "1m",
"handler": 1,
"message": "CPU使用率超过80%持续5分钟",
"name": "CPU告警"
}
},
{
"id": 2,
"title": "内存使用率",
"type": "graph",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
"targets": [
{
"expr": "(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100",
"legendFormat": "{{ instance }}",
"refId": "A"
}
],
"yaxes": [
{"format": "percent", "max": 100, "min": 0}
]
},
{
"id": 3,
"title": "磁盘使用率",
"type": "table",
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 8},
"targets": [
{
"expr": "(node_filesystem_size_bytes{fstype=~\"ext4|xfs\"} - node_filesystem_avail_bytes{fstype=~\"ext4|xfs\"}) / node_filesystem_size_bytes{fstype=~\"ext4|xfs\"} * 100",
"format": "table",
"instant": true,
"refId": "A"
}
],
"transformations": [
{
"id": "organize",
"options": {
"excludeByName": {},
"indexByName": {},
"renameByName": {
"instance": "实例",
"mountpoint": "挂载点",
"Value": "使用率(%)"
}
}
}
]
},
{
"id": 4,
"title": "网络流量",
"type": "graph",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 16},
"targets": [
{
"expr": "rate(node_network_receive_bytes_total{device!~\"lo|docker.*|veth.*\"}[5m])",
"legendFormat": "{{ instance }}-{{ device }} 入站",
"refId": "A"
},
{
"expr": "rate(node_network_transmit_bytes_total{device!~\"lo|docker.*|veth.*\"}[5m])",
"legendFormat": "{{ instance }}-{{ device }} 出站",
"refId": "B"
}
],
"yaxes": [
{"format": "Bps"}
]
},
{
"id": 5,
"title": "磁盘IO",
"type": "graph",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 16},
"targets": [
{
"expr": "rate(node_disk_read_bytes_total[5m])",
"legendFormat": "{{ instance }}-{{ device }} 读取",
"refId": "A"
},
{
"expr": "rate(node_disk_written_bytes_total[5m])",
"legendFormat": "{{ instance }}-{{ device }} 写入",
"refId": "B"
}
],
"yaxes": [
{"format": "Bps"}
]
},
{
"id": 6,
"title": "系统负载",
"type": "graph",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 24},
"targets": [
{
"expr": "node_load1",
"legendFormat": "{{ instance }} 1分钟",
"refId": "A"
},
{
"expr": "node_load5",
"legendFormat": "{{ instance }} 5分钟",
"refId": "B"
},
{
"expr": "node_load15",
"legendFormat": "{{ instance }} 15分钟",
"refId": "C"
}
]
},
{
"id": 7,
"title": "文件描述符使用率",
"type": "gauge",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 24},
"targets": [
{
"expr": "node_filefd_allocated / node_filefd_maximum * 100",
"legendFormat": "{{ instance }}",
"refId": "A"
}
],
"options": {
"showThresholdLabels": false,
"showThresholdMarkers": true
},
"fieldConfig": {
"defaults": {
"max": 100,
"min": 0,
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 70},
{"color": "red", "value": 90}
]
},
"unit": "percent"
}
}
}
],
"refresh": "30s",
"time": {"from": "now-1h", "to": "now"},
"timepicker": {
"refresh_intervals": ["10s", "30s", "1m", "5m", "15m", "30m", "1h"]
}
}
}
◆ 3.1.2 自动化部署脚本(Docker Compose版本)
# 文件路径:/data/prometheus/docker-compose.yml
version: '3.8'
# 网络定义
networks:
monitoring:
driver: bridge
ipam:
config:
- subnet: 172.20.0.0/16
# 数据卷定义
volumes:
prometheus_data:
driver: local
grafana_data:
driver: local
alertmanager_data:
driver: local
services:
# Prometheus服务
prometheus:
image: prom/prometheus:v2.48.0
container_name: prometheus
hostname: prometheus
restart: unless-stopped
user: "0:0" # 解决权限问题
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--storage.tsdb.retention.size=100GB'
- '--web.console.libraries=/usr/share/prometheus/console_libraries'
- '--web.console.templates=/usr/share/prometheus/consoles'
- '--web.enable-lifecycle'
- '--web.enable-admin-api'
- '--log.level=info'
ports:
- "9090:9090"
volumes:
- ./config/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./rules:/etc/prometheus/rules:ro
- prometheus_data:/prometheus
networks:
monitoring:
ipv4_address: 172.20.0.10
depends_on:
- alertmanager
# Alertmanager服务
alertmanager:
image: prom/alertmanager:v0.26.0
container_name: alertmanager
hostname: alertmanager
restart: unless-stopped
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
- '--log.level=info'
ports:
- "9093:9093"
volumes:
- ./config/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
- ./templates:/etc/alertmanager/templates:ro
- alertmanager_data:/alertmanager
networks:
monitoring:
ipv4_address: 172.20.0.11
# Grafana服务
grafana:
image: grafana/grafana:10.2.2
container_name: grafana
hostname: grafana
restart: unless-stopped
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=admin@prometheus
- GF_INSTALL_PLUGINS=alexanderzobnin-zabbix-app,grafana-clock-panel,grafana-piechart-panel
- GF_SERVER_ROOT_URL=http://grafana.example.com
- GF_SMTP_ENABLED=true
- GF_SMTP_HOST=smtp.exmail.qq.com:465
- GF_SMTP_USER=alert@example.com
- GF_SMTP_PASSWORD=your_password
- GF_SMTP_FROM_ADDRESS=alert@example.com
ports:
- "3000:3000"
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/dashboards:/etc/grafana/provisioning/dashboards:ro
- ./grafana/datasources:/etc/grafana/provisioning/datasources:ro
networks:
monitoring:
ipv4_address: 172.20.0.12
depends_on:
- prometheus
# Node Exporter服务
node-exporter:
image: prom/node-exporter:v1.7.0
container_name: node-exporter
hostname: node-exporter
restart: unless-stopped
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/host'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/host:ro,rslave
networks:
monitoring:
ipv4_address: 172.20.0.13
pid: host
# cAdvisor - 容器监控
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.47.0
container_name: cadvisor
hostname: cadvisor
restart: unless-stopped
privileged: true
ports:
- "8080:8080"
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker:/var/lib/docker:ro
- /dev/disk:/dev/disk:ro
networks:
monitoring:
ipv4_address: 172.20.0.14
# Blackbox Exporter - 黑盒监控
blackbox-exporter:
image: prom/blackbox-exporter:v0.24.0
container_name: blackbox-exporter
hostname: blackbox-exporter
restart: unless-stopped
ports:
- "9115:9115"
volumes:
- ./config/blackbox.yml:/etc/blackbox_exporter/config.yml:ro
networks:
monitoring:
ipv4_address: 172.20.0.15
Grafana数据源自动配置:
# 文件路径:/data/prometheus/grafana/datasources/datasource.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://172.20.0.10:9090
isDefault: true
editable: true
jsonData:
httpMethod: POST
timeInterval: 30s
部署脚本:
#!/bin/bash
# 文件名:deploy_prometheus_stack.sh
# 功能:一键部署Prometheus监控栈
set -euo pipefail
echo "开始部署Prometheus监控栈..."
# 检查Docker和Docker Compose
if ! command -v docker &> /dev/null; then
echo "Docker未安装,开始安装..."
curl -fsSL https://get.docker.com | bash
systemctl enable docker --now
fi
if ! command -v docker-compose &> /dev/null; then
echo "Docker Compose未安装,开始安装..."
curl -L "https://github.com/docker/compose/releases/download/v2.23.0/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
chmod +x /usr/local/bin/docker-compose
fi
# 创建目录结构
mkdir -p /data/prometheus/{config,rules,templates,grafana/{dashboards,datasources}}
# 进入工作目录
cd /data/prometheus
# 启动服务栈
docker-compose up -d
# 等待服务启动
echo "等待服务启动..."
sleep 10
# 验证服务状态
docker-compose ps
# 检查Prometheus健康状态
if curl -s http://localhost:9090/-/healthy | grep -q "Prometheus is Healthy"; then
echo "Prometheus启动成功!"
else
echo "Prometheus启动失败,请检查日志:docker-compose logs prometheus"
exit 1
fi
# 检查Grafana健康状态
if curl -s http://localhost:3000/api/health | grep -q "ok"; then
echo "Grafana启动成功!"
else
echo "Grafana启动失败,请检查日志:docker-compose logs grafana"
exit 1
fi
echo "======================================"
echo "部署完成!访问地址:"
echo "Prometheus: http://$(hostname -I | awk '{print $1}'):9090"
echo "Alertmanager: http://$(hostname -I | awk '{print $1}'):9093"
echo "Grafana: http://$(hostname -I | awk '{print $1}'):3000"
echo "默认用户名/密码: admin / admin@prometheus"
echo "======================================"
3.2 实际应用案例
◆ 案例一:电商大促全链路监控方案
场景描述:某电商平台在双11大促期间,需要实时监控从CDN、网关、应用、数据库到消息队列的全链路性能,确保交易链路稳定。日均订单量从50万激增到500万,系统QPS从5万上升到80万。
实现代码:
# 全链路监控配置
# 文件路径:/data/prometheus/config/ecommerce_full_stack.yml
scrape_configs:
# 1. CDN层监控(阿里云CDN)
- job_name: 'aliyun-cdn'
static_configs:
- targets: ['aliyun-exporter:9522']
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: aliyun-exporter:9522
# 2. 网关层监控(Nginx Ingress)
- job_name: 'nginx-ingress'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- ingress-nginx
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
action: keep
regex: ingress-nginx
- source_labels: [__meta_kubernetes_pod_container_port_number]
action: keep
regex: "10254"
# 3. 应用层监控(SpringBoot微服务)
- job_name: 'springboot-order-service'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- ecommerce
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
action: keep
regex: order-service
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
# 4. 数据库层监控(MySQL主从集群)
- job_name: 'mysql-master-slave'
static_configs:
- targets:
- '10.0.1.11:9104' # 主库
labels:
role: master
cluster: order-db
- targets:
- '10.0.1.12:9104' # 从库1
- '10.0.1.13:9104' # 从库2
labels:
role: slave
cluster: order-db
# 5. 缓存层监控(Redis Cluster)
- job_name: 'redis-cluster'
static_configs:
- targets:
- '10.0.1.21:9121'
- '10.0.1.22:9121'
- '10.0.1.23:9121'
labels:
cluster: cache-cluster
# 6. 消息队列监控(RocketMQ)
- job_name: 'rocketmq'
static_configs:
- targets:
- '10.0.1.31:5557'
labels:
cluster: mq-cluster
# 7. 业务指标监控(自定义Pushgateway)
- job_name: 'pushgateway-business'
honor_labels: true
static_configs:
- targets:
- 'pushgateway:9091'
# 核心业务告警规则
rule_files:
- '/data/prometheus/rules/ecommerce_alerts.yml'
关键业务告警规则:
# 文件路径:/data/prometheus/rules/ecommerce_alerts.yml
groups:
- name: ecommerce_critical_alerts
interval: 15s # 大促期间缩短评估间隔
rules:
# 订单创建成功率告警(核心指标)
- alert: LowOrderSuccessRate
expr: |
sum(rate(order_created_total{status="success"}[1m])) /
sum(rate(order_created_total[1m])) * 100 < 99
for: 1m
labels:
severity: critical
team: backend
business: order
annotations:
summary: "订单创建成功率过低"
description: "订单创建成功率为 {{ $value | humanizePercentage }},低于99%阈值,影响交易"
action: "1. 检查订单服务日志 2. 检查MySQL主库状态 3. 检查Redis缓存 4. 联系DBA排查"
# 支付成功率告警
- alert: LowPaymentSuccessRate
expr: |
sum(rate(payment_total{status="success"}[1m])) /
sum(rate(payment_total[1m])) * 100 < 98
for: 1m
labels:
severity: critical
team: backend
business: payment
annotations:
summary: "支付成功率过低"
description: "支付成功率为 {{ $value | humanizePercentage }},低于98%阈值,影响交易GMV"
# 库存扣减失败率告警
- alert: HighInventoryDeductionFailureRate
expr: |
sum(rate(inventory_deduction_total{status="failed"}[1m])) /
sum(rate(inventory_deduction_total[1m])) * 100 > 5
for: 2m
labels:
severity: warning
team: backend
business: inventory
annotations:
summary: "库存扣减失败率过高"
description: "库存扣减失败率为 {{ $value | humanizePercentage }},可能导致超卖"
# MySQL主从复制延迟(大促期间阈值降低)
- alert: MySQLReplicationLagHigh
expr: mysql_slave_status_seconds_behind_master{role="slave"} > 5
for: 30s
labels:
severity: critical
team: dba
annotations:
summary: "MySQL主从复制延迟: {{ $labels.instance }}"
description: "从库 {{ $labels.instance }} 复制延迟为 {{ $value }} 秒,影响读写分离"
# Redis缓存穿透告警
- alert: RedisCacheMissTooHigh
expr: |
sum(rate(redis_keyspace_misses_total[5m])) /
(sum(rate(redis_keyspace_hits_total[5m])) + sum(rate(redis_keyspace_misses_total[5m]))) * 100 > 30
for: 5m
labels:
severity: warning
team: backend
annotations:
summary: "Redis缓存未命中率过高"
description: "缓存未命中率为 {{ $value | humanizePercentage }},可能存在缓存穿透"
# 消息积压告警
- alert: RocketMQConsumerLag
expr: rocketmq_consumer_lag > 10000
for: 3m
labels:
severity: warning
team: backend
annotations:
summary: "消息消费延迟: {{ $labels.consumer_group }}"
description: "消费组 {{ $labels.consumer_group }} 积压消息数为 {{ $value }}"
# 应用JVM GC告警
- alert: HighJVMGCTime
expr: |
rate(jvm_gc_pause_seconds_sum[5m]) /
rate(jvm_gc_pause_seconds_count[5m]) > 0.1
for: 5m
labels:
severity: warning
team: backend
annotations:
summary: "JVM GC耗时过长: {{ $labels.instance }}"
description: "实例 {{ $labels.instance }} 的平均GC耗时为 {{ $value }} 秒"
# API响应时间P99告警
- alert: HighAPIResponseTime
expr: |
histogram_quantile(0.99,
sum(rate(http_server_requests_seconds_bucket{uri!~"/actuator/.*"}[5m])) by (uri, le)
) > 1
for: 5m
labels:
severity: warning
team: backend
annotations:
summary: "API响应时间过慢: {{ $labels.uri }}"
description: "接口 {{ $labels.uri }} 的P99响应时间为 {{ $value }} 秒"
实际效果数据:
大促期间监控数据(11月11日 00:00-01:00):
- 监控指标数量:2,358,000个时间序列
- Prometheus写入速率:峰值 126,000 samples/s
- 告警触发次数:347次(自动恢复298次,人工介入49次)
- 平均故障发现时间(MTTD):18秒
- 平均故障恢复时间(MTTR):3分12秒
- 系统整体可用性:99.97%
- GMV影响:因监控及时发现问题,避免潜在损失约580万元
关键事件:
- 00:12 Redis主节点内存使用率达95%,自动告警,运维3分钟内扩容完成
- 00:28 订单服务某个Pod出现内存泄漏,监控发现并自动重启
- 00:45 MySQL从库复制延迟达8秒,告警触发,DBA调整参数后恢复
◆ 案例二:Kubernetes集群全方位监控
场景描述:某互联网公司运行着一个包含200个节点、3000个Pod的Kubernetes生产集群,需要实现从基础设施到应用的全方位监控。
实现步骤:
- 部署Prometheus Operator
# 使用Helm安装kube-prometheus-stack
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# 创建namespace
kubectl create namespace monitoring
# 安装(包含Prometheus、Grafana、Alertmanager、各种Exporter)
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=500Gi \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.storageClassName=ssd \
--set grafana.adminPassword=admin@k8s \
--set alertmanager.config.global.resolve_timeout=5m
# 验证部署
kubectl get pods -n monitoring
- 自定义ServiceMonitor监控应用
# 监控SpringBoot应用
# 文件路径:servicemonitor-springboot.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: springboot-monitor
namespace: monitoring
labels:
release: kube-prometheus-stack
spec:
selector:
matchLabels:
app: springboot
namespaceSelector:
matchNames:
- default
- production
endpoints:
- port: metrics
path: /actuator/prometheus
interval: 30s
scrapeTimeout: 10s
- 配置PrometheusRule自定义告警
# 文件路径:prometheusrule-k8s.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: k8s-custom-alerts
namespace: monitoring
labels:
release: kube-prometheus-stack
spec:
groups:
- name: kubernetes.rules
interval: 30s
rules:
# Pod频繁重启告警
- alert: PodRestartTooOften
expr: |
rate(kube_pod_container_status_restarts_total[15m]) * 3600 > 5
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} 频繁重启"
description: "Pod在过去15分钟内重启了 {{ $value }} 次"
# HPA无法扩容告警
- alert: HPAMaxedOut
expr: |
kube_horizontalpodautoscaler_status_current_replicas /
kube_horizontalpodautoscaler_spec_max_replicas > 0.95
for: 15m
labels:
severity: warning
annotations:
summary: "HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} 接近最大副本数"
description: "当前副本数 {{ $value | humanizePercentage }} 接近最大值,可能需要调整上限"
# PVC存储空间不足
- alert: PVCStorageRunningLow
expr: |
kubelet_volume_stats_available_bytes /
kubelet_volume_stats_capacity_bytes * 100 < 20
for: 10m
labels:
severity: warning
annotations:
summary: "PVC {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} 存储空间不足"
description: "剩余空间仅 {{ $value | humanizePercentage }}"
- 部署并验证
# 应用配置
kubectl apply -f servicemonitor-springboot.yaml
kubectl apply -f prometheusrule-k8s.yaml
# 验证ServiceMonitor
kubectl get servicemonitor -n monitoring
# 验证PrometheusRule
kubectl get prometheusrule -n monitoring
# 检查Prometheus是否发现新的target
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
# 访问 http://localhost:9090/targets
# 查看告警规则
# 访问 http://localhost:9090/rules
监控效果:
集群监控数据统计:
- 监控指标总数:约450万个时间序列
- Prometheus内存占用:峰值38GB
- TSDB磁盘占用:约320GB(30天保留期)
- 查询响应时间P99:<500ms
- 告警规则数量:127条
- 每日告警数量:平均82条(误报率<5%)
典型监控场景:
1. 自动发现新部署的Pod并开始监控(延迟<1分钟)
2. 检测到Node NotReady后3分钟内完成Pod迁移
3. HPA根据CPU/内存指标自动扩缩容,响应时间<2分钟
4. 通过监控发现异常流量,提前15分钟预警DDoS攻击
四、最佳实践和注意事项
4.1 最佳实践
◆ 4.1.1 性能优化
# 根据业务重要性分级设置抓取间隔
global:
scrape_interval: 30s # 默认30秒
scrape_configs:
# 核心业务:高频采集
- job_name: 'critical-services'
scrape_interval: 15s
static_configs:
- targets: ['order-service:8080', 'payment-service:8080']
# 普通业务:标准采集
- job_name: 'normal-services'
scrape_interval: 30s
static_configs:
- targets: ['user-service:8080']
# 基础设施:低频采集
- job_name: 'infrastructure'
scrape_interval: 60s
static_configs:
- targets: ['node-exporter:9100']
# 存储优化:分层保留
# 启动参数:
# --storage.tsdb.retention.time=30d # 本地保留30天
# --storage.tsdb.retention.size=100GB # 磁盘限制100GB
实践经验:
scrape_configs:
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
# 过滤不需要的指标(减少存储)
metric_relabel_configs:
# 丢弃conntrack相关指标
- source_labels: [__name__]
regex: 'node_nf_conntrack_.*'
action: drop
# 仅保留特定挂载点的文件系统指标
- source_labels: [__name__, mountpoint]
regex: 'node_filesystem_.+;/tmp'
action: drop
# 过滤Docker虚拟网卡
- source_labels: [__name__, device]
regex: 'node_network_.+;(veth.*|docker.*)'
action: drop
# 重标记(规范化标签)
relabel_configs:
# 添加环境标签
- target_label: env
replacement: production
# 从实例地址提取主机名
- source_labels: [__address__]
regex: '([^:]+):.*'
target_label: hostname
replacement: '$1'
实践经验:某客户原有300万时间序列,通过指标过滤减少到120万(降低60%),Prometheus内存从48GB降至18GB。
# Prometheus配置远程写入Thanos Receiver
remote_write:
- url: "http://thanos-receiver:19291/api/v1/receive"
queue_config:
capacity: 10000 # 队列容量
max_shards: 50 # 最大并发写入数
min_shards: 1
max_samples_per_send: 5000
batch_send_deadline: 5s
min_backoff: 30ms
max_backoff: 100ms
# 可选:仅发送特定指标到远程存储
write_relabel_configs:
- source_labels: [__name__]
regex: 'node_.*|mysql_.*'
action: keep
Thanos架构部署(简化版):
# Thanos Sidecar:与Prometheus部署在一起,上传TSDB到对象存储
docker run -d --name thanos-sidecar \
-v /data/prometheus:/prometheus \
quay.io/thanos/thanos:v0.32.5 sidecar \
--tsdb.path=/prometheus \
--prometheus.url=http://localhost:9090 \
--objstore.config="$(cat <<EOF
type: S3
config:
bucket: prometheus-data
endpoint: s3.amazonaws.com
access_key: YOUR_ACCESS_KEY
secret_key: YOUR_SECRET_KEY
EOF
)"
# Thanos Store:从对象存储提供历史数据查询
docker run -d --name thanos-store \
-p 10901:10901 \
quay.io/thanos/thanos:v0.32.5 store \
--data-dir=/data/thanos-store \
--objstore.config="..."
# Thanos Query:统一查询入口(聚合多个Prometheus和Store)
docker run -d --name thanos-query \
-p 9090:9090 \
quay.io/thanos/thanos:v0.32.5 query \
--http-address=0.0.0.0:9090 \
--store=thanos-sidecar:10901 \
--store=thanos-store:10901
# Grafana配置Thanos Query作为数据源
# URL: http://thanos-query:9090
◆ 4.1.2 安全加固
# Prometheus配置Basic Auth
# prometheus.yml
global:
external_labels:
cluster: 'production'
# 启用TLS和Basic Auth(需要生成证书和密码文件)
# 启动参数:
# --web.config.file=/etc/prometheus/web-config.yml
# web-config.yml
tls_server_config:
cert_file: /etc/prometheus/ssl/prometheus.crt
key_file: /etc/prometheus/ssl/prometheus.key
basic_auth_users:
admin: $2y$10$HvuYHy... # bcrypt加密密码
readonly: $2y$10$ZzuYHy...
# 生成密码:
# htpasswd -nBC 10 admin
Grafana启用OAuth2认证(企业场景):
# grafana.ini
[auth.generic_oauth]
enabled = true
name = OAuth
allow_sign_up = true
client_id = YOUR_CLIENT_ID
client_secret = YOUR_CLIENT_SECRET
scopes = openid email profile
auth_url = https://accounts.google.com/o/oauth2/auth
token_url = https://accounts.google.com/o/oauth2/token
api_url = https://www.googleapis.com/oauth2/v1/userinfo
# 通过Nginx反向代理限制访问
# /etc/nginx/conf.d/prometheus.conf
server {
listen 443 ssl http2;
server_name prometheus.example.com;
ssl_certificate /etc/nginx/ssl/example.com.crt;
ssl_certificate_key /etc/nginx/ssl/example.com.key;
# IP白名单
allow 10.0.0.0/8;
allow 172.16.0.0/12;
deny all;
location / {
proxy_pass http://127.0.0.1:9090;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
# 限制Admin API访问
if ($request_uri ~* "^/api/v1/admin") {
return 403;
}
}
}
# 使用HashiCorp Vault管理敏感配置
# 1. 安装Vault
docker run -d --name vault -p 8200:8200 vault:1.15.0
# 2. 初始化Vault并存储密钥
vault kv put secret/prometheus/alertmanager \
smtp_password=your_password \
wechat_api_secret=your_secret
# 3. Alertmanager配置使用Vault
# alertmanager.yml
global:
smtp_auth_password: '{{ with secret "secret/prometheus/alertmanager" }}{{ .Data.data.smtp_password }}{{ end }}'
◆ 4.1.3 高可用配置
# 架构:多个Prometheus分片采集,1个全局Prometheus聚合
# 分片Prometheus 1(监控K8s集群1)
# prometheus-shard1.yml
global:
external_labels:
shard: '1'
cluster: 'k8s-cluster-1'
# 分片Prometheus 2(监控K8s集群2)
# prometheus-shard2.yml
global:
external_labels:
shard: '2'
cluster: 'k8s-cluster-2'
# 全局Prometheus(联邦聚合)
# prometheus-global.yml
scrape_configs:
- job_name: 'federate'
scrape_interval: 30s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="kubernetes-nodes"}'
- '{job="kubernetes-pods"}'
- '{__name__=~"job:.*"}' # 聚合后的指标
static_configs:
- targets:
- 'prometheus-shard1:9090'
- 'prometheus-shard2:9090'
# 部署2个完全相同的Prometheus实例
# prometheus-1和prometheus-2采集相同目标,配置完全一致
# Prometheus 1
docker run -d --name prometheus-1 \
-p 9090:9090 \
-v /data/prometheus/config:/etc/prometheus \
prom/prometheus:v2.48.0
# Prometheus 2(备份)
docker run -d --name prometheus-2 \
-p 9091:9090 \
-v /data/prometheus/config:/etc/prometheus \
prom/prometheus:v2.48.0
# Alertmanager集群(高可用)
# alertmanager-1
docker run -d --name alertmanager-1 \
-p 9093:9093 \
-v /data/alertmanager/config:/etc/alertmanager \
prom/alertmanager:v0.26.0 \
--cluster.peer=alertmanager-2:9094
# alertmanager-2
docker run -d --name alertmanager-2 \
-p 9094:9093 \
-v /data/alertmanager/config:/etc/alertmanager \
prom/alertmanager:v0.26.0 \
--cluster.peer=alertmanager-1:9094
#!/bin/bash
# 备份脚本:backup_prometheus.sh
BACKUP_DIR="/data/backup/prometheus"
RETENTION_DAYS=7
DATE=$(date +%Y%m%d_%H%M%S)
# 创建快照(需要启用--web.enable-admin-api)
curl -XPOST http://localhost:9090/api/v1/admin/tsdb/snapshot
# 获取快照目录名
SNAPSHOT=$(ls -t /data/prometheus/data/snapshots/ | head -1)
# 压缩备份
tar -czf ${BACKUP_DIR}/prometheus_snapshot_${DATE}.tar.gz \
-C /data/prometheus/data/snapshots/${SNAPSHOT} .
# 备份配置文件
tar -czf ${BACKUP_DIR}/prometheus_config_${DATE}.tar.gz \
/data/prometheus/config
# 清理过期备份
find ${BACKUP_DIR} -name "*.tar.gz" -mtime +${RETENTION_DAYS} -delete
# 上传到S3(可选)
# aws s3 cp ${BACKUP_DIR}/prometheus_snapshot_${DATE}.tar.gz \
# s3://my-backup-bucket/prometheus/
4.2 注意事项
◆ 4.2.1 配置注意事项
警告:Prometheus存储不支持NFS等网络文件系统,会导致数据损坏!必须使用本地磁盘或块存储(如AWS EBS)。
# 错误示例:将用户ID作为标签(百万用户=百万时间序列)
# Bad
http_requests_total{user_id="12345",path="/api/order"}
# 正确示例:用户维度使用单独的指标或聚合
# Good
http_requests_total{path="/api/order"} # 总量指标
http_requests_by_user_bucket{le="100"} # 用户数分布直方图
高基数危害:
- 内存占用暴涨(每个时间序列约3KB内存)
- 查询性能急剧下降(百万序列查询耗时>10秒)
- TSDB磁盘空间耗尽
识别高基数标签:
# 查询基数最高的标签
curl http://localhost:9090/api/v1/label/__name__/values | jq '.data | length'
# 查询特定指标的时间序列数量
curl -G http://localhost:9090/api/v1/query \
--data-urlencode 'query=count({__name__=~"http_.*"}) by (__name__)' | jq .
# 慢查询示例:全局聚合大量时间序列
sum(rate(http_requests_total[5m])) # 可能扫描百万序列
# 优化后:先按标签分组,减少聚合范围
sum by (job, instance) (rate(http_requests_total[5m]))
# 更优:使用recording rule预计算
# rules.yml
groups:
- name: http_requests
interval: 30s
rules:
- record: job:http_requests:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))
# 查询时直接使用预计算结果
sum(job:http_requests:rate5m)
查询优化建议:
-
避免使用 .* 正则,尽量精确匹配
-
范围查询使用合理的时间窗口(5m够用不要用1h)
-
大范围查询使用 subquery 分段计算
-
使用 limit 限制返回序列数
-
注意事项三:告警规则的 for 时长设置
# 错误示例:for时长过短,导致误报
- alert: HighCPU
expr: node_cpu_usage > 80
for: 30s # 太短,可能是瞬时尖刺
# 正确示例:根据业务容忍度设置合理时长
- alert: HighCPU
expr: node_cpu_usage > 80
for: 5m # 持续5分钟才告警,过滤瞬时波动
# 特殊场景:立即告警(服务宕机)
- alert: ServiceDown
expr: up == 0
for: 1m # 宕机1分钟即告警
◆ 4.2.2 常见错误
| 错误现象 |
原因分析 |
解决方案 |
| context deadline exceeded |
PromQL查询超时(默认2分钟) |
1. 优化查询语句,减少时间序列数量 2. 增加超时时间:--query.timeout=5m 3. 使用recording rule预计算 |
| out of memory |
时间序列过多或查询聚合量过大 |
1. 减少抓取目标或过滤指标 2. 增加内存或降低保留时间 3. 启用远程存储分流 |
| sample timestamp out of bounds |
样本时间戳超出有效范围 |
1. 检查时间同步:timedatectl status 2. 确保Exporter和Prometheus时间一致 3. 调整 --storage.tsdb.allow-overlapping-blocks |
| invalid magic number |
TSDB数据损坏 |
1. 停止Prometheus 2. 删除损坏的block目录 3. 从备份恢复数据 |
| target is DOWN |
抓取目标不可达 |
1. 检查网络连通性:curl http://target:9100/metrics 2. 检查防火墙规则 3. 验证ServiceDiscovery配置 |
| too many open files |
文件描述符耗尽 |
1. 增加系统限制:ulimit -n 65535 2. 在systemd中配置 LimitNOFILE=65536 3. 减少抓取目标数量 |
◆ 4.2.3 兼容性问题
-
版本兼容:
- Prometheus 2.x与1.x配置格式不兼容,需迁移
- Alertmanager 0.25+与0.20-告警路由匹配规则有变化
- Grafana 10.x部分老版本panel插件不兼容
- 建议保持Prometheus、Alertmanager版本相近(差距不超过2个小版本)
-
平台兼容:
- Kubernetes 1.24+移除了dockershim,cadvisor指标路径变更
- ARM架构需使用专用镜像(如
prom/prometheus:v2.48.0-linux-arm64)
- Windows平台部分Exporter功能受限(如node_exporter)
-
组件依赖:
- Thanos需要Prometheus 2.13+支持
--storage.tsdb.min-block-duration
- Grafana Loki需要Promtail 2.8+版本匹配
- Kubernetes Service Monitor需要Prometheus Operator 0.50+
五、故障排查和监控
5.1 故障排查
◆ 5.1.1 日志查看
# 查看Prometheus日志
journalctl -u prometheus -f --since "30 minutes ago"
# Docker环境查看日志
docker logs -f --tail 100 prometheus
# Kubernetes环境查看日志
kubectl logs -f -n monitoring prometheus-kube-prometheus-stack-0
# 过滤错误日志
journalctl -u prometheus | grep -i error
# 查看特定时间段日志
journalctl -u prometheus --since "2024-11-08 10:00" --until "2024-11-08 11:00"
# 查看Alertmanager日志
journalctl -u alertmanager -f
# 查看Grafana日志
tail -f /var/log/grafana/grafana.log
◆ 5.1.2 常见问题排查
问题一:Prometheus内存持续增长,最终OOM
# 诊断命令
# 1. 查看TSDB统计信息
curl http://localhost:9090/api/v1/status/tsdb | jq .
# 关注:numSeries(时间序列数量)
# 2. 查询基数最高的指标
curl -G http://localhost:9090/api/v1/query \
--data-urlencode 'query=topk(20, count by (__name__)({__name__=~".+"}))' | jq .
# 3. 查看内存占用Top 10的指标
curl -G http://localhost:9090/api/v1/query \
--data-urlencode 'query=topk(10, count by (__name__, job)({__name__=~".+"}))' | jq .
# 4. 检查是否有高基数标签
curl http://localhost:9090/api/v1/label/__name__/values | jq '.data | length'
解决方案:
- 过滤不需要的指标(使用metric_relabel_configs)
- 识别并移除高基数标签(如user_id、order_id)
- 减少保留时间或启用远程存储
- 增加内存配置(建议:100万序列约需10GB内存)
- 使用recording rule预聚合,减少原始序列
问题二:PromQL查询超时或失败
# 诊断命令
# 1. 查看慢查询日志(需开启--query.log-queries.count=100)
grep "query stats" /var/log/prometheus/prometheus.log | tail -20
# 2. 在Prometheus UI执行EXPLAIN查询
# 访问 http://localhost:9090/graph
# 执行查询并查看执行计划
# 3. 查看当前正在执行的查询
curl http://localhost:9090/api/v1/status/runtimeinfo | jq .data.activeQueryList
解决方案:
- 优化查询语句,添加标签过滤器缩小范围
- 缩短查询时间范围(如从1d改为1h)
- 使用recording rule预计算复杂查询
- 增加查询超时时间:
--query.timeout=5m
- 分散查询时间,避免多个大查询并发执行
问题三:告警未触发或重复触发
# 1. 检查告警规则是否加载
curl http://localhost:9090/api/v1/rules | jq '.data.groups[] | select(.name=="my_alert_group")'
# 2. 检查告警状态
curl http://localhost:9090/api/v1/alerts | jq '.data.alerts[] | select(.labels.alertname=="HighCPU")'
# 3. 验证PromQL表达式
curl -G http://localhost:9090/api/v1/query \
--data-urlencode 'query=node_cpu_usage > 80' | jq .
# 4. 检查Alertmanager是否收到告警
curl http://localhost:9093/api/v1/alerts | jq .
# 5. 查看Alertmanager路由匹配
curl http://localhost:9093/api/v1/status | jq .
- 验证告警规则语法:
promtool check rules /path/to/rules.yml
- 检查
for 时长是否过长,导致还未触发
- 检查Alertmanager路由配置,确认匹配正确的receiver
- 检查告警抑制规则,是否被误抑制
- 验证通知渠道配置(SMTP、Webhook等)
◆ 5.1.3 调试模式
# 开启Prometheus调试日志
# 启动参数添加:--log.level=debug
# 或通过API动态调整(需要--web.enable-lifecycle)
curl -X PUT http://localhost:9090/api/v1/admin/log-level \
-d '{"level":"debug"}'
# 查看调试日志
journalctl -u prometheus -f | grep -i debug
# Alertmanager调试模式
# 启动参数添加:--log.level=debug
# 测试告警路由匹配(不实际发送)
amtool config routes test \
--config.file=/data/alertmanager/config/alertmanager.yml \
severity=critical service=mysql
# 测试告警模板渲染
amtool template render \
--config.file=/data/alertmanager/config/alertmanager.yml \
/data/alertmanager/templates/email.tmpl
# 使用promtool调试PromQL
promtool query instant http://localhost:9090 'up{job="node"}'
# 查询范围数据
promtool query range http://localhost:9090 \
--start=2024-11-08T10:00:00Z \
--end=2024-11-08T11:00:00Z \
--step=5m \
'rate(http_requests_total[5m])'
5.2 性能监控
◆ 5.2.1 关键指标监控
# Prometheus自身指标
curl http://localhost:9090/metrics | grep prometheus_
# 关键指标:
# prometheus_tsdb_symbol_table_size_bytes:符号表大小
# prometheus_tsdb_head_series:当前内存中的时间序列数
# prometheus_tsdb_head_chunks:当前内存中的chunk数量
# prometheus_engine_query_duration_seconds:查询耗时
# prometheus_rule_evaluation_duration_seconds:规则评估耗时
# prometheus_sd_discovered_targets:服务发现目标数
# prometheus_target_scrape_pool_targets:抓取目标数
# prometheus_remote_storage_samples_total:远程写入样本数
# 使用PromQL查询Prometheus性能
# 每秒采集样本数
rate(prometheus_tsdb_head_samples_appended_total[5m])
# 每秒写入字节数
rate(prometheus_tsdb_compaction_chunk_size_bytes_sum[5m])
# 查询P99延迟
histogram_quantile(0.99, rate(prometheus_engine_query_duration_seconds_bucket[5m]))
◆ 5.2.2 监控指标说明
| 指标名称 |
正常范围 |
告警阈值 |
说明 |
| prometheus_tsdb_head_series |
根据配置 |
>500万 |
内存中时间序列数,超过则需过滤指标或扩容 |
| prometheus_engine_query_duration_seconds{quantile="0.99"} |
<5s |
>30s |
查询P99延迟,超过则需优化查询或增加资源 |
| prometheus_rule_evaluation_duration_seconds |
<10s |
>30s |
规则评估耗时,超过则需优化规则或拆分 |
| prometheus_target_scrape_pool_exceeded_target_limit_total |
0 |
>0 |
抓取目标超限,需调整配置或分片 |
| prometheus_tsdb_compactions_failed_total |
0 |
>5/hour |
TSDB压缩失败,可能磁盘故障或权限问题 |
| up{job="prometheus"} |
1 |
0 |
Prometheus自身存活状态 |
◆ 5.2.3 监控告警配置
# Prometheus自监控告警规则
# prometheus_self_monitoring.yml
groups:
- name: prometheus_alerts
interval: 30s
rules:
# Prometheus自身宕机
- alert: PrometheusDown
expr: up{job="prometheus"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Prometheus实例 {{ $labels.instance }} 宕机"
description: "Prometheus已宕机超过1分钟"
# 时间序列数量过多
- alert: PrometheusTooManyTimeSeries
expr: prometheus_tsdb_head_series > 3000000
for: 15m
labels:
severity: warning
annotations:
summary: "Prometheus时间序列数量过多"
description: "当前时间序列数量为 {{ $value }},接近性能上限"
# 查询延迟过高
- alert: PrometheusSlowQueries
expr: |
histogram_quantile(0.99,
rate(prometheus_engine_query_duration_seconds_bucket[5m])
) > 10
for: 10m
labels:
severity: warning
annotations:
summary: "Prometheus查询延迟过高"
description: "P99查询延迟为 {{ $value }} 秒"
# 规则评估失败
- alert: PrometheusRuleEvaluationFailures
expr: rate(prometheus_rule_evaluation_failures_total[5m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Prometheus规则评估失败"
description: "规则 {{ $labels.rule_group }} 评估失败"
# Target抓取失败率过高
- alert: PrometheusTargetScrapeFailed
expr: |
(sum by (job) (up == 0) / count by (job) (up)) * 100 > 10
for: 5m
labels:
severity: warning
annotations:
summary: "Target抓取失败率过高: {{ $labels.job }}"
description: "Job {{ $labels.job }} 的抓取失败率为 {{ $value }}%"
# TSDB压缩失败
- alert: PrometheusTSDBCompactionsFailed
expr: rate(prometheus_tsdb_compactions_failed_total[5m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Prometheus TSDB压缩失败"
description: "TSDB压缩失败,可能磁盘故障"
5.3 备份与恢复
◆ 5.3.1 备份策略
#!/bin/bash
# 备份脚本:prometheus_backup.sh
# 功能:定期备份Prometheus TSDB数据和配置
set -euo pipefail
BACKUP_DIR="/data/backup/prometheus"
S3_BUCKET="s3://prometheus-backup"
RETENTION_DAYS=30
DATE=$(date +%Y%m%d_%H%M%S)
echo "[$(date)] 开始备份Prometheus..."
# 1. 创建TSDB快照
echo "[$(date)] 创建TSDB快照..."
SNAPSHOT=$(curl -XPOST http://localhost:9090/api/v1/admin/tsdb/snapshot | jq -r .data.name)
if [ -z "$SNAPSHOT" ]; then
echo "[ERROR] 快照创建失败"
exit 1
fi
echo "[$(date)] 快照创建成功:$SNAPSHOT"
# 2. 压缩快照
echo "[$(date)] 压缩快照数据..."
tar -czf ${BACKUP_DIR}/tsdb_${DATE}.tar.gz \
-C /data/prometheus/data/snapshots/${SNAPSHOT} .
# 3. 备份配置文件
echo "[$(date)] 备份配置文件..."
tar -czf ${BACKUP_DIR}/config_${DATE}.tar.gz \
/data/prometheus/config \
/data/alertmanager/config
# 4. 上传到S3
echo "[$(date)] 上传到S3..."
aws s3 cp ${BACKUP_DIR}/tsdb_${DATE}.tar.gz ${S3_BUCKET}/tsdb/
aws s3 cp ${BACKUP_DIR}/config_${DATE}.tar.gz ${S3_BUCKET}/config/
# 5. 清理本地过期备份
echo "[$(date)] 清理本地过期备份..."
find ${BACKUP_DIR} -name "*.tar.gz" -mtime +${RETENTION_DAYS} -delete
# 6. 清理Prometheus快照
rm -rf /data/prometheus/data/snapshots/${SNAPSHOT}
# 7. 验证备份完整性
echo "[$(date)] 验证备份完整性..."
if tar -tzf ${BACKUP_DIR}/tsdb_${DATE}.tar.gz > /dev/null 2>&1; then
echo "[$(date)] 备份验证成功"
else
echo "[ERROR] 备份文件损坏"
exit 1
fi
echo "[$(date)] 备份完成!"
echo "TSDB备份:${BACKUP_DIR}/tsdb_${DATE}.tar.gz"
echo "配置备份:${BACKUP_DIR}/config_${DATE}.tar.gz"
◆ 5.3.2 恢复流程
- 停止服务
# 停止Prometheus
systemctl stop prometheus
# Docker环境
docker stop prometheus
# Kubernetes环境
kubectl scale statefulset prometheus-kube-prometheus-stack-prometheus \
-n monitoring --replicas=0
- 恢复数据
# 从S3下载备份
aws s3 cp s3://prometheus-backup/tsdb/tsdb_20241108_020000.tar.gz /tmp/
# 清空现有数据目录
rm -rf /data/prometheus/data/*
# 解压备份数据
mkdir -p /data/prometheus/data
tar -xzf /tmp/tsdb_20241108_020000.tar.gz -C /data/prometheus/data/
# 恢复配置文件
aws s3 cp s3://prometheus-backup/config/config_20241108_020000.tar.gz /tmp/
tar -xzf /tmp/config_20241108_020000.tar.gz -C /
# 设置权限
chown -R prometheus:prometheus /data/prometheus
- 验证完整性
# 使用promtool验证TSDB数据
promtool tsdb analyze /data/prometheus/data
# 检查配置文件语法
promtool check config /data/prometheus/config/prometheus.yml
promtool check rules /data/prometheus/rules/*.yml
- 重启服务
# 启动Prometheus
systemctl start prometheus
# 检查日志
journalctl -u prometheus -f --since "1 minute ago"
# 验证数据恢复
curl http://localhost:9090/api/v1/query \
--data-urlencode 'query=up' | jq .
# 验证时间范围
curl http://localhost:9090/api/v1/label/__name__/values | jq .
六、总结
6.1 技术要点回顾
-
要点一:监控体系架构设计
- 分层监控:基础设施层、中间件层、应用层、业务层
- 指标分类:RED(Rate/Errors/Duration)+ USE(Utilization/Saturation/Errors)
- 高可用保障:Prometheus联邦/Thanos + Alertmanager集群
- 长期存储:本地TSDB(30天) + 远程存储(1年+)
-
要点二:告警规则最佳实践
- 告警分级:Critical(立即处理)、Warning(24小时内)、Info(记录)
- 降噪策略:分组聚合(group_by) + 抑制规则(inhibit_rules) + 静默窗口
- 可操作性:每个告警必须包含runbook链接和处理步骤
- 持续优化:定期回顾告警,调整阈值,降低误报率到<5%
-
要点三:性能优化关键点
- 指标过滤:通过metric_relabel_configs减少50%+不必要指标
- 抓取优化:根据业务重要性设置15s/30s/60s不同抓取间隔
- 查询优化:使用recording rule预计算,避免实时聚合百万序列
- 资源规划:100万时间序列约需10GB内存,SSD盘提升10倍查询性能
-
要点四:生产环境落地经验
- 渐进式迁移:先监控非核心业务,验证稳定后扩展到核心系统
- 灰度发布:新增监控规则先在测试环境验证1周再上生产
- 容量规划:按峰值流量的2倍规划资源,预留30%扩展空间
- 应急预案:准备监控系统故障时的降级方案(邮件告警兜底)
6.2 进阶学习方向
-
方向一:Prometheus深度定制开发
- 开发自定义Exporter暴露业务指标(使用Go语言client_golang库)
- 编写复杂的recording rule实现多级聚合和降采样
- 集成Cortex实现多租户隔离和全球分布式部署
- 研究TSDB内部实现,优化压缩算法和查询性能
- 学习资源:《Prometheus Up & Running》、Prometheus官方博客
-
方向二:可观测性全栈方案
- 构建Metrics(Prometheus)+ Logs(Loki)+ Traces(Jaeger)统一平台
- 使用OpenTelemetry实现自动埋点,统一采集三大支柱数据
- 实现分布式追踪与指标关联,快速定位性能瓶颈
- 探索eBPF技术实现零侵入式应用性能监控
- 学习资源:CNCF Observability Whitepaper、OpenTelemetry官方文档
-
方向三:AIOps智能运维
- 基于Prometheus指标训练异常检测模型(LSTM/Isolation Forest)
- 实现根因分析自动化,从告警到根本原因<1分钟
- 容量预测:基于历史数据预测未来3个月资源需求
- 自愈系统:告警触发后自动执行修复脚本(扩容/重启/降级)
- 学习资源:Google SRE Workbook、《智能运维:从0到1》
6.3 参考资料
- Prometheus官方文档 - 最权威的配置和使用指南
- PromQL Cheat Sheet - PromQL快速参考手册
- Awesome Prometheus - 精选资源列表
- Grafana Dashboard库 - 10000+社区贡献的Dashboard
- Prometheus Operator文档 - K8s环境部署指南
- Thanos官方文档 - 高可用和长期存储方案
- 《Prometheus监控实战》 - 中文开源电子书
- CNCF Prometheus项目 - 官方项目主页
附录
A. 命令速查表
# Prometheus常用命令
prometheus --version # 查看版本
prometheus --config.file=prometheus.yml # 指定配置文件启动
promtool check config prometheus.yml # 验证配置文件语法
promtool check rules rules.yml # 验证告警规则
promtool query instant http://localhost:9090 'up' # 即时查询
promtool tsdb analyze /data/prometheus/data # 分析TSDB数据
# Alertmanager常用命令
alertmanager --version # 查看版本
amtool config routes show # 查看路由配置
amtool alert query # 查询当前告警
amtool silence add alertname=Test # 添加静默规则
amtool silence expire <silence_id> # 取消静默
# API操作
curl http://localhost:9090/api/v1/targets # 查看targets
curl http://localhost:9090/api/v1/alerts # 查看alerts
curl http://localhost:9090/api/v1/rules # 查看rules
curl -XPOST http://localhost:9090/-/reload # 热重载配置
curl -XPOST http://localhost:9090/api/v1/admin/tsdb/snapshot # 创建快照
curl -XPOST http://localhost:9090/api/v1/admin/tsdb/clean_tombstones # 清理墓碑
# PromQL常用查询
up # 所有实例存活状态
rate(http_requests_total[5m]) # 每秒请求速率
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) # P99延迟
topk(10, count by (__name__)({__name__=~".+"})) # Top 10指标
B. 配置参数详解
Prometheus启动参数:
--config.file:配置文件路径,默认prometheus.yml
--storage.tsdb.path:TSDB数据存储路径
--storage.tsdb.retention.time:数据保留时间,默认15d
--storage.tsdb.retention.size:数据最大存储大小,超过则删除最旧数据
--web.listen-address:Web UI监听地址,默认0.0.0.0:9090
--web.enable-lifecycle:启用通过API重载配置(POST /-/reload)
--web.enable-admin-api:启用管理API(snapshot、delete series等)
--query.timeout:查询超时时间,默认2m
--query.max-concurrency:最大并发查询数,默认20
PromQL函数:
rate():计算速率,适合Counter类型
irate():瞬时速率,更敏感但容易产生毛刺
increase():计算增量
histogram_quantile():计算分位数
predict_linear():线性预测
avg_over_time():时间窗口内平均值
delta():首尾值差值,适合Gauge类型
deriv():导数,计算变化速率
C. 术语表
| 术语 |
英文 |
解释 |
| 时间序列 |
Time Series |
由指标名称和标签唯一标识的一组时间戳-数值数据对 |
| 指标 |
Metric |
被监控的对象属性,如CPU使用率、请求数 |
| 标签 |
Label |
键值对,用于标识时间序列的维度,如instance、job |
| 样本 |
Sample |
单个时间点的数值,包含时间戳和浮点数值 |
| 目标 |
Target |
被抓取指标的端点,通常是一个HTTP URL |
| 抓取 |
Scrape |
Prometheus主动拉取目标指标的过程 |
| 联邦 |
Federation |
多个Prometheus实例分层架构,上层聚合下层数据 |
| 记录规则 |
Recording Rule |
预计算查询结果并存储为新的时间序列 |
| 告警规则 |
Alerting Rule |
根据PromQL表达式判断是否触发告警 |
| 导出器 |
Exporter |
将第三方系统指标转换为Prometheus格式的组件 |
| 推送网关 |
Pushgateway |
接收短生命周期任务推送的指标 |
| 服务发现 |
Service Discovery |
自动发现监控目标,支持K8s、Consul等 |
| 基数 |
Cardinality |
时间序列的数量,高基数会导致性能问题 |
| 墓碑 |
Tombstone |
删除数据时的标记文件,压缩时才真正删除 |
| TSDB |
Time Series Database |
时间序列数据库,Prometheus存储引擎 |
