
适用场景 & 前置条件
适用业务:中大型 SRE/运维团队,日均告警 50+ 条,需要分级处理与值班管理
前置条件:
- Prometheus ≥ 2.30(支持规则验证)、Alertmanager ≥ 0.24(支持静默 API v2)
- 访问端口:Prometheus 9090、Alertmanager 9093
- 权限要求:可读写 Prometheus 配置目录(
/etc/prometheus/rules/)、可重载配置(systemctl reload prometheus)
- 网络要求:Alertmanager 可与 Webhook 接收端(如企微/钉钉/PagerDuty)互通
环境与版本矩阵
| 组件 |
版本要求 |
OS 支持 |
资源规格(最小) |
| Prometheus |
2.30+ (推荐 2.45+) |
RHEL 7/8, Ubuntu 20.04/22.04 |
2C/4G/50G SSD |
| Alertmanager |
0.24+ (推荐 0.26+) |
同上 |
1C/2G/10G SSD |
| 规则文件验证工具 |
promtool (内置) |
同上 |
无额外要求 |
| 网络 |
- |
- |
到目标 SMTP/Webhook 端口通 |
快速清单(Checklist)
- 验证现有告警规则语法(promtool check)
- 定义告警分级标签(severity/team/service)
- 编写核心告警规则(CPU/内存/磁盘/进程/响应时间)
- 配置 Alertmanager 路由与接收器(按标签分发)
- 测试告警触发与通知(手动制造故障)
- 创建计划性静默(维护窗口、测试环境)
- 验证静默生效(检查 Alertmanager 日志与 API)
- 配置告警抑制规则(节点级故障屏蔽服务级告警)
- 建立告警规则审查流程(Git 管理 + CI 校验)
- 配置告警历史与审计(持久化到数据库/日志中心)
实施步骤
Step 1:验证当前规则语法与性能影响
执行前校验:检查现有规则文件
# RHEL/CentOS/Ubuntu 通用
ls -lh /etc/prometheus/rules/*.yml
语法检查(必须通过才能应用):
promtool check rules /etc/prometheus/rules/*.yml
预期输出:
Checking /etc/prometheus/rules/node.yml
SUCCESS: 8 rules found
Checking /etc/prometheus/rules/app.yml
SUCCESS: 12 rules found
关键参数解释:
promtool check rules:静态语法检查,不实际加载到 Prometheus
- 失败时会输出行号与错误提示(如
unknown function 或 label mismatch)
Step 2:定义告警分级标签体系
创建标签标准文件 /etc/prometheus/alert-labels.yaml(仅文档用):
# 告警严重等级(必须字段)
severity:
- critical # P0,立即处理,电话通知
- warning # P1,15分钟内响应
- info # P2,日常巡检处理
# 责任团队(必须字段)
team:
- sre
- backend
- frontend
- dba
# 服务标识(必须字段)
service:
- api-gateway
- user-service
- order-service
- mysql-cluster
幂等性要点:
- 所有告警规则必须包含
severity/team/service 三个标签
- 标签值与 Alertmanager 路由配置保持一致
Step 3:编写核心告警规则(以节点监控为例)
创建 /etc/prometheus/rules/node-alerts.yml:
groups:
- name: node-critical
interval: 30s
rules:
# CPU 持续高负载
- alert: NodeCPUHighLoad
expr: |
(1 - avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))) * 100 > 80
for: 3m
labels:
severity: warning
team: sre
service: infrastructure
annotations:
summary: "节点 CPU 使用率持续高于 80%"
description: "实例 {{ $labels.instance }} CPU 使用率为 {{ $value | humanizePercentage }},持续 3 分钟"
runbook: "https://wiki.example.com/runbook/cpu-high"
# 内存可用量低于 10%
- alert: NodeMemoryLow
expr: |
(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 < 10
for: 5m
labels:
severity: critical
team: sre
service: infrastructure
annotations:
summary: "节点可用内存低于 10%"
description: "实例 {{ $labels.instance }} 可用内存 {{ $value | humanizePercentage }},可能触发 OOM"
runbook: "https://wiki.example.com/runbook/memory-low"
# 根分区磁盘使用率超 85%
- alert: NodeDiskSpaceHigh
expr: |
(node_filesystem_avail_bytes{mountpoint="/",fstype!~"tmpfs|fuse.*"}
/ node_filesystem_size_bytes) * 100 < 15
for: 10m
labels:
severity: warning
team: sre
service: infrastructure
annotations:
summary: "根分区剩余空间低于 15%"
description: "实例 {{ $labels.instance }} 根分区剩余 {{ $value | humanizePercentage }}"
# 磁盘 I/O 等待时间过高
- alert: NodeDiskIOWaitHigh
expr: |
rate(node_disk_io_time_seconds_total[5m]) * 100 > 80
for: 5m
labels:
severity: warning
team: sre
service: infrastructure
annotations:
summary: "磁盘 I/O 等待时间超 80%"
description: "实例 {{ $labels.instance }} 设备 {{ $labels.device }} I/O 繁忙度 {{ $value }}%"
# 关键进程(如 kubelet)不存在
- alert: NodeProcessDown
expr: |
node_systemd_unit_state{name="kubelet.service",state="active"} != 1
for: 1m
labels:
severity: critical
team: sre
service: kubernetes
annotations:
summary: "关键进程 kubelet 未运行"
description: "实例 {{ $labels.instance }} 的 kubelet.service 状态异常"
关键参数解释:
for: 3m:持续时间阈值,避免瞬时抖动误报
rate(...[5m]):5 分钟速率窗口,平滑短期波动
humanizePercentage:格式化输出(如 0.82 → 82%)
验证规则语法:
promtool check rules /etc/prometheus/rules/node-alerts.yml
热加载配置(无需重启):
# RHEL/CentOS
systemctl reload prometheus
# Ubuntu
systemctl reload prometheus
# 或通过 HTTP API
curl -X POST http://localhost:9090/-/reload
执行后校验:
# 检查规则是否加载
curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[].name'
# 检查当前触发的告警
curl -s http://localhost:9090/api/v1/alerts | jq '.data.alerts[] | {alert: .labels.alertname, state: .state}'
Step 4:配置 Alertmanager 路由与接收器
编辑 /etc/alertmanager/alertmanager.yml:
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alertmanager@example.com'
smtp_auth_username: 'alertmanager@example.com'
smtp_auth_password: 'YOUR_PASSWORD'
# 路由树(按标签分发)
route:
receiver: 'default-email'
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s # 首次等待聚合时间
group_interval: 5m # 同组后续告警间隔
repeat_interval: 4h # 未恢复告警重复通知间隔
routes:
# P0 级告警:电话 + 企微
- match:
severity: critical
receiver: 'oncall-phone'
group_wait: 10s
repeat_interval: 30m
# DBA 团队告警
- match:
team: dba
receiver: 'dba-wechat'
# 测试环境告警:仅邮件
- match_re:
env: 'test|dev'
receiver: 'test-email'
group_interval: 1h
# 接收器配置
receivers:
- name: 'default-email'
email_configs:
- to: 'sre-team@example.com'
headers:
Subject: '[{{ .GroupLabels.severity }}] {{ .GroupLabels.alertname }}'
- name: 'oncall-phone'
webhook_configs:
- url: 'https://api.pagerduty.com/webhook/xxx'
send_resolved: true
wechat_configs:
- corp_id: 'YOUR_CORP_ID'
to_party: '1'
agent_id: 'YOUR_AGENT_ID'
api_secret: 'YOUR_SECRET'
- name: 'dba-wechat'
wechat_configs:
- corp_id: 'YOUR_CORP_ID'
to_user: 'dba-team'
agent_id: 'YOUR_AGENT_ID'
api_secret: 'YOUR_SECRET'
- name: 'test-email'
email_configs:
- to: 'dev-alerts@example.com'
# 告警抑制规则(节点故障时抑制该节点上的容器告警)
inhibit_rules:
- source_match:
severity: 'critical'
alertname: 'NodeDown'
target_match_re:
severity: 'warning|info'
equal: ['instance']
- source_match:
alertname: 'NodeMemoryLow'
target_match:
alertname: 'PodMemoryHigh'
equal: ['instance']
关键参数解释:
group_by:相同标签值的告警合并为一条通知
inhibit_rules:当 source 告警触发时,抑制 target 告警(避免级联告警)
验证配置语法:
amtool check-config /etc/alertmanager/alertmanager.yml
热加载配置:
# RHEL/CentOS/Ubuntu
systemctl reload alertmanager
# 或通过 HTTP API
curl -X POST http://localhost:9093/-/reload
执行后校验:
# 查看当前活跃告警
amtool alert --alertmanager.url=http://localhost:9093
# 查看路由配置
amtool config routes --alertmanager.url=http://localhost:9093
Step 5:测试告警触发与通知
手动触发 CPU 高负载:
# 在被监控节点执行(持续 5 分钟)
timeout 300 sh -c 'while true; do :; done' &
timeout 300 sh -c 'while true; do :; done' &
timeout 300 sh -c 'while true; do :; done' &
timeout 300 sh -c 'while true; do :; done' &
验证告警生成(3 分钟后):
# 在 Prometheus 检查
curl -s http://localhost:9090/api/v1/alerts | jq '.data.alerts[] | select(.labels.alertname=="NodeCPUHighLoad")'
# 在 Alertmanager 检查
amtool alert --alertmanager.url=http://localhost:9093 | grep NodeCPUHighLoad
验证通知发送:
# 查看 Alertmanager 日志
journalctl -u alertmanager -f | grep -E 'Notify|webhook|email'
预期日志输出:
level=info msg="Notify success" receiver=oncall-phone integration=webhook[0]
清理测试进程:
pkill -f 'while true'
Step 6:创建计划性静默(维护窗口)
场景 1:维护窗口静默特定实例所有告警
# 静默 node-01 服务器,持续 2 小时
amtool silence add \
--alertmanager.url=http://localhost:9093 \
--author="SRE-OnCall" \
--comment="计划维护:更换内存条" \
--duration=2h \
instance=~"node-01:.*"
场景 2:静默测试环境所有告警
amtool silence add \
--alertmanager.url=http://localhost:9093 \
--author="Dev-Team" \
--comment="压测期间静默测试环境告警" \
--duration=4h \
env="test"
场景 3:静默特定告警名(如部署期间的健康检查告警)
amtool silence add \
--alertmanager.url=http://localhost:9093 \
--author="Deploy-Pipeline" \
--comment="灰度发布:静默健康检查告警" \
--duration=30m \
alertname="ServiceHealthCheckFailed" \
service="user-service"
执行后校验:
# 查看所有活跃静默
amtool silence query --alertmanager.url=http://localhost:9093
# 查看特定标签的静默
amtool silence query instance=~"node-01.*"
预期输出:
ID Matchers Ends At Created By Comment
8f4e2a1c-3b5d-4e8f-9c0a-1b2c3d4e5f6g instance=~node-01:.* 2025-10-31 14:30:00 UTC SRE-OnCall 计划维护:更换内存条
Step 7:通过 API 批量管理静默(自动化场景)
创建静默(JSON API):
curl -X POST http://localhost:9093/api/v2/silences \
-H "Content-Type: application/json" \
-d '{
"matchers": [
{"name": "alertname", "value": "NodeDown", "isRegex": false},
{"name": "instance", "value": "node-02:9100", "isRegex": false}
],
"startsAt": "2025-10-31T10:00:00Z",
"endsAt": "2025-10-31T12:00:00Z",
"createdBy": "automation-script",
"comment": "自动创建:定期维护窗口"
}'
预期输出:
{"silenceID":"9a1b2c3d-4e5f-6a7b-8c9d-0e1f2a3b4c5d"}
查询静默(API):
curl -s http://localhost:9093/api/v2/silences | jq '.[] | {id: .id, comment: .comment, status: .status.state}'
删除静默(API):
SILENCE_ID="9a1b2c3d-4e5f-6a7b-8c9d-0e1f2a3b4c5d"
curl -X DELETE http://localhost:9093/api/v2/silence/$SILENCE_ID
验证静默生效:
# 触发一个应该被静默的告警,检查是否收到通知
amtool alert --alertmanager.url=http://localhost:9093 --silenced
Step 8:配置告警抑制规则(避免级联告警)
编辑 /etc/alertmanager/alertmanager.yml 的 inhibit_rules 段:
inhibit_rules:
# 规则 1:节点宕机时,抑制该节点的所有 Pod 告警
- source_match:
alertname: 'NodeDown'
target_match_re:
alertname: 'Pod.*'
equal: ['instance']
# 规则 2:数据库主库故障时,抑制从库延迟告警
- source_match:
alertname: 'MySQLMasterDown'
service: 'mysql-cluster'
target_match:
alertname: 'MySQLReplicationLag'
equal: ['cluster']
# 规则 3:网络不可达时,抑制应用层超时告警
- source_match:
alertname: 'NetworkUnreachable'
target_match_re:
alertname: '.*Timeout'
equal: ['instance']
关键参数解释:
source_match:触发抑制的源告警条件
target_match_re:被抑制的目标告警(支持正则)
equal:必须相同的标签(如同一实例/集群)
重载配置并验证:
systemctl reload alertmanager
# 检查抑制规则是否加载
curl -s http://localhost:9093/api/v2/status | jq '.config.inhibit_rules'
Step 9:建立告警规则审查流程(Git + CI)
将规则纳入版本管理:
cd /etc/prometheus
git init
git add rules/*.yml prometheus.yml
git commit -m "Initial alert rules"
git remote add origin git@gitlab.example.com:sre/prometheus-config.git
git push -u origin main
创建 GitLab CI 校验 Pipeline .gitlab-ci.yml:
stages:
- validate
validate-rules:
stage: validate
image: prom/prometheus:v2.45.0
script:
- promtool check rules rules/*.yml
- promtool check config prometheus.yml
only:
- merge_requests
- main
配置 Pre-commit Hook(本地校验):
cat > /etc/prometheus/.git/hooks/pre-commit << 'EOF'
#!/bin/bash
promtool check rules /etc/prometheus/rules/*.yml || {
echo "规则文件语法错误,拒绝提交"
exit 1
}
EOF
chmod +x /etc/prometheus/.git/hooks/pre-commit
验证 Hook 生效:
# 故意制造语法错误测试
echo "invalid yaml" >> /etc/prometheus/rules/test.yml
git add rules/test.yml
git commit -m "test" # 应该失败
Step 10:配置告警历史与审计(持久化到 MySQL)
安装 Alertmanager Webhook Receiver(用于转存到数据库):
# 使用开源工具 alertmanager-webhook-logger
docker run -d \
--name alertmanager-logger \
-p 9099:9099 \
-e DATABASE_URL="mysql://alert_user:password@mysql-server:3306/alerts" \
tomtom/alertmanager-webhook-logger:latest
在 Alertmanager 添加 Webhook 接收器:
receivers:
- name: 'audit-logger'
webhook_configs:
- url: 'http://localhost:9099/webhook'
send_resolved: true
route:
routes:
- receiver: 'audit-logger'
continue: true # 继续发送到其他接收器
MySQL 表结构示例:
CREATE TABLE alert_history (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
alert_name VARCHAR(128) NOT NULL,
severity VARCHAR(32),
instance VARCHAR(256),
status VARCHAR(32),
starts_at TIMESTAMP,
ends_at TIMESTAMP,
labels JSON,
annotations JSON,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
INDEX idx_alert_name (alert_name),
INDEX idx_starts_at (starts_at)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
验证告警入库:
# 触发测试告警后查询
mysql -h mysql-server -u alert_user -p -e \
"SELECT alert_name, severity, status, starts_at FROM alerts.alert_history ORDER BY starts_at DESC LIMIT 10;"
监控与告警
Prometheus 自监控指标
# Prometheus 规则评估延迟(应 < 1s)
prometheus_rule_evaluation_duration_seconds{quantile="0.99"} > 1
# Alertmanager 通知失败率(应 < 1%)
rate(alertmanager_notifications_failed_total[5m]) / rate(alertmanager_notifications_total[5m]) > 0.01
# 当前活跃告警数量(异常峰值监控)
ALERTS{alertstate="firing"} > 50
# 静默规则数量(避免过度静默)
alertmanager_silences{state="active"} > 20
Grafana 面板建议
| 面板 |
PromQL 查询 |
阈值 |
| 活跃告警数 |
count(ALERTS{alertstate="firing"}) |
> 30 |
| 通知成功率 |
rate(alertmanager_notifications_total[5m]) - rate(alertmanager_notifications_failed_total[5m]) |
< 99% |
| 规则评估延迟 P99 |
prometheus_rule_evaluation_duration_seconds{quantile="0.99"} |
> 2s |
| Alertmanager 队列深度 |
alertmanager_notification_queue_length |
> 100 |
Linux 原生监控(无 Prometheus 环境)
# 检查 Prometheus 规则文件最后修改时间
stat -c '%y %n' /etc/prometheus/rules/*.yml | sort
# 统计当前告警数量(从日志)
journalctl -u prometheus --since "1 hour ago" | grep -c 'ALERTS{alertstate=\"firing\"}'
# 监控 Alertmanager 进程资源占用
ps aux | grep alertmanager | awk '{print $3, $4, $11}' # CPU%, MEM%, CMD
性能与容量
规则评估性能基准
# 测试规则文件解析性能
time promtool check rules /etc/prometheus/rules/*.yml
# 预期:< 1s for 100 rules
告警吞吐量测试
模拟大量告警(使用 amtool):
for i in {1..100}; do
amtool alert add \
--alertmanager.url=http://localhost:9093 \
--annotation=summary="测试告警 $i" \
alertname="LoadTest" instance="test-$i" &
done
wait
# 检查 Alertmanager 响应时间
time amtool alert query --alertmanager.url=http://localhost:9093
预期指标:
- 单 Alertmanager 实例:处理 1000 告警/秒
- 通知延迟(group_wait + 网络):< 30s
调优参数
Prometheus 调优 (/etc/prometheus/prometheus.yml):
global:
evaluation_interval: 30s # 规则评估间隔(降低可提高实时性)
scrape_interval: 15s
# 针对规则评估慢的问题
rule_files:
- "/etc/prometheus/rules/*.yml"
# 启用规则评估并发(Prometheus 2.30+)
# 通过环境变量设置
# export GOMAXPROCS=4
Alertmanager 调优 (/etc/alertmanager/alertmanager.yml):
global:
resolve_timeout: 3m # 告警自动恢复超时(过短会导致频繁恢复通知)
route:
group_wait: 10s # 首次告警聚合等待(降低可提高实时性)
group_interval: 5m # 后续告警聚合间隔
repeat_interval: 4h # 未恢复告警重复通知(过短导致告警疲劳)
系统内核参数 (sysctl.conf):
# 提高网络连接数(Webhook 通知密集时)
net.core.somaxconn = 4096
net.ipv4.tcp_max_syn_backlog = 4096
安全与合规
访问控制
Prometheus 基础认证 (使用 Nginx 反向代理):
server {
listen 9090;
location / {
auth_basic "Prometheus";
auth_basic_user_file /etc/nginx/.htpasswd;
proxy_pass http://localhost:9090;
}
}
生成密码文件:
# Ubuntu
sudo apt install apache2-utils
htpasswd -c /etc/nginx/.htpasswd admin
# RHEL/CentOS
sudo yum install httpd-tools
htpasswd -c /etc/nginx/.htpasswd admin
Webhook 安全
Alertmanager Webhook 签名验证(接收端实现):
receivers:
- name: 'secure-webhook'
webhook_configs:
- url: 'https://receiver.example.com/webhook'
http_config:
bearer_token: 'YOUR_SECRET_TOKEN' # 使用密钥认证
tls_config:
insecure_skip_verify: false # 强制 TLS 证书验证
审计日志
Alertmanager 审计日志配置 (通过 systemd):
# /etc/systemd/system/alertmanager.service
[Service]
ExecStart=/usr/local/bin/alertmanager \
--config.file=/etc/alertmanager/alertmanager.yml \
--log.level=info \
--log.format=json # JSON 格式便于解析
StandardOutput=append:/var/log/alertmanager/audit.log
重载 systemd:
systemctl daemon-reload
systemctl restart alertmanager
审计查询示例:
# 查询谁创建了静默
jq 'select(.msg=="silence created") | {author: .author, comment: .comment}' /var/log/alertmanager/audit.log
# 查询告警通知失败记录
jq 'select(.msg=="Notify failed") | {receiver: .receiver, error: .error}' /var/log/alertmanager/audit.log
合规核对点
- [ ] 告警通知使用加密传输(HTTPS/TLS)
- [ ] Webhook 接收端实现签名验证或 Token 认证
- [ ] 静默操作记录创建者与原因(
--author 和 --comment 必填)
- [ ] 规则文件变更通过 Git 审计(包含提交者与时间)
- [ ] 敏感标签(如 IP/密码)不出现在 annotations 明文中
- [ ] Alertmanager 配置文件中密码使用环境变量或加密存储
- [ ] 告警历史持久化保留至少 90 天(合规要求)
常见故障与排错
| 症状 |
诊断命令 |
可能根因 |
快速修复 |
永久修复 |
| 告警规则不生效 |
promtool check rules /etc/prometheus/rules/*.yml |
语法错误/文件未加载 |
修复语法并 systemctl reload prometheus |
配置 CI 校验 + Pre-commit Hook |
| 告警未触发 |
curl http://localhost:9090/api/v1/alerts |
for 持续时间未满足/指标缺失 |
调整 for 时间或检查 exporter |
优化规则阈值/确保 exporter 正常 |
| 通知未发送 |
journalctl -u alertmanager -f |
路由匹配失败/接收器配置错误 |
检查标签与路由 match 条件 |
测试接收器连通性/验证凭据 |
| 通知重复过多 |
amtool alert query |
repeat_interval 设置过短 |
增大 repeat_interval 到 4h |
按严重等级分级设置 |
| 静默未生效 |
amtool silence query |
标签匹配不精确/时间未覆盖 |
使用 ~ 正则匹配/检查时区 |
创建静默时使用 --dry-run 测试 |
| Alertmanager OOM |
ps aux | grep alertmanager |
大量活跃告警/历史未清理 |
增大内存限制/清理过期 silence |
启用告警聚合/持久化到外部存储 |
| 告警风暴 |
amtool alert | wc -l |
某个规则阈值过低/监控异常 |
临时禁用规则或创建全局 silence |
调优规则阈值/添加 for 时间 |
变更与回滚剧本
维护窗口
推荐时间:凌晨 2:00 - 4:00(业务低峰期)
变更前置条件:
- [ ] 在测试环境完成全部变更验证
- [ ] 备份当前配置文件(Prometheus + Alertmanager)
- [ ] 创建维护窗口静默(覆盖所有实例)
- [ ] 通知值班人员与业务方
灰度策略
阶段 1:金丝雀部署(10% 实例)
# 仅在单个 Prometheus 实例应用新规则
scp rules/new-alerts.yml prometheus-01:/etc/prometheus/rules/
ssh prometheus-01 "systemctl reload prometheus"
# 观察 30 分钟,检查误报率
阶段 2:全量发布(剩余 90%)
# 通过 Ansible 批量更新
ansible prometheus -m copy -a "src=rules/ dest=/etc/prometheus/rules/"
ansible prometheus -m systemd -a "name=prometheus state=reloaded"
健康检查
# 检查 Prometheus 规则加载状态
curl -s http://localhost:9090/api/v1/rules | jq '.status'
# 检查 Alertmanager 集群状态
curl -s http://localhost:9093/api/v2/status | jq '.cluster.status'
# 检查告警通知最近 1 小时成功率
prometheus-cli query 'rate(alertmanager_notifications_total[1h]) - rate(alertmanager_notifications_failed_total[1h])'
回退条件与命令
触发条件:
- 新规则导致误报率 > 20%
- 告警风暴(5 分钟内触发 > 100 条告警)
- Alertmanager 通知失败率 > 10%
回退操作:
# 恢复备份的规则文件
cp /etc/prometheus/rules.backup/* /etc/prometheus/rules/
systemctl reload prometheus
# 恢复 Alertmanager 配置
cp /etc/alertmanager/alertmanager.yml.backup /etc/alertmanager/alertmanager.yml
systemctl reload alertmanager
# 验证回退成功
promtool check rules /etc/prometheus/rules/*.yml
amtool config routes
数据/配置备份与恢复示例
备份脚本:
#!/bin/bash
BACKUP_DIR="/backup/prometheus/$(date +%Y%m%d-%H%M%S)"
mkdir -p $BACKUP_DIR
# 备份规则与配置
cp -r /etc/prometheus/rules/ $BACKUP_DIR/
cp /etc/prometheus/prometheus.yml $BACKUP_DIR/
cp /etc/alertmanager/alertmanager.yml $BACKUP_DIR/
# 导出当前活跃静默
amtool silence query -o json > $BACKUP_DIR/silences.json
echo "备份完成:$BACKUP_DIR"
恢复脚本:
#!/bin/bash
BACKUP_DIR=$1
if [ -z "$BACKUP_DIR" ]; then
echo "用法: $0 <备份目录>"
exit 1
fi
# 恢复配置
cp -r $BACKUP_DIR/rules/* /etc/prometheus/rules/
cp $BACKUP_DIR/prometheus.yml /etc/prometheus/
cp $BACKUP_DIR/alertmanager.yml /etc/alertmanager/
# 重载服务
systemctl reload prometheus
systemctl reload alertmanager
# 恢复静默(需手动审查)
echo "静默规则已保存到 $BACKUP_DIR/silences.json,请手动导入"
最佳实践
- 告警分级强制执行:所有规则必须包含
severity/team/service 标签,否则 CI 拒绝合并
- 持续时间阈值:
for 字段至少设置 3 分钟(瞬时波动占比 < 5%)
- 通知聚合优先:
group_by 至少包含 alertname 和 cluster,避免单条告警轰炸
- 静默审计追踪:所有静默必须填写
--author 和 --comment,每月审查一次
- 抑制规则优先级:节点级 > 网络级 > 应用级,避免级联告警
- Runbook 强制关联:P0/P1 级告警必须在
annotations 包含 runbook 链接
- 告警疲劳控制:单个告警
repeat_interval ≥ 4 小时,P2 级 ≥ 12 小时
- 测试环境隔离:通过
env 标签隔离路由,避免干扰生产告警
- 规则定期审查:每季度清理 90 天未触发的告警规则
- 告警收敛比:通过聚合与抑制,确保实际通知数 ≤ 原始告警数的 20%
附录
完整配置样例(生产环境)
Prometheus 规则文件 /etc/prometheus/rules/production.yml:
groups:
- name: infrastructure
interval: 30s
rules:
- alert: InstanceDown
expr: up == 0
for: 2m
labels:
severity: critical
team: sre
service: infrastructure
annotations:
summary: "实例 {{ $labels.instance }} 不可达"
description: "Job {{ $labels.job }} 的实例 {{ $labels.instance }} 已宕机超过 2 分钟"
runbook: "https://wiki.example.com/runbook/instance-down"
- name: application
interval: 1m
rules:
- alert: APIHighLatency
expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
team: backend
service: api-gateway
annotations:
summary: "API P99 延迟超过 1 秒"
description: "服务 {{ $labels.service }} P99 延迟 {{ $value }}s"
runbook: "https://wiki.example.com/runbook/api-latency"
- alert: APIErrorRateHigh
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 3m
labels:
severity: critical
team: backend
service: api-gateway
annotations:
summary: "API 5xx 错误率超过 5%"
description: "服务 {{ $labels.service }} 错误率 {{ $value | humanizePercentage }}"
Ansible 任务片段(批量部署规则)
---
- name: 部署 Prometheus 告警规则
hosts: prometheus
become: yes
tasks:
- name: 备份现有规则
copy:
src: /etc/prometheus/rules/
dest: /etc/prometheus/rules.backup/
remote_src: yes
- name: 复制新规则文件
copy:
src: files/prometheus-rules/
dest: /etc/prometheus/rules/
owner: prometheus
group: prometheus
mode: 0644
- name: 验证规则语法
command: promtool check rules /etc/prometheus/rules/*.yml
register: check_result
failed_when: check_result.rc != 0
- name: 重载 Prometheus 配置
systemd:
name: prometheus
state: reloaded
- name: 等待并验证规则加载
uri:
url: http://localhost:9090/api/v1/rules
return_content: yes
register: rules_api
failed_when: "'success' not in rules_api.content"
Alertmanager 高可用配置(集群模式)
# alertmanager-01 配置
alertmanager:
cluster:
listen-address: "0.0.0.0:9094"
peers:
- alertmanager-02:9094
- alertmanager-03:9094
# 启动命令
/usr/local/bin/alertmanager \
--config.file=/etc/alertmanager/alertmanager.yml \
--cluster.listen-address=0.0.0.0:9094 \
--cluster.peer=alertmanager-02:9094 \
--cluster.peer=alertmanager-03:9094
验证集群状态:
curl -s http://localhost:9093/api/v2/status | jq '.cluster.peers[] | {name: .name, address: .address}'
测试环境:RHEL 8.8 / Ubuntu 22.04 LTS, Prometheus 2.45.0, Alertmanager 0.26.0
测试日期:2025-10-31
维护周期:规则文件每季度审查,Alertmanager 配置每月备份
希望这份详细的 Prometheus 告警与静默配置指南能帮助你构建更可靠的生产监控体系。如果你在实施过程中遇到其他问题,或者想分享自己的配置经验,欢迎到 云栈社区 的 运维 & 测试 板块参与讨论。