云栈社区»论坛 › 技术文档「 Note & Doc 」 › Linux系统自动化运维实战：基于Prometheus与Shell的告警响应与故 ...

发回帖发新帖

4241 积分	1 好友	588 主题

发消息

Linux系统自动化运维实战：基于Prometheus与Shell的告警响应与故障自愈方案

发表于 2025-12-5 14:40:20 | 查看: 79| 回复: 0

监控指标体系

系统核心指标

CPU监控

CPU使用率（整体和分核心）
CPU负载平均值（1分钟、5分钟、15分钟）
CPU上下文切换次数
CPU中断处理次数

内存监控

内存使用率和剩余内存
Swap使用情况
内存碎片化程度
缓存和缓冲区使用情况

磁盘监控

磁盘空间使用率
磁盘I/O读写速率
磁盘队列长度
文件系统inode使用情况

网络监控

网络接口流量统计
网络连接数量
网络错误包统计
网络延迟和丢包率

应用层指标

进程监控

关键进程存活状态
进程CPU和内存占用
进程文件描述符使用情况
进程端口监听状态

服务监控

服务响应时间
服务可用性检查
服务错误率统计
服务连接池状态

告警系统架构设计

监控数据收集层

系统级监控工具

使用 Prometheus node_exporter收集系统指标：

# 安装node_exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xvfz node_exporter-1.6.1.linux-amd64.tar.gz
sudo cp node_exporter-1.6.1.linux-amd64/node_exporter /usr/local/bin/

# 创建systemd服务
sudo tee /etc/systemd/system/node_exporter.service > /dev/null <<EOF
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/node_exporter
Restart=always
RestartSec=3

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

自定义监控脚本

创建系统健康检查Shell脚本：

#!/bin/bash
# system_health_check.sh

# 配置文件
CONFIG_FILE="/etc/monitoring/health_check.conf"

# 默认阈值
CPU_THRESHOLD=80
MEMORY_THRESHOLD=85
DISK_THRESHOLD=90
LOAD_THRESHOLD=10

# 加载配置
if [ -f "$CONFIG_FILE" ]; then
    source "$CONFIG_FILE"
fi

# 检查CPU使用率
check_cpu() {
    local cpu_usage=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | awk -F'%' '{print $1}')
    if (( $(echo "$cpu_usage > $CPU_THRESHOLD" | bc -l) )); then
        echo "CRITICAL: CPU usage is ${cpu_usage}%"
        return 2
    elif (( $(echo "$cpu_usage > $((CPU_THRESHOLD - 10))" | bc -l) )); then
        echo "WARNING: CPU usage is ${cpu_usage}%"
        return 1
    fi
    return 0
}

# 检查内存使用率
check_memory() {
    local memory_usage=$(free | grep Mem | awk '{printf("%.2f"), $3/$2 * 100.0}')
    if (( $(echo "$memory_usage > $MEMORY_THRESHOLD" | bc -l) )); then
        echo "CRITICAL: Memory usage is ${memory_usage}%"
        return 2
    elif (( $(echo "$memory_usage > $((MEMORY_THRESHOLD - 10))" | bc -l) )); then
        echo "WARNING: Memory usage is ${memory_usage}%"
        return 1
    fi
    return 0
}

# 检查磁盘使用率
check_disk() {
    local disk_usage=$(df -h | grep -vE '^Filesystem|tmpfs|cdrom' | awk '{print $5}' | sed 's/%//g' | sort -n | tail -1)
    if [ "$disk_usage" -gt "$DISK_THRESHOLD" ]; then
        echo "CRITICAL: Disk usage is ${disk_usage}%"
        return 2
    elif [ "$disk_usage" -gt "$((DISK_THRESHOLD - 10))" ]; then
        echo "WARNING: Disk usage is ${disk_usage}%"
        return 1
    fi
    return 0
}

# 检查系统负载
check_load() {
    local load_avg=$(uptime | awk -F'load average:' '{print $2}' | awk '{print $1}' | sed 's/,//')
    if (( $(echo "$load_avg > $LOAD_THRESHOLD" | bc -l) )); then
        echo "CRITICAL: System load is ${load_avg}"
        return 2
    elif (( $(echo "$load_avg > $((LOAD_THRESHOLD - 2))" | bc -l) )); then
        echo "WARNING: System load is ${load_avg}"
        return 1
    fi
    return 0
}

# 主检查函数
main() {
    local exit_code=0
    local timestamp=$(date '+%Y-%m-%d %H:%M:%S')

    echo "[$timestamp] Starting system health check..."

    # 执行各项检查
    check_cpu
    local cpu_result=$?

    check_memory
    local memory_result=$?

    check_disk
    local disk_result=$?

    check_load
    local load_result=$?

    # 确定最终状态
    if [ $cpu_result -eq 2 ] || [ $memory_result -eq 2 ] || [ $disk_result -eq 2 ] || [ $load_result -eq 2 ]; then
        exit_code=2
    elif [ $cpu_result -eq 1 ] || [ $memory_result -eq 1 ] || [ $disk_result -eq 1 ] || [ $load_result -eq 1 ]; then
        exit_code=1
    fi

    echo "[$timestamp] Health check completed with exit code: $exit_code"
    exit $exit_code
}

main "$@"

告警规则配置

Prometheus告警规则

创建告警规则文件：

# /etc/prometheus/rules/system_alerts.yml
groups:
- name: system_alerts
  rules:
  - alert: HighCPUUsage
    expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage detected"
      description: "CPU usage is above 80% for more than 5 minutes on {{ $labels.instance }}"

  - alert: HighMemoryUsage
    expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High memory usage detected"
      description: "Memory usage is above 85% for more than 5 minutes on {{ $labels.instance }}"

  - alert: DiskSpaceLow
    expr: (node_filesystem_avail_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes{fstype!="tmpfs"}) * 100 < 10
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Low disk space"
      description: "Disk space is below 10% on {{ $labels.instance }}"

  - alert: SystemLoadHigh
    expr: node_load1 > 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High system load"
      description: "System load is above 10 for more than 5 minutes on {{ $labels.instance }}"

  - alert: ServiceDown
    expr: up == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Service is down"
      description: "{{ $labels.instance }} has been down for more than 1 minute"

自动化响应机制

响应策略分类

预防性响应

资源预分配
负载均衡调整
缓存预热
连接池扩容

修复性响应

服务重启
进程清理
临时文件清理
日志轮转

扩展性响应

自动扩容
资源迁移
负载分流
备份激活

自动化脚本实现

服务自动重启脚本

#!/bin/bash
# auto_restart_service.sh

SERVICE_NAME="$1"
LOG_FILE="/var/log/auto_restart.log"
MAX_RESTART_COUNT=3
RESTART_INTERVAL=60

# 检查服务状态
check_service_status() {
    systemctl is-active --quiet "$SERVICE_NAME"
    return $?
}

# 记录日志
log_message() {
    echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" >> "$LOG_FILE"
}

# 发送通知
send_notification() {
    local message="$1"
    local severity="$2"

    # 发送邮件通知
    echo "$message" | mail -s "Service Alert: $SERVICE_NAME" admin@company.com

    # 发送钉钉通知
    curl -X POST https://oapi.dingtalk.com/robot/send \
        -H 'Content-Type: application/json' \
        -d "{\"msgtype\": \"text\", \"text\": {\"content\": \"$message\"}}"
}

# 主要重启逻辑
main() {
    local restart_count=0

    while [ $restart_count -lt $MAX_RESTART_COUNT ]; do
        if check_service_status; then
            log_message "Service $SERVICE_NAME is running normally"
            exit 0
        else
            restart_count=$((restart_count + 1))
            log_message "Attempting to restart $SERVICE_NAME (attempt $restart_count/$MAX_RESTART_COUNT)"

            systemctl restart "$SERVICE_NAME"
            sleep $RESTART_INTERVAL

            if check_service_status; then
                log_message "Successfully restarted $SERVICE_NAME"
                send_notification "Service $SERVICE_NAME has been successfully restarted" "INFO"
                exit 0
            fi
        fi
    done

    log_message "Failed to restart $SERVICE_NAME after $MAX_RESTART_COUNT attempts"
    send_notification "CRITICAL: Failed to restart $SERVICE_NAME after $MAX_RESTART_COUNT attempts" "CRITICAL"
    exit 1
}

main "$@"

磁盘空间自动清理脚本

#!/bin/bash
# disk_cleanup.sh

CLEANUP_PATHS=(
    "/var/log"
    "/tmp"
    "/var/tmp"
    "/var/cache"
)

LOG_RETENTION_DAYS=7
TEMP_FILE_AGE=7

# 清理日志文件
cleanup_logs() {
    local log_path="$1"
    find "$log_path" -name "*.log" -type f -mtime +$LOG_RETENTION_DAYS -delete
    find "$log_path" -name "*.log.*" -type f -mtime +$LOG_RETENTION_DAYS -delete
}

# 清理临时文件
cleanup_temp() {
    local temp_path="$1"
    find "$temp_path" -type f -mtime +$TEMP_FILE_AGE -delete
    find "$temp_path" -type d -empty -delete
}

# 清理系统缓存
cleanup_cache() {
    # 清理包管理器缓存
    if command -v apt-get &> /dev/null; then
        apt-get clean
    elif command -v yum &> /dev/null; then
        yum clean all
    fi

    # 清理系统缓存
    sync && echo 3 > /proc/sys/vm/drop_caches
}

# 主清理函数
main() {
    echo "Starting disk cleanup process..."

    for path in "${CLEANUP_PATHS[@]}"; do
        if [ -d "$path" ]; then
            echo "Cleaning up $path..."
            case "$path" in
                "/var/log")
                    cleanup_logs "$path"
                    ;;
                "/tmp"|"/var/tmp")
                    cleanup_temp "$path"
                    ;;
                "/var/cache")
                    cleanup_cache
                    ;;
            esac
        fi
    done

    echo "Disk cleanup completed"
}

main "$@"

告警管理器配置

Alertmanager配置

# /etc/alertmanager/alertmanager.yml
global:
  smtp_smarthost: 'localhost:587'
  smtp_from: 'alertmanager@company.com'

templates:
  - '/etc/alertmanager/templates/*.tmpl'

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'critical-alerts'
    - match:
        severity: warning
      receiver: 'warning-alerts'

receivers:
  - name: 'default'
    email_configs:
      - to: 'admin@company.com'
        subject: 'Alert: {{ .GroupLabels.alertname }}'
        body: |
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          {{ end }}

  - name: 'critical-alerts'
    email_configs:
      - to: 'admin@company.com'
        subject: 'CRITICAL Alert: {{ .GroupLabels.alertname }}'
        body: |
          {{ range .Alerts }}
          CRITICAL Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          {{ end }}
    webhook_configs:
      - url: 'http://localhost:9093/webhook'
        send_resolved: true

  - name: 'warning-alerts'
    email_configs:
      - to: 'admin@company.com'
        subject: 'WARNING Alert: {{ .GroupLabels.alertname }}'
        body: |
          {{ range .Alerts }}
          WARNING Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          {{ end }}

集成第三方工具

Webhook处理器

创建webhook处理器来触发自动化响应：

#!/usr/bin/env python3
# webhook_handler.py

from flask import Flask, request, jsonify
import subprocess
import json
import logging

app = Flask(__name__)
logging.basicConfig(level=logging.INFO)

# 自动化响应映射
AUTOMATION_MAPPING = {
    'HighCPUUsage': 'handle_high_cpu',
    'HighMemoryUsage': 'handle_high_memory',
    'DiskSpaceLow': 'handle_disk_space_low',
    'ServiceDown': 'handle_service_down'
}

def handle_high_cpu(alert_data):
    """处理高CPU使用率告警"""
    instance = alert_data.get('labels', {}).get('instance', '')
    logging.info(f"Handling high CPU usage for {instance}")

    # 执行CPU优化脚本
    subprocess.run(['/usr/local/bin/cpu_optimization.sh', instance])

    return {"status": "success", "action": "cpu_optimization"}

def handle_high_memory(alert_data):
    """处理高内存使用率告警"""
    instance = alert_data.get('labels', {}).get('instance', '')
    logging.info(f"Handling high memory usage for {instance}")

    # 执行内存清理脚本
    subprocess.run(['/usr/local/bin/memory_cleanup.sh', instance])

    return {"status": "success", "action": "memory_cleanup"}

def handle_disk_space_low(alert_data):
    """处理磁盘空间不足告警"""
    instance = alert_data.get('labels', {}).get('instance', '')
    logging.info(f"Handling low disk space for {instance}")

    # 执行磁盘清理脚本
    subprocess.run(['/usr/local/bin/disk_cleanup.sh'])

    return {"status": "success", "action": "disk_cleanup"}

def handle_service_down(alert_data):
    """处理服务停止告警"""
    instance = alert_data.get('labels', {}).get('instance', '')
    job = alert_data.get('labels', {}).get('job', '')
    logging.info(f"Handling service down for {job} on {instance}")

    # 执行服务重启脚本
    subprocess.run(['/usr/local/bin/auto_restart_service.sh', job])

    return {"status": "success", "action": "service_restart"}

@app.route('/webhook', methods=['POST'])
def webhook():
    """处理Alertmanager webhook"""
    try:
        data = request.json
        alerts = data.get('alerts', [])

        responses = []
        for alert in alerts:
            alert_name = alert.get('labels', {}).get('alertname', '')

            if alert_name in AUTOMATION_MAPPING:
                handler_func = globals()[AUTOMATION_MAPPING[alert_name]]
                response = handler_func(alert)
                responses.append(response)
            else:
                logging.warning(f"No handler found for alert: {alert_name}")

        return jsonify({"responses": responses})

    except Exception as e:
        logging.error(f"Error processing webhook: {str(e)}")
        return jsonify({"error": str(e)}), 500

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=9093)

钉钉集成

#!/bin/bash
# send_dingtalk_alert.sh

WEBHOOK_URL="https://oapi.dingtalk.com/robot/send?access_token=YOUR_TOKEN"
ALERT_TYPE="$1"
ALERT_MESSAGE="$2"
INSTANCE="$3"

# 根据告警类型设置颜色
case "$ALERT_TYPE" in
    "CRITICAL")
        COLOR="red"
        ;;
    "WARNING")
        COLOR="yellow"
        ;;
    "INFO")
        COLOR="green"
        ;;
    *)
        COLOR="blue"
        ;;
esac

# 构造消息
MESSAGE=$(cat <<EOF
{
    "msgtype": "markdown",
    "markdown": {
        "title": "系统告警通知",
        "text": "## 系统告警通知\n\n**告警级别**: <font color='$COLOR'>$ALERT_TYPE</font>\n\n**告警实例**: $INSTANCE\n\n**告警内容**: $ALERT_MESSAGE\n\n**告警时间**: $(date '+%Y-%m-%d %H:%M:%S')\n\n请及时处理相关问题。"
    }
}
EOF
)

# 发送消息
curl -X POST "$WEBHOOK_URL" \
    -H 'Content-Type: application/json' \
    -d "$MESSAGE"

监控数据可视化

Grafana仪表板配置

创建系统监控仪表板的JSON配置：

{
  "dashboard": {
    "title": "Linux系统监控",
    "panels": [
      {
        "title": "CPU使用率",
        "type": "stat",
        "targets": [
          {
            "expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "thresholds": {
              "steps": [
                {"color": "green", "value": 0},
                {"color": "yellow", "value": 70},
                {"color": "red", "value": 90}
              ]
            }
          }
        }
      },
      {
        "title": "内存使用率",
        "type": "stat",
        "targets": [
          {
            "expr": "(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100"
          }
        ]
      },
      {
        "title": "磁盘使用率",
        "type": "stat",
        "targets": [
          {
            "expr": "(node_filesystem_size_bytes{fstype!=\"tmpfs\"} - node_filesystem_avail_bytes{fstype!=\"tmpfs\"}) / node_filesystem_size_bytes{fstype!=\"tmpfs\"} * 100"
          }
        ]
      }
    ]
  }
}

性能优化与调优

监控性能优化

数据采集优化

调整采集间隔，平衡准确性和性能
使用数据压缩减少存储空间
实施数据保留策略
优化查询语句性能

告警优化

设置合理的告警阈值避免误报
实施告警抑制机制
配置告警聚合规则
定期评估和调整告警策略

系统资源优化

内存管理

# 内存优化脚本
#!/bin/bash
# memory_optimization.sh

# 清理页面缓存
sync && echo 1 > /proc/sys/vm/drop_caches

# 调整swap使用策略
echo 10 > /proc/sys/vm/swappiness

# 优化内存回收
echo 1 > /proc/sys/vm/overcommit_memory

磁盘I/O优化

# 磁盘I/O优化脚本
#!/bin/bash
# disk_io_optimization.sh

# 调整I/O调度器
echo noop > /sys/block/sda/queue/scheduler

# 优化文件系统参数
mount -o remount,noatime,nodiratime /

# 调整磁盘队列深度
echo 32 > /sys/block/sda/queue/nr_requests

故障处理与恢复

故障分类处理

硬件故障

磁盘故障自动切换
网络故障自动恢复
内存故障隔离处理

软件故障

进程异常自动重启
服务依赖关系检查
配置文件自动恢复

网络故障

网络连接自动重试
负载均衡自动切换
DNS解析故障处理

恢复策略实现

#!/bin/bash
# disaster_recovery.sh

BACKUP_DIR="/opt/backups"
CONFIG_BACKUP="$BACKUP_DIR/configs"
DATA_BACKUP="$BACKUP_DIR/data"

# 配置文件恢复
restore_configs() {
    echo "Restoring configuration files..."

    # 恢复系统配置
    cp -r "$CONFIG_BACKUP"/etc/* /etc/

    # 恢复服务配置
    systemctl daemon-reload

    # 重启相关服务
    systemctl restart nginx
    systemctl restart mysql
    systemctl restart redis
}

# 数据恢复
restore_data() {
    echo "Restoring data..."

    # 恢复数据库
    mysql -u root -p < "$DATA_BACKUP/mysql_backup.sql"

    # 恢复文件数据
    rsync -av "$DATA_BACKUP/files/" /var/www/html/
}

# 系统健康检查
health_check() {
    echo "Performing health check..."

    # 检查服务状态
    systemctl status nginx
    systemctl status mysql
    systemctl status redis

    # 检查端口监听
    netstat -tuln | grep :80
    netstat -tuln | grep :3306
    netstat -tuln | grep :6379
}

# 主恢复流程
main() {
    echo "Starting disaster recovery process..."

    restore_configs
    restore_data
    health_check

    echo "Disaster recovery completed"
}

main "$@"

最佳实践总结

监控策略

分层监控

基础设施层：硬件、操作系统、网络
应用层：服务、进程、业务指标
用户体验层：响应时间、可用性、错误率

告警策略

设置合理的告警阈值
实施告警升级机制
配置告警静默和抑制
定期评估告警效果

自动化原则

渐进式自动化

从简单任务开始自动化
逐步扩展到复杂场景
保持人工介入能力
建立回滚机制

安全性考虑

权限最小化原则
操作审计记录
关键操作人工确认
定期安全评估

运维团队协作

文档化

详细的操作手册
故障处理流程
系统架构文档
应急响应预案

培训与技能提升

定期技术培训
故障演练
工具使用培训
最佳实践分享

有效的Linux系统监控与自动化响应配置，是构建稳定、可靠运维体系的关键。通过实施上述方案，可以显著提升问题响应速度与系统自愈能力。

上一篇：Java反序列化漏洞黑盒挖掘实战：Fastjson与Shiro漏洞检测技巧
下一篇：Kubernetes安全最佳实践：生产环境容器全链路防护指南

Linux, Prometheus, Alertmanager, 自动化运维, 故障自愈