找回密码
立即注册
搜索
热搜: Java Python Linux Go
发回帖 发新帖

428

积分

1

好友

48

主题
发表于 前天 14:40 | 查看: 7| 回复: 0

监控指标体系

系统核心指标

CPU监控

  • CPU使用率(整体和分核心)
  • CPU负载平均值(1分钟、5分钟、15分钟)
  • CPU上下文切换次数
  • CPU中断处理次数

内存监控

  • 内存使用率和剩余内存
  • Swap使用情况
  • 内存碎片化程度
  • 缓存和缓冲区使用情况

磁盘监控

  • 磁盘空间使用率
  • 磁盘I/O读写速率
  • 磁盘队列长度
  • 文件系统inode使用情况

网络监控

  • 网络接口流量统计
  • 网络连接数量
  • 网络错误包统计
  • 网络延迟和丢包率

应用层指标

进程监控

  • 关键进程存活状态
  • 进程CPU和内存占用
  • 进程文件描述符使用情况
  • 进程端口监听状态

服务监控

  • 服务响应时间
  • 服务可用性检查
  • 服务错误率统计
  • 服务连接池状态

告警系统架构设计

监控数据收集层

系统级监控工具

使用 Prometheus node_exporter收集系统指标:

# 安装node_exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xvfz node_exporter-1.6.1.linux-amd64.tar.gz
sudo cp node_exporter-1.6.1.linux-amd64/node_exporter /usr/local/bin/

# 创建systemd服务
sudo tee /etc/systemd/system/node_exporter.service > /dev/null <<EOF
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/node_exporter
Restart=always
RestartSec=3

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

自定义监控脚本

创建系统健康检查Shell脚本:

#!/bin/bash
# system_health_check.sh

# 配置文件
CONFIG_FILE="/etc/monitoring/health_check.conf"

# 默认阈值
CPU_THRESHOLD=80
MEMORY_THRESHOLD=85
DISK_THRESHOLD=90
LOAD_THRESHOLD=10

# 加载配置
if [ -f "$CONFIG_FILE" ]; then
    source "$CONFIG_FILE"
fi

# 检查CPU使用率
check_cpu() {
    local cpu_usage=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | awk -F'%' '{print $1}')
    if (( $(echo "$cpu_usage > $CPU_THRESHOLD" | bc -l) )); then
        echo "CRITICAL: CPU usage is ${cpu_usage}%"
        return 2
    elif (( $(echo "$cpu_usage > $((CPU_THRESHOLD - 10))" | bc -l) )); then
        echo "WARNING: CPU usage is ${cpu_usage}%"
        return 1
    fi
    return 0
}

# 检查内存使用率
check_memory() {
    local memory_usage=$(free | grep Mem | awk '{printf("%.2f"), $3/$2 * 100.0}')
    if (( $(echo "$memory_usage > $MEMORY_THRESHOLD" | bc -l) )); then
        echo "CRITICAL: Memory usage is ${memory_usage}%"
        return 2
    elif (( $(echo "$memory_usage > $((MEMORY_THRESHOLD - 10))" | bc -l) )); then
        echo "WARNING: Memory usage is ${memory_usage}%"
        return 1
    fi
    return 0
}

# 检查磁盘使用率
check_disk() {
    local disk_usage=$(df -h | grep -vE '^Filesystem|tmpfs|cdrom' | awk '{print $5}' | sed 's/%//g' | sort -n | tail -1)
    if [ "$disk_usage" -gt "$DISK_THRESHOLD" ]; then
        echo "CRITICAL: Disk usage is ${disk_usage}%"
        return 2
    elif [ "$disk_usage" -gt "$((DISK_THRESHOLD - 10))" ]; then
        echo "WARNING: Disk usage is ${disk_usage}%"
        return 1
    fi
    return 0
}

# 检查系统负载
check_load() {
    local load_avg=$(uptime | awk -F'load average:' '{print $2}' | awk '{print $1}' | sed 's/,//')
    if (( $(echo "$load_avg > $LOAD_THRESHOLD" | bc -l) )); then
        echo "CRITICAL: System load is ${load_avg}"
        return 2
    elif (( $(echo "$load_avg > $((LOAD_THRESHOLD - 2))" | bc -l) )); then
        echo "WARNING: System load is ${load_avg}"
        return 1
    fi
    return 0
}

# 主检查函数
main() {
    local exit_code=0
    local timestamp=$(date '+%Y-%m-%d %H:%M:%S')

    echo "[$timestamp] Starting system health check..."

    # 执行各项检查
    check_cpu
    local cpu_result=$?

    check_memory
    local memory_result=$?

    check_disk
    local disk_result=$?

    check_load
    local load_result=$?

    # 确定最终状态
    if [ $cpu_result -eq 2 ] || [ $memory_result -eq 2 ] || [ $disk_result -eq 2 ] || [ $load_result -eq 2 ]; then
        exit_code=2
    elif [ $cpu_result -eq 1 ] || [ $memory_result -eq 1 ] || [ $disk_result -eq 1 ] || [ $load_result -eq 1 ]; then
        exit_code=1
    fi

    echo "[$timestamp] Health check completed with exit code: $exit_code"
    exit $exit_code
}

main "$@"

告警规则配置

Prometheus告警规则

创建告警规则文件:

# /etc/prometheus/rules/system_alerts.yml
groups:
- name: system_alerts
  rules:
  - alert: HighCPUUsage
    expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage detected"
      description: "CPU usage is above 80% for more than 5 minutes on {{ $labels.instance }}"

  - alert: HighMemoryUsage
    expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High memory usage detected"
      description: "Memory usage is above 85% for more than 5 minutes on {{ $labels.instance }}"

  - alert: DiskSpaceLow
    expr: (node_filesystem_avail_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes{fstype!="tmpfs"}) * 100 < 10
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Low disk space"
      description: "Disk space is below 10% on {{ $labels.instance }}"

  - alert: SystemLoadHigh
    expr: node_load1 > 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High system load"
      description: "System load is above 10 for more than 5 minutes on {{ $labels.instance }}"

  - alert: ServiceDown
    expr: up == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Service is down"
      description: "{{ $labels.instance }} has been down for more than 1 minute"

自动化响应机制

响应策略分类

预防性响应

  • 资源预分配
  • 负载均衡调整
  • 缓存预热
  • 连接池扩容

修复性响应

  • 服务重启
  • 进程清理
  • 临时文件清理
  • 日志轮转

扩展性响应

  • 自动扩容
  • 资源迁移
  • 负载分流
  • 备份激活

自动化脚本实现

服务自动重启脚本

#!/bin/bash
# auto_restart_service.sh

SERVICE_NAME="$1"
LOG_FILE="/var/log/auto_restart.log"
MAX_RESTART_COUNT=3
RESTART_INTERVAL=60

# 检查服务状态
check_service_status() {
    systemctl is-active --quiet "$SERVICE_NAME"
    return $?
}

# 记录日志
log_message() {
    echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" >> "$LOG_FILE"
}

# 发送通知
send_notification() {
    local message="$1"
    local severity="$2"

    # 发送邮件通知
    echo "$message" | mail -s "Service Alert: $SERVICE_NAME" admin@company.com

    # 发送钉钉通知
    curl -X POST https://oapi.dingtalk.com/robot/send \
        -H 'Content-Type: application/json' \
        -d "{\"msgtype\": \"text\", \"text\": {\"content\": \"$message\"}}"
}

# 主要重启逻辑
main() {
    local restart_count=0

    while [ $restart_count -lt $MAX_RESTART_COUNT ]; do
        if check_service_status; then
            log_message "Service $SERVICE_NAME is running normally"
            exit 0
        else
            restart_count=$((restart_count + 1))
            log_message "Attempting to restart $SERVICE_NAME (attempt $restart_count/$MAX_RESTART_COUNT)"

            systemctl restart "$SERVICE_NAME"
            sleep $RESTART_INTERVAL

            if check_service_status; then
                log_message "Successfully restarted $SERVICE_NAME"
                send_notification "Service $SERVICE_NAME has been successfully restarted" "INFO"
                exit 0
            fi
        fi
    done

    log_message "Failed to restart $SERVICE_NAME after $MAX_RESTART_COUNT attempts"
    send_notification "CRITICAL: Failed to restart $SERVICE_NAME after $MAX_RESTART_COUNT attempts" "CRITICAL"
    exit 1
}

main "$@"

磁盘空间自动清理脚本

#!/bin/bash
# disk_cleanup.sh

CLEANUP_PATHS=(
    "/var/log"
    "/tmp"
    "/var/tmp"
    "/var/cache"
)

LOG_RETENTION_DAYS=7
TEMP_FILE_AGE=7

# 清理日志文件
cleanup_logs() {
    local log_path="$1"
    find "$log_path" -name "*.log" -type f -mtime +$LOG_RETENTION_DAYS -delete
    find "$log_path" -name "*.log.*" -type f -mtime +$LOG_RETENTION_DAYS -delete
}

# 清理临时文件
cleanup_temp() {
    local temp_path="$1"
    find "$temp_path" -type f -mtime +$TEMP_FILE_AGE -delete
    find "$temp_path" -type d -empty -delete
}

# 清理系统缓存
cleanup_cache() {
    # 清理包管理器缓存
    if command -v apt-get &> /dev/null; then
        apt-get clean
    elif command -v yum &> /dev/null; then
        yum clean all
    fi

    # 清理系统缓存
    sync && echo 3 > /proc/sys/vm/drop_caches
}

# 主清理函数
main() {
    echo "Starting disk cleanup process..."

    for path in "${CLEANUP_PATHS[@]}"; do
        if [ -d "$path" ]; then
            echo "Cleaning up $path..."
            case "$path" in
                "/var/log")
                    cleanup_logs "$path"
                    ;;
                "/tmp"|"/var/tmp")
                    cleanup_temp "$path"
                    ;;
                "/var/cache")
                    cleanup_cache
                    ;;
            esac
        fi
    done

    echo "Disk cleanup completed"
}

main "$@"

告警管理器配置

Alertmanager配置

# /etc/alertmanager/alertmanager.yml
global:
  smtp_smarthost: 'localhost:587'
  smtp_from: 'alertmanager@company.com'

templates:
  - '/etc/alertmanager/templates/*.tmpl'

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'critical-alerts'
    - match:
        severity: warning
      receiver: 'warning-alerts'

receivers:
  - name: 'default'
    email_configs:
      - to: 'admin@company.com'
        subject: 'Alert: {{ .GroupLabels.alertname }}'
        body: |
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          {{ end }}

  - name: 'critical-alerts'
    email_configs:
      - to: 'admin@company.com'
        subject: 'CRITICAL Alert: {{ .GroupLabels.alertname }}'
        body: |
          {{ range .Alerts }}
          CRITICAL Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          {{ end }}
    webhook_configs:
      - url: 'http://localhost:9093/webhook'
        send_resolved: true

  - name: 'warning-alerts'
    email_configs:
      - to: 'admin@company.com'
        subject: 'WARNING Alert: {{ .GroupLabels.alertname }}'
        body: |
          {{ range .Alerts }}
          WARNING Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          {{ end }}

集成第三方工具

Webhook处理器

创建webhook处理器来触发自动化响应:

#!/usr/bin/env python3
# webhook_handler.py

from flask import Flask, request, jsonify
import subprocess
import json
import logging

app = Flask(__name__)
logging.basicConfig(level=logging.INFO)

# 自动化响应映射
AUTOMATION_MAPPING = {
    'HighCPUUsage': 'handle_high_cpu',
    'HighMemoryUsage': 'handle_high_memory',
    'DiskSpaceLow': 'handle_disk_space_low',
    'ServiceDown': 'handle_service_down'
}

def handle_high_cpu(alert_data):
    """处理高CPU使用率告警"""
    instance = alert_data.get('labels', {}).get('instance', '')
    logging.info(f"Handling high CPU usage for {instance}")

    # 执行CPU优化脚本
    subprocess.run(['/usr/local/bin/cpu_optimization.sh', instance])

    return {"status": "success", "action": "cpu_optimization"}

def handle_high_memory(alert_data):
    """处理高内存使用率告警"""
    instance = alert_data.get('labels', {}).get('instance', '')
    logging.info(f"Handling high memory usage for {instance}")

    # 执行内存清理脚本
    subprocess.run(['/usr/local/bin/memory_cleanup.sh', instance])

    return {"status": "success", "action": "memory_cleanup"}

def handle_disk_space_low(alert_data):
    """处理磁盘空间不足告警"""
    instance = alert_data.get('labels', {}).get('instance', '')
    logging.info(f"Handling low disk space for {instance}")

    # 执行磁盘清理脚本
    subprocess.run(['/usr/local/bin/disk_cleanup.sh'])

    return {"status": "success", "action": "disk_cleanup"}

def handle_service_down(alert_data):
    """处理服务停止告警"""
    instance = alert_data.get('labels', {}).get('instance', '')
    job = alert_data.get('labels', {}).get('job', '')
    logging.info(f"Handling service down for {job} on {instance}")

    # 执行服务重启脚本
    subprocess.run(['/usr/local/bin/auto_restart_service.sh', job])

    return {"status": "success", "action": "service_restart"}

@app.route('/webhook', methods=['POST'])
def webhook():
    """处理Alertmanager webhook"""
    try:
        data = request.json
        alerts = data.get('alerts', [])

        responses = []
        for alert in alerts:
            alert_name = alert.get('labels', {}).get('alertname', '')

            if alert_name in AUTOMATION_MAPPING:
                handler_func = globals()[AUTOMATION_MAPPING[alert_name]]
                response = handler_func(alert)
                responses.append(response)
            else:
                logging.warning(f"No handler found for alert: {alert_name}")

        return jsonify({"responses": responses})

    except Exception as e:
        logging.error(f"Error processing webhook: {str(e)}")
        return jsonify({"error": str(e)}), 500

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=9093)

钉钉集成

#!/bin/bash
# send_dingtalk_alert.sh

WEBHOOK_URL="https://oapi.dingtalk.com/robot/send?access_token=YOUR_TOKEN"
ALERT_TYPE="$1"
ALERT_MESSAGE="$2"
INSTANCE="$3"

# 根据告警类型设置颜色
case "$ALERT_TYPE" in
    "CRITICAL")
        COLOR="red"
        ;;
    "WARNING")
        COLOR="yellow"
        ;;
    "INFO")
        COLOR="green"
        ;;
    *)
        COLOR="blue"
        ;;
esac

# 构造消息
MESSAGE=$(cat <<EOF
{
    "msgtype": "markdown",
    "markdown": {
        "title": "系统告警通知",
        "text": "## 系统告警通知\n\n**告警级别**: <font color='$COLOR'>$ALERT_TYPE</font>\n\n**告警实例**: $INSTANCE\n\n**告警内容**: $ALERT_MESSAGE\n\n**告警时间**: $(date '+%Y-%m-%d %H:%M:%S')\n\n请及时处理相关问题。"
    }
}
EOF
)

# 发送消息
curl -X POST "$WEBHOOK_URL" \
    -H 'Content-Type: application/json' \
    -d "$MESSAGE"

监控数据可视化

Grafana仪表板配置

创建系统监控仪表板的JSON配置:

{
  "dashboard": {
    "title": "Linux系统监控",
    "panels": [
      {
        "title": "CPU使用率",
        "type": "stat",
        "targets": [
          {
            "expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "thresholds": {
              "steps": [
                {"color": "green", "value": 0},
                {"color": "yellow", "value": 70},
                {"color": "red", "value": 90}
              ]
            }
          }
        }
      },
      {
        "title": "内存使用率",
        "type": "stat",
        "targets": [
          {
            "expr": "(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100"
          }
        ]
      },
      {
        "title": "磁盘使用率",
        "type": "stat",
        "targets": [
          {
            "expr": "(node_filesystem_size_bytes{fstype!=\"tmpfs\"} - node_filesystem_avail_bytes{fstype!=\"tmpfs\"}) / node_filesystem_size_bytes{fstype!=\"tmpfs\"} * 100"
          }
        ]
      }
    ]
  }
}

性能优化与调优

监控性能优化

数据采集优化

  • 调整采集间隔,平衡准确性和性能
  • 使用数据压缩减少存储空间
  • 实施数据保留策略
  • 优化查询语句性能

告警优化

  • 设置合理的告警阈值避免误报
  • 实施告警抑制机制
  • 配置告警聚合规则
  • 定期评估和调整告警策略

系统资源优化

内存管理

# 内存优化脚本
#!/bin/bash
# memory_optimization.sh

# 清理页面缓存
sync && echo 1 > /proc/sys/vm/drop_caches

# 调整swap使用策略
echo 10 > /proc/sys/vm/swappiness

# 优化内存回收
echo 1 > /proc/sys/vm/overcommit_memory

磁盘I/O优化

# 磁盘I/O优化脚本
#!/bin/bash
# disk_io_optimization.sh

# 调整I/O调度器
echo noop > /sys/block/sda/queue/scheduler

# 优化文件系统参数
mount -o remount,noatime,nodiratime /

# 调整磁盘队列深度
echo 32 > /sys/block/sda/queue/nr_requests

故障处理与恢复

故障分类处理

硬件故障

  • 磁盘故障自动切换
  • 网络故障自动恢复
  • 内存故障隔离处理

软件故障

  • 进程异常自动重启
  • 服务依赖关系检查
  • 配置文件自动恢复

网络故障

  • 网络连接自动重试
  • 负载均衡自动切换
  • DNS解析故障处理

恢复策略实现

#!/bin/bash
# disaster_recovery.sh

BACKUP_DIR="/opt/backups"
CONFIG_BACKUP="$BACKUP_DIR/configs"
DATA_BACKUP="$BACKUP_DIR/data"

# 配置文件恢复
restore_configs() {
    echo "Restoring configuration files..."

    # 恢复系统配置
    cp -r "$CONFIG_BACKUP"/etc/* /etc/

    # 恢复服务配置
    systemctl daemon-reload

    # 重启相关服务
    systemctl restart nginx
    systemctl restart mysql
    systemctl restart redis
}

# 数据恢复
restore_data() {
    echo "Restoring data..."

    # 恢复数据库
    mysql -u root -p < "$DATA_BACKUP/mysql_backup.sql"

    # 恢复文件数据
    rsync -av "$DATA_BACKUP/files/" /var/www/html/
}

# 系统健康检查
health_check() {
    echo "Performing health check..."

    # 检查服务状态
    systemctl status nginx
    systemctl status mysql
    systemctl status redis

    # 检查端口监听
    netstat -tuln | grep :80
    netstat -tuln | grep :3306
    netstat -tuln | grep :6379
}

# 主恢复流程
main() {
    echo "Starting disaster recovery process..."

    restore_configs
    restore_data
    health_check

    echo "Disaster recovery completed"
}

main "$@"

最佳实践总结

监控策略

分层监控

  • 基础设施层:硬件、操作系统、网络
  • 应用层:服务、进程、业务指标
  • 用户体验层:响应时间、可用性、错误率

告警策略

  • 设置合理的告警阈值
  • 实施告警升级机制
  • 配置告警静默和抑制
  • 定期评估告警效果

自动化原则

渐进式自动化

  • 从简单任务开始自动化
  • 逐步扩展到复杂场景
  • 保持人工介入能力
  • 建立回滚机制

安全性考虑

  • 权限最小化原则
  • 操作审计记录
  • 关键操作人工确认
  • 定期安全评估

运维团队协作

文档化

  • 详细的操作手册
  • 故障处理流程
  • 系统架构文档
  • 应急响应预案

培训与技能提升

  • 定期技术培训
  • 故障演练
  • 工具使用培训
  • 最佳实践分享

有效的Linux系统监控与自动化响应配置,是构建稳定、可靠运维体系的关键。通过实施上述方案,可以显著提升问题响应速度与系统自愈能力。




上一篇:Java反序列化漏洞黑盒挖掘实战:Fastjson与Shiro漏洞检测技巧
下一篇:Kubernetes安全最佳实践:生产环境容器全链路防护指南
您需要登录后才可以回帖 登录 | 立即注册

手机版|小黑屋|网站地图|云栈社区(YunPan.Plus) ( 苏ICP备2022046150号-2 )

GMT+8, 2025-12-7 02:33 , Processed in 0.074822 second(s), 39 queries , Gzip On.

Powered by Discuz! X3.5

© 2025-2025 CloudStack.

快速回复 返回顶部 返回列表