监控指标体系
系统核心指标
CPU监控
- CPU使用率(整体和分核心)
- CPU负载平均值(1分钟、5分钟、15分钟)
- CPU上下文切换次数
- CPU中断处理次数
内存监控
- 内存使用率和剩余内存
- Swap使用情况
- 内存碎片化程度
- 缓存和缓冲区使用情况
磁盘监控
- 磁盘空间使用率
- 磁盘I/O读写速率
- 磁盘队列长度
- 文件系统inode使用情况
网络监控
- 网络接口流量统计
- 网络连接数量
- 网络错误包统计
- 网络延迟和丢包率
应用层指标
进程监控
- 关键进程存活状态
- 进程CPU和内存占用
- 进程文件描述符使用情况
- 进程端口监听状态
服务监控
- 服务响应时间
- 服务可用性检查
- 服务错误率统计
- 服务连接池状态
告警系统架构设计
监控数据收集层
系统级监控工具
使用 Prometheus node_exporter收集系统指标:
# 安装node_exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xvfz node_exporter-1.6.1.linux-amd64.tar.gz
sudo cp node_exporter-1.6.1.linux-amd64/node_exporter /usr/local/bin/
# 创建systemd服务
sudo tee /etc/systemd/system/node_exporter.service > /dev/null <<EOF
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/node_exporter
Restart=always
RestartSec=3
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter
自定义监控脚本
创建系统健康检查Shell脚本:
#!/bin/bash
# system_health_check.sh
# 配置文件
CONFIG_FILE="/etc/monitoring/health_check.conf"
# 默认阈值
CPU_THRESHOLD=80
MEMORY_THRESHOLD=85
DISK_THRESHOLD=90
LOAD_THRESHOLD=10
# 加载配置
if [ -f "$CONFIG_FILE" ]; then
source "$CONFIG_FILE"
fi
# 检查CPU使用率
check_cpu() {
local cpu_usage=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | awk -F'%' '{print $1}')
if (( $(echo "$cpu_usage > $CPU_THRESHOLD" | bc -l) )); then
echo "CRITICAL: CPU usage is ${cpu_usage}%"
return 2
elif (( $(echo "$cpu_usage > $((CPU_THRESHOLD - 10))" | bc -l) )); then
echo "WARNING: CPU usage is ${cpu_usage}%"
return 1
fi
return 0
}
# 检查内存使用率
check_memory() {
local memory_usage=$(free | grep Mem | awk '{printf("%.2f"), $3/$2 * 100.0}')
if (( $(echo "$memory_usage > $MEMORY_THRESHOLD" | bc -l) )); then
echo "CRITICAL: Memory usage is ${memory_usage}%"
return 2
elif (( $(echo "$memory_usage > $((MEMORY_THRESHOLD - 10))" | bc -l) )); then
echo "WARNING: Memory usage is ${memory_usage}%"
return 1
fi
return 0
}
# 检查磁盘使用率
check_disk() {
local disk_usage=$(df -h | grep -vE '^Filesystem|tmpfs|cdrom' | awk '{print $5}' | sed 's/%//g' | sort -n | tail -1)
if [ "$disk_usage" -gt "$DISK_THRESHOLD" ]; then
echo "CRITICAL: Disk usage is ${disk_usage}%"
return 2
elif [ "$disk_usage" -gt "$((DISK_THRESHOLD - 10))" ]; then
echo "WARNING: Disk usage is ${disk_usage}%"
return 1
fi
return 0
}
# 检查系统负载
check_load() {
local load_avg=$(uptime | awk -F'load average:' '{print $2}' | awk '{print $1}' | sed 's/,//')
if (( $(echo "$load_avg > $LOAD_THRESHOLD" | bc -l) )); then
echo "CRITICAL: System load is ${load_avg}"
return 2
elif (( $(echo "$load_avg > $((LOAD_THRESHOLD - 2))" | bc -l) )); then
echo "WARNING: System load is ${load_avg}"
return 1
fi
return 0
}
# 主检查函数
main() {
local exit_code=0
local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
echo "[$timestamp] Starting system health check..."
# 执行各项检查
check_cpu
local cpu_result=$?
check_memory
local memory_result=$?
check_disk
local disk_result=$?
check_load
local load_result=$?
# 确定最终状态
if [ $cpu_result -eq 2 ] || [ $memory_result -eq 2 ] || [ $disk_result -eq 2 ] || [ $load_result -eq 2 ]; then
exit_code=2
elif [ $cpu_result -eq 1 ] || [ $memory_result -eq 1 ] || [ $disk_result -eq 1 ] || [ $load_result -eq 1 ]; then
exit_code=1
fi
echo "[$timestamp] Health check completed with exit code: $exit_code"
exit $exit_code
}
main "$@"
告警规则配置
Prometheus告警规则
创建告警规则文件:
# /etc/prometheus/rules/system_alerts.yml
groups:
- name: system_alerts
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage detected"
description: "CPU usage is above 80% for more than 5 minutes on {{ $labels.instance }}"
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage detected"
description: "Memory usage is above 85% for more than 5 minutes on {{ $labels.instance }}"
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes{fstype!="tmpfs"}) * 100 < 10
for: 1m
labels:
severity: critical
annotations:
summary: "Low disk space"
description: "Disk space is below 10% on {{ $labels.instance }}"
- alert: SystemLoadHigh
expr: node_load1 > 10
for: 5m
labels:
severity: warning
annotations:
summary: "High system load"
description: "System load is above 10 for more than 5 minutes on {{ $labels.instance }}"
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service is down"
description: "{{ $labels.instance }} has been down for more than 1 minute"
自动化响应机制
响应策略分类
预防性响应
修复性响应
扩展性响应
自动化脚本实现
服务自动重启脚本
#!/bin/bash
# auto_restart_service.sh
SERVICE_NAME="$1"
LOG_FILE="/var/log/auto_restart.log"
MAX_RESTART_COUNT=3
RESTART_INTERVAL=60
# 检查服务状态
check_service_status() {
systemctl is-active --quiet "$SERVICE_NAME"
return $?
}
# 记录日志
log_message() {
echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" >> "$LOG_FILE"
}
# 发送通知
send_notification() {
local message="$1"
local severity="$2"
# 发送邮件通知
echo "$message" | mail -s "Service Alert: $SERVICE_NAME" admin@company.com
# 发送钉钉通知
curl -X POST https://oapi.dingtalk.com/robot/send \
-H 'Content-Type: application/json' \
-d "{\"msgtype\": \"text\", \"text\": {\"content\": \"$message\"}}"
}
# 主要重启逻辑
main() {
local restart_count=0
while [ $restart_count -lt $MAX_RESTART_COUNT ]; do
if check_service_status; then
log_message "Service $SERVICE_NAME is running normally"
exit 0
else
restart_count=$((restart_count + 1))
log_message "Attempting to restart $SERVICE_NAME (attempt $restart_count/$MAX_RESTART_COUNT)"
systemctl restart "$SERVICE_NAME"
sleep $RESTART_INTERVAL
if check_service_status; then
log_message "Successfully restarted $SERVICE_NAME"
send_notification "Service $SERVICE_NAME has been successfully restarted" "INFO"
exit 0
fi
fi
done
log_message "Failed to restart $SERVICE_NAME after $MAX_RESTART_COUNT attempts"
send_notification "CRITICAL: Failed to restart $SERVICE_NAME after $MAX_RESTART_COUNT attempts" "CRITICAL"
exit 1
}
main "$@"
磁盘空间自动清理脚本
#!/bin/bash
# disk_cleanup.sh
CLEANUP_PATHS=(
"/var/log"
"/tmp"
"/var/tmp"
"/var/cache"
)
LOG_RETENTION_DAYS=7
TEMP_FILE_AGE=7
# 清理日志文件
cleanup_logs() {
local log_path="$1"
find "$log_path" -name "*.log" -type f -mtime +$LOG_RETENTION_DAYS -delete
find "$log_path" -name "*.log.*" -type f -mtime +$LOG_RETENTION_DAYS -delete
}
# 清理临时文件
cleanup_temp() {
local temp_path="$1"
find "$temp_path" -type f -mtime +$TEMP_FILE_AGE -delete
find "$temp_path" -type d -empty -delete
}
# 清理系统缓存
cleanup_cache() {
# 清理包管理器缓存
if command -v apt-get &> /dev/null; then
apt-get clean
elif command -v yum &> /dev/null; then
yum clean all
fi
# 清理系统缓存
sync && echo 3 > /proc/sys/vm/drop_caches
}
# 主清理函数
main() {
echo "Starting disk cleanup process..."
for path in "${CLEANUP_PATHS[@]}"; do
if [ -d "$path" ]; then
echo "Cleaning up $path..."
case "$path" in
"/var/log")
cleanup_logs "$path"
;;
"/tmp"|"/var/tmp")
cleanup_temp "$path"
;;
"/var/cache")
cleanup_cache
;;
esac
fi
done
echo "Disk cleanup completed"
}
main "$@"
告警管理器配置
Alertmanager配置
# /etc/alertmanager/alertmanager.yml
global:
smtp_smarthost: 'localhost:587'
smtp_from: 'alertmanager@company.com'
templates:
- '/etc/alertmanager/templates/*.tmpl'
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
- match:
severity: warning
receiver: 'warning-alerts'
receivers:
- name: 'default'
email_configs:
- to: 'admin@company.com'
subject: 'Alert: {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
{{ end }}
- name: 'critical-alerts'
email_configs:
- to: 'admin@company.com'
subject: 'CRITICAL Alert: {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
CRITICAL Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
{{ end }}
webhook_configs:
- url: 'http://localhost:9093/webhook'
send_resolved: true
- name: 'warning-alerts'
email_configs:
- to: 'admin@company.com'
subject: 'WARNING Alert: {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
WARNING Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
{{ end }}
集成第三方工具
Webhook处理器
创建webhook处理器来触发自动化响应:
#!/usr/bin/env python3
# webhook_handler.py
from flask import Flask, request, jsonify
import subprocess
import json
import logging
app = Flask(__name__)
logging.basicConfig(level=logging.INFO)
# 自动化响应映射
AUTOMATION_MAPPING = {
'HighCPUUsage': 'handle_high_cpu',
'HighMemoryUsage': 'handle_high_memory',
'DiskSpaceLow': 'handle_disk_space_low',
'ServiceDown': 'handle_service_down'
}
def handle_high_cpu(alert_data):
"""处理高CPU使用率告警"""
instance = alert_data.get('labels', {}).get('instance', '')
logging.info(f"Handling high CPU usage for {instance}")
# 执行CPU优化脚本
subprocess.run(['/usr/local/bin/cpu_optimization.sh', instance])
return {"status": "success", "action": "cpu_optimization"}
def handle_high_memory(alert_data):
"""处理高内存使用率告警"""
instance = alert_data.get('labels', {}).get('instance', '')
logging.info(f"Handling high memory usage for {instance}")
# 执行内存清理脚本
subprocess.run(['/usr/local/bin/memory_cleanup.sh', instance])
return {"status": "success", "action": "memory_cleanup"}
def handle_disk_space_low(alert_data):
"""处理磁盘空间不足告警"""
instance = alert_data.get('labels', {}).get('instance', '')
logging.info(f"Handling low disk space for {instance}")
# 执行磁盘清理脚本
subprocess.run(['/usr/local/bin/disk_cleanup.sh'])
return {"status": "success", "action": "disk_cleanup"}
def handle_service_down(alert_data):
"""处理服务停止告警"""
instance = alert_data.get('labels', {}).get('instance', '')
job = alert_data.get('labels', {}).get('job', '')
logging.info(f"Handling service down for {job} on {instance}")
# 执行服务重启脚本
subprocess.run(['/usr/local/bin/auto_restart_service.sh', job])
return {"status": "success", "action": "service_restart"}
@app.route('/webhook', methods=['POST'])
def webhook():
"""处理Alertmanager webhook"""
try:
data = request.json
alerts = data.get('alerts', [])
responses = []
for alert in alerts:
alert_name = alert.get('labels', {}).get('alertname', '')
if alert_name in AUTOMATION_MAPPING:
handler_func = globals()[AUTOMATION_MAPPING[alert_name]]
response = handler_func(alert)
responses.append(response)
else:
logging.warning(f"No handler found for alert: {alert_name}")
return jsonify({"responses": responses})
except Exception as e:
logging.error(f"Error processing webhook: {str(e)}")
return jsonify({"error": str(e)}), 500
if __name__ == '__main__':
app.run(host='0.0.0.0', port=9093)
钉钉集成
#!/bin/bash
# send_dingtalk_alert.sh
WEBHOOK_URL="https://oapi.dingtalk.com/robot/send?access_token=YOUR_TOKEN"
ALERT_TYPE="$1"
ALERT_MESSAGE="$2"
INSTANCE="$3"
# 根据告警类型设置颜色
case "$ALERT_TYPE" in
"CRITICAL")
COLOR="red"
;;
"WARNING")
COLOR="yellow"
;;
"INFO")
COLOR="green"
;;
*)
COLOR="blue"
;;
esac
# 构造消息
MESSAGE=$(cat <<EOF
{
"msgtype": "markdown",
"markdown": {
"title": "系统告警通知",
"text": "## 系统告警通知\n\n**告警级别**: <font color='$COLOR'>$ALERT_TYPE</font>\n\n**告警实例**: $INSTANCE\n\n**告警内容**: $ALERT_MESSAGE\n\n**告警时间**: $(date '+%Y-%m-%d %H:%M:%S')\n\n请及时处理相关问题。"
}
}
EOF
)
# 发送消息
curl -X POST "$WEBHOOK_URL" \
-H 'Content-Type: application/json' \
-d "$MESSAGE"
监控数据可视化
Grafana仪表板配置
创建系统监控仪表板的JSON配置:
{
"dashboard": {
"title": "Linux系统监控",
"panels": [
{
"title": "CPU使用率",
"type": "stat",
"targets": [
{
"expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"steps": [
{"color": "green", "value": 0},
{"color": "yellow", "value": 70},
{"color": "red", "value": 90}
]
}
}
}
},
{
"title": "内存使用率",
"type": "stat",
"targets": [
{
"expr": "(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100"
}
]
},
{
"title": "磁盘使用率",
"type": "stat",
"targets": [
{
"expr": "(node_filesystem_size_bytes{fstype!=\"tmpfs\"} - node_filesystem_avail_bytes{fstype!=\"tmpfs\"}) / node_filesystem_size_bytes{fstype!=\"tmpfs\"} * 100"
}
]
}
]
}
}
性能优化与调优
监控性能优化
数据采集优化
- 调整采集间隔,平衡准确性和性能
- 使用数据压缩减少存储空间
- 实施数据保留策略
- 优化查询语句性能
告警优化
- 设置合理的告警阈值避免误报
- 实施告警抑制机制
- 配置告警聚合规则
- 定期评估和调整告警策略
系统资源优化
内存管理
# 内存优化脚本
#!/bin/bash
# memory_optimization.sh
# 清理页面缓存
sync && echo 1 > /proc/sys/vm/drop_caches
# 调整swap使用策略
echo 10 > /proc/sys/vm/swappiness
# 优化内存回收
echo 1 > /proc/sys/vm/overcommit_memory
磁盘I/O优化
# 磁盘I/O优化脚本
#!/bin/bash
# disk_io_optimization.sh
# 调整I/O调度器
echo noop > /sys/block/sda/queue/scheduler
# 优化文件系统参数
mount -o remount,noatime,nodiratime /
# 调整磁盘队列深度
echo 32 > /sys/block/sda/queue/nr_requests
故障处理与恢复
故障分类处理
硬件故障
- 磁盘故障自动切换
- 网络故障自动恢复
- 内存故障隔离处理
软件故障
- 进程异常自动重启
- 服务依赖关系检查
- 配置文件自动恢复
网络故障
- 网络连接自动重试
- 负载均衡自动切换
- DNS解析故障处理
恢复策略实现
#!/bin/bash
# disaster_recovery.sh
BACKUP_DIR="/opt/backups"
CONFIG_BACKUP="$BACKUP_DIR/configs"
DATA_BACKUP="$BACKUP_DIR/data"
# 配置文件恢复
restore_configs() {
echo "Restoring configuration files..."
# 恢复系统配置
cp -r "$CONFIG_BACKUP"/etc/* /etc/
# 恢复服务配置
systemctl daemon-reload
# 重启相关服务
systemctl restart nginx
systemctl restart mysql
systemctl restart redis
}
# 数据恢复
restore_data() {
echo "Restoring data..."
# 恢复数据库
mysql -u root -p < "$DATA_BACKUP/mysql_backup.sql"
# 恢复文件数据
rsync -av "$DATA_BACKUP/files/" /var/www/html/
}
# 系统健康检查
health_check() {
echo "Performing health check..."
# 检查服务状态
systemctl status nginx
systemctl status mysql
systemctl status redis
# 检查端口监听
netstat -tuln | grep :80
netstat -tuln | grep :3306
netstat -tuln | grep :6379
}
# 主恢复流程
main() {
echo "Starting disaster recovery process..."
restore_configs
restore_data
health_check
echo "Disaster recovery completed"
}
main "$@"
最佳实践总结
监控策略
分层监控
- 基础设施层:硬件、操作系统、网络
- 应用层:服务、进程、业务指标
- 用户体验层:响应时间、可用性、错误率
告警策略
- 设置合理的告警阈值
- 实施告警升级机制
- 配置告警静默和抑制
- 定期评估告警效果
自动化原则
渐进式自动化
- 从简单任务开始自动化
- 逐步扩展到复杂场景
- 保持人工介入能力
- 建立回滚机制
安全性考虑
- 权限最小化原则
- 操作审计记录
- 关键操作人工确认
- 定期安全评估
运维团队协作
文档化
- 详细的操作手册
- 故障处理流程
- 系统架构文档
- 应急响应预案
培训与技能提升
- 定期技术培训
- 故障演练
- 工具使用培训
- 最佳实践分享
有效的Linux系统监控与自动化响应配置,是构建稳定、可靠运维体系的关键。通过实施上述方案,可以显著提升问题响应速度与系统自愈能力。