一、概述
1.1 背景介绍
凌晨3点,手机突然响起刺耳的告警铃声。睡眼惺忪地看了一眼:“MySQL主库CPU持续100%,已触发P1告警”。这种场景,相信每个DBA都经历过。
2021年双十一凌晨,某电商平台的MySQL主库CPU突然飙升到100%,订单系统完全卡死。当时团队花了40分钟才定位到问题——一个缺少索引的SQL在流量高峰时被大量执行。这40分钟的故障直接导致了数百万的损失。
从那之后,MySQL CPU 100%的排查流程就被固化下来,形成了一套标准化的应急响应机制。这套流程经过多次实战验证,能在5分钟内定位问题,10分钟内完成处置。
本文将完整呈现这套排查流程,包括紧急止血、根因定位、彻底修复三个阶段的具体操作步骤。
1.2 技术特点
MySQL CPU飙高的常见原因可以分为几大类:
SQL层面
- 慢查询大量堆积:缺索引、全表扫描、复杂关联
- 锁等待:行锁、表锁、元数据锁
- 大事务:长时间未提交的事务占用资源
系统层面
- 连接数暴涨:超过max_connections导致排队
- Buffer Pool不足:频繁的磁盘IO转化为CPU消耗
- 临时表过多:排序、分组操作创建大量临时表
硬件层面
- CPU核数不足:并发查询超过处理能力
- NUMA架构问题:内存访问不均衡
- 虚拟化overhead:云主机CPU争抢
MySQL 8.0/8.4在性能诊断方面有显著增强:
- Performance Schema增强:更细粒度的性能数据采集
- sys Schema:开箱即用的性能分析视图
- 错误日志改进:JSON格式,更易解析
- 资源组(Resource Groups):CPU资源隔离和限制
1.3 适用场景
| 场景 |
典型表现 |
紧急程度 |
| 业务高峰突发 |
CPU突然100%,响应超时 |
P1紧急 |
| 慢查询堆积 |
CPU缓慢上升,逐渐100% |
P2高 |
| 锁等待雪崩 |
CPU 100%伴随大量连接等待 |
P1紧急 |
| 定时任务冲突 |
每天固定时间CPU飙高 |
P3中 |
| 参数配置不当 |
长期CPU偏高 |
P4低 |
1.4 环境要求
| 组件 |
版本要求 |
说明 |
| MySQL Server |
8.0.35+ 或 8.4 LTS |
|
| 操作系统 |
Rocky 9 / Ubuntu 24.04 |
|
| 监控工具 |
Prometheus + Grafana |
实时监控 |
| 诊断工具 |
pt-tools 3.5+ |
Percona Toolkit |
| 压测工具 |
sysbench 1.0.20+ |
问题复现 |
二、详细步骤
2.1 准备工作
2.1.1 应急工具箱准备
在故障发生前,需要提前准备好诊断工具:
# 安装Percona Toolkit
sudo dnf install -y https://repo.percona.com/yum/percona-release-latest.noarch.rpm
sudo percona-release setup pt
sudo dnf install -y percona-toolkit
# 安装系统诊断工具
sudo dnf install -y sysstat perf htop iotop
# 准备诊断脚本目录
sudo mkdir -p /opt/mysql-emergency
sudo chmod 755 /opt/mysql-emergency
# 准备紧急诊断账号
mysql -uroot -p << 'EOF'
CREATE USER 'emergency'@'localhost' IDENTIFIED BY 'EmergencyPass@2024';
GRANT SELECT, PROCESS, REPLICATION CLIENT ON *.* TO 'emergency'@'localhost';
GRANT SELECT ON performance_schema.* TO 'emergency'@'localhost';
GRANT SELECT ON sys.* TO 'emergency'@'localhost';
FLUSH PRIVILEGES;
EOF
2.1.2 一键诊断脚本
#!/bin/bash
# 文件:/opt/mysql-emergency/quick_diagnosis.sh
# 功能:MySQL CPU 100%快速诊断脚本
MYSQL_USER="emergency"
MYSQL_PASS="EmergencyPass@2024"
OUTPUT_DIR="/tmp/mysql_diag_$(date +%Y%m%d_%H%M%S)"
mkdir -p ${OUTPUT_DIR}
echo "========================================" | tee ${OUTPUT_DIR}/summary.txt
echo "MySQL Emergency Diagnosis Report" | tee -a ${OUTPUT_DIR}/summary.txt
echo "Time: $(date)" | tee -a ${OUTPUT_DIR}/summary.txt
echo "========================================" | tee -a ${OUTPUT_DIR}/summary.txt
# 1. 系统资源快照
echo -e "\n[1] System Resource Snapshot" | tee -a ${OUTPUT_DIR}/summary.txt
echo "--- CPU Usage ---" | tee -a ${OUTPUT_DIR}/summary.txt
top -bn1 | head -20 | tee -a ${OUTPUT_DIR}/summary.txt
echo -e "\n--- Memory Usage ---" | tee -a ${OUTPUT_DIR}/summary.txt
free -h | tee -a ${OUTPUT_DIR}/summary.txt
echo -e "\n--- MySQL Process ---" | tee -a ${OUTPUT_DIR}/summary.txt
ps aux | grep mysqld | grep -v grep | tee -a ${OUTPUT_DIR}/summary.txt
# 2. MySQL连接状态
echo -e "\n[2] MySQL Connection Status" | tee -a ${OUTPUT_DIR}/summary.txt
mysql -u${MYSQL_USER} -p${MYSQL_PASS} -e "
SELECT
'Total Connections' AS metric, @@max_connections AS config_value,
(SELECT COUNT(*) FROM information_schema.PROCESSLIST) AS current_value
UNION ALL
SELECT
'Running Queries',
NULL,
(SELECT COUNT(*) FROM information_schema.PROCESSLIST WHERE COMMAND != 'Sleep')
UNION ALL
SELECT
'Threads Connected',
NULL,
(SELECT VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME = 'Threads_connected')
UNION ALL
SELECT
'Threads Running',
NULL,
(SELECT VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME = 'Threads_running');
" 2>/dev/null | tee -a ${OUTPUT_DIR}/summary.txt
# 3. 当前运行的SQL
echo -e "\n[3] Currently Running Queries (Top 20 by Time)" | tee -a ${OUTPUT_DIR}/summary.txt
mysql -u${MYSQL_USER} -p${MYSQL_PASS} -e "
SELECT
ID,
USER,
HOST,
DB,
COMMAND,
TIME,
STATE,
LEFT(INFO, 200) AS QUERY
FROM information_schema.PROCESSLIST
WHERE COMMAND != 'Sleep' AND INFO IS NOT NULL
ORDER BY TIME DESC
LIMIT 20;
" 2>/dev/null | tee -a ${OUTPUT_DIR}/summary.txt
# 4. 锁等待情况
echo -e "\n[4] Lock Wait Status" | tee -a ${OUTPUT_DIR}/summary.txt
mysql -u${MYSQL_USER} -p${MYSQL_PASS} -e "
SELECT
r.trx_id AS waiting_trx_id,
r.trx_mysql_thread_id AS waiting_pid,
TIMESTAMPDIFF(SECOND, r.trx_wait_started, NOW()) AS wait_seconds,
r.trx_query AS waiting_query,
b.trx_id AS blocking_trx_id,
b.trx_mysql_thread_id AS blocking_pid,
b.trx_query AS blocking_query
FROM information_schema.INNODB_LOCK_WAITS w
JOIN information_schema.INNODB_TRX r ON r.trx_id = w.requesting_trx_id
JOIN information_schema.INNODB_TRX b ON b.trx_id = w.blocking_trx_id
LIMIT 10;
" 2>/dev/null | tee -a ${OUTPUT_DIR}/summary.txt
# 5. 高消耗SQL统计
echo -e "\n[5] Top SQL by Total Time (Last Hour)" | tee -a ${OUTPUT_DIR}/summary.txt
mysql -u${MYSQL_USER} -p${MYSQL_PASS} -e "
SELECT
DIGEST_TEXT,
COUNT_STAR AS exec_count,
ROUND(SUM_TIMER_WAIT/1000000000000, 2) AS total_time_sec,
ROUND(AVG_TIMER_WAIT/1000000000, 2) AS avg_time_ms,
SUM_ROWS_EXAMINED AS rows_examined,
SUM_ROWS_SENT AS rows_sent
FROM performance_schema.events_statements_summary_by_digest
WHERE LAST_SEEN > DATE_SUB(NOW(), INTERVAL 1 HOUR)
ORDER BY SUM_TIMER_WAIT DESC
LIMIT 10\G
" 2>/dev/null | tee -a ${OUTPUT_DIR}/summary.txt
# 6. 全表扫描SQL
echo -e "\n[6] Full Table Scan Queries" | tee -a ${OUTPUT_DIR}/summary.txt
mysql -u${MYSQL_USER} -p${MYSQL_PASS} -e "
SELECT
DIGEST_TEXT,
COUNT_STAR AS exec_count,
SUM_NO_INDEX_USED AS no_index_count,
SUM_ROWS_EXAMINED AS rows_examined
FROM performance_schema.events_statements_summary_by_digest
WHERE SUM_NO_INDEX_USED > 0
AND LAST_SEEN > DATE_SUB(NOW(), INTERVAL 1 HOUR)
ORDER BY SUM_ROWS_EXAMINED DESC
LIMIT 10\G
" 2>/dev/null | tee -a ${OUTPUT_DIR}/summary.txt
# 7. InnoDB状态
echo -e "\n[7] InnoDB Status" | tee -a ${OUTPUT_DIR}/summary.txt
mysql -u${MYSQL_USER} -p${MYSQL_PASS} -e "SHOW ENGINE INNODB STATUS\G" 2>/dev/null > ${OUTPUT_DIR}/innodb_status.txt
grep -A 20 "TRANSACTIONS" ${OUTPUT_DIR}/innodb_status.txt | tee -a ${OUTPUT_DIR}/summary.txt
echo -e "\n========================================" | tee -a ${OUTPUT_DIR}/summary.txt
echo "Diagnosis completed. Full report saved to: ${OUTPUT_DIR}" | tee -a ${OUTPUT_DIR}/summary.txt
echo "========================================" | tee -a ${OUTPUT_DIR}/summary.txt
# 打包诊断报告
cd /tmp && tar -czf mysql_diag_$(date +%Y%m%d_%H%M%S).tar.gz $(basename ${OUTPUT_DIR})
echo "Compressed report: /tmp/mysql_diag_*.tar.gz"
2.2 核心配置
2.2.1 紧急止血操作
当CPU已经100%时,首要任务是恢复服务,止血优先于定位根因。
第一步:快速识别问题SQL
-- 登录MySQL(可能需要多次尝试)
mysql -uroot -p
-- 1. 查看当前运行线程(按执行时间排序)
SELECT
ID,
USER,
HOST,
DB,
COMMAND,
TIME,
STATE,
LEFT(INFO, 100) AS QUERY
FROM information_schema.PROCESSLIST
WHERE COMMAND != 'Sleep'
ORDER BY TIME DESC
LIMIT 30;
-- 2. 快速识别占用资源最多的SQL
-- 查看执行时间超过10秒的查询
SELECT
ID,
USER,
TIME,
STATE,
LEFT(INFO, 200) AS QUERY
FROM information_schema.PROCESSLIST
WHERE COMMAND = 'Query' AND TIME > 10
ORDER BY TIME DESC;
第二步:终止问题SQL
-- 终止单个查询
KILL 12345; -- 12345是processlist中的ID
-- 批量终止某类SQL(谨慎使用)
-- 生成KILL语句
SELECT CONCAT('KILL ', ID, ';')
FROM information_schema.PROCESSLIST
WHERE USER = 'app_user'
AND COMMAND = 'Query'
AND TIME > 30
AND INFO LIKE '%problematic_table%';
-- 如果无法登录MySQL,通过shell强制终止
# 找到MySQL进程中最耗CPU的线程
ps -eLo pid,lwp,pcpu,comm | grep mysqld | sort -k3 -rn | head -10
# 使用pt-kill批量终止
pt-kill --user=root --password=xxx \
--busy-time=30 \
--match-info "SELECT.*FROM orders.*" \
--kill \
--print
第三步:限制新连接
-- 临时降低最大连接数,防止雪崩
SET GLOBAL max_connections = 100;
-- 如果是某个用户导致的问题,限制其连接
ALTER USER 'problem_user'@'%' WITH MAX_USER_CONNECTIONS 10;
-- 紧急情况下,临时禁用问题账号
ALTER USER 'problem_user'@'%' ACCOUNT LOCK;
2.2.2 根因定位流程
止血后,需要快速定位根本原因。
检查一:慢查询分析
-- 查看最近1小时执行最慢的SQL
SELECT
SCHEMA_NAME,
DIGEST_TEXT,
COUNT_STAR AS exec_count,
ROUND(SUM_TIMER_WAIT/1000000000000, 2) AS total_sec,
ROUND(AVG_TIMER_WAIT/1000000000, 2) AS avg_ms,
ROUND(MAX_TIMER_WAIT/1000000000, 2) AS max_ms,
SUM_ROWS_EXAMINED,
SUM_ROWS_SENT,
SUM_NO_INDEX_USED
FROM performance_schema.events_statements_summary_by_digest
WHERE LAST_SEEN > DATE_SUB(NOW(), INTERVAL 1 HOUR)
ORDER BY SUM_TIMER_WAIT DESC
LIMIT 20\G
-- 分析具体SQL的执行计划
EXPLAIN FORMAT=JSON SELECT ... ; -- 填入问题SQL
检查二:锁等待分析
-- MySQL 8.0+ 使用performance_schema
SELECT
waiting.THREAD_ID AS waiting_thread,
waiting.EVENT_ID AS waiting_event,
waiting.OBJECT_SCHEMA AS db,
waiting.OBJECT_NAME AS table_name,
waiting.LOCK_TYPE AS waiting_lock_type,
waiting.LOCK_MODE AS waiting_lock_mode,
blocking.THREAD_ID AS blocking_thread,
blocking.LOCK_TYPE AS blocking_lock_type,
blocking.LOCK_MODE AS blocking_lock_mode
FROM performance_schema.data_lock_waits waits
JOIN performance_schema.data_locks waiting ON waits.REQUESTING_ENGINE_LOCK_ID = waiting.ENGINE_LOCK_ID
JOIN performance_schema.data_locks blocking ON waits.BLOCKING_ENGINE_LOCK_ID = blocking.ENGINE_LOCK_ID;
-- 查看长事务
SELECT
trx_id,
trx_state,
trx_started,
TIMESTAMPDIFF(SECOND, trx_started, NOW()) AS running_sec,
trx_rows_locked,
trx_rows_modified,
trx_query
FROM information_schema.INNODB_TRX
ORDER BY trx_started
LIMIT 10;
-- 查看元数据锁(MDL)
SELECT
OBJECT_TYPE,
OBJECT_SCHEMA,
OBJECT_NAME,
LOCK_TYPE,
LOCK_DURATION,
LOCK_STATUS,
OWNER_THREAD_ID
FROM performance_schema.metadata_locks
WHERE OBJECT_SCHEMA NOT IN ('performance_schema', 'information_schema', 'mysql', 'sys');
检查三:连接数分析
-- 按用户统计连接数
SELECT
USER,
HOST,
COUNT(*) AS connections,
SUM(IF(COMMAND = 'Sleep', 1, 0)) AS sleeping,
SUM(IF(COMMAND != 'Sleep', 1, 0)) AS active
FROM information_schema.PROCESSLIST
GROUP BY USER, HOST
ORDER BY connections DESC;
-- 按数据库统计
SELECT
DB,
COUNT(*) AS connections,
SUM(IF(COMMAND = 'Sleep', 1, 0)) AS sleeping
FROM information_schema.PROCESSLIST
WHERE DB IS NOT NULL
GROUP BY DB
ORDER BY connections DESC;
-- 检查连接数变化趋势
SHOW STATUS LIKE 'Threads_%';
SHOW STATUS LIKE 'Connections';
SHOW STATUS LIKE 'Max_used_connections%';
检查四:InnoDB引擎状态
-- 完整InnoDB状态
SHOW ENGINE INNODB STATUS\G
-- 关键指标检查
SELECT
NAME, COUNT
FROM information_schema.INNODB_METRICS
WHERE NAME IN (
'buffer_pool_reads',
'buffer_pool_read_requests',
'buffer_pool_write_requests',
'buffer_pool_pages_dirty',
'lock_deadlocks',
'lock_timeouts',
'trx_rseg_history_len'
);
-- Buffer Pool命中率
SELECT
(1 - (
(SELECT VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME = 'Innodb_buffer_pool_reads') /
(SELECT VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME = 'Innodb_buffer_pool_read_requests')
)) * 100 AS buffer_pool_hit_rate;
2.3 启动和验证
2.3.1 常见场景处理方案
场景一:缺少索引导致全表扫描
-- 症状:某SQL执行时间长,rows_examined远大于rows_sent
-- 定位问题SQL
SELECT
DIGEST_TEXT,
COUNT_STAR,
SUM_ROWS_EXAMINED,
SUM_ROWS_SENT,
SUM_NO_INDEX_USED
FROM performance_schema.events_statements_summary_by_digest
WHERE SUM_NO_INDEX_USED > 0
ORDER BY SUM_ROWS_EXAMINED DESC
LIMIT 5\G
-- 分析执行计划
EXPLAIN SELECT * FROM orders WHERE status = 1 AND created_at > '2024-01-01';
/*
type: ALL -- 全表扫描
key: NULL -- 未使用索引
rows: 1000000 -- 扫描100万行
*/
-- 解决方案:添加合适的索引
CREATE INDEX idx_status_created ON orders(status, created_at);
-- 验证
EXPLAIN SELECT * FROM orders WHERE status = 1 AND created_at > '2024-01-01';
/*
type: range
key: idx_status_created
rows: 10000
*/
场景二:大量锁等待
-- 症状:大量线程处于Lock wait状态
-- 找到锁源头
SELECT
b.trx_mysql_thread_id AS blocking_pid,
b.trx_query AS blocking_query,
b.trx_started,
TIMESTAMPDIFF(SECOND, b.trx_started, NOW()) AS blocking_sec,
COUNT(r.trx_id) AS waiting_count
FROM information_schema.INNODB_TRX b
JOIN information_schema.INNODB_LOCK_WAITS w ON b.trx_id = w.blocking_trx_id
JOIN information_schema.INNODB_TRX r ON r.trx_id = w.requesting_trx_id
GROUP BY b.trx_mysql_thread_id, b.trx_query, b.trx_started
ORDER BY waiting_count DESC;
-- 解决方案
-- 方案1:终止锁源头事务
KILL 12345;
-- 方案2:如果是应用bug导致事务未提交,检查应用日志
-- 方案3:优化事务粒度,减少锁持有时间
场景三:连接数暴涨
-- 症状:Too many connections错误
-- 检查连接来源
SELECT
SUBSTRING_INDEX(HOST, ':', 1) AS client_ip,
USER,
COUNT(*) AS connections
FROM information_schema.PROCESSLIST
GROUP BY client_ip, USER
ORDER BY connections DESC
LIMIT 20;
-- 解决方案
-- 方案1:临时增加连接数
SET GLOBAL max_connections = 2000;
-- 方案2:清理空闲连接
pt-kill --user=root --password=xxx \
--idle-time=300 \
--kill \
--print
-- 方案3:应用层增加连接池配置
-- 方案4:检查是否有连接泄漏
场景四:临时表/排序消耗
-- 症状:大量查询处于Creating sort index状态
-- 检查临时表使用
SHOW STATUS LIKE 'Created_tmp%';
/*
Created_tmp_disk_tables: 10000 -- 磁盘临时表过多
Created_tmp_tables: 15000
*/
-- 定位问题SQL
SELECT
DIGEST_TEXT,
SUM_CREATED_TMP_TABLES,
SUM_CREATED_TMP_DISK_TABLES,
SUM_SORT_ROWS
FROM performance_schema.events_statements_summary_by_digest
WHERE SUM_CREATED_TMP_DISK_TABLES > 0
ORDER BY SUM_CREATED_TMP_DISK_TABLES DESC
LIMIT 10\G
-- 解决方案
-- 方案1:增大临时表内存
SET GLOBAL tmp_table_size = 256*1024*1024;
SET GLOBAL max_heap_table_size = 256*1024*1024;
-- 方案2:优化SQL,添加索引避免临时表
-- 方案3:改用覆盖索引
2.3.2 系统层面排查
#!/bin/bash
# 文件:/opt/mysql-emergency/system_check.sh
# 功能:系统层面的MySQL问题排查
echo "=== CPU Usage ==="
# 查看CPU整体使用率
mpstat -P ALL 1 3
# 查看MySQL进程CPU使用
pidstat -p $(pgrep mysqld) 1 5
echo -e "\n=== MySQL Thread CPU ==="
# 找出最耗CPU的MySQL线程
ps -eLo pid,lwp,pcpu,comm | grep mysqld | sort -k3 -rn | head -20
echo -e "\n=== Memory Usage ==="
# 内存使用情况
free -h
# MySQL进程内存
ps aux | grep mysqld | grep -v grep | awk '{print "RSS:", $6/1024, "MB"}'
echo -e "\n=== IO Status ==="
# 磁盘IO情况
iostat -x 1 3
echo -e "\n=== Network Connections ==="
# MySQL连接数
ss -tn state established '( dport = :3306 or sport = :3306 )' | wc -l
# 连接来源统计
ss -tn state established '( dport = :3306 )' | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -rn | head -10
echo -e "\n=== System Load ==="
uptime
echo -e "\n=== Perf Top (5 seconds) ==="
# CPU热点分析
timeout 5 perf top -p $(pgrep mysqld) 2>/dev/null || echo "perf not available or permission denied"
2.3.3 验证恢复
-- 确认CPU恢复正常后的检查清单
-- 1. 检查线程运行状态
SELECT
'Running Threads' AS metric,
VARIABLE_VALUE AS value
FROM performance_schema.global_status
WHERE VARIABLE_NAME = 'Threads_running'
UNION ALL
SELECT
'Threads Connected',
VARIABLE_VALUE
FROM performance_schema.global_status
WHERE VARIABLE_NAME = 'Threads_connected';
-- 2. 检查无长时间运行的查询
SELECT COUNT(*) AS long_running_queries
FROM information_schema.PROCESSLIST
WHERE COMMAND = 'Query' AND TIME > 30;
-- 3. 检查无锁等待
SELECT COUNT(*) AS lock_waits
FROM performance_schema.data_lock_waits;
-- 4. 检查InnoDB状态正常
SELECT
NAME, COUNT
FROM information_schema.INNODB_METRICS
WHERE NAME IN ('lock_deadlocks', 'lock_timeouts', 'trx_rseg_history_len');
-- 5. 检查慢查询趋势
SELECT
FROM_UNIXTIME(FLOOR(UNIX_TIMESTAMP(LAST_SEEN)/300)*300) AS time_bucket,
COUNT(*) AS query_types,
SUM(COUNT_STAR) AS total_queries,
ROUND(SUM(SUM_TIMER_WAIT)/1000000000000, 2) AS total_time_sec
FROM performance_schema.events_statements_summary_by_digest
WHERE LAST_SEEN > DATE_SUB(NOW(), INTERVAL 1 HOUR)
GROUP BY time_bucket
ORDER BY time_bucket;
三、示例代码和配置
3.1 完整配置示例
3.1.1 预防性监控配置
# Prometheus告警规则
# /etc/prometheus/rules/mysql_cpu.yml
groups:
- name: mysql_cpu_alerts
rules:
# CPU使用率告警
- alert: MySQLHighCPU
expr: |
(
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
) * on(instance) group_left mysql_up > 80
for: 2m
labels:
severity: warning
annotations:
summary: "MySQL服务器CPU使用率高"
description: "实例{{ $labels.instance }} CPU使用率 {{ $value | printf \"%.1f\" }}%"
- alert: MySQLCriticalCPU
expr: |
(
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
) * on(instance) group_left mysql_up > 95
for: 1m
labels:
severity: critical
annotations:
summary: "MySQL服务器CPU严重过载"
description: "实例{{ $labels.instance }} CPU使用率 {{ $value | printf \"%.1f\" }}%"
# 运行线程数告警
- alert: MySQLHighThreadsRunning
expr: mysql_global_status_threads_running > 50
for: 1m
labels:
severity: warning
annotations:
summary: "MySQL运行线程数过高"
description: "实例{{ $labels.instance }}运行线程数{{ $value }}"
# 慢查询告警
- alert: MySQLSlowQueriesHigh
expr: rate(mysql_global_status_slow_queries[5m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "MySQL慢查询过多"
description: "实例{{ $labels.instance }}每秒{{ $value | printf \"%.1f\" }}个慢查询"
# 锁等待告警
- alert: MySQLLockWaits
expr: mysql_global_status_innodb_row_lock_waits > 10
for: 2m
labels:
severity: warning
annotations:
summary: "MySQL锁等待频繁"
description: "实例{{ $labels.instance }}行锁等待{{ $value }}次"
3.1.2 自动止血脚本
#!/bin/bash
# 文件:/opt/mysql-emergency/auto_killer.sh
# 功能:自动终止问题查询(需谨慎启用)
# 建议:仅在明确了解风险后启用
MYSQL_USER="emergency"
MYSQL_PASS="EmergencyPass@2024"
LOG_FILE="/var/log/mysql_auto_kill.log"
# 配置阈值
MAX_QUERY_TIME=300 # 最大执行时间(秒)
MAX_LOCK_WAIT_TIME=60 # 最大锁等待时间(秒)
CPU_THRESHOLD=95 # CPU触发阈值
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a ${LOG_FILE}
}
# 检查CPU使用率
check_cpu() {
cpu_usage=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
echo ${cpu_usage%.*}
}
# 终止长时间运行的查询
kill_long_queries() {
log "Checking for long running queries..."
mysql -u${MYSQL_USER} -p${MYSQL_PASS} -N -e "
SELECT ID FROM information_schema.PROCESSLIST
WHERE COMMAND = 'Query'
AND TIME > ${MAX_QUERY_TIME}
AND USER NOT IN ('system user', 'event_scheduler')
AND INFO NOT LIKE '%PROCESSLIST%'
" 2>/dev/null | while read pid; do
log "Killing query PID: ${pid}"
mysql -u${MYSQL_USER} -p${MYSQL_PASS} -e "KILL ${pid};" 2>/dev/null
done
}
# 终止长时间锁等待
kill_lock_waits() {
log "Checking for lock waits..."
mysql -u${MYSQL_USER} -p${MYSQL_PASS} -N -e "
SELECT r.trx_mysql_thread_id
FROM information_schema.INNODB_TRX r
WHERE r.trx_state = 'LOCK WAIT'
AND TIMESTAMPDIFF(SECOND, r.trx_wait_started, NOW()) > ${MAX_LOCK_WAIT_TIME}
" 2>/dev/null | while read pid; do
log "Killing lock wait PID: ${pid}"
mysql -u${MYSQL_USER} -p${MYSQL_PASS} -e "KILL ${pid};" 2>/dev/null
done
}
# 主循环
main() {
log "Auto killer started"
while true; do
cpu=$(check_cpu)
if [ ${cpu} -gt ${CPU_THRESHOLD} ]; then
log "CPU usage ${cpu}% exceeds threshold ${CPU_THRESHOLD}%"
kill_long_queries
kill_lock_waits
fi
sleep 10
done
}
main
3.1.3 资源组限制(MySQL 8.0+)
-- 创建资源组限制某些用户的CPU使用
-- 创建低优先级资源组
CREATE RESOURCE GROUP low_priority
TYPE = USER
VCPU = 0-3 -- 只使用0-3号CPU核心
THREAD_PRIORITY = 10; -- 较低优先级
-- 创建高优先级资源组(实时查询)
CREATE RESOURCE GROUP high_priority
TYPE = USER
VCPU = 4-7
THREAD_PRIORITY = 0;
-- 查看资源组
SELECT * FROM information_schema.RESOURCE_GROUPS;
-- 将会话分配到资源组
SET RESOURCE GROUP low_priority;
-- 将特定线程分配到资源组
SET RESOURCE GROUP low_priority FOR 12345;
-- 查询时指定资源组
SELECT /*+ RESOURCE_GROUP(low_priority) */ * FROM large_table;
-- 应用场景:报表查询使用低优先级
-- 创建报表用户并默认使用低优先级资源组
CREATE USER 'report_user'@'%' IDENTIFIED BY 'ReportPass@2024';
GRANT SELECT ON mydb.* TO 'report_user'@'%';
-- 在应用层连接后设置资源组
-- SET RESOURCE GROUP low_priority;
3.2 实际应用案例
3.2.1 案例一:电商大促CPU飙高
背景:某电商平台618大促,凌晨0点流量瞬间暴增,MySQL主库CPU飙到100%。
排查过程:
# 第一步:快速诊断(耗时1分钟)
/opt/mysql-emergency/quick_diagnosis.sh
-- 发现问题:大量相同SQL在执行
SELECT
LEFT(INFO, 100) AS query,
COUNT(*) AS count
FROM information_schema.PROCESSLIST
WHERE COMMAND = 'Query'
GROUP BY LEFT(INFO, 100)
ORDER BY count DESC;
/*
query | count
----------------------------------------------------------|------
SELECT * FROM orders WHERE user_id = ? ORDER BY created_at DESC LIMIT 10 | 500
*/
-- 分析执行计划
EXPLAIN SELECT * FROM orders WHERE user_id = 12345 ORDER BY created_at DESC LIMIT 10;
/*
type: ref
key: idx_user_id
Extra: Using filesort -- 问题在这里,产生了文件排序
*/
止血方案:
-- 临时限制并发
SET GLOBAL max_connections = 500;
-- 添加复合索引(在线DDL)
CREATE INDEX idx_user_created ON orders(user_id, created_at DESC) ALGORITHM=INPLACE, LOCK=NONE;
效果:索引创建完成后,CPU从100%降到30%,平均响应时间从5秒降到20毫秒。
根因分析:
- 订单表有user_id索引,但查询需要按created_at排序
- 每次查询都需要filesort,在高并发下CPU消耗剧增
- 缺少(user_id, created_at)复合索引
预防措施:
- 上线前进行压测,模拟大促流量
- 定期review慢查询日志,提前发现隐患
- 建立SQL审核机制,新SQL必须提供执行计划
3.2.2 案例二:定时任务与业务冲突
背景:每天凌晨2点,MySQL CPU都会飙高15分钟。
排查过程:
-- 检查定时任务
SELECT * FROM information_schema.EVENTS WHERE STATUS = 'ENABLED'\G
-- 发现凌晨2点有数据清理任务
/*
EVENT_NAME: daily_cleanup
EVENT_DEFINITION:
DELETE FROM logs WHERE created_at < DATE_SUB(NOW(), INTERVAL 30 DAY);
EXECUTE_AT: NULL
INTERVAL_VALUE: 1
INTERVAL_FIELD: DAY
STARTS: 2024-01-01 02:00:00
*/
-- 分析该SQL
EXPLAIN DELETE FROM logs WHERE created_at < DATE_SUB(NOW(), INTERVAL 30 DAY);
/*
type: ALL -- 全表扫描
rows: 50000000 -- 5000万行
*/
问题根因:
- logs表有5000万数据,created_at列没有索引
- 删除操作产生大量undo log
- 大量行锁阻塞了其他业务
解决方案:
-- 方案1:添加索引
CREATE INDEX idx_created_at ON logs(created_at);
-- 方案2:分批删除
DELIMITER //
CREATE PROCEDURE batch_delete_logs()
BEGIN
DECLARE done INT DEFAULT 0;
DECLARE batch_size INT DEFAULT 10000;
REPEAT
DELETE FROM logs
WHERE created_at < DATE_SUB(NOW(), INTERVAL 30 DAY)
LIMIT batch_size;
SET done = ROW_COUNT();
SELECT SLEEP(1); -- 每批次间隔1秒,降低CPU压力
UNTIL done < batch_size END REPEAT;
END //
DELIMITER ;
-- 方案3:改用分区表,直接DROP分区
ALTER TABLE logs
PARTITION BY RANGE (TO_DAYS(created_at)) (
PARTITION p202401 VALUES LESS THAN (TO_DAYS('2024-02-01')),
PARTITION p202402 VALUES LESS THAN (TO_DAYS('2024-03-01')),
...
);
-- 删除旧分区
ALTER TABLE logs DROP PARTITION p202312;
3.2.3 案例三:锁等待引发的雪崩
背景:某金融系统突然出现大量超时,MySQL CPU 100%。
排查过程:
-- 检查锁等待
SELECT
r.trx_mysql_thread_id AS waiting_pid,
r.trx_state,
TIMESTAMPDIFF(SECOND, r.trx_wait_started, NOW()) AS wait_sec,
LEFT(r.trx_query, 100) AS waiting_query,
b.trx_mysql_thread_id AS blocking_pid,
LEFT(b.trx_query, 100) AS blocking_query,
b.trx_started,
TIMESTAMPDIFF(SECOND, b.trx_started, NOW()) AS blocking_sec
FROM information_schema.INNODB_LOCK_WAITS w
JOIN information_schema.INNODB_TRX r ON r.trx_id = w.requesting_trx_id
JOIN information_schema.INNODB_TRX b ON b.trx_id = w.blocking_trx_id;
/*
waiting_pid | wait_sec | waiting_query | blocking_pid | blocking_query | blocking_sec
------------|----------|--------------------|--------------:|----------------|-------------
1001 | 45 | UPDATE accounts... | 999 | NULL | 1200
1002 | 43 | UPDATE accounts... | 999 | NULL | 1200
...(200+等待)
*/
-- 发现:999号线程持有锁1200秒,但当前无执行SQL
-- 说明:事务开启后执行了SQL但未提交
-- 检查该连接信息
SELECT * FROM information_schema.PROCESSLIST WHERE ID = 999\G
/*
USER: app_user
HOST: 10.0.1.100:45678
COMMAND: Sleep
TIME: 1200
STATE: NULL
*/
问题根因:
- 应用程序开启事务后,在业务逻辑中调用了外部API
- 外部API超时,事务一直未提交
- 该事务持有的锁阻塞了其他所有相关操作
紧急止血:
-- 终止阻塞源
KILL 999;
-- 验证锁释放
SELECT COUNT(*) FROM information_schema.INNODB_LOCK_WAITS;
-- 0
根本解决:
# 应用代码优化:事务中不调用外部API
# Bad
def transfer_money(from_account, to_account, amount):
with db.transaction():
db.execute("UPDATE accounts SET balance = balance - %s WHERE id = %s", (amount, from_account))
db.execute("UPDATE accounts SET balance = balance + %s WHERE id = %s", (amount, to_account))
notify_service.send_notification(to_account) # 外部调用在事务内,危险!
db.commit()
# Good
def transfer_money(from_account, to_account, amount):
with db.transaction():
db.execute("UPDATE accounts SET balance = balance - %s WHERE id = %s", (amount, from_account))
db.execute("UPDATE accounts SET balance = balance + %s WHERE id = %s", (amount, to_account))
db.commit()
# 事务提交后再调用外部服务
notify_service.send_notification(to_account)
四、最佳实践和注意事项
4.1 最佳实践
4.1.1 性能优化配置
# /etc/my.cnf 性能相关配置
[mysqld]
# 连接管理
max_connections = 2000
max_connect_errors = 100000
thread_cache_size = 128
table_open_cache = 4000
# 查询优化
join_buffer_size = 4M
sort_buffer_size = 4M
read_buffer_size = 2M
read_rnd_buffer_size = 8M
tmp_table_size = 256M
max_heap_table_size = 256M
# InnoDB优化
innodb_buffer_pool_size = 8G # 建议为物理内存的50-70%
innodb_buffer_pool_instances = 8
innodb_log_file_size = 2G
innodb_log_buffer_size = 64M
innodb_flush_log_at_trx_commit = 1
innodb_flush_method = O_DIRECT
innodb_io_capacity = 4000
innodb_io_capacity_max = 8000
innodb_read_io_threads = 8
innodb_write_io_threads = 8
innodb_thread_concurrency = 0
# 锁等待超时
innodb_lock_wait_timeout = 50
lock_wait_timeout = 31536000
# 慢查询日志
slow_query_log = ON
slow_query_log_file = /data/mysql/logs/slow.log
long_query_time = 1
log_queries_not_using_indexes = ON
min_examined_row_limit = 1000
4.1.2 安全加固
-- 限制危险操作
-- 禁止没有WHERE条件的UPDATE/DELETE
SET GLOBAL sql_safe_updates = ON;
-- 设置最大执行时间(毫秒)
SET GLOBAL max_execution_time = 30000;
-- 限制单个连接的资源使用
-- 每小时最大查询数
ALTER USER 'app_user'@'%' WITH MAX_QUERIES_PER_HOUR 100000;
-- 最大并发连接数
ALTER USER 'report_user'@'%' WITH MAX_USER_CONNECTIONS 10;
-- 配置查询超时自动终止
-- 在应用层设置
-- SET SESSION max_execution_time = 10000;
4.1.3 高可用设计
-- 读写分离配置
-- 主库:处理写入
-- 从库:处理读取
-- 检查从库延迟
SELECT
SUBSTRING_INDEX(HOST, ':', 1) AS slave_host,
Seconds_Behind_Master
FROM performance_schema.replication_connection_status
JOIN performance_schema.replication_applier_status
USING (CHANNEL_NAME);
-- 主库CPU高时,将读流量切到从库
-- 通过ProxySQL或应用层路由实现
-- ProxySQL读写分离规则
-- INSERT/UPDATE/DELETE -> hostgroup_id=10 (主库)
-- SELECT -> hostgroup_id=20 (从库)
4.2 注意事项
4.2.1 常见错误
| 错误场景 |
错误做法 |
正确做法 |
| CPU 100%时 |
直接重启MySQL |
先止血(KILL问题SQL),再定位根因 |
| 锁等待过多 |
增大lock_wait_timeout |
找到锁源头,优化事务 |
| 慢查询多 |
盲目添加索引 |
先EXPLAIN分析,针对性添加 |
| 连接数满 |
增大max_connections |
检查连接池配置,清理空闲连接 |
| 临时表多 |
增大tmp_table_size |
优化SQL,减少GROUP BY/ORDER BY |
4.2.2 操作注意事项
- 止血操作的风险评估
- KILL会导致事务回滚,可能造成数据不一致
- 终止长事务可能产生大量undo log回滚
- 优先终止SELECT查询,UPDATE/DELETE谨慎
- 索引操作的影响
- 在线DDL(ALGORITHM=INPLACE)虽然不锁表,但仍消耗资源
- 大表加索引建议在低峰期
- 先在从库测试,确认无问题再在主库执行
- 参数修改的影响
- 部分参数修改需要重启生效(如innodb_buffer_pool_size)
- 动态参数修改立即生效,但重启后失效(需同步到配置文件)
- 修改前记录原值,便于回滚
4.2.3 常见错误诊断
| 错误码 |
说明 |
处理方案 |
| 1040 |
Too many connections |
增加max_connections或清理空闲连接 |
| 1205 |
Lock wait timeout |
检查锁等待,优化事务 |
| 1213 |
Deadlock found |
检查死锁日志,优化SQL顺序 |
| 1206 |
Total locks exceed limit |
分批处理大事务 |
| 3024 |
Query execution was interrupted |
查询被终止或超时 |
五、故障排查和监控
5.1 故障排查
5.1.1 日志分析
# MySQL错误日志分析
tail -1000 /data/mysql/logs/error.log | grep -E "(ERROR|Warning|Note)"
# 提取CPU相关信息
grep -i "cpu\|thread\|lock" /data/mysql/logs/error.log | tail -50
# 慢查询日志分析
pt-query-digest /data/mysql/logs/slow.log \
--since '2024-01-15 02:00:00' \
--until '2024-01-15 02:30:00' \
--limit 10
# 分析特定时间段的SQL模式
pt-query-digest /data/mysql/logs/slow.log \
--group-by fingerprint \
--order-by Query_time:sum \
--limit 20
5.1.2 实时诊断
-- 使用sys schema快速诊断
-- CPU占用最高的SQL
SELECT * FROM sys.statement_analysis ORDER BY avg_latency DESC LIMIT 10;
-- 等待最多的事件
SELECT * FROM sys.wait_classes_global_by_avg_latency;
-- 表IO统计
SELECT * FROM sys.schema_table_statistics ORDER BY total_latency DESC LIMIT 10;
-- 索引使用统计
SELECT * FROM sys.schema_index_statistics ORDER BY rows_selected DESC LIMIT 10;
-- 未使用的索引
SELECT * FROM sys.schema_unused_indexes;
5.1.3 性能剖析
-- 启用性能剖析
SET profiling = 1;
-- 执行问题SQL
SELECT * FROM orders WHERE status = 1 LIMIT 100;
-- 查看剖析结果
SHOW PROFILES;
SHOW PROFILE CPU, BLOCK IO FOR QUERY 1;
/*
+----------------------+----------+----------+------------+
| Status | Duration | CPU_user | CPU_system |
+----------------------+----------+----------+------------+
| starting | 0.000059 | 0.000000 | 0.000000 |
| checking permissions | 0.000006 | 0.000000 | 0.000000 |
| Opening tables | 0.000015 | 0.000000 | 0.000000 |
| init | 0.000015 | 0.000000 | 0.000000 |
| System lock | 0.000007 | 0.000000 | 0.000000 |
| optimizing | 0.000004 | 0.000000 | 0.000000 |
| statistics | 0.000012 | 0.000000 | 0.000000 |
| preparing | 0.000008 | 0.000000 | 0.000000 |
| executing | 0.000002 | 0.000000 | 0.000000 |
| Sending data | 0.125432 | 0.120000 | 0.004000 | <-- 主要耗时
| end | 0.000004 | 0.000000 | 0.000000 |
| query end | 0.000006 | 0.000000 | 0.000000 |
| closing tables | 0.000005 | 0.000000 | 0.000000 |
| freeing items | 0.000010 | 0.000000 | 0.000000 |
| cleaning up | 0.000004 | 0.000000 | 0.000000 |
+----------------------+----------+----------+------------+
*/
5.2 性能监控
5.2.1 关键监控指标
| 指标 |
说明 |
告警阈值 |
采集方法 |
| CPU使用率 |
整体CPU使用 |
>80%警告,>95%严重 |
top/mpstat |
| Threads_running |
并发执行线程数 |
>50警告,>100严重 |
SHOW STATUS |
| Threads_connected |
已连接线程数 |
>max_connections*0.8 |
SHOW STATUS |
| Slow_queries |
慢查询计数 |
增长>100/min |
SHOW STATUS |
| Innodb_row_lock_waits |
行锁等待次数 |
>10/sec |
SHOW STATUS |
| Innodb_row_lock_time_avg |
平均锁等待时间 |
>100ms |
SHOW STATUS |
| Questions |
查询总数 |
趋势监控 |
SHOW STATUS |
| Com_select/insert/update/delete |
各类SQL计数 |
趋势监控 |
SHOW STATUS |
5.2.2 Grafana监控面板
{
"dashboard": {
"title": "MySQL CPU Monitoring",
"panels": [
{
"title": "CPU Usage",
"type": "graph",
"targets": [
{
"expr": "100 - (avg(rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "CPU Usage %"
}
]
},
{
"title": "MySQL Threads",
"type": "graph",
"targets": [
{
"expr": "mysql_global_status_threads_running",
"legendFormat": "Running"
},
{
"expr": "mysql_global_status_threads_connected",
"legendFormat": "Connected"
}
]
},
{
"title": "Queries Per Second",
"type": "graph",
"targets": [
{
"expr": "rate(mysql_global_status_questions[5m])",
"legendFormat": "QPS"
}
]
},
{
"title": "Slow Queries",
"type": "graph",
"targets": [
{
"expr": "rate(mysql_global_status_slow_queries[5m])",
"legendFormat": "Slow Queries/sec"
}
]
}
]
}
}
5.2.3 自动化告警配置
#!/bin/bash
# 文件:/opt/mysql-emergency/alert_check.sh
# 功能:CPU告警检查脚本(配合crontab使用)
MYSQL_USER="monitor"
MYSQL_PASS="MonitorPass@2024"
ALERT_URL="https://alert.example.com/api/alert"
HOSTNAME=$(hostname)
# 获取MySQL状态
get_mysql_status() {
mysql -u${MYSQL_USER} -p${MYSQL_PASS} -N -e "
SELECT
(SELECT VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME = 'Threads_running') AS threads_running,
(SELECT VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME = 'Threads_connected') AS threads_connected,
(SELECT COUNT(*) FROM information_schema.PROCESSLIST WHERE COMMAND = 'Query' AND TIME > 30) AS long_queries,
(SELECT COUNT(*) FROM performance_schema.data_lock_waits) AS lock_waits
" 2>/dev/null
}
# 获取CPU使用率
get_cpu_usage() {
top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1
}
# 发送告警
send_alert() {
local level=$1
local message=$2
curl -s -X POST "${ALERT_URL}" \
-H "Content-Type: application/json" \
-d "{
\"level\": \"${level}\",
\"host\": \"${HOSTNAME}\",
\"service\": \"mysql\",
\"message\": \"${message}\",
\"timestamp\": \"$(date -u +%Y-%m-%dT%H:%M:%SZ)\"
}"
}
# 主逻辑
main() {
cpu=$(get_cpu_usage)
status=$(get_mysql_status)
threads_running=$(echo $status | awk '{print $1}')
threads_connected=$(echo $status | awk '{print $2}')
long_queries=$(echo $status | awk '{print $3}')
lock_waits=$(echo $status | awk '{print $4}')
# CPU检查
if [ ${cpu%.*} -gt 95 ]; then
send_alert "critical" "MySQL CPU ${cpu}% - Threads Running: ${threads_running}"
elif [ ${cpu%.*} -gt 80 ]; then
send_alert "warning" "MySQL CPU ${cpu}% - Threads Running: ${threads_running}"
fi
# 运行线程检查
if [ ${threads_running} -gt 100 ]; then
send_alert "critical" "MySQL Threads Running ${threads_running} exceeds 100"
fi
# 长查询检查
if [ ${long_queries} -gt 10 ]; then
send_alert "warning" "MySQL has ${long_queries} queries running over 30s"
fi
# 锁等待检查
if [ ${lock_waits} -gt 5 ]; then
send_alert "warning" "MySQL has ${lock_waits} lock waits"
fi
}
main
5.3 备份与恢复
5.3.1 故障时的数据保护
# CPU 100%时,如果需要重启MySQL,先做好数据保护
# 1. 检查当前事务状态
mysql -uroot -p -e "SELECT * FROM information_schema.INNODB_TRX\G" > /tmp/trx_before_restart.txt
# 2. 导出当前processlist
mysql -uroot -p -e "SELECT * FROM information_schema.PROCESSLIST" > /tmp/processlist_before_restart.txt
# 3. 保存InnoDB状态
mysql -uroot -p -e "SHOW ENGINE INNODB STATUS\G" > /tmp/innodb_status_before_restart.txt
# 4. 如果可能,进行flush操作
mysql -uroot -p -e "FLUSH TABLES WITH READ LOCK; FLUSH LOGS;"
# 5. 记录binlog位置
mysql -uroot -p -e "SHOW MASTER STATUS" > /tmp/binlog_position_before_restart.txt
# 6. 安全关闭MySQL
sudo systemctl stop mysqld
# 7. 检查错误日志
tail -100 /data/mysql/logs/error.log > /tmp/error_log_before_restart.txt
六、总结
6.1 技术要点回顾
- 止血优先于定位根因:CPU 100%时首要任务是恢复服务,可通过KILL问题SQL、限制连接数等方式快速止血
- 诊断三板斧:
- 查PROCESSLIST找正在执行的SQL
- 查Performance Schema找高消耗SQL
- 查INNODB_TRX找锁问题
- 常见根因:
- 缺索引导致全表扫描
- 锁等待/死锁引发雪崩
- 大事务长时间未提交
- 连接数暴涨
- 预防措施:
- 上线前压测
- 定期审计慢查询
- 完善监控告警
- 准备应急脚本
6.2 进阶学习方向
- MySQL内核原理:深入理解查询执行流程
- 性能调优方法论:系统化的调优思路
- 自动化运维:构建自愈系统
- MySQL云原生:Kubernetes上的MySQL运维
6.3 参考资料
- MySQL 8.0 Performance Schema
- MySQL 8.0 sys Schema
- High Performance MySQL, 4th Edition
- Percona Toolkit Documentation
附录
A. 命令速查表
| 命令 |
说明 |
SHOW PROCESSLIST |
查看当前连接和执行的SQL |
SHOW ENGINE INNODB STATUS\G |
查看InnoDB引擎状态 |
KILL <id> |
终止指定连接 |
SHOW STATUS LIKE 'Threads%' |
查看线程状态 |
SHOW VARIABLES LIKE '%timeout%' |
查看超时配置 |
SELECT * FROM sys.statement_analysis |
分析SQL执行统计 |
SELECT * FROM performance_schema.events_statements_summary_by_digest |
SQL摘要统计 |
B. 配置参数详解
| 参数 |
默认值 |
说明 |
| max_connections |
151 |
最大连接数 |
| thread_cache_size |
-1(自动) |
线程缓存大小 |
| innodb_thread_concurrency |
0 |
InnoDB并发线程数(0为不限制) |
| max_execution_time |
0 |
最大执行时间(毫秒,0为不限制) |
| innodb_lock_wait_timeout |
50 |
锁等待超时(秒) |
| interactive_timeout |
28800 |
交互连接超时 |
| wait_timeout |
28800 |
非交互连接超时 |
C. 术语表
| 术语 |
英文 |
说明 |
| 止血 |
Stop the bleeding |
紧急恢复服务的操作 |
| 雪崩 |
Avalanche |
连锁故障导致的系统崩溃 |
| 锁等待 |
Lock wait |
事务等待获取锁的状态 |
| 全表扫描 |
Full table scan |
扫描表的所有行 |
| 执行计划 |
Execution plan |
查询优化器生成的执行方案 |
| 资源组 |
Resource group |
MySQL 8.0的CPU资源隔离机制 |
| Performance Schema |
- |
MySQL性能监控schema |
| sys Schema |
- |
MySQL系统诊断视图 |
以上完整呈现了MySQL CPU 100%紧急排查与处理的标准化流程。从实战中总结出的经验告诉我们,完善的预案、熟练的工具使用和清晰的排查思路是应对此类生产危机的关键。如果你想深入了解更多数据库运维与性能优化的实战技巧,欢迎在 云栈社区 与更多同行交流探讨。