死锁基础概念
1.1 什么是死锁?
死锁是指两个或更多事务互相等待对方释放锁资源,导致所有事务都无法继续执行的阻塞状态。
1.2 死锁四要素(必要条件)
- 互斥条件:资源只能被一个事务独占
- 持有并等待:事务持有资源并等待其他资源
- 不可剥夺:资源只能由持有者主动释放
- 循环等待:事务之间形成等待环路
1.3 经典死锁示例
-- 事务A
BEGIN;
UPDATE accounts SET balance = balance -100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT;
-- 事务B(同时执行)
BEGIN;
UPDATE accounts SET balance = balance -50 WHERE id = 2;
UPDATE accounts SET balance = balance + 50 WHERE id = 1; -- 死锁发生!
COMMIT;
时间线分析:
| 时间 |
事务A |
事务B |
| T1 |
锁住 id=1 |
锁住 id=2 |
| T2 |
等待 id=2 的锁 |
等待 id=1 的锁 |
| T3 |
⚠️ 死锁发生! |
⚠️ 死锁发生! |
死锁检测机制
2.1 数据库死锁处理策略
不同的数据库管理系统有着不同的内部机制来处理这个棘手的并发问题。
| 数据库 |
检测机制 |
处理策略 |
超时时间 |
| MySQL |
等待图检测 |
回滚代价最小的事务 |
innodb_lock_wait_timeout |
| Oracle |
定时检测 |
回滚产生最少undo的事务 |
默认无限等待 |
| PostgreSQL |
死锁检测器 |
回滚任意事务 |
deadlock_timeout |
2.2 MySQL InnoDB死锁检测
-- 查看死锁检测状态
SHOW VARIABLES LIKE 'innodb_deadlock_detect';
SHOW VARIABLES LIKE 'innodb_lock_wait_timeout';
-- 死锁检测原理(等待图算法)
-- 1. 构建锁等待图
-- 2. 检测环路(深度优先搜索)
-- 3. 选择“牺牲者”(回滚代价最小)
死锁信息收集
3.1 MySQL死锁日志分析
3.1.1 启用死锁日志
# my.cnf配置
[mysqld]
innodb_print_all_deadlocks = 1 # 记录所有死锁到错误日志
log_error = /var/log/mysql/error.log
3.1.2 解析死锁日志
# 典型MySQL死锁日志
------------------------
LATEST DETECTED DEADLOCK
------------------------
2026-03-10 12:20:17 0x7f8b2c1d6700
*** (1) TRANSACTION:
TRANSACTION 123456, ACTIVE 2 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 3 lock struct(s), heap size 1136, 2 row lock(s)
MySQL thread id 100, OS thread handle 139863, query id 1000 localhost root updating
UPDATE accounts SET balance = balance - 100 WHERE id = 1
*** (1) HOLDS THE LOCK(S):
RECORD LOCKS space id 100 page no 3 n bits 72 index PRIMARY of table `test`.`accounts` trx id 123456 lock_mode X locks rec but not gap
Record lock, heap no 2 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 100 page no 4 n bits 72 index PRIMARY of table `test`.`accounts` trx id 123456 lock_mode X locks rec but not gap waiting
*** (2) TRANSACTION:
TRANSACTION 123457, ACTIVE 1 sec starting index read
mysql tables in use 1, locked 1
3 lock struct(s), heap size 1136, 2 row lock(s)
MySQL thread id 101, OS thread handle 139864, query id 1001 localhost root updating
UPDATE accounts SET balance = balance - 50 WHERE id = 2
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 100 page no 4 n bits 72 index PRIMARY of table `test`.`accounts` trx id 123457 lock_mode X locks rec but not gap
*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 100 page no 3 n bits 72 index PRIMARY of table `test`.`accounts` trx id 123457 lock_mode X locks rec but not gap waiting
*** WE ROLL BACK TRANSACTION (2)
3.1.3 关键信息提取
# 从日志提取关键信息脚本
grep -A 50 "LATEST DETECTED DEADLOCK" /var/log/mysql/error.log | tail -50
# 或者使用pt工具分析
pt-deadlock-logger /var/log/mysql/error.log --dest D=test,t=deadlocks
死锁排查实战
4.1 排查流程
graph TD
A[发现死锁] --> B{分析死锁日志}
B --> C[识别资源冲突]
C --> D[分析SQL执行计划]
D --> E[检查事务隔离级别]
E --> F[检查索引使用情况]
F --> G[制定解决方案]
G --> H[实施并验证]
4.2 现场排查步骤
步骤1:确认死锁发生
-- MySQL:查看锁等待
SELECT * FROM information_schema.INNODB_TRX;
SELECT * FROM information_schema.INNODB_LOCKS;
SELECT * FROM information_schema.INNODB_LOCK_WAITS;
-- 当前锁等待情况
SELECT
r.trx_id waiting_trx_id,
r.trx_mysql_thread_id waiting_thread,
r.trx_query waiting_query,
b.trx_id blocking_trx_id,
b.trx_mysql_thread_id blocking_thread,
b.trx_query blocking_query
FROM information_schema.innodb_lock_waits w
INNER JOIN information_schema.innodb_trx b ON b.trx_id = w.blocking_trx_id
INNER JOIN information_schema.innodb_trx r ON r.trx_id = w.requesting_trx_id;
步骤2:分析事务SQL
-- 查看正在执行的SQL
SELECT
t.processlist_id as thread_id,
t.processlist_user as user,
t.processlist_host as host,
t.processlist_db as db,
t.processlist_command as command,
t.processlist_state as state,
t.processlist_info as current_sql,
trx.trx_state,
trx.trx_started,
trx.trx_query
FROM performance_schema.threads t
LEFT JOIN information_schema.innodb_trx trx
ON t.processlist_id = trx.trx_mysql_thread_id
WHERE t.processlist_id IS NOT NULL
ORDER BY trx.trx_started;
步骤3:检查锁模式与隔离级别
-- 查看当前隔离级别
SELECT @@global.transaction_isolation, @@session.transaction_isolation;
-- 查看锁模式说明
/*
锁模式说明:
- 锁类型:X(排他锁),S(共享锁),IX(意向排他锁),IS(意向共享锁)
- 锁粒度:表锁、行锁、间隙锁、Next-Key锁
- 锁兼容性:S锁兼容S锁,X锁不兼容任何锁
*/
4.3 常见死锁场景分析
场景1:不同顺序更新同组记录
-- 事务A顺序:id=1 → id=2
UPDATE table SET col = val WHERE id = 1;
UPDATE table SET col = val WHERE id = 2;
-- 事务B顺序:id=2 → id=1(反向顺序导致死锁)
UPDATE table SET col = val WHERE id = 2;
UPDATE table SET col = val WHERE id = 1;
解决方案:
-- 方案1:统一更新顺序
UPDATE table SET col = val WHERE id IN (1, 2) ORDER BY id;
-- 方案2:使用应用层锁排序
-- 在代码中对ID排序后再执行更新
场景2:Gap Lock导致的死锁
-- 事务A:删除不存在记录,产生间隙锁
DELETE FROM table WHERE id = 100;
-- 事务B:插入相邻记录,等待间隙锁
INSERT INTO table (id, name) VALUES (101, 'test');
解决方案:
-- 方案1:降低隔离级别
SET SESSION TRANSACTION ISOLATION LEVEL READ COMMITTED;
-- 方案2:使用唯一索引避免间隙锁
ALTER TABLE table ADD UNIQUE INDEX idx_id (id);
场景3:索引缺失导致表锁升级
-- 没有索引的更新,可能导致大量行锁
UPDATE table SET status = 1 WHERE create_time < '2026-03-01';
-- 同时另一个事务更新同一表
DELETE FROM table WHERE id = 100;
解决方案:
-- 添加合适索引
CREATE INDEX idx_create_time ON table(create_time);
-- 分批更新
UPDATE table SET status = 1
WHERE create_time < '2026-03-01'
AND id BETWEEN 1 AND 1000;
解决方案与预防措施
5.1 应用层解决方案
5.1.1 统一资源访问顺序
# Python示例:统一排序避免死锁
def update_accounts(account_ids, amount):
# 对账户ID排序,确保访问顺序一致
account_ids = sorted(account_ids)
for account_id in account_ids:
execute_sql(f"""
UPDATE accounts
SET balance = balance + {amount}
WHERE id = {account_id}
""")
5.1.2 重试机制
// Java示例:死锁重试机制
public void transferWithRetry(int fromId, int toId, BigDecimal amount) {
int maxRetries = 3;
int retryCount = 0;
while (retryCount < maxRetries) {
try {
transfer(fromId, toId, amount);
break; // 成功则退出
} catch (DeadlockException e) {
retryCount++;
if (retryCount == maxRetries) {
throw new RuntimeException("转账失败,超过重试次数", e);
}
// 指数退避
Thread.sleep(100 * (long) Math.pow(2, retryCount));
}
}
}
5.1.3 悲观锁与乐观锁
-- 悲观锁:提前锁定资源
BEGIN;
SELECT * FROM accounts WHERE id IN (1, 2) FOR UPDATE;
-- 执行更新操作
UPDATE accounts SET balance = balance -100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT;
-- 乐观锁:使用版本号
UPDATE accounts
SET balance = balance - 100, version = version + 1
WHERE id = 1 AND version = @old_version;
5.2 数据库层优化
5.2.1 索引优化
-- 添加合适索引减少锁范围
-- 原查询(可能导致全表扫描)
UPDATE orders SET status = 'shipped' WHERE customer_id = 100 AND create_date > '2026-01-01';
-- 优化:添加复合索引
CREATE INDEX idx_customer_date ON orders(customer_id, create_date);
5.2.2 事务优化
-- 避免长事务
SET SESSION autocommit = 1; -- 自动提交
-- 最小化事务范围
BEGIN;
-- 只包含必要的操作
UPDATE table1 SET col1 = val1 WHERE id = 1;
UPDATE table2 SET col2 = val2 WHERE id = 2;
COMMIT; -- 尽快提交
-- 避免在事务中执行非数据库操作
-- ❌ 错误:事务中包含网络调用、文件操作等
5.2.3 隔离级别调整
-- 根据业务需求选择合适的隔离级别
-- READ COMMITTED:减少锁冲突,但可能出现不可重复读
SET SESSION TRANSACTION ISOLATION LEVEL READ COMMITTED;
-- REPEATABLE READ:MySQL默认,可能产生更多锁
SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ;
5.3 架构层预防
5.3.1 分库分表
-- 按业务拆分,减少资源竞争
-- 用户账户表按用户ID分片
CREATE TABLE accounts_0000 (id INT PRIMARY KEY, ...);
CREATE TABLE accounts_0001 (id INT PRIMARY KEY, ...);
-- 不同用户在不同分片,减少锁冲突
5.3.2 异步处理
# 使用消息队列异步处理
def transfer_async(from_id, to_id, amount):
# 将转账请求放入队列
message = {
'from_id': from_id,
'to_id': to_id,
'amount': amount,
'request_id': generate_uuid()
}
queue.send(message)
# 消费者顺序处理,避免并发冲突
def process_transfer(message):
with lock_manager.acquire_lock(message['request_id']):
execute_transfer(message)
5.3.3 缓存层优化
# 使用Redis等缓存减少数据库压力
def update_counter(user_id):
# 先更新缓存
redis.incr(f'counter:{user_id}')
# 异步批量更新数据库
if should_sync_to_db():
counters = redis.get_all_counters()
batch_update_db(counters)
监控与告警
6.1 死锁监控配置
6.1.1 MySQL监控
-- 创建死锁监控表
CREATE TABLE deadlock_monitor (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
occurred_at DATETIME DEFAULT CURRENT_TIMESTAMP,
deadlock_log LONGTEXT,
victim_trx_id VARCHAR(64),
resolved_by VARCHAR(32)
);
-- 定期检查死锁
SELECT
COUNT(*) as deadlock_count,
DATE(occurred_at) as day
FROM deadlock_monitor
WHERE occurred_at > DATE_SUB(NOW(), INTERVAL 7 DAY)
GROUP BY DATE(occurred_at)
ORDER BY day DESC;
6.1.2 告警规则设置
# Prometheus告警规则示例
groups:
- name: mysql_deadlock
rules:
- alert: HighDeadlockRate
expr: rate(mysql_global_status_innodb_row_lock_deadlocks[5m]) > 0.1
for: 2m
labels:
severity: warning
annotations:
summary: "MySQL死锁频率过高"
description: "过去5分钟死锁率 {{ $value }}/秒"
6.2 性能指标监控
-- 关键性能指标查询
-- 锁等待时间
SELECT
AVG(wait_time_ms) as avg_wait_time,
MAX(wait_time_ms) as max_wait_time,
COUNT(*) as lock_wait_count
FROM sys.innodb_lock_waits;
-- 死锁统计
SHOW GLOBAL STATUS LIKE 'Innodb_row_lock%';
-- 事务统计
SELECT
COUNT(*) as active_transactions,
SUM(TIMESTAMPDIFF(SECOND, trx_started, NOW())) as total_age_seconds,
AVG(TIMESTAMPDIFF(SECOND, trx_started, NOW())) as avg_age_seconds
FROM information_schema.innodb_trx;
工具与脚本
7.1 死锁分析工具
7.1.1 pt-deadlock-logger
# 安装Percona Toolkit
wget https://www.percona.com/downloads/percona-toolkit/3.0.0/binary/tarball/percona-toolkit-3.0.0_x86_64.tar.gz
# 监控死锁日志
pt-deadlock-logger /var/log/mysql/error.log --dest D=monitor,t=deadlocks
# 生成死锁报告
pt-deadlock-logger --print /var/log/mysql/error.log
7.1.2 自定义分析脚本
#!/bin/bash
# analyze_deadlocks.sh
ERROR_LOG="/var/log/mysql/error.log"
OUTPUT_FILE="/tmp/deadlock_report_$(date +%Y%m%d_%H%M%S).txt"
echo "=== 死锁分析报告 ===" > $OUTPUT_FILE
echo "生成时间: $(date)" >> $OUTPUT_FILE
echo "=====================================" >> $OUTPUT_FILE
# 提取最近24小时的死锁
grep -B5 -A50 "LATEST DETECTED DEADLOCK" $ERROR_LOG | \
grep -E "(TRANSACTION|HOLDS|WAITING|ROLL BACK)" | \
sed -n '1,100p' >> $OUTPUT_FILE
echo "" >> $OUTPUT_FILE
echo "=== 统计信息 ===" >> $OUTPUT_FILE
DEADLOCK_COUNT=$(grep -c "LATEST DETECTED DEADLOCK" $ERROR_LOG)
echo "总死锁次数: $DEADLOCK_COUNT" >> $OUTPUT_FILE
# 发送报告
mail -s "MySQL死锁分析报告" admin@example.com < $OUTPUT_FILE
7.2 压力测试与复现
# 死锁复现脚本
import threading
import pymysql
import time
def transaction_a():
conn = pymysql.connect(host='localhost', user='root', password='', db='test')
try:
with conn.cursor() as cursor:
conn.begin()
cursor.execute("UPDATE accounts SET balance = balance - 100 WHERE id = 1")
time.sleep(0.1) # 增加死锁概率
cursor.execute("UPDATE accounts SET balance = balance + 100 WHERE id = 2")
conn.commit()
except pymysql.err.OperationalError as e:
print(f"事务A失败: {e}")
finally:
conn.close()
def transaction_b():
conn = pymysql.connect(host='localhost', user='root', password='', db='test')
try:
with conn.cursor() as cursor:
conn.begin()
cursor.execute("UPDATE accounts SET balance = balance - 50 WHERE id = 2")
time.sleep(0.1) # 增加死锁概率
cursor.execute("UPDATE accounts SET balance = balance + 50 WHERE id = 1")
conn.commit()
except pymysql.err.OperationalError as e:
print(f"事务B失败: {e}")
finally:
conn.close()
# 并发执行
thread1 = threading.Thread(target=transaction_a)
thread2 = threading.Thread(target=transaction_b)
thread1.start()
thread2.start()
thread1.join()
thread2.join()
最佳实践总结
8.1 设计原则
- 访问顺序一致性:统一资源访问顺序
- 事务最小化:减少事务范围和持有时间
- 索引优化:避免全表扫描和锁升级
- 超时设置:合理设置锁等待超时时间
- 重试机制:优雅处理死锁异常
8.2 检查清单 ✅
- 所有更新操作是否使用相同顺序?
- 事务是否尽可能短小?
- 是否添加了合适的索引?
- 是否设置了合理的超时时间?
- 应用层是否有重试机制?
- 是否监控死锁发生频率?
- 隔离级别是否适合业务需求?
8.3 应急响应流程
- 确认死锁发生(查看错误日志)
- 收集死锁信息(事务、SQL、锁信息)
- 分析根本原因(访问顺序、索引、事务设计)
- 实施临时解决方案(kill事务、调整超时)
- 制定长期解决方案(代码优化、架构调整)
- 验证解决方案有效性(压力测试)
- 更新监控告警(预防再次发生)
希望这份从理论到实战的数据库死锁指南能帮助你。如果你在实践中有更独特的见解或踩坑经验,欢迎在云栈社区的后端与架构板块分享,与更多开发者一同探讨高并发场景下的稳定性挑战。