一、概述
1.1 背景介绍
Elasticsearch作为分布式搜索引擎,在生产环境中承载着海量数据的存储与检索任务。然而,分布式系统天然面临的网络分区、节点故障等问题,可能导致集群出现脑裂(Split-Brain)现象。脑裂是指集群因网络隔离被分割成多个独立子集群,每个子集群都选举出自己的 Master 节点,导致数据不一致、写入冲突等严重问题。本文从 Elasticsearch 的选主机制入手,深入分析脑裂产生的原因、危害,并提供系统化的预防和恢复方案。
1.2 技术特点
- Zen Discovery 机制:Elasticsearch 7.x 之前的选主协议,基于 Ping-Pong 机制
- Quorum 机制:通过
discovery.zen.minimum_master_nodes 参数防止脑裂
- Raft 协议:Elasticsearch 7.0+ 引入,提供更强的一致性保证
- Master 选举:基于节点 ID、版本号、集群状态版本号的多轮投票机制
- 故障检测:通过心跳检测节点存活状态,触发重新选举
1.3 适用场景
- 场景一:大规模 Elasticsearch 集群(节点数 > 50),网络环境复杂
- 场景二:跨数据中心部署,存在网络延迟和不稳定性
- 场景三:金融、电商等对数据一致性要求极高的业务场景
- 场景四:7x24 小时不间断服务,需要自动故障恢复能力
1.4 环境要求
| 组件 |
版本要求 |
说明 |
| Elasticsearch |
7.17+/8.x |
推荐使用最新稳定版 |
| 操作系统 |
CentOS 7+/Ubuntu 20.04+ |
Linux Kernel 4.0+ |
| JDK |
OpenJDK 17 |
ES 8.x 要求 JDK 17 |
| 内存 |
32GB+ |
堆内存建议不超过 32GB |
| 网络 |
万兆网卡 |
低延迟网络环境 |
二、详细步骤
2.1 Elasticsearch 选主机制详解
◆ 2.1.1 Master 节点的职责
Master 节点负责集群的全局管理工作,但不处理数据写入和查询请求:
# elasticsearch.yml - Master 节点配置
node.roles: [ master ]
# Master 节点职责
# 1. 管理集群元数据(索引创建/删除、映射变更)
# 2. 节点加入/离开集群的协调
# 3. 分片分配决策(Shard Allocation)
# 4. 集群状态(Cluster State)的维护和同步
集群状态(Cluster State)包含:
- 集群配置信息
- 索引元数据(mappings、settings)
- 分片路由表(Shard Routing Table)
- 节点信息
◆ 2.1.2 Zen Discovery 选主流程(ES 7.0 之前)
第一阶段:Ping 阶段
// 节点启动后,向配置的 discovery.seed_hosts 发送 Ping 请求
discovery.seed_hosts: ["10.0.1.101", "10.0.1.102", "10.0.1.103"]
// Ping 响应包含
// 1. 节点 ID
// 2. 节点角色
// 3. 集群名称
// 4. 集群状态版本号
第二阶段:选举阶段
选举规则(按优先级):
1. 集群状态版本号最新的节点
2. 节点 ID 最小的节点(字典序)
投票过程:
1. 候选 Master 节点向其他 Master-eligible 节点发送投票请求
2. 收到 (N/2 + 1) 票即当选(N 为 master-eligible 节点总数)
3. 当选后广播自己为新 Master
第三阶段:Master 确认
1. 新 Master 等待至少 minimum_master_nodes 个节点连接
2. 发布新的集群状态
3. 其他节点接受新 Master 的领导
◆ 2.1.3 脑裂产生的根本原因
案例一:网络分区
原始集群:3 个 Master-eligible 节点(Node1, Node2, Node3)
discovery.zen.minimum_master_nodes: 2
网络分区发生:
分区 A: Node1, Node2 (2 个节点,满足 minimum_master_nodes)
分区 B: Node3 (1 个节点,不满足条件)
结果:
- 分区 A 选举出 Master(Node1)
- 分区 B 无法选举 Master,进入只读状态
正确行为:分区 B 应该停止服务,等待网络恢复
案例二:配置错误导致脑裂
# 错误配置示例
# 5 个 Master-eligible 节点
discovery.zen.minimum_master_nodes: 2 # 错误!应该设置为 3
# 网络分区
分区A: Node1, Node2 (2个节点,满足minimum_master_nodes=2)
分区B: Node3, Node4, Node5 (3个节点,也满足=2)
结果:两个分区都能选举Master,产生脑裂
正确公式:
minimum_master_nodes = (master_eligible_nodes / 2) + 1
示例:
3 个节点:(3/2) + 1 = 2
5 个节点:(5/2) + 1 = 3
7 个节点:(7/2) + 1 = 4
2.2 Elasticsearch 7.0+ 的改进
◆ 2.2.1 自动 Quorum 机制
# Elasticsearch 7.0+ 不再需要手动配置 minimum_master_nodes
# 系统自动计算并维护投票配置(Voting Configuration)
# 初始化集群时指定初始 Master 节点
cluster.initial_master_nodes: ["node-1", "node-2", "node-3"]
# 节点加入/离开时,自动调整投票配置
# 例如:3 节点集群 -> 5 节点集群
# 投票配置自动从 2 调整为 3
◆ 2.2.2 基于 Raft 的选主
Raft 协议优势:
1. 日志复制:Master 变更通过日志复制到多数节点
2. 任期(Term):每次选举递增 Term,防止旧 Master 干扰
3. 随机超时:避免同时选举导致的分票
选举过程:
1. Follower 节点超时未收到 Leader 心跳
2. 转为 Candidate,Term +1,发起投票
3. 获得多数票后成为 Leader
4. Leader 定期发送心跳维持地位
◆ 2.2.3 集群引导(Cluster Bootstrapping)
# 首次启动集群时必须指定初始 Master 节点
# elasticsearch.yml
cluster.initial_master_nodes:
- node-1
- node-2
- node-3
# 注意事项
# 1. 只在首次启动时需要配置
# 2. 集群成功启动后应删除此配置
# 3. 不要在已运行的集群中修改此配置
自动化部署脚本:
#!/bin/bash
# 初始化 ES 集群
MASTER_NODES=("es-master-1" "es-master-2" "es-master-3")
# 第一次启动时添加初始配置
for node in "${MASTER_NODES[@]}"; do
ssh $node "echo 'cluster.initial_master_nodes: [\"${MASTER_NODES[@]}\"]' >> /etc/elasticsearch/elasticsearch.yml"
ssh $node "systemctl start elasticsearch"
done
# 等待集群启动
sleep 30
# 检查集群状态
curl -X GET "http://es-master-1:9200/_cluster/health?pretty"
# 移除初始配置(避免误操作)
for node in "${MASTER_NODES[@]}"; do
ssh $node "sed -i '/cluster.initial_master_nodes/d' /etc/elasticsearch/elasticsearch.yml"
done
echo "Cluster initialized successfully"
2.3 高可用架构设计
◆ 2.3.1 节点角色分离
# Master 节点(3 个)
node.roles: [ master ]
node.attr.box_type: hot
# Data-Hot 节点(用于热数据,SSD)
node.roles: [ data_hot, data_content ]
node.attr.box_type: hot
# Data-Warm 节点(用于温数据,SATA)
node.roles: [ data_warm, data_content ]
node.attr.box_type: warm
# Data-Cold 节点(用于冷数据,归档存储)
node.roles: [ data_cold, data_content ]
node.attr.box_type: cold
# Coordinating 节点(协调节点,不存储数据)
node.roles: [ ]
# Ingest 节点(数据预处理)
node.roles: [ ingest ]
推荐架构:
负载均衡器
↓
Coordinating 节点(2+)
↓
Master 节点(3 个)
↓
Data-Hot 节点(5+) Data-Warm 节点(3+) Data-Cold 节点(2+)
◆ 2.3.2 跨机架/跨数据中心部署
# 启用分片分配感知
cluster.routing.allocation.awareness.attributes: zone, rack
# 节点配置
# 节点 1-3(机架 A)
node.attr.zone: zone-a
node.attr.rack: rack-1
# 节点 4-6(机架 B)
node.attr.zone: zone-a
node.attr.rack: rack-2
# 节点 7-9(机架 C)
node.attr.zone: zone-a
node.attr.rack: rack-3
# 强制分片分配(主副分片不在同一机架)
cluster.routing.allocation.awareness.force.zone.values: zone-a, zone-b
跨数据中心注意事项:
# 1. 网络延迟优化
transport.tcp.compress: true
transport.ping_schedule: 30s
# 2. 选举超时调整(默认 30s,跨 DC 可增加到 60s)
cluster.election.duration: 60s
cluster.election.initial_timeout: 10s
# 3. 心跳超时
cluster.fault_detection.follower_check.timeout: 60s
cluster.fault_detection.leader_check.timeout: 60s
# 4. 禁用跨数据中心副本(减少网络流量)
# 使用 Shard Allocation Filtering
PUT _cluster/settings
{
"persistent": {
"cluster.routing.allocation.awareness.attributes": "datacenter",
"cluster.routing.allocation.awareness.force.datacenter.values": "dc1,dc2"
}
}
◆ 2.3.3 分片分配策略
// 索引模板配置
PUT _index_template/logs_template
{
"index_patterns": ["logs-*"],
"template": {
"settings": {
"number_of_shards": 5,
"number_of_replicas": 1,
// 分片分配规则
"index.routing.allocation.require.box_type": "hot",
// 自动分片分配
"index.routing.allocation.total_shards_per_node": 2,
// 延迟分配(节点离线后等待 5 分钟再重新分配分片)
"index.unassigned.node_left.delayed_timeout": "5m"
}
}
}
分片计算公式:
单个分片大小建议:20GB - 50GB
分片数量 = 数据量 / 目标分片大小
示例:
日志数据 500GB/天,保留 30 天 = 15TB
目标分片大小 30GB
分片数 = 15000GB / 30GB = 500 个主分片
按索引切分(每天一个索引):
500 / 30 = 约 17 个主分片/天
在设计和实施高可用的分布式架构时,分片的合理规划是关键一环。
2.4 脑裂预防与检测
◆ 2.4.1 网络隔离测试
#!/bin/bash
# 模拟网络分区(在测试环境使用)
# 节点 1 隔离节点 3
ssh node-1 "iptables -A INPUT -s 10.0.1.103 -j DROP"
ssh node-1 "iptables -A OUTPUT -d 10.0.1.103 -j DROP"
# 观察集群行为
watch -n 1 'curl -s http://10.0.1.101:9200/_cluster/health | jq'
# 恢复网络
ssh node-1 "iptables -D INPUT -s 10.0.1.103 -j DROP"
ssh node-1 "iptables -D OUTPUT -d 10.0.1.103 -j DROP"
◆ 2.4.2 监控脑裂指标
import requests
import time
def detect_split_brain(es_nodes):
"""
检测集群是否出现脑裂
"""
master_nodes = []
for node in es_nodes:
try:
response = requests.get(f'http://{node}:9200/_cat/master?format=json')
if response.status_code == 200:
master_info = response.json()[0]
master_nodes.append({
'queried_node': node,
'master_node': master_info['node'],
'master_id': master_info['id']
})
except Exception as e:
print(f'Error querying {node}: {e}')
# 检查是否有多个不同的 Master
unique_masters = set(m['master_id'] for m in master_nodes)
if len(unique_masters) > 1:
print('[ALERT] Split-brain detected!')
print(f'Multiple masters found: {unique_masters}')
return True
return False
# 持续监控
es_nodes = ['10.0.1.101', '10.0.1.102', '10.0.1.103']
while True:
detect_split_brain(es_nodes)
time.sleep(10)
◆ 2.4.3 自动化健康检查
#!/bin/bash
# es_health_check.sh
ES_HOST="localhost:9200"
LOG_FILE="/var/log/es_health.log"
check_cluster_health() {
health=$(curl -s "http://$ES_HOST/_cluster/health")
status=$(echo $health | jq -r '.status')
echo "[$(date)] Cluster status: $status" >> $LOG_FILE
if [ "$status" != "green" ]; then
echo "[$(date)] WARNING: Cluster not green!" >> $LOG_FILE
# 检查未分配分片
unassigned=$(curl -s "http://$ES_HOST/_cat/shards?h=index,shard,prirep,state,unassigned.reason" | grep UNASSIGNED)
echo "$unassigned" >> $LOG_FILE
# 发送告警
send_alert "ES cluster status: $status"
fi
}
check_master_stability() {
# 连续 3 次检查 Master 是否变化
master1=$(curl -s "http://$ES_HOST/_cat/master?h=id")
sleep 5
master2=$(curl -s "http://$ES_HOST/_cat/master?h=id")
sleep 5
master3=$(curl -s "http://$ES_HOST/_cat/master?h=id")
if [ "$master1" != "$master2" ] || [ "$master2" != "$master3" ]; then
echo "[$(date)] WARNING: Master node unstable!" >> $LOG_FILE
send_alert "ES master node changed multiple times"
fi
}
send_alert() {
local message=$1
# 集成告警系统
curl -X POST "http://alert.example.com/api/alert" \
-H "Content-Type: application/json" \
-d "{\"message\": \"$message\"}"
}
# 主循环
while true; do
check_cluster_health
check_master_stability
sleep 60
done
这种自动化的监控与告警是运维/DevOps实践中保障系统稳定性的重要组成部分。
2.5 脑裂恢复方案
◆ 2.5.1 手动恢复步骤
# 1. 识别正确的 Master 分区(数据最新、节点最多)
curl -X GET "http://node-1:9200/_cluster/state/master_node,version?pretty"
# 输出示例
{
"master_node" : "node-1-id",
"version" : 12345,
"state_uuid" : "abc123"
}
# 2. 停止错误分区的所有节点
ssh node-4 "systemctl stop elasticsearch"
ssh node-5 "systemctl stop elasticsearch"
# 3. 清理错误分区节点的数据(谨慎操作)
ssh node-4 "rm -rf /var/lib/elasticsearch/nodes/*/indices/*"
# 4. 重新加入集群
ssh node-4 "systemctl start elasticsearch"
# 5. 验证集群状态
curl -X GET "http://node-1:9200/_cluster/health?pretty"
◆ 2.5.2 自动化恢复脚本
import requests
import json
import subprocess
import time
class SplitBrainRecovery:
def __init__(self, all_nodes):
self.all_nodes = all_nodes
self.partitions = []
def detect_partitions(self):
"""
检测网络分区
"""
masters = {}
for node in self.all_nodes:
try:
resp = requests.get(f'http://{node}:9200/_cat/master?format=json', timeout=5)
if resp.status_code == 200:
master_id = resp.json()[0]['id']
if master_id not in masters:
masters[master_id] = []
masters[master_id].append(node)
except:
pass
self.partitions = list(masters.values())
return len(self.partitions) > 1
def choose_primary_partition(self):
"""
选择主分区(节点数最多且数据最新)
"""
best_partition = None
max_score = -1
for partition in self.partitions:
# 计算分数:节点数 + 集群状态版本号
node = partition[0]
try:
resp = requests.get(f'http://{node}:9200/_cluster/state/version', timeout=5)
version = resp.json()['version']
score = len(partition) * 1000 + version
if score > max_score:
max_score = score
best_partition = partition
except:
pass
return best_partition
def stop_secondary_partitions(self, primary_partition):
"""
停止次要分区的节点
"""
for partition in self.partitions:
if partition != primary_partition:
for node in partition:
print(f'Stopping node {node}')
subprocess.run(['ssh', node, 'systemctl stop elasticsearch'])
def recover(self):
"""
执行恢复
"""
if not self.detect_partitions():
print('No split-brain detected')
return
print(f'Split-brain detected! Found {len(self.partitions)} partitions')
primary_partition = self.choose_primary_partition()
print(f'Primary partition: {primary_partition}')
self.stop_secondary_partitions(primary_partition)
# 等待节点停止
time.sleep(30)
# 重启次要分区节点
for partition in self.partitions:
if partition != primary_partition:
for node in partition:
print(f'Restarting node {node}')
subprocess.run(['ssh', node, 'systemctl start elasticsearch'])
print('Recovery completed')
# 使用示例
nodes = ['10.0.1.101', '10.0.1.102', '10.0.1.103', '10.0.1.104', '10.0.1.105']
recovery = SplitBrainRecovery(nodes)
recovery.recover()
三、示例代码和配置
3.1 完整配置示例
◆ 3.1.1 生产环境配置文件
# /etc/elasticsearch/elasticsearch.yml
# 生产环境标准配置
# ======================== 集群配置 ========================
cluster.name: production-cluster
node.name: es-master-1
# ======================== 节点角色 ========================
node.roles: [ master ]
# ======================== 路径配置 ========================
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
# ======================== 网络配置 ========================
network.host: 10.0.1.101
http.port: 9200
transport.port: 9300
# ======================== 集群发现 ========================
discovery.seed_hosts:
- 10.0.1.101
- 10.0.1.102
- 10.0.1.103
# 初始 Master 节点(仅首次启动)
cluster.initial_master_nodes:
- es-master-1
- es-master-2
- es-master-3
# ======================== 选举配置 ========================
cluster.election.duration: 30s
cluster.election.initial_timeout: 10s
cluster.election.back_off_time: 100ms
cluster.election.max_timeout: 10s
# ======================== 故障检测 ========================
cluster.fault_detection.follower_check.interval: 10s
cluster.fault_detection.follower_check.timeout: 30s
cluster.fault_detection.follower_check.retry_count: 3
cluster.fault_detection.leader_check.interval: 10s
cluster.fault_detection.leader_check.timeout: 30s
cluster.fault_detection.leader_check.retry_count: 3
# ======================== 分片分配 ========================
cluster.routing.allocation.node_concurrent_recoveries: 2
cluster.routing.allocation.disk.threshold_enabled: true
cluster.routing.allocation.disk.watermark.low: 85%
cluster.routing.allocation.disk.watermark.high: 90%
cluster.routing.allocation.disk.watermark.flood_stage: 95%
# ======================== 内存配置 ========================
bootstrap.memory_lock: true
# ======================== JVM 配置 ========================
# 在 jvm.options 中配置
# -Xms16g
# -Xmx16g
# -XX:+UseG1GC
# ======================== 安全配置 ========================
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.keystore.path: certs/elastic-certificates.p12
xpack.security.transport.ssl.truststore.path: certs/elastic-certificates.p12
◆ 3.1.2 系统优化配置
# /etc/sysctl.conf
vm.max_map_count=262144
vm.swappiness=1
fs.file-max=65535
# /etc/security/limits.conf
elasticsearch soft nofile 65535
elasticsearch hard nofile 65535
elasticsearch soft memlock unlimited
elasticsearch hard memlock unlimited
# 禁用 Swap
swapoff -a
sed -i '/ swap / s/^\(.*\)$/#\1/g' /etc/fstab
# 禁用 THP
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
3.2 实际应用案例
◆ 案例一:日志分析平台
# 索引生命周期管理(ILM)策略
PUT _ilm/policy/logs_policy
{
"policy": {
"phases": {
"hot": {
"actions": {
"rollover": {
"max_size": "50GB",
"max_age": "1d"
},
"set_priority": {
"priority": 100
}
}
},
"warm": {
"min_age": "7d",
"actions": {
"allocate": {
"require": {
"box_type": "warm"
}
},
"forcemerge": {
"max_num_segments": 1
},
"set_priority": {
"priority": 50
}
}
},
"cold": {
"min_age": "30d",
"actions": {
"allocate": {
"require": {
"box_type": "cold"
}
},
"freeze": {},
"set_priority": {
"priority": 0
}
}
},
"delete": {
"min_age": "90d",
"actions": {
"delete": {}
}
}
}
}
}
# 索引模板
PUT _index_template/logs_template
{
"index_patterns": ["logs-*"],
"template": {
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1,
"index.lifecycle.name": "logs_policy",
"index.lifecycle.rollover_alias": "logs",
"index.routing.allocation.require.box_type": "hot"
},
"mappings": {
"properties": {
"@timestamp": { "type": "date" },
"level": { "type": "keyword" },
"message": { "type": "text" },
"host": { "type": "keyword" },
"service": { "type": "keyword" }
}
}
}
}
◆ 案例二:跨数据中心容灾
# 主数据中心(DC1)配置
# elasticsearch.yml
node.attr.datacenter: dc1
node.attr.zone: zone-1
cluster.routing.allocation.awareness.attributes: datacenter,zone
# 跨数据中心复制(CCR)
PUT _ccr/auto_follow/dc1_to_dc2
{
"remote_cluster": "dc2_cluster",
"leader_index_patterns": ["logs-*", "metrics-*"],
"follow_index_pattern": "{{leader_index}}-replica",
"settings": {
"index.number_of_replicas": 0
},
"max_read_request_operation_count": 5120,
"max_outstanding_read_requests": 12,
"max_read_request_size": "32mb",
"max_write_request_operation_count": 5120,
"max_write_request_size": "9mb",
"max_outstanding_write_requests": 9,
"max_write_buffer_count": 2147483647,
"max_write_buffer_size": "512mb",
"max_retry_delay": "500ms",
"read_poll_timeout": "1m"
}
四、最佳实践和注意事项
4.1 最佳实践
◆ 4.1.1 Master 节点配置建议
1. Master 节点数量:3 个或 5 个(奇数)
2. 专用 Master 节点:不承载数据和查询
3. 硬件配置:CPU 优先,内存 8GB-16GB 即可
4. 网络要求:低延迟、高带宽
5. 监控告警:Master 切换、选举超时
◆ 4.1.2 分片数量规划
单节点分片数建议:< 1000 个
单分片大小建议:20GB - 50GB
副本数量:1 个(两副本)
计算示例:
数据量:10TB
节点数:20 个
单分片大小:30GB
主分片数 = 10TB / 30GB = 340 个
每节点分片数 = 340 * 2 / 20 = 34 个(合理)
◆ 4.1.3 监控指标
# 关键监控指标
GET _cluster/health
GET _cluster/stats
GET _nodes/stats
GET _cat/master?v
GET _cat/nodes?v&h=name,heap.percent,ram.percent,cpu,load_1m,node.role
# Prometheus Exporter
# https://github.com/prometheus-community/elasticsearch_exporter
4.2 注意事项
◆ 4.2.1 版本升级注意事项
# 1. 滚动升级步骤
# 关闭分片分配
PUT _cluster/settings
{
"persistent": {
"cluster.routing.allocation.enable": "primaries"
}
}
# 2. 停止一个节点
systemctl stop elasticsearch
# 3. 升级软件包
yum update elasticsearch
# 4. 启动节点
systemctl start elasticsearch
# 5. 等待节点加入集群
GET _cat/nodes?v
# 6. 重新启用分片分配
PUT _cluster/settings
{
"persistent": {
"cluster.routing.allocation.enable": "all"
}
}
# 7. 重复 2-6 步骤升级所有节点
◆ 4.2.2 常见错误避免
1. 避免使用通配符删除索引(DELETE /*)
2. 避免单个索引分片数过多(> 100 个)
3. 避免 Master 节点同时承载数据
4. 避免跨版本集群(最多允许一个大版本差异)
5. 避免频繁重启 Master 节点
五、故障排查和监控
5.1 故障排查
◆ 5.1.1 集群状态异常
# 查看集群健康状态
GET _cluster/health?pretty
# 查看未分配分片原因
GET _cat/shards?h=index,shard,prirep,state,unassigned.reason&v
GET _cluster/allocation/explain
# 示例输出
{
"index": "logs-2025.01.26",
"shard": 0,
"primary": true,
"current_state": "unassigned",
"unassigned_info": {
"reason": "NODE_LEFT",
"details": "node_left[node-4]"
},
"can_allocate": "no",
"allocate_explanation": "cannot allocate because allocation is disabled"
}
# 手动分配分片
POST _cluster/reroute
{
"commands": [
{
"allocate_replica": {
"index": "logs-2025.01.26",
"shard": 0,
"node": "node-5"
}
}
]
}
◆ 5.1.2 Master 选举失败
# 查看选举日志
tail -f /var/log/elasticsearch/production-cluster.log | grep "master"
# 常见日志
[2025-01-26 10:00:00] [INFO ] master not discovered yet
[2025-01-26 10:00:30] [WARN ] timed out while waiting for initial discovery state
[2025-01-26 10:01:00] [INFO ] elected-as-master
# 排查步骤
1. 检查网络连通性:ping、telnet 9300 端口
2. 检查防火墙规则
3. 检查 discovery.seed_hosts 配置
4. 检查 cluster.name 是否一致
5. 检查节点时间是否同步(NTP)
◆ 5.1.3 性能问题排查
# 查看慢查询日志
GET _nodes/stats/indices/search?filter_path=**.took
# 查看线程池状态
GET _cat/thread_pool?v&h=node_name,name,active,queue,rejected
# 查看热点线程
GET _nodes/hot_threads
# 查看磁盘使用
GET _cat/allocation?v&h=node,disk.used,disk.total,disk.percent
5.2 性能监控
◆ 5.2.1 Elasticsearch Exporter
# docker-compose.yml
version: '3'
services:
elasticsearch-exporter:
image: quay.io/prometheuscommunity/elasticsearch-exporter:latest
command:
- '--es.uri=http://elasticsearch:9200'
- '--es.all'
- '--es.indices'
- '--es.cluster_settings'
ports:
- "9114:9114"
◆ 5.2.2 Grafana Dashboard
// 导入 Dashboard ID: 266 (Elasticsearch Overview)
// 或自定义 Dashboard
{
"panels":[
{
"title":"Cluster Health",
"targets":[
{"expr":"elasticsearch_cluster_health_status{cluster='production'}"}
]
},
{
"title":"Master Node",
"targets":[
{"expr":"elasticsearch_cluster_health_number_of_nodes{cluster='production'}"}
]
}
]
}
5.3 备份与恢复
◆ 5.3.1 快照备份
# 配置快照仓库
PUT _snapshot/backup_repo
{
"type": "fs",
"settings": {
"location": "/mount/backups/elasticsearch",
"compress": true
}
}
# 创建快照
PUT _snapshot/backup_repo/snapshot_2025_01_26
{
"indices": "logs-*,metrics-*",
"ignore_unavailable": true,
"include_global_state": false
}
# 查看快照状态
GET _snapshot/backup_repo/snapshot_2025_01_26/_status
# 恢复快照
POST _snapshot/backup_repo/snapshot_2025_01_26/_restore
{
"indices": "logs-2025.01.26",
"ignore_unavailable": true,
"include_global_state": false
}
◆ 5.3.2 自动化备份脚本
#!/bin/bash
# es_backup.sh
ES_HOST="localhost:9200"
REPO_NAME="backup_repo"
DATE=$(date +%Y%m%d_%H%M%S)
SNAPSHOT_NAME="snapshot_$DATE"
# 创建快照
curl -X PUT "http://$ES_HOST/_snapshot/$REPO_NAME/$SNAPSHOT_NAME?wait_for_completion=true" \
-H 'Content-Type: application/json' \
-d '{
"indices": "logs-*,metrics-*",
"ignore_unavailable": true,
"include_global_state": false
}'
# 删除 30 天前的快照
OLD_DATE=$(date -d '30 days ago' +%Y%m%d)
curl -X GET "http://$ES_HOST/_snapshot/$REPO_NAME/_all" | \
jq -r ".snapshots[] | select(.snapshot | startswith(\"snapshot_$OLD_DATE\")) | .snapshot" | \
while read snapshot; do
curl -X DELETE "http://$ES_HOST/_snapshot/$REPO_NAME/$snapshot"
done
六、总结
6.1 技术要点回顾
- 脑裂的根本原因是网络分区和配置不当,通过正确配置
minimum_master_nodes 或使用 ES 7.0+ 的自动 Quorum 机制可有效预防
- Master 节点负责集群元数据管理,选举基于集群状态版本号和节点 ID,ES 7.0+ 引入 Raft 协议提供更强一致性
- 高可用架构应采用节点角色分离、跨机架部署、合理的分片分配策略
- 监控和自动化是保证集群稳定的关键,应建立完善的告警和故障恢复机制
6.2 进阶学习方向
- Elasticsearch 内部原理:Lucene 索引结构、倒排索引、段合并
- 查询性能优化:查询 DSL、聚合优化、缓存策略
- 大规模集群运维:数千节点集群的管理和优化
- Elastic Stack 整合:Logstash、Kibana、Beats 的协同使用
6.3 参考资料
- Elasticsearch 官方文档
- Elasticsearch 权威指南
- Elastic Stack 实战手册
- Elasticsearch 源码分析
附录
A. 命令速查表
# 集群管理
GET _cluster/health
GET _cluster/state
GET _cluster/stats
GET _cluster/pending_tasks
# 节点管理
GET _cat/nodes?v
GET _cat/master?v
GET _nodes/stats
# 索引管理
GET _cat/indices?v
GET _cat/shards?v
DELETE /index_name
# 分片管理
POST _cluster/reroute
GET _cluster/allocation/explain
PUT _cluster/settings
# 快照管理
GET _snapshot
GET _snapshot/repo_name/_all
PUT _snapshot/repo_name/snapshot_name
POST _snapshot/repo_name/snapshot_name/_restore
B. 配置参数速查
| 参数 |
说明 |
推荐值 |
cluster.name |
集群名称 |
唯一标识 |
node.name |
节点名称 |
唯一标识 |
discovery.seed_hosts |
种子节点列表 |
所有 Master |
cluster.initial_master_nodes |
初始 Master 节点 |
首次启动配置 |
cluster.election.duration |
选举超时时间 |
30s-60s |
network.host |
绑定地址 |
0.0.0.0 |
http.port |
HTTP 端口 |
9200 |
transport.port |
传输端口 |
9300 |
C. 故障排查 Checklist
- 检查集群健康状态:
GET _cluster/health
- 查看 Master 节点:
GET _cat/master
- 检查节点状态:
GET _cat/nodes
- 查看未分配分片:
GET _cat/shards?h=index,shard,state,unassigned.reason
- 检查网络连通性:
telnet node-ip 9300
- 查看日志文件:
tail -f /var/log/elasticsearch/cluster.log
- 验证配置文件:
elasticsearch-yml-validator
- 检查磁盘空间:
df -h
- 检查 JVM 堆内存:
GET _nodes/stats/jvm
- 测试快照恢复:
POST _snapshot/repo/_restore