当生产环境中的 数据库 主节点遭遇硬件故障或意外重启时,保障业务的连续性至关重要。本文将以Linux x86_64平台下的金仓数据库Kingbase V8R6(具体版本为 V008R006C009B0014)为例,详细演示在主库发生故障后,集群如何自动完成Failover(故障切换),以及故障节点在恢复后如何重新加入集群的完整操作流程。
首先,我们确认一下初始的集群环境。通过 repmgr cluster show 命令,可以看到一个包含三个节点的集群,其中node3是主节点(primary),node1和node2是备用节点(standby)。
[kingbase@knode3 bin]$ repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 | node1 | standby | running | node3 | default | 100 | 4 | 0 bytes | host=192.168.45.137 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000
2 | node2 | standby | running | node3 | default | 100 | 4 | 0 bytes | host=192.168.45.138 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000
3 | node3 | primary | * running | | default | 100 | 4 | | host=192.168.45.159 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000
一、模拟主节点故障与自动Failover
我们通过重启主节点(node3)来模拟一次故障。
[kingbase@knode3 bin]$ exit
注销
[root@knode3 ~]# reboot
主节点宕机后,高可用管理组件会自动发起故障转移。查看node1节点(原备库)的 hamgr.log 日志,可以看到它成功被提升为新的主节点,并接管了虚拟IP。
[2026-03-10 09:56:40] [NOTICE] new primary node (ID: 1) acquire the virtual ip 192.168.45.250/24 success
...
[DETAIL] server "node1" (ID: 1) was successfully promoted to primary
同时,node1还自动对另一个备用节点node2执行了恢复操作,使其重新连接到新的主节点(node1)上。
[2026-03-10 09:56:58] [INFO] [thread pid:6077] ES connection to host "192.168.45.138" succeeded, ready to do auto-recovery
[2026-03-10 09:56:58] [INFO] node "node2" (ID: 2, HOST: 192.168.45.138) auto-recovery: STANDBY FOLLOW
...
[NOTICE] STANDBY FOLLOW successful
此时,查看node2的日志,也能发现其上游(Upstream)已从node3切换到了node1。
[2026-03-10 09:57:01] [DETAIL] currently monitoring upstream 3; new upstream is 1
至此,集群已完成自动Failover。我们再次查看集群状态,会发现node1已成为新的主节点(Timeline已从4推进到5),node2是正常的备用节点,而node3的状态显示为 failed。
[kingbase@knode2 log]$ repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 | node1 | primary | * running | | default | 100 | 5 | | host=192.168.45.137 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000
2 | node2 | standby | running | node1 | default | 100 | 4 | 0 bytes | host=192.168.45.138 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000
3 | node3 | primary | - failed | ? | default | 100 | | | host=192.168.45.159 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000
二、故障节点重新加入集群(Rejoin)
现在,故障的node3节点已经恢复运行。我们如何让它以备用节点的身份重新加入新集群呢?这里有两种情况。
情况一:无WAL日志丢失,使用 repmgr node rejoin
如果原主库(node3)与新主库(node1)之间没有出现WAL日志丢失,数据可以连续同步,则可以直接使用 rejoin 命令。在node3上执行以下命令,指定新主库的地址。
[kingbase@knode3 ~]$ repmgr node rejoin -h 192.168.45.137 -d esrep -U esrep -p 54321
执行过程会检查日志连续性,设置新的上游,并启动数据库服务。
[INFO] local node 3 can attach to rejoin target node 1
[DETAIL] local node‘s recovery point: 0/E012690; rejoin target node‘s fork point: 0/E012738
...
[NOTICE] NODE REJOIN successful
[DETAIL] node 3 is now attached to node 1
操作成功后,再次查看集群状态,可以看到node3已作为备用节点正常运行,上游是node1。
[kingbase@knode3 ~]$ repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 | node1 | primary | * running | | default | 100 | 5 | | host=192.168.45.137 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000
2 | node2 | standby | running | node1 | default | 100 | 5 | 0 bytes | host=192.168.45.138 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000
3 | node3 | standby | running | node1 | default | 100 | 4 | 0 bytes | host=192.168.45.159 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000
情况二:存在WAL日志丢失,使用 repmgr standby clone
如果原主库故障时间较长,或者有WAL日志丢失,导致无法通过 rejoin 追赶数据,那么就需要使用克隆(clone)的方式,从当前主库全量同步数据。这是最彻底的重建方式。
为了演示,我们先将node1重启,让node2成为主库,集群状态如下(node1状态为 failed)。
[kingbase@knode2 log]$ repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 | node1 | primary | - failed | ? | default | 100 | | | host=192.168.45.137 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000
2 | node2 | primary | * running | | default | 100 | 6 | | host=192.168.45.138 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000
3 | node3 | standby | running | node2 | default | 100 | 5 | 0 bytes | host=192.168.45.159 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000
在故障节点node1上,执行克隆命令,指定当前主库node2的地址,并使用 --force 参数强制覆盖已有数据目录。
[kingbase@knode1 ~]$ repmgr standby clone -h 192.168.45.138 -d esrep -U esrep -p 54321 --force
该命令会连接到主库,使用 sys_basebackup 工具拉取一份完整的数据备份。
[NOTICE] starting backup (using sys_basebackup)...
[NOTICE] standby clone (using sys_basebackup) complete
克隆完成后,根据提示,首先启动node1的数据库服务。
[kingbase@knode1 ~]$ sys_ctl -D /data/Kingbase/ES/V8/data start
启动后,在集群中查看,node1的状态会变为 running as standby,但角色(Role)仍显示为 primary,这是不匹配的。
[kingbase@knode2 log]$ repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string
----+-------+---------+----------------------+----------+----------+----------+----------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 | node1 | primary | ! running as standby | | default | 100 | 6 | | host=192.168.45.137 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000
...
因此,我们需要在node1上执行 standby register 命令,并加上 --force 参数,来更新其在集群元数据中的角色信息。
[kingbase@knode1 ~]$ repmgr standby register --force
[INFO] standby registration complete
[NOTICE] standby node "node1" (ID: 1) successfully registered
注册成功后,再次查看集群状态,一切恢复正常。node2为主节点,node1和node3均为其备用节点,共同构成了一个具备 高可用 能力的三节点集群。
[kingbase@knode3 log]$ repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 | node1 | standby | running | node2 | default | 100 | 6 | 0 bytes | host=192.168.45.137 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000
2 | node2 | primary | * running | | default | 100 | 6 | | host=192.168.45.138 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000
3 | node3 | standby | running | node2 | default | 100 | 6 | 0 bytes | host=192.168.45.159 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000
总结
通过以上实操演示,我们完整走通了Kingbase高可用集群在主库故障后的核心处理流程:
- 自动故障切换(Failover):主库故障后,高可用管理器自动提升一个备库为新主库,并接管服务。
- 节点重加入(Rejoin):若故障节点数据未丢失,可通过
repmgr node rejoin 快速重新接入集群,作为备库继续工作。
- 克隆重建(Clone):若故障节点数据已不可用,则通过
repmgr standby clone 从当前主库进行全量克隆,然后重新注册为备库。
这两种重加入方式为不同故障场景下的数据库恢复提供了灵活可靠的解决方案。掌握这些操作,对于保障数据库服务的高可用性与连续性至关重要。如果你想了解更多数据库管理与实践知识,欢迎来 云栈社区 与更多技术同行交流探讨。
参考文档:金仓KES官方文档 - 灾备演练