3617 积分	0 好友	479 主题

发消息

RAC 心跳超时节点驱逐：bonding mode 4 故障与 reconfig 耗时分析

发表于 2 小时前 | 查看: 4| 回复: 0

生产环境一套 Oracle RAC 高可用集群，底层网络采用两台交换机堆叠并配置了 bond。某日，其中一台交换机突发故障，随即触发集群节点 2 被踢出。虽然心跳链路在设计上做了冗余，但现象表明高可用并未在 30 秒超时窗口内生效。大约 7 分钟后集群自行恢复。

时间线分析

17:30:41，集群开始感知到心跳通信异常：

2025-12-04 17:30:41.266 [OCSSD(19050)]CRS-7503: The Oracle Grid Infrastructure
process 'ocssd' observed communication issues between node 'rac2' and node
'rac1', interface list of local node 'rac2' is '1x.xx.xxx.152:13431;',
interface list of remote node 'rac1' is '1x.xx.xxx.151:57615;'.

17:31:01，节点 2 因心跳超时被 CSS 强制终止，VIP 随即从本节点删除并漂移至节点 1。紧接着在 17:31:07，节点 2 被从集群成员中移除。

# 节点1：
2025-12-04 17:31:02.818 [EVMD(15212)]CRS-7503: The Oracle Grid Infrastructure
process 'evmd' observed communication issues between node 'rac1' and node
'rac2', interface list of local node 'rac1' is '1x.xx.xxx.151:26997;',
interface list of remote node 'rac2' is '1x.xx.xxx.152:55995;'.
2025-12-04 17:31:06.778 [OCSSD(16377)]CRS-1601: CSSD Reconfiguration complete.
Active nodes are rac1.
2025-12-04 17:31:06.784 [CRSD(19684)]CRS-5504: Node down event reported for node
'rac2'.
2025-12-04 17:31:07.982 [CRSD(19684)]CRS-2773: Server 'rac2' has been
removed from pool 'Generic'.
# 节点2：
2025-12-04 17:30:47.192 [OCSSD(19050)]CRS-1612: Network communication with node
rac1(1) has been missing for 50% of the timeout interval. If this
persists, removal of this node from cluster will occur in 14.070 seconds
2025-12-04 17:30:47.529 [OCTSSD(20812)]CRS-7503: The Oracle Grid Infrastructure
process 'octssd' observed communication issues between node 'rac2' and node
'rac1', interface list of local node 'rac2' is '10.104.191.152:16469;',
interface list of remote node 'rac1' is '10.104.191.151:35158;'.
2025-12-04 17:30:54.193 [OCSSD(19050)]CRS-1611: Network communication with node
rac1(1) has been missing for 75% of the timeout interval. If this
persists, removal of this node from cluster will occur in 7.070 seconds
2025-12-04 17:30:59.194 [OCSSD(19050)]CRS-1610: Network communication with node
rac1 (1) has been missing for 90% of the timeout interval. If this
persists, removal of this node from cluster will occur in 2.070 seconds
2025-12-04 17:31:01.766 [OCSSD(19050)]CRS-1609: This node is unable to
communicate with other nodes in the cluster and is going down to preserve cluster
integrity; details at (:CSSNM00008:) in
/app/grid/base/diag/crs/rac2/crs/trace/ocssd.trc.
2025-12-04 17:31:01.767 [OCSSD(19050)]CRS-1656: The CSS daemon is terminating due
to a fatal error; Details at (:CSSSC00012:) in
/app/grid/base/diag/crs/rac2/crs/trace/ocssd.trc

数据库日志显示，节点 2 在 17:31:55 由 ORAAGENT 启动实例。17:32:07 数据库开始执行 reconfig，但由于私网通信迟迟未能恢复，reconfig 一直处于等待状态。直到 17:37:41，私网通信恢复正常，reconfig 才终于完成，整个过程耗时约 334 秒。17:37:42，监听完成注册，此后整个集群及数据库恢复正常，对外提供服务。

# 节点1：
2025-12-04T17:32:07.081760+08:00
Reconfiguration started (old inc 5, new inc 7)
List of instances (total 2) :
1 2
New instances (total 1) :
2
My inst 1
# 节点2
2025-12-04T17:31:55.646413+08:00
Starting ORACLE instance (normal) (OS id: 3549958)
...
...
...
2025-12-04T17:32:07.723956+08:00
Using default pga_aggregate_limit of 81920 MB
2025-12-04T17:37:41.797781+08:00
Reconfiguration complete (total time 334.7 secs)
Decreasing priority of 4 RS
2025-12-04T17:37:41.798108+08:00
Starting background process LCK0
2025-12-04T17:37:41.822978+08:00
LCK0 started with pid=9, OS id=3568372
Starting background process RSMN
2025-12-04T17:37:41.854115+08:00
RSMN started with pid=79, OS id=3568376
Starting background process TMON
2025-12-04T17:37:41.875179+08:00
TMON started with pid=80, OS id=3568383
ORACLE_BASE from environment = /app/oracle
2025-12-04T17:37:42.013689+08:00
NOTE: ASMB (index:0) (3550312) connected to ASM instance +ASM2, osid: 3550318
(Flex mode; client id 0x4b0c9f3ff9f551f2)
NOTE: initiating MARK startup
Starting background process MARK
2025-12-04T17:37:42.034675+08:00
ALTER SYSTEM SET local_listener=' (ADDRESS=(PROTOCOL=TCP)(HOST=10.104.160.154)
(PORT=1528))' SCOPE=MEMORY SID='rac2';

故障及恢复时间线总结

17:31:01：VIP 从节点 2 删除，漂移至节点 1，节点 2 被踢出集群。
17:31:03：CRSD 自动重启，第一次因 “Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-23: Error in cluster services layer]” 启动失败。
17:31:23：CRSD 继续尝试重启，启动正常。
17:31:42：VIP 漂移回节点 2。
17:31:55：ORAAGENT 开始启动数据库实例。
17:32:07：数据库开始 reconfig。
17:32:30：集群再次检测到心跳异常。
17:37:41：心跳恢复正常，reconfig 完成，耗时 334 秒。
17:37:42：节点 2 注册到监听。
17:37:47：数据库 open，启动完成，整个集群及数据库恢复对外服务。
17:37:49：出现第一个分配到节点 2 的连接。

本地模拟测试

为了进一步厘清 reconfig 耗时及其对业务的实际影响，我们根据客户场景在本地和测试环境进行了多次模拟：

无论在 11g 还是 19c 环境下，节点恢复、数据库启动时，reconfig 期间偶尔会有极少量的数据写入能力，但对于业务来说基本等同于不可用。现有连接不会被报错中断，也不会释放，待 reconfig 完成后就会自动恢复。
结合客户的报错现象与模拟结果，应用在 reconfig 期间出现的报错，根源在于数据库连接在 reconfig 过程中会被 hung 住，连接池无法回收这些连接，最终导致连接数耗尽，抛出“无法获取 JDBC 连接”之类的错误。
Reconfig 的详细原理和耗时长短，属于 Oracle 内部核心机制，我们无法深入了解细节，也难以直接干预。

总结

客户的网口绑定模式配置为 mode 4 (802.3ad)。该模式下的逻辑较为复杂，而集群默认心跳超时仅为 30 秒。一旦其中一条物理链路出现抖动或故障，mode 4 未必能在 30 秒内完成链路的容错切换，这就直接导致集群心跳超时，进而节点被驱逐。在将 bond 模式调整为 mode 1 (active-backup) 后，集群恢复正常。

数据库 reconfig 期间，数据无法正常操作，连接会陷入 hung 住状态且不会主动释放。reconfig 的耗时与实现属于 Oracle 内部核心机制，无法从外部干预。但通过多次模拟测试可以发现，耗时长短与内存中业务数据的复杂度存在密切关联。

特别需要注意的是，在 mode 4 的 bond 模式下，故障节点极有可能因为数据库 reconfig 超时而无法重新加入集群。这主要是由于 reconfig 期间，缓存融合对数据包的顺序十分敏感，而 mode 4 并不能保证数据包的严格保序。因此，对于 RAC 私网通信，强烈推荐使用 mode 1 或者其它能保证数据包顺序的 bonding 策略。

更多关于高可用架构与集群优化的技术实践，也欢迎到云栈社区与大家交流探讨。

上一篇：Win11新功能：CPU瞬时提升频率，UI响应速度最高提升70%
下一篇：低轨卫星互联网成隐蔽通信工具：伊朗Starlink间谍案揭示安全新威胁

Oracle RAC, 心跳故障, 高可用, 绑定模式, 集群恢复

RAC 心跳超时节点驱逐：bonding mode 4 故障与 reconfig 耗时分析

时间线分析

故障及恢复时间线总结

本地模拟测试

总结

相关帖子