RAC Node Eviction
The node eviction/reboot is used for I/O fencing to ensure that writes from I/O capable clients can be cleared avoiding potential corruption scenarios in the event of a network split, node hang, or some other fatal event in clustered environment.
By definition, I/O fencing (cluster industry technique) is the isolation of a malfunctioning node from a cluster’s shared storage to protect the integrity of data.
We want to reboot here as fast as possible. Choosing not to flush local disks or kill off processes gracefully helps us shutdown quickly. It is imperative that we do not flush any IO to the shared disks. Else it may write irrelevant information to clusterware components(OCR or Voting disk) or to database files.
Who evicts/reboot the node
The daemons for Oracle Clusterware (CRS) are started by init when the machine boots. Viz. CRSD, OCSSD,EVMD, OPROCD (when vendor clusterware is absent), OCLSOMON.
There are three fatal processes, i.e. processes whose abnormal halt or kill will provoke a node reboot
1. the ocssd.bin (run as oracle)
2. the oclsomon.bin (monitors OCSSD and run as a root)
3. the oprocd.bin (I/o fencing in non-vendor clusterware env and run as a root)
—Other non-CRS Processes Capable of Evicting:
?OCFS2 (if used)
?Vendor Clusterware (if used)
?Operating System (panic)
When Node Eviction
Read below to understand when OCSSD will trigger the node reboot..
—OCSSD’s primary job is internode health monitoring (via NM and GM services), other roles not discussed here in-depth.
—It is a multi-threaded application. i.e. several jobs or threads runs simultaneously performing specific tasks. The ocssd.log reveals all the thread names clss%, clsc%, etc
Eg. Threads for performing heartbeat (network & disk) and monitoring, send/receive cluster messages, etc.
—Evictions occur when CSS detects a heartbeat problem that must be dealt with. For example, lost network communication with another node(s) or lost disk heartbeat information from another node(s). CSS initiated evictions (via poison packet or kill block) should always result in a node reboot.
—init spawns init.cssd, which in turn spawns OCSSD as a child. If ocssd dies or is killed or exit, the node kill functionality of the init script will kill the node. Killing init.cssd (respawn creates DUP. cssd) will also result in reboot.
Read below to understand when OCLSOMON will trigger the node reboot..
By now we know working of CSS is very crucial for cluster functioning. This calls for a process that can keep track of its(CSS’s) good health. Name: OCLSOMON.
This process monitors the CSS daemon for hangs or scheduling issues and can reboot a node if there is a perceived hang of CSS threads.
Variety of problems such as with OS scheduler, resources, hardware, driver, network, misconfiguration, Oracle code bug, etc may cause a process/thread to hang/crash.
Some routines in OS kernel will cause kernel ‘lockup and are non-preemtable, this causes the CPU starvation or a scheduling issues. Typically, on AIX memory over commitment may lead to heavy paging activity resulting in scheduling issues.
Read below to understand when OPROCD will trigger the node reboot..
Unlike CSS which is responsible to maintain and monitor good health of all nodes in cluster, the OPROCD helps monitor the health of node locally for any issues with self node where it runs ((when vendor clusterware is absent).
The OPROCD process is locked in memory to monitor each local cluster node where it executes, to detect scheduling latencies caused by hardware and driver freezes on a machine, and provide I/O fencing functionality. (only in 10g and 11gR1)
To provide this functionality OPROCD performs its check, stops running (sleeps/timeout –t 1000ms ), and if the wake up is beyond (margin –m 500ms) the expected time, OPROCD reboots the local node. (alarm clock snooze, exam nightmare)
default values for OPROCD can be overly sensitive to scheduling latencies and may cause FALSE reboot. Also, more so in pre-11.2 releases because its code does not function in tandem with CSS eviction code (e.g. NM polling threads).
A FALSE reboot is when a reboot takes place when no formal CSS eviction was in progress.CSS expiring misscount/disktimeout and rebooting the node is not considered a ‘false reboot’
Due to the fast nature of the reboot, the CRS logging messages might not actually get flushed to the disk. However, with newer CRS releases and on some platforms (except AIX), we now perform kernel crash dump /panic /TOC on reboot for OS support to investigate what the system looked like when we crashed the node.
IMPORTANT: OCLSOMON and OPROCD does not exists in 11gR2. CSSD Monito(ora.cssdmonitor) will take over the functionality of oclsomon and oprocd.
Also CPU Starvation or Memory starvation caused by non clusterware services may lead the node eviction. Somtime Hardware freeze will also cause the node eviction.