You are on page 1of 2

RAC NODE EVICTION TROUBLE SHOOTING

1) Which steps should you take to troubleshoot the node eviction?


2) What are the most common causes of node evictions?
3) What are the best practices to avoid node evictions?

STEPS:
1. Look at the cssd.log files on both nodes; usually we will get more information on the second
node if the first node is evicted. Also take a look at crsd.log file too 
 All Clusterware log files are stored under $ORA_CRS_HOME/log/hostname directory.

2. The evicted node will have core dump file generated and system reboot info. 
3. Find out if there was node reboot, is it because of CRS or others, check system reboot time.
4. If you see “Polling” key words with reduce in percentage values in cssd.log file that says the
eviction is probably due to Network. If you see “Diskpingout” or something related to -DISK-
then, the eviction is because of Disk time out…
5. After finding Network or Disk issue. Then starting going in depth. 
6. Now it’s time to collect NMON/OSW/RDA reports to make sure /justify if it was DISK issue or
Network. 
7. If in case we see more memory contention/paging in the reports then it’s time to collect AWR
report to see what loads/SQL was running during that period? 
8. If network was the issue, then check if any NIC cards were down, or if link switching as
happen. And check private interconnect is working between both the nodes.

9. Sometimes eviction could also be due to OS error where the system is in halt state for while
or Memory over commitment or CPU 100% used. 

10. Check OS /system logfiles to get more information.

11. What got changed recently? Ask your coworker to open up a ticket with Oracle and upload
logs 

12. Check the health of clusterware, db instances, asm instances, uptime of all hosts and all the
logs – ASM logs, Grid logs, CRS and ocssd.log,
HAS logs, EVM logs, DB instances logs, OS logs, SAN logs for that particular timestamp. 
13. Check health of interconnect if error logs guide you in that direction. 

14. Check the OS memory, CPU usage if error logs guide you in that direction. 

15. Check storage error logs guide you in that direction. 

16. Run TFA and OSWATCHER, NETSTAT, IFCONFIG settings etc based on error messages during
your RCA. 

17. Node eviction because iptables had been enabled. After iptables was turned off, everything
went back to normal.
Avoid to enable firewalls between the nodes, and that appears to be true.
The ACL can open the ports on the interconnect, as we did, but we still experienced all kinds of
issues.
(unable to start crs, unable to stop crs and node eviction).
We also had a problem with the voting disk caused by presenting LDEV's using business copies /
Shadowimage that made RAC less than happy.

18. Verify user equiv between cluster nodes


19. Verify switch use for only interconnect. DO NOT USE same switch for other network
operations.
20. Verify all nodes are 100pct the same configuration, sometimes there are net or config diffs
that are not obvious.
look for hangs in the logs and the monitoring tools like NAGIOS to see any memory usage ran
out of RAM, or became unresponsive.
21. A major reason however for node evictions at our cluster was at the "patch-levels" not
being equal across the two nodes.
Nodes sometimes completely died, without any error what so ever.It turned to be a bug in the
installer of 11.1.0.7.1 PSU,
which only installed the local node. After we patched to 11.1.0.7.4 both nodes, these problems
did not reoccur in IBM-AIX RAC 11gR1

You might also like