Log Files for Troubleshooting Oracle RAC issues

The cluster has a number of log files that can be examined to gain any insight of occurring problems A good place to start diagnosis for the cluster problems is from the $ORA_CRS_HOME/log/<hostname>/alert<hostname>.log All clusterware log files are stored under $ORA_CRS_HOME/log/ directory. 1. alert<nodename>.log : Important clusterware alerts are stored in this log file. It is stored in $ORA_CRS_HOME/log/<hostname>/alert<hostname>.log 2. crsd.log : CRS logs are stored in $ORA_CRS_HOME/log/<hostname>/crsd/ directory. The crsd.log file is archived every 10MB as crsd.101, crsd.102 ... 3. cssd.log : CSS logs are stored in $ORA_CRS_HOME/log/<hostname>/cssd/ directory. The cssd.log file is archived every 20MB as cssd.101, cssd.102.... 4. evmd.log : EVM logs are stored in $ORA_CRS_HOME/log/<hostname>/evmd/ directory. 5. OCR logs : OCR logs (ocrdump, ocrconfig, ocrcheck) log files are stored in $ORA_CRS_HOME/log/<hostname>/client/ directory. 6. SRVCTL logs: srvctl logs are stored in two locations, $ORA_CRS_HOME/log/<hostname>/client/ and in $ORACLE_HOME/log/<hostname>/client/ directories. 7. RACG logs : The high availability trace files are stored in two locations $ORA_CRS_HOME/log/<hostname>/racg/ and in $ORACLE_HOME/log/<hostname>/racg/ directories. RACG contains log files for node applications such as VIP, ONS etc. ONS log filename = ora.<hostname>.ons.log VIP log filename = ora.<hostname>.vip.log Each RACG executable has a sub directory assigned exclusively for that executable. racgeut : $ORA_CRS_HOME/log/<hostname>/racg/racgeut/ racgevtf : $ORA_CRS_HOME/log/<hostname>/racg/racgevtf/ racgmain : $ORA_CRS_HOME/log/<hostname>/racg/racgmain/ racgeut : $ORACLE_HOME/log/<hostname>/racg/racgeut/ racgmain: $ORACLE_HOME/log/<hostname>/racg/racgmain/ racgmdb : $ORACLE_HOME/log/<hostname>/racg/racgmdb/ racgimon: $ORACLE_HOME/log/<hostname>/racg/racgimon/ As in a normal Oracle single instance environment, a RAC environment contains the standard RDBMS log files: These files are located by the parameters : background_dest_dump contan the alert log and backgrond process trace files. user_dump_dest contains any trace file generated by a user process.

core_dump_dest contains core files that are generated due to a core dump in a user process.

RAC - Issues & Troubleshooting

Whenever a node is having issues joining the cluster back post reboot, here is a quick check list I would suggest:

/var/log/messages ifconfig ip route /etc/hosts /etc/sysconfig/network-scripts/ifcfg-eth* ethtool mii-tool cluvfy $ORA_CRS_HOME/log

Let us now take a closer look at specifc issues with examples and steps taken for their resolution. These are all tested on Oracle 10.2.0.4 database on RHEL4 U8 x-64

1. srvctl not able to start Oracle Instance but sqlplus able to start a. Check racg log for actual error message. % more $ORACLE_HOME/log/`hostname -s`/racg/ora.{DBNAME}. {INSTANCENAME}.inst.log

b. Check if srvctl is configured to use correct parameter file(pfile/spfile) % srvctl config database -d {DBNAME} -a

bring it down manually as root Example: ifconfig eth0:2 down PS: Be careful to bring down only VIP. # chown -R oracle:dba $ORACLE_HOME/log 2. Check ownership for $ORACLE_HOME/log If this is owned by root.You can also validate parameter file by using sqlplus to see the exact error message. VIP has failed over to another node but is not coming back to the original node Fix: The node where the VIP has failed over. While trying to change ocr mirror or the ocr to a new location. Moving OCR to a different location PS: This can be done while CRS is up as root. srvctl won't be able to start instance as oracle user. The fix is to touch the new file. Example: # ocrconfig -replace ocrmirror /crs_new/cludata/ocrfile PROT-21: Invalid parameter # touch /crs_new/cludata/ocrfile # chown root:dba /crs_new/cludata/ocrfile # ocrconfig -replace ocrmirror /crs_new/cludata/ocrfile . A small typo may bring down your public interface:) 3. c. ocrconfig complaints.

4. The idea is to add new voting disks and delete the older ones. before using force option. . However. If CRS is up. Validate using "ocrcheck".Verify: a. you'll need to use force option. # crsctl add css votedisk /crs_new/cludata/cssfile_new Cluster is not in a ready state for online disk addition We need to use force option. ensure CRS is down. Device/File Name should point to the new one with integrity check succeeded. While deleting too. # crsctl add css votedisk /crs_new/cludata/cssfile_new -force Now formatting voting disk: /crs_new/cludata/cssfile_new successful addition of votedisk /crs_new/cludata/cssfile_new. Verify using "crsctl query css votedisk" and then delete the old votedisks. Moving Voting Disk to a different location PS: CRS must be down while moving the voting disk. b.loc ocrconfig_loc and ocrmirrorconfig_loc should point to correct locations. Ensure OCR inventory is updated correctly # cat /etc/oracle/ocr. DO NOT use force option else it may corrupt your OCR. Find below sample errors and their fix.

From an existing available node.0/db_1/bin/racgwrap ACTIVE_PLACEMENT=0 AUTO_START=1 CHECK_INTERVAL=600 DESCRIPTION=CRS application for listener on node FAILOVER_DELAY=0 FAILURE_INTERVAL=0 FAILURE_THRESHOLD=0 HOSTING_MEMBERS=test-server2 OPTIONAL_RESOURCES= PLACEMENT=restricted .2.LISTENER_TEST-SERVER2. the permission should be changed to oracle:dba 5. print the listener resource % crs_stat -p ora. Manually registering listener resource to OCR Listener was registered manually with OCR but srvctl was unable to bring up the listener Let us first see example of how to manually do this. It should be oracle:dba If voting disks were added using root.lsnr TYPE=application ACTION_SCRIPT=/orahome/ora10g/product/10.lsnr > /tmp/res % cat /tmp/res NAME=ora.test-server2.test-server2.Also verify the permissions of the voting disk files.LISTENER_TEST-SERVER2.

vip RESTART_ATTEMPTS=5 SCRIPT_TIMEOUT=600 START_TIMEOUT=0 STOP_TIMEOUT=0 UPTIME_THRESHOLD=7d USR_ORA_ALERT_NAME= USR_ORA_CHECK_TIMEOUT=0 USR_ORA_CONNECT_STR=/ as sysdba USR_ORA_DEBUG=0 USR_ORA_DISCONNECT=false USR_ORA_FLAGS= USR_ORA_IF= USR_ORA_INST_NOT_SHUTDOWN= USR_ORA_LANG= USR_ORA_NETMASK= USR_ORA_OPEN_MODE= USR_ORA_OPI=false USR_ORA_PFILE= USR_ORA_PRECONNECT=none USR_ORA_SRV= USR_ORA_START_TIMEOUT=0 USR_ORA_STOP_MODE=immediate USR_ORA_STOP_TIMEOUT=0 USR_ORA_VIP= .test-server2.REQUIRED_RESOURCES=ora.

LISTENER_TEST-SERVER1. So all the aforementioned operations while registering the listener manually should be done using oracle user. 6. If resource is registered using root. Rename as resourcename. .test-server1.cap Register with OCR % crs_register ora. it says "already exists" This happens because the service is in an "Unknown" state in the OCR Using crs_stat.Modify relevant parameters in the resource file to point to correct instance.test-server1.srv and . the error message is "No such service exists" or "already running" If we try to add service with same name. it says "not running" If we try to start it using srvctl. srvctl is throwing errors like "Unable to read from listener log file" The listener log file exists. Services While checking status of a service.cap % mv /tmp/res /tmp/ora.lsnr.LISTENER_TEST-SERVER1.cs) is still lying around.lsnr -dir /tmp/ Start listener % srvctl start listener -d testdb -n test-server1 While trying to start listener. then srvctl won't be able to start using oracle user. check if any related resource for service(resource names ending with .

srvctl remove service -f has been tried and the issue persists. No CRS logs in $ORA_CRS_HOME Check /var/log/messages "Cluster Ready Services waiting on dependencies. Post host reboot. then check if /etc/oracle/ocr. CRS was not coming up.9559" No logs seen in /tmp/crsctl. 8. 7. my hash ids don't match" in the CRS log. Diagnostics in /tmp/crsctl. CRS is not starting After host reboot.loc is same across all nodes of the cluster. Here is the fix: # crs_stop -f {resourcename} # crs_unregister {resourcename} Now service can be added and started correctly. CRS binary restored by copying from existing node in the cluster .* Run cluvfy to identify the issue $ORA_CRS_HOME/bin/cluvfy stage -post crsinst -n {nodename} /tmp was not writable /etc/fstab was incorrect and was fixed for making /tmp available If you see messages like "Shutdown CacheLocal.

9. %ps -ef grep oprocd root 10409 9937 0 Feb27 ? 00:00:00 /oracle/product/crs/bin/oprocd. Hostnames are modified correctly in $ORA_CRS_HOME/log b. you should never touch those files otherwise reboot may be inevitable. If CRS is up.oracle PS:Exercise caution while working with the socket files. You may need to cleanup socket files from /var/tmp/. CRS rebooting frequently by oprocd Check /etc/oracle/oprocd/ and grep for "Rebooting". this confirms reboots are being initated by oprocd process. . then you need to ensure: a.bin run -t 1000 -m 500 -f -t 1000 means oprocd would wake up every 1000ms -m 500 means allow upto 500ms margin of error Basically with these options if oprocd wakes up after > 1. Check /var/log/messages and grep for "restart" If the timestamps are matching. "Id "h1" respawning too fast: disabled for 5 minutes" CRSD log showing "no listener" If CRS binary is restored by copying from existing node in the cluster.CRS not starting with following messages in /var/log/messages.5 secs it’s going to force a reboot.

This is conceptually analogous to what hangcheck timer used to do pre 10. All SQL queries on GV$ views are hanging.4 Oracle releases on Linux.5 seconds.2. 10. Cluster hung. Fix is to set CSS diagwait to 13 #crsctl set css diagwait 13 -force # /oracle/product/crs/bin/crsctl get css diagwait 13 This actually changes what parameters oprocd runs with %ps -ef grep oprocd root 10409 9937 0 Feb27 ? 00:00:00 /oracle/product/crs/bin/oprocd.bin run -t 1000 -m 10000 -hsi 5:10:50:75:90 -f Note that the margin has now changed to 10000ms i. Alert log from all instance have message like below: INST1: IPC Send timeout detected.0. Receiver ospid 1650 INST2:IPC Send timeout detected.e 10 seconds in place of the default 0. PS: Setting diagwait requires a full shutdown of Oracle Clusterware on ALL nodes.Sender: ospid 24692 Receiver: inst 1 binc 150 ospid 1650 .

ocrdump shows: [DATABASE. INST1 is the one that needs to be fixed. USER_NAME : root. As seen from above. GROUP_NAME : root} . 11.Lock Process In case of inter-instance lock issues.2 PRKR-1005 : adding of cluster database testdb configuration failed. OTHER_PERMISSION : PROCR_READ.INST3: IPC Send timeout detected.Sender: ospid 12955 Receiver: inst 1 binc 150 ospid 1650 The ospid on all instances belong to LCK0 .LOG.testdb] UNDEF : SECURITY : {USER_PERMISSION : PROCR_ALL_ACCESS. Authentication error [User does not have permission to perform this operation] [0] crs_stat doesn't have any trace of it so utilities like crs_setperm/crs_unregister/crs_stop won't work in this case. Just identify the process that is causing row cache lock and kill it otherwise reboot node 1. it's important to identify the instance from where it's initiating. Inconsistent OCR with invalid permissions % srvctl add db -d testdb -o /oracle/product/10. GROUP_PERMISSION : PROCR_ALL_ACCESS. PROC-5: User does not have permission to perform a cluster registry operation on this key.

OTHER_PERMISSION : PROCR_READ. USER_NAME : root. GROUP_PERMISSION : PROCR_ALL_ACCESS.testdb and DATABASE. Though it has been removed by root but now it cannot be added by oracle user unless we get rid of the aforementioned. GROUP_NAME : root} These logs are owned by root and that's the problem.testdb.INSTANCE] UNDEF : SECURITY : {USER_PERMISSION : PROCR_ALL_ACCESS.LOG.LOG. This means that the resource was perhaps added into OCR using root. there you can also do the following: Take export backup of OCR using: ocrconfig -export /tmp/export -s online Edit /tmp/export and remove those 2 lines pointing to DATABASE.LOG.testdb. Shutdown the entire cluster and either restore from previous good backup of OCR using: ocrconfig -restore backupfilename You can get list of backups using: ocrconfig -showbackup If you are not sure of last good backup.INSTANCE owned by root Import it back now ocrconfig -import /tmp/export .[DATABASE.

Troubleshooting Oracle Clusters and Oracle RAC Oracle Clusterware has its moments. ocssd and EVM daemons. CRSD. Stop the cluster (crsctl stop cluster) 2. In this section we talk about diagnosing the health of the cluster. Use the crsctl check cluster –all command to check all daemons on all nodes of the cluster. collecting diagnostic information. and trying to correct problems with Oracle Clusterware. [oracle@rac1 admin]$crsctl check cluster –all ******************************************************** rac1: CRS-4537: Cluster Ready Services is online CRS-4529: Cluster Synchronization Services is online CRS-4533: Event Manager is online rac2: CRS-4537: Cluster Ready Services is online CRS-4529: Cluster Synchronization Services is online CRS-4533: Event Manager is online 4. Follow these steps to give your cluster a full health check: 1. Start the cluster (crsctl start cluster). monitoring the startup message to see if any errors occur. verify using ocrdump. Additionally you might want to check the health of the cluster after you have added or removed a node or if there is something about the cluster that is causing you to suspect that it’s suffering from some problem. .After starting the cluster. Check the cluster logs for any error messages that might have been logged. you can do the following: 1. The OCRDUMPFILE should not have any trace of those leftover log entries owned by root. Checking the Health of the Cluster Once you have configured a new cluster you will probably want to check the health of that cluster. Use the crsctl check has command to check if OHASD is running on the local node and that it’s healthy: [oracle@rac1 admin]$crsctl check has CRS-4638: Oracle High Availability Services is online 2. There are times when it does not want to start for various reasons. [oracle@rac1 admin]$crsctl check crs CRS-4638: Oracle High Availability Services is online CRS-4537: Cluster Ready Services is online CRS-4529: Cluster Synchronization Services is online CRS-4533: Event Manager is online 3. Use the crsctl check crs command to check OHASD. If these steps do not indicate a problem and you feel there is a problem with the cluster.

OCRCHECK. CRSCTL Cluster Time Synchronization Service Grid Interprocess Communication Daemon Oracle High Availability Services Daemon Cluster Ready Services Daemon Grid Plug and Play Daemon Mulitcast Domain Name Service Daemon Event Manager Daemon RAC RACG RAC RACG RAC RACG RAC RACG (only used if pre-11. Under that directory is another directory with the host name and then a directory that indicates the Clusterware component that the specific logs are associated with.3. If errors do occur or you still feel there is a problem. For example. check the cluster logs for error messages. Collecting Diagnostic Information and Trouble Resolution Because Oracle Clusterware 11g Release 2 consists of a number of different components it follows that there are a number of different log files associated with these processes.log Contents Clusterware alert log Disk Monitor Daemon OCRDUMP. GRID_HOME/log/myrac1/crsd stores the log files associated with CRSD for the host myrac1. Oracle Clusterware 11g Release 2 generates a number of different log files that can be used to troubleshoot Clusterware problems. The following table lists the log file directories and the contents of those directories: Directory Path GRID_HOME/log/<host>/alert<host>. In this section we will first document the log files associated with the various Clusterware processes.1 database is installed) Cluster Synchronization Service Daemon Server Manager HA Service Daemon Agent HA Service Daemon CSS Agent GRID_HOME/log/<host>/diskmon GRID_HOME/log/<host>/client GRID_HOME/log/<host>/ctssd GRID_HOME/log/<host>/gipcd GRID_HOME/log/<host>/ohasd GRID_HOME/log/<host>/crsd GRID_HOME/log/<host>/gpnpd GRID_HOME/log/<host>/mdnsd GRID_HOME/log/<host>/evmd GRID_HOME/log/<host>/racg/racgmain GRID_HOME/log/<host>/racg/racgeut GRID_HOME/log/<host>/racg/racgevtf GRID_HOME/log/<host>/racg GRID_HOME/log/<host>/cssd GRID_HOME/log/<host>/srvm GRID_HOME/log/<host>/agent/ohasd/oraagent_oracle11 GRID_HOME/log/<host>/agent/ohasd/oracssdagent_root GRID_HOME/log/<host>/agent/ohasd/oracssdmonitor_root HA Service Daemon ocssdMonitor Agent GRID_HOME/log/<host>/agent/ohasd/orarootagent_root HA Service Daemon Oracle Root Agent . The Clusterware log files are typically stored under the GRID_HOME directory in a sub-directory called log. OCRCONFIG. Log Files. We will then discuss a method of collecting the data in these logs into a single source that you can reference when doing trouble diagnosis. Oracle Clusterware 11g adds a new environment variable called GRID_HOME to reference the base of the Oracle Clusterware software home.

GRID_HOME/log/<host>/agent/crsd/oraagent_oracle11 GRID_HOME/log/<host> agent/crsd/orarootagent_root GRID_HOME/log/<host> agent/crsd/ora_oc4j_type_oracle11g GRID_HOME/log/<host>/gnsd CRS Daemon Oracle Agent CRS Daemon Oracle Root Agent CRS Daemon Oracle OC4J Agent Grid Naming Service Daemon The following diagram provides additional detail as to the location of the Oracle Clusterware log files: .

This is known as a rollover of the log. of which 10 are maintained. Each log file type has its own rotation time frame. In the listing note that there is the current oraagent_oracle log file with an extension of . An example of rolling of log files can be seen in this listing of the GRID_HOME/log/rac1/agent/crsd/oraagent_oracle directory. Rollover log files will typically have the same name as the logfile but it will have a version number attached to the end.Oracle Clusterware will rotate logs over time. .log and then there are the additional oraagent_oracle log files with extensions from l01 to l10. This helps to maintain control of space utilization in the GRID_HOME directory. These latter log files are the backup log files.

l05 -rw-r--r-. These options include –crs.1 oracle oinstall 10584515 Jun 9 13:17 oraagent_oracle.2.pl. -rw-r--r-. Contains log files from GRID_HOME/log/<host> directory structure. We will then look at the Cluster Verify Utility (CVU).l08 -rw-r--r-.pl which is contained in $GRID_HOME/bin. The script has a –collect option that you invoke to collect the diagnostic information.0/grid/log/rac1/agent/crsd/oraagent_oracle [oracle@rac1 oraagent_oracle]$ ls -al total 109320 drwxr-xr-t 2 oracle oinstall 4096 Jun 10 20:02 .1 oracle oinstall 10539847 Jun 8 21:09 oraagent_oracle.l03 -rw-r--r-.pl script comes with the following options:    --collect – Collect diagnostic information. Current OCR backups are also listed.l04 -rw-r--r-.pl script which is used to collect logfile information.1 oracle oinstall 10583346 Jun 10 07:13 oraagent_oracle.pl Clearly Oracle Clusterware has a number of log files. Additionally Oracle support might well ask that you collect up all the Clusterware log files so they can diagnose the problem that you are having.1 oracle oinstall 10583902 Jun 9 18:29 oraagent_oracle.1 oracle oinstall 10565073 Jun 10 20:02 oraagent_oracle. Contains /var/log/messages and other related files.l06 -rw-r--r-. drwxrwxrwt 5 root oinstall 4096 Jun 8 10:29 .l09 -rw-r--r-. Options to only collect specific information.1 oracle oinstall 10584126 Jun 9 01:50 oraagent_oracle.l01 -rw-r--r-. --clean – Cleans the directory of previous files created by previous runs of diagcollection.l10 -rw-r--r-. –all is the default setting. In this section we will review the diagcollection.log -rw-r--r-. --core or –all. Often when troubleshooting problems you will want to review several of the log files.l07 -rw-r--r-.gz crsData*tar*gz ocrData*tar*gz osData*tar*gz Contains Core files and related analysis files. Using Diagcollection.1 oracle oinstall 10584344 Jun 9 05:37 oraagent_oracle. Contains the results of an execution of ocrdump and ocrcheck .l02 -rw-r--r-. These four files are gzipped tarballs and are listed in the following table: Script Name Coredata*tar. The diagcollection.1 oracle oinstall 10583355 Jun 10 13:35 oraagent_oracle.1 oracle oinstall 0 Jun 8 10:29 oraagent_oracleOUT. When invoked the script creates four files in the local directory. Oracle Cluster Verification Utility (CVU) . To make collection of the Clusterware log data easier Oracle provides a program called diagcollection. This script will collect Clusterware log files and other helpful diagnostic information.1 oracle oinstall 6 Jun 10 21:20 oraagent_oracle. This can involve traversing directories which can be tedious at best.[oracle@rac1 oraagent_oracle]$ pwd /ora01/app/11.1 oracle oinstall 5955542 Jun 10 23:38 oraagent_oracle.1 oracle oinstall 10584397 Jun 9 09:26 oraagent_oracle.1 oracle oinstall 10583397 Jun 10 00:51 oraagent_oracle.pid Collecting Clusterware Diagnostic Data Oracle provides utilities that make it easier to determine the status of the Cluster and collect the Clusterware log files for problem diagnosis..log -rw-r--r-.

For example.sh which calls CVU. In some cases. If they are not and at least one node survives. OCR integrity. If you find nothing on MOS. After opening the SR. 3. These steps are: 1. Review the logs for error messages that might give you some insight into the problem at hand. backup your database. After you have completed the initial creation of the cluster After you add or remove a node from the cluster If you suspect there is a problem with the cluster. backing up the archived redo logs is also a very good idea. search Metalink Oracle Support (MOS) for the problem you are experiencing. The bottom line is that you have an unstable environment. integrity of the cluster. Using the diagcollection. The truth is that Oracle Clusterware is a very complex beast.pl script. collect the Clusterware logs. If you have a good backup. The ocrcheck program provides a way to check the integrity of the OCR. Prior to Oracle Clusterware 11g Release 2 you would need to download it from OTN. CVU is located in the GRID_HOME/bin directory and also in $ORACLE_HOME/bin. You can use CVU to check one or all components of the cluster. You can also run CVU from the Oracle 11g Release 2 install media. In your attempts to solve the problem. Here is an example of running the ocrcheck program: [oracle@rac1 admin]$ ocrcheck Status of Oracle Cluster Registry is as follows : Version : 3 Total space (kbytes) : 262120 Used space (kbytes) : 2580 Available space (kbytes) : 259540 ID : 749518627 Device/File Name : +DATA Device/File integrity check succeeded Cluster registry integrity check succeeded Logical corruption check bypassed due to non-privileged user Oracle Clusterware Trouble Resolution When dealing with difficult Clusterware issues that befuddles you there are some basic first steps to perform. Check and double check your RAC database backups are current. call the program runcluvfy. Make sure you have protected your data should the whole thing go bottom up. clock synchronization and so on. CVU can be run in various situations including:     During various phases of the install of the initial cluster to confirm that key components are in place and operational (such as SSH). determining the nature and resolution to a problem can be an overwhelming challenge. Components are groupings based on functionality. you can cause additional problems and damage to the cluster. Checking the Oracle Cluster Registry Node evictions or other problems can be caused by corruption in the OCR. 4. when problems are detected CVU can create fixup scripts that are designed to correct problems that were detected. OUI makes calls to the CVU during the creation of the cluster to ensure that pre-requisites were executed. 5. 2. do a Google search for the problem you are experiencing. CVU supports Oracle Clusterware versions 10gR1 onwards. For the DBA who does not deal with solving Clusterware problems on a day-in-day-out basis. In Oracle Clusterware 11g Release 2 CVU is installed as a part of Oracle Clusterware. In this case. It is . Performs checksum operations on the blocks within the OCR to ensure they are not corrupt. Open an SR with Oracle Support. CVU diagnosis/verifies specific components.The CVU is used to verify that there are no configuration issues with the cluster. Examples of components are space.

b. c. This includes the switches that the interconnect is attached to. If a hang is indicated (say the ocssd daemon is lost) then the node will be rebooted. Clusterware is complex enough that it will take the support tools that Oracle Support has available to diagnose the problem. However.log GRID_HOME/log/host/alert<host>. some of which might be obvious and some which might not be. What Can Cause an Eviction? A common problem that DBA’s have to face with Clusterware is node evictions which usually leads to a reboot of the node that was evicted. if you do not know the solution. Note that the number one step is backup of any RAC databases on the cluster. to let Oracle support work with you on a solution. Ensure that the interconnect is completely isolated from all other network traffic. Determine the time the node rebooted. Previous to Oracle Clusterware 11g Release 2 the hangcheck-timer module was configured and could also be a cause of nodes rebooting. Configuration/certification issues – Ensure that the hardware and software you are using is certified by Oracle. Keep in mind that one possible problem your cluster could be starting to experience is issues with the storage infrastructure. like node mis-configurations. there is a lot that can go wrong and a lot of damage that can be occur (this is true with a non-clustered database too). /var/log/messages GRID_HOME/log/<host>/cssd/ocssd. We recommend that you configure NTP for all nodes. what can cause a node eviction. 3. It may be that you will want to try to backup to some other storage medium (NAS for example) that uses a different hardware path (for example. In this section we address dealing with node evictions. There are many possible causes of node evictions. does not use your HBA’s) if possible. Check the following logfiles to begin with: 2. As of Oracle Clusterware 11g Release 2 this module is no longer needed and should not be enabled. That’s what you pay them for. First we ask the question. If the cluster is starting to have issues. Dealing With Node Evictions Node evictions can be hard to diagnose. Finding What Caused the Eviction Very often with node evictions you will need to engage Oracle support. 2.far better. Interconnect issues – A common cause of node eviction issues is that the interconnect is not completely isolated from other network traffic.log Perhaps the biggest causes for node evictions are: 1. Some things you might want to do are: 1. This will help you determine where in the various logs you will want to look for additional information. CSSDMONITOR – OCSSD daemon is monitored by cssdmonitor. This includes the specific version and even patchset number of each component. there are some initial things you can do that might help to solve some basic problems. . We then discuss finding out what actually caused our node eviction. Consider this carefully when performing a backup on an unstable cluster. Using the uptime UNIX command for example. a. Node time coordination – We have found that even though Oracle Clusterware 11g Release 2 does not indicate that NTP is a requirement Clusterware does seem to be more stable when it’s enabled and working correctly. With Oracle Clusterware 11g Release 2 there are two main processes that can cause node evictions:   Oracle Clusterware Kill Daemon (Oclskd) – Used by CSS to reboot a node when the reboot is requested by one or more other nodes.

Take the time to ensure that you have the correct patch sets installed and that you have followed the install directions carefully. Software bugs The biggest piece of advice that can be given to avoid instability within a cluster is to get the setup and configuration of that cluster right the first time. Patches – Ensure that all patch sets are installed as required. If in doubt about any step of the instillation. Make sure that you are installing the correct revision of those components.4. contact Oracle for support. . Don’t decide to not install a component just because you don’t think you are going to need to use it. OS Software components – Ensure that all OS software components have been installed as directed by Oracle. 6. The Oracle documentation and OMS provide a complete list of all required patch sets that must be installed for Clusterware and RAC to work correctly. 5.