Log Files for Troubleshooting Oracle RAC issues

The cluster has a number of log files that can be examined to gain any insight of occurring problems A good place to start diagnosis for the cluster problems is from the $ORA_CRS_HOME/log/<hostname>/alert<hostname>.log All clusterware log files are stored under $ORA_CRS_HOME/log/ directory. 1. alert<nodename>.log : Important clusterware alerts are stored in this log file. It is stored in $ORA_CRS_HOME/log/<hostname>/alert<hostname>.log 2. crsd.log : CRS logs are stored in $ORA_CRS_HOME/log/<hostname>/crsd/ directory. The crsd.log file is archived every 10MB as crsd.101, crsd.102 ... 3. cssd.log : CSS logs are stored in $ORA_CRS_HOME/log/<hostname>/cssd/ directory. The cssd.log file is archived every 20MB as cssd.101, cssd.102.... 4. evmd.log : EVM logs are stored in $ORA_CRS_HOME/log/<hostname>/evmd/ directory. 5. OCR logs : OCR logs (ocrdump, ocrconfig, ocrcheck) log files are stored in $ORA_CRS_HOME/log/<hostname>/client/ directory. 6. SRVCTL logs: srvctl logs are stored in two locations, $ORA_CRS_HOME/log/<hostname>/client/ and in $ORACLE_HOME/log/<hostname>/client/ directories. 7. RACG logs : The high availability trace files are stored in two locations $ORA_CRS_HOME/log/<hostname>/racg/ and in $ORACLE_HOME/log/<hostname>/racg/ directories. RACG contains log files for node applications such as VIP, ONS etc. ONS log filename = ora.<hostname>.ons.log VIP log filename = ora.<hostname>.vip.log Each RACG executable has a sub directory assigned exclusively for that executable. racgeut : $ORA_CRS_HOME/log/<hostname>/racg/racgeut/ racgevtf : $ORA_CRS_HOME/log/<hostname>/racg/racgevtf/ racgmain : $ORA_CRS_HOME/log/<hostname>/racg/racgmain/ racgeut : $ORACLE_HOME/log/<hostname>/racg/racgeut/ racgmain: $ORACLE_HOME/log/<hostname>/racg/racgmain/ racgmdb : $ORACLE_HOME/log/<hostname>/racg/racgmdb/ racgimon: $ORACLE_HOME/log/<hostname>/racg/racgimon/ As in a normal Oracle single instance environment, a RAC environment contains the standard RDBMS log files: These files are located by the parameters : background_dest_dump contan the alert log and backgrond process trace files. user_dump_dest contains any trace file generated by a user process.

core_dump_dest contains core files that are generated due to a core dump in a user process.

RAC - Issues & Troubleshooting

Whenever a node is having issues joining the cluster back post reboot, here is a quick check list I would suggest:

/var/log/messages ifconfig ip route /etc/hosts /etc/sysconfig/network-scripts/ifcfg-eth* ethtool mii-tool cluvfy $ORA_CRS_HOME/log

Let us now take a closer look at specifc issues with examples and steps taken for their resolution. These are all tested on Oracle database on RHEL4 U8 x-64

1. srvctl not able to start Oracle Instance but sqlplus able to start a. Check racg log for actual error message. % more $ORACLE_HOME/log/`hostname -s`/racg/ora.{DBNAME}. {INSTANCENAME}.inst.log

b. Check if srvctl is configured to use correct parameter file(pfile/spfile) % srvctl config database -d {DBNAME} -a

Check ownership for $ORACLE_HOME/log If this is owned by root. Moving OCR to a different location PS: This can be done while CRS is up as root. # chown -R oracle:dba $ORACLE_HOME/log 2. VIP has failed over to another node but is not coming back to the original node Fix: The node where the VIP has failed over. ocrconfig complaints. The fix is to touch the new file.You can also validate parameter file by using sqlplus to see the exact error message. bring it down manually as root Example: ifconfig eth0:2 down PS: Be careful to bring down only VIP. Example: # ocrconfig -replace ocrmirror /crs_new/cludata/ocrfile PROT-21: Invalid parameter # touch /crs_new/cludata/ocrfile # chown root:dba /crs_new/cludata/ocrfile # ocrconfig -replace ocrmirror /crs_new/cludata/ocrfile . srvctl won't be able to start instance as oracle user. While trying to change ocr mirror or the ocr to a new location. c. A small typo may bring down your public interface:) 3.

If CRS is up. . Device/File Name should point to the new one with integrity check succeeded. Find below sample errors and their fix.loc ocrconfig_loc and ocrmirrorconfig_loc should point to correct locations. DO NOT use force option else it may corrupt your OCR. # crsctl add css votedisk /crs_new/cludata/cssfile_new Cluster is not in a ready state for online disk addition We need to use force option. Ensure OCR inventory is updated correctly # cat /etc/oracle/ocr. # crsctl add css votedisk /crs_new/cludata/cssfile_new -force Now formatting voting disk: /crs_new/cludata/cssfile_new successful addition of votedisk /crs_new/cludata/cssfile_new. 4.Verify: a. b. before using force option. However. Verify using "crsctl query css votedisk" and then delete the old votedisks. you'll need to use force option. Validate using "ocrcheck". While deleting too. ensure CRS is down. Moving Voting Disk to a different location PS: CRS must be down while moving the voting disk. The idea is to add new voting disks and delete the older ones.

Manually registering listener resource to OCR Listener was registered manually with OCR but srvctl was unable to bring up the listener Let us first see example of how to manually do this.lsnr TYPE=application ACTION_SCRIPT=/orahome/ora10g/product/10. the permission should be changed to oracle:dba 5.Also verify the permissions of the voting disk files.lsnr > /tmp/res % cat /tmp/res NAME=ora.0/db_1/bin/racgwrap ACTIVE_PLACEMENT=0 AUTO_START=1 CHECK_INTERVAL=600 DESCRIPTION=CRS application for listener on node FAILOVER_DELAY=0 FAILURE_INTERVAL=0 FAILURE_THRESHOLD=0 HOSTING_MEMBERS=test-server2 OPTIONAL_RESOURCES= PLACEMENT=restricted .LISTENER_TEST-SERVER2.test-server2.test-server2. From an existing available node.2. print the listener resource % crs_stat -p ora. It should be oracle:dba If voting disks were added using root.LISTENER_TEST-SERVER2.


LISTENER_TEST-SERVER1.LISTENER_TEST-SERVER1.test-server1. Services While checking status of a service.Modify relevant parameters in the resource file to point to correct instance.cs) is still lying around. Rename as resourcename.lsnr -dir /tmp/ Start listener % srvctl start listener -d testdb -n test-server1 While trying to start listener. . then srvctl won't be able to start using oracle user. So all the aforementioned operations while registering the listener manually should be done using oracle user. 6. check if any related resource for service(resource names ending with . srvctl is throwing errors like "Unable to read from listener log file" The listener log file exists.cap % mv /tmp/res /tmp/ora.cap Register with OCR % crs_register ora. the error message is "No such service exists" or "already running" If we try to add service with same name.test-server1. it says "not running" If we try to start it using srvctl.srv and . If resource is registered using root.lsnr. it says "already exists" This happens because the service is in an "Unknown" state in the OCR Using crs_stat.

* Run cluvfy to identify the issue $ORA_CRS_HOME/bin/cluvfy stage -post crsinst -n {nodename} /tmp was not writable /etc/fstab was incorrect and was fixed for making /tmp available If you see messages like "Shutdown CacheLocal. my hash ids don't match" in the CRS log.srvctl remove service -f has been tried and the issue persists.9559" No logs seen in /tmp/crsctl. Diagnostics in /tmp/crsctl. 8. Here is the fix: # crs_stop -f {resourcename} # crs_unregister {resourcename} Now service can be added and started correctly. then check if /etc/oracle/ocr.loc is same across all nodes of the cluster. No CRS logs in $ORA_CRS_HOME Check /var/log/messages "Cluster Ready Services waiting on dependencies. CRS is not starting After host reboot. CRS was not coming up. Post host reboot. CRS binary restored by copying from existing node in the cluster . 7.

oracle PS:Exercise caution while working with the socket files. . you should never touch those files otherwise reboot may be inevitable.bin run -t 1000 -m 500 -f -t 1000 means oprocd would wake up every 1000ms -m 500 means allow upto 500ms margin of error Basically with these options if oprocd wakes up after > 1. You may need to cleanup socket files from /var/tmp/. If CRS is up. Hostnames are modified correctly in $ORA_CRS_HOME/log b. this confirms reboots are being initated by oprocd process. then you need to ensure: a. %ps -ef grep oprocd root 10409 9937 0 Feb27 ? 00:00:00 /oracle/product/crs/bin/oprocd.5 secs it’s going to force a reboot. "Id "h1" respawning too fast: disabled for 5 minutes" CRSD log showing "no listener" If CRS binary is restored by copying from existing node in the cluster. 9. Check /var/log/messages and grep for "restart" If the timestamps are matching. CRS rebooting frequently by oprocd Check /etc/oracle/oprocd/ and grep for "Rebooting".CRS not starting with following messages in /var/log/messages.

Cluster hung.0.2. PS: Setting diagwait requires a full shutdown of Oracle Clusterware on ALL nodes. Receiver ospid 1650 INST2:IPC Send timeout detected. All SQL queries on GV$ views are hanging. 10.Sender: ospid 24692 Receiver: inst 1 binc 150 ospid 1650 .4 Oracle releases on Linux. Fix is to set CSS diagwait to 13 #crsctl set css diagwait 13 -force # /oracle/product/crs/bin/crsctl get css diagwait 13 This actually changes what parameters oprocd runs with %ps -ef grep oprocd root 10409 9937 0 Feb27 ? 00:00:00 /oracle/product/crs/bin/oprocd. Alert log from all instance have message like below: INST1: IPC Send timeout detected.5 seconds.This is conceptually analogous to what hangcheck timer used to do pre 10.e 10 seconds in place of the default 0.bin run -t 1000 -m 10000 -hsi 5:10:50:75:90 -f Note that the margin has now changed to 10000ms i.

As seen from above. OTHER_PERMISSION : PROCR_READ. Just identify the process that is causing row cache lock and kill it otherwise reboot node 1. it's important to identify the instance from where it's initiating.Lock Process In case of inter-instance lock issues. INST1 is the one that needs to be fixed. GROUP_PERMISSION : PROCR_ALL_ACCESS. 11.Sender: ospid 12955 Receiver: inst 1 binc 150 ospid 1650 The ospid on all instances belong to LCK0 . Authentication error [User does not have permission to perform this operation] [0] crs_stat doesn't have any trace of it so utilities like crs_setperm/crs_unregister/crs_stop won't work in this case.LOG.2 PRKR-1005 : adding of cluster database testdb configuration failed. USER_NAME : root. Inconsistent OCR with invalid permissions % srvctl add db -d testdb -o /oracle/product/10. PROC-5: User does not have permission to perform a cluster registry operation on this key. ocrdump shows: [DATABASE. GROUP_NAME : root} .testdb] UNDEF : SECURITY : {USER_PERMISSION : PROCR_ALL_ACCESS.INST3: IPC Send timeout detected.

there you can also do the following: Take export backup of OCR using: ocrconfig -export /tmp/export -s online Edit /tmp/export and remove those 2 lines pointing to DATABASE. Though it has been removed by root but now it cannot be added by oracle user unless we get rid of the aforementioned.testdb and DATABASE. Shutdown the entire cluster and either restore from previous good backup of OCR using: ocrconfig -restore backupfilename You can get list of backups using: ocrconfig -showbackup If you are not sure of last good backup.INSTANCE] UNDEF : SECURITY : {USER_PERMISSION : PROCR_ALL_ACCESS.LOG.INSTANCE owned by root Import it back now ocrconfig -import /tmp/export .testdb. GROUP_PERMISSION : PROCR_ALL_ACCESS. GROUP_NAME : root} These logs are owned by root and that's the problem. This means that the resource was perhaps added into OCR using root.testdb.[DATABASE. OTHER_PERMISSION : PROCR_READ.LOG.LOG. USER_NAME : root.

you can do the following: 1. and trying to correct problems with Oracle Clusterware.After starting the cluster. In this section we talk about diagnosing the health of the cluster. verify using ocrdump. The OCRDUMPFILE should not have any trace of those leftover log entries owned by root. collecting diagnostic information. Use the crsctl check crs command to check OHASD. . Use the crsctl check has command to check if OHASD is running on the local node and that it’s healthy: [oracle@rac1 admin]$crsctl check has CRS-4638: Oracle High Availability Services is online 2. monitoring the startup message to see if any errors occur. Start the cluster (crsctl start cluster). [oracle@rac1 admin]$crsctl check cluster –all ******************************************************** rac1: CRS-4537: Cluster Ready Services is online CRS-4529: Cluster Synchronization Services is online CRS-4533: Event Manager is online rac2: CRS-4537: Cluster Ready Services is online CRS-4529: Cluster Synchronization Services is online CRS-4533: Event Manager is online 4. Checking the Health of the Cluster Once you have configured a new cluster you will probably want to check the health of that cluster. Additionally you might want to check the health of the cluster after you have added or removed a node or if there is something about the cluster that is causing you to suspect that it’s suffering from some problem. CRSD. Follow these steps to give your cluster a full health check: 1. Use the crsctl check cluster –all command to check all daemons on all nodes of the cluster. Troubleshooting Oracle Clusters and Oracle RAC Oracle Clusterware has its moments. If these steps do not indicate a problem and you feel there is a problem with the cluster. Check the cluster logs for any error messages that might have been logged. [oracle@rac1 admin]$crsctl check crs CRS-4638: Oracle High Availability Services is online CRS-4537: Cluster Ready Services is online CRS-4529: Cluster Synchronization Services is online CRS-4533: Event Manager is online 3. ocssd and EVM daemons. There are times when it does not want to start for various reasons. Stop the cluster (crsctl stop cluster) 2.

OCRCHECK. The Clusterware log files are typically stored under the GRID_HOME directory in a sub-directory called log. If errors do occur or you still feel there is a problem. Under that directory is another directory with the host name and then a directory that indicates the Clusterware component that the specific logs are associated with. Oracle Clusterware 11g Release 2 generates a number of different log files that can be used to troubleshoot Clusterware problems. GRID_HOME/log/myrac1/crsd stores the log files associated with CRSD for the host myrac1. check the cluster logs for error messages.log Contents Clusterware alert log Disk Monitor Daemon OCRDUMP. In this section we will first document the log files associated with the various Clusterware processes. We will then discuss a method of collecting the data in these logs into a single source that you can reference when doing trouble diagnosis. Log Files. Oracle Clusterware 11g adds a new environment variable called GRID_HOME to reference the base of the Oracle Clusterware software home.3.1 database is installed) Cluster Synchronization Service Daemon Server Manager HA Service Daemon Agent HA Service Daemon CSS Agent GRID_HOME/log/<host>/diskmon GRID_HOME/log/<host>/client GRID_HOME/log/<host>/ctssd GRID_HOME/log/<host>/gipcd GRID_HOME/log/<host>/ohasd GRID_HOME/log/<host>/crsd GRID_HOME/log/<host>/gpnpd GRID_HOME/log/<host>/mdnsd GRID_HOME/log/<host>/evmd GRID_HOME/log/<host>/racg/racgmain GRID_HOME/log/<host>/racg/racgeut GRID_HOME/log/<host>/racg/racgevtf GRID_HOME/log/<host>/racg GRID_HOME/log/<host>/cssd GRID_HOME/log/<host>/srvm GRID_HOME/log/<host>/agent/ohasd/oraagent_oracle11 GRID_HOME/log/<host>/agent/ohasd/oracssdagent_root GRID_HOME/log/<host>/agent/ohasd/oracssdmonitor_root HA Service Daemon ocssdMonitor Agent GRID_HOME/log/<host>/agent/ohasd/orarootagent_root HA Service Daemon Oracle Root Agent . For example. The following table lists the log file directories and the contents of those directories: Directory Path GRID_HOME/log/<host>/alert<host>. Collecting Diagnostic Information and Trouble Resolution Because Oracle Clusterware 11g Release 2 consists of a number of different components it follows that there are a number of different log files associated with these processes. CRSCTL Cluster Time Synchronization Service Grid Interprocess Communication Daemon Oracle High Availability Services Daemon Cluster Ready Services Daemon Grid Plug and Play Daemon Mulitcast Domain Name Service Daemon Event Manager Daemon RAC RACG RAC RACG RAC RACG RAC RACG (only used if pre-11. OCRCONFIG.

GRID_HOME/log/<host>/agent/crsd/oraagent_oracle11 GRID_HOME/log/<host> agent/crsd/orarootagent_root GRID_HOME/log/<host> agent/crsd/ora_oc4j_type_oracle11g GRID_HOME/log/<host>/gnsd CRS Daemon Oracle Agent CRS Daemon Oracle Root Agent CRS Daemon Oracle OC4J Agent Grid Naming Service Daemon The following diagram provides additional detail as to the location of the Oracle Clusterware log files: .

This helps to maintain control of space utilization in the GRID_HOME directory. of which 10 are maintained. Rollover log files will typically have the same name as the logfile but it will have a version number attached to the end. .log and then there are the additional oraagent_oracle log files with extensions from l01 to l10. An example of rolling of log files can be seen in this listing of the GRID_HOME/log/rac1/agent/crsd/oraagent_oracle directory. Each log file type has its own rotation time frame. In the listing note that there is the current oraagent_oracle log file with an extension of . This is known as a rollover of the log.Oracle Clusterware will rotate logs over time. These latter log files are the backup log files.

l03 -rw-r--r-.pl.1 oracle oinstall 10583902 Jun 9 18:29 oraagent_oracle.1 oracle oinstall 5955542 Jun 10 23:38 oraagent_oracle. Options to only collect specific information. These four files are gzipped tarballs and are listed in the following table: Script Name Coredata*tar.1 oracle oinstall 10583355 Jun 10 13:35 oraagent_oracle.1 oracle oinstall 10583397 Jun 10 00:51 oraagent_oracle.. -rw-r--r-. --core or –all.1 oracle oinstall 10584397 Jun 9 09:26 oraagent_oracle. This can involve traversing directories which can be tedious at best.l04 -rw-r--r-. drwxrwxrwt 5 root oinstall 4096 Jun 8 10:29 . We will then look at the Cluster Verify Utility (CVU).l09 -rw-r--r-. Often when troubleshooting problems you will want to review several of the log files.1 oracle oinstall 10584515 Jun 9 13:17 oraagent_oracle.l05 -rw-r--r-.1 oracle oinstall 6 Jun 10 21:20 oraagent_oracle. In this section we will review the diagcollection.1 oracle oinstall 10565073 Jun 10 20:02 oraagent_oracle.1 oracle oinstall 10539847 Jun 8 21:09 oraagent_oracle.l10 -rw-r--r-. This script will collect Clusterware log files and other helpful diagnostic information. Using Diagcollection. These options include –crs.pl script which is used to collect logfile information.1 oracle oinstall 10584344 Jun 9 05:37 oraagent_oracle.pl which is contained in $GRID_HOME/bin.log -rw-r--r-. Contains the results of an execution of ocrdump and ocrcheck .2.1 oracle oinstall 0 Jun 8 10:29 oraagent_oracleOUT.0/grid/log/rac1/agent/crsd/oraagent_oracle [oracle@rac1 oraagent_oracle]$ ls -al total 109320 drwxr-xr-t 2 oracle oinstall 4096 Jun 10 20:02 .l08 -rw-r--r-.[oracle@rac1 oraagent_oracle]$ pwd /ora01/app/11. When invoked the script creates four files in the local directory.l07 -rw-r--r-. The diagcollection.1 oracle oinstall 10584126 Jun 9 01:50 oraagent_oracle. –all is the default setting.pid Collecting Clusterware Diagnostic Data Oracle provides utilities that make it easier to determine the status of the Cluster and collect the Clusterware log files for problem diagnosis.l01 -rw-r--r-.1 oracle oinstall 10583346 Jun 10 07:13 oraagent_oracle.gz crsData*tar*gz ocrData*tar*gz osData*tar*gz Contains Core files and related analysis files. Oracle Cluster Verification Utility (CVU) .pl Clearly Oracle Clusterware has a number of log files. The script has a –collect option that you invoke to collect the diagnostic information.pl script comes with the following options:    --collect – Collect diagnostic information. Current OCR backups are also listed. Contains log files from GRID_HOME/log/<host> directory structure. To make collection of the Clusterware log data easier Oracle provides a program called diagcollection. --clean – Cleans the directory of previous files created by previous runs of diagcollection.log -rw-r--r-.l02 -rw-r--r-. Additionally Oracle support might well ask that you collect up all the Clusterware log files so they can diagnose the problem that you are having. Contains /var/log/messages and other related files.l06 -rw-r--r-.

Review the logs for error messages that might give you some insight into the problem at hand. clock synchronization and so on. After opening the SR. Here is an example of running the ocrcheck program: [oracle@rac1 admin]$ ocrcheck Status of Oracle Cluster Registry is as follows : Version : 3 Total space (kbytes) : 262120 Used space (kbytes) : 2580 Available space (kbytes) : 259540 ID : 749518627 Device/File Name : +DATA Device/File integrity check succeeded Cluster registry integrity check succeeded Logical corruption check bypassed due to non-privileged user Oracle Clusterware Trouble Resolution When dealing with difficult Clusterware issues that befuddles you there are some basic first steps to perform. The truth is that Oracle Clusterware is a very complex beast. For the DBA who does not deal with solving Clusterware problems on a day-in-day-out basis. Performs checksum operations on the blocks within the OCR to ensure they are not corrupt. collect the Clusterware logs. call the program runcluvfy. CVU diagnosis/verifies specific components. It is . CVU can be run in various situations including:     During various phases of the install of the initial cluster to confirm that key components are in place and operational (such as SSH).pl script. OUI makes calls to the CVU during the creation of the cluster to ensure that pre-requisites were executed. CVU is located in the GRID_HOME/bin directory and also in $ORACLE_HOME/bin. Make sure you have protected your data should the whole thing go bottom up. The ocrcheck program provides a way to check the integrity of the OCR. backup your database. You can also run CVU from the Oracle 11g Release 2 install media. 3. After you have completed the initial creation of the cluster After you add or remove a node from the cluster If you suspect there is a problem with the cluster. In some cases. you can cause additional problems and damage to the cluster. 4. Examples of components are space.sh which calls CVU.The CVU is used to verify that there are no configuration issues with the cluster. determining the nature and resolution to a problem can be an overwhelming challenge. For example. If they are not and at least one node survives. when problems are detected CVU can create fixup scripts that are designed to correct problems that were detected. 2. If you find nothing on MOS. CVU supports Oracle Clusterware versions 10gR1 onwards. If you have a good backup. In your attempts to solve the problem. Components are groupings based on functionality. backing up the archived redo logs is also a very good idea. integrity of the cluster. search Metalink Oracle Support (MOS) for the problem you are experiencing. In this case. In Oracle Clusterware 11g Release 2 CVU is installed as a part of Oracle Clusterware. Using the diagcollection. OCR integrity. Checking the Oracle Cluster Registry Node evictions or other problems can be caused by corruption in the OCR. Check and double check your RAC database backups are current. do a Google search for the problem you are experiencing. The bottom line is that you have an unstable environment. You can use CVU to check one or all components of the cluster. These steps are: 1. Prior to Oracle Clusterware 11g Release 2 you would need to download it from OTN. 5. Open an SR with Oracle Support.

what can cause a node eviction. c.log Perhaps the biggest causes for node evictions are: 1. Clusterware is complex enough that it will take the support tools that Oracle Support has available to diagnose the problem. does not use your HBA’s) if possible. Node time coordination – We have found that even though Oracle Clusterware 11g Release 2 does not indicate that NTP is a requirement Clusterware does seem to be more stable when it’s enabled and working correctly. Previous to Oracle Clusterware 11g Release 2 the hangcheck-timer module was configured and could also be a cause of nodes rebooting. like node mis-configurations. Configuration/certification issues – Ensure that the hardware and software you are using is certified by Oracle. However. It may be that you will want to try to backup to some other storage medium (NAS for example) that uses a different hardware path (for example. As of Oracle Clusterware 11g Release 2 this module is no longer needed and should not be enabled. Consider this carefully when performing a backup on an unstable cluster. CSSDMONITOR – OCSSD daemon is monitored by cssdmonitor. In this section we address dealing with node evictions. That’s what you pay them for. Using the uptime UNIX command for example. Some things you might want to do are: 1. Interconnect issues – A common cause of node eviction issues is that the interconnect is not completely isolated from other network traffic. This will help you determine where in the various logs you will want to look for additional information. 3. there are some initial things you can do that might help to solve some basic problems. There are many possible causes of node evictions. Keep in mind that one possible problem your cluster could be starting to experience is issues with the storage infrastructure. 2. Check the following logfiles to begin with: 2. First we ask the question. . some of which might be obvious and some which might not be. if you do not know the solution. Determine the time the node rebooted. Finding What Caused the Eviction Very often with node evictions you will need to engage Oracle support. If a hang is indicated (say the ocssd daemon is lost) then the node will be rebooted. /var/log/messages GRID_HOME/log/<host>/cssd/ocssd. Ensure that the interconnect is completely isolated from all other network traffic. If the cluster is starting to have issues. We then discuss finding out what actually caused our node eviction. Dealing With Node Evictions Node evictions can be hard to diagnose. to let Oracle support work with you on a solution. a. We recommend that you configure NTP for all nodes. there is a lot that can go wrong and a lot of damage that can be occur (this is true with a non-clustered database too). b. Note that the number one step is backup of any RAC databases on the cluster. This includes the switches that the interconnect is attached to.log GRID_HOME/log/host/alert<host>. This includes the specific version and even patchset number of each component.far better. With Oracle Clusterware 11g Release 2 there are two main processes that can cause node evictions:   Oracle Clusterware Kill Daemon (Oclskd) – Used by CSS to reboot a node when the reboot is requested by one or more other nodes. What Can Cause an Eviction? A common problem that DBA’s have to face with Clusterware is node evictions which usually leads to a reboot of the node that was evicted.

4. Take the time to ensure that you have the correct patch sets installed and that you have followed the install directions carefully. . If in doubt about any step of the instillation. 6. Don’t decide to not install a component just because you don’t think you are going to need to use it. contact Oracle for support. Make sure that you are installing the correct revision of those components. Patches – Ensure that all patch sets are installed as required. 5. The Oracle documentation and OMS provide a complete list of all required patch sets that must be installed for Clusterware and RAC to work correctly. OS Software components – Ensure that all OS software components have been installed as directed by Oracle. Software bugs The biggest piece of advice that can be given to avoid instability within a cluster is to get the setup and configuration of that cluster right the first time.

Sign up to vote on this title
UsefulNot useful