100% found this document useful (1 vote)

743 views7 pages

RAC Troubleshooting Guide and Logs

This document provides guidance on troubleshooting Oracle Real Application Clusters (RAC). It discusses where to find log files for different RAC components, describes common reconfiguration events and their causes, explains how to disable and enable RAC, and provides tips for addressing performance issues like hung databases or sessions.

Uploaded by

binoykumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as RTF, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

743 views7 pages

RAC Troubleshooting Guide and Logs

Uploaded by

binoykumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as RTF, PDF, TXT or read online on Scribd

RAC Troubleshooting: Provides an introduction to RAC troubleshooting techniques and common issues related to cluster resource logs.
Reconfiguration Process: Describes the reconfiguration process including reasons, steps, and detailed explanations of the events involved.
Oracle RAC Management: Outlines procedure for managing Oracle RAC databases including mount status, reconfiguration checks, and command usage.
Performance Issues and Troubleshooting: Discusses performance problems within Oracle systems including diagnoses and corrective steps with command examples.
Node Eviction and Membership: Details node eviction processes and factors related to cluster membership and network issues.
Debugging CRS and GSD: Covers CRS and GSD debugging including configuration tools and troubleshooting approaches for cluster management.

RAC Troubleshooting

This is the one section what will be updated frequently as my experience with RAC
grows, as RAC has been around for a while most problems can be resolve with a simple
google lookup, but a basic understanding on where to look for the problem is required. In
this section I will point you where to look for problems, every instance in the cluster has
its own alert logs, which is where you would start to look. Alert logs contain startup and
shutdown information, nodes joining and leaving the cluster, etc.
Here is my complete alert log file of my two node RAC starting up.
The cluster itself has a number of log files that can be examined to gain any insight of
occurring problems, the table below describes the information that you may need of the
CRS components
$ORA_CRS_HOME/crs/log contains trace files for the CRS resources
$ORA_CRS_HOME/crs/init contains trace files for the CRS daemon during startup, a good place to start
contains cluster reconfigurations, missed check-ins, connects and disconnects
$ORA_CRS_HOME/css/log
Look here to obtain when reboots occur
$ORA_CRS_HOME/css/init contains core dumps from the cluster synchronization service daemon (OCSd
$ORA_CRS_HOME/evm/lo
log files for the event volume manager and eventlogger daemon
g
$ORA_CRS_HOME/evm/ini
pid and lock files for EVM
t
$ORA_CRS_HOME/srvm/lo
log files for Oracle Cluster Registry (OCR)
g
$ORA_CRS_HOME/log log files for Oracle clusterware which contains diagnostic messages at the Ora
As in a normal Oracle single instance environment, a RAC environment contains the
standard RDBMS log files, these files are located by the parameter
background_dest_dump. The most important of these are
$ORACLE_BASE/admin/ud
contains any trace file generated by a user process
ump
$ORACLE_BASE/admin/cd
contains core files that are generated due to a core dump in a user process
ump
Now lets look at a two node startup and the sequence of events
First you must check that the RAC environment is using the connect interconnect, this
can be done by either of the following
logfile ## The location of my alert log, yours may be different
/u01/app/oracle/admin/racdb/bdump/alert_racdb1.log
ifcfg command oifcfg getif
table check select inst_id, pub_ksxpia, picked_ksxpia, ip_ksxpia from x$ksxpia;
SQL> oradebug setmypid
SQL> oradebug ipc
oradebug
Note: check the trace file which can be located by the parameter user_dump_d
cluster_interconnects
system parameter
Note: used to specify which address to use
When the instance starts up the Lock Monitor's (LMON) job is to register with the Node
Monitor (NM) (see below table). Remember when a node joins or leaves the cluster the
GRD undergoes a reconfiguration event, as seen in the logfile it is a seven step process
(see below for more details on the seven step process).
The LMON trace file also has details about reconfigurations it also details the reason for
the event
reconfiguation reason description
1 means that the NM initiated the reconfiguration event, typical when a node joins or le
means that an instance has died

2 How does the RAC detect an instance death, every instance updates the control file w
checkpoint (CKPT), if the heartbeat information is missing for x amount of time, the
dead and the Instance Membership Recovery (IMR) process initiates reconfiguration
means communication failure of a node/s. Messages are sent across the interconnect
3 an amount of time then a communication failure is assumed by default UDP is used a
an eye on the logs if too many reconfigurations happen for reason 3.
Example of a Sat Mar 20 11:35:53 2010
reconfiguration, taken Reconfiguration started (old inc 2, new inc 4)
from the alert log. List of nodes:
01
Global Resource Directory frozen
* allocate domain 0, invalid = TRUE
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Sat Mar 20 11:35:53 2010
LMS 0: 0 GCS shadows cancelled, 0 closed
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Post SMON to start 1st pass IR
Sat Mar 20 11:35:53 2010
LMS 0: 0 GCS shadows traversed, 3291 replayed
Sat Mar 20 11:35:53 2010
Submitted all GCS remote-cache requests
Post SMON to start 1st pass IR
Fix write in gcs resources
Reconfiguration complete

Note: when a reconfiguration happens the GRD is frozen until the reconfiguration is
Confirm that the database has been started in cluster mode, the log file will state the
following
Sat Mar 20 11:36:02 2010
cluster mode Database mounted in Shared Mode (CLUSTER_DATABASE=TRUE)
Completed: ALTER DATABASE MOUNT
Staring with 10g the SCN is broadcast across all nodes, the system will have to wait until
all nodes have seen the commit SCN. You can change the board cast method using the
system parameter _lgwr_async_broadcasts.
Lamport Algorithm
The lamport algorithm generates SCNs in parallel and they are assigned to transaction on
a first come first served basis, this is different than a single instance environment, a
broadcast method is used after a commit operation, this method is more CPU intensive as
it has to broadcast the SCN for every commit, but he other nodes can see the committed
SCN immediately.
The initialization parameter max_commit_propagation_delay limits the maximum delay
allow for SCN propagation, by default it is 7 seconds. When set to less than 100 the
broadcast on commit algorithm is used.
Disable/Enable Oracle RAC
There are times when you may wish to disable RAC, this feature can only be used in a
Unix environment (no windows option).
Disable Oracle RAC (Unix only)
Log in as Oracle in all nodes
shutdown all instances using either normal or immediate option
change to the working directory $ORACLE_HOME/lib
run the below make command to relink the Oracle binaries without the RAC option (should take a few mi

make -f ins_rdbms.mk rac_off

Now relink the Oracle binaries

make -f ins_rdbms.mk ioracle

Enable Oracle RAC (Unix only)
Log in as Oracle in all nodes
shutdown all instances using either normal or immediate option
change to the working directory $ORACLE_HOME/lib
run the below make command to relink the Oracle binaries without the RAC option (should take a few mi

make -f ins_rdbms.mk rac_on

Now relink the Oracle binaries

make -f ins_rdbms.mk ioracle

Performance Issues
Oracle can suffer a number of different performance problems and can be categorized by
the following

• Hung Database

• Hung Session(s)

• Overall instance/database performance

• Query Performance
A hung database is basically an internal deadlock between to processes, usually Oracle
will detect the deadlock and rollback one of the processes, however if the situation occurs
with the internal kernel-level resources (latches or pins), it is unable to automatically
detect and resolve the deadlock, thus hanging the database. When this event occurs you
must obtain dumps from each of the instances (3 dumps per instance in regular times), the
trace files will be very large.
capture information ## Using alter session
SQL> alter session set max_dump_file_size = unlimited;
SQL> alter session set events 'immediate trace name systemstate level 10';
# using oradebug
SQL> select * from dual;
SQL> oradebug setmypid
SQL> unlimit
SQL> oradebug dump systemstate 10

# using oradebug from another instance

SQL> select * from dual;
SQL> oradebug setmypid
SQL> unlimit
SQL> oradebug -g all dump systemstate 10

Note: the select statement above is to avoid problems on pre 8 Oracle

## If you get problems connecting with SQLPLUS use the command below
SQLPlus - problems
$ sqlplus -prelim
connecting
Enter user-name: / as sysdba
A severe performance problem can be mistaken for a hang, this usually happen because
of contention problems, a systemstate dump is normally used to analyze this problem,
however a systemstate dump taken a long time to complete, it also has a number of
limitations

• Reads the SGA in a dirty manner, so it may be inconsistent

• Usually dumps a lot of information

• does not identify interesting processes on which to perform additional dumps

• can be a very expensive operation if you have a large SGA.

To overcome these limitations a new utility command was released with 8i called
hanganalyze which provides clusterwide information in a RAC environment on a single
shot.
sql method alter session set events 'immediate trace hanganalyze level <level>';
SQL> oradebug hanganalyze <level>

## Another way using oradebug

SQL> setmypid
oradebug
SQL> setinst all
SQL> oradebug -g def hanganalyze <level>

Note: you will be told where the output will be dumped to

hanganalyze levels
1-2 only hanganalyze output, no process dump at all, click here for an example level 1 du
3 Level 2 + Dump only processes thought to be in a hang (IN_HANG state)
4 Level 3 + Dump leaf nodes (blockers) in wait chains (LEAF, LEAF_NW, IGN_DMP
5 Level 4 + Dump all processes involved in wait chains (NLEAF state)
10 Dump all processes (IGN state)
The hanganalyze command uses internal kernel calls to determine whether a session is
waiting for a resource and reports the relationship between blockers and waiters,
systemdump is better but if you over whelmed try hanganalyze first.
Debugging Node Eviction
A node is evicted from the cluster after it kills itself because it is not able to service the
application, this generally happens when you have communication problems. For eviction
node problems look for ora-29740 errors in the alert log file and LMON trace files.
To understand eviction problems you need to now the basics of node membership and
instance membership recovery (IMR) works. When a communication failure happens the
heartbeat information in the control cannot happen, thus data corruption can happen. IMR
will remove any nodes from the cluster that it deems as a problem, IMR will ensure that
the larger part of the cluster will survive and kills any remaining nodes. IMR is part of the
service offered by Cluster Group Services (CGS). LMON handles many of the CGS
functionalities, this works at the cluster level and can work with 3rd party software (Sun
Cluster, Veritas Cluster). The Node Monitor (NM) provides information about nodes and
their health by registering and communicating with the Cluster Manager (CM). Node
membership is represented as a bitmap in the GRD. LMON will let other nodes know of
any changes in membership, for example if a node joins or leaves the cluster, the bitmap
is rebuilt and communicated to all nodes.
Node registering
lmon registered with NM - instance id 1 (internal mem no 0)
(alert log)
One thing to remember is that all nodes must be able to read from and write to the
controlfile. CGS makes sure that members are valid, it uses a voting mechanism to check
the validity of each member. I have already discussed the voting disk in my architecture
section, as stated above memberships is held in a bitmap in the GRD, the CKPT process
updates the controlfile every 3 seconds in an operation known as a heartbeat. It writes
into a single block that is unique for each instance, thus intra-instance coordination is not
required, this block is called the checkpoint progress record. You can see the controlfile
records using the gv$controlfile_record_section view, all members attempt to obtain a
lock on the controlfile record for updating, the instance that obtains the lock tallies the
votes from all members, the group membership must conform to the decided (voted)
membership before allowing the GCS/GES reconfiguration to proceed, the controlfile
vote result is stored in the same block as the heartbeat in the control file checkpoint
progress record.
A cluster reconfiguration is performed using 7 steps
Name service is frozen, the CGS contains an internal database of all the
members/instances in the cluster with all their configurations and servicing
details.
Lock database (IDLM) is frozen, this prevents processes from obtaining locks on
resources that were mastered by the departing/dead instance
Determination of membership and validation and IMR
Bitmap rebuild takes place, instance name and uniqueness verification, GCS must
synchronize the cluster to be sure that all members get the reconfiguration event
and that they all see the same bitmap.
Delete all dead instance entries and republish all names newly configured
Unfreeze and release name service for use
Hand over reconfiguration to GES/GCS
Debugging CRS and GSD
Oracle server management configuration tools include a diagnostic and tracing facility
for verbose output for SRVCTL, GSD, GSDCTL or SRVCONFIG.
To capture diagnose following the below
use vi to edit the gsd.sh/srvctl/srvconfig file in the $ORACLE_HOME/bin directory
At the end of the file look for the below line

exec $JRE -classpath $CLASSPATH oracle.ops.mgmt.daemon.OPSMDaemon

$MY_OHOME
Add the following just before the -classpath in the exec $JRE line

-DTRACING.ENABLED=true -DTRACING.LEVEL=2
the string should look like this

exec $JRE -DTRACING.ENABLED=true -DTRACING.LEVEL=2

-classpath...........
In Oracle database 10g setting the below variable accomplishes the same thing, set it to
blank to remove the debugging
Enable tracing $ export SRVM_TRACE=true
Disable tracing $ export SRVM_TRACE=""

ORAchk EXAchk Feature Fix History
100% (1)
ORAchk EXAchk Feature Fix History
112 pages
Tuning
No ratings yet
Tuning
12 pages
Add Disks to Linux ASM Disk Group
No ratings yet
Add Disks to Linux ASM Disk Group
3 pages
Configure 10g R2 Dataguard on RHEL/CentOS
100% (1)
Configure 10g R2 Dataguard on RHEL/CentOS
8 pages
Fixing RMAN Errors: ORA-19505 & ORA-19588
No ratings yet
Fixing RMAN Errors: ORA-19505 & ORA-19588
3 pages
Oracle DBA Best Practices Guide
100% (1)
Oracle DBA Best Practices Guide
50 pages
Step by Guide Rman For Dataguard-ID 469493.1
No ratings yet
Step by Guide Rman For Dataguard-ID 469493.1
11 pages
Database Blocking Analysis Queries
No ratings yet
Database Blocking Analysis Queries
4 pages
Oracle RAC & TAF Failover Guide
No ratings yet
Oracle RAC & TAF Failover Guide
6 pages
Grid Installation Errors
No ratings yet
Grid Installation Errors
29 pages
Blocking Sessions in Oracle Database
No ratings yet
Blocking Sessions in Oracle Database
46 pages
Oracle RAC Start/Stop Commands Guide
No ratings yet
Oracle RAC Start/Stop Commands Guide
5 pages
Oracle Dba Scripts
50% (2)
Oracle Dba Scripts
153 pages
Querying Oracle Alert Logs Efficiently
No ratings yet
Querying Oracle Alert Logs Efficiently
5 pages
RAC Grid Infrastucture Startup Sequence and Important Logfile Location
50% (2)
RAC Grid Infrastucture Startup Sequence and Important Logfile Location
5 pages
Awr Recreate
No ratings yet
Awr Recreate
2 pages
Day To Day Activities
100% (1)
Day To Day Activities
7 pages
Questionaire
No ratings yet
Questionaire
106 pages
Oracle RMAN Interview Guide
No ratings yet
Oracle RMAN Interview Guide
3 pages
Inc Restore
No ratings yet
Inc Restore
131 pages
Oracle Data Guard Interview Questions
No ratings yet
Oracle Data Guard Interview Questions
12 pages
Senior Oracle DBA Interview Questions
No ratings yet
Senior Oracle DBA Interview Questions
35 pages
SRVCTL Guide for Oracle9i RAC Admins
No ratings yet
SRVCTL Guide for Oracle9i RAC Admins
9 pages
Oracle ASM Interview Guide
100% (1)
Oracle ASM Interview Guide
6 pages
Data Guard Process Flow
No ratings yet
Data Guard Process Flow
14 pages
ASM Architecture ASM Disk Group Administration
No ratings yet
ASM Architecture ASM Disk Group Administration
135 pages
Oracle RAC Process Management Guide
No ratings yet
Oracle RAC Process Management Guide
3 pages
6) Real Cluster Applications (RAC) Questions & Answers
No ratings yet
6) Real Cluster Applications (RAC) Questions & Answers
5 pages
Adding New Tables To An Existing Oracle Goldengate Replication
No ratings yet
Adding New Tables To An Existing Oracle Goldengate Replication
7 pages
RACcheck Configuration Audit Tool Guide
No ratings yet
RACcheck Configuration Audit Tool Guide
8 pages
Database Performance Scripts
No ratings yet
Database Performance Scripts
7 pages
Oracle RAC Database Upgrade 12c To 19c PDF
No ratings yet
Oracle RAC Database Upgrade 12c To 19c PDF
27 pages
Oracle Database Monitoring Guide
100% (1)
Oracle Database Monitoring Guide
45 pages
Oracle DBA Wait Events Explained
No ratings yet
Oracle DBA Wait Events Explained
51 pages
Performance Tuning Basics 15 - AWR Report Analysis - Expert Oracle
100% (1)
Performance Tuning Basics 15 - AWR Report Analysis - Expert Oracle
63 pages
Standby Database Checklist: 1. Overview
No ratings yet
Standby Database Checklist: 1. Overview
13 pages
LA Oracle Users Group: Succeeding With RMAN
No ratings yet
LA Oracle Users Group: Succeeding With RMAN
26 pages
Oracle Scripting for DBAs
No ratings yet
Oracle Scripting for DBAs
3 pages
Exadata RAC DBA Training Course
100% (1)
Exadata RAC DBA Training Course
4 pages
AWR and ASH Performance Insights
No ratings yet
AWR and ASH Performance Insights
42 pages
Script To Monitor Database Instance and Listener
100% (1)
Script To Monitor Database Instance and Listener
3 pages
LATCH - Enq TX - Index Contention
No ratings yet
LATCH - Enq TX - Index Contention
2 pages
Oracle 19c CRS Log Locations Guide
No ratings yet
Oracle 19c CRS Log Locations Guide
1 page
Oracle 10g Memory Notification Guide
100% (1)
Oracle 10g Memory Notification Guide
68 pages
Oracle Tablespace & User Management
No ratings yet
Oracle Tablespace & User Management
33 pages
AWR RPT Reading PDF
No ratings yet
AWR RPT Reading PDF
64 pages
Top 10 Backup Recovery Practices
No ratings yet
Top 10 Backup Recovery Practices
3 pages
Upgrade Is Very Easy Now For DBAs
No ratings yet
Upgrade Is Very Easy Now For DBAs
8 pages
Mastering Oracle Performance Diagnostics
100% (1)
Mastering Oracle Performance Diagnostics
16 pages
Data Guard Basics
No ratings yet
Data Guard Basics
27 pages
Oracle RAC Tuning on Linux
100% (1)
Oracle RAC Tuning on Linux
12 pages
Oracle Data Pump Utility Guide
No ratings yet
Oracle Data Pump Utility Guide
28 pages
Oracle RAC vs Non-RAC Databases
No ratings yet
Oracle RAC vs Non-RAC Databases
9 pages
A Rough Guide To RAC: Julian Dyke Independent Consultant
No ratings yet
A Rough Guide To RAC: Julian Dyke Independent Consultant
63 pages
Oracle 11g RAC Guide: Availability & Scalability
100% (1)
Oracle 11g RAC Guide: Availability & Scalability
59 pages
Oracle Database Hang Diagnostics Guide
No ratings yet
Oracle Database Hang Diagnostics Guide
5 pages
A Rough Guide To RAC: Julian Dyke Independent Consultant
No ratings yet
A Rough Guide To RAC: Julian Dyke Independent Consultant
63 pages
Jul 042013
No ratings yet
Jul 042013
60 pages
RAC DOC - New
No ratings yet
RAC DOC - New
64 pages
RAC-ASM-VOTING DISK Interview Questions & Answer
100% (2)
RAC-ASM-VOTING DISK Interview Questions & Answer
18 pages
Online Exam Training
No ratings yet
Online Exam Training
16 pages
ICS Cybersecurity: A Comprehensive Survey
No ratings yet
ICS Cybersecurity: A Comprehensive Survey
19 pages
Report Project Complete
No ratings yet
Report Project Complete
41 pages
Simple Way To Lengthen Sleeves of A Finished Sweater
No ratings yet
Simple Way To Lengthen Sleeves of A Finished Sweater
8 pages
Fajr and Sehri PDF
No ratings yet
Fajr and Sehri PDF
3 pages
Pymalion
No ratings yet
Pymalion
8 pages
FXFQ Avm
No ratings yet
FXFQ Avm
56 pages
The Golden Touch and Other Stories
No ratings yet
The Golden Touch and Other Stories
5 pages
Ate Lita
No ratings yet
Ate Lita
3 pages
Investment Portfolio Overview
No ratings yet
Investment Portfolio Overview
6 pages
AIS205 Accounting Info Systems Plan
No ratings yet
AIS205 Accounting Info Systems Plan
3 pages
Case of Marie Jackson
No ratings yet
Case of Marie Jackson
3 pages
Ahinsha Khand (98) (1) - 1-300
No ratings yet
Ahinsha Khand (98) (1) - 1-300
1,917 pages
AAT Financial Statements Course & Q&A
No ratings yet
AAT Financial Statements Course & Q&A
99 pages
Important Question of Class Xii Physics-Atoms and Nuclei
100% (2)
Important Question of Class Xii Physics-Atoms and Nuclei
17 pages
Associate Customer Experience Manager Role
No ratings yet
Associate Customer Experience Manager Role
2 pages
Language Learning in Elementary Education
No ratings yet
Language Learning in Elementary Education
69 pages
Data Processing and Coding Tabulation and Data Presentation
No ratings yet
Data Processing and Coding Tabulation and Data Presentation
20 pages
Mediterranean Ethnicity Traits
No ratings yet
Mediterranean Ethnicity Traits
3 pages
Final Term Mcqs of English: The Islamia University of Bahawalpur
No ratings yet
Final Term Mcqs of English: The Islamia University of Bahawalpur
10 pages
Unit 2 Flow of Fluids: Structure
No ratings yet
Unit 2 Flow of Fluids: Structure
34 pages
A2+ Grammar Worksheets 2024-2025
No ratings yet
A2+ Grammar Worksheets 2024-2025
85 pages
The Ghost King Dressed DownTGCF ENG PT 3
100% (1)
The Ghost King Dressed DownTGCF ENG PT 3
2 pages
UP Technical Education Exam Schedule 2019
No ratings yet
UP Technical Education Exam Schedule 2019
3 pages
Exit Interview Questionnaire - SLC (India) - Review Parameters
No ratings yet
Exit Interview Questionnaire - SLC (India) - Review Parameters
2 pages
Configuring AMOS Mail API Setup Guide
No ratings yet
Configuring AMOS Mail API Setup Guide
13 pages
Contract Between Spouses Regarding Own Property
No ratings yet
Contract Between Spouses Regarding Own Property
10 pages
2024 Monaco Grand Prix - Event Notes - Circuit Map, Pit Lane Drawing and Red Zone
No ratings yet
2024 Monaco Grand Prix - Event Notes - Circuit Map, Pit Lane Drawing and Red Zone
4 pages
Structural Analysis and Design Parameters
No ratings yet
Structural Analysis and Design Parameters
3 pages
Scribd Leech Generator Guide
No ratings yet
Scribd Leech Generator Guide
2 pages

RAC Troubleshooting Guide and Logs

Uploaded by

RAC Troubleshooting Guide and Logs

Uploaded by

RAC Troubleshooting

make -f ins_rdbms.mk rac_off

make -f ins_rdbms.mk ioracle

make -f ins_rdbms.mk rac_on

make -f ins_rdbms.mk ioracle

• Overall instance/database performance

# using oradebug from another instance

Note: the select statement above is to avoid problems on pre 8 Oracle

• Reads the SGA in a dirty manner, so it may be inconsistent

• Usually dumps a lot of information

• does not identify interesting processes on which to perform additional dumps

• can be a very expensive operation if you have a large SGA.

## Another way using oradebug

Note: you will be told where the output will be dumped to

exec $JRE -classpath $CLASSPATH oracle.ops.mgmt.daemon.OPSMDaemon

exec $JRE -DTRACING.ENABLED=true -DTRACING.LEVEL=2

You might also like