You are on page 1of 29

CRS & RAC

Troubleshooting

Krishnadev Telikicherla
Cluster & Parallel Storage Technology
Oracle Corporation

Oracle Corporation
Topics:

 Defining the Issue


 Creating a Timeline
 Hang or Slowdown
 Performance Issues
 Gathering Data
 Testcases
 Rediscovery
 Engaging Oracle Support
 Examples

Oracle Corporation
Defining the Issue
Layers
 What layers are involved in the issue:

• Oracle Clusterware
• CRS daemon
• CSS daemon
• HangCheckTimer [Linux] / Oprocd (not
Linux)
• EVM
• OCR
• Voting
• General RDBMS
• Operating System
• Hardware

Oracle Corporation
Defining the Issue
Cause vs. Effects
 Causes:
– Resource issues
– Oracle issues
– OS issues
 Effects:
– Hangs/Spins
– Instances Crashes and Evictions
– Node Reboots and Evictions
– Oracle Errors (ORA-600, ORA-7445, ORA-29740)

Oracle Corporation
Defining the Issue
Description
 When describing the problem while creating the SR
via Metalink it is important that you use phrases that
will help identify known issues either in bugs or
Metalink content.
 In the body of the SR try to be as detailed as possible
about the environment.
 Nobody knows the system better than the you.
 Talk to the sys-admin as well regarding OS/Network
related issues.

Oracle Corporation
Creating a Timeline

 A timeline helps identify the times to concentrate on


when reviewing files
 A timeline can be built from reviewing the files
themselves once they are provided to support but this
will only slow resolution time down
 Timelines should include an ordering of cause and
effects as well as include all participating nodes
 Include specific times, ie…
– At 3:00am PST we noticed that node2 was hanging.

Oracle Corporation
Hang or slowdown

 Differentiate between a database hang and a


database slowdown
 Identify the extent of a hang

Oracle Corporation
Is it a Hang or a Slowdown?

 Check:
 System states to see if there is any change
over a short period of time
 V$SESSION_WAIT where wait_time=0
 Overall machine load, including cpu,
memory, swap, I/O

Oracle Corporation
Is it a Hang or a Slowdown?

 Single or multiprocess hang:


– Usually characterized by a particular job
hanging or not completing
– Essentially the same as in single instance
unless it’s internode parallel query.
 Instance hang: A single instance is
unusable.
 Multi-instance or full database hang: Entire
database is hung or not responding

Oracle Corporation
Performance

 Single process or statement


 Instance
 Multi-Instance

Oracle Corporation
Single Process or Single
Statement
 Find the wait event
 10046 level 12
- oradebug setorapid
- oradebug event 10046 trace name context forever, level 12
- oradebug tracefile_name

 Explain plan
 10053 if plan problems are found
 V$SESSTAT
 Truss/trace/dbx/pstack if OS-related
problems are suspected
Oracle Corporation
Instance Slowdown

 Statspack / AWR
 OS performance statistics - cpu, memory,
and I/O
 Characteristics:
– Related to a particular job?
– Certain time of day?
– What’s changed?

Oracle Corporation
Multi-Instance Slowdowns

 AWR from each node can be of use:


 AWR collects instance specific data
 Examine and correlate the reports

Oracle Corporation
Multi-Instance Slowdowns

 In cases of extreme slowdowns:


 systemstates on all nodes
 V$SESSION_WAIT
 Alert logs and any trace files
 Process states, or stack traces if
determined and applicable

Oracle Corporation
Debugging Techniques

 v$session_wait
 System states from all nodes
 10046 level 12 trace of the hung process
 ORADEBUG
 Lock layer and DLM tracing
 Get any traces:
 DLM traces
 Background processes, alert logs, and init.ora
 User traces

Oracle Corporation
Debugging and Diagnostics

 Performance issues or hangs:


 Identify the resource being requested.
 Identify who holds the resource.

Oracle Corporation
ORADEBUG and Tools

 Hang analyze:
– hanganalyze <level>
 Note: 301137.1 – OS Watcher User Guide
 Note: 135714.1 - Script to Collect RAC
Diagnostic Information (diagcollection.pl)

Oracle Corporation
Gathering Data
Best Practices
 Single most important step
 There is never too much data, but including lots of
useless data can increase download time of the data
as well as increase the amount of time to process the
data.
 Always error on getting too much data, but be aware
of the impact on the resolution time.
 Too little data increases resolution time more than too
much data.
 Always include a readme.txt file that explains the
contens of the provided files

Oracle Corporation
Gathering Data
Processes
 Always get stacks from processes that seem
to be spinning, hanging or unresponsive:
– oradebug
– gdb
– pstack
 ps and top info can be very usefull when
trying to determine if a processes exhibits
issues such as memory leaks, spinning or
hanging

Oracle Corporation
Gathering Data
RAC
 For instance evictions please review Metalink
note 219361.1
 See Metalink note 203226.1 : RAC Survival
Kit: Real Application Clusters Troubleshooting
and Information
 See Metalink note 289690.1 : Data Gathering
for Troubleshooting RAC and CRS issues

Oracle Corporation
Gathering Data
Tools
 RDA – system and Oracle configuration information
 racdiag – modifiable sql script for gathering rac data. See
Metalink note 135714.1 “Script to Collect RAC Diagnostic
Information
 OSW – OS Watcher gathers top, slabinfo, netstat and ps data
over programmable intervals 301137.1 “OS Watcher User
Guide”

Oracle Corporation
Gathering Data
CRS 10.2.0.x (continued)
 CRS and other resource issues:
– ORA_CRS_HOME
 log/<hostname>/cssd/oclsmon
 log/<hostname>/cssd
 log/<hostname>/client
 log/<hostname>/crsd
 log/<hostname>/evmd
 log/<hostname>/racg
– ORACLE_HOME (rdbms)
 racg/dump
 ORACLE_BASE/<db_name>/hdump

Oracle Corporation
Gathering Data
Tools (continue)
 Starting with 10.2.0.1 $ORA_CRS_HOME/bin/diagcollection.pl collect all
RAC relevant files (run as root)
oracle10@stnsp010>./diagcollection.pl
Production Copyright 2004, 2005, Oracle. All rights reserved
Cluster Ready Services (CRS) diagnostic collection tool
diagcollection
--collect
[--crs] For collecting crs diag information
[--oh] For collecting oracle home diag information
[--ob] For collecting oracle base diag information
[--all] Default.For collecting all diag information
NOTE:
1. You can also do the following
./diagcollection.pl --collect --crs --oh
2. ORA_CRS_HOME,ORACLE_HOME and ORACLE_BASE env variables
need to be set.
--clean cleans up the diagnosability
information gathered by this script
--coreanalyze extracts information from core files
and stores it in a text file

Oracle Corporation
Testcases

 Not always feasible


 If provided, can greatly influence resolution time
 When providing a testcase:
– Include a readme file
– Try to strip the testcase down to the minimal elements that
are needed to reproduce the problem
 If at all possible, always try to build a testcase
 Testcases are your friends!

Oracle Corporation
Rediscovery

 Expensive for a support organization


 Issue rediscovery is not always obvious
 Use Metalink to identify possible causes for
issues as well as workarounds and patch
availability
 Communicate new issues between DBAs

Oracle Corporation
Engaging Oracle Support

 Try to be responsive to all TARs when they


are set to CUS status. Delays inherently
causes two problems:
1. The issue loses momentum
2. A new engineer may have to take over the issue

Oracle Corporation
Examples

 10.2.0.2 HP-UX/Itanium ServiceGuard, CRS,


CFS and RAC
 Delays in reconfiguration

Oracle Corporation
Examples

 10.2.0.2 Linux CRS, RAC and ASM


 ORA-600[2103] and one instance crashed

Oracle Corporation
Questions?

Oracle Corporation

You might also like