You are on page 1of 39

RAC Assurance Team

RAC System Test Plan Outline


11gR2
Version 2.4

Purpose
Before a new computer /cluster system is deployed in production it is important to test the system thoroughly to validate
that it will perform at a satisfactory level, relative to its service level objectives. Testing is also required when
introducing major or minor changes to the system. This document provides an outline consisting of basic guidelines and
recommendations for testing a new RAC system. This test plan outline can be used as a framework for building a
system test plan specific to each company’s RAC implementation and their associated service level objectives.

Scope of System Testing


This document provides an outline of basic testing guidelines that will be used to validate core component functionality
for RAC environments in the form of an organized test plan. Every application exercises the underlying software and
hardware infrastructure differently, and must be tested as part of a component testing strategy. Each new system must be
tested thoroughly, in an environment that is a realistic representation of the production environment in terms of
configuration, capacity, and workload prior to going live or after implementing significant architectural/system
modifications. Without a completed system implementation and functional available end-user applications, only core
component functionality and testing is possible to verify cluster, RDBMS and various sub-component behaviors for the
Networking, I/O subsystem and miscellaneous database administrative functions.

In addition to the specific system testing outlined in this document additional testing needs to be defined and executed
for RMAN, backup and recovery, and Data Guard (for disaster recovery). Each component area of testing also requires
specific operational procedures to be documented and maintained to address site-specific requirements.

Testing Objectives
In addition to application functionality testing, overall system testing is normally performed for one or more of the
following reasons:
• Verify that the system has been installed and configured correctly. Check that nothing is broken. Establish a
baseline of functionality behavior such that we can answer the question down the road: ‘has this ever worked in this
environment?’
• Verify that basic functionality still works in a specific environment and for a specific workload. Vendors normally
test their products very thoroughly, but it is not possible to test all possible hardware/software combinations and
unique workloads.
• Make sure that the system will achieve its objectives, in particular, availability and performance objectives. This can
be very complex and normally requires some form of simulated production environment and workload.
• Test operational procedures. This includes normal operational procedures and recovery procedures.
• Train operations staff.

Planning System Testing


Effective system testing requires careful planning. The service level objectives for the system itself and for the testing
must be clearly understood and a detailed test plan should be documented. The basis for all testing is that the current
best practices for RAC system configuration have been implemented before testing.
Oracle Support Services RAC Starter Kit
RAC Assurance Team System Test Plan
Page 1
Testing should be performed in an environment that mirrors the production environment as much as possible. The
software configuration should be identical but for cost reasons it might be necessary to use a scaled down hardware
configuration. All testing should be performed while running a workload that is as close to production as possible.
When planning for system testing it is extremely important to understand how the application has been designed to
handle the failures outlined in this plan and to ensure that the expected results are met at the application level as well as
the database level. Oracle technologies that enable fault tolerance of the database at the application level include the
following:
• Fast Application Notification (FAN) – Notification mechanism that alerts application of service level changes of the
database.
• Fast Connection Failover (FCF) – Utilizes FAN events to enable database clients to proactively react to down events
by quickly failing over connections to surviving database instances.
• Transparent Application Failover (TAF) – Allows for connections to be automatically reestablished to a surviving
database instance in the case that the instance servicing the initial connection should fail. TAF has the ability to fail
over in-flight select statements (if configured) but insert, update and delete transactions will be rolled back.
• Runtime Connection Load Balancing (RCLB) – Provides intelligence about the current service level of the database
instances to application connection pools. This increases the performance of the application by utilizing least loaded
servers to service application requests and allows for dynamic workload balancing in the event of the loss of service
by a database instance or increase of service by adding a database instance.
More information on each of the above technologies can be found in the Oracle Real Application Clusters
Administration and Deployment Guide 11g Release 2.

Generating a realistic application workload can be complex and expensive but it is the most important factor for effective
testing. For each individual test in the plan, a clear understanding of the following is required:
• What is the objective of the test and how does this relate to the overall system objectives?
• Exactly how will the test be performed and what are the execution steps?
• What are the success/failure criteria, and what are the expected results?
• How will the test result be measured?
• Which tools will be used?
• Which logfiles and other data will be collected?
• Which operational procedures are relevant?
• What are the expected results of the application for each of the defined tests (TAF, FCF, RCLB)?

Notes for Windows Users


Many of the Fault Injection Tests outlined in this document involve abnormal termination of various processes within the
Oracle Software stack. On Unix/Linux systems this is easily achieved by using “ps” and “kill” commands. Natively,
Windows does not provide the ability to view enough details of running processes to properly identify and kill the
processes involved in the Fault Injection Testing. To overcome this limitation a utility called Process Explorer (provided
by Microsoft) will be used to identify and kill the necessary processes. Process Explorer can be found on the Windows
Sysinternals website within Microsoft Technet (http://technet.microsoft.com/en-us/sysinternals/bb896653.aspx). In
addition to Process Explorer, a utility called orakill will be used to kill individual threads within the database. More
information on orakill can be found under Note: 69882.1.

Production Simulation / System Stress Test


The best way to ensure that the system will perform well without any problems is to simulate production workload and
conditions before going live. Ideally the system should be stressed a little more than what is expected in production. In
addition to running the normal user and application workload, all normal operational procedures should also be tested at
the same time. The output from the normal monitoring procedures should be kept and compared with the real data when
going live. Normal maintenance operations such as adding users, adding disk space, reorganizing tables and indexes,
backup, archiving data, etc. must also be tested. A commercial or in-house developed workload generator is essential.

Fault Injection Testing


The system configuration and operational procedures must also be tested to make sure that component failures and other
problems can be dealt with as efficiently as possible and with minimum impact on system availability. This section
provides some examples of tests that can be used as part of a system test plan. The idea is to test the system’s robustness
against various failures. Depending on the overall architecture and objectives, only some of the tests might be used

Oracle Support Services RAC Starter Kit


RAC Assurance Team System Test Plan
Page 2
and/or additional tests might have to be constructed. Introducing multiple failures at the same time should also be
considered.

This list only covers testing for RAC-related components and procedures. Additional tests are required for other parts of
the system. These tests should be performed with a realistic workload on the system. Procedures for detecting and
recovering from these failures must also be tested.

In some worst-case scenarios it might not be possible to recover the system within an acceptable time frame and a
disaster recovery plan should specify how to switch to an alternative system or location. This should also be tested.
The result of a test should initially be measured at a business or user level to see if the result is within the service level
agreement. If a test fails it will be necessary to gather and analyze the relevant log and trace files. The analysis can result
in system tuning, changing the system architecture or possibly reporting component problems to the appropriate vendor.
Also, if the system objectives turn out to be unrealistic, they might have to be changed.

Oracle Support Services RAC Starter Kit


RAC Assurance Team System Test Plan
Page 3
System Testing Scenarios
Test # Test Procedure Expected Results Measures Actual Results/Notes
Test 1 Planned Node Reboot • Start client workload • The instances and other • Time to detect node or
• Identify instance with most client connections Clusterware resources that were instance failure
• Reboot the node where the most loaded instance running on that node go offline • Time to complete
is running (no value for ‘SERVER’ field of instance recovery.
o For AIX, HPUX, Windows: “shutdown –r” crsctl stat res –t output) Check alert log for
o For Linux: “shutdown –r now” • The node VIP fails over to one of instance performing the
o For Solaris: “reboot” the surviving nodes and will recovery
show a state of • Time to restore client
“INTERMEDIATE” with activity to same level
state_details of (assuming remaining
“FAILED_OVER” nodes have sufficient
• The SCAN VIP(s) that were capacity to run
running on the rebooted node workload)
will fail over to surviving nodes. • Duration of database
• The SCAN Listener(s) running reconfiguration.
on that node will fail over to a • Time before failed
surviving node. instance is restarted
• Instance recovery is performed automatically by
by another instance. Clusterware and is
• Services are moved to available accepting new
instances, if the downed instance connections
is specified as a preferred • Successful failover of
instance the SCAN VIP(s) and
• Client connections are moved / SCAN Listener(s)
reconnected to surviving
instances (Procedure and timings
will depend on client types and
configuration). With TAF
configured select statements
should continue. Active
DMLwill be aborted.
• After the database
reconfiguration, surviving
instances continue processing
their workload.

Oracle Support Services RAC Starter Kit


RAC Assurance Team System Test Plan Outline
Page 4
Test # Test Procedure Expected Results Measures Actual Results/Notes
Test 2 Unplanned Node • Start client workload. • Same as Planned Node Reboot • Same as Planned Node
Failure of the OCR • Identify the node that is the OCR master using the Reboot
Master following grep command from any of the nodes:
grep -i "OCR MASTER"
$GI_HOME/log/<node_name>/crsd/crsd.l*
NOTE: Windows users must manually review
the $GI_HOME/log/<node_name>/crsd/crsd.l*
logs to determine the OCR Master.
• Power off the node that is the OCR master.

NOTE: On many servers the power-off switch


will perform a controlled shutdown, and it might
be necessary to cut the power supply
Test 3 Restart Failed Node • On clusters having 3 or fewer • Time for all resources to
nodes, one of the SCAN VIPs become available again,
and Listeners will be relocated to Check with “crsctl stat
the restarted node when the res –t”
Oracle Clusterware starts.
• The VIP will migrate back to the
restarted node.
• Services that had failed over as a
result of the node failure will
NOT automatically be relocated.
• Failed resources (asm, listener,
instance, etc) will be restarted by
the Clusterware.

Test 4 Reboot all nodes at • Issue a reboot on all nodes at the same time • All nodes, instances and • Time for all resources to
the same time o For AIX, HPUX, Windows: ‘shutdown –r’ resources are restarted without become available again,
o For Linux: ‘shutdown –r now’ problems Check with “crsctl stat
o For Solaris: ‘reboot’ res –t”.

Oracle Support Services RAC Starter Kit


RAC Assurance Team System Test Plan Outline
Page 5
Test # Test Procedure Expected Results Measures Actual Results/Notes
Test 5 Unplanned Instance • Start client workload • One of the other instances • Time to detect instance
Failure • Identify single database instance with the most performs instance recovery failure
client connections and abnormally terminate that • Services are moved to available • Time to complete
instance: instances, if a preferred instance instance recovery.
o For AIX, HPUX, Linux, Solaris: failed Check alert log for
Obtain the PID for the pmon process of the • Client connections are moved / recovering instance
database instance: reconnected to surviving • Time to restore client
# ps –ef | grep pmon instances (Procedure and timings activity to same level
kill the pmon process: will depend on client types and (assuming remaining
# kill –9 <pmon pid> configuration) nodes have sufficient
o For Windows: • After a short freeze, surviving capacity to run
Obtain the thread ID of the pmon thread of the instances continue processing the workload)
database instance by running: workload • Duration of database
SQL> select b.name, p.spid • Failing instance will be restarted freeze during failover.
from v$bgprocess b, v$process p by Oracle Clusterware, unless • Time before failed
where b.paddr=p.addr and b.name=’PMON’; this feature has been disabled instance is restarted
Run orakill to kill the thread: automatically by Oracle
cmd> orakill <SID> <Thread ID> Clusterware and is
accepting new
connections
Test 6 Planned Instance • Issue a ‘shutdown abort’ • One other instance performs • Time to detect instance
Termination instance recovery failure.
• Services are moved to available • Time to complete
instances, if a preferred instance instance recovery.
failed Check alert log for
• Client connections are moved / recovering instance.
reconnected to surviving • Time to restore client
instances (Procedure and timings activity to same level
will depend on client types and (assuming remaining
configuration) nodes have sufficient
• The instance will NOT be capacity to run
automatically restarted by Oracle workload).
Clusterware due to the user • The instance will NOT
invoked shutdown. be restarted by Oracle
Clusterware due to the
user induced shutdown.

Oracle Support Services RAC Starter Kit


RAC Assurance Team System Test Plan Outline
Page 6
Test # Test Procedure Expected Results Measures Actual Results/Notes
Test 7 Restart Failed • Automatic restart by Oracle Clusterware if it is an • Instance rejoins RAC cluster • Time before services
Instance uncontrolled failure without any problems (review and workload are
• Manual restart necessary if a “shutdown” alert logs etc.) rebalanced across all
command was issued. • Client connections and workload instances (including any
• Manual restart when the "Auto Start" option for will be load balanced across the manual steps)
the related instance has been disabled. new instance (Manual procedure
might be required to redistribute
workload if long running /
permanent connections)
Test 8 Unplanned ASM • Start client workload • The *.dg, *.acfs, *.asm and *.db • Time to detect instance
Instance Failure • Identify a single ASM instance in the cluster: resources that were running on failure
o For AIX, HPUX, Linux, Solaris: that node will go offline (crsctl • Time to complete
Obtain the PID for the pmon process of the stat res –t). By default these instance recovery.
ASM instance: resources will be automatically Check alert log for
# ps –ef | grep pmon restarted by Oracle Clusterware. recovering instance
kill the pmon process: • One other instance performs • Time to restore client
# kill –9 <pmon pid> instance recovery activity to same level
o For Windows: • Services are moved to available (assuming remaining
Obtain the thread ID of the pmon thread of the instances, if a preferred instance nodes have sufficient
ASM instance by running: failed capacity to run
SQL> select b.name, p.spid • Client connections are moved / workload)
from v$bgprocess b, v$process p reconnected to surviving • Duration of database
where b.paddr=p.addr and b.name=’PMON’; instances (Procedure and timings reconfiguration.
Run orakill to kill the thread: will depend on client types and • Time before failed
cmd> orakill <SID> <Thread ID> configuration) resources are restarted
• After the database and the database
reconfiguration is complete, instance is accepting
surviving instances continue new connections
processing the workload
• The Clusterware alert log will
show crsd going offline due to an
inaccessible OCR if the OCR is
stored in ASM. CRSD will
automatically restart

Oracle Support Services RAC Starter Kit


RAC Assurance Team System Test Plan Outline
Page 7
Test # Test Procedure Expected Results Measures Actual Results/Notes
Test 9 Unplanned Multiple • Start client workload • Same as instance failure. • Same as instance
Instance Failure • Abnormally terminate 2 different database • Both instances should be failure.
instances from the same database at the same recovered and restarted without
time: problems.
o For AIX, HPUX, Linux, Solaris:
Obtain the PID for the pmon process of the
database instance:
# ps –ef | grep pmon
kill the pmon process:
# kill –9 <pmon pid>
o For Windows:
Obtain the thread ID of the pmon thread of the
database instance by running:
SQL> select b.name, p.spid
from v$bgprocess b, v$process p
where b.paddr=p.addr and b.name=’PMON’;
Run orakill to kill the thread:
cmd> orakill <SID> <Thread ID>

Oracle Support Services RAC Starter Kit


RAC Assurance Team System Test Plan Outline
Page 8
Test # Test Procedure Expected Results Measures Actual Results/Notes
Test 10 Listener Failure • For AIX, HPUX, Linux and Solaris: • No impact on connected • Time for the
Obtain the PID for the listener process: database sessions. Clusterware to detect
# ps –ef | grep tnslsnr • New connections are redirected failure and restart
Kill the listener process: to listener on other node listener.
# kill –9 <listener pid> (depends on client configuration)
• For Windows: • Local database instance will
Use Process Explorer to identify the receive new connections if
tnslistener.exe process for the database listener. shared server is used. Local
This will be the tnslistener.exe registered to the database instance will NOT
“<home name>TNSListener” service (not the receive new connections if
<home dedicated server is used.
name>TNSListenerLISTENER_SCAN<n>” • The Listener failure is detected
service). Once the proper tnslistener.exe is by the ORAAGENT and is
identified kill the process by right clicking the automatically restarted. Review
executable and choosing “Kill Process”. the following logs:
o $GI_HOME/log/<nodename>/
crsd/crsd.log
o $GI_HOME/log/<nodename>/
agent/crsd/oraagent_<GI_own
er>/oraagent_<GI_owner>.log

Test 11 SCAN Listener • For AIX, HPUX, Linux and Solaris: • No impact on connected • Same as Listener
Failure Obtain the PID for the SCAN listener process: database sessions. Failure
# ps –ef | grep tnslsnr • New connections are redirected
Kill the listener process: to listener on other node
# kill –9 <listener pid> (depends on client configuration)
• For Windows: • The Listener failure is detected
Use Process Explorer to identify the by CRSD ORAAGENT and is
tnslistener.exe process for the SCAN listener. automatically restarted. Review
This will be the tnslistener.exe registered to the the following logs:
“<home name> • $GI_HOME/log/<nodename>/cr
TNSListenerLISTENER_SCAN<n>” service (not sd/crsd.log
the <home name>TNSListener” service). Once • $GI_HOME/log/<nodename>/ag
the proper tnslistener.exe is identified kill the ent/crsd/oraagent_<GI_owner>/o
process by right clicking the executable and raagent_<GI_owner>.log
choosing “Kill Process”.

Oracle Support Services RAC Starter Kit


RAC Assurance Team System Test Plan Outline
Page 9
Test # Test Procedure Expected Results Measures Actual Results/Notes
Test 12 Public Network • Unplug all network cables for the public network • Check with “crsctl stat res –t” • Time to detect the
Failure o The ora.*.network and listener network failure and
NOTE: Configurations using NIS must also have resources will go offline for relocate resources.
implemented NSCD for this test to succeed with the the node.
expected results. o SCAN VIPs and SCAN
LISTENERs running on the
NOTE: It is recommended NOT to use ifconfig to node will fail over to a
down the interface, this may lead to the address still surviving node.
being plumbed to the interface resulting in o The VIP for the node will fail
unexpected results. over to a surviving node.
• The database instance will
remain up but will be
unregistered with the remote
listeners.
• Database services will fail over
to one of the other available
nodes.
• If TAF is configured, clients
should fail over to an available
instance.
Test 13 Public NIC Failure • Assuming dual NICs are configured public • Network traffic should fail over • Time to fail over to
interface for redundancy (e.g. bonding, teaming, to other NIC without impacting other NIC card. With
etc). any of the cluster resources. bonding /teaming
• Unplug the network cable from 1 of the NICs. configured this should
be less than 100ms.
NOTE: It is recommended NOT to use ifconfig
to down the interface, this may lead to the address
still being plumbed to the interface resulting in
unexpected results.

Oracle Support Services RAC Starter Kit


RAC Assurance Team System Test Plan Outline
Page 10
Test # Test Procedure Expected Results Measures Actual Results/Notes
Test 14 Interconnect Network • Unplug all network cables for the interconnect For 11.2.0.1: • Time to detect split
Failure (11.2.0.1) network • CSSD will detect split-brain brain and start eviction.
situation and perform one of the
NOTE: It is recommended NOT to use ifconfig following: For 11.2.0.1:
Note: The method in to down the interface, this may lead to the address o In a two-node cluster the node • See measures for node
which a node is evicted still being plumbed to the interface resulting in with the lowest node number failure.
has changed in unexpected results. will survive the other node will
11.2.0.2 with the be rebooted.
introduction of a new o In a multiple node cluster the
feature called Reboot largest sub-cluster will survive
less Restart. Reboot others will be rebooted.
less restart aims to • Review the following logs:
achieve a node eviction o $GI_HOME/log/<nodename>/
without actually cssd/ocssd.log
rebooting the node. o $GI_HOME/log/<nodename>/
alert<nodename>.log

Oracle Support Services RAC Starter Kit


RAC Assurance Team System Test Plan Outline
Page 11
Test # Test Procedure Expected Results Measures Actual Results/Notes
Test 14 Interconnect Network • Unplug all network cables for the interconnect For 11.2.0.2 and above: For 11.2.0.2 and above:
Cont’d Failure (11.2.0.2 and network • CSSD will detect split-brain • Oracle Clusterware will
higher) situation and perform one of the gracefully shutdown,
NOTE: It is recommended NOT to use ifconfig following: should graceful
to down the interface, this may lead to the address o In a two-node cluster the node shutdown fail (due to
Note: The method in still being plumbed to the interface resulting in with the lowest node number I/O processes not being
which a node is evicted unexpected results. will survive. terminated or resource
has changed in o In a multiple node cluster the cleanup) the node will
11.2.0.2 with the largest sub-cluster will be rebooted.
introduction of a new survive. • Assuming that the
feature called Reboot • On the node(s) that is being graceful shutdown of
less Restart. Reboot evicted, a graceful shutdown of Oracle Clusterware
less restart aims to Oracle Clusterware will be succeeded, OHASD
achieve a node eviction attempted. will restart the stack
without actually o All I/O capable client once network
rebooting the node. processes will be terminated connectivity for the
and all resources will be private interconnect has
cleaned up. If process been restored.
termination and/or resource
cleanup does not complete
successfully the node will be
rebooted.
o Assuming that the above has
completed successfully,
OHASD will attempt to restart
the stack. In this case the
stack will be restarted once the
network connectivity of the
private interconnect network
has been restored.
• Review the following logs:
o $GI_HOME/log/<nodename>/
alert<nodename>.log
o $GI_HOME/log/<nodename>/
cssd/ocssd.log

Oracle Support Services RAC Starter Kit


RAC Assurance Team System Test Plan Outline
Page 12
Test # Test Procedure Expected Results Measures Actual Results/Notes
Test 15 Interconnect NIC • Assuming dual NICs are configured for the • Network traffic should fail over • Time to fail over to
Failure (OS or 3rd private interface for redundancy (e.g. bonding, to other NIC without impacting other NIC card. With
Party NIC teaming, etc). any of the cluster resources. bonding / teaming
Redundancy) • Unplug the network cable from 1 of the NICs. configured this should
be less than 100ms.
NOTE: It is recommended NOT to use ifconfig
to down the interface, this may lead to the address
still being plumbed to the interface resulting in
unexpected results.
Test 16 Interconnect NIC • Assuming 2 or more NICs configured for Oracle • The HAIP running on the NIC in • Failover (and fail back)
Failure (Oracle Redundant Interconnect and HAIP. which the cable was pulled will will be seamless (no
Redundant • Unplug the network cable from 1 of the NICs failover to one of the surviving disruption in service
Interconnect, 11.2.0.2 NICs in the configuration. from any node in the
and higher only) NOTE: It is recommended NOT to use ifconfig • Clusterware and/or RAC cluster).
to down the interface, this may lead to the address communication will not be
Note: This test is still being plumbed to the interface resulting in impacted.
applicable for those on unexpected results. • Review the following logs:
11.2.0.2 and higher o $GI_HOME/log/<nodename>/
using Oracle NOTE: At present it is REQUIRED that all cssd/ocssd.log
Redundant interconnect interfaces be placed on separate o $GI_HOME/log/<nodename>/
Interconnect/HAIP subnets. If the interfaces are all on the same gipcd/gipcd.log
subnet and the cable is pulled from the first NIC • Upon reconnecting the cable, the
in the routing table a rebootless-restart or node HAIP that failed over will
reboot will occur. See MOS Note: 1210883.1 for relocate back to its original
additional details. interface.

Test 17 Interconnect Switch • In a redundant network switch configuration, • Network traffic should fail over • Time to fail over to
Failure (Redundant power off one switch to other switch without any other NIC card. With
Switch impact on interconnect traffic or bonding /teaming/11.2
Configuration) instances. Redundant Interconnect
configured this should
be less than 100ms.

Oracle Support Services RAC Starter Kit


RAC Assurance Team System Test Plan Outline
Page 13
Test # Test Procedure Expected Results Measures Actual Results/Notes
Test 18 Node Loses Access to • Unplug external storage cable connection (SCSI, For 11.2.0.1: For 11.2.0.1:
Disks with CSS FC or LAN cable) from one node to disks • CSS will detect this and evict the • See measures for node
Voting Device containing the CSS Voting Device(s). node with a reboot. Review the failure
following logs:
Note: The method in NOTE: To perform this test it may be necessary o $GI_HOME/log/<nodename>/ For 11.2.0.2 and above:
which a node is evicted to isolate the CSS Voting Device(s) to an isolated cssd/ocssd.log • Oracle Clusterware will
has changed in ASM diskgroup or CFS. o $GI_HOME/log/<nodename>/ gracefully shutdown,
11.2.0.2 with the alert<nodename>.log should graceful
introduction of a new For 11.2.0.2 and above: shutdown fail (due to
feature called Reboot • CSS will detect this and evict the I/O processes not being
less Restart. Reboot node as follows: terminated or resource
less restart aims to o All I/O capable client cleanup) the node will
achieve a node eviction processes will be terminated be rebooted.
without actually and all resources will be • Assuming that the
rebooting the node. cleaned up. If process graceful shutdown of
termination and/or resource Oracle Clusterware
cleanup does not complete succeeded, OHASD
successfully the node will be will restart the stack
rebooted. once network
o Assuming that the above has connectivity for the
completed successfully, private interconnect has
OHASD will attempt to restart been restored.
the stack. In this case the
stack will be restarted once the
network connectivity of the
private interconnect network
has been restored.
• Review the following logs:
o $GI_HOME/log/<nodename>/
alert<nodename>.log
o $GI_HOME/log/<nodename>/
cssd/ocssd.log

Oracle Support Services RAC Starter Kit


RAC Assurance Team System Test Plan Outline
Page 14
Test # Test Procedure Expected Results Measures Actual Results/Notes
Test 19 Node Loses Access to • Unplug external storage cable connection (SCSI, • CRSD will detect the failure of • Monitor database status
Disks with OCR FC or LAN cable) from one node to disks the OCR device and abort. under load to ensure no
Device(s) containing the OCR Device(s). OHASD will attempt to restart service interruption
CRSD 10 times after which occurs.
NOTE: To perform this test it may be necessary manual intervention will be
to isolate the OCR Device(s) to an isolated ASM required.
diskgroup or CFS. • The database instance, ASM
instance and listeners will not be
impacted.
• Review the following logs:
o $GI_HOME/log/<nodename>/
cssd/crsd.log
o $GI_HOME/log/<nodename>/
alert<nodename>.log
o $GI_HOME/log/<nodename>/
ohasd/ohasd.log

Test 20 Node Loses Access to • Unplug external storage cable connection (SCSI, • If multi-pathing is enabled, the • Monitor database status
Single Path of Disk FC or LAN cable) from node to disk subsystem. multi-pathing configuration under load to ensure no
Subsystem (OCR, should provide failure service interruption
Voting Device, transparency occurs.
Database files) • No impact to database instances. • Path failover should be
visible in the OS
logfiles.
Test 21 ASM Disk Lost • Assuming ASM normal redundancy • No impact on database instances • Monitor progress:
• Power off / pull out / offline (depending on • ASM starts rebalancing (view select * from
config) one ASM disk. ASM alert logs). v$asm_operation
Test 22 ASM Disk Repaired • Power on / insert / online the ASM disk • No impact on database instances • Monitor progress:
• ASM starts rebalancing (view select * from
ASM alert logs). v$asm_operation

Oracle Support Services RAC Starter Kit


RAC Assurance Team System Test Plan Outline
Page 15
Test # Test Procedure Expected Results Measures Actual Results/Notes
Test 23 One multiplexed • Remove access to a multiplexed voting disk from • Cluster will remain available. • No Impact on Cluster
Voting Device is all nodes. If voting disks are in a normal • The voting disk will be
inaccessible redundancy disk group remove access to one of automatically brought online
the ASM disks. when access is restored.
• Voting Disks can be queried
using “crsctl query css votedisk”.
Review the following logs:
o $GI_HOME/log/<nodename>/
cssd/cssd.log
o $GI_HOME/log/<nodename>/
alert<nodename>.log

Test 24 Lose and Recover one 1. Remove access to one copy of OCR or force • There will be no impact on the • There is no impact on
copy of OCR dismount of ASM diskgroup (asmcmd umount cluster operation. The loss of the cluster operation
<dg_name> -f). access and restoration of the • The OCR can be
2. Replace the disk or remount the diskgroup, missing/corrupt OCR will be replaced online, without
ocrcheck will report the OCR to be out of sync. reported in: a cluster outage.
3. Delete the corrupt OCR (ocrconfig –delete o $GI_HOME/log/<nodename>/
+<diskgroup>) and read the OCR (ocrconfig –add cssd/crsd.log
+<diskgroup>). This avoids having to stop o $GI_HOME/log/<nodename>/
CRSD. alert<nodename>.log

NOTE: This test assumes that the OCR is


mirrored to 2 ASM diskgroups that do not contain
voting disks or data or stored on CFS
Test 25 Add a node to the • Follow the procedures in Oracle Clusterware • The new node will successfully • The node is
cluster and extend Administration and Deployment Guide 11g be added to the cluster. dynamically added to
the database (if Release 2 Chapter 4 to extend Grid • If the database is policy managed the cluster
admin managed) to Infrastructure to the new node. and there is free space in the • If the database is policy
that node • After extending the Grid Infrastructure, follow the server pool for the new node the managed an instance for
procedures in Oracle® Real Application database will be extended to the the database will
Clusters Administration and Deployment new node automatically (OMF automatically be created
Guide 11g Release 2 Chapter 10 to extend the should be enabled so no user on the new node.
RDBMS binaries and database to the new node. intervention is required).
• The new database instance will
begin servicing connections.

Oracle Support Services RAC Starter Kit


RAC Assurance Team System Test Plan Outline
Page 16
Test # Test Procedure Expected Results Measures Actual Results/Notes
Test 26 Remove a node from • Follow the procedures in Oracle® Real • The connections on to the • The node will be
the cluster Application Clusters Administration and database instance being removed dynamically removed
Deployment Guide 11g Release 2 Chapter 10 to will fail over to the remaining from the cluster.
delete the node from the cluster. instances (if configured).
• After successfully removing the RDBMS • The node will be successfully
installation, follow the procedures in Oracle removed from the cluster.
Clusterware Administration and Deployment
Guide 11g Release 2 Chapter 4 to remove the
node from the cluster.

Oracle Support Services RAC Starter Kit


RAC Assurance Team System Test Plan Outline
Page 17
System Testing Scenarios: Clusterware Process Failures
NOTE: This section of the system testing scenarios demonstrate failures of various Oracle Clusterware processes. These process failures are NOT within the realm of typical failures
within a RAC system. Killing of these processes under normal operation is highly discouraged by Oracle Support. This section is to be used to provide a better understanding of the
Clusterware processes, the functionality of these processes and a general understanding of the logging performed by each of these processes.

Test # Test Procedure Expected Results Measures Actual Results/Notes


Test 1 CRSD Process • For AIX, HPUX, Linux and Solaris: • CRSD process failure is detected • Time to restart CRSD
Failure Obtain the PID for the CRSD process: by the orarootagent and CRSD is process
# ps –ef | grep crsd restarted. Review the following
Kill the CRSD process: logs:
# kill –9 <crsd pid> o $GI_HOME/log/<nodename>/
crsd/crsd.log
• For Windows: o $GI_HOME/log/<nodename>/
Use Process Explorer to identify the crsd.exe agent/ohasd/orarootagent_root/
process. Once the crsd.exe process is identified orarootagent_root.log
kill the process by right clicking the executable
and choosing “Kill Process”.
Test 2 EVMD Process • For AIX, HPUX, Linux and Solaris: • EVMD process failure is • Time to restart the
Failure Obtain the PID for the EVMD process: detected by the OHASD EVMD process
# ps –ef | grep evmd orarootagent and CRSD is
Kill the EVMD process: restarted. Review the following
# kill –9 <evmd pid> logs:
o $GI_HOME/log/<nodename>/
• For Windows: evmd/evmd.log
Use Process Explorer to identify the evmd.exe o $GI_HOME/log/<nodename>/
process. Once the evmd.exe process is identified agent/ohasd/oraagent_grid
kill the process by right clicking the executable /oraagent_grid.log
and choosing “Kill Process”.

Oracle Support Services RAC Starter Kit


RAC Assurance Team System Test Plan Outline
Page 18
Test # Test Procedure Expected Results Measures Actual Results/Notes
Test 3 CSSD Process Failure • For AIX, HPUX, Linux and Solaris: • The node will reboot. • Time for the eviction
Obtain the PID for the CSSD process: • Cluster reconfiguration will take and cluster
# ps –ef | grep cssd place reconfiguration on the
Kill the CSSD process: • Windows ONLY: On the system surviving nodes
# kill –9 <cssd pid> console a Blue Screen will show • Time for the node to
with a stop code of 0x0000ffff come back online and
• For Windows: which indicates that the reconfiguration to
Use Process Explorer to identify the ocssd.exe OraFence driver rebooted the complete to add the
process. Once the ocssd.exe process is identified box due to a CSSD failure. node as an active
kill the process by right clicking the executable member of the cluster.
and choosing “Kill Process”.
Test 4 CRSD ORAAGENT • For AIX, HPUX, Linux and Solaris: • The ORAAGENT process failure • Time to restart the
RDBMS Process Obtain the PID for the CRSD oraagent for the is detected by CRSD and is ORAAGENT process
Failure RDBMS software owner: automatically restarted. Review
# cat the following logs:
NOTE: Test Valid for $GI_HOME/log/<nodename>/agent/crsd/oraag o $GI_HOME/log/<nodename>/
Only Multi User ent_<rdbms_owner>/oraagent_<rdbms_owner> crsd/crsd.log
Installations. .pid o $GI_HOME/log/<nodename>/
# kill –9 <pid for RDBMS oraagent process> agent/crsd/oraagent_<rdbms_o
wner>/oraagent_<rdbms_owne
r>.log

Test 5 CRSD ORAAGENT • For AIX, HPUX, Linux and Solaris: • The Grid Infrastructure • Time to restart the
Grid Infrastructure Obtain the PID for the CRSD oraagent for the GI ORAAGENT process failure is ORAAGENT process
Process Failure software owner: detected by CRSD and is
# cat automatically restarted. Review
$GI_HOME/log/<nodename>/agent/crsd/oraag the following logs:
ent_<GI_owner>/oraagent_<GI_owner>.pid o $GI_HOME/log/<nodename>/
# kill –9 <pid for GI oraagent process> crsd/crsd.log
o $GI_HOME/log/<nodename>/
• For Windows: agent/crsd/oraagent_<GI_own
Use Process Explorer to identify the crsd er>/oraagent_<GI_owner>.log
oraagent.exe process that is a child process of
crsd.exe (or obtain the pid for the crsd
oraagent.exe as shown in the Unix/Linux
instructions above). Once the proper oraagent.exe
process is identified kill the process by right
clicking the executable and choosing “Kill
Process”.
Oracle Support Services RAC Starter Kit
RAC Assurance Team System Test Plan Outline
Page 19
Test # Test Procedure Expected Results Measures Actual Results/Notes
Test 6 CRSD • For AIX, HPUX, Linux and Solaris: • The ORAROOTAGENT process • Time to restart the
ORAROOTAGENT Obtain the PID for the CRSD orarootagent: failure is detected by CRSD and ORAROOTAGENT
Process Failure # cat is automatically restarted. process
$GI_HOME/log/<nodename>/agent/crsd/oraro Review the following logs:
otagent_root/orarootagent_root.pid” o $GI_HOME/log/<nodename>/
# kill –9 <pid for orarootagent process> crsd/crsd.log
o $GI_HOME/log/<nodename>/
• For Windows: agent/crsd/orarootagent_root/o
Use Process Explorer to identify the crsd rarootagent_root.log
orarootagent.exe process that is a child process of
crsd.exe (or obtain the pid for the crsd
orarootagent.exe as shown in the Unix/Linux
instructions above). Once the proper
orarootagent.exe process is identified kill the
process by right clicking the executable and
choosing “Kill Process”.
Test 7 OHASD • For AIX, HPUX, Linux and Solaris: • The ORAAGENT process failure • Time to restart the
ORAAGENT Process Obtain the PID for the OHASD oraagent: is detected by OHASD and is ORAAGENT process
Failure # cat automatically restarted. Review
$GI_HOME/log/<nodename>/agent/ohasd/oraa the following logs:
gent_<GI_owner>/oraagent_<GI_owner>.pid o $GI_HOME/log/<nodename>/
# kill –9 <pid for oraagent process> ohasd/ohasd.log
o $GI_HOME/log/<nodename>/
• For Windows: agent/ohasd/oraagent_<GI_ow
Use Process Explorer to identify the ohasd ner>/oraagent_<GI_owner>.lo
oraagent.exe process that is a child process of g
ohasd.exe (or obtain the pid for the ohasd
oraagent.exe as shown in the Unix/Linux
instructions above). Once the proper oraagent.exe
process is identified kill the process by right
clicking the executable and choosing “Kill
Process”.

Oracle Support Services RAC Starter Kit


RAC Assurance Team System Test Plan Outline
Page 20
Test # Test Procedure Expected Results Measures Actual Results/Notes
Test 8 OHASD • For AIX, HPUX, Linux and Solaris: • The ORAROOTAGENT process • Time to restart the
ORAROOTAGENT Obtain the PID for the OHASD orarootagent: failure is detected by OHASD ORAROOTAGENT
Process Failure # cat and is automatically restarted. process
$GI_HOME/log/<nodename>/agent/ohasd/orar Review the following logs:
ootagent_root/orarootagent_root.pid o $GI_HOME/log/<nodename>/
# kill –9 <pid for orarootagent process> ohasd/ohasd.log
o $GI_HOME/log/<nodename>/
• For Windows: agent/ohasd/orarootagent_root/
Use Process Explorer to identify the ohasd orarootagent_root.log
orarootagent.exe process that is a child process of
ohasd.exe (or obtain the pid for the ohasd
orarootagent.exe as shown in the Unix/Linux
instructions above). Once the proper
orarootagent.exe process is identified kill the
process by right clicking the executable and
choosing “Kill Process”.
Test 9 CSSDAGENT • For AIX, HPUX, Linux and Solaris: • The CSSDAGENT process • Time to restart the
Process Failure Obtain the PID for the CSSDAGENT: failure is detected by OHASD CSSDAGENT process
# ps –ef | grep cssdagent and is automatically restarted.
# kill –9 <pid for cssdagent process> Review the following logs:
o $GI_HOME/log/<nodename>/
• For Windows: ohasd/ohasd.log
Use Process Explorer to identify the o $GI_HOME/log/<nodename>/
cssdagent.exe process. Once the cssdagent.exe agent/ohasd/oracssdagent_root
process is identified kill the process by right /oracssdagent_root.log
clicking the executable and choosing “Kill
Process”.
Test 10 CSSMONITOR • For AIX, HPUX, Linux and Solaris: • The CSSDMONITOR process • Time to restart the
Process Failure Obtain the PID for the CSSDMONITOR: failure is detected by OHASD CSSMONITOR process
# ps –ef | grep cssdmonitor and is automatically restarted.
# kill –9 <pid for cssdmonitor process> Review the following logs:
o $GI_HOME/log/<nodename>/
• For Windows: ohasd/ohasd.log
Use Process Explorer to identify the o $GI_HOME/log/<nodename>/
cssdmonitor.exe process. Once the agent/ohasd/oracssdmonitor_r
cssdmonitor.exe process is identified kill the oot/oracssdmonitor_root.log
process by right clicking the executable and
choosing “Kill Process”.

Oracle Support Services RAC Starter Kit


RAC Assurance Team System Test Plan Outline
Page 21
Component Functionality Testing
Normally it should not be necessary to perform additional functionality testing for each individual system component.
However, for some new components in new environments it might be useful to perform additional testing to make sure
that they are configured properly. This testing will also help system and database administrators become familiar with
new technology components.

Cluster Infrastructure
To simplify testing and problem diagnosis it is often very useful to do some basic testing on the cluster infrastructure
without Oracle software or a workload running. Normally this testing will be performed after installing the hardware and
operating system, but before installing any Oracle software. If problems are encountered during System Stress Test or
Destructive Testing, diagnosis and analysis can be facilitated by testing the cluster infrastructure separately. Typically
some of these destructive tests will be used:

•Node Failure. Obviously without Oracle software or workload.


•Restart Failed Node
•Reboot all nodes at the same time
•Lost disk access
•HBA failover. Assuming multiple HBAs with failover capability.
•Disk controller failover. Assuming multiple disk controllers with failover capability.
•Public NIC Failure
•Interconnect NIC Failure
•NAS (Netapps) storage failure – In case of a complete mirror failure, measure the time that the storage reconfiguration
needed to be completed. Check the same if going into maintenance mode.

If using non-Oracle cluster software:


•Interconnect Network Failure
•Lost access to cluster voting/quorum disk

ASM Test and Validation


This test and validation plan is intended to give the customer or engineer a procedural approach to:
•Validating the installation of RAC-ASM
•Functional and operation validation of ASM

Oracle Support Services RAC Starter Kit


RAC Assurance Team System Test Plan Outline
Page 22
Component Testing: ASM Functional Tests
Test # Test Procedure Expected Results/Measures Actual Results/Notes
Test 1 Verify that candidate • Add a Disk/LUN to the RAC nodes • The newly added LUN will appear as a candidate disk
disks are available. and configure the Disk/LUN for use within ASM.
by ASM.
• Login to ASM via SQL*Plus and run:
“select name, group_number, path,
state, header_status, mode_status,
label from v$asm_disk”

Test 2 Create an external • Login to ASM via SQL*Plus and run: • A successfully created diskgroup. This diskgroup should
redundancy ASM “create diskgroup <dg name> also be listed in v$asm_diskgroup.
diskgroup using external redundancy disk ‘<candidate • The diskgroup will be registered as a Clusterware resource
SQL*Plus path>’ ;“ (crsctl stat res –t)

Test 3 Create an normal or • Login to ASM via SQL*Plus and run: • A successfully created diskgroup with normal redundancy
high redundancy ASM “create diskgroup <dg name> norma and two failure groups. For high redundancy, it will create
diskgroup using lredundancy disk '<candidate1 path>, three fail groups.
SQL*Plus '<candidate 2 path> ;” • The diskgroup will be registered as a Clulsterware resource
(crsctl stat res –t)

Test 4 Add a disk to a ASM • Login to ASM via SQL*Plus and run: • The disk will be added to the diskgroup and the data will be
disk group using “alter diskgroup <dg name> add disk rebalanced evenly across all disks in the diskgroup.
SQL*Plus '<candidate1 path> ;”

NOTE: Progress can be monitored by


querying v$asm_operation

Test 5 Drop an ASM disk • Login to ASM via SQL*Plus and run: • The data from the removed disk will be rebalanced across
from a diskgroup using “alter diskgroup <dg name> drop disk the remaining disks in the diskgroup. Once the rebalance is
SQL*Plus <disk name>;” complete the disk will have a header_status of “FORMER”
(v$asm_disk) and will be a candidate to be added to another
NOTE: Progress can be monitored by diskgroup.
querying v$asm_operation

Oracle Support Services RAC Starter Kit


RAC Assurance Team System Test Plan Outline
Page 23
Test # Test Procedure Expected Results/Measures Actual Results/Notes
Test 6 Undrop a ASM disk • Login to ASM via SQL*Plus and run: • The undrop operation will rollback the drop operation
that is currently being “alter diskgroup <dg name> drop disk (assuming it has not completed). The disk entry will remain
dropped using <disk name>;” in v$asm_disk as a MEMBER.
SQL*Plus • Before the rebalance completes run the
following command via SQL*Plus:
“alter diskgroup <dg name> undrop
disk <disk name>;”

NOTE: Progress can be monitored by


querying v$asm_operation

Test 7 Drop a ASM diskgroup • Login to ASM via SQL*Plus and run: • The diskgroup will be successfully dropped.
using SQL*Plus “drop diskgroup <dg name>;” • The diskgroup will be unregistered as a Clusterware
resource (crsctl stat res –t)
Test 8 Modify rebalance • Login to ASM via SQL*Plus and run: • The rebalance power of the current operation will be
power of an active “alter diskgroup <dg name> add disk increased to the specified value. This is visible in the
operation using '<candidate1 path> ;” v$asm_operation view.
SQL*Plus • Before the rebalance completes run the
following command via SQL*Plus:
“alter diskgroup <dg name> rebalance
power <1 – 11>;”. 1 is the default
rebalance power.

NOTE: Progress can be monitored by


querying v$asm_operation
Test 9 Verify CSS-database • Start all the database instances and • Each database instance should be listed in the v$asm_client
communication and query the v$asm_client view in the view.
ASM files access. ASM instances.

Test 10 Check the internal • Login to ASM via SQL*Plus and run: • If there are no internal inconsistencies, the statement
consistency of disk “alter diskgroup <name> check all” “Diskgroup altered” will be returned (asmcmd will return
group metadata using back to the asmcmd prompt). If inconsistencies are
SQL*Plus discovered, then appropriate messages are displayed
describing the problem.

Oracle Support Services RAC Starter Kit


RAC Assurance Team System Test Plan Outline
Page 24
Component Testing: ASM Functional Tests –ASMCMD
Test # Test Procedure Expected Results/Measures Actual Results/Notes
Test 1 Verify that candidate • Add a Disk/LUN to the RAC nodes • The newly added LUN will appear as a candidate disk
disks are available. and configure the Disk/LUN for use by within ASM.
ASM.
• Login to ASM via ASMCMD and run:
“lsdsk --candidate

Test 2 Create an external • Identify the candidate disks for the • A successfully created diskgroup. This diskgroup can be
redundancy ASM diskgroup by running: viewed using the “lsdg” ASMCMD command.
diskgroup using “lsdsk –candidate” • The diskgroup will be registered as a Clusterware resource
ASMCMD (crsctl stat res –t)
• Create a XML config file to define the
diskgroup e.g.
<dg name="<dg name>"
redundancy="external">
<dsk string="<disk path>" />
<a name="compatible.asm"
value="11.1"/>
<a name="compatible.rdbms"
value="11.1"/>
</dg>
• Login to ASM via ASMCMD and run:
“mkdg <config file>.xml”

Oracle Support Services RAC Starter Kit


RAC Assurance Team System Test Plan Outline
Page 25
Test # Test Procedure Expected Results/Measures Actual Results/Notes
Test 3 Create a normal or • Identify the candidate disks for the • A successfully created diskgroup. This diskgroup can be
high redundancy ASM diskgroup by running: viewed using the “lsdg” ASMCMD command.
diskgroup using “lsdsk –candidate” • The diskgroup will be registered as a Clusterware resource
ASMCMD • Create a XML config file to define the (crsctl stat res –t)
diskgroup e.g.
<dg name="<dg_name>"
redundancy="normal">
<fg name="fg1">
<dsk string="<disk path>" />
</fg>
<fg name="fg2">
<dsk string="<disk path>" />
</fg>
<a name="compatible.asm"
value="11.1"/>
<a name="compatible.rdbms"
value="11.1"/>
</dg>
• Login to ASM via ASMCMD and run:
“mkdg <config file>.xml”

Test 4 Add a disk to a ASM • Identify the candidate disk to be added • The disk will be added to the diskgroup and the data will be
disk group using by running: rebalanced evenly across all disks in the diskgroup.
ASMCMD “lsdsk –candidate” Progress of the rebalance can be monitored by running the
• Create a XML config file to define the “lsop” ASMCMD command.
diskgroup change e.g.
<chdg name="<dg name>">
<add>
<dsk string="<disk path>"/>
</add>
</chdg>
• Login to ASM via ASMCMD and run:
“chdg <config file>.xml”

NOTE: Progress can be monitored by


running “lsop”

Oracle Support Services RAC Starter Kit


RAC Assurance Team System Test Plan Outline
Page 26
Test # Test Procedure Expected Results/Measures Actual Results/Notes
Test 5 Drop an ASM disk • Identify the ASM name for the disk to • The data from the removed disk will be rebalanced across
from a diskgroup be dropped from the given diskgroup: the remaining disks in the diskgroup. Once the rebalance is
using ASMCMD “lsdsk -G <dg name> -k complete the disk will be listed as a candidate (lsdsk –
• Create a XML config file to define the candidate) to be added to another diskgroup. Progress can
diskgroup change e.g. be monitored by running “lsop”
<chdg name="<dg name>"> • The diskgroup will be unregistered as a Clusterware
<add>
<dsk name="<disk name>"/>
resource (crsctl stat res –t)
</add>
</chdg>
• Login to ASM via ASMCMD and run:
“chdg <config file>.xml”

NOTE: Progress can be monitored by


running “lsop”

Test 6 Modify rebalance • Add a disk to a diskgroup (as shown • The rebalance power of the current operation will be
power of an active above). increased to the specified value. This is visible with the
operation using • Identify the rebalance operation by lsop command.
ASMCMD running “lsop” via ASMCMD.
• Before the rebalance completes run the
following command via ASMCMD:
“rebal –power <1-11> <dg name>.

NOTE: Progress can be monitored by


running “lsop”
Test 7 Drop a ASM • Login to ASM via ASMCMD and run: • The diskgroup will be successfully dropped.
diskgroup using “dropdg <dg name>;” • The diskgroup will be unregistered as a Clusterware
ASMCMD resource (crsctl stat res –t)

Oracle Support Services RAC Starter Kit


RAC Assurance Team System Test Plan Outline
Page 27
Component Testing: ASM Objects Functional Tests
Test # Test Procedure Expected Results/Measures Actual Results/Notes
Test 1 Create an ASM • Login to ASM via SQL*Plus and run: • The ASM template will be successfully created and
template “alter diskgroup <dg name> add visible within the v$asm_template view.
template unreliable
attributes(unprotected fine);”

Test 2 Apply an ASM template • Use the template above and apply it • The datafile is created using the attributes of the ASM
to a new tablespace to be created on template
the database
• Login to ASM via SQL*Plus and run:
“create tablespace test datafile '+<dg
name>/my_files(unreliable)' size
10M;”

Test 3 Drop an ASM template • Login to ASM via SQL*Plus and run: • This template should be removed from
“alter diskgroup <dg name> drop v$asm_template.
template unreliable;”

Test 4 Create an ASM • Login to ASM via SQL*Plus and run: • You can use the asmcmd tool to check that the new
directory “alter diskgroup <dg name> add directory name was created in the desired diskgroup.
directory '+<dg name>/my_files';”
• The created directory will have an entry in
v$asm_directory
Test 5 Create an ASM alias • Login to ASM via SQL*Plus and run: • Verify that the alias exists in v$asm_alias
“alter diskgroup DATA add alias
'+DATA/my_files/datafile_alias' for
'+<dg name>/
<db name>/DATAFILE/<file
name>';”

Test 6 Drop an ASM alias • Login to ASM via SQL*Plus and run: • Verify that the alias does not exist in v$asm_alias.
“alter diskgroup DATA drop alias
'+<dg name>/my_files/ datafile_alias
';”

Oracle Support Services RAC Starter Kit


RAC Assurance Team System Test Plan Outline
Page 28
Test # Test Procedure Expected Results/Measures Actual Results/Notes
Test 7 Drop an active database • Identify a data file from a running • This will fail with the following message:
file within ASM database. ERROR at line 1:
• Login to ASM via SQL*Plus and run: ORA-15032: not all alterations performed
“alter diskgroup data drop file '+<dg ORA-15028: ASM file
name>/<db name>/DATAFILE/<file '+DATA/V102/DATAFILE/TEST.269.654602409' not
name>';” dropped; currently being accessed

Test 8 Drop an inactive • Identify a datafile that is no longer • Observe that file number in v$asm_file is now
database file within used by a database removed.
ASM • Login to ASM via SQL*Plus and run:
“alter diskgroup data drop file '+<dg
name>/<db name>/DATAFILE/<file
name>';”

Component Testing: ASM ACFS Functional Tests


Test # Test Procedure Expected Results/Measures Actual Results/Notes
Test 1 Create an ASM • Create an ASM diskgroup to house • The volume will be created with the specified
Dynamic Volume the ASM Logical Volume. attributes. The volume can be viewed in ASMCMD by
ASMCMD or SQL*Plus may be used running “volinfo –a”.
to achieve this task. The diskgroup
compatibility attributes
COMPATIBLE.ASM and
COMPATIBLE.ADVM must be set
to 11.2 or higher.
• Login to ASM via ASMCMD and
create the logical volume to house the
ACFS filesystem:
“volcreate –G <dg name> -s <size>
<vol name>”

Oracle Support Services RAC Starter Kit


RAC Assurance Team System Test Plan Outline
Page 29
Test # Test Procedure Expected Results/Measures Actual Results/Notes
Test 2 Create an ACFS • Within ASMCMD issue the “volinfo • The filesystem will be successfully created. The
filesystem –a” command and take note of the filesystem attributes can be viewed by running
Volume Device path. “/sbin/acfsutil info fs”
• As the root user create an ACFS
filesystem on the ASM Volume as
follows:
“/sbin/mkfs –t acfs <volume device
path>”
Test 3 Mount the ACFS • As the root user execute the • The filesystem will successfully be mounted and will
filesystem following to mount the ACFS be visible.
filesystem:
“/sbin/mount –t acfs <volume device
path> <mount point>

NOTE: If acfsutil was not used to


register the file system, the
dynamic volume must be enabled
on the remote nodes before
mounting (within ASMCMD run
volenable).
Test 4 Add an ACFS filesystem • Use acfsutil to register the ACFS • The filesystem will be registered with the ACFS
to the ACFS mount filesystem: registry. This can be validated by running
registry “/sbin/acfsutil registry –a <volume “/sbin/acfsutil registry –l”
device path> <mount point> • The filesystem will be automounted on all nodes in the
cluster on reboot

Test 5 Create a file on the • Perform the following: • The file will exist on all nodes with the specified
ACFS filesystem “echo “Testing ACFS” > <mount contents.
point>/testfile
• Perform a “cat” command on the file
on all nodes in the cluster.

Test 6 Remove an ACFS • Use acfsutil to register the ACFS • The filesystem will be unregistered with the ACFS
filesystem from the filesystem: registry. This can be validated by running
ACFS mount registry “/sbin/acfsutil registry –d <volume “/sbin/acfsutil registry –l”
device path> • The filesystem will NOT be automounted on all nodes
in the cluster on reboot

Oracle Support Services RAC Starter Kit


RAC Assurance Team System Test Plan Outline
Page 30
Test # Test Procedure Expected Results/Measures Actual Results/Notes
Test 7 Add an ACFS filesystem • Execute the following command as • The filesystem will be registered as a resource within
as a Clusterware root to add a ACFS filesystem as a the Clusterware. This can be validated by running
resource Clusterware resource: “crsctl stat res –t”
“svrctl add filesystem –d < volume • The filesystem will be automounted on all nodes in the
NOTE: This is required device path> -v <volume name> -g cluster on reboot
when using ACFS for a <dg name> -m <mount point> -u
shared RDBMS Home. root”
When ACFS is registered • Start the ACFS filesystem resource:
as a CRS resource it “svrctl start filesystem –d <volume
should NOT be registered device path>”
in the ACFS mount
registry.
Test 8 Increase the size of a • Add a disk to the diskgroup housing • The dynamic volume and filesystem will be resized
ACFS filesystem the ACFS filesystem (if necessary) without an outage of the filesystem provided enough
• Use acfsutil as the root user to resize free space exists in the diskgroup. Validate with “df –
the ACFS filesystem: h”.
“acfsutil size <size><K|M|G>
<mount point>”
Test 9 Install a shared Oracle • Create an ACFS filesystem that is a • The shared 11gR2 RDBMS Home will be successfully
Home on an ACFS minimum of 6GB in size. installed.
filesystem • Add the ACFS filesystem as a
Clusterware resource.
• Install the 11gR2 RDBMS on the
shared ACFS filesystem (see install
guide)
Test 10 Create a snapshot of a • Use acfsutil to create a snapshot of an • A snapshot of the ACFS file system will be created
ACFS filesystem ACFS filesystem: under <ACFS mount point>/.ACFS/snaps.
“/sbin/acfsutil snap <name> <ACFS
mount point>”

Test 11 Delete a snapshot of a • Use acfsutil to delete a previously • The specified snapshot will be deleted and will no
ACFS filesystem created snapshot of an ACFS longer appear under <ACFS mount
filesystem: point>/.ACFS/snaps.
“/sbin/acfsutil snap delete <name>
<ACFS mount point>”

Oracle Support Services RAC Starter Kit


RAC Assurance Team System Test Plan Outline
Page 31
Test # Test Procedure Expected Results/Measures Actual Results/Notes
Test 12 Perform a FSCK of a • Dismount the ACFS filesystem to be • FSCK will check the specified ACFS filesystem for
ACFS filesystem checked on ALL nodes: errors, automatically fix any errors (-a), answer yes to
o If the filesystem is registered as a any prompts (-y) and provide verbose output (-v).
Clusterware resource issue “srvctl
stop filesystem –d <device path>”
to dismount the filesystem on all
nodes
o If the filesystem is only in the
ACFS mount registry or is not
registered with Clusterware in any
way dismount the filesystem using
“umount <mount point>”.
• Execute fsck on the ACFS filesystem
as follows:
“sbin/fsck -a -v -y -t acfs <device
path>”
This command will automatically fix
any errors (-a), answer yes to any
prompts (-y) and provide verbose
output (-v).

Oracle Support Services RAC Starter Kit


RAC Assurance Team System Test Plan Outline
Page 32
Test # Test Procedure Expected Results/Measures Actual Results/Notes
Test 13 Delete an ACFS • Dismount the ACFS filesystem to be • The ACFS filesystem will be removed from the ASM
filesystem deleted on ALL nodes: Dynamic Volume. Attempts to mount the filesystem
o If the filesystem is registered as a should now fail.
Clusterware resource issue “srvctl
stop filesystem –d <device path>”
to dismount the filesystem on all
nodes
o If the filesystem is only in the
ACFS mount registry or is not
registered with CRS in any way
dismount the filesystem using
“umount <mount point>”.
• If the filesystem is registered with the
ACFS mount registry deregister the
mount point using acfsutil as follows:
“/sbin/acfsutil registry –d <device
path>”
• Remove the filesystem from the
Dynamic Volume using acfsutil:
“/sbin/acfsutil rmfs <device path>”
Test 14 Remove an ASM • Use ASMCMD to delete a ASM • The removed Dynamic Volume will no longer be listed
Dynamic Volume Dynamic Volume: in the output of “volinfo –a”.
“voldelete –G <dg name> <vol • The disk space utilized by the Dynamic Volume will be
name>” returned to the diskgroup.

Oracle Support Services RAC Starter Kit


RAC Assurance Team System Test Plan Outline
Page 33
Component Testing: ASM Tools & Utilities
Test # Test Procedure Expected Results/Measures Actual Results/Notes
Test 1 Run dbverify on the • Specify each file individually using • The output should be similar to the
database files. the dbv utility: following, with no errors present:
dbv
userid=s<user>/<password>file='<A DBVERIFY - Verification complete
SM filename>'
Total Pages Examined : 640
blocksize=<blocksize> Total Pages Processed (Data) : 45
Total Pages Failing (Data) : 0
Total Pages Processed (Index): 2
Total Pages Failing (Index): 0
Total Pages Processed (Other): 31
Total Pages Processed (Seg) : 0
Total Pages Failing (Seg) : 0
Total Pages Empty : 562
Total Pages Marked Corrupt : 0
Total Pages Influx :0
Highest block SCN : 0 (0.0)

Test 2 Use • Use dbms_file_transfer.put_file and • The put_file and get file functions will
dbms_file_transfer to get_file functions to copy database copy files successfully to/from
copy files from ASM files (datafiles, archives, etc) into and filesystem. This provides an alternate
to filesystem out of ASM. option for migrating to ASM, or to
simply copy files out of ASM.
NOTE: This requires that a database
directory be pre-created and
available for the source and
destination directories. See PL/SQL
Guide for dbms_file_transfer details

Oracle Support Services RAC Starter Kit


RAC Assurance Team System Test Plan Outline
Page 34
Component Testing: Miscellaneous Tests
Test # Test Procedure Expected Results/Measures Measures Actual Results/Notes
Test 1 Diagnostics Procedure • Start client workload • Diagnostics collection procedures • Time to run
for Hang/Slowdown • Execute automatic and manual complete normally. diagnostics
procedures to collect database, procedures. Is it
Clusterware and operating system acceptable to wait for
diagnostics (hanganalyze, this time before
racdiag.sql) restarting instances or
nodes in a production
situation?

Appendix I: Linux Specific Tests


Test # Test Procedure Expected Results/Measures Actual Results/Notes

Test 1 Create an OCFS2 • Add a Disk/LUN to the RAC nodes • The OCFS2 filesystem will be
filesystem and configure the Disk/LUN for use by created.
OCFS2. • The OCFS2 filesystem will be
• Create the appropriate partition table mounted on all nodes
on the disk and use “partprobe” to
rescan the partition tables.
• Create the OCFS2 filesystem by
running:
“/sbin/mkfs –t ocfs2 <device path>”
• Add the filesystem to /etc/fstab on all
nodes
• Mount the filesystem on all nodes

Oracle Support Services RAC Starter Kit


RAC Assurance Team System Test Plan Outline
Page 35
Test # Test Procedure Expected Results/Measures Actual Results/Notes

Test 2 Create a file on the • Perform the following: • The file will exist on all nodes with
OCFS filesystem “echo “Testing OCFS2” > <mount the specified contents.
point>/testfile
• Perform a “cat” command on the file
on all nodes in the cluster.

Test 3 Verify that the • Issue a “shutdown –r now” • The OCFS2 filesystem will
OCFS2 filesystem is automatically mount and be accessible
available after a to all nodes after a reboot.
system reboot
Test 4 Enable database • Modify the database archive log • Archivelog files are created, and
archive logs to settings to utilize OCFS2 available to all nodes on the specified
OCFS2 OCFS2 filesystem.

NOTE: If using the


OCFS2 filesystem for
database files it must
be mounted with the
following options:
rw,datavolume,nointr
Test 5 Create an RMAN on • Back up ASM based datafiles to • RMAN backupsets are created, and
a OCFS2 filesystem OCFS2 filesystem. available to all nodes on the specified
• Execute baseline recovery scenarios OCFS2 filesystem.
NOTE: If using the (full, point-in-time, datafile). • Recovery scenarios completed with
OCFS2 filesystem for no errors.
database files it must
be mounted with the
following options:
rw,datavolume,nointr
Test 6 Create a datapump • Using datapump, take an export of the • A full system export should be created
export on a OCFS2 database to an OCFS2 filesystem. without errors or warnings.
filesystem
Test 7 Validate OCFS2 • Issue a “shutdown –r now” from a • OCFS2 filesystem should remain
functionality during single node in the cluster available to surviving nodes.
node failures.

Oracle Support Services RAC Starter Kit


RAC Assurance Team System Test Plan Outline
Page 36
Test # Test Procedure Expected Results/Measures Actual Results/Notes

Test 8 Validate OCFS2 • Unplug external storage cable • If multi-pathing is enabled, the multi-
functionality during connection (SCSI, FC or LAN cable) pathing configuration should provide
disk/disk subsystem from node to disk subsystem. failure transparency
path failures • No impact to the OCFS2 filesystem.
• Path failover should be visible in the
NOTE: Only OS logfiles.
applicable on
multipath storage
environments.
Test 9 Perform a FSCK of a • Dismount the OCFS2 filesystem to be • FSCK will check the specified
OCFS2 filesystem checked on ALL nodes OCFS2 filesystem for errors, answer
• Execute fsck on the OCFS2 filesystem yes to any prompts (-y) and provide
as follows: verbose output (-v).
“sbin/fsck -v -y -t ocfs2 <device
path>”
This command will automatically,
answer yes to any prompts (-y) and
provide verbose output (-v).
Test 10 Check the OCFS2 • Check the OCFS2 cluster status on all • The output of the command will be
cluster status nodes by issuing “/etc/init.d/o2cb similar to:
status”. Module "configfs": Loaded
Filesystem "configfs": Mounted
Module "ocfs2_nodemanager": Loaded
Module "ocfs2_dlm": Loaded
Module "ocfs2_dlmfs": Loaded
Filesystem "ocfs2_dlmfs": Mounted
Checking O2CB cluster ocfs2: Online
Checking O2CB heartbeat: Active

Oracle Support Services RAC Starter Kit


RAC Assurance Team System Test Plan Outline
Page 37
Appendix II: Windows Specific Tests
Test# Test Procedure Expected Results Actual Results/Notes

Test 1 Create an OCFS • Add a Disk/LUN to the RAC nodes • The OCFS filesystem will be created.
filesystem and configure the Disk/LUN for use by • The OCFS filesystem will be mounted
OCFS. on all nodes
• Create the appropriate partition table
on the disk and validate disk and
partition table is visible on ALL nodes
(this can be achieved via diskpart).
• Assign a drive letter to the logical drive
• Create the OCFS filesystem by
running:
cmd> %GI_HOME%\cfs\ocfsformat
/m <drive_letter> /c <cluster size> /v
<volume name> /f /a
Test 2 Create a file on the • Perform the following: • The file will exist on all nodes with
OCFS filesystem Use notepad to create a text file the specified contents.
containing the text “TESTING
OCFS” on an OCFS drive.
• Use notepad to validate that the file
exists on all nodes.

Test 3 Verify that the OCFS • Issue a “reboot” • The OCFS filesystem will
filesystem is available automatically mount and be accessible
after a system reboot to all nodes after a reboot.
Test 4 Enable database • Modify the database archive log • Archivelog files are created, and
archive logs to OCFS settings to utilize OCFS available to all nodes on the specified
OCFS filesystem.

Oracle Support Services RAC Starter Kit


RAC Assurance Team System Test Plan Outline
Page 38
Test# Test Procedure Expected Results Actual Results/Notes

Test 5 Create an RMAN • Back up ASM based datafiles to OCFS • RMAN backupsets are created, and
backup on an OCFS filesystem. available to all nodes on the specified
filesystem • Execute baseline recovery scenarios OCFS filesystem.
(full, point-in-time, datafile). • Recovery scenarios completed with
no errors.

Test 6 Create a datapump • Using datapump, take an export of the • A full system export should be created
export on an OCFS database to an OCFS filesystem. without errors or warnings.
filesystem
Test 7 Validate OCFS • Issue a “reboot” from a single node in • OCFS filesystem should remain
functionality during the cluster available to surviving nodes.
node failures.
Test 8 Remove a drive letter • Using Windows disk management use • OracleClusterVolumeService should
and ensure that the the ‘Change Drive Letter and Paths …’ restore the drive letter assignment
letter is re- option to remove a drive letter within a short period of time.
established for that associated with an OCFS partition.
partition
Test 9 Run ocfscollect tool • OCFSCollect is available as an • A .zap file (rename to .zip and
attachment to Note: 332872.1 extract). Can be used as a baseline
regarding the health of the available
OCFS drives.

Oracle Support Services RAC Starter Kit


RAC Assurance Team System Test Plan Outline
Page 39

You might also like