RAC Assurance Team RAC System Test Plan Outline 10gR1, 10gR2 and 11gR1

Version 2.1.2

Purpose
Before a new computer /cluster system is deployed in production it is important to test the system thoroughly to validate that it will perform at a satisfactory level, relative to its service level objectives. Testing is also required when introducing major or minor changes to the system. This document provides an outline consisting of basic guidelines and recommendations for testing a new RAC system. This test plan outline can be used as a framework for building a system test plan specific to each company’s RAC implementation and their associated service level objectives.

Scope of System Testing
This document provides an outline of basic testing guidelines that will be used to validate core component functionality for RAC environments in the form of an organized test plan. Every application exercises the underlying software and hardware infrastructure differently, and must be tested as part of a component testing strategy. Each new system must be tested thoroughly, in an environment that is a realistic representation of the production environment in terms of configuration, capacity, and workload prior to going live or after implementing significant architectural/system modifications. Without a completed system implementation and functional available end-user applications, only core component functionality and testing is possible to verify cluster, RDBMS and various sub-component behaviors for the Networking, I/O subsystem and miscellaneous database administrative functions. In addition to the specific system testing outlined in this document additional testing needs to be defined and executed for RMAN, backup and recovery, and Data Guard (for disaster recovery). Each component area of testing also requires specific operational procedures to be documented and maintained to address site-specific requirements.

Testing Objectives
In addition to application functionality testing, overall system testing is normally performed for one or more of the following reasons: • Verify that the system has been installed and configured correctly. Check that nothing is broken. Establish a baseline of functionality behavior such that we can answer the question down the road: ‘has this ever worked in this environment?’ • Verify that basic functionality still works in a specific environment and for a specific workload. Vendors normally test their products very thoroughly, but it is not possible to test all possible hardware/software combinations and unique workloads. • Make sure that the system will achieve its objectives, in particular, availability and performance objectives. This can be very complex and normally requires some form of simulated production environment and workload. • Test operational procedures. This includes normal operational procedures and recovery procedures. • Train operations staff.

Planning System Testing
Effective system testing requires careful planning. The service level objectives for the system itself and for the testing must be clearly understood and a detailed test plan should be documented. The basis for all testing is that the current best practices for RAC system configuration have been implemented before testing. Oracle Support Services RAC Assurance Team Page 1 RAC Starter Kit System Test Plan

Testing should be performed in an environment that mirrors the production environment as much as possible. The software configuration should be identical but for cost reasons it might be necessary to use a scaled down hardware configuration. All testing should be performed while running a workload that is as close to production as possible. When planning for system testing it is extremely important to understand how the application has been designed to handle the failures outlined in this plan and to ensure that the expected results are met at the application level as well as the database level. Oracle technologies that enable fault tolerance of the database at the application level include the following: • Fast Application Notification (FAN) – Notification mechanism that alerts application of service level changes of the database. • Fast Connection Failover (FCF) – Utilizes FAN events to enable database clients to proactively react to down events by quickly failing over connections to surviving database instances. • Transparent Application Failover (TAF) – Allows for connections to be automatically reestablished to a surviving database instance in the case that the instance servicing the initial connection should fail. TAF has the ability to fail over in-flight select statements (if configured) but insert, update and delete transactions will be rolled back. • Runtime Connection Load Balancing (RCLB) – Provides intelligence about the current service level of the database instances to application connection pools. This increases the performance of the application by utilizing least loaded servers to service application requests and allows for dynamic workload balancing in the event of the loss of service by a database instance or increase of service by adding a database instance. More information on each of the above technologies can be found in the Oracle Real Application Clusters Administration and Deployment Guide 10g Release 2 or 11g Release 1. Generating a realistic application workload can be complex and expensive but it is the most important factor for effective testing. For each individual test in the plan, a clear understanding of the following is required: • What is the objective of the test and how does this relate to the overall system objectives? • Exactly how will the test be performed and what are the execution steps? • What are the success/failure criteria, and what are the expected results? • How will the test result be measured? • Which tools will be used? • Which logfiles and other data will be collected? • Which operational procedures are relevant? • What are the expected results of the application for each of the defined tests (TAF, FCF, RCLB)?

Notes for Windows Users
Many of the Fault Injection Tests outlined in this document involve abnormal termination of various processes within the Oracle Software stack. On Unix/Linux systems this is easily achieved by using “ps” and “kill” commands. Natively, Windows does not provide the ability to view enough details of running processes to properly identify and kill the processes involved in the Fault Injection Testing. To overcome this limitation a utility called Process Explorer (provided by Microsoft) will be used to identify and kill the necessary processes. Process Explorer can be found on the Windows Sysinternals website within Microsoft Technet (http://technet.microsoft.com/en-us/sysinternals/bb896653.aspx). In addition to Process Explorer, a utility called orakill will be used to kill individual threads within the database. More information on orakill can be found under Note: 69882.1.

Production Simulation / System Stress Test
The best way to ensure that the system will perform well without any problems is to simulate production workload and conditions before going live. Ideally the system should be stressed a little more than what is expected in production. In addition to running the normal user and application workload, all normal operational procedures should also be tested at the same time. The output from the normal monitoring procedures should be kept and compared with the real data when going live. Normal maintenance operations such as adding users, adding disk space, reorganizing tables and indexes, backup, archiving data, etc. must also be tested. A commercial or in-house developed workload generator is essential.

Fault Injection Testing
The system configuration and operational procedures must also be tested to make sure that component failures and other problems can be dealt with as efficiently as possible and with minimum impact on system availability. This section provides some examples of tests that can be used as part of a system test plan. The idea is to test the system’s robustness against various failures. Depending on the overall architecture and objectives, only some of the tests might be used Oracle Support Services RAC Assurance Team Page 2 RAC Starter Kit System Test Plan

changing the system architecture or possibly reporting component problems to the appropriate vendor. The analysis can result in system tuning. The result of a test should initially be measured at a business or user level to see if the result is within the service level agreement. If a test fails it will be necessary to gather and analyze the relevant log and trace files. This should also be tested. These tests should be performed with a realistic workload on the system. Introducing multiple failures at the same time should also be considered. This list only covers testing for RAC-related components and procedures. Additional tests are required for other parts of the system. Also. Oracle Support Services RAC Assurance Team Page 3 RAC Starter Kit System Test Plan .and/or additional tests might have to be constructed. Procedures for detecting and recovering from these failures must also be tested. if the system objectives turn out to be unrealistic. In some worst-case scenarios it might not be possible to recover the system within an acceptable time frame and a disaster recovery plan should specify how to switch to an alternative system or location. they might have to be changed.

System Testing Scenarios Test # Test 1 Test Planned Node Reboot Procedure • Start client workload • Identify instance with most client connections • Reboot the node where the most loaded instance is running o For AIX. • Same as Planned Node Reboot Measures • Time to detect node or instance failure • Time to complete instance recovery. Check alert log for instance performing the recovery • Time to restore client activity to same level (assuming remaining nodes have sufficient capacity to run workload) • Duration of database reconfiguration. NOTE: On many servers the power-off switch will perform a controlled shutdown. • Time before failed instance is restarted automatically by Clusterware and is accepting new connections Actual Results/Notes Test 2 Unplanned Node Failure of the OCR Master • • Start client workload.l* • Power off the node that is the OCR master. if the downed instance is specified as a preferred instance • Client connections are moved / reconnected to surviving instances (Procedure and timings will depend on client types and configuration). HPUX. • After the database reconfiguration. Active DML will be aborted. Windows: “shutdown –r” o For Linux: “shutdown –r now” o For Solaris: “reboot” Expected Results • The instances and other Clusterware resources that were running on that node go offline (no value for ‘HOST’ field of crs_stat output) • The node VIP fails over to one of the surviving nodes • Services are moved to available instances. With TAF configured select statements should continue. surviving instances continue processing their workload. Identify the node that is the OCR master by reviewing the $CRS_HOME/log/<node_name>/crsd/crsd. and it might be necessary to cut the power supply • Same as Planned Node Reboot Oracle Support Services RAC Assurance Team Page 4 RAC Starter Kit System Test Plan Outline .

name=’PMON’. Check with “crs_stat -t” Actual Results/Notes Test 4 Reboot all nodes at the same time • Test 5 Unplanned Instance Failure Issue a reboot on all nodes at the same time o For AIX. etc) will be restarted by the Clusterware.name. listener. v$process p where b. Check with “crs_stat t”. p. HPUX. • Measures • Time for all resources to become available again.Test # Test 3 Test Restart Failed Node Procedure Expected Results • The VIP will migrate back to the restarted node. Time before failed instance is restarted automatically by Oracle Clusterware and is accepting new connections Oracle Support Services RAC Assurance Team Page 5 RAC Starter Kit System Test Plan Outline . surviving instances continue processing the workload Failing instance will be restarted by Oracle Clusterware. Check alert log for recovering instance Time to restore client activity to same level (assuming remaining nodes have sufficient capacity to run workload) Duration of database freeze during failover. • Failed resources (asm. HPUX. instances and resources are restarted without problems One of the other instances performs instance recovery Services are moved to available instances. Linux. Windows: ‘shutdown –r’ o For Linux: ‘shutdown –r now’ o For Solaris: ‘reboot’ • Start client workload • Identify single database instance with the most client connections and abnormally terminate that instance: o For AIX. Time to detect instance failure Time to complete instance recovery. if a preferred instance failed Client connections are moved / reconnected to surviving instances (Procedure and timings will depend on client types and configuration) After a short freeze. instance. Run orakill to kill the thread: cmd> orakill <SID> <Thread ID> All nodes. • Services that had failed over as a result of the node failure will NOT automatically be relocated.paddr=p. Solaris: Obtain the PID for the pmon process of the database instance: # ps –ef | grep pmon kill the pmon process: # kill –9 <pmon pid> o For Windows: Obtain the thread ID of the pmon thread of the database instance by running: SQL> select b. unless this feature has been disabled • • • • • • • • • • • Time for all resources to become available again.spid from v$bgprocess b.addr and b.

• Instance rejoins RAC cluster without any problems (review alert logs etc.Test # Test 6 Test Planned Instance Termination Procedure • Issue a ‘shutdown abort’ Expected Results • One other instance performs instance recovery • Services are moved to available instances. • Time to restore client activity to same level (assuming remaining nodes have sufficient capacity to run workload). Test 7 Restart Failed Instance • Automatic restart by Oracle Clusterware if it is an uncontrolled failure • Manual restart necessary if a “shutdown” command was issued. • The instance will NOT be restarted by Oracle Clusterware due to the user induced shutdown. • Time before services and workload are rebalanced across all instances (including any manual steps) Actual Results/Notes Oracle Support Services RAC Assurance Team Page 6 RAC Starter Kit System Test Plan Outline . • Time to complete instance recovery.) • Client connections and workload will be load balanced across the new instance (Manual procedure might be required to redistribute workload if long running / permanent connections) Measures • Time to detect instance failure. if a preferred instance failed • Client connections are moved / reconnected to surviving instances (Procedure and timings will depend on client types and configuration) • The instance will NOT be automatically restarted by Oracle Clusterware due to the user invoked shutdown. • Manual restart when the "Auto Start" option for the related instance has been disabled. Check alert log for recovering instance.

HPUX. Run orakill to kill the thread: cmd> orakill <SID> <Thread ID> Same as instance failure. Check alert log for recovering instance • Time to restore client activity to same level (assuming remaining nodes have sufficient capacity to run workload) • Duration of database reconfiguration. Run orakill to kill the thread: cmd> orakill <SID> <Thread ID> Expected Results • The ASM resource will offline (crs_stat -t).paddr=p. surviving instances continue processing the workload • • Test 9 Unplanned Multiple Instance Failure • • Start client workload Abnormally terminate 2 different database instances from the same database at the same time: o For AIX. p.name. Linux.name=’PMON’. By default the resource will be automatically restarted by Oracle Clusterware. p. Linux. Both instances should be recovered and restarted without problems.addr and b. Solaris: Obtain the PID for the pmon process of the database instance: # ps –ef | grep pmon kill the pmon process: # kill –9 <pmon pid> o For Windows: Obtain the thread ID of the pmon thread of the database instance by running: SQL> select b. Solaris: Obtain the PID for the pmon process of the ASM instance: # ps –ef | grep pmon kill the pmon process: # kill –9 <pmon pid> o For Windows: Obtain the thread ID of the pmon thread of the ASM instance by running: SQL> select b. Actual Results/Notes Oracle Support Services RAC Assurance Team Page 7 RAC Starter Kit System Test Plan Outline .spid from v$bgprocess b. v$process p where b.Test # Test 8 Test Unplanned ASM Instance Failure Procedure • Start client workload • Identify a single ASM instance in the cluster: o For AIX. • One other instance performs instance recovery • Services are moved to available instances. v$process p where b. if a preferred instance failed • Client connections are moved / reconnected to surviving instances (Procedure and timings will depend on client types and configuration) • After the database reconfiguration is complete. Measures • Time to detect instance failure • Time to complete instance recovery.name=’PMON’. • Time before failed resources are restarted and the database instance is accepting new connections • Same as instance failure.paddr=p.spid from v$bgprocess b. HPUX.name.addr and b.

exe process for the database listener. This will be the tnslistener. o The listener will become offline. • New connections are redirected to listener on other node (depends on client configuration) • Local database instance will receive new connections if shared server is used. Check with “crs_stat -t” o The VIP for the node will fail over to a surviving node. HPUX. Linux and Solaris: Obtain the PID for the listener process: # ps –ef | grep tnslsnr Kill the pmon process: # kill –9 <listener pid> • For Windows: Use Process Explorer to identify the tnslistener. • The Listener failure is detected by the CRSD and is automatically restarted. • If TAF is configured.exe is identified kill the process by right clicking the executable and choosing “Kill Process”. • The database instance will remain up but will be unregistered with the remote listeners. clients should fail over to an available instance.log • Measures • Time for the Clusterware to detect failure and restart listener. this may lead to the address still being plumbed to the interface resulting in unexpected results. • Time to detect the network failure and relocate resources. Once the proper tnslistener.Test # Test 10 Test Listener Failure Procedure • For AIX. Local database instance will NOT receive new connections if dedicated server is used. Expected Results • No impact on connected database sessions. NOTE: It is recommended NOT to use ifconfig to down the interface. Oracle Support Services RAC Assurance Team Page 8 RAC Starter Kit System Test Plan Outline . • Database services will fail over to one of the other available nodes. Review the following logs: o $CRS_HOME/log/<nodename >/crsd/crsd.exe registered to the “<home name>TNSListener” service. Actual Results/Notes Test 11 Public Network Failure • Unplug all network cables for the public network NOTE: Configurations using NIS must also have implemented NCSD for this test to succeed with the expected results.

• In a redundant network switch configuration. • Review the following logs: o $CRS_HOME/log/<nodename >/cssd/cssd. etc). Test 15 Interconnect Switch Failure (Redundant Switch Configuration) • Network traffic should fail over to other switch without any impact on interconnect traffic or instances. this may lead to the address still being plumbed to the interface resulting in unexpected results. With bonding /teaming configured this should be less than 100ms. • Time to fail over to other NIC card. power off one switch • Network traffic should fail over to other NIC without impacting any of the cluster resources.log o $CRS_HOME/log/<nodename >/alert<nodename>. etc).log • Time to detect split brain and start eviction. With bonding / teaming configured this should be less than 100ms. teaming.g. Actual Results/Notes Test 13 Interconnect Network Failure • CSSD will detect split-brain situation and perform one of the following: o In a two-node cluster the node with the lowest node number will survive. this may lead to the address still being plumbed to the interface resulting in unexpected results. o In a multiple node cluster the largest sub-cluster will survive. • Unplug the network cable from 1 of the NICs. • See measures for node failure Test 14 Interconnect NIC Failure • Assuming dual NICs are configured for the private interface for redundancy (e. this may lead to the address still being plumbed to the interface resulting in unexpected results. bonding. • Unplug the network cable from 1 of the NICs. teaming. RAC Starter Kit System Test Plan Outline Oracle Support Services RAC Assurance Team Page 9 . • Time to fail over to other NIC card. With bonding /teaming configured this should be less than 100ms. • Unplug all network cables for the interconnect network NOTE: It is recommended NOT to use ifconfig to down the interface.g. bonding.Test # Test 12 Test Public NIC Failure Procedure • Assuming dual NICs are configured public interface for redundancy (e. NOTE: It is recommended NOT to use ifconfig to down the interface. NOTE: It is recommended NOT to use ifconfig to down the interface. Measures • Time to fail over to other NIC card. Expected Results • Network traffic should fail over to other NIC without impacting any of the cluster resources.

Expected Results • CSS will detect this and evict the node from the cluster. Review the following logs: o $CRS_HOME/log/<nodename >/cssd/cssd.log o $CRS_HOME/log/<nodename >/alert<nodename>. the multi-pathing configuration should provide failure transparency • No impact to database instances. • Power on / insert / online the ASM disk • • No impact on database instances ASM starts rebalancing (view ASM alert logs). Database files) • Unplug external storage cable connection (SCSI. • The database instance.log o $CRS_HOME/log/<nodename >/alert<nodename>. • Path failover should be visible in the OS logfiles. • Test 19 ASM Disk Lost • • Test 20 ASM Disk Repaired Assuming ASM normal redundancy Power off / pull out / offline (depending on config) one ASM disk. • No impact on database instances • ASM starts rebalancing (view ASM alert logs). • CRSD will detect the failure of the OCR device and abort. • Monitor progress: select * from v$asm_operation • Monitor progress: select * from v$asm_operation Oracle Support Services RAC Assurance Team Page 10 RAC Starter Kit System Test Plan Outline . If multi-pathing is enabled.log Measures • See measures for node failure Actual Results/Notes Test 17 Node Loses Access to Disks with OCR Device(s) • Unplug external storage cable connection (SCSI. ASM instance and listeners will not be impacted.log • • Monitor database status under load to ensure no service interruption occurs. Test 18 Node Loses Access to Single Path of Disk Subsystem (OCR. Monitor database status under load to ensure no service interruption occurs. FC or LAN cable) from node to disk subsystem. • Review the following logs: o $CRS_HOME/log/<nodename >/cssd/crsd. FC or LAN cable) from one node to disks containing the CSS Voting Device(s). FC or LAN cable) from one node to disks containing the OCR Device(s). Voting Device.Test # Test 16 Test Node Loses Access to Disks with CSS Voting Device Procedure • Unplug external storage cable connection (SCSI.

log • There is no impact on the cluster operation • The OCR can be replaced online. Follow the procedures in Oracle Clusterware Administration and Deployment Guide for your release to extend the Clusterware to the new node. Remove access to one copy of OCR. NOTE: This test assumes that the OCR is mirrored to 2 devices Test 23 Add a node to the cluster and extend the database (if admin managed) to that node • There will be no impact on the cluster operation.log o $CRS_HOME/log/<nodename >/alert<nodename>. Remove access to one of the Voting Disks. Oracle Support Services RAC Assurance Team Page 11 RAC Starter Kit System Test Plan Outline . • If the database is policy managed and there is free space in the server pool for the new node the database will be extended to the new node automatically (OMF should be enabled so no user intervention is required). 2.log o $CRS_HOME/log/<nodename >/alert<nodename>. This avoids having to stop CRSD. Replace the disk or remount the diskgroup. • The node is dynamically added to the cluster • If the database is policy managed an instance for the database will automatically be created on the new node. without a cluster outage. The loss of access and restoration of the missing/corrupt OCR will be reported in: o $CRS_HOME/log/<nodename >/cssd/crsd.Test # Test 21 Test One multiplexed Voting Device is inaccessible Procedure • Remove access to a multiplexed voting disk from all nodes. • The voting disk will be automatically brought online when access is restored.log • Measures • No Impact on Cluster Actual Results/Notes Test 22 Lose and Recover one copy of OCR 1. the corrupt OCR (ocrconfig –delete ) and read the OCR (ocrconfig –add). • Voting Disks can be queried using “crsctl query css votedisk”. Delete ocrcheck will report the OCR to be out of sync. Expected Results • Cluster will remain available. • The new database instance will begin servicing connections. • The new node will successfully be added to the cluster. • After extending the Clusterware follow the procedures in Oracle® Real Application Clusters Administration and Deployment Guide for your release to extend the RDBMS binaries. 3. ASM binaries and database to the new node. Review the following logs: o $CRS_HOME/log/<nodename >/cssd/cssd.

the functionality of these processes and a general understanding of the logging performed by each of these processes. HPUX. Oracle Support Services RAC Assurance Team Page 12 RAC Starter Kit System Test Plan Outline . This section is to be used to provide a better understanding of the Clusterware processes. The crsd. Measures • The node will be dynamically removed from the cluster.exe process. • The node will be successfully removed from the cluster. Expected Results • The connections on to the database instance being removed will fail over to the remaining instances (if configured).log Measures • Time to restart CRSD process Actual Results/Notes For Windows: Use Process Explorer to identify the crsd.exe that we will need to kill is the one with the higher memory footprint. Linux and Solaris: Obtain the PID for the CRSD process: # ps –ef | grep crsd Kill the CRSD process: # kill –9 <crsd pid> • Expected Results • CRSD process failure is detected by init and CRSD is restarted. These process failures are NOT within the realm of typical failures within a RAC system. Actual Results/Notes System Testing Scenarios: Clusterware Process Failures NOTE: This section of the system testing scenarios demonstrate failures of various Oracle Clusterware processes. Killing of these processes under normal operation is highly discouraged by Oracle Support.exe process is identified kill the process by right clicking the executable and choosing “Kill Process”. follow the procedures in Oracle Clusterware Administration and Deployment Guide for your release to remove the node from the cluster. Once the crsd. there will be two. • After successfully removing the RDBMS installation.Test # Test 24 Test Remove a node from the cluster Procedure • Follow the procedures in Oracle® Real Application Clusters Administration and Deployment Guide for your release to delete the node from the cluster. Test # Test 1 Test CRSD Process Failure Procedure • For AIX. Review the following logs: o $CRS_HOME/log/<nodename >/crsd/crsd.

Linux and Solaris: Obtain the PID for the CSSD process: # ps –ef | grep cssd Kill the CSSD process: # kill –9 <cssd pid> • • • The node will reboot.exe process is identified kill the process by right clicking the executable and choosing “Kill Process”.Test # Test 2 Test EVMD Process Failure Procedure • For AIX. Review the following logs: o $CRS_HOME/log/<nodename >/evmd/evmd. Linux and Solaris: Obtain the PID for the EVMD process: # ps –ef | grep evmd Kill the EVMD process: # kill –9 <evmd pid> • Expected Results • EVMD process failure is detected init and EVMD is restarted.exe process. there will be two. • For AIX. HPUX. HPUX.exe process is identified kill the process by right clicking the executable and choosing “Kill Process”. Once the evmd.log Measures • Time to restart the EVMD process Actual Results/Notes Test 3 CSSD Process Failure For Windows: Use Process Explorer to identify the evmd. Oracle Support Services RAC Assurance Team Page 13 RAC Starter Kit System Test Plan Outline . Time for the eviction and cluster reconfiguration on the surviving nodes • Time for the node to come back online and reconfiguration to complete to add the node as an active member of the cluster.exe that we will need to kill is the one with the higher memory footprint. Cluster reconfiguration will take place • For Windows: Use Process Explorer to identify the ocssd. Once the ocssd.exe process. The ocssd.

If using non-Oracle cluster software: •Interconnect Network Failure •Lost access to cluster voting/quorum disk ASM Test and Validation This test and validation plan is intended to give the customer or engineer a procedural approach to: •Validating the installation of RAC-ASM •Functional and operation validation of ASM Oracle Support Services RAC Assurance Team Page 14 RAC Starter Kit System Test Plan Outline . Typically some of these destructive tests will be used: •Node Failure. Check the same if going into maintenance mode. Cluster Infrastructure To simplify testing and problem diagnosis it is often very useful to do some basic testing on the cluster infrastructure without Oracle software or a workload running. If problems are encountered during System Stress Test or Destructive Testing.Component Functionality Testing Normally it should not be necessary to perform additional functionality testing for each individual system component. Assuming multiple HBAs with failover capability. measure the time that the storage reconfiguration needed to be completed. diagnosis and analysis can be facilitated by testing the cluster infrastructure separately. However. Obviously without Oracle software or workload. This testing will also help system and database administrators become familiar with new technology components. Normally this testing will be performed after installing the hardware and operating system. •Public NIC Failure •Interconnect NIC Failure •NAS (Netapps) storage failure – In case of a complete mirror failure. for some new components in new environments it might be useful to perform additional testing to make sure that they are configured properly. •Restart Failed Node •Reboot all nodes at the same time •Lost disk access •HBA failover. but before installing any Oracle software. Assuming multiple disk controllers with failover capability. •Disk controller failover.

” NOTE: Progress can be monitored by querying v$asm_operation • The data from the removed disk will be rebalanced across the remaining disks in the diskgroup. Oracle Support Services RAC Assurance Team Page 15 RAC Starter Kit System Test Plan Outline . mode_status.” • A successfully created diskgroup. • The diskgroup will be registered as a Clusterware resource (crsctl stat res –t) • Test 3 • A successfully created diskgroup with normal redundancy and two failure groups. Test 5 Drop an ASM disk from a diskgroup using SQL*Plus • Login to ASM via SQL*Plus and run: “alter diskgroup <dg name> drop disk <disk name>. Procedure • Add a Disk/LUN to the RAC nodes and configure the Disk/LUN for use by ASM. state. group_number. For high redundancy. header_status. '<candidate 2 path> .” NOTE: Progress can be monitored by querying v$asm_operation • The disk will be added to the diskgroup and the data will be rebalanced evenly across all disks in the diskgroup. it will create three fail groups. label from v$asm_disk” • Expected Results/Measures • The newly added LUN will appear as a candidate disk within ASM. • The diskgroup will be registered as a Clulsterware resource (crsctl stat res –t) Test 4 Add a disk to a ASM disk group using SQL*Plus • Login to ASM via SQL*Plus and run: “alter diskgroup <dg name> add disk '<candidate1 path> . path. • Login to ASM via SQL*Plus and run: “select name. This diskgroup should also be listed in v$asm_diskgroup. Actual Results/Notes Test 2 Create an external redundancy ASM diskgroup using SQL*Plus Create an normal or high redundancy ASM diskgroup using SQL*Plus Login to ASM via SQL*Plus and run: “create diskgroup <dg name> external redundancy disk ‘<candidate path>’ .“ Login to ASM via SQL*Plus and run: “create diskgroup <dg name> norma lredundancy disk '<candidate1 path>. Once the rebalance is complete the disk will have a header_status of “FORMER” (v$asm_disk) and will be a candidate to be added to another diskgroup.Component Testing: ASM Functional Tests Test # Test 1 Test Verify that candidate disks are available.

”. Test 10 Login to ASM via SQL*Plus and run: “alter diskgroup <name> check all” • If there are no internal inconsistencies. the statement “Diskgroup altered” will be returned (asmcmd will return back to the asmcmd prompt). If inconsistencies are discovered. Actual Results/Notes Test 7 Drop a ASM diskgroup using SQL*Plus Modify rebalance power of an active operation using SQL*Plus • Login to ASM via SQL*Plus and run: “drop diskgroup <dg name>.” • Before the rebalance completes run the following command via SQL*Plus: “alter diskgroup <dg name> rebalance power <1 – 11>.Test # Test 6 Test Undrop a ASM disk that is currently being dropped using SQL*Plus Procedure • Login to ASM via SQL*Plus and run: “alter diskgroup <dg name> drop disk <disk name>.” • Before the rebalance completes run the following command via SQL*Plus: “alter diskgroup <dg name> undrop disk <disk name>. The disk entry will remain in v$asm_disk as a MEMBER. • The diskgroup will be successfully dropped. NOTE: Progress can be monitored by querying v$asm_operation • Start all the database instances and query the v$asm_client view in the ASM instances. then appropriate messages are displayed describing the problem. Check the internal consistency of disk group metadata using SQL*Plus • Each database instance should be listed in the v$asm_client view. The diskgroup will be unregistered as a Clusterware resource (crsctl stat res –t) • The rebalance power of the current operation will be increased to the specified value. Test 9 Verify CSS-database communication and ASM files access.” • • Test 8 • Login to ASM via SQL*Plus and run: “alter diskgroup <dg name> add disk '<candidate1 path> .” NOTE: Progress can be monitored by querying v$asm_operation Expected Results/Measures • The undrop operation will rollback the drop operation (assuming it has not completed). 1 is the default rebalance power. Oracle Support Services RAC Assurance Team Page 16 RAC Starter Kit System Test Plan Outline . This is visible in the v$asm_operation view.

” Login to ASM via SQL*Plus and run: “alter diskgroup DATA drop alias '+<dg name>/my_files/ datafile_alias '.Component Testing: ASM Objects Functional Tests Test # Test 1 Test Create an ASM template Procedure • Login to ASM via SQL*Plus and run: “alter diskgroup <dg name> add template unreliable attributes(unprotected fine). Actual Results/Notes Test 2 Apply an ASM template Use the template above and apply it to a new tablespace to be created on the database • Login to ASM via SQL*Plus and run: “create tablespace test datafile '+<dg name>/my_files(unreliable)' size 10M.” Login to ASM via SQL*Plus and run: “alter diskgroup <dg name> add directory '+<dg name>/my_files'.” • This template should be removed from v$asm_template.” • • The datafile is created using the attributes of the ASM template Test 3 Drop an ASM template Login to ASM via SQL*Plus and run: “alter diskgroup <dg name> drop template unreliable.” • Expected Results/Measures • The ASM template will be successfully created and visible within the v$asm_template view. • Test 5 Create an ASM alias • Login to ASM via SQL*Plus and run: “alter diskgroup DATA add alias '+DATA/my_files/datafile_alias' for '+<dg name>/ <db name>/DATAFILE/<file name>'. Oracle Support Services RAC Assurance Team Page 17 RAC Starter Kit System Test Plan Outline .” The created directory will have an entry in v$asm_directory • Verify that the alias exists in v$asm_alias Test 6 Drop an ASM alias • • Verify that the alias does not exist in v$asm_alias. Test 4 Create an ASM directory • • You can use the asmcmd tool to check that the new directory name was created in the desired diskgroup.

currently being accessed • Actual Results/Notes Test 8 Drop an inactive database file within ASM Identify a datafile that is no longer used by a database • Login to ASM via SQL*Plus and run: “alter diskgroup data drop file '+<dg name>/<db name>/DATAFILE/<file name>'.654602409' not dropped.” • Expected Results/Measures • This will fail with the following message: ERROR at line 1: ORA-15032: not all alterations performed ORA-15028: ASM file '+DATA/V102/DATAFILE/TEST.” Observe that file number in v$asm_file is now removed. • Login to ASM via SQL*Plus and run: “alter diskgroup data drop file '+<dg name>/<db name>/DATAFILE/<file name>'. Oracle Support Services RAC Assurance Team Page 18 RAC Starter Kit System Test Plan Outline .Test # Test 7 Test Drop an active database file within ASM Procedure • Identify a data file from a running database.269.

0) • Actual Results/Notes Test 2 Use dbms_file_transfer to copy files from ASM to filesystem • Use dbms_file_transfer. NOTE: This requires that a database directory be pre-created and available for the source and destination directories. Procedure • Specify each file individually using the dbv utility: dbv userid=s<user>/<password>file='<A SM filename>' blocksize=<blocksize> Expected Results/Measures • The output should be similar to the following. See PL/SQL Guide for dbms_file_transfer details The put_file and get file functions will copy files successfully to/from filesystem. with no errors present: DBVERIFY .Verification complete Total Pages Examined : 640 Total Pages Processed (Data) : 45 Total Pages Failing (Data) : 0 Total Pages Processed (Index): 2 Total Pages Failing (Index): 0 Total Pages Processed (Other): 31 Total Pages Processed (Seg) : 0 Total Pages Failing (Seg) : 0 Total Pages Empty : 562 Total Pages Marked Corrupt : 0 Total Pages Influx :0 Highest block SCN : 0 (0. This provides an alternate option for migrating to ASM. or to simply copy files out of ASM. archives. etc) into and out of ASM. Oracle Support Services RAC Assurance Team Page 19 RAC Starter Kit System Test Plan Outline .Component Testing: ASM Tools & Utilities Test # Test 1 Test Run dbverify on the database files.put_file and get_file functions to copy database files (datafiles.

Create the appropriate partition table on the disk and use “partprobe” to rescan the partition tables. Is it acceptable to wait for this time before restarting instances or nodes in a production situation? Actual Results/Notes Appendix I: Linux Specific Tests Test # Test 1 Test Create an OCFS2 filesystem Procedure • Expected Results/Measures • Actual Results/Notes • • • • Add a Disk/LUN to the RAC nodes and configure the Disk/LUN for use by OCFS2. Measures • Time to run diagnostics procedures. Clusterware and operating system diagnostics (hanganalyze. racdiag.sql) Expected Results/Measures • Diagnostics collection procedures complete normally.Component Testing: Miscellaneous Tests Test # Test 1 Test Diagnostics Procedure for Hang/Slowdown Procedure • Start client workload • Execute automatic and manual procedures to collect database. Create the OCFS2 filesystem by running: “/sbin/mkfs –t ocfs2 <device path>” Add the filesystem to /etc/fstab on all nodes Mount the filesystem on all nodes The OCFS2 filesystem will be created. • The OCFS2 filesystem will be mounted on all nodes Oracle Support Services RAC Assurance Team Page 20 RAC Starter Kit System Test Plan Outline .

Test 6 • Using datapump. point-in-time. • The file will exist on all nodes with the specified contents. • Modify the database archive log settings to utilize OCFS2 • Test 5 • Back up ASM based datafiles to OCFS2 filesystem. datafile). Issue a “shutdown –r now” from a single node in the cluster • A full system export should be created without errors or warnings.datavolume. Test 3 Test 4 Verify that the OCFS2 filesystem is available after a system reboot Enable database archive logs to OCFS2 NOTE: If using the OCFS2 filesystem for database files it must be mounted with the following options: rw. • Execute baseline recovery scenarios (full. • Recovery scenarios completed with no errors. and available to all nodes on the specified OCFS2 filesystem.datavolume. take an export of the database to an OCFS2 filesystem. and available to all nodes on the specified OCFS2 filesystem.nointr Create an RMAN on a OCFS2 filesystem NOTE: If using the OCFS2 filesystem for database files it must be mounted with the following options: rw.Test # Test 2 Test Create a file on the OCFS filesystem Procedure • Expected Results/Measures • Actual Results/Notes Perform the following: “echo “Testing OCFS2” > <mount point>/testfile • Perform a “cat” command on the file on all nodes in the cluster. Test 7 • • Oracle Support Services RAC Assurance Team Page 21 RAC Starter Kit System Test Plan Outline .nointr Create a datapump export on a OCFS2 filesystem Validate OCFS2 functionality during node failures. Archivelog files are created. OCFS2 filesystem should remain available to surviving nodes. Issue a “shutdown –r now” • The OCFS2 filesystem will automatically mount and be accessible to all nodes after a reboot. • RMAN backupsets are created.

Test 9 • Test 10 Check the OCFS2 cluster status Dismount the OCFS2 filesystem to be checked on ALL nodes • Execute fsck on the OCFS2 filesystem as follows: “sbin/fsck -v -y -t ocfs2 <device path>” This command will automatically. answer yes to any prompts (-y) and provide verbose output (-v).Test # Test 8 Test Validate OCFS2 functionality during disk/disk subsystem path failures NOTE: Only applicable on multipath storage environments. • Path failover should be visible in the OS logfiles. • The output of the command will be similar to: Module "configfs": Loaded Filesystem "configfs": Mounted Module "ocfs2_nodemanager": Loaded Module "ocfs2_dlm": Loaded Module "ocfs2_dlmfs": Loaded Filesystem "ocfs2_dlmfs": Mounted Checking O2CB cluster ocfs2: Online Checking O2CB heartbeat: Active Oracle Support Services RAC Assurance Team Page 22 RAC Starter Kit System Test Plan Outline . • Check the OCFS2 cluster status on all nodes by issuing “/etc/init. If multi-pathing is enabled.d/o2cb status”. FC or LAN cable) from node to disk subsystem. the multipathing configuration should provide failure transparency • No impact to the OCFS2 filesystem. • FSCK will check the specified OCFS2 filesystem for errors. Perform a FSCK of a OCFS2 filesystem Procedure • Expected Results/Measures • Actual Results/Notes Unplug external storage cable connection (SCSI. answer yes to any prompts (-y) and provide verbose output (-v).

• Archivelog files are created. Issue a “reboot” The OCFS filesystem will be created. Oracle Support Services RAC Assurance Team Page 23 RAC Starter Kit System Test Plan Outline . Use notepad to validate that the file exists on all nodes. Create the appropriate partition table on the disk and validate disk and partition table is visible on ALL nodes (this can be achieved via diskpart). The OCFS filesystem will be mounted on all nodes • The file will exist on all nodes with the specified contents. Test 3 Test 4 Verify that the OCFS filesystem is available after a system reboot Enable database archive logs to OCFS • • • Modify the database archive log settings to utilize OCFS The OCFS filesystem will automatically mount and be accessible to all nodes after a reboot.Appendix II: Windows Specific Tests Test# Test 1 Test Create an OCFS filesystem Procedure • Expected Results • • Actual Results/Notes • • • Test 2 Create a file on the OCFS filesystem • • Add a Disk/LUN to the RAC nodes and configure the Disk/LUN for use by OCFS. Assign a drive letter to the logical drive Create the OCFS filesystem by running: cmd> %CRS_HOME%\cfs\ocfsformat /m <drive_letter> /c <cluster size> /v <volume name> /f /a Perform the following: Use notepad to create a text file containing the text “TESTING OCFS” on an OCFS drive. and available to all nodes on the specified OCFS filesystem.

zap file (rename to . • • • • • • A . Issue a “reboot” from a single node in the cluster Using Windows disk management use the ‘Change Drive Letter and Paths …’ option to remove a drive letter associated with an OCFS partition. Oracle Support Services RAC Assurance Team Page 24 RAC Starter Kit System Test Plan Outline . • Test 6 Test 7 Test 8 Test 9 Create a datapump export on an OCFS filesystem Validate OCFS functionality during node failures. Can be used as a baseline regarding the health of the available OCFS drives. Remove a drive letter and ensure that the letter is reestablished for that partition Run ocfscollect tool • Using datapump. take an export of the database to an OCFS filesystem. point-in-time.zip and extract). OCFSCollect is available as an attachment to Note: 332872. OracleClusterVolumeService should restore the drive letter assignment within a short period of time. and available to all nodes on the specified OCFS filesystem. datafile).Test# Test 5 Test Create an RMAN backup on an OCFS filesystem Procedure • Expected Results • Actual Results/Notes Back up ASM based datafiles to OCFS filesystem. OCFS filesystem should remain available to surviving nodes.1 A full system export should be created without errors or warnings. • Execute baseline recovery scenarios (full. RMAN backupsets are created. • Recovery scenarios completed with no errors.

Sign up to vote on this title
UsefulNot useful

Master Your Semester with Scribd & The New York Times

Special offer: Get 4 months of Scribd and The New York Times for just $1.87 per week!

Master Your Semester with a Special Offer from Scribd & The New York Times