You are on page 1of 47

Oracle Real Application Clusters (RAC

)
When you have very critical systems that require to be online 24x7 then you need a HA solution (High
Availability)
Sometimes in large corporations you will hear the phrase five nines, this phrases means the availability of
a system and what downtime (approx) is allowed, the table below highlights the uptime a system requires
in order to achieve the five nines
% uptime

% Downtime

Downtime per year

Downtime per week

98

2

7.3 days

3 hours 22 minutes

99

1

3.65 days

1 hour 41 minutes

99.8

0.2

17 hours 30 minutes

20 minutes

99.9

0.1

8 hours 45 minutes

10 minutes

99.99

0.01

52.5 minutes

1 minute

99.999 (five nines)

0.001

5.25 minutes

6 seconds

To achieve the five nines your system is only allowed 5.25 minutes per year or 6 seconds per week, in
some HA designs it may take 6 seconds to failover.
When looking for a solution you should try and build redundancy into your plan, this is the first
step to a HA solution.
You are trying to eliminate as many Single Point Of Failures (SPOF's) as you can without increasing the
costs.
HA comes in three/four flavors

Nofailover

This option usually uses the already built-in redundancy, failed disks and PSU can be replaced
online, but if a major hardware was to fail then a system outage is unavoidable, the system will
remain down until it is fixed.
This solution can be perfectly acceptable in some environments but at what price to the
business, even in today's market QA/DEV systems cost money when not running, i am sure
that your developers are quite happy to take the day off paid while you fix the system
This is the jewel in the HA world, a cluster can be configure in a variety of favors, from minimal
downtime while services are moved to a good nodes, to virtual zero downtime.

Cluster

Cold
failover

Hot

However a cluster solution does come with a heavy price tag, hardware, configuration and
maintaining a cluster is expensive but if you business loses vast amounts of money if you
system is down, then its worth it.
Many smaller companies use this solution, basically you have a additional server ready to take
over a number of servers if one server is fail. I have used this technique myself, i create a
number of scripts that can turn a cold standby server into any number of servers, if the original
server is going to have a prolonged outage.
The problem with this solution is there is going to be downtime, especially if it takes a long time
to get the standby server up to the same point in time as the failed server.The advantage of this
solution is that one additional server could cover a number of servers, even if it slight under
powered to the original server, as long as it keeps the service running.
Many applications offer hot-standby servers, these servers are running along side the live

system, data is applied to the hot-standby server periodically to keep it up to date, thus in a
failover situation the server is almost ready to go.

failover

The problem with this system is costs and manageability, also one server is usually dedicated to
one application, thus you may have to have many hot-standby servers.
The advantage is that downtime is kept to a minimum, but there will be some downtime,
generally the time it take to get the hot-standby server up to date, for example applying the last
set of logs to a database.

Here is a summary table that shows the most command aspects of cold failover versus hot failover
Aspects

Cold Failover

Scalability/number Scalable limited to the capacity of a
of nodes
single node
User interruption

Hot Failover
As nodes can be added on demand, it
provides infinite scalability. High number of
nodes supported.

Required up to a minimal extent. The
failover operation can be scripted or Not required, failover is automatic
automated to a certain extent

Transparent failover
Not Possible
of applications

Transparent application failover will be
available where sessions can be transferred
to another node without user interruption

Load Balancing

Not possible, only one server will be
used

Incoming load can be balanced between both
nodes

Usage of resources

Only one server at a time, the other
server will be kept idle

Both the servers will be used

Failover time

More than minutes as the other
system must be cold started

Less than a minute, typically in a few
seconds.

Clustering
A cluster is a group of two or more interconnected nodes that provide a service. The cluster provides a
high level of fault tolerance, if a node were to become unavailable within the cluster the services are
moved/restored to another working node, thus the end user should never know that a fault occurred.
It is very scalable because additional nodes can be added or taken away (a node may need to be
patched) without interrupting the service. There are now three types of clustering architecture
Shared
nothing

Each node within the cluster is independent, they share nothing. An example of this may be
web servers, you a have number of nodes within the cluster supplying the same web
service. The content will be static thus there is no need to share disks, etc.

Each node will be attached or have access to the same set of disks. These disks will contain
the data that is required by the service. One node will control the application and the disk
Shared disk and in the event of a node fails; the other node will take control of both the application and
the data. This means that one node will have to be on standby setting idle waiting to take
only
over if required to do so.
A typical traditional Veritas Cluster and Sun Cluster would fit the bill here.
Shared

Again all nodes will be attached or have access to the same set of disks, but this time each

everything

node can read/write to the disks concurrently. Normally there will be a piece of software that
controls the reading and writing to the disks ensuring data integrity. To achieve this a clusterwide file system is introduced, so that all nodes view the filesystem identically, the software
then coordinates the sharing and updating of files, records and databases.
Oracle RAC and IBM HACMP would be good examples of this type of cluster

Oracle RAC History
Oracle Real Application Cluster was introduced in Oracle 9i version. Before Oracle RAC there was Oracle
parallel server available.
Oracle RAC addresses the limitation in OPS by extending Cache Fusion, and the dynamic lock
mastering. Oracle 10g RAC also comes with its own integrated clusterware and storage management
framework, removing all dependencies of a third-party clusterware product.
RAC Architecture
Oracle Real Application clusters allows multiple instances to access a single database; the instances will
be running on multiple nodes. In a standard Oracle configuration a database can only be mounted by one
instance but in a RAC environment many instances can access a single database.

Oracle's RAC is heavy dependent on an efficient, high reliable high speed private network called the
interconnect, make sure when designing a RAC system that you get the best that you can afford.
The table below describes the difference of a standard oracle database (single instance) an a RAC
environment
Component

Single Instance
Environment

RAC Environment

SGA

Instance has its own SGA

Each instance has its own SGA

Background
processes

Instance has its own set of
background processes

Each instance has its own set of background processes

Datafiles

Accessed by only one
instance

Shared by all instances (shared storage)

Control Files

Accessed by only one
instance

Shared by all instances (shared storage)

Online Redo

Dedicated for write/read to

Only one instance can write but other instances can

Logfile

only one instance

read during recovery and archiving. If an instance is
shutdown, log switches by other instances can force the
idle instance redo logs to be archived

Archived Redo
Logfile

Dedicated to the instance

Private to the instance but other instances will need
access to all required archive logs during media
recovery

Flash Recovery Accessed by only one
Log
instance

Shared by all instances (shared storage)

Alert Log and
Trace Files

Private to each instance, other instances never read or
write to those files.

Dedicated to the instance

Multiple instances on the
same server accessing
ORACLE_HOME
different databases can use
the same executable files

Same as single instance plus can be placed on shared
file system allowing a common ORACLE_HOME for all
instances in a RAC environment.

RAC Components
The major components of a Oracle RAC system are:

Shared disk system
Oracle Clusterware

Cluster Interconnects

Oracle Kernel Components

The below diagram describes the basic architecture of the Oracle RAC environment

Disk architecture
With today's SAN and NAS disk storage systems, sharing storage is fairly easy and is required for a RAC
environment, you can use the below storage setups

SAN (Storage Area Networks) - generally using fibre to connect to the SAN

if one disk fails the system is unaffected as it can use its mirror. the disks are striped with parity across 3 or more disks. raid 0 (Striping) Advantages Improved performance Can Create very large Volumes Disadvantages Not highly available (if one disk fails. Once you have you storage attached to the servers. however they are hard to manage and backup Cluster Filesystem .direct attached storage.generally using a network to connect to the NAS using either NFS. the old traditional way and still used by many companies as a cheap option All of the above solutions can offer multi-pathing to reduce SPOFs within the RAC environment. it’s not used widely . NAS ( Network Attached Storage) .used to hold all the Oracle datafiles can be used by windows and Linux. the data on the failed disk is reconstructed by using the parity bit. there is no reason not to configure multi-pathing as the cost is cheap when adding additional paths to the disk because most of the expense is paid when out when configuring the first path. so an additional controller card and network/fibre cables is all that is need. raid 1 (Mirroring) Advantages Improved performance Highly Available (if one disk fails the mirror takes over) Disadvantages Expensive (requires double the number of disks) Raid stands for Redundant Array of Inexpensive Disks. ISCSI  JBOD . raid 5 Advantages Improved performance (read only) Not expensive Disadvantages Slow write operations (caused by having to create the parity bit) There are many other raid levels that can be used with a particular hardware environment for example EMC storage uses the RAID-S. the parity is used in the event that one of the disks fails. the volume fails) A single disk is mirrored by another disk. HP storage uses Auto RAID.normally used for performance benefits. here are the most common ones A number of disks are concatenated together to give the appearance of one very large disk. The last thing to think about is how to setup the underlining disk structure this is known as a raid level. so check with the manufacture for the best solution that will provide you with the best performance and resilience. you have three choices on how to setup the disks   Raw Volumes . there are about 12 different raid levels that I know off.

CRS daemon.  Lock services provide the basic cluster-wide serialization locking functions. failure of this daemon results in the node being rebooted. it requires a public. which runs the EVMd process. The daemon spawns a processes called evmlogger and generates the events when things happen. it also manages the OCR data which is static otherwise. The last component is the Event Management Logger. dedicated and optimized cluster filesystem Oracle Clusterware Oracle Clusterware software is designed to run Oracle in a cluster mode. CRS manages the OCR and stores the current know state of the cluster. private and VIP interface in order to run. Group services use vendor clusterware group services when it is available. The evmlogger spawns new children processes on demand and scans the callout directory to invoke callouts. it can even be used with a vendor cluster like Sun Cluster. the failure of this daemon results in a node being reboot to avoid data corruption  OCSSd . Automatic Storage Management (ASM) . it provides access to the node membership and enables basic cluster services. The below functions are covered by the OCSSd   CSS provides basic Group Services Support. OCSSd provides synchronization services among nodes. it’s a portable. including cluster group services and locking. .Event Volume Manager Daemon The OPROCd daemon provides the I/O fencing for the Oracle cluster. it also spawns separate processes to manage application resources. The CRSd process manages resources such as starting and stopping the services and failover of the application resources. It is locked into memory and runs as a realtime processes. failure of this daemon causes the node to be rebooted to avoid split-brain situations.Oracle Cluster Synchronization Service Daemon (updates the registry)  EVMd . The software is run by the Cluster Ready Services (CRS) using the Oracle Cluster Registry (OCR) that records and maintains the cluster and node membership information and the voting disk which acts as a tiebreaker during communication failures. it is a distributed group membership system that allows applications to coordinate activities to archive a common result. Death of the EVMd daemon will not halt the instance and will be restarted.Process Monitor Daemon CRSd . it can support you to 64 nodes.Oracle choice of storage management. Fencing is used to protect the data. The Clusterware software allows nodes to communicate with each other and forms the cluster that makes the nodes work as a single logical server. Consistent heartbeat information travels across the interconnect to the voting disk when the cluster is running. it uses the hangcheck timer or watchdog timer for the cluster integrity. if a node were to have problems fencing presumes the worst and protects the data thus restarts the node in question. its better to be save than sorry. First Out (FIFO) mechanism to manage locking  Node services uses OCR to store data and updates the information during reconfiguration. The CRS has four components   OPROCd . it uses the First In.

by default it manages the node membership functionality along with managing regular RAC-related resources and services RAC uses a membership scheme. no node restart The cluster-ready services (CRS) is a new component in 10g RAC. You can add and remove nodes from the cluster and the membership increases or decrease. The CRS daemon will update the OCR about status of the nodes in the cluster during reconfigurations and failures. The resource management framework manage the resources to the cluster (disks. The OCSSd uses the OCR extensively and writes the changes to the registry. this is known as the heartbeat.Cluster Ready Services resource monitoring. Multiple frameworks are not supported as it can lead to undesirable affects. below details what each means . it stores name and value pairs of information such as resources that are used to manage the resource equivalents by the CRS stack. RAC can evict any member that it seems as a problem. it is automatically backed up (every 4 hours) the daemons plus you can manually back it up. The voting disk (or quorum disk) is shared by all nodes within the cluster. no node restart oracle OCSSd . it should be at least 100MB in size. thus you can have only have one resource management framework per resource. The voting disk manages the cluster membership and arbitrates the cluster ownership during communication failures between nodes.Cluster Synchronization Services basic node membership. The OCR is loaded as cache on each node. the node is called the master.Quick recap CRS Process Functionality Failure of the Process Run AS OPROCd . this protects the cluster from split-brains (the Instance Membership Recovery algorithm IMR is used to detect and resolve split-brains) as the voting disk decides what part is the really cluster. failover and node recovery Daemon restarted root automatically. it is a binary file. its is installed in a separate home directory called ORACLE_CRS_HOME. when network problems occur membership becomes the deciding factor on which part stays as the cluster and what nodes get evicted. each node will update the cache then only one node is allowed to write the cache to the OCR file. It is a mandatory component but can be used with a third party cluster (Veritas. thus any node wanting to join the cluster as to become a member. This shared storage is known as the Oracle Cluster Registry (OCR) and it’s a major part of the cluster. group services. The OCR keeps details of all resources and services. nodes. basic locking Node Restart oracle CRSd . The OCR is also used to supply bootstrap information ports.Process Monitor provides basic cluster integrity services Node Restart root EVMd . volumes). Resources with the CRS stack are components that are managed by CRS and have the information on the good/bad state and the callout scripts. the use of a voting disk is used which I will talk about later. If for any reason a node cannot access the voting disk it is immediately evicted from the cluster. The Oracle Cluster Ready Services (CRS) uses the registry to keep the cluster configuration. etc.Event Management spawns a child process event logger Daemon automatically and generates callouts restarted. Voting is often confused with quorum they are similar but distinct. The Enterprise manager also uses the OCR cache. Sun Cluster). information about the cluster is constantly being written to the disk. it should reside on a shared storage and accessible to all nodes within the cluster. its primary concern is protecting the data.

The GRD is managed by two services called Global Caches Services (GCS) and Global Enqueue Services (GES). the instances require better coordination at the resource management level. usually a majority of members of a body. which is distributed. the quorum member vote defines the cluster. Each node will have its own set of buffers but will be able to request and receive data blocks currently held in another instance's cache. This interconnect should be private. they should not start any services because they risk conflicting with an established quorum.Voting A vote is usually a formal expression of opinion or will in response to a proposed decision Quorum is defined as the number. When a node leaves the cluster. RAC Background Processes Each node has its own background processes and memory structures. The voting disk has to reside on shared storage. there are additional processes than the norm to manage the shared resources. ideally they should be on a minimum private 1GB network. In Oracle 10g R1 you can have only one voting disk. The original Virtual IP in Oracle was Transparent Application Failover (TAF). and also used to transfer some data from one instance to another. The management of data sharing and exchange is done by the Global Cache Services (GCS). highly available and fast with low latency. Whatever hardware you are using the NIC should use multi-pathing (Linux . The cluster interconnect is used to synchronize the resources of the RAC cluster. Oracle Kernel Components The kernel components relate to the background processes. but in R2 you can have upto 32 voting disks allowing you to eliminate any SPOF's. these public IPs are configured in DNS so that users can access them. The cluster VIPs will failover to working nodes if a node should fail. . when assembled is legally competent to transact business The only vote that counts is the quorum member vote. If a node or group of nodes cannot archive a quorum. buffer cache and shared pool and managing the resources without conflicts and corruptions requires special handling. and this has now been replaced with cluster VIPs. together they form and manage the GRD. The resources are equally distributed among the nodes based on their weight. that.IPMP). In RAC as more than one instance is accessing the resource. theses additional processes maintain cache coherency across the nodes. You can use crossover cables in a QA/DEV environment but it is not supported in a production environment. it is a small file (20MB) that can be accessed by all nodes in the cluster. also crossover cables limit you to a two node cluster. the GRD portion of that instance needs to be redistributed to the surviving nodes. All the resources in the cluster group form a central repository called the Global Resource Directory (GRD).bonding. Each instance masters some set of resources and together all instances form the GRD. a similar action is performed when a new node joins. The cluster VIPs are different from the cluster interconnect IP address and are only used to access the database. this had limitations. Solaris .

The sequence of an operation would go as below 1. RAC uses two processes the GCS and GES which maintain records of lock status of each data file and each cached block using a GRD. only one instance has the current copy of the block. It also handles global . All caches in the SGA are either global or local. So what is a resource. it handles the consistent copies of blocks that are transferred between instances. a disk file or an abstract entity. A resource can be owned or locked in various states (exclusive or shared). GCS will then request instance A to release the lock. thus cache fusion moves current copies of data blocks between instances (hence why you need a fast private network). it is an identifiable entity. Each instance has a buffer cache in its SGA. transaction enqueue's and database data structures are other examples. it can be a area in memory. before reading it must inform the GCS (DLM). to ensure that each RAC instance obtains the block that it needs to satisfy a query or transaction. Global cache management ensures that access to a master copy of a data block in one buffer cache is coordinated with the copy of the block in another buffer cache. GCS handle data buffer cache blocks and GES handle all the non-data block resources. Data buffer cache blocks are the most obvious and most heavily global resource. It receives requests from LMD to perform lock requests. A global resource is a resource that is visible to all the nodes within the cluster. thus GCS ensures that instance B gets the latest version of the data block (including instance A modifications) and then exclusively locks it on instance B behalf. 3. It rolls back any uncommitted transactions. At any one point in time. They manage lock manager service requests for GCS resources and send them to a service queue to be handled by the LMSn process. When instance A needs a block of data to modify. it too must inform GCS. Cache fusion is used to read the data buffer cache from another instance instead of getting the block from disk. GCS manages the block transfers between the instances. it basically has a name or a reference. There can be up to ten LMS processes running and can be started dynamically if demand requires it. thus keeping the integrity of the block.Cache coherency is the technique of keeping multiple copies of a buffer consistent between different Oracle instances on different nodes. large and java pool buffer caches are local. The Global Resource Manager (GRM) helps to coordinate and communicate the lock requests from Oracle processes between instances in the RAC. Any shared resource is lockable and if it is not shared no access conflict will occur. This is known as Parallel Cache Management (PCM). GCS maintains data coherency and coordination by keeping track of all lock status of each block that can be read/written to by any nodes in the RAC. it reads the bock from disk. GCS keeps track of the lock status of the data block by keeping an exclusive lock on it on behalf of instance A 2. GCS is an in memory database that contains information about current locks on blocks and instances waiting to acquire locks. Finally we get to the processes Oracle RAC Daemons and Processes LMSn Lock Manager Server process GCS This is the cache fusion part and the most active process. Now instance B wants to modify that same data block. dictionary and buffer caches are global.

It also handles deadlock detention and remote resource requests from other instances. it checks for Process instance deaths and listens for local messaging.defines the name of the Oracle instance (default is the value of the oracle_sid variable) instance_number . it maintains consistency of GCS memory Lock Monitor structure in case of process death. add it afterwards. this is the only way to learn.a unique number for each instance must be greater than 0 but smaller than the max_instance parameter thread . Daemon GES you can see the statistics of this daemon by looking at the view X$KJMDDP LCK0 Manages instance resource requests and cross-instance call operations for Lock Process shared resources. don't install it with the original configuration. use this node to remove a node from the cluster and also to simulate node failures. It is recommended that the spfile (binary parameter file) is shared between all nodes within the cluster. RAC Administration At the time of RAC installation. examples would be db_cache_size. Unique parameters These parameters are unique to each instance. examples would be instance_name. LMD Lock Manager This manages the enqueue manager service requests for the GCS. keep repeating certain situations until you fully understand how RAC works. it uses the DIAG framework to monitor the health of the cluster.specifies the set of redolog files to be used by the instance . thread and undo_tablespace Identical parameters Parameters in this category must be the same for each instance. As a performance gain you can increase this process priority to make sure CPU starvation does not occur You can see the statistics of this daemon by looking at the view X$KJMSDP LMON This process manages the GES. large_pool_size. It captures information for later diagnosis in the event of failures. GES A detailed log file is created that tracks any reconfigurations that have happened. local_listener and gcs_servers_processes The main unique parameters that you should know about are    instance_name . DIAG Diagnostic Daemon This is a lightweight process.GES elements during recovery. It will perform any necessary recovery if an operational hang is detected. but it is possible that each instance can have its own spfile. It is also responsible for cluster reconfiguration and locks reconfiguration (node joining or leaving). examples would be db_name and control_file Neither unique or identical parameters parameters that are not in any of the above.deadlock detection and monitors for lock conversion timeouts. make use of that third node. It builds a list of invalid lock elements and validates lock . The parameters can be grouped into three categories.

mounts the control file in either share (cluster) or exclusive mode.db_cache_size = 1000000 *.specifies the number of instances that will be accessing the database (set to maximum # of nodes)  dml_locks . isinstance_modifiable from v$parameter where isinstance_modifiable = 'false' order by name. <instance_name>. example Note: use the sid option to specify a particular instance . use false in the below cases o Converting from no archive log mode to archive log mode and vice versa o Enabling the flashback database feature o Performing a media recovery on a system table o Maintenance of a node  active_instance_count .<parameter_name>=<parameter_value> syntax for parameter file inst1.specifies the group of instances to be used for parallel query execution  gcs_server_processes .specify the number of lock manager server (LMS) background processes used by the instance for Cache Fusion  remote_listener .  cluster_database .register the instance with listeners on remote nodes.use if only if Oracle has trouble not picking the correct interconnects The identical unique parameters that you should know about are below you can use the below query to view all of them select name.specifies the number of DML locks for a particular instance (only change if you get ORA-00055 errors)  gc_files_to_locks .options are true or false.specify multiple parallel query execution groups and assigns the current instance to those groups  parallel_instance_group .used for primary/secondary RAC environments  cluster_database_instances .specifies the name of the undo tablespace to be used by the instance  rollback_segments .influences the mechanism Oracle uses to synchronize the SCN among all instances  instance_groups . undo_tablespace .specify the number of global locks to a data file.  max_commit_propagation_delay .you should use Automatic Undo Management  cluster_interconnects . changing this disables the Cache Fusion.undo_management=auto alter system set db_2k_cache_size=10m scope=spfile sid='inst1'.

temporary datafiles being used for the temporary tablespace Redologs I have already discussed redologs. etc v$tempfile . Redologs are located on the shared storage so that all instances can . temporary segments can be reclaimed from segments from other instances in that tablespace. see below for options start all instances force open mount nomount srvctl stop database -d <database> -o <option> Note: the listeners are not stopped.undo_tablespace=undo_tbs1 instance2. in a RAC environment every instance has its own set of redologs.explore current and maximum sort segment usage statistics (check columns freed_extents. Each instance creates a temporary segment in the temporary tablespace it is using.if they grow increase tablespace size) useful views gv$tempseg_usage . see below for options stop all instances immediate abort normal transactional start/stop particular srvctl [start|stop] database -d <database> -i <instance>. gv$sort_segment .explore tempry segment usage details such as name. Each instance has exclusive write access to its own redologs. If an instance is running a large sort.<instance> instance Undo Management Instances in a RAC do not share undo. you can use the -o option to specify startup/shutdown options. you can use the -o option to specify startup/shutdown options. free_requests . SQL. Using the undo_tablespace parameter each instance can point to its own undo tablespace undo tablespace instance1. this group is then used by all instances of the RAC.undo_tablespace=undo_tbs2 Temporary Tablespace In a RAC environment you should setup a temporary tablespace group. you can also use sqlplus to start and stop the instance srvctl start database -d <database> -o <option> Note: starts listeners if not already running.identify . they each have a dedicated undo tablespace.Starting and Stopping Instances The srvctl command is used to start/stop an instance. this is used for recovery. but each instance can read each others redologs.

SQL> shutdown. SQL> alter system set cluster_database=true scope=spfile sid='prod1'. the configuration is stored in the Oracle Cluster Registry (OCR) that was created when RAC was installed. I suggest that you look up the command but I will provide a few examples display the registered databases status srvctl config database srvctl status database -d <database srvctl status instance -d <database> -i <instance> srvctl status nodeapps -n <node> .have access to each others redologs. The process is a little different to the standard Oracle when changing the archive mode archive mode (RAC) SQL> alter system set cluster_database=false scope=spfile sid='prod1'. srvctl start database -p prod1 SRVCTL command We have already come across the srvctl above. there is no difference in RAC environment apart from the setting up ## Make sure that the database is running in archive log mode SQL> archive log list ## Setup the flashback SQL> alter system set cluster_database=false scope=spfile sid='prod1'. It can divided into two categories   Database configuration tasks Database instance control tasks Oracle stores database configuration in a repository. Srvctl uses CRS to communicate and perform startup and shutdown commands on other nodes. srvctl stop database -p prod1 SQL> startup mount SQL> alter database flashback on. this command is called the server control utility. SQL> alter system set DB_RECOVERY_FILE_DEST_SIZE=200M scope=spfile. it will be located on the shared storage. SQL> shutdown. srvctl start database -d prod Flashback Again I have already talked about flashback. srvctl stop database -d <database> SQL> startup mount SQL> alter database archivelog. flashback (RAC) SQL> alter system set DB_RECOVERY_FILE_DEST='/ocfs2/flashback' scope=spfile.

throughput or none . here is a list from a fresh RAC installation The table above is described below  Goal .<instance> srvctl start nodeapps -n <node> srvctl start asm -n <node> srvctl add database -d <database> -o <oracle_home> srvctl add instance -d <database> -i <instance> -n <node> srvctl add service -d <database> -s <service> -r <preferred_list> srvctl add nodeapps -n <node> -o <oracle_home> -A <name|ip>/network srvctl add asm -n <node> -i <asm_instance> -o <oracle_home> adding/removing srvctl remove database -d <database> -o <oracle_home> srvctl remove instance -d <database> -i <instance> -n <node> srvctl remove service -d <database> -s <service> -r <preferred_list> srvctl remove nodeapps -n <node> -o <oracle_home> -A <name|ip>/network srvctl asm remove -n <node> Services Services are used to manage the workload in Oracle RAC.allows you to define a service goal using service time. the important features of services are   used to distribute the workload can be configured to provide high availability  provide a transparent way to direct workload The view v$services contains information about services that have been started on that instance.<instance> srvctl start service -d <database> -s <service><service> -i <instance>.<instance> srvctl stop service -d <database> [-s <service><service>] [-i <instance>.srvctl status service -d <database> srvctl status asm -n <node> srvctl stop database -d <database> srvctl stop instance -d <database> -i <instance>.<instance>] srvctl stop nodeapps -n <node> srvctl stop asm -n <node> stopping/starting srvctl start database -d <database> srvctl start instance -d <database> -i <instance>.

batch_job_class to vallep. END.listeners and mid-tier servers contain current information about service performance  Distributed Transaction Processing . available ones are the backup instances You can administer services using the following tools   DBCA EM (Enterprise Manager)  DBMS_SERVICES  Server Control (srvctl) Two services are created when the database is first installed.   sys$background .node2 -a node3 Note: the options are describe below add -d . these services are running all the time and cannot be disabled. Connect Time Load Balancing Goal . ## create a job associated with a job class .information about nodes being up or down will be sent to mid-tier servers via the advance queuing mechanism  Preferred and Available Instances . service => 'BATCH_SERVICE').used by an instance's background processes only sys$users . / ## Grant the privileges to execute the job grant execute on sys.the service will running on the these nodes -a .if nodes in the -r list are not running then run on this node remove srvctl remove service -d D01 -s BATCH_SERVICE start srvctl start service -d D01 -s BATCH_SERVICE stop srvctl stop service -d D01 -s BATCH_SERVICE status srvctl status service -d D10 -s BATCH_SERVICE service (example) ## create the JOB class BEGIN DBMS_SCHEDULER.the preferred instances for a service.create_job_class( job_class_name => 'BATCH_JOB_CLASS'.used for distributed transactions  AQ_HA_Notifications .when users connect to the database without specifying a service they use this service srvctl add service -d D01 -s BATCH_SERVICE -r node1.the service -r .database -s .

basically permanent over reboots disabling/enabling ## Oracle 10g R1 /etc/init.d/init. job_type => 'PLSQL_BLOCK'.BATCH_JOB_CLASS'. job_class => 'SYS.set_attribute('MY_BATCH_JOB'. though it is not required (apart from HP True64). you can use it with other third-party clusterware software. Cluster Ready Services (CRS) CRS is Oracle's clusterware software. comments => 'Test batch job to show RAC services').BEGIN DBMS_SCHDULER.create_job( job_name => 'my_user.batch_job_test'. job_action => SYSTIMESTAMP' repeat_interval => 'FREQ=DAILY.crs [disable|enable] ## Oracle 10g R2 $ORA_CRS_HOME/bin/crsctl [disable|enable] crs . you should only stop this service in the following situations   Applying a patch set to $ORA_CRS_HOME O/S maintenance  Debugging CRS problems CRS Administration ## Starting CRS using Oracle 10g R1 not possible starting ## Starting CRS using Oracle 10g R2 $ORA_CRS_HOME/bin/crsctl start crs stopping ## Stopping CRS using Oracle 10g R1 srvctl stop -d database <database> srvctl stop asm -n <node> srvctl stop nodeapps -n <node> /etc/init. 'JOB_CLASS'. CRS is start automatically when the server starts. / ## assign a job class to an existing job exec dbms_scheduler. END. 'BATCH_JOB_CLASS').crs stop ## Stopping CRS using Oracle 10g R2 $ORA_CRS_HOME/bin/crsctl stop crs ## stop CRS restarting after a reboot. enabled => TRUE. end_date => NULL.'.d/init.

$ORA_CRS_HOME/bin/crsctl check crs $ORA_CRS_HOME/bin/crsctl check evmd $ORA_CRS_HOME/bin/crsctl check cssd $ORA_CRS_HOME/bin/crsctl check crsd $ORA_CRS_HOME/bin/crsctl check install -wait 600 checking Resource Applications (CRS Utilities) $ORA_CRS_HOME/bin/crs_stat $ORA_CRS_HOME/bin/crs_stat -t $ORA_CRS_HOME/bin/crs_stat -ls $ORA_CRS_HOME/bin/crs_stat -p status Note: -t more readable display -ls permission listing -p parameters create profile $ORA_CRS_HOME/bin/crs_profile register/unregister application $ORA_CRS_HOME/bin/crs_register $ORA_CRS_HOME/bin/crs_unregister Start/Stop an application $ORA_CRS_HOME/bin/crs_start $ORA_CRS_HOME/bin/crs_stop Resource permissions $ORA_CRS_HOME/bin/crs_getparam $ORA_CRS_HOME/bin/crs_setparam Relocate a resource $ORA_CRS_HOME/bin/crs_relocate Nodes olsnodes -n member number/name Note: the olsnodes command is located in $ORA_CRS_HOME/bin local node name olsnodes -l activates logging olsnodes -g Oracle Interfaces display oifcfg getif delete oicfg delig -global set oicfg setif -global <interface name>/<subnet>:public oicfg setif -global <interface name>/<subnet>:cluster_interconnect Global Services Daemon Control starting gsdctl start stopping gsdctl stop status gsdctl status Cluster Configuration (clscfg is used during installation) clscfg -install create a new configuration Note: the clscfg command is located in $ORA_CRS_HOME/bin upgrade or downgrade and existing configuration clscfg -upgrade clscfg –downgrade add or delete a node from the configuration clscfg -add clscfg –delete .

the file pointer indicating the OCR device location is the ocr. this was taken from my installation orc. and this can be in either of the following   linux .loc ocrconfig_loc=/u02/oradata/racdb/OCRFile ocrmirrorconfig_loc=/u02/oradata/racdb/OCRFile_mirror local_only=FALSE OCR is import to the RAC environment and any problems must be immediately actioned.sh Delete Node Note: see adding and deleting nodes Oracle Cluster Registry (OCR) As you already know the OCR is the registry that contains information   Node list Node membership mapping  Database instance./etc/oracle solaris .loc./var/opt/oracle The file contents look something like below.sh Add Node Note: see adding and deleting nodes deletenode.create a special single-node configuration for ASM clscfg –local brief listing of terminology used in the other nodes clscfg –concepts used for tracing clscfg –trace help clscfg -h Cluster Name Check cemutlo -n print cluster name Note: in Oracle 9i the ulity was called "cemutls". the command can be found in located in $ORA_CRS_HOME/bin . node and other mapping information  Characteristics of any third-party applications controlled by CRS The file location is specified during the installation. the command is located in $ORA_CRS_HOME/bin cemutlo -w print the clusterware version Note: in Oracle 9i the ulity was called "cemutls" Node Scripts addnode.

the voting disk protects data integrity.dbf' ## remove the OCR or OCRMirror file ocrconfig -replace ocr ocrconfig -replace ocrmirror Voting Disk The voting disk as I mentioned in the architecture is used to resolve membership issues in the event of a partitioned cluster. querying crsctl query css votedisk adding crsctl add css votedisk <file> deleting crsctl delete css votedisk <file> RAC Backups and Recovery . total space allocated. space used. location of each device and the result of the integrity check ocrdump dump contents Note: by default it dumps the contents into a file named OCRDUMPFILE in the current directory ocrconfig -export <file> export/import ocrconfig -restore <file> # show backups ocrconfig -showbackup # to change the location of the backup. you can even specify a ASM disk ocrconfig -backuploc <path|+asm> backup/restore # perform a backup. free space. will use the location specified by the -backuploc location ocrconfig -manualbackup # perform a restore ocrconfig -restore <file> # delete a backup orcconfig -delete <file> Note: there are many more option so see the ocrconfig man page ## add/relocate the ocrmirror file to the specified location ocrconfig -replace ocrmirror '/ocfs2/ocr2.dbf' ## relocate an existing OCR file add/remove/replace ocrconfig -replace ocr '/ocfs1/ocr_new.OCR Utilities log file $ORA_HOME/log/<hostname>/client/ocrconfig_<pid>.log ocrcheck checking Note: will return the OCR version.

Oracle RAC also supports Oracle Data Guard. thus you can have a primary database configured as a RAC and a standby database also configured as a RAC. Backup Basics Oracle backups can be taken hot or cold. each instance has one active thread.may use tools such as tar. . this is required when trying to recover the database.SAN mirroring with a backup option like Netbackup or RMAN Oracle RAC can use all the above backup technologies. rsync medium/large company . during recovery only one node applies the archived logs as in a standard single instance configuration. All log files for that instance belong to this thread. RMAN can be used in either a tape or disk solution.ora or SPFILE) Databases have now grown to very large sizes well over a terabyte in size in some cases. the node that performs the recovery must have access to all archived redologs. however. In a Oracle RAC environment it is critical to make sure that all archive redolog files are located on shared storage. This article covers only the specific issues that surround RAC backups and recovery.Backups and recovery is very similar to a single instance database. Instance Recovery In a RAC environment there are two types of recovery   Crash Recovery . a backup will comprise of the following   Datafiles Control Files  Archive redolog files  Parameter files (init. Details about log group file and thread association details are stored in the control file. A stream consists of all the threads of redo information ever recorded. and the threads are parallel timelines and together form a stream. thus tapes backups are not used in these cases but sophisticated disk mirroring have taken their place. as you need access to all archive redologs. it can even work with third-party solutions such as Veritas Netbackup.means that all instances have failed. cpio. this instance can then be recovered by the surviving instances Redo information generated by an instance is called a thread of redo. RMAN can use parallelism when recovering. thus they all need to be recovered Instance Recovery .means that one or more instances have failed. RMAN  Enterprise company . an online redolog file belongs to a group and the group belongs to a thread. RAC databases have multiple threads of redo. but Oracle prefers you to use RMAN oracle own backup solution.Veritas Netbackup. the streams form the timeline of changes performed to the database. Backups can be different depending on the size of the company   small company .

SMON maintains a list of all the dead instances and invalid block locks. usually a single block. so it needs to obtain the latest version of the dirty block and it uses PI (Past Image) and Block Written Record (BWR) to archive this in a quick and timely fashion. all blocks in memory that contain changes made prior to this SCN across all instances must be written out to disk. if the block has the same timestamp as the redo record (SCN match) the redo is applied. though in a lazy fashion. it also adds a redo record that states the block has been written (data block address and SCN). I have a complete recovery section in my Oracle section. I have discussed how to control recovery in my Oracle section and this applies to RAC as well. Once the recovery and cleanup has finished this list is updated. A foreground process in a surviving instance detects an "invalid block lock" condition when an attempt is made to read a block into the buffer cache. The block specified in the redolog is read into cache. The starting point is provided by the control file and compared against the same information in the data file headers. Oracle RAC uses a two-pass recovery. Block Written Record (BRW) The cache aging and incremental checkpoint system would write a number of blocks to disk. when a checkpoint needs to be triggered. In RAC a BWR is written when . The on-disk block is the starting point for the recovery. Each vector is a description of a single change. only the changes need to be applied 3. Crash Recovery Crash recovery is basically the same for a single instance and a RAC environment. Crash recovery will automatically happen using the online redo logs that are current or active 2. here is a note detailing the difference. Oracle will look for the thread checkpoint that has the lowest checkpoint SCN. Oracle will only consider the block on the disk so the recovery is simple. DBWn can write block written records (BWRs) in batches. when the DBWR completes a data block write operation. For a single instance the following is the recovery process 1. I have already discussed checkpoints. The death of another instance is detected if the current instance is able to acquire that instance's redo thread locks. The starting point is the last full checkpoint. For a RAC instance the following is the recovery process 1. This is an indication that an instance has failed (died) 2. The foreground process sends a notification to instance system monitor (SMON) which begin to search for dead instances. A redo record contains one or more change vectors and is located by its Redo Byte Address (RBA) and points to a specific location in the redolog file (or thread). these are called change vectors. which is usually held by an open and active instance. 3. because a data block could have been modified in any of the instances (dead or alive). It will consist of three components   log sequence number block number within the log  byte number within the block Checkpoints are the same in a RAC environment and a single instance environment.Oracle records the changes made to a database.

an instance writes a block covered by a global resource or when it is told that its past image (PI) buffer it is holding is no longer necessary. Cache Fusion Recovery I have a detailed section on cache fusion. The GCS is responsible for informing an instance that its PI is no longer needed after another instance writes a newer (PI) (current) version of the same block. This is was makes RAC cache fusion work. it can be created and saved when a dirty block is shipped across Past Image to another instance after setting the resource role to global. RAC uses an algorithm called lazy remastering to remaster only a minimal number of resources during a reconfiguration. cache fusion is only used in RAC environments. internodes communication. A block will not be recovered if its BWR version is greater than the latest PI in any of the buffer caches. in an instance recovery SMON will perform the recovery where as in a crash recovery a foreground process performs the recovery. even before recovery reads are complete In cache fusion the starting point for recovery of a block is its most current PI version. PI's are discarded when GCS posts all the holding instances that a new and consistent version of that particular block is now on disk. this section covers the recovery. such as GRD reconfiguration. not the total number of nodes It eliminates disk reads of blocks that are present in a surviving instance's cache  It prunes recovery set based on the global resource lock state  The cluster is available after an initial log scan. it eliminates the write/write contention problem that existed in the OPS database. this could be located on any of the surviving instances and multiple PI blocks of a particular buffer can exist. the redolog entries are then compared against a recovery set built in the first pass and any matches are applied to the in-memory buffers as in a single pass recovery. Remastering is the term used that describes the operation whereby a node attempting recovery tries to own or master the resource(s) that were once mastered by another instance prior to the failure. The main features (advantages) of cache fusion recovery are   Recovery cost is proportional to the number of failures. etc. The buffer cache is flushed and the checkpoint SCN for each thread is updated upon successful completion.all instances have failed Instance Recovery . all modified blocks are added to the recovery set (an organized hash table).one instance has failed In both cases the threads from failed instances need to be merged. When one instance leaves the cluster. the GRD of that instance needs to be redistributed to the surviving nodes. There are two types of recovery   Crash Recovery . A PI is a copy of a globally dirty block and is maintained in the database buffer cache. The entire Parallel Cache Management (PCM) lock space remains invalid while the DLM and SMON complete the below steps . The checkpoint SCN is need as a starting point for the recovery. The second pass SMON re-reads the merged redo stream (by SCN) from all threads needing recovery. The first pass does not perform the actual recovery but merges and reads redo threads to create a hash table of the blocks that need recovery and that are not known to have been written back to the datafiles. as additional steps are required.

5 and 7 Instance B masters resources 2. 11 and 12 Instance B is removed from the cluster. IDLM master node discards locks that are held by dead instances. SMON issues a message saying that it has acquired the necessary buffer locks to perform recovery Lets look at an example on what happens during a remastering. only the resources from instance B are evenly remastered across the surviving nodes (no resources on instances A and C are affected). this reduces the amount of work the RAC has to perform.1. 4. 3. the space is reclaimed by this operation is used to remaster locks that are held by the surviving instance for which a dead instance was remastered 2. force dynamic remastering (DRM) ## Determine who masters it SQL> oradebug setmypid SQL> oradebug lkdebug -a <OBJECT_ID> ## Now remaster the resource SQL> oradebug setmypid SQL> oradebug lkdebug -m pkey <OBJECT_ID> The steps of a GRD reconfiguration is as follows . 6. and 8  Instance C masters resources 9. 10. lets presume the following   Instance A masters resources 1. likewise when a instance joins a cluster only minimum amount of resources are remastered to the new instance. Before Remastering After Remastering You can control the remastering process with a number of parameters _gcs_fast_config enables fast reconfiguration for gcs locks (true|false) _lm_master_weight controls which instance will hold or (re)master more resources than others _gcs_resources controls the number of resources an instance will master at a time you can also force a dynamic remastering (DRM) of an object using oradebug ## Obtain the OBJECT_ID form the below table SQL> select * from v$gcspfmaster_info.

recovery is then complete o The system is available Graphically it looks like below RAC Performance .  Instance death is detected by the cluster manager Request for PCM locks are frozen  Enqueues are reconfigured and made available  DLM recovery  GCS (PCM lock) is remastered  Pending writes and notifications are processed  I Pass recovery  o The instance recovery (IR) lock is acquired by SMON o The recovery set is prepared and built. database is partially available o Blocks are made available as they are recovered o The IR lock is released by SMON. memory space is allocated in the SMON PGA o SMON acquires locks on buffers that need recovery II Pass recovery o II pass recovery is initiated.

unless you cannot afford to lose sequence during a crash  Avoid using DDL in production. see table v$sysstat column "table scans (long tables)"  if the application uses lots of logins. You should consider the following when deciding to implement partitioning  If the CPU and private interconnects are of high performance then there is no need to partition .audsess$ sequence Partitioning Workload Workload partitioning is a certain type of workload that is executed on an instance. it causes the GCS to service lots of block requests. that is partitioning allows users who access the same set of data to log on to the same instance. thus reducing cache contention  Keep read-only tablespaces away from DML-intensive tablespaces. This limits the amount of data that is shared between instances thus saving resources used for messaging and Cache Fusion data block transfer. increase the value of sys. ensure that the middle tier and programs that connect to the database are efficient in connection management and do not log on or off repeatedly Tune the SQL using the available tools such as ADDM and SQL Tuning Advisor  Ensure that applications use bind variables. cursor_sharing was introduced to solve this problem  Use packages and procedures (because they are compiled) in place of anonymous PL/SQL blocks and big SQL statements  Use locally managed tablespaces and automatic segment space management to help performance and simplify database administration  Use automatic undo management and temporary tablespace to simplify administration and increase performance  Ensure you use large caching when using sequences.In this section I will mainly discuss Oracle RAC tuning. it increases invalidations of the already parsed SQL statements and they need to be recompiled  Portion tables and indexes to reduce index leaf contention (buffer busy global cr problems)  Optimize contention on data blocks (hot spots) by avoiding small tables with too many rows in a block Now we can review RAC specific best practices   Consider using application partitioning (see below) Consider restricting DML-intensive users to using one instance. this causes more shared library cache locks  Use full tables scans sparingly. they only require minimum resources thus optimizing Cache Fusion performance  Avoid auditing in RAC. First let's review the best practices of Oracle design regarding the application and database   Optimize connection management.

thus if you can increase CPU and the interconnect performance the better  Only partition if performance is betting impacted  Test both partitioning and non-partitioning to what difference it makes. Events that are associated with all such waits are known as wait events. then decide if partitioning is worth it RAC Wait Events An event is an operation or particular function that the Oracle kernel performs on behalf of a user or a Oracle background process.cr) global cache curr request (current .curr) . however you probably will only deal with about 50 or so that can improve performance. Whenever a session has to wait for something. Partitioning does add complexity. the wait time is tracked and charged to the event that was associated with that wait. Two placeholder events   global cache cr request (consistent read . When a session requests access to a data block it sends a request to the lock master for proper authorization. the request does not know if it will receive the block via Cache Fusion or a permission to read from the disk. They are a number of wait classes   Commit Scheduler  Application  Configuration  User I/O  System I/O  Concurrency  Network  Administrative  Cluster  Idle  Other There are over 800 different events spread across the above list. events have specific names like database event.

Use the above actions to increase the performance gc current block 2way write/read The difference with the one above is that this sends a copy of the block thus keeping the current copy.keep track of the time a session spends in this state. asking it to relinquish ownership. There are number of types of wait events regarding access to a data block Wait Event Contention Description type An instance requests authorization for a block to be accessed in current mode to modify a block. gc current block 3way write/read The difference with the one above is that this sends a copy of the block thus keeping the current copy. the difference is that another session on the same instance also has requested the block (hence local contention) . the instance mastering the resource receives the request and forwards it to the current holder of the block. The holding instance sends a copy of the current version of the block to the write/write requestor via Cache Fusion and transfers the exclusive lock to the requesting instance. The master has the current version of the block and sends the current copy of the block to the requestor via Cache Fusion and keeps a Past Image (. It also keeps a past Image (PI). The requestor will eventually get the block via cache fusion but it is delayed due to one of the following gc current write/write block busy   The block was being used by another session on another session was delayed as the holding instance could not write the corresponding redo record immediately If you get this then do the following  gc current buffer busy local Ensure the log writer is tuned This is the same as above (gc current block busy). segments in the "current blocks received" section of AWR Use application partitioning scheme  Make sure the system has enough CPU power  Make sure the interconnect is as fast as possible  Ensure that socket send and receive buffers are configured correctly an instance requests authorization for a block to be accessed in current mode to modify a block.PI) If you get this then do the following gc current block 2way gc current block 3way  write/write  Analyze the contention. the instance mastering the resource receives the request.

GES is responsible for coordinating the global resources. here are some examples of enqueues:   updating the control file (CF enqueue) updating an individual row (TX enqueue)  exclusive lock on a table (TM enqueue) Enqueues can be managed by the instance itself others are used globally.91 . Enqueue wait is the time spent by a session waiting for a shared resource. because my AWR report is lightweight here is a more heavy used RAC example Global Cache Load Profile ~~~~~~~~~~~~~~~~~~~~~~~~~ Per Second Per Transaction ----------------------------Global Cache blocks received: 315.82 Global Cache blocks served: 240. you can view the below report. in the report section is a AWR of my home RAC environment. thus CPU resources are stretched Enqueue Tuning Oracle RAC uses a queuing mechanism to ensure proper use of shared resources.37 12. The formula used to calculate the number of enqueue resources is as below GES Resources = DB_FILES + DML_LOCKS + ENQUEUE_RESOURCES + PROCESS + TRANSACTION x (1 + (N .gc current block congested none This is caused if heavy congestion on the GCS. RAC AWR Section Number of Instances Instance global cache load profile Report Description instances lists the number of instances from the beginning and end of the AWR report global cache information about the interinstance cache fusion data block and messaging traffic. AWR and RAC From a RAC point of view there are a number of RAC-specific sections that you need to look at in the AWR.67 GCS/GES messages received: 525.30 9. SQL> select * from v$resource_limit.16 20.1)/N) N = number of RAC instances displaying enqueues stats SQL> column current_utilization heading current SQL> column max_utilization heading max_usage SQL> column initial_allocation heading initial SQL> column resource_limit format a23. it is called Global Enqueue Services (GES).81 GCS/GES messages sent: 765.32 30.

. As a general rule you are looking for GCS and GES workload characteristics GCS and GES workload  All timings related to CR (Consistent Read) processing block should be less than 10 msec  All timings related to CURRENT block processing should be less than 20 msec The first section relates to sending a message and should be less than 1 second. Messaging statistics messaging The second section details the breakup of direct and indirect messages. thus if you are using a 8K block size Sent: 240 x 8. indirect are messages that are not urgent and are pooled and sent. this section contains timing statistics for global enqueue and global cache. if you are getting higher values then you may consider application partitioning. then calculate the amount of messaging traffic on this network 193 (765 + 525) = 387000 = 0. you are looking for a value less than 10%.6 MB/sec to determine the amount of network traffic generated due to messaging you first need to find the average message size (this was 193 on my system) select sum(kjxmsize * (kjxmrcv + kjxmsnt + kjxmqsnt)) / sum((kjxmrcv + kjxmsnt + kjxmqsnt)) "avg Message size" from x$kjxm where kjxmrcv > 0 or kjxmsnt > 0 or kjxmqsnt > 0.4 = 5 MBytes/sec = 5 x 8 = 40 Mbits/sec The DBWR Fusion writes statistic indicates the number of times the local DBWR was forced to write a block to disk due to remote instances. this number should be low.The first two statistics indicate the number of blocks transferred to or from this instance.6 + 0.0 + 2.0 MB/sec Received: 315 x 8.4 MB to calculate the total network traffic generated by cache fusion = 2. The best order is the following: Glocal cache efficiency percentage global cache efficiency   Local cache Remote cache  Disk The first two give the cache hit ratio for the instance.192 = 2580480 bytes/sec = 2.192 = 1966080 bytes/sec = 2. direct messages are sent by a instance foreground or the user processes to remote instances. this section shows how the instance is getting all the data blocks it needs.

Cluster Interconnect As I stated above the interconnect it a critical part of the RAC. this sometimes is referred to as the global cache but in reality each nodes buffer cache is separate and copies of blocks are exchanged through traditional distributed locking mechanism. it uses a mechanism by which multiple copies of an object are keep consistent between Oracle instances. Oracle uses locking and queuing mechanisms to coordinate lock resources. Synchronization is also required for buffer cache management as it is divided into multiple caches. Parallel Cache Management (PCM) ensures that the master copy of a data block is stored in one buffer cache and consistent copies of the data block are stored in other buffer caches. the trace will be there Global Resource Directory (GRD) The RAC environment includes many resources such as multiple versions of data block buffers in buffer caches in different modes. and each instance is responsible for managing its own local version of the buffer cache. Cache Coherency Cache coherency identifies the most up-to-date copy of a resource.Service statistics Service stats shows the resources used by all the service instance supports Service wait class statistics Service wait summarizes waits in different categories for each service class Top 5 CR and current block segements conatns the names of the top 5 contentious segments (table or index). data and interinstance data requests. interconnect SQL> oradebug setmypid SQL> oradebug ipc Note: look in the user_dump_dest directory. The lock and resource structures for instance locks reside in the GRD (also called the DLM). also called the master copy. If a Top 5 CR table or index has a very high percentage of CR and Current block and current transfers you need to investigate. Global Cache Services (GCS) maintain the cache coherency across buffer cache resources and Global Enqueue Services (GES) controls the resource management across the clusters non-buffer cache resources. in Oracle 10g R2 the cluster interconnect is also contained in the alert. You can confirm that the interconnect is being used in Oracle 9i and 10g by using the command oradebug to dump information out to a trace file. Copies of data are exchanged between nodes. The synchronization provided by the Global Resource Directory (GRD) maintains a cluster wide concurrency of the resources and in turn ensures the integrity of the shared data. This is pretty much like a normal single blocks instance. Details about the data blocks resources and cached versions are . its a dedicated area within the shared pool. the process LCKx is responsible for this task. Resources such as data blocks and locks must be synchronized between nodes as nodes within a cluster acquire and release ownership of them. you can view my information from here. you must make sure that this is on the best hardware you can buy.log file.

Global cache together with GES form the GRD. Enqueues are the same in RAC as they are in a single instance. a disk file. role of the data block (local or global) and ownership are maintained by GES. the grant queue and convert queue are associated with each and every resource that is managed by the GES. the lock is moved back to the grant queue in the previous mode  The requested mode is compatible with the most restrictive lock in the grant queue and with all the previous modes of the convert queue. A convert queue is a queue of locks that are waiting to be converted to particular mode. it has a name or reference. to manage all information about a particular resource. A resource can be owned or locked in various states (exclusive or shared). Resources and Enqueues A resource is an identifiable entity. they protect persistent objects such as tables or library cache objects. Locks are placed on a resource grant or a convert queue. this will become the resource master. An enqueue can be held in exclusive mode by one process and others can hold a non-exclusive mode depending on the type. The GCS and GES nominate one instance. that are currently granted to users. A resource has a lock value block (LVB). if the lock changes it moves between the queues. and the lock is at the head of the convert queue Convert requests are processed on a FIFO basis. even a NULL is a lock.maintained by GCS. latches do not affect the cluster only the local instance. Each resource can have a list of locks called the grant queue. Each instance knows which instance master is with which resource. A global resource is visible throughout the cluster. thus a local resource can only be used by the instance at it is local too. state of the buffer. a data block or an abstract entity. A lock leaves the convert queue under the following conditions   The process requests the lock termination (it remove the lock) The process cancels the conversion. Enqueues can use any of the following modes . it also deals with deadlocks and timeouts. they support multiple modes and are held longer than latches. Enqueues are basically locks that support queuing mechanisms and that can be acquired in different modes. latches and enqueues. There are two types of local locks. Global Enqueue Services (GES) GES coordinates the requests of all global enqueues. Additional details such as the location of the most current version. all resources are lockable. The Global Resource Manager (GRM) keeps the lock information valid and correct across the cluster. Each instance maintains a part of the GRD in its SGA. this is the process of changing a lock from one mode to another. Enqueues are shared structures that serialize access to database resources. enqueues can affect both the cluster and the instance. The referenced entity is usually a memory region.

this is done by messages and asynchronous traps (ASTs). Examples of this are DDL. Messaging The difference between RAC and a single instance messaging is that RAC uses the high speed interconnect and a single instance uses shared memory and semaphores.Mode Summary Description no access rights. Holding instance (H) and the Requesting instance (R). The resource is mastered in master instance M . the lock is also known as a row share lock SX Shared Exclusive the resource can be read and written to in an unprotected fashion. DML enqueue table locks. GES uses messaging for interinstance communication. They also control library caches and the dictionary cache. This is the traditional share lock. transaction enqueues and DDL locks or dictionary locks. Global Locks Each node has information for a set of resources. Exclusive grants the holding process exclusive access to the resource. The messaging traffic can be viewed using the view V$GES_MISC. Oracle uses a hashing algorithm to determine which nodes hold the directory tree information for the resource. Other processes can perform unprotected reads. the sequence is detailed below where requesting instance R is interested in block B1 from holding instance H. Both LMON and LMD use messages to communicate to other instances. This is also Exclusive know as a SRX (shared row exclusive) table lock. SSX X Only one process can hold a lock at this level. the GRD is updated when locks are required. the only difference is that the enqueues are global enqueues.e. this is also known as a RX (row exclusive) lock S Shared a process cannot write to the resource but multiple processes can read it. interrupts are used when one or more process want to use the processor in a multiple CPU architecture. A three-way lock message involves up to a maximum of three instances. these are called PCM locks Global locks (global enqueue) that Oracle synchronizes within a cluster to coordinate non-PCM resources. Transaction and row locks are the same as in a single instance database. this makes sure that only processes SubShared can modify it at a time. Master instance (M). they protect the enqueue structures An instance owns a global lock that protects a resource (i. other processes cannot read or write to the resource. GES locks control access to data files (not the data blocks) and control files and also serialize interinstance communication. This is also the traditional exclusive lock. take a look in locking for an in depth view on how Oracle locking works. a lock is held at this level to indicate that a process is interested in a resource NULL NULL SS SubShared the resource can be read in an unprotected fashion other processes can read and write to the resource. The SCN and mount lock are global locks. Global locks are mainly of two types   Locks used by the GCS for buffer cache management. data block or data dictionary entry) when the resource enters the instance's SGA.

remote_nid remote. exclusive and null. it uses buffering to accommodate large volumes of traffic. The lock and state information is held in the SGA and is maintained by GCS. Instance R gets the ownership information about a resource from the GRD. This is also sent directly. also the messages are kept small (128 bytes) to increase performance. this is known as a blocking asynchronous trap (BAST). dump ticket information SQL> oradebug setmypid SQL> oradebug unlimit SQL> oradebug lkdebug -t Note: the output can be viewed here Global Cache Services (GCS) GCS locks only protect data blocks in the global cache (also know as PCM locks). These can be view via v$lock_element. once sent the ticket is returned to the pool. tckt_avail avail. This message is sent by a direct send as it is critical 2. low latency). A ticket is obtained before sending any messages. If there are no tickets then the message has to wait until a ticket is available. In global role mode you can read or write to the data block only as directed by the master instance of that resource. . Instance M receives the message and forwards it to the holding instance H. using the interconnect. The TRFC keeps track of everything by using tickets (sequence numbers). LMS or LMD perform this. You can control the number of tickets and view them system parameter _lm_tickets _lm_ticket_active_sendback (used for aggressive messaging) ticket usage select local_nid local. the parameter _db_block_hash_buckets controls the number of hash buffer chain buckets. It also holds a chain of cache buffer chains that are covered by the corresponding lock elements. Instance H sends the resource to instance R. Once the lock handle is obtained on the resource instance R sends an acknowledgment to instance M. 3. This message is queued as it is not critical. When in global role three lock modes are possible. instance R then sends the message to the master instance M requesting access to the resource. tckt_wait waiting from v$ges_traffic_controller. snd_q_len send_queue. tckt_limit limit. shared.1. Each lock element can have the lock role set to either local (same as single instance) or global. Because GES heavily rely's on messaging the interconnect must be of high quality (high performance . this is called acquisition asynchronous trap (AAST). The Traffic Controller (TRFC) is used to control the DLM traffic between the instances in the cluster. it can be acquired in share or exclusive mode. these are called lock elements. there is a predefined pool of tickets this is dependent on the network send buffer size. the resource is copied in instance R memory 4.

This mode is used so that locks need not be created and destroyed all the time. the previous PI remains untouched in case another node requires it. When a new current block arrives. the block is global and it may even by dirty in any of the instances and the disk version may be obsolete. A node must keep a PI until it receives notification from the master that a write to disk has completed covering that version. it can be either local or global. If there are a number of PI's that exist. an instance cannot read from the disk as it may not be current. which will help you better to understand. LEs are managed by the lock process to determine the mode of the locks. Lock roles are used by Cache Fusion. If the block is modified (dirty). A Past Image (PI) is kept by the instance when a block is shipped to another instance. etc). S and X. a PI is retained and the lock becomes global. In the local role only S and X modes are permitted. the master will determine this based on if the older PI's are required. 0 FREE 1 EXLCUR 2 SHRCUR 3 CR 4 READING 5 MRECOVERY 6 IRCOVERY 7 WRITING 8 PI . they also old a chain of cache buffers that are covered by the LE and allow the Oracle database to keep track of cache buffers that must be written to disk in a case a LE (mode) needs to be downgraded (X > N). the resource is local if the block is dirty only in the local cache. If the block is globally clean this instance lock role remains local. the role is then changed to a global role. Null (N) allows instances to keep a lock without any permission on the block(s). if another instance requires the block that has a exclusive lock it asks GES to request that he second instance disown the global lock Shared (S) used for select operations. reading of data does not require a instance to disown a global lock. the node will then log a block written record (BWR). it just converts from one lock to another. the list below describes the classes of the data block which are managed by the LEs using GCS locks (x$bh. when requested by the master instance the holding instance serves a copy of the block to others. a indeterminate number of PI's can exist. I have a complete detailed walkthough in my cache_fusion section.GCS locks uses the following modes as stated above Exclusive (X) used during update or any DML operation. In the global lock role lock modes can be N. they may or may not merge into a single PI. granting. I have already discussed PI and BWR in my backup section. Interested parties can only modify the block using X mode. LEs protect all the data blocks in the buffer cache. it is global if the block is dirty in a remote cache or in several remote caches.class). the holding instance can send copies to other instances when instructed by the master. A lock element holds lock state information (converting. thus the PI represents the state of a dirty buffer.

thus the requestor can then read it from disk. especially if the requested block has a older SCN and needs to reconstruct it (known as CR fabrication). The lightwork rule is involved when CR construction involves too much work and no current block or PI block is available in the cache for block cleanouts. as of Oracle 10g R2 mastering occurs at the object level which helps fine-grained object remastering.So putting this altogether you get the following. fairness_down_converts from v$cr_block_server. Normally resource mastering only happens when a instance joins or leaves the RAC environment. if not found then it can be read from disk by the requesting instance. Data blocks are can be kept in any of the instances buffer cache (which is global). If too many CR requests arrive for a particular buffer. I spoke about lock remastering in my backup section. but a resource can only be mastered by one instance. set to 0 if the instance is a query only instance. The GRD is a central repository for locks and resources. One parameter that can help is _db_block_max_cr_dba which limits the number of CR copies per DBA on the buffer cache. downconvert Note: lower the _fairness_threshold if the ratio goes above 40%. Each instance will master a number of resources. There are a number of parameters that can be used to dynamically remaster an object _gc_affinity_time specifies interval minutes for remastering _gc_affinity_limit defines the number of times a instance access the resource before remastering. The process of maintaining information about resources is called lock mastering or resource mastering. Cache Fusion . PCM locks manage the data blocks in the global cache. the holder can disown the lock on the buffer and write the buffer to the disk. If a block is modified all Past Images (PI) are no longer current and new copies are required to obtained. it is distributed across all nodes (not a single node). row-level locks are used in conjunction with PCM locks. and the parameter _fairness_threshold can used to configure it. Consistent read processing means that readers never block writers. data_requests. GCS ensures cache coherency by requiring that instances acquire a lock before modifying or reading a database block. GCS locks are not row-level locks. setting to 0 disable remastering _gc_affinity_minimum defines the minimum number of times a instance access the resource before remastering _lm_file_affinity disables dynamic remastering for the objects belonging to those files _lm_dynamic_remastering enable or disable remastering You should consult Oracle before changing any of the above parameters. GCS manages PCM locks in the GRD. it uses dynamic resource mastering to move the location of the resource masters. but only one instance masters a resource. light_works. GCS lock ensures that they block is accessed by one instances then row-level locks manage the blocks at the row-level. The below can be used to view the number of times a downconvert occurs select cr_requests. as the same in a single instance. This is technically known as fairness downconvert. The GCS monitors and maintains the list and mode of the blocks in all the instances. Resource affinity allows the resource mastering of the frequently used resources on its local node.

which can be used to provide consistent read versions of the block. this operation is known as disk ping or hard ping. thus relaying on block transfers more. however there will always be a small amount of disk pinging. allowing the global resource to be released.the original data contained in the block is preserved in the undo segment. however it might have to write the corresponding block to disk. In a single instance the following happens when reading a block . Past Image Blocks (PI) In the GRD section I mentioned Past Images (PIs). this process is known as blocking asynchronous trap (BAST). Disk pings have been reduce in the later versions of RAC. I will also provide a number of walk through examples on my RAC system. When a checkpoint is required it informs GCS of the write requirement. Checking the status of a transaction is an expensive operation that may require access (and pinging) to the related undo segment header and undo data blocks as well. GCS informs the instance holding the PIs to discard the PIs. basically they are copies of data blocks in the local buffer cache of an instance. When the block is written to disk and is known to have a global role. GCS is responsible for finding the most current block image and informing the instance holding that image to perform a block write. In the newer versions of RAC when a BAST is received sending the block or downgrading the lock may be deferred by tens of milliseconds. yes memory and networking together are faster than disk I/O. indicating the presents of PIs in other instances buffer caches. RAC copies data blocks across the interconnect to other instances as it is more efficient than reading the disk. it keeps a list of recent transactions that have changed a block. it preserves a copy of that block. As mentioned already when an instance requires a data block it sends the request to the lock master to obtain a lock in the desired mode. now you don't need this level of detail to administer a RAC environment but it sure helps to understand how RAC works when trying to diagnose problems. data blocks are shared through distributed locking and messagingoperations. GCS then informs all holders of the global resource that they can release the buffers holding the PI copies of the block. When an instance sends a block it has recently modified to another instance. Note: the state column with 8 is the past images. When an instance receives a BAST it downgrades the lock ASAP. in reality the buffer caches of each node remain separate. count(state) from X$BH group by state. Ping The transfer of a data block from instances buffer cache to another instances buffer cache is know as a ping.1. You can view the past image blocks present in the fixed table X$BH PIs select state. Cache Fusion I Cache Fusion I is also know as consistent read server and was introduced in Oracle 8. Cache Fusion uses the most efficient communications as possible to limit the amount of traffic used on the interconnect.5. this extra time allows the holding instance to complete an active transaction and mark the block header appropriately.I mentioned above Cache Fusion in my GRD section. RAC appears to have one large buffer but this is not the case. marking as a PI. The parameter _gc_defer_time can be used to define the duration by which an instance deferred downgrading a lock. here I go into great detail on how it works. The PI is kept until that block is written to disk by the current owner of the block. this will eliminate any need for the receiving instance to check the status of the transaction immediately after receiving/reading a block.

The cluster itself has a number of log files that can be examined to gain any insight of occurring problems. In RAC the instance can construct a CR copy by hopefully using the above blocks that are still in memory and then sending the CR over the interconnect thus reducing 6 I/O operations. Alert logs contain startup and shutdown information. it might find an active transaction in the block The reader will need to read the undo segment header to decide whether the transaction has been committed or not  If the transaction is not committed. RAC Troubleshooting This is the one section what will be updated frequently as my experience with RAC grows.  When a reader reads a recently modified block. the table below describes the information that you may need of the CRS components $ORA_CRS_HOME/crs/log contains trace files for the CRS resources $ORA_CRS_HOME/crs/init contains trace files for the CRS daemon during startup. resulting in 6 I/O operations. a good place to start $ORA_CRS_HOME/css/log contains cluster reconfigurations. the process has to revisit the block and clean out the block (delay block cleanout) and generate the redo for the changes. nodes joining and leaving the cluster. Look here to obtain when reboots occur $ORA_CRS_HOME/css/init contains core dumps from the cluster synchronization service daemon (OCSd) $ORA_CRS_HOME/evm/log log files for the event volume manager and eventlogger daemon $ORA_CRS_HOME/evm/init pid and lock files for EVM $ORA_CRS_HOME/srvm/log log files for Oracle Cluster Registry (OCR) $ORA_CRS_HOME/log log files for Oracle clusterware which contains diagnostic messages at the Oracle cluster level . In an RAC environment if the process of reading the block is on an instance other than the one that modified the block. as RAC has been around for a while most problems can be resolve with a simple google lookup. the process creates a consistent read (CR) version of the block in the buffer cache using the data in the block and the data stored in the undo segment  If the undo segment shows the transaction is committed. every instance in the cluster has its own alert logs. In this section I will point you where to look for problems. etc. which is where you would start to look. missed check-ins. the reader will have to read the following blocks from the disk   data block to get the data and/or transaction ID and Undo Byte Address (UBA) undo segment header block to find the last undo block used for the entire transaction  undo data block to get the actual record to construct a CR image Before these blocks can be read the instance modifying the block will have to write those's blocks to disk. connects and disconnects from the client CSS listener. but a basic understanding on where to look for the problem is required. Here is my complete alert log file of my two node RAC starting up.

every instance updates the control file with a heartbeat through its checkpoint (CKPT). The most important of these are $ORACLE_BASE/admin/udump contains any trace file generated by a user process $ORACLE_BASE/admin/cdump contains core files that are generated due to a core dump in a user process Now lets look at a two node startup and the sequence of events First you must check that the RAC environment is using the connect interconnect. Remember when a node joins or leaves the cluster the GRD undergoes a reconfiguration event. picked_ksxpia. ip_ksxpia from x$ksxpia. as seen in the logfile it is a seven step process (see below for more details on the seven step process).As in a normal Oracle single instance environment. pub_ksxpia. this can be done by either of the following logfile ## The location of my alert log. The LMON trace file also has details about reconfigurations it also details the reason for the event reconfiguation reason 1 description means that the NM initiated the reconfiguration event. a RAC environment contains the standard RDBMS log files. these files are located by the parameter background_dest_dump. Messages are sent across the interconnect if a message is not received in an amount of time then a communication failure is assumed by default UDP is used and can be unreliable so keep an eye on the logs if too many reconfigurations happen for reason 3. the instance is considered to be dead and the Instance Membership Recovery (IMR) process initiates reconfiguration. taken Reconfiguration started (old inc 2. means communication failure of a node/s. typical when a node joins or leaves a cluster means that an instance has died 2 3 How does the RAC detect an instance death. List of nodes: . SQL> oradebug setmypid SQL> oradebug ipc oradebug Note: check the trace file which can be located by the parameter user_dump_dest cluster_interconnects system parameter Note: used to specify which address to use When the instance starts up the Lock Monitor's (LMON) job is to register with the Node Monitor (NM) (see below table). Example of a Sat Mar 20 11:35:53 2010 reconfiguration.log ifcfg command oifcfg getif table check select inst_id. new inc 4) from the alert log. if the heartbeat information is missing for x amount of time. yours may be different /u01/app/oracle/admin/racdb/bdump/alert_racdb1.

invalid = TRUE Communication channels reestablished Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out Sat Mar 20 11:35:53 2010 LMS 0: 0 GCS shadows cancelled. this method is more CPU intensive as it has to broadcast the SCN for every commit. VALBLKs dubious All grantable enqueues granted Post SMON to start 1st pass IR Sat Mar 20 11:35:53 2010 LMS 0: 0 GCS shadows traversed. by default it is 7 seconds. 3291 replayed Sat Mar 20 11:35:53 2010 Submitted all GCS remote-cache requests Post SMON to start 1st pass IR Fix write in gcs resources Reconfiguration complete Note: when a reconfiguration happens the GRD is frozen until the reconfiguration is completed Confirm that the database has been started in cluster mode. Disable Oracle RAC (Unix only) 1. the log file will state the following cluster mode Sat Mar 20 11:36:02 2010 Database mounted in Shared Mode (CLUSTER_DATABASE=TRUE) Completed: ALTER DATABASE MOUNT Staring with 10g the SCN is broadcast across all nodes. Log in as Oracle in all nodes . You can change the board cast method using the system parameter _lgwr_async_broadcasts. this is different than a single instance environment. Disable/Enable Oracle RAC There are times when you may wish to disable RAC. When set to less than 100 the broadcast on commit algorithm is used. 0 closed Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed. a broadcast method is used after a commit operation. The initialization parameter max_commit_propagation_delay limits the maximum delay allow for SCN propagation. Lamport Algorithm The lamport algorithm generates SCNs in parallel and they are assigned to transaction on a first come first served basis. but he other nodes can see the committed SCN immediately.01 Global Resource Directory frozen * allocate domain 0. the system will have to wait until all nodes have seen the commit SCN. this feature can only be used in a Unix environment (no windows option).

# using oradebug SQL> select * from dual. capture information ## Using alter session SQL> alter session set max_dump_file_size = unlimited. usually Oracle will detect the deadlock and rollback one of the processes. shutdown all instances using either normal or immediate option 3. run the below make command to relink the Oracle binaries without the RAC option (should take a few minutes) make -f ins_rdbms. run the below make command to relink the Oracle binaries without the RAC option (should take a few minutes) make -f ins_rdbms.mk ioracle Enable Oracle RAC (Unix only) 1. however if the situation occurs with the internal kernel-level resources (latches or pins). . SQL> alter session set events 'immediate trace name systemstate level 10'. shutdown all instances using either normal or immediate option 3. change to the working directory $ORACLE_HOME/lib 4.mk ioracle Performance Issues Oracle can suffer a number of different performance problems and can be categorized by the following   Hung Database Hung Session(s)  Overall instance/database performance  Query Performance A hung database is basically an internal deadlock between to processes. When this event occurs you must obtain dumps from each of the instances (3 dumps per instance in regular times).mk rac_off 5. Now relink the Oracle binaries make -f ins_rdbms. Log in as Oracle in all nodes 2. the trace files will be very large. thus hanging the database. it is unable to automatically detect and resolve the deadlock. change to the working directory $ORACLE_HOME/lib 4.mk rac_on 5.2. Now relink the Oracle binaries make -f ins_rdbms.

it also has a number of limitations   Reads the SGA in a dirty manner. IGN_DMP state) 5 Level 4 + Dump all processes involved in wait chains (NLEAF state) 10 Dump all processes (IGN state) The hanganalyze command uses internal kernel calls to determine whether a session is waiting for a resource and reports the relationship between blockers and waiters. LEAF_NW. . To overcome these limitations a new utility command was released with 8i called hanganalyze which provides clusterwide information in a RAC environment on a single shot.SQL> oradebug setmypid SQL> unlimit SQL> oradebug dump systemstate 10 # using oradebug from another instance SQL> select * from dual. however a systemstate dump taken a long time to complete. systemdump is better but if you over whelmed try hanganalyze first. this usually happen because of contention problems. sql method alter session set events 'immediate trace hanganalyze level <level>'. SQL> oradebug hanganalyze <level> oradebug ## Another way using oradebug SQL> setmypid SQL> setinst all SQL> oradebug -g def hanganalyze <level> Note: you will be told where the output will be dumped to hanganalyze levels 1-2 only hanganalyze output. SQL> oradebug setmypid SQL> unlimit SQL> oradebug -g all dump systemstate 10 Note: the select statement above is to avoid problems on pre 8 Oracle SQLPlus . so it may be inconsistent Usually dumps a lot of information  does not identify interesting processes on which to perform additional dumps  can be a very expensive operation if you have a large SGA. a systemstate dump is normally used to analyze this problem. no process dump at all.problems connecting ## If you get problems connecting with SQLPLUS use the command below $ sqlplus -prelim Enter user-name: / as sysdba A severe performance problem can be mistaken for a hang. click here for an example level 1 dump 3 Level 2 + Dump only processes thought to be in a hang (IN_HANG state) 4 Level 3 + Dump leaf nodes (blockers) in wait chains (LEAF.

the instance that obtains the lock tallies the votes from all members. IMR will remove any nodes from the cluster that it deems as a problem. IMR is part of the service offered by Cluster Group Services (CGS).Debugging Node Eviction A node is evicted from the cluster after it kills itself because it is not able to service the application. I have already discussed the voting disk in my architecture section. 5. GCS must synchronize the cluster to be sure that all members get the reconfiguration event and that they all see the same bitmap.instance id 1 (internal mem no 0) (alert log) One thing to remember is that all nodes must be able to read from and write to the controlfile. Determination of membership and validation and IMR 4. 2. the CKPT process updates the controlfile every 3 seconds in an operation known as a heartbeat. IMR will ensure that the larger part of the cluster will survive and kills any remaining nodes. Lock database (IDLM) is frozen. thus intra-instance coordination is not required. The Node Monitor (NM) provides information about nodes and their health by registering and communicating with the Cluster Manager (CM). Unfreeze and release name service for use 7. You can see the controlfile records using the gv$controlfile_record_section view. Hand over reconfiguration to GES/GCS Debugging CRS and GSD . the bitmap is rebuilt and communicated to all nodes. this generally happens when you have communication problems. for example if a node joins or leaves the cluster. the CGS contains an internal database of all the members/instances in the cluster with all their configurations and servicing details. this works at the cluster level and can work with 3rd party software (Sun Cluster. To understand eviction problems you need to now the basics of node membership and instance membership recovery (IMR) works. When a communication failure happens the heartbeat information in the control cannot happen. Bitmap rebuild takes place. this prevents processes from obtaining locks on resources that were mastered by the departing/dead instance 3. this block is called the checkpoint progress record. Name service is frozen. Delete all dead instance entries and republish all names newly configured 6. Node membership is represented as a bitmap in the GRD. Veritas Cluster). Node registering lmon registered with NM . For eviction node problems look for ora29740 errors in the alert log file and LMON trace files. it uses a voting mechanism to check the validity of each member. A cluster reconfiguration is performed using 7 steps 1. the group membership must conform to the decided (voted) membership before allowing the GCS/GES reconfiguration to proceed. all members attempt to obtain a lock on the controlfile record for updating. LMON handles many of the CGS functionalities. instance name and uniqueness verification. the controlfile vote result is stored in the same block as the heartbeat in the control file checkpoint progress record. LMON will let other nodes know of any changes in membership. as stated above memberships is held in a bitmap in the GRD. It writes into a single block that is unique for each instance. thus data corruption can happen. CGS makes sure that members are valid.

rac2. although you should add a node of a similar spec it is possible to add a node of a higher or lower spec. In Oracle database 10g setting the below variable accomplishes the same thing.sh/srvctl/srvconfig file in the $ORACLE_HOME/bin directory 2. I am going to presume we have a two RAC environment already setup. Pre-Install Checking You used the Cluster Verification utility when installing the RAC environment.rac3 -r 10g2 Make sure that you fix any highlighted problems before continuing. . GSD.rac2. You can run the command either from the new node or from any of the existing nodes in the cluster pre-install check run from new node runcluvfy... the tools check that the node has been properly prepared for a RAC deployment. At the end of the file look for the below line exec $JRE -classpath $CLASSPATH oracle.rac3 -r 10gr2 pre-install check run from existing node cluvfy stage -pre crsinst -n rac1. also make sure that the node can see the shared disks available to the existing RAC.Oracle server management configuration tools include a diagnostic and tracing facility for verbose output for SRVCTL... use vi to edit the gsd.. set it to blank to remove the debugging Enable tracing $ export SRVM_TRACE=true Disable tracing $ export SRVM_TRACE="" Adding or Deleting a Node One of the jobs of a DBA is adding and removing nodes from a RAC environment when capacity demands.LEVEL=2 -classpath. Add the following just before the -classpath in the exec $JRE line -DTRACING.LEVEL=2 4. The first stage is to configure the operating system and make sure any necessary drivers are installed.ENABLED=true -DTRACING. the string should look like this exec $JRE -DTRACING...OPSMDaemon $MY_OHOME 3...sh stage -pre crsinst -n rac1.. To capture diagnose following the below 1.ops. and we are going to add a third node. GSDCTL or SRVCONFIG.ENABLED=true -DTRACING.daemon.mgmt.

Click install. the installer will copy the files from the existing node to the new node. Adding the new node can be started from any of the existing nodes 1. select the new node and click next . In the specify cluster nodes to add to installation screen.config 8. hopefully the tool will already see the existing cluster and fill in the details for you $ORA_RS_HOME/oui/bin/addnode. private and virtual hosts 3. Now you need to configure Oracle Notification Services (ONS).Install CRS Cluster Ready Services (CRS) should be installed first. Click next on the welcome screen to open the specify cluster nodes to add to installation screen. Click next to see a summary page 4. rootaddnode. Click next to complete the installation.sh in the new and rootaddnode. orainstRoot. Once copied you will be asked to run orainstRoot.sh in the node that you are running the installation from. enter the new names for the public.sh 2.sh and root. the script below starts the OUI GUI tool.sh 2. you should have a list of all the existing nodes in the cluster. 1.sh as user root 5.sh checks whether the Oracle CRS stack is already configured in the new node.sh and root. The port can be identified by the below command cat $ORA_CRS_HOME/opmn/conf/ons.sh sets the Oracle inventory in the new node and set ownerships and permissions to the inventory root. Log into any of the existing nodes as user oracle then run the below command. Log into any of the existing nodes as user oracle then run the below command. Run orainstRoot. it is time to install the Oracle DB software. the script below starts the OUI GUI tool. hopefully the tool will already see the existing cluster and will fill in the details for you $ORA_RS_HOME/oui/bin/addnode. Now run the ONS utility by supplying the <remote_port> number obtained above racgons add_config rac3:<remote_port> Installing Oracle DB Software Once the CRS has been installed and the new node is in the cluster. creates /etc/oracle directory and adds the relevant OCR keys to the cluster registry and it adds the daemon to CRS and starts CRS in the new node. Again you can use any of the existing nodes to install the software. 7.sh configures the OCR registry to include the new nodes as part of the cluster 6. this allows the node to become part of the cluster.

Choose cluster management 3. Login as user oracle.. choose yes when asked to extend ASM Removing a Node Removing a node is similar to above but in reverse order 1. Clean up ASM .3. and set your DISPLAY environment variable. Select RACDB (or whatever name you gave you RAC environment) as the database and enter the SYSDBA and password. Check the summary page then click install to start the installation 4. Choose add instance and click next 5. The files will be copied to the new node. Choose the the name as LISTENER These steps will add a listener on rac3 as LISTENER_rac3 Create the Database Instance Run the below to create the database instance on the new node 1. In the welcome screen choose oracle real application clusters database to create the instance and click next 3. Choose add 5. The database instance will now created. click next in the database storage screen. Choose instance management and click next 4. Login as oracle on the new node. then start the Network Configuration Assistant $ORACLE_HOME/bin/netca 2. click next 6. You should see a list of existing instances. then click OK to finish off the installation Configuring the Listener Now its time to configure the listener in the new node 1. the script will ask you to run run. set the environment to database home and then run the database creation assistant (DBCA) $ORACLE_HOME/bin/dbca 2. Choose listener 4. Delete the instance on the node to be removed 2. click next and on the following screen enter ORARAC3 as the instance and choose RAC3 as the node name (substitute any of the above names for your environment naming convention) 7.sh on the new node.

Now run the following on the node to be removed cd $ORACLE_HOME/admin rm -rf +ASM cd $ORACLE_HOME/dbs rm -f *ASM* 3.rac2. invoke the program choose the RAC database. Remove the listener from the node to be removed 4.3. enter the sysdba user and password then choose the instance to delete. Choose the the name as LISTENER Next we remove the node from the database 1. From node 1 run the below command to stop ASM on the node to be removed srvctl stop asm -n rac3 srvctl remove asm -n rac3 2. Remove the node from the database 5. To clean up ASM follow the below steps 1./runInstaller 2. Run the following from node 1 cd $ORACLE_HOME/oui/bin ./runInstaller -updateNodeList ORACLE_HOME=$ORACLE_HOME "CLUSTER_NODES={rac3}" -local . then start the Network Configuration Assistant $ORACLE_HOME/bin/netca 2. Choose cluster management 3. Run the below script from the node to be removed cd $ORACLE_HOME/bin ./runInstaller -updateNodeList ORACLE_HOME=$ORACLE_HOME "CLUSTER_NODES={rac1. Check that /etc/oratab file has no ASM entries. and set your DISPLAY environment variable. choose instance management and then choose delete instance. Choose Remove 5. if so remove them Now remove the listener for the node to be removed 1. Login as user oracle.rac3}" . Choose to deinstall products and select the dbhome 3. Choose listener 4. Remove the node from the clusterware You can delete the instance by using the database creation assistant (DBCA).

/runInstaller 5.Lastly we remove the clusterware software 1.3 4. Run the following from node as user oracle cd $CRS_HOME/oui/bin ./rootdelete.rac3}" CRS=TRUE 7. Choose to deinstall software and remove the CRS_HOME 6./runInstaller -updateNodeList ORACLE_HOME=$ORACLE_HOME "CLUSTER_NODES={rac3}" CRS=TRUE -local . Now run the below from the node to be removed as user oracle cd $CRS_HOME/oui/bin ./runInstaller -updateNodeList ORACLE_HOME=$ORACLE_HOME "CLUSTER_NODES={rac1./rootdeletenode.sh 3.rac2. Check that the node has been removed.config file in $ORA_CRS_HOME/opmn/conf $CRS_HOME/bin/racgons remove_config rac3:6200 2. obtain the node number first $CRS_HOME/bin/olsnodes -n cd $CRS_HOME/install . the second you should not see any output and the last command you should only see nodes rac1 and rac2 srvctl status nodeapps -n rac3 crs_stat |grep -i rac3 olsnodes -n .sh rac3. Now run the following from node 1 as user root. the first should report "invalid node". Run the following from the node to be removed as user root cd $CRS_HOME/install . Run the following from node 1. you obtain the port number from remoteport section in the ons.