Oracle Real Application Clusters (RAC

)
The following documentation is a guide on how to install, configure and administer Oracle 10g Real Application Clusters (RAC). Some of the topics that I will be discussing have already been covered in my Oracle topic. The site has been comprised of reading the following books and real world experience, if you are new to Oracle RAC I highly recommend that you should purchase these books as it contains far more information than this web site contains and of course the Official Oracle web site contains all the documentation you will ever need. Please feel free to email me any constructive criticism you have with the site as any additional knowledge or mistakes that I have made would be most welcomed. 1. HA, Clustering and OPS .................................................................................................3 High Availability and Clustering.....................................................................................3 Clustering ........................................................................................................................5 Oracle RAC History ........................................................................................................6 Oracle Parallel Server Architecture ................................................................................6 2. RAC Architecture............................................................................................................9 RAC Architecture Introduction......................................................................................10 RAC Components .........................................................................................................11 Disk architecture ...........................................................................................................13 Oracle Clusterware ........................................................................................................14 Oracle Kernel Components ...........................................................................................17 RAC Background Processes .........................................................................................18 3. RAC Installation, Configuration and Storage................................................................20 RAC Installation............................................................................................................20 4. RAC Administration and Management ........................................................................22 RAC Parameters ............................................................................................................22 Starting and Stopping Instances.....................................................................................23 Undo Management ........................................................................................................24 Temporary Tablespace ..................................................................................................24 Redologs .......................................................................................................................24 Flashback.......................................................................................................................25 SRVCTL command ......................................................................................................25 Services..........................................................................................................................27 Cluster Ready Services (CRS) ......................................................................................29 Oracle Cluster Registry (OCR) .....................................................................................31 Voting Disk ...................................................................................................................33 5. RAC Backups and Recovery ........................................................................................33 Introduction....................................................................................................................34 Backup Basics................................................................................................................34 Instance Recovery .........................................................................................................34 Crash Recovery .............................................................................................................35 Cache Fusion Recovery ................................................................................................37

6. RAC Performance..........................................................................................................41 RAC Performance..........................................................................................................41 Partitioning Workload ...................................................................................................42 RAC Wait Events...........................................................................................................42 Enqueue Tuning ............................................................................................................44 AWR and RAC .............................................................................................................45 Cluster Interconnect ......................................................................................................47 7. Global Resource Directory (GRD)................................................................................47 GRD introduction...........................................................................................................48 Cache Coherency ..........................................................................................................48 Resources and Enqueues ...............................................................................................48 Global Enqueue Services (GES) ...................................................................................50 Global Locks .................................................................................................................50 Messaging .....................................................................................................................51 Global Cache Services (GCS)........................................................................................52 8. Cache Fusion..................................................................................................................57 Introduction....................................................................................................................57 Ping................................................................................................................................57 Past Image Blocks (PI) ..................................................................................................58 Cache Fusion I ..............................................................................................................58 Cache Fusion in Operation ............................................................................................62 9. RAC Troubleshooting....................................................................................................71 Troubleshooting.............................................................................................................72 Lamport Algorithm .......................................................................................................74 Disable/Enable Oracle RAC .........................................................................................75 Performance Issues .......................................................................................................75 Debugging Node Eviction .............................................................................................77 Debugging CRS and GSD.............................................................................................78 10. Adding and Removing nodes ......................................................................................80 Adding and removing nodes .........................................................................................80 Pre-Install Checking...................................................................................................80 Install CRS.................................................................................................................80 Installing Oracle DB Software ..................................................................................81 Create the Database Instance ....................................................................................82 Removing a Node .....................................................................................................82 11. RAC Cheat sheet..........................................................................................................85 Cheatsheet......................................................................................................................85 Useful Views/Tables..................................................................................................86 Useful Parameters .....................................................................................................88 Processes....................................................................................................................89 General Administration..............................................................................................90 CRS Administration ..................................................................................................92 Voting Disk ...............................................................................................................95

1. HA, Clustering and OPS
High Availability and Clustering
When you have very critical systems that require to be online 24x7 then you need a HA solution (High Availability), you have to weigh up the risk associated with downtime against the cost of a solution. HA solutions are not cheap and they are not easy to manage. HA solutions need to be thoroughly tested as it may not be tested in the real world for months. I had a solution that run for almost a year before a hardware failure caused a failover, this is when your testing before hand comes into play. As I said before HA comes with a price, and there are a number of HA technologies
• • •

Fault Tolerance - this technology protects you from hardware failures for example redundant PSU, etc Disaster Recovery - this technology protects from operational issues such as a Data Center becoming unavailable Disaster Tolerance - this technology is used to prepare for the above two, the most important of the three technologies.

Every company should plan for unplanned outages, this costs virtually nothing, knowing what to do in a DR situation is half the battle, in many companies people make excuses not to design a DR plan (it costs to much, we don't have the redundant hardware,etc). You cannot make these assumptions until you design a DR plan, the plan will highlight the risks and the costs that go with that risk, then you can make the decision on what you can and cannot afford, there is no excuse not to create DR plan. Sometimes in large corporations you will hear the phrase five nines, this phrases means the availability of a system and what downtime (approx) is allowed, the table below highlights the uptime a system requires in order to achieve the five nines % uptime 98 99 99.8 99.9 99.99 99.999 (five nines) % Downtime 2 1 0.2 0.1 0.01 0.001 Downtime per year 7.3 days 3.65 days 17 hours 30 minutes 8 hours 45 minutes 52.5 minutes 5.25 minutes Downtime per week 3 hours 22 minutes 1 hour 41 minutes 20 minutes 10 minutes 1 minute 6 seconds

To achieve the five nines your system is only allowed 5.25 minutes per year or 6 seconds per week, in some HA designs it may take 6 seconds to failover. When looking for a solution you should try and build redundancy into your plan, this is the first step to a HA solution, for example

. Most hardware today will have these redundancy features built in. configuration and maintaining a cluster is expensive but if you business loses vast amounts of money if you system is down. hardware. i am sure that your developers are quite happy to take the day off paid while you fix the system This is the jewel in the HA world. a cluster can be configure in a variety of favors. these servers are running along side failover the live system. redundant hard disks that can be mirrored Make sure you use multipathing to the data disk which are usually on a SAN or NAS Make sure that the server is connected to two different network switches You are trying to eliminate as many Single Point Of Failures (SPOF's) as you can without increasing the costs. Hot Many applications offer hot-standby servers. failed disks and PSU can be replaced online. especially if it failover takes a long time to get the standby server up to the same point in time as the failed server.• • • • Make sure computer cabinets have dual power Make sure servers have dual power supplies. as long as it keeps the service running. I have used this technique myself. HA comes in three/four favors This option usually uses the already built-in redundancy. even if it slight under powered to the original server. Nofailover This solution can be perfectly acceptable in some environments but at what price to the business. Cold The problem with this solution is there is going to be downtime. Cluster However a cluster solution does come with a heavy price tag. Many smaller companies use this solution. basically you have a additional server ready to take over a number of servers if one where to fail. the system will remain down until it is fixed. but its up to you to make use of them. even in today's market QA/DEV systems cost money when not running. if the original server is going to have a prolonged outage. to virtual zero downtime. from minimal downtime while services are moved to a good nodes. i create a number of scripts that can turn a cold standby server into any number of servers. thus in a failover situation the server is almost ready to go. The advantage of this solution is that one additional server could cover a number of servers. but if a major hardware was to fail then a system outage is unavoidable. dual network cards. data is applied to the hot-standby server periodically to keep it up to date. then its worth it.

thus you may have to have many hotstandby servers. One advantage of a cluster is that it is very scalable because additional nodes can be added or taken away (a node may need to be patched) without interrupting the service. typically in a few Failover time system must be cold started seconds. The failover operation User interruption Not required.The problem with this system is costs and manageability. for example applying the last set of logs to a database. but the main object is to keep the service running. also one server is usually dedicated to one application. Clustering I have discussed clustering in my Tomcat and JBoss topics. Here is a summary table that shows the most command aspects of cold failover versus hot failover Aspects Cold Failover Hot Failover As nodes can be added on demand. hence why you pay top dollar for this. High number of nodes supported. if a node were to become unavailable within the cluster the services are moved/restored to another working node. . thus the end user should never know that a fault occurred. The advantage is that downtime is kept to a minimum. generally the time it take to get the hot-standby server up todate. failover is automatic can be scripted or automated to a certain extent Transparent application failover will Transparent be available where sessions can be failover of Not Possible transferred to another node without applications user interruption Not possible. A cluster is a group of two or more interconnected nodes. Scalability/number Scalable limited to the of nodes capacity of a single node Required up to a minimal extent. it provides infinite scalability. The cluster provides a high level of fault tolerance. but there will be some downtime. Clusters can be setup to use a single node in the cluster or to load balance between the nodes. that provide a service. only one server Incoming load can be balanced Load Balancing will be used between both nodes Only one server at a time. so I will only touch on the subject lightly here. the Usage of resources Both the servers will be used other server will be kept idle More than minutes as the other Less than a minute.

etc. Oracle Real Application Clusters 9i (Oracle RAC) used the same IDLM and relied on external clusterware software (Sun Cluster. Normally there will be a piece of software that controls the reading and writing to the disks ensuring data integrity.2 Oracle Parallel Server (OPS) was born. An example of this may be web servers. With Oracle 6. this was the first cluster database on the market.Clustering has come a long way. etc). Oracle's lock manager is integrated with Oracle code with an additional layer called OSD (Operating System Dependent). they share nothing. Oracle 7 used vendorsupplied clusterware but this was complex to setup and manage. but the data files and control files are all shared to all instances. again all nodes will be attached or have access to the same set of disks. Oracle Parallel Server Architecture A Oracle parallel database consists of two or more nodes that own Oracle instances and share a disk array. there are now three types of clustering architecture each node within the cluster is independent. records and databases. Veritas Cluster. Each instance has its own set of background processes. the other node will take control of both the application and the data. To achieve this a cluster-wide filesystem is introduced. The content will be static thus there is no need to share disks. . Oracle 8 introduce a general lock manager and this was a direction for Oracle to create its own clusterware product. Shared nothing Shared disk only A typical traditional Veritas Cluster and Sun Cluster would fit the bill here. redo logs files on the other hand can be read by any instance but only written by the owning instance. Each node has its own SGA and its own redo logs. This means that one node will have to be on standby setting idle waiting to take over if required to do so. this was soon integrated within the kernel and become known as IDLM (Integrated Distributed Lock Manager) in later Oracle versions. but this time each node can read/write to the disks concurrently. the software then coordinates the sharing and updating of files. Shared everything so that all nodes view the filesystem identically. each node will be attached or have access to the same set of disks. These disks will contain the data that is required by the service. you a have number of nodes within the cluster supplying the same web service. One node will control the application and the disk and in the event of a that node fails. Oracle RAC and IBM HACMP would be good examples of this type of cluster Oracle RAC History The first Oracle cluster database was release with Oracle 6 for the digital VAX. which used Oracle's own DLM (Distributed Lock Manager). All data and controls are concurrently read and written by all instances.

This layer provides the following • • • • Internode messaging Group member consistency Cluster synchronization Process grouping. In older versions the DLM API module had to rely on external OS routines to check the status of a lock. registration and deregistration The DLM is a integral part of OPS and the RAC stack.The components of a OPS database are • • • • Cluster Manager . The IDLM job is to track every lock granted to a resource. With the new IDLM the data is in the SGA of each instance and requires only a serialized lookup using latches and/or enqueues and may require global coordination.OS Vendor specific Distributed Lock Manager (DLM) Cluster Interconnect Shared Disk Array The Cluster Group Services (CGS) has some OSD components (node monitor interface) and the rest is built in the kernel. memory structures . this was done using UNIX sockets and pipes. CGS has a key repository used by the DLM for communication and network related activities. the algorithm for which was built directly into the Oracle kernel.

DLM maintains information about all locks on a given resource. Oracle 10g RAC also comes with its own integrated clusterware and storage management framework. the DLM nominates one node to manage all relevant lock information for a resource. The design of the DLM is such it can survive nodes failures in all but one node of the cluster. . this shipping is performed over a high speed interconnect. Since 8i the introduction of the GV$ views meant that a DBA could view cluster-wide database and other statistics sitting on any node/instance of the cluster. removing all dependencies of a third-party clusterware product. this introduced a new background process called the Block Server Process (BSP). In OPS 8i Oracle introduced Cache Fusion Stage 1. other instances have to wait until the lock is released.you can use low cost hardware. lock mastering is distributed among all nodes. the Parallel Cache Management (PCM) coordinates and maintains data blocks exists within each data buffer cache (of an instance) so that data viewed or requested by users is never inconsistent or incoherent. The limitations of OPS are • • • • scalability is limited to the capacity of the node you cannot easily add additional nodes to a OPS OPS requires third-party clustering software adding to the expense and complexity. The PCM ensures that only one instance in a cluster can modify a block at any given time. and the dynamic lock mastering. in which both types of blocks (CR and CUR) can be transferred using the interconnect.improved code and monitoring RAC has become very reliable Affordability . basically means adding additional server is easy.multiple servers in a cluster to manage a single database transparently (scale out). OPS requires RAW partitions and these can be difficult to manage. Reliability . Using the IPC layer the DLM permits it to share the load of mastering resources. however RAC is not going to come cheap. Oracle RAC addresses the limitation in OPS by extending Cache Fusion. which means that a user can lock a resource on one node but actually end up communicating with the processes on another node. The latest Oracle RAC offers • • • • Availability . A user must require a lock before it can operate on any resource. this node is referred to as the master node.required by the DLM are allocated out of the shared pool.can be configured to have no SPOF even when running on low spec'ed hardware Scalability . The BSP main roles was to ship consistent read (CR) version(s) of a block(s) across an instance in a read/write contention scenario. addresses some of the issues with Stage 1. Cache Fusion Stage 2 in Oracle 9i and 10g.

RAC Architecture .• Transparency .RAC looks and feels like a standard Oracle database to an application 2.

make sure when designing a RAC system that you get the best that you can afford. Oracle's RAC is heavy dependent on a efficient. the instances will be running on multiple nodes. In an standard Oracle configuration a database can only be mounted by one instance but in a RAC environment many instances can access a single database. high reliable high speed private network called the interconnect.RAC Architecture Introduction Oracle Real Application clusters allows multiple instances to access a single database. The table below describes the difference of a standard oracle database (single instance) an a RAC environment Component SGA Background processes Datafiles Control Files Single Instance RAC Environment Environment Instance has its own Each instance has its own SGA SGA Instance has its own set Each instance has its own set of background of background processes processes Accessed by only one Shared by all instances (shared storage) instance Accessed by only one Shared by all instances (shared storage) instance .

log Logfile instance switches by other instances can force the idle instance redo logs to be archived Private to the instance but other instances will Archived Redo Dedicated to the need access to all required archive logs during Logfile instance media recovery Flash Recovery Accessed by only one Shared by all instances (shared storage) Log instance Alert Log and Dedicated to the Private to each instance. same executable files RAC Components The major components of a Oracle RAC system are • • • • Shared disk system Oracle Clusterware Cluster Interconnects Oracle Kernel Components The below diagram describes the basic architecture of the Oracle RAC environment . other instances never Trace Files instance read or write to those files. Multiple instances on Same as single instance plus can be placed on the same server shared file system allowing a common ORACLE_HOME accessing different ORACLE_HOME for all instances in a RAC databases ca use the environment. If an instance is shutdown.Only one instance can write but other Dedicated for instances can read during recovery and Online Redo write/read to only one archiving.

Here are the list of processes running on a freshly installed RAC .

the disks are striped with parity across 3 or more disks. there is no reason not to configure multi-pathing as the cost is cheap when adding additional paths to the disk because most of the expense is paid when out when configuring the first path. ISCSI JBOD .generally using fibre to connect to the SAN NAS ( Network Attached Storage) . here are the most common ones A number of disks are concatenated together to give the appearance of one very large disk. Advantages raid 5 . raid 0 (Striping) Advantages Improved performance Can Create very large Volumes Disadvantages Not highly available (if one disk fails. sharing storage is fairly easy and is required for a RAC environment.generally using a network to connect to the NAS using either NFS. raid 1 (Mirroring) Advantages Improved performance Highly Available (if one disk fails the mirror takes over) Disadvantages Expensive (requires double the number of disks) Raid stands for Redundant Array of Inexpensive Disks. you can use the below storage setups • • • SAN (Storage Area Networks) . the parity is used in the event that one of the disks fails. so an additional controller card and network/fibre cables is all that is need. there are about 12 different raid levels that I know off.Disk architecture With today's SAN and NAS disk storage systems. the data on the failed disk is reconstructed by using the parity bit.direct attached storage. if one disk fails the system is unaffected as it can use its mirror. the volume fails) A single disk is mirrored by another disk. The last thing to think about is how to setup the underlining disk structure this is known as a raid level. the old traditional way and still used by many companies as a cheap option All of the above solutions can offer multi-pathing to reduce SPOFs within the RAC environment.

its a portable. the failure of this daemon results in a node being reboot to avoid data corruption OCSSd . which i have already have a topic on called Automatic Storage Management. Oracle Clusterware Oracle Clusterware software is designed to run Oracle in a cluster mode.Process Monitor Daemon CRSd . however they are hard to manage and backup Cluster FileSystem .used to hold all the Oracle datafiles can be used by windows and linux. The software is run by the Cluster Ready Services (CRS) using the Oracle Cluster Registry (OCR) that records and maintains the cluster and node membership information and the voting disk which acts as a tiebreaker during communication failures. Consistent heartbeat information travels across the interconnect to the voting disk when the cluster is running.Oracle choice of storage management. HP storage uses Auto RAID. its not used widely Automatic Storage Management (ASM) .Improved performance (read only) Not expensive Disadvantages Slow write operations (caused by having to create the parity bit) There are many other raid levels that can be used with a particular hardware environment for example EMC storage uses the RAID-S.CRS daemon. it can support you to 64 nodes.Event Volume Manager Daemon . dedicated and optimized cluster filesystem I will only be discussing ASM. Once you have you storage attached to the servers. The Clusterware software allows nodes to communicate with each other and forms the cluster that makes the nodes work as a single logical server. you have three choices on how to setup the disks • • • Raw Volumes .normally used for performance benefits. The CRS has four components • • • • OPROCd . so check with the manufacture for the best solution that will provide you with the best performance and resilience.Oracle Cluster Synchronization Service Daemon (updates the registry) EVMd . it can even be used with a vendor cluster like Sun Cluster.

Process Monitor EVMd . it provides access to the node membership and enables basic cluster services. it uses the First In. failover Daemon restarted and node recovery automatically. it is a distributed group membership system that allows applications to coordinate activities to archive a common result. it requires a public. The below functions are covered by the OCSSd • • • • CSS provides basic Group Services Support. group Node Restart services. basic locking resource monitoring. First Out (FIFO) mechanism to manage locking Node services uses OCR to store data and updates the information during reconfiguration. Lock services provide the basic cluster-wide serialization locking functions.The OPROCd daemon provides the I/O fencing for the Oracle cluster. failure of this daemon causes the node to be rebooted to avoid split-brain situations.Event Management OCSSd .Cluster Synchronization Services CRSd . Group services use vendor clusterware group services when it is available. OCSSd provides synchronization services among nodes. no node restart basic node membership. failure of this daemon results in the node being rebooted. The CRSd process manages resources such as starting and stopping the services and failover of the application resources. Quick recap CRS Process OPROCd . it uses the hangcheck timer or watchdog timer for the cluster integrity. it also spawns separate processes to manage application resources.Cluster Ready Services Functionality Failure of the Process Run AS provides basic cluster integrity Node Restart root services spawns a child process event Daemon automatically oracle logger and generates callouts restarted. no node oracle root . Death of the EVMd daemon will not halt the instance and will be restarted. it also manages the OCR data which is static otherwise. private and VIP interface in order to run. its better to be save than sorry. The last component is the Event Management Logger. The evmlogger spawns new children processes on demand and scans the callout directory to invoke callouts. if a node were to have problems fencing presumes the worst and protects the data thus restarts the node in question. which runs the EVMd process. Fencing is used to protect the data. It is locked into memory and runs as a realtime processes. including cluster group services and locking. The daemon spawns a processes called evmlogger and generates the events when things happen. CRS manages the OCR and stores the current know state of the cluster.

The resource management framework manage the resources to the cluster (disks. It is a mandatory component but can be used with a third party cluster (Veritas. this protects the cluster from split-brains (the Instance Membership Recovery algorithm IMR is used to detect and resolve split-brains) as the voting disk decides what part is the really cluster. Resources with the CRS stack are components that are managed by CRS and have the information on the good/bad state and the callout scripts. The OCR is also used to supply bootstrap information ports. etc. The OCR is loaded as cache on each node. it stores name and value pairs of information such as resources that are used to manage the resource equivalents by the CRS stack. thus you can have only have one resource management framework per resource. Voting is often confused with quorum the are similar but distinct. this is know as the heartbeat. below details what each means . its is installed in a separate home directory called ORACLE_CRS_HOME. nodes. each node will update the cache then only one node is allowed to write the cache to the OCR file. the node is called the master. information about the cluster is constantly being written to the disk. The Enterprise manager also uses the OCR cache. The voting disk manages the cluster membership and arbitrates the cluster ownership during communication failures between nodes. its primary concern is protecting the data. it should reside on a shared storage and accessible to all nodes within the cluster. by default it manages the node membership functionality along with managing regular RAC-related resources and services RAC uses a membership scheme. The Oracle Cluster Ready Services (CRS) uses the registry to keep the cluster configuration. Multiple frameworks are not supported as it can lead to undesirable affects. thus any node wanting to join the cluster as to become a member. it should be at least 100MB in size. If for any reason a node cannot access the voting disk it is immediately evicted from the cluster. it is automatically backed up (every 4 hours) the daemons plus you can manually back it up. The OCSSd uses the OCR extensively and writes the changes to the registry The OCR keeps details of all resources and services. The CRS daemon will update the OCR about status of the nodes in the cluster during reconfigurations and failures. when network problems occur membership becomes the deciding factor on which part stays as the cluster and what nodes get evicted. This shared storage is known as the Oracle Cluster Registry (OCR) and its a major part of the cluster. The voting disk (or quorum disk) is shared by all nodes within the cluster. Sun Cluster).restart The cluster-ready services (CRS) is a new component in 10g RAC. volumes). it is a binary file. RAC can evict any member that it seems as a problem. the use of a voting disk is used which I will talk about later. You can add and remove nodes from the cluster and the membership increases or decrease.

If a node or group of nodes cannot archive a quorum. the instances require better coordination at the resource management level. ideally they should be on a minimum private 1GB network. highly available and fast with low latency. also crossover cables limit you to a two node cluster. What ever hardware you are using the NIC should use multipathing (Linux . usually a majority of members of a body. The GRD is managed by two services called Global Caches Services (GCS) and Global Enqueue Services (GES). a similar action is performed when a new node joins. . Each node will have its own set of buffers but will be able to request and receive data blocks currently held in another instance's cache. but in R2 you can have upto 32 voting disks allowing you to eliminate any SPOF's.IPMP). All the resources in the cluster group form a central repository called the Global Resource Directory (GRD). This interconnect should be private. In Oracle 10g R1 you can have only one voting disk. buffer cache and shared pool and managing the resources without conflicts and corruptions requires special handling. Each instance masters some set of resources and together all instances form the GRD. Solaris . this has now been replaced with cluster VIPs. You can use crossover cables in a QA/DEV environment but it is not supported in a production environment. The voting disk has to reside on shared storage. they should not start any services because they risk conflicting with an established quorum. when assembled is legally competent to transact business The only vote that counts is the quorum member vote. and also used to transfer some data from one instance to another. it is a a small file (20MB) that can be accessed by all nodes in the cluster. this had limitations.Voting Quorum A vote is usually a formal expression of opinion or will in response to a proposed decision is defined as the number.bonding. The cluster VIPs are different from the cluster interconnect IP address and are only used to access the database. Oracle Kernel Components The kernel components relate to the background processes. The resources are equally distributed among the nodes based on their weight. The original Virtual IP in Oracle was Transparent Application Failover (TAF). The management of data sharing and exchange is done by the Global Cache Services (GCS). The cluster interconnect is used to synchronize the resources of the RAC cluster. In RAC as more than one instance is accessing the resource. which is distributed. The cluster VIPs will failover to working nodes if a node should fail. that. the GRD portion of that instance needs to be redistributed to the surviving nodes. the quorum member vote defines the cluster. together they form and manage the GRD. When a node leaves the cluster. these public IPs are configured in DNS so that users can access them.

RAC uses two processes the GCS and GES which maintain records of lock status of each data file and each cached block using a GRD. it to must inform GCS. GCS keeps track of the lock status of the data block by keeping an exclusive lock on it on behalf of instance A 2. Now instance B wants to modify that same data block. All caches in the SGA are either global or local. A resource can be owned or locked in various states (exclusive or shared). GCS handle data buffer cache blocks and GES handle all the non-data block resources. it basically has a name or a reference. large and java pool buffer caches are local. GCS maintains data coherency and coordination by keeping track of all lock status of each block that can be read/written to by any nodes in the RAC. only one instance has the current copy of the block. Data buffer cache blocks are the most obvious and most heavily global resource. there are additional processes than the norm to manage the shared resources. before reading it must inform the GCS (DLM). The Global Resource Manager (GRM) helps to coordinate and communicate the lock requests from Oracle processes between instances in the RAC. When instance A needs a block of data to modify. Cache coherency is the technique of keeping multiple copies of a buffer consistent between different Oracle instances on different nodes. thus cache fusion . Global cache management ensures that access to a master copy of a data block in one buffer cache is coordinated with the copy of the block in another buffer cache. it reads the bock from disk. At any one point in time. transaction enqueue's and database data structures are other examples. dictionary and buffer caches are global. it can be a area in memory. This is known as Parallel Cache Management (PCM). thus keeping the integrity of the block. thus GCS ensures that instance B gets the latest version of the data block (including instance A modifications) and then exclusively locks it on instance B behalf. Cache fusion is used to read the data buffer cache from another instance instead of getting the block from disk. it is an identifiable entity. 3. theses additional processes maintain cache coherency across the nodes. So what is a resource. to ensure that each RAC instance obtains the block that it needs to satisfy a query or transaction. Each instance has a buffer cache in its SGA. Any shared resource is lockable and if it is not shared no access conflict will occur. GCS is an in memory database that contains information about current locks on blocks and instances waiting to acquire locks. GCS will then request instance A to release the lock. The sequence of a operation would go as below 1.RAC Background Processes Each node has its own background processes and memory structures. A global resource is a resource that is visible to all the nodes within the cluster. a disk file or an abstract entity.

It receives requests from LMD to perform lock requests. this manages the enqueue manager service requests for the GCS. It builds a list of invalid lock GES elements and validates lock elements during recovery. as a performance gain you can increase this process priority to make sure CPU starvation does not occur you can see the statistics of this daemon by looking at the view X$KJMSDP this process manages the GES. . Lock Manager LMSn Server process GCS they manage lock manager service requests for GCS resources and send them to a service queue to be handled by the LMSn process. it uses the DIAG framework to Diagnostic monitor the health of the cluster. Finally we get to the processes Oracle RAC Daemons and Processes this is the cache fusion part and the most active process. it handles the consistent copies of blocks that are transferred between instances. There can be up to ten LMS processes running and can be started dynamically if demand requires it.moves current copies of data blocks between instances (hence why you need a fast private network). I rolls back any uncommitted transactions. it maintains consistency of GCS memory structure in case of process death. it checks for instance deaths and listens for local messaging. Lock Monitor LMON Process GES LMD Lock Manager Daemon GES LCK0 DIAG you can see the statistics of this daemon by looking at the view X$KJMDDP Lock manages instance resource requests and cross-instance call Process operations for shared resources. It captures information for later Daemon diagnosis in the event of failures. It also handles deadlock detention and remote resource requests from other instances. A detailed log file is created that tracks any reconfigurations that have happened. It is also responsible for cluster reconfiguration and locks reconfiguration (node joining or leaving). It will perform any necessary recovery if an operational hang is detected. This is a lightweight process. GCS manages the block transfers between the instances. It also handles global deadlock detection and monitors for lock conversion timeouts.

installing and configuring it. remove and break to see what happens to the cluster.3. more than enough for me) 2 x Netgear GS608 8 port Gigabit switches (one for the private RAC network. Configuration and Storage RAC Installation I am not going to show you a step by step guide on how to install Oracle RAC there are many documents on the internet that explain it better then I could. hence why I took the route above. However I will point to the one I am fond of and it works very will if you want to build a cheap Oracle RAC environment to play around with. alot cheaper than an Oracle course. 2 and 3 HD . Hardware Description 3 X Compaq Evo D510 PC's specs: CPU . Compaq Evo D510 PC Openfiler Server specs: CPU . there is also a newer version out using 11g.250GB (brought additional disk for ISCSI storage.support jumbo frames (may or may not be required .4GHz (P4) Instance Node RAM . The third node I use to add.40GB Note: picked these up for £50 each. As I said the document is excellent.2.2GB HD . To configure a Oracle RAC environment follow the instructions in the document Build your own Oracle RAC cluster on Oracle Enterprise Linux and ISCSI. RAC Installation.4GHz (P4) RAM . I used the hardware below and it cost me a little over £400 from EBay. the instructions are simple and I have had no problems setting up.2. had to buy additional memory to max it out.40GB HD . definitely worth getting a third node. I did try and setup a RAC environment on VMWare on my laptop (I do have an old laptop) but it did not work very well. one for the ISCSI network (data)) Router/Switch Note: I could have connect it all to one switch and saved a bit of money Miscellaneous 1GB Network cards .2GB 1.

keep repeating certain situations until you fully understand how RAC works. take your time and make notes.cat5e KVM switch .any more) and TOE (TCP offload engine) Network cables . add it afterwards.cheap one Make sure you give yourself a couple of days to setup. this is the only way to learn. don't install it with the original configuration. use this node to remove a node from the cluster and also to simulate node failures. Good Luck!!!!! . I have now setup and reinstalled so many times that I can do in a day. install and configure the RAC. Make use of that third node.

if you need Oracle administration then see my Oracle section. local_listener and identical parameters gcs_servers_processes Unique parameters The main unique parameters that you should know about are • • • • • • instance_name .defines the name of the Oracle instance (default is the value of the oracle_sid variable) instance_number .used for primary/secondary RAC environments . It is recommended that the spfile (binary parameter file) is shared between all nodes within the cluster.4. use false in the below cases o Converting from no archive log mode to archive log mode and vice versa o Enabling the flashback database feature o Performing a media recovery on a system table o Maintenance of a node active_instance_count . isinstance_modifiable from v$parameter where isinstance_modifiable = 'false' order by name.use if only if Oracle has trouble not picking the correct interconnects The identical unique parameters that you should know about are below you can use the below query to view all of them select name.options are true or false.you should use Automatic Undo Management cluster_interconnects . The parameters can be grouped into three categories These parameters are unique to each instance. examples would be instance_name. thread and undo_tablespace Identical Parameters in this category must be the same for each instance. examples would be Neither unique or db_cache_size. • • cluster_database . RAC Administration and Management RAC Parameters I am only going to talk about RAC administration. parameters examples would be db_name and control_file parameters that are not in any of the above. but it is possible that each instance can have its own spfile. mounts the control file in either share (cluster) or exclusive mode.specifies the name of the undo tablespace to be used by the instance rollback_segments . large_pool_size.a unique number for each instance must be greater than 0 but smaller than the max_instance parameter thread .specifies the set of redolog files to be used by the instance undo_tablespace .

influences the mechanism Oracle uses to synchronize the SCN among all instances instance_groups .specify the number of global locks to a data file. see below for options . see below for options start all instances force open mount nomount stop all instances srvctl stop database -d <database> -o <option> Note: the listeners are not stopped. max_commit_propagation_delay . you can also use sqlplus to start and stop the instance srvctl start database -d <database> -o <option> Note: starts listeners if not already running. <instance_name>.specifies the group of instances to be used for parallel query execution gcs_server_processes . you can use the -o option to specify startup/shutdown options.specifies the number of instances that will be accessing the database (set to maximum # of nodes) dml_locks . changing this disables the Cache Fusion.specify the number of lock manager server (LMS) background processes used by the instance for Cache Fusion remote_listener . example Note: use the sid option to specify a particular instance Starting and Stopping Instances The srvctl command is used to start/stop an instance.undo_management=auto alter system set db_2k_cache_size=10m scope=spfile sid='inst1'.specify multiple parallel query execution groups and assigns the current instance to those groups parallel_instance_group .specifies the number of DML locks for a particular instance (only change if you get ORA-00055 errors) gc_files_to_locks .• • • • • • • • cluster_database_instances .<parameter_name>=<parameter_value> syntax for parameter file inst1. you can use the -o option to specify startup/shutdown options.register the instance with listeners on remote nodes.db_cache_size = 1000000 *.

etc v$tempfile .immediate abort normal transactional start/stop particular instance srvctl [start|stop] database -d <database> -i <instance>. instances in a RAC do not share undo. they each have a dedicated undo tablespace. Each instance creates a temporary segment in the temporary tablespace it is using. free_requests .if they grow increase tablespace size) gv$tempseg_usage . Each instance has exclusive write access to its own redologs.temporary datafiles being used for the temporary tablespace useful views Redologs I have already discussed redologs.undo_tablespace=undo_tbs1 instance2.identify . again I have a detailed discussion on AUM in my undo section. in a RAC environment you should setup a temporary tablespace group. Using the undo_tablespace parameter each instance can point to its own undo tablespace undo tablespace instance1. Temporary Tablespace I have already discussed temporary tablespace's. gv$sort_segment .explore current and maximum sort segment usage statistics (check columns freed_extents. but each . If an instance is running a large sort. this group is then used by all instances of the RAC. in a RAC environment every instance has its own set of redologs. temporary segments can be reclaimed from segments from other instances in that tablespace. SQL.undo_tablespace=undo_tbs2 With todays Oracle you should be using automatic undo management.<instance> Undo Management To recap on undo management you can see my undo section.explore temporary segment usage details such as name.

It can divided into two categories • • Database configuration tasks Database instance control tasks .instance can read each others redologs. SQL> shutdown. srvctl stop database -d <database> SQL> startup mount archive mode SQL> alter database archivelog. SQL> shutdown. srvctl stop database -p prod1 SQL> startup mount SQL> alter database flashback on. SQL> alter system set DB_RECOVERY_FILE_DEST_SIZE=200M scope=spfile. srvctl start database -p prod1 flashback (RAC) SRVCTL command We have already come across the srvctl above. there is no difference in RAC environment apart from the setting up ## Make sure that the database is running in archive log mode SQL> archive log list ## Setup the flashback SQL> alter system set cluster_database=false scope=spfile sid='prod1'. this is used for recovery. srvctl start database -d prod Flashback Again I have already talked about flashback. The process is a little different to the standard Oracle when changing the archive mode SQL> alter system set cluster_database=false scope=spfile sid='prod1'. SQL> alter system set DB_RECOVERY_FILE_DEST='/ocfs2/flashback' scope=spfile. this command is called the server control utility. Redologs are located on the shared storage so that all instances can have access to each others redologs. (RAC) SQL> alter system set cluster_database=true scope=spfile sid='prod1'.

the configuration is stored in the Oracle Cluster Registry (OCR) that was created when RAC was installed.<instance> srvctl stop service -d <database> [-s <service><service>] [-i <instance>.<instance> srvctl start service -d <database> -s <service><service> -i <instance>.<instance> srvctl start nodeapps -n <node> srvctl start asm -n <node> srvctl add database -d <database> -o <oracle_home> srvctl add instance -d <database> -i <instance> -n <node> srvctl add service -d <database> -s <service> -r <preferred_list> srvctl add nodeapps -n <node> -o <oracle_home> -A <name|ip>/network srvctl add asm -n <node> -i <asm_instance> -o <oracle_home> status stopping/starting adding/removing srvctl remove database -d <database> -o <oracle_home> srvctl remove instance -d <database> -i <instance> -n <node> srvctl remove service -d <database> -s <service> -r <preferred_list> srvctl remove nodeapps -n <node> -o <oracle_home> -A <name| ip>/network srvctl asm remove -n <node> . it will be located on the shared storage. I suggest that you lookup the command but I will provide a few examples display the registered databases srvctl config database srvctl status database -d <database srvctl status instance -d <database> -i <instance> srvctl status nodeapps -n <node> srvctl status service -d <database> srvctl status asm -n <node> srvctl stop database -d <database> srvctl stop instance -d <database> -i <instance>. Srvctl uses CRS to communicate and perform startup and shutdown commands on other nodes.Oracle stores database configuration in a repository.<instance>] srvctl stop nodeapps -n <node> srvctl stop asm -n <node> srvctl start database -d <database> srvctl start instance -d <database> -i <instance>.

• • sys$background .the preferred instances for a service. these services are running all the time and cannot be disabled. throughput or none Connect Time Load Balancing Goal .Services Services are used to manage the workload in Oracle RAC. here is a list from a fresh RAC installation The table above is described below • • • • • Goal .information about nodes being up or down will be sent to mid-tier servers via the advance queuing mechanism Preferred and Available Instances .listeners and mid-tier servers contain current information about service performance Distributed Transaction Processing .used for distributed transactions AQ_HA_Notifications .used by an instance's background processes only sys$users . available ones are the backup instances You can administer services using the following tools • • • • DBCA EM (Enterprise Manager) DBMS_SERVICES Server Control (srvctl) Two services are created when the database is first installed. the important features of services are • • • used to distribute the workload can be configured to provide high availability provide a transparent way to direct workload The view v$services contains information about services that have been started on that instance.allows you to define a service goal using service time.when users connect to the database without specifying a service they use this service .

/ ## Grant the privileges to execute the job grant execute on sys. job_action => SYSTIMESTAMP' repeat_interval => 'FREQ=DAILY.create_job( job_name => 'my_user.BATCH_JOB_CLASS'. 'JOB_CLASS'.the service will running on the these nodes -a . enabled => TRUE. job_class => 'SYS.if nodes in the -r list are not running then run on this node srvctl remove service -d D01 -s BATCH_SERVICE srvctl start service -d D01 -s BATCH_SERVICE srvctl stop service -d D01 -s BATCH_SERVICE srvctl status service -d D10 -s BATCH_SERVICE ## create the JOB class BEGIN DBMS_SCHEDULER.batch_job_class to vallep. 'BATCH_JOB_CLASS'). END. ## create a job associated with a job class BEGIN DBMS_SCHDULER. / ## assign a job class to an existing job exec dbms_scheduler. service => 'BATCH_SERVICE').set_attribute('MY_BATCH_JOB'.batch_job_test'.node2 -a node3 Note: the options are describe below add -d . end_date => NULL. remove start stop status service (example) . comments => 'Test batch job to show RAC services').srvctl add service -d D01 -s BATCH_SERVICE -r node1.the service -r . job_type => 'PLSQL_BLOCK'. END.'.database -s .create_job_class( job_class_name => 'BATCH_JOB_CLASS'.

basically permanent over reboots disabling/enabling ## Oracle 10g R1 /etc/init.d/init.crs [disable|enable] ## Oracle 10g R2 $ORA_CRS_HOME/bin/crsctl [disable|enable] crs $ORA_CRS_HOME/bin/crsctl check crs $ORA_CRS_HOME/bin/crsctl check evmd $ORA_CRS_HOME/bin/crsctl check cssd $ORA_CRS_HOME/bin/crsctl check crsd $ORA_CRS_HOME/bin/crsctl check install -wait 600 Resource Applications (CRS Utilities) $ORA_CRS_HOME/bin/crs_stat $ORA_CRS_HOME/bin/crs_stat -t stopping checking status . you should only stop this service in the following situations • • • Applying a patch set to $ORA_CRS_HOME O/S maintenance Debugging CRS problems CRS Administration ## Starting CRS using Oracle 10g R1 not possible starting ## Starting CRS using Oracle 10g R2 $ORA_CRS_HOME/bin/crsctl start crs ## Stopping CRS using Oracle 10g R1 srvctl stop -d database <database> srvctl stop asm -n <node> srvctl stop nodeapps -n <node> /etc/init.Cluster Ready Services (CRS) CRS is Oracle's clusterware software.d/init. you can use it with other third-party clusterware software.crs stop ## Stopping CRS using Oracle 10g R2 $ORA_CRS_HOME/bin/crsctl stop crs ## stop CRS restarting after a reboot. CRS is start automatically when the server starts. though it is not required (apart from HP True64).

$ORA_CRS_HOME/bin/crs_stat -ls $ORA_CRS_HOME/bin/crs_stat -p Note: -t more readable display -ls permission listing -p parameters $ORA_CRS_HOME/bin/crs_profile $ORA_CRS_HOME/bin/crs_register $ORA_CRS_HOME/bin/crs_unregister $ORA_CRS_HOME/bin/crs_start $ORA_CRS_HOME/bin/crs_stop $ORA_CRS_HOME/bin/crs_getparam $ORA_CRS_HOME/bin/crs_setparam $ORA_CRS_HOME/bin/crs_relocate Nodes olsnodes -n create profile register/unregister application Start/Stop an application Resource permissions Relocate a resource member number/name Note: the olsnodes command is located in $ORA_CRS_HOME/bin local node name olsnodes -l activates logging olsnodes -g Oracle Interfaces display oifcfg getif delete oicfg delig -global oicfg setif -global <interface name>/<subnet>:public set oicfg setif -global <interface name>/<subnet>:cluster_interconnect Global Services Daemon Control starting gsdctl start stopping gsdctl stop status gsdctl status Cluster Configuration (clscfg is used during installation) clscfg -install create a new configuration Note: the clscfg command is located in $ORA_CRS_HOME/bin upgrade or downgrade and clscfg -upgrade existing configuration clscfg –downgrade add or delete a node from clscfg -add .

the file pointer indicating the OCR device location is the ocr. this can be in either of the following • • linux ./var/opt/oracle The file contents look something like below.loc.sh print cluster name print the clusterware version Add Node Delete Node Note: see adding and deleting nodes Oracle Cluster Registry (OCR) As you already know the OCR is the registry that contains information • • • • Node list Node membership mapping Database instance./etc/oracle solaris .the configuration create a special single-node configuration for ASM brief listing of terminology used in the other nodes used for tracing help clscfg –delete clscfg –local clscfg –concepts clscfg –trace clscfg -h Cluster Name Check cemutlo -n Note: in Oracle 9i the ulity was called "cemutls".sh Note: see adding and deleting nodes deletenode. node and other mapping information Characteristics of any third-party applications controlled by CRS The file location is specified during the installation. this was taken from my installation . the command is located in $ORA_CRS_HOME/bin cemutlo -w Note: in Oracle 9i the ulity was called "cemutls" Node Scripts addnode.

log ocrcheck Note: will return the OCR version. location of each device and the result of the integrity check ocrdump Note: by default it dumps the contents into a file named OCRDUMPFILE in the current directory ocrconfig -export <file> ocrconfig -restore <file> # show backups ocrconfig -showbackup # to change the location of the backup. free space. the command can be found in located in $ORA_CRS_HOME/bin OCR Utilities $ORA_HOME/log/<hostname>/client/ocrconfig_<pid>.orc.dbf' log file checking dump contents export/import backup/restore .dbf' ## relocate an existing OCR file ocrconfig -replace ocr '/ocfs1/ocr_new.loc ocrconfig_loc=/u02/oradata/racdb/OCRFile ocrmirrorconfig_loc=/u02/oradata/racdb/OCRFile_mirror local_only=FALSE OCR is import to the RAC environment and any problems must be immediately actioned. space used. you can even specify a ASM disk ocrconfig -backuploc <path|+asm> # perform a backup. total space allocated. will use the location specified by the -backuploc location ocrconfig -manualbackup # perform a restore ocrconfig -restore <file> # delete a backup orcconfig -delete <file> Note: there are many more option so see the ocrconfig man page add/remove/replace ## add/relocate the ocrmirror file to the specified location ocrconfig -replace ocrmirror '/ocfs2/ocr2.

## remove the OCR or OCRMirror file ocrconfig -replace ocr ocrconfig -replace ocrmirror Voting Disk The voting disk as I mentioned in the architecture is used to resolve membership issues in the event of a partitioned cluster. RAC Backups and Recovery . the voting disk protects data integrity. querying adding deleting crsctl query css votedisk crsctl add css votedisk <file> crsctl delete css votedisk <file> 5.

this is required when trying to recover the database. a backup will comprise of the following • • • • Datafiles Control Files Archive redolog files Parameter files (init.ora or SPFILE) Databases have now grown to very large sizes well over a terabyte in size in some cases.Veritas Netbackup. RMAN can be used in either a tape or disk solution. This article covers only the specific issues that surround RAC backups and recovery.SAN mirroring with a backup option like Netbackup or RMAN Oracle RAC can use all the above backup technologies. RMAN can use parallelism when recovering. Backup Basics Oracle backups can be taken hot or cold. during recovery only one node applies the archived logs as in a standard single instance configuration. but Oracle prefers you to use RMAN oracle own backup solution. thus tapes backups are not used in these cases but sophisticated disk mirroring have taken their place. Backups can be different depending on the the size of the company • • • small company . rsync medium/large company . In a Oracle RAC environment it is critical to make sure that all archive redolog files are located on shared storage. it can even work with third-party solutions such as Veritas Netbackup. thus you can have a primary database configured as a RAC and a standby database also configured as a RAC. however. as you need access to all archive redologs.may use tools such as tar. RMAN Enterprise company . I have already written a article on standard Oracle backups and recovery. Instance Recovery In a RAC environment there are two types of recovery . the node that performs the recovery must have access to all archived redologs.Introduction Backups and recovery is very similar to a single instance database. Oracle RAC also supports Oracle Data Guard. cpio.

thus they all need to be recovered Instance Recovery . The starting point is the last full checkpoint. an online redolog file belongs to a group and the group belongs to a thread. Crash Recovery Crash recovery is basically the same for a single instance and a RAC environment. I have discussed how to control recovery in my Oracle section and this applies to RAC as well. the streams form the timeline of changes performed to the database. The starting point is provided by the control file and compared against the same information in the data file headers. It will consist of three components • • • log sequence number block number within the log byte number within the block Checkpoints are the same in a RAC environment and a single instance environment. A redo record contains one or more change vectors and is located by its Redo Byte Address (RBA) and points to a specific location in the redolog file (or thread). Oracle will look for the thread checkpoint that has the lowest checkpoint SCN. Details about log group file and thread association details are stored in the control file. Oracle will only consider the block on the disk so the recovery is simple.means that all instances have failed. A stream consists of all the threads of redo information ever recorded. . all blocks in memory that contain changes made prior to this SCN across all instances must be written out to disk. these are called change vectors. RAC databases have multiple threads of redo. this instance can then be recovered by the surviving instances Redo information generated by an instance is called a thread of redo. Each vector is a description of a single change. when a checkpoint needs to be triggered. All log files for that instance belong to this thread. The on-disk block is the starting point for the recovery.means that one or more instances have failed. each instance has one active thread. The block specified in the redolog is read into cache. here is a note detailing the difference For a single instance the following is the recovery process 1. only the changes need to be applied 3. I have a complete recovery section in my Oracle section. Oracle records the changes made to a database. Crash recovery will automatically happen using the online redo logs that are current or active 2. the threads are parallel timelines and together form a stream. if the block has the same timestamp as the redo record (SCN match) the redo is applied. I have already discussed checkpoints. usually a single block.• • Crash Recovery .

The second pass SMON rereads the merged redo stream (by SCN) from all threads needing recovery. Once the recovery and cleanup has finished this list is updated. all modified blocks are added to the recovery set (a organized hash table). A PI is a copy of a globally dirty block and is maintained in the database buffer cache. 3. This is an indication that an instance has failed (died) 2. The first pass does not perform the actual recovery but merges and reads redo threads to create a hash table of the blocks that need recovery and that are not known to have been written back to the datafiles. so it needs to obtain the latest version of the dirty block and it uses PI (Past Image) and Block Written Record (BWR) to archive this in a quick and timely fashion. though in a lazy fashion. The cache aging and incremental checkpoint system would write a number of blocks to disk. The death of another instance is detected if the current instance is able to acquire that instance's redo thread locks. because a data block could have been modified in any of the instances (dead or alive). The checkpoint SCN is need as a starting point for the recovery. it eliminates the write/write contention problem that existed in the OPS database. Oracle RAC uses a two-pass recovery. it also adds a redo record that states the block has been written (data block address and SCN). the redolog entries are then compared against a recovery set built in the Block Written Record (BRW) Past Image (PI) . A foreground process in a surviving instance detects an "invalid block lock" condition when a attempt is made to read a block into the buffer cache. it can be created and saved when a dirty block is shipped across to another instance after setting the resource role to global. SMON maintains a list of all the dead instances and invalid block locks. The GCS is responsible for informing an instance that its PI is no longer needed after another instance writes a newer (current) version of the same block. The foreground process sends a notification to instance system monitor (SMON) which begin to search for dead instances. PI's are discarded when GCS posts all the holding instances that a new and consistent version of that particular block is now on disk.For a RAC instance the following is the recovery process 1. This is was makes RAC cache fusion work. when the DBWR completes a data block write operation. A block will not be recovered if its BWR version is greater than the latest PI in any of the buffer caches. I go into more details about PI's in my cache fusion section. which is usually held by an open and active instance. In RAC a BWR is written when an instance writes a block covered by a global resource or when it is told that its past image (PI) buffer it is holding is no longer necessary. DBWn can write block written records (BWRs) in batches.

this could be located on any of the surviving instances and multiple PI blocks of a particular buffer can exist. RAC uses an algorithm called lazy remastering to remaster only a minimal number of resources during a reconfiguration. such as GRD reconfiguration. as additional steps are required. The main features (advantages) of cache fusion recovery are • • • • Recovery cost is proportional to the number of failures. Remastering is the term used that describes the operation whereby a node attempting recovery tries to own or master the resource(s) that were once mastered by another instance prior to the failure. not the total number of nodes It eliminates disk reads of blocks that are present in a surviving instance's cache It prunes recovery set based on the global resource lock state The cluster is available after an initial log scan. the space is reclaimed by this operation is used to remaster locks that are held by the surviving instance for which a dead instance was remastered 2. the GRD of that instance needs to be redistributed to the surviving nodes. When one instance leaves the cluster. even before recovery reads are complete In cache fusion the starting point for recovery of a block is its most current PI version. The buffer cache is flushed and the checkpoint SCN for each thread is updated upon successful completion. cache fusion is only used in RAC environments. The entire Parallel Cache Management (PCM) lock space remains invalid while the DLM and SMON complete the below steps 1. SMON issues a message saying that it has acquired the necessary buffer locks to perform recovery . in a instance recovery SMON will perform the recovery where as in a crash recovery a foreground process performs the recovery. Cache Fusion Recovery I have a detailed section on cache fusion. internode communication.one instance has failed In both cases the threads from failed instances need to be merged. IDLM master node discards locks that are held by dead instances. There are two types of recovery • • Crash Recovery .all instances have failed Instance Recovery . etc.first pass and any matches are applied to the in-memory buffers as in a single pass recovery. this section covers the recovery.

likewise when a instance joins a cluster only minimum amount of resources are remastered to the new instance. and 8 Instance C masters resources 9. only the resources from instance B are evenly remastered across the surviving nodes (no resources on instances A and C are affected). 4. 5 and 7 Instance B masters resources 2. lets presume the following • • • Instance A masters resources 1. Before Remastering After Remastering You can control the remastering process with a number of parameters _gcs_fast_config enables fast reconfiguration for gcs locks (true|false) . 10. 11 and 12 Instance B is removed from the cluster. 3. this reduces the amount of work the RAC has to perform. 6.Lets look at an example on what happens during a remastering.

## Determine who masters it force dynamic SQL> oradebug setmypid remastering (DRM) SQL> oradebug lkdebug -a <OBJECT_ID> ## Now remaster the resource SQL> oradebug setmypid SQL> oradebug lkdebug -m pkey <OBJECT_ID> The steps of a GRD reconfiguration is as follows • • • • • • • • Instance death is detected by the cluster manager Request for PCM locks are frozen Enqueues are reconfigured and made available DLM recovery GCS (PCM lock) is remastered Pending writes and notifications are processed I Pass recovery o The instance recovery (IR) lock is acquired by SMON o The recovery set is prepared and built. recovery is then complete o The system is available Graphically it looks like below . memory space is allocated in the SMON PGA o SMON acquires locks on buffers that need recovery II Pass recovery o II pass recovery is initiated._lm_master_weight _gcs_resources controls which instance will hold or (re)master more resources than others controls the number of resources an instance will master at a time you can also force a dynamic remastering (DRM) of an object using oradebug ## Obtain the OBJECT_ID form the below table SQL> select * from v$gcspfmaster_info. database is partially available o Blocks are made available as they are recovered o The IR lock is released by SMON.

.

audsess$ sequence . RAC Performance RAC Performance I have already discussed basic Oracle tuning. in this section I will mainly dicuss Oracle RAC tuning. thus reducing cache contention Keep read-only tablespaces away from DML-intensive tablespaces. First lets review the best pratices of a Oracle design regarding the application and database • • • • • • • • • • Optimize connection management. it increases invalidations of the already parsed SQL statements and they need to be recompiled Partion tables and indexes to reduce index leaf contention (buffer busy global cr problems) Optimize contention on data blocks (hot spots) by avoiding small tables with too many rows in a block Now we can review RAC specific best practices • • • • • • Consider using application partitioning (see below) Consider restricting DML-intensive users to using one instance. unless you cannot afford to lose sequence during a crash Avoid using DDL in production. they only require minimum resources thus optimizing Cache Fusion performance Avoid auditing in RAC. this causes more shared library cache locks Use full tables scans sparingly. cursor_sharing was introduced to solve this problem Use packages and procedures (because they are compiled) in place of anonymous PL/SQL blocks and big SQL statements Use locally managed tablespaces and automatic segment space management to help performance and simplify database administration Use automatic undo management and temporary tablespace to simplify administration and increase performance Ensure you use large caching when using sequences. increase the value of sys. it causes the GCS to service lots of block requests.6. see table v$sysstat column "table scans (long tables)" if the application uses lots of logins. ensure that the middle tier and programs that connect to the database are efficent in connection management and do not log on or off repeatedly Tune the SQL using the available tools such as ADDM and SQL Tuning Advisor Ensure that applications use bind variables.

Partitioning Workload
Workload partitioning is a certian type of workload that is executed on an instance, that is partitioning allows users who access the same set of data to log on to the same instance. This limits the amount of data that is shared between instances thus saving resources used for messaging and Cache Fusion data block transfer. You should consider the following when deciding to implement partitioning
• • • •

If the CPU and private interconnects are of high performance then there is no need to to partition Partitioning does add complexity, thus if you can increase CPU and the interconnect performance the better Only partition if performance is betting impacted Test both partitioning and non-partitioning to what difference it makes, then decide if partitioning is worth it

RAC Wait Events
An event is an operation or particular function that the Oracle kernel performs on behalf of a user or a Oracle background process, events have specific names like database event. Whenever a session has to wait for something, the wait time is tracked and charged to the event that was associated with that wait. Events that are associated with all such waits are known as wait events. The are a number of wait classes
• • • • • • • • • • • •

Commit Scheduler Application Configuration User I/O System I/O Concurrency Network Administrative Cluster Idle Other

There are over 800 different events spread across the above list, however you probably will only deal with about 50 or so that can improve performance. When a session requests access to a data block it sends a request to the lock master for proper authorization, the request does not know if it will receive the block via Cache Fusion or a permission to read from the disk. Two placeholder events

• •

global cache cr request (consistent read - cr) global cache curr request (current - curr)

keep track of the time a session spends in this state. There are number of types of wait events regarding access to a data block Wait Event Contention Description type an instance requests authorization for a block to be accessed in current mode to modify a block, the instance mastering the resource receives the request. The master has the current version of the block and sends the current copy of the block to the requestor via Cache Fusion and keeps a Past Image (.PI) If you get this then do the following
• • • • •

gc current block 2write/write way

Analyze the contention, segments in the "current blocks received" section of AWR Use application partitioning scheme Make sure the system has enough CPU power Make sure the interconnect is as fast as possible

Ensure that socket send and receive buffers are configured correctly an instance requests authorization for a block to be accessed in current mode to modify a block, the instance mastering the resource receives the request and forwards it to the current holder of the block, asking it to relinquish ownership. The holding gc current block 3write/write instance sends a copy of the current version of the block to the requestor via Cache Fusion and transfers the exclusive lock to the way requesting instance. It also keeps a past Image (PI). Use the above actions to increase the performance gc current The difference with the one above is that this sends a copy of the block 2write/read block thus keeping the current copy. way gc current The difference with the one above is that this sends a copy of the block 3write/read block thus keeping the current copy. way gc current write/write The requestor will eventually get the block via cache fusion but it block is delayed due to one of the following busy • The block was being used by another session on another session

was delayed as the holding instance could not write the corresponding redo record immediately

If you get this then do the following Ensure the log writer is tuned This is the same as above (gc current block busy), the difference is that another session on the same instance also has requested the block (hence local contention)

gc current buffer busy gc current block congested

local

none

This is caused if heavy congestion on the GCS, thus CPU resources are stretched

Enqueue Tuning
Oracle RAC uses a queuing mechanism to ensure proper use of shared resources, it is called Global Enqueue Services (GES). Enqueue wait is the time spent by a session waiting for a shared resource, here are some examples of enqueues:
• • •

updating the control file (CF enqueue) updating an individual row (TX enqueue) exclusive lock on a table (TM enqueue)

Enqueues can be managed by the instance itself others are used globally, GES is responsible for coordinating the global resources. The formula used to calculate the number of enqueue resources is as below GES Resources = DB_FILES + DML_LOCKS + ENQUEUE_RESOURCES + PROCESS + TRANSACTION x (1 + (N - 1)/N) N = number of RAC instances SQL> column current_utilization heading current SQL> column max_utilization heading max_usage displaying SQL> column initial_allocation heading initial enqueues stats SQL> column resource_limit format a23; SQL> select * from v$resource_limit;

192 = 1966080 bytes/sec = 2.192 = 2580480 bytes/sec = 2. RAC AWR Report Description Section Number of lists the number of instances from the beginning and end of the instances Instances AWR report Instance global global information about the interinstance cache fusion data block cache load cache and messaging traffic. so for a quick refresh take a look and come back here to see how you can use it in a RAC environment.91 The first two statistics indicate the number of blocks transferred to or from this instance.AWR and RAC I have already discussed AWR in a single instance environment.30 9.37 12.67 GCS/GES messages received: 525. thus if you are using a 8K block size Sent: 240 x 8. . you can view the whole report here.6 MB/sec to determine the amount of network traffic generated due to messaging you first need to find the average message size (this was 193 on my system) select sum(kjxmsize * (kjxmrcv + kjxmsnt + kjxmqsnt)) / sum((kjxmrcv + kjxmsnt + kjxmqsnt)) "avg Message size" from x$kjxm where kjxmrcv > 0 or kjxmsnt > 0 or kjxmqsnt > 0.82 Global Cache blocks served: 240. because my AWR report is lightweight profile here is a more heavy used RAC example Global Cache Load Profile ~~~~~~~~~~~~~~~~~~~~~~~~~ Per Second Per Transaction ----------------------------Global Cache blocks received: 315.16 20.0 MB/sec Received: 315 x 8. From a RAC point of view there are a number of RAC-specific sections that you need to look at in the AWR. in the report section is a AWR of my home RAC environment.32 30.81 GCS/GES messages sent: 765.

4 = 5 MBytes/sec = 5 x 8 = 40 Mbits/sec The DBWR Fusion writes statistic indicates the number of times the local DBWR was forced to write a block to disk due to remote instances. direct messages are sent by a instance foreground or the user processes to remote instances. indirect are messages that are not urgent and are pooled and sent. this section shows how the instance is getting all the data blocks it needs. this number should be low. • Messaging statistics messaging The second section details the breakup of direct and indirect messages. this section contains timing statistics for global enqueue and global cache. If a table or index has a very high percentage of CR . Service Service shows the resources used by all the service instance supports statistics stats Service wait Service summarizes waits in different categories for each service class statistics wait class Top 5 CR and Top 5 CR conatns the names of the top 5 contentious segments (table or current block and index).workload GES characteristics workload • All timings related to CR (Consistent Read) processing block should be less than 10 msec All timings related to CURRENT block processing should be less than 20 msec The first section relates to sending a message and should be less than 1 second.0 + 2. As a general rule you are looking for GCS and GES GCS and .4 MB to calculate the total network traffic generated by cache fusion = 2. The best order is the following Glocal cache efficiency percentage global cache efficiency • • • Local cache Remote cache Disk The first two give the cache hit ratio for the instance. you are looking for a value less than 10%.6 + 0.then calculate the amount of messaging traffic on this network 193 (765 + 525) = 387000 = 0. if you are getting higher values then you may consider application partitioning.

Global Resource Directory (GRD) . This is blocks pretty much like a normal single instance. you must make sure that this is on the best hardware you can buy. You can confirm that the interconnect is being used in Oracle 9i and 10g by using the command oradebug to dump information out to a trace file. SQL> oradebug setmypid SQL> oradebug ipc Note: look in the user_dump_dest directory. you can view my information from here.log file.segements current and Current block transfers you need to investigate. the trace will be there interconnect 7. Cluster Interconnect As I stated above the interconnect it a critical part of the RAC. in Oracle 10g R2 the cluster interconnect is also contained in the alert.

Additional details such as the location of the most current version. Global cache together with GES form the GRD. Resources and Enqueues A resource is an identifiable entity. Parallel Cache Management (PCM) ensures that the master copy of a data block is stored in one buffer cache and consistent copies of the data block are stored in other buffer caches. The referenced entity is usually a memory region. Copies of data are exchanged between nodes. Details about the data blocks resources and cached versions are maintained by GCS. this will become the resource master. also called the master copy. state of the buffer. The lock and resource structures for instance locks reside in the GRD (also called the DLM). The synchronization provided by the Global Resource Directory (GRD) maintains a cluster wide concurrency of the resources and in turn ensures the integrity of the shared data. and each instance is responsible for managing its own local version of the buffer cache. its a dedicated area within the shared pool. the process LCKx is responsible for this task. A resource can be owned or locked in various states (exclusive or shared). it has a name or reference. to manage all information about a particular resource. data and interinstance data requests. Global Cache Services (GCS) maintain the cache coherency across buffer cache resources and Global Enqueue Services (GES) controls the resource management across the clusters non-buffer cache resources. a data block or an abstract entity. it uses a mechanism by which multiple copies of an object are keep consistent between Oracle instances. Synchronization is also required for buffer cache management as it is divided into multiple caches. Resources such as data blocks and locks must be synchronized between nodes as nodes within a cluster acquire and release ownership of them.GRD introduction The RAC environment includes many resources such as multiple versions of data block buffers in buffer caches in different modes. role of the data block (local or global) and ownership are maintained by GES. thus a local resource can only be used by . A global resource is visible throughout the cluster. all resources are lockable. Each instance maintains a part of the GRD in its SGA. The GCS and GES nominate one instance. a disk file. Cache Coherency Cache coherency identifies the most up-to-date copy of a resource. Each instance knows which instance master is with which resource. Oracle uses locking and queuing mechanisms to coordinate lock resources. this sometimes is referred to as the global cache but in reality each nodes buffer cache is separate and copies of blocks are exchanged through traditional distributed locking mechanism.

A resource has a lock value block (LVB). Locks are placed on a resource grant or a convert queue. this is the process of changing a lock from one mode to another. A convert queue is a queue of locks that are waiting to be converted to particular mode. A lock leaves the convert queue under the following conditions • • • The process requests the lock termination (it remove the lock) The process cancels the conversion. the grant queue and convert queue are associated with each and every resource that is managed by the GES. An enqueue can be held in exclusive mode by one process and others can hold a non-exclusive mode depending on the type. that are currently granted to users. Each resource can have a list of locks called the grant queue. . Enqueues are basically locks that support queuing mechanisms and that can be acquired in different modes. Enqueues are the same in RAC as they are in a single instance. The Global Resource Manager (GRM) keeps the lock information valid and correct across the cluster.the instance at it is local too. the lock is moved back to the grant queue in the previous mode The requested mode is compatible with the most restrictive lock in the grant queue and with all the previous modes of the convert queue. and the lock is at the head of the convert queue Convert requests are processed on a FIFO basis. even a NULL is a lock. if the lock changes it moves between the queues.

Global locks are mainly of two types • • Locks used by the GCS for buffer cache management. the lock is also known as a row share lock Shared the resource can be read and written to in an unprotected fashion.Global Enqueue Services (GES) GES coordinates the requests of all global enqueues. they support multiple modes and are held longer than latches. There are two types of local locks. this SX Exclusive is also known as a RX (row exclusive) lock a process cannot write to the resource but multiple processes can read S Shared it. Oracle uses a hashing algorithm to determine which nodes hold the directory tree information for the resource. Global Locks Each node has information for a set of resources. Enqueues can use any of the following modes Mode Summary Description no access rights. they protect persistent objects such as tables or library cache objects. This is the traditional share lock. enqueues can affect both the cluster and the instance. this makes sure that only SubShared processes can modify it at a time. a lock is held at this level to indicate that a process is NULL NULL interested in a resource the resource can be read in an unprotected fashion other processes can SS SubShared read and write to the resource. Only one process can hold a lock at this level. This is also know as a SRX (shared row exclusive) table lock. This is also the traditional exclusive lock. other X Exclusive processes cannot read or write to the resource. grants the holding process exclusive access to the resource. Other processes can perform SSX Exclusive unprotected reads. . it also deals with deadlocks and timeouts. Enqueues are shared structures that serialize access to database resources.e. these are called PCM locks Global locks (global enqueue) that Oracle synchronizes within a cluster to coordinate non-PCM resources. data block or data dictionary entry) when the resource enters the instance's SGA. latches and enqueues. they protect the enqueue structures An instance owns a global lock that protects a resource (i. latches do not affect the cluster only the local instance.

Both LMON and LMD use messages to communicate to other instances. Messaging The difference between RAC and a single instance messaging is that RAC uses the high speed interconnect and a single instance uses shared memory and semaphores. Examples of this are DDL. Transaction and row locks are the same as in a single instance database.GES locks control access to data files (not the data blocks) and control files and also serialize interinstance communication. The resource is mastered in master instance M . They also control library caches and the dictionary cache. the sequence is detailed below where requesting instance R is interested in block B1 from holding instance H. Holding instance (H) and the Requesting instance (R). this is done by messages and asynchronous traps (ASTs). The SCN and mount lock are global locks. take a look in locking for an in depth view on how Oracle locking works. GES uses messaging for interinstance communication. Master instance (M). the only difference is that the enqueues are global enqueues. DML enqueue table locks. A three-way lock message involves up to a maximum of three instances. The messaging traffic can be viewed using the view V$GES_MISC. transaction enqueues and DDL locks or dictionary locks. interrupts are used when one or more process want to use the processor in a multiple CPU architecture. the GRD is updated when locks are required.

also the messages are kept small (128 bytes) to increase performance. You can control the number of tickets and view them system parameter ticket usage _lm_tickets _lm_ticket_active_sendback (used for aggressive messaging) select local_nid local. remote_nid remote. tckt_limit limit. LMS or LMD perform this. it uses buffering to accommodate large volumes of traffic. If there are no tickets then the message has to wait until a ticket is available. The TRFC keeps track of everything by using tickets (sequence numbers). Instance M receives the message and forwards it to the holding instance H. tckt_wait waiting from v$ges_traffic_controller. low latency). This message is sent by a direct send as it is critical 2. instance R then sends the message to the master instance M requesting access to the resource. Because GES heavily rely's on messaging the interconnect must be of high quality (high performance . this is called acquisition asynchronous trap (AAST). A ticket is obtained before sending any messages. there is a predefined pool of tickets this is dependent on the network send buffer size. it can be acquired in share or exclusive mode. tckt_avail avail. this is known as a blocking asynchronous trap (BAST). SQL> oradebug setmypid SQL> oradebug unlimit SQL> oradebug lkdebug -t Note: the output can be viewed here dump ticket information Global Cache Services (GCS) GCS locks only protect data blocks in the global cache (also know as PCM locks). Instance R gets the ownership information about a resource from the GRD. This is also sent directly. the resource is copied in instance R memory 4.1. The Traffic Controller (TRFC) is used to control the DLM traffic between the instances in the cluster. This message is queued as it is not critical. once sent the ticket is returned to the pool. using the interconnect. snd_q_len send_queue. 3. Once the lock handle is obtained on the resource instance R sends an acknowledgment to instance M. Each lock element can have the lock role set to . Instance H sends the resource to instance R.

. it is global if the block is dirty in a remote cache or in several remote caches. In the local role only S and X modes are permitted. it just converts from one lock to another. reading of data does not require a instance to Shared (S) disown a global lock. This Null (N) mode is used so that locks need not be created and destroyed all the time. the holding instance can send copies to other instances when instructed by the master. thus the PI represents the state of a dirty buffer. S and X. If there are a number of PI's that exist. shared. It also holds a chain of cache buffer chains that are covered by the corresponding lock elements. which will help you better to understand. When in global role three lock modes are possible. GCS locks uses the following modes as stated above used during update or any DML operation. I have a complete detailed walkthough in my cache_fusion section. the node will then log a block written record (BWR). When a new current block arrives. Interested parties can only modify the block using X mode. it can be either local or global. These can be view via v$lock_element. allows instances to keep a lock without any permission on the block(s). A Past Image (PI) is kept by the instance when a block is shipped to another instance. if another instance requires the Exclusive block that has a exclusive lock it asks GES to request that he second instance (X) disown the global lock used for select operations. the block is global and it may even by dirty in any of the instances and the disk version may be obsolete. A node must keep a PI until it receives notification from the master that a write to disk has completed covering that version. If the block is modified (dirty). when requested by the master instance the holding instance serves a copy of the block to others. the role is then changed to a global role. In global role mode you can read or write to the data block only as directed by the master instance of that resource. the resource is local if the block is dirty only in the local cache. the previous PI remains untouched in case another node requires it. exclusive and null. these are called lock elements. a PI is retained and the lock becomes global. If the block is globally clean this instance lock role remains local. they may or may not merge into a single PI. the master will determine this based on if the older PI's are required. a indeterminate number of PI's can exist. The lock and state information is held in the SGA and is maintained by GCS. Lock roles are used by Cache Fusion. an instance cannot read from the disk as it may not be current. I have already discussed PI and BWR in my backup section. the parameter _db_block_hash_buckets controls the number of hash buffer chain buckets. In the global lock role lock modes can be N.either local (same as single instance) or global.

A lock element holds lock state information (converting. they also old a chain of cache buffers that are covered by the LE and allow the Oracle database to keep track of cache buffers that must be written to disk in a case a LE (mode) needs to be downgraded (X > N).class). etc). the list below describes the classes of the data block which are managed by the LEs using GCS locks (x$bh. 0 1 2 3 4 5 6 7 8 FREE EXLCUR SHRCUR CR READING MRECOVERY IRCOVERY WRITING PI . LEs protect all the data blocks in the buffer cache. granting. LEs are managed by the lock process to determine the mode of the locks.

the holder can disown the lock on the buffer and write the buffer to the disk. Normally resource mastering only happens when a instance joins or leaves the RAC environment. GCS ensures cache coherency by requiring that instances acquire a lock before modifying or reading a database block. The process of maintaining information about resources is called lock mastering or resource mastering. if not found then it can be read from disk by the requesting instance. The GCS monitors and maintains the list and mode of the blocks in all the instances. The below can be used to view the number of times a downconvert occurs select cr_requests. Consistent read processing means that readers never block writers. Each instance will master a number of resources. row-level locks are used in conjunction with PCM locks. One parameter that can help is _db_block_max_cr_dba which limits the number of CR copies per DBA on the buffer cache. set to 0 if the instance is a query only instance. and the parameter _fairness_threshold can used to configure it. GCS manages PCM locks in the GRD. downconvert Note: lower the _fairness_threshold if the ratio goes above 40%. If too many CR requests arrive for a particular buffer. PCM locks manage the data blocks in the global cache. data_requests. If a block is modified all Past Images (PI) are no longer current and new copies are required to obtained. but a resource can only be mastered by one instance. light_works. as the same in a single instance. but only one instance masters a resource. The lightwork rule is involved when CR construction involves too much work and no current block or PI block is available in the cache for block cleanouts. as of Oracle 10g R2 mastering occurs at the object level which helps fine-grained object remastering. Resource affinity allows the resource mastering of the frequently used resources on its local node. thus the requestor can then read it from disk. GCS lock ensures that they block is accessed by one instances then row-level locks manage the blocks at the row-level. Data blocks are can be kept in any of the instances buffer cache (which is global).So putting this altogether you get the following. This is technically known as fairness downconvert. The GRD is a central repository for locks and resources. I spoke about lock remastering in my backup section. There are a number of parameters that can be used to dynamically remaster an object _gc_affinity_time _gc_affinity_limit specifies interval minutes for remastering defines the number of times a instance access the resource . fairness_down_converts from v$cr_block_server. it uses dynamic resource mastering to move the location of the resource masters. especially if the requested block has a older SCN and needs to reconstruct it (known as CR fabrication). GCS locks are not row-level locks. it is distributed across all nodes (not a single node).

setting to 0 disable remastering defines the minimum number of times a instance access the _gc_affinity_minimum resource before remastering disables dynamic remastering for the objects belonging to _lm_file_affinity those files _lm_dynamic_remastering enable or disable remastering You should consult Oracle before changing any of the above parameters. .before remastering.

8. Cache Fusion Introduction I mentioned above Cache Fusion in my GRD section. The parameter _gc_defer_time can be used to define the duration by which an instance deferred downgrading a lock. Disk pings have been reduce in the later versions of RAC. When an instance receives a BAST it downgrades the lock ASAP. RAC appears to have one large buffer but this is not the case. I will also provide a number of walk through examples on my RAC system. Ping The transfer of a data block from instances buffer cache to another instances buffer cache is know as a ping. yes memory and networking together are faster than disk I/O. this process is known as blocking asynchronous trap (BAST). this operation is known as disk ping or hard ping. RAC copies data blocks across the interconnect to other instances as it is more efficient than reading the disk. As mentioned already when an instance requires a data block it sends the request to the lock master to obtain a lock in the desired mode. . in reality the buffer caches of each node remain separate. In the newer versions of RAC when a BAST is received sending the block or downgrading the lock may be deferred by tens of milliseconds. Checking the status of a transaction is an expensive operation that may require access (and pinging) to the related undo segment header and undo data blocks as well. thus relaying on block transfers more. here I go into great detail on how it works. data blocks are shared through distributed locking and messagingoperations. however there will always be a small amount of disk pinging. Cache Fusion uses the most efficient communications as possible to limit the amount of traffic used on the interconnect. this will eliminate any need for the receiving instance to check the status of the transaction immediately after receiving/reading a block. this extra time allows the holding instance to complete an active transaction and mark the block header appropriately. now you don't need this level of detail to administer a RAC environment but it sure helps to understand how RAC works when trying to diagnose problems. however it might have to write the corresponding block to disk.

marking as a PI. You can view the past image blocks present in the fixed table X$BH select state. basically they are copies of data blocks in the local buffer cache of an instance. PIs Note: the state column with 8 is the past images.the original data contained in the block is preserved in the undo segment. the process has to revisit the block and clean out the block (delay block cleanout) and generate the redo for the changes. In a single instance the following happens when reading a block • • • • When a reader reads a recently modified block. Cache Fusion I Cache Fusion I is also know as consistent read server and was introduced in Oracle 8. it preserves a copy of that block. GCS informs the instance holding the PIs to discard the PIs.1. it keeps a list of recent transactions that have changed a block. it might find an active transaction in the block The reader will need to read the undo segment header to decide whether the transaction has been committed or not If the transaction is not committed. The PI is kept until that block is written to disk by the current owner of the block.Past Image Blocks (PI) In the GRD section I mentioned Past Images (PIs). When an instance sends a block it has recently modified to another instance. GCS then informs all holders of the global resource that they can release the buffers holding the PI copies of the block. GCS is responsible for finding the most current block image and informing the instance holding that image to perform a block write. indicating the presents of PIs in other instances buffer caches. the process creates a consistent read (CR) version of the block in the buffer cache using the data in the block and the data stored in the undo segment If the undo segment shows the transaction is committed. which can be used to provide consistent read versions of the block. When the block is written to disk and is known to have a global role. allowing the global resource to be released. count(state) from X$BH group by state.5. . When a checkpoint is required it informs GCS of the write requirement.

As from Oracle 8 introduced a new background process called the Block Server Process makes the CR fabrication at the holders cache and ships the CR version of the block across the interconnect. the sequence is detailed in the table below 1. An instance sends a message to the lock manager requesting a shared lock on the block 2.In an RAC environment if the process of reading the block is on an instance other than the one that modified the block. resulting in 6 I/O operations. the reader will have to read the following blocks from the disk • • • data block to get the data and/or transaction ID and Undo Byte Address (UBA) undo segment header block to find the last undo block used for the entire transaction undo data block to get the actual record to construct a CR image Before these blocks can be read the instance modifying the block will have to write those's blocks to disk. In RAC the instance can construct a CR copy by hopefully using the above blocks that are still in memory and then sending the CR over the interconnect thus reducing 6 I/O operations. Following are the possibilities in the global cache .

the holding instance may refuse to do so if • • it does not find any of the blocks needed in its buffer cache. either of the following can happen o if the lock is granted.If there is no current user for the block. Based on the result. after sending the CR copies four times it will voluntarily relinquish the lock. The number of copies it will serve before doing so is governed by the parameter _fairness_threshold Cache Fusion II Read/Write contention was addressed in cache fusion I. the lock manager grants the shared lock to the requesting instance o if the other instance has an exclusive lock on the block. cache fusion II addresses the write/write contention . the lock manager asks the owning instance to build a CR copy and ship it to the requesting instance. The requesting instance has the locked granted. the requesting instance reads the block from disk o The owning instance creates a CR version of the buffer in its own buffer cache and ships it to the requesting instance over the interconnect 4. The owning instance also informs the lock manager and requesting instance that it has shipped the block o 5. the lock manager updates the IDLM with the new holders of that resource While making a CR copy. write the block to the disk and let other instances get the block from the disk. 3. it will not perform a disk read to make a CR copy for another instance It is repeatedly asked to send a CR copy of the same block.

It will keep a past image of the block and inform the master instance that it has sent the current block to the requesting instance 4. the lock manager grants the exclusive lock to the requesting instance o if the other instance has an exclusive lock on the block. The lock manager updates the resource directory (GRD) with the current holder of the block . Based on the result. the owning instance writes all the redo records generated for the block to the online redolog file. Following are the possibilities in the global cache o If there is no current user for the block. the lock manager asks the owning instance to release the lock 3. the requesting instance reads the block from disk o The owning instance sends the current block to the requesting instance via the interconnect.1. An instance sends a message to the lock manager requesting an exclusive lock on the block 2. either of the following can happen o if the lock is granted. to guarantee recovery in the event of instance death.

it has exclusive right to modify the block. GCS can ask the instance to write the block and serve it to other instances after discarding PIs when instructed to by GCS. if it is local it can be acted upon without consulting other instances. if it is global it cannot be acted upon without consulting or informing remote instances. a GCS resource can be local or global. GCS is used as a messaging agent to coordinate manipulation of a global resource. the current instance will send the block to the requesting instance and downgrades its role to NL a block is present in one or more instances. By default all resources are in NULL mode (remember null mode is used to convert from one type to another (share or exclusive)). all changes to the blocks are in the local buffer cache and it can write the block to the disk.Cache Fusion in Operation A quick recap of GCS. The table below denotes the different states of a resource Mode/Role Null (N) Shared (S) Exclusive (X) SL Local NL SL XL Global NG SG XG XL NL SG XG NG States it can serve a copy of the block to other instances and it can read the block from disk. if an instance wants it in X mode. If another instance wants the block it can to come via the GCS used to protect consistent read block. this serves only as the CR copy of the block. the block is kept in the buffer cache with NG role. the instance with the XG role has the latest copy of the block and is the most likely candidate to write the block to the disk. since the block is not modified there is no need to write to disk it has sole ownership and interest in that resource. an instance can read the read from disk and serve it to other instances a block can have one or more PIs. Below are a number of common scenarios to help understand the following • • • • • • • • reading from disk reading from cache getting the block from cache for update performing an update on a block performing an update on the same block reading a block that was globally dirty performing a rollback on a previously updated block reading the block after commit .

C and D) Instance D is the master of the lock resource for the data block BL We will only use one block and it will reside at SCN 987654 We will use a three-letter code for the lock states o first letter will indicate the lock mode . Instance C has the block in shard mode. the lock manager updates the resource directory. S = Shared and X = Exclusive o second latter will indicate lock role . The lock granted is SL0 (see above to work out three-letter code) 3.N = Null. L = Local o The third letter will indicate the PIs . Instance C requests the block by sending a shared lock request to master D 2. B.We will assume the following • • • • Four RAC environment (Instances A. . Instance C reads the block from the shared disk into its buffer cache 4. 1 = a PI of the bloc for example a code of SL0 means a global shared lock with no past images (PIs) Reading a block from disk instance C want to read the block it will request a lock in share mode from the master instance 1.G = Global.0 = no PIs. Master D grants the lock to instance C. The block has never been read into the buffer cache of any instance and it is not locked.

Instance B sends a message to instance D that it has assumed the SL lock for the block. The lock master knows that the block may be available at instance C and sends a ping message to instance C 3. Instance B sends a shared lock request to master instance D 2. Instance B wants to read the same block that is cached in instance C buffer. thus the message is sent asynchronously Getting a (Cached) clean block for update .Reading a block from the cache Carrying on from the above example. instance C keeps a copy of the block 4. This message is not critical for the lock manager. 1. along with the block instance C indicates that instance B should take the current lock mode and role from instance C. Instance C sends the block to instance B via the interconnect.

Instance A now has the exclusive lock on the block and sends an assume message to instance D.Carrying on from the above example. thus the SCN remains at 987654 . Instance B sends the block to instance A via the interconnect and closes it shared lock. the changes are not committed and thus the block has not been written to disk. it also sends a ping message to the shared lock holders. but all locks are released 4. Instance A sends an exclusive lock request to master D 2. The block may still be in its buffer to be as CR. The most recent access was at instance B and instance D sends a BAST message to instance B 3. the lock is in XL0 5. The lock master knows that the block may be available at instance B in SCUR mode and at instance C in CR mode. Instance A modifies the block in its buffer cache. instance A wants to modify the same block that is already cached in instance B and C (block 987654) 1.

the SCN is now 987660. Instance C sends a message to instance D indicating it has the block in exclusive mode. Instance C sends an exclusive lock request to master D 2. if it tries to modify the same row it will have to wait until instance A either commits or rolls back.Getting a (Cached) modified block for update and commit Carrying on from the above example. it downgrades the lock from XCR to NULL. . Instance C modifies the block and issues a commit. The block role G indicates that the block is in global mode and if it needs to write the block to disk it must coordinate it with other instances that have past images (PIs) of that block. However in this case instance C wants to modify a different row in the same block. Instance A has to create a PI image and flush any pending redo for the block change. the block mode on instance A is now NG1 4. Before shipping the block. The lock master knows that instance A holds an exclusive lock on the block and hence sends a ping message to instance A 3. instance C now wants to modify the block. it keeps a PI version of the block and disowns any lock on that buffer. Instance A sends the dirty buffer to instance C via the interconnect. 1.

commit operations do not require any synchronous modifications to the block 2. . Instance A wants to commit the changes.Commit the previously modified block and select the data Carrying on from the above example. The lock status remains the same as the previous state and change vectors for the commits are written to the redologs. instance A now issues a commit to release the row level locks held by the transaction and flush the redo information to the redologs 1.

Write the dirty buffers to disk due to a checkpoint Carrying on from the above example. On receipt of the notification. Instance B sends a write request to master D with the necessary SCN 2. All instances that have previously modified this block will also have to write a BWR. Instance C notifies the master that the write is completed 6. Instance C initiates a disk write and writes a BWR into the redolog file 4. instance B writes the dirty blocks from the buffer cache due to a checkpoint (this is were it gets interesting and very clever) 1. The master knows that the most recent copy of the block may be available at instance C and hence sends a message to instance C asking to write 3. instance D tells all PI holders to discard their PIs. The write request by instance C has now been satisfied and instance C can now proceed with its checkpoint as usual . Instance C get the write notification that the write is complete 5. and the lock at instance C writes the modified block to the disk 7.

Master instance crashes Carrying on from the above example 1. the master instance D crashes 2. also know as remastering (see remastering for more details). . The Global Resource Directory is frozen momentarily and the resources held by master instance D will be equally distributed in the surviving nodes.

now instance A queries the rows from that table to get the most recent data 1. Master C knows the most recent copy of the block may be in instance C and asks the holder to ship the CR block to instance A Instance C ships the CR block to instance A via the interconnect The above sequence of events can be seen in the table below Example 1 2 3 4 5 commit the update the block update the same block Operation on Node B C read block from disk read the block from cache Buffer Status B C SCUR CR XCUR PI PI CR CR CR SCUR CR XCUR XCUR A D A D . Instance A sends a shared lock to now the new master instance C 2.Select the rows from Instance A Carrying on from the above example.

RAC Troubleshooting .changes 6 7 8 select the rows trigger checkpoint instance crash CR XCUR CR XCUR 9.

these files are located by the parameter background_dest_dump. as RAC has been around for a while most problems can be resolve with a simple google lookup. a $ORA_CRS_HOME/crs/init good place to start contains cluster reconfigurations. Alert logs contain startup and shutdown information. nodes joining and leaving the cluster. The most important of these are $ORACLE_BASE/admin/udump contains any trace file generated by a user process contains core files that are generated due to a core $ORACLE_BASE/admin/cdump dump in a user process Now lets look at a two node startup and the sequence of events First you must check that the RAC environment is using the connect interconnect. Look here to obtain when reboots occur contains core dumps from the cluster synchronization $ORA_CRS_HOME/css/init service daemon (OCSd) log files for the event volume manager and eventlogger $ORA_CRS_HOME/evm/log daemon $ORA_CRS_HOME/evm/init pid and lock files for EVM $ORA_CRS_HOME/srvm/log log files for Oracle Cluster Registry (OCR) log files for Oracle clusterware which contains diagnostic $ORA_CRS_HOME/log messages at the Oracle cluster level As in a normal Oracle single instance environment. $ORA_CRS_HOME/css/log connects and disconnects from the client CSS listener. which is where you would start to look.Troubleshooting This is the one section what will be updated frequently as my experience with RAC grows. The cluster itself has a number of log files that can be examined to gain any insight of occurring problems. missed check-ins. Here is my complete alert log file of my two node RAC starting up. but a basic understanding on where to look for the problem is required. a RAC environment contains the standard RDBMS log files. every instance in the cluster has its own alert logs. the table below describes the information that you may need of the CRS components $ORA_CRS_HOME/crs/log contains trace files for the CRS resources contains trace files for the CRS daemon during startup. this can be done by either of the following . etc. In this section I will point you where to look for problems.

new inc 4) List of nodes: 01 Global Resource Directory frozen * allocate domain 0. Messages are sent across the interconnect if a message is not received in an amount of time then a communication failure is assumed by default UDP is used and can be unreliable so keep an eye on the logs if too many reconfigurations happen for reason 3. taken from the alert log. as seen in the logfile it is a seven step process (see below for more details on the seven step process). the instance is considered to be dead and the Instance Membership Recovery (IMR) process initiates reconfiguration. The LMON trace file also has details about reconfigurations it also details the reason for the event reconfiguation description reason means that the NM initiated the reconfiguration event.log oifcfg getif select inst_id. invalid = TRUE 3 Example of a reconfiguration. picked_ksxpia. typical when a 1 node joins or leaves a cluster means that an instance has died 2 How does the RAC detect an instance death. ip_ksxpia from x$ksxpia. Sat Mar 20 11:35:53 2010 Reconfiguration started (old inc 2. Remember when a node joins or leaves the cluster the GRD undergoes a reconfiguration event. yours may be different /u01/app/oracle/admin/racdb/bdump/alert_racdb1. every instance updates the control file with a heartbeat through its checkpoint (CKPT).logfile ifcfg command table check ## The location of my alert log. SQL> oradebug setmypid SQL> oradebug ipc Note: check the trace file which can be located by the parameter user_dump_dest cluster_interconnects oradebug system parameter Note: used to specify which address to use When the instance starts up the Lock Monitor's (LMON) job is to register with the Node Monitor (NM) (see below table). pub_ksxpia. means communication failure of a node/s. . if the heartbeat information is missing for x amount of time.

by default it is 7 seconds. a broadcast method is used after a commit operation. VALBLKs dubious All grantable enqueues granted Post SMON to start 1st pass IR Sat Mar 20 11:35:53 2010 LMS 0: 0 GCS shadows traversed. the system will have to wait until all nodes have seen the commit SCN. Lamport Algorithm The lamport algorithm generates SCNs in parallel and they are assigned to transaction on a first come first served basis. When set to less than 100 the broadcast on commit algorithm is used. . the log file will state the following Sat Mar 20 11:36:02 2010 cluster mode Database mounted in Shared Mode (CLUSTER_DATABASE=TRUE) Completed: ALTER DATABASE MOUNT Staring with 10g the SCN is broadcast across all nodes. The initialization parameter max_commit_propagation_delay limits the maximum delay allow for SCN propagation. 0 closed Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed. 3291 replayed Sat Mar 20 11:35:53 2010 Submitted all GCS remote-cache requests Post SMON to start 1st pass IR Fix write in gcs resources Reconfiguration complete Note: when a reconfiguration happens the GRD is frozen until the reconfiguration is completed Confirm that the database has been started in cluster mode. this method is more CPU intensive as it has to broadcast the SCN for every commit. but he other nodes can see the committed SCN immediately. this is different than a single instance environment. You can change the board cast method using the system parameter _lgwr_async_broadcasts.Communication channels reestablished Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out Sat Mar 20 11:35:53 2010 LMS 0: 0 GCS shadows cancelled.

mk ioracle Enable Oracle RAC (Unix only) Log in as Oracle in all nodes shutdown all instances using either normal or immediate option change to the working directory $ORACLE_HOME/lib run the below make command to relink the Oracle binaries without the RAC option (should take a few minutes) make -f ins_rdbms. 2. however if the situation occurs with the internal kernel-level resources (latches or pins). 3. 3. Disable Oracle RAC (Unix only) Log in as Oracle in all nodes shutdown all instances using either normal or immediate option change to the working directory $ORACLE_HOME/lib run the below make command to relink the Oracle binaries without the RAC option (should take a few minutes) make -f ins_rdbms.Disable/Enable Oracle RAC There are times when you may wish to disable RAC. 4. 2. 4.mk ioracle 1. 1. usually Oracle will detect the deadlock and rollback one of the processes.mk rac_off 5. it is unable to automatically . Now relink the Oracle binaries make -f ins_rdbms.mk rac_on 5. Now relink the Oracle binaries make -f ins_rdbms. this feature can only be used in a Unix environment (no windows option). Performance Issues Oracle can suffer a number of different performance problems and can be categorized by the following • • • • Hung Database Hung Session(s) Overall instance/database performance Query Performance A hung database is basically an internal deadlock between to processes.

To overcome these limitations a new utility command was released with 8i called hanganalyze which provides clusterwide information in a RAC environment on a single shot. SQL> oradebug setmypid SQL> unlimit capture information SQL> oradebug dump systemstate 10 # using oradebug from another instance SQL> select * from dual. # using oradebug SQL> select * from dual. the trace files will be very large. SQL> oradebug setmypid SQL> unlimit SQL> oradebug -g all dump systemstate 10 Note: the select statement above is to avoid problems on pre 8 Oracle ## If you get problems connecting with SQLPLUS use the command SQLPlus . it also has a number of limitations • • • • Reads the SGA in a dirty manner.problems below connecting $ sqlplus -prelim Enter user-name: / as sysdba A severe performance problem can be mistaken for a hang. however a systemstate dump taken a long time to complete. thus hanging the database. sql method oradebug alter session set events 'immediate trace hanganalyze level <level>'. When this event occurs you must obtain dumps from each of the instances (3 dumps per instance in regular times). so it may be inconsistent Usually dumps a lot of information does not identify interesting processes on which to perform additional dumps can be a very expensive operation if you have a large SGA.detect and resolve the deadlock. SQL> alter session set events 'immediate trace name systemstate level 10'. ## Using alter session SQL> alter session set max_dump_file_size = unlimited. a systemstate dump is normally used to analyze this problem. this usually happen because of contention problems. SQL> oradebug hanganalyze <level> ## Another way using oradebug .

IMR is part of the service offered by Cluster Group Services (CGS). LMON will let other nodes know of any changes in membership. When a communication failure happens the heartbeat information in the control cannot happen.SQL> setmypid SQL> setinst all SQL> oradebug -g def hanganalyze <level> Note: you will be told where the output will be dumped to hanganalyze levels only hanganalyze output. Node membership is represented as a bitmap in the GRD. IGN_DMP state) Level 4 + Dump all processes involved in wait chains (NLEAF state) Dump all processes (IGN state) 1-2 3 4 5 10 The hanganalyze command uses internal kernel calls to determine whether a session is waiting for a resource and reports the relationship between blockers and waiters. IMR will ensure that the larger part of the cluster will survive and kills any remaining nodes. Debugging Node Eviction A node is evicted from the cluster after it kills itself because it is not able to service the application. For eviction node problems look for ora-29740 errors in the alert log file and LMON trace files. Node registering (alert log) lmon registered with NM . click here for an example level 1 dump Level 2 + Dump only processes thought to be in a hang (IN_HANG state) Level 3 + Dump leaf nodes (blockers) in wait chains (LEAF. LEAF_NW. no process dump at all. systemdump is better but if you over whelmed try hanganalyze first. the bitmap is rebuilt and communicated to all nodes. Veritas Cluster). this works at the cluster level and can work with 3rd party software (Sun Cluster. IMR will remove any nodes from the cluster that it deems as a problem. thus data corruption can happen. for example if a node joins or leaves the cluster. The Node Monitor (NM) provides information about nodes and their health by registering and communicating with the Cluster Manager (CM). To understand eviction problems you need to now the basics of node membership and instance membership recovery (IMR) works. LMON handles many of the CGS functionalities. this generally happens when you have communication problems.instance id 1 (internal mem no 0) .

the CKPT process updates the controlfile every 3 seconds in an operation known as a heartbeat.mgmt. as stated above memberships is held in a bitmap in the GRD. Hand over reconfiguration to GES/GCS Debugging CRS and GSD Oracle server management configuration tools include a diagnostic and tracing facility for verbose output for SRVCTL. the instance that obtains the lock tallies the votes from all members. 2. A cluster reconfiguration is performed using 7 steps 1. I have already discussed the voting disk in my architecture section. this block is called the checkpoint progress record. Determination of membership and validation and IMR 4. use vi to edit the gsd. GSD. At the end of the file look for the below line exec $JRE -classpath $CLASSPATH oracle. 5. all members attempt to obtain a lock on the controlfile record for updating.daemon. the controlfile vote result is stored in the same block as the heartbeat in the control file checkpoint progress record. Lock database (IDLM) is frozen. It writes into a single block that is unique for each instance. this prevents processes from obtaining locks on resources that were mastered by the departing/dead instance 3.One thing to remember is that all nodes must be able to read from and write to the controlfile. instance name and uniqueness verification. Unfreeze and release name service for use 7. CGS makes sure that members are valid.ops. Delete all dead instance entries and republish all names newly configured 6. the CGS contains an internal database of all the members/instances in the cluster with all their configurations and servicing details. it uses a voting mechanism to check the validity of each member. GSDCTL or SRVCONFIG.OPSMDaemon $MY_OHOME . To capture diagnose following the below 1. Bitmap rebuild takes place. GCS must synchronize the cluster to be sure that all members get the reconfiguration event and that they all see the same bitmap. the group membership must conform to the decided (voted) membership before allowing the GCS/GES reconfiguration to proceed. thus intra-instance coordination is not required. Name service is frozen. You can see the controlfile records using the gv$controlfile_record_section view.sh/srvctl/srvconfig file in the $ORACLE_HOME/bin directory 2.

the string should look like this exec $JRE -DTRACING.. set it to blank to remove the debugging Enable tracing $ export SRVM_TRACE=true Disable tracing $ export SRVM_TRACE="" .ENABLED=true -DTRACING. Add the following just before the -classpath in the exec $JRE line -DTRACING..3..LEVEL=2 4.ENABLED=true -DTRACING.. In Oracle database 10g setting the below variable accomplishes the same thing......LEVEL=2 -classpath..

although you should add a node of a similar spec it is possible to add a node of a higher or lower spec. The first stage is to configure the operating system and make sure any necessary drivers are installed.rac2. this allows the node to become part of the cluster. Log into any of the existing nodes as user oracle then run the below command. Click next to see a summary page . the script below starts the OUI GUI tool. and we are going to add a third node. hopefully the tool will already see the existing cluster and will fill in the details for you $ORA_RS_HOME/oui/bin/addnode.rac2.sh stage -pre crsinst -n rac1. You can run the command either from the new node or from any of the existing nodes in the cluster pre-install check run from new node pre-install check run from existing node runcluvfy. the tools check that the node has been properly prepared for a RAC deployment. Adding the new node can be started from any of the existing nodes 1. Pre-Install Checking You used the Cluster Verification utility when installing the RAC environment. In the specify cluster nodes to add to installation screen. I am going to presume we have a two RAC environment already setup. enter the new names for the public. Install CRS Cluster Ready Services (CRS) should be installed first.rac3 -r 10g2 Make sure that you fix any highlighted problems before continuing. private and virtual hosts 3.sh 2.10. also make sure that the node can see the shared disks available to the existing RAC.rac3 -r 10gr2 cluvfy stage -pre crsinst -n rac1. Adding and Removing nodes Adding and removing nodes One of the jobs of a DBA is adding and removing nodes from a RAC environment when capacity demands.

Again you can use any of the existing nodes to install the software.sh in the new and rootaddnode. Log into any of the existing nodes as user oracle then run the below command.sh cluster registry and it adds the daemon to CRS and starts CRS in the new node. creates /etc/oracle directory and adds the relevant OCR keys to the root. Click install.sh and root. sets the Oracle inventory in the new node and set ownerships and permissions to the inventory checks whether the Oracle CRS stack is already configured in the new node. the script will ask you to run run. Click next on the welcome screen to open the specify cluster nodes to add to installation screen.sh 2.sh and root. select the new node and click next 3. configures the OCR registry to include the new nodes as part of the rootaddnode.sh on the new node.4. the script below starts the OUI GUI tool.sh cluster 6. Check the summary page then click install to start the installation 4. 7. the installer will copy the files from the existing node to the new node. it is time to install the Oracle DB software. Once copied you will be asked to run orainstRoot. then click OK to finish off the installation . Now you need to configure Oracle Notification Services (ONS). Click next to complete the installation.sh in the node that you are running the installation from. The files will be copied to the new node.config 8. Run orainstRoot. 1.sh cat $ORA_CRS_HOME/opmn/conf/ons. hopefully the tool will already see the existing cluster and fill in the details for you $ORA_RS_HOME/oui/bin/addnode. you should have a list of all the existing nodes in the cluster.sh as user root 5. Now run the ONS utility by supplying the <remote_port> number obtained above racgons add_config rac3:<remote_port> Installing Oracle DB Software Once the CRS has been installed and the new node is in the cluster. The port can be identified by the below command orainstRoot.

click next You should see a list of existing instances. 4. then start the Network Configuration Assistant 2. 6. 4. 3. 7. Delete the instance on the node to be removed Clean up ASM Remove the listener from the node to be removed Remove the node from the database Remove the node from the clusterware . click next in the database storage screen.Configuring the Listener Now its time to configure the listener in the new node 1. 3. 5. 5.. 3. 2. set the environment to database home and then run the database creation assistant (DBCA) 2. click next and on the following screen enter ORARAC3 as the instance and choose RAC3 as the node name (substitute any of the above names for your environment naming convention) The database instance will now created. choose yes when asked to extend ASM Removing a Node Removing a node is similar to above but in reverse order 1. Login as oracle on the new node. $ORACLE_HOME/bin/netca Choose cluster management Choose listener Choose add Choose the the name as LISTENER These steps will add a listener on rac3 as LISTENER_rac3 Create the Database Instance Run the below to create the database instance on the new node 1. and set your DISPLAY environment variable. Login as user oracle. $ORACLE_HOME/bin/dbca In the welcome screen choose oracle real application clusters database to create the instance and click next Choose instance management and click next Choose add instance and click next Select RACDB (or whatever name you gave you RAC environment) as the database and enter the SYSDBA and password. 5. 4.

Now run the following on the node to be removed cd $ORACLE_HOME/admin rm -rf +ASM cd $ORACLE_HOME/dbs rm -f *ASM* 3.You can delete the instance by using the database creation assistant (DBCA). then start the Network Configuration Assistant 2.rac3}" Lastly we remove the clusterware software . 4. invoke the program choose the RAC database. Run the following from node 1 cd $ORACLE_HOME/oui/bin . 3. 5. if so remove them Now remove the listener for the node to be removed 1.rac2. To clean up ASM follow the below steps 1. Run the below script from the node to be removed cd $ORACLE_HOME/bin . Choose to deinstall products and select the dbhome 3. Login as user oracle./runInstaller -updateNodeList ORACLE_HOME=$ORACLE_HOME "CLUSTER_NODES={rac1. enter the sysdba user and password then choose the instance to delete. choose instance management and then choose delete instance. Check that /etc/oratab file has no ASM entries. From node 1 run the below command to stop ASM on the node to be removed srvctl stop asm -n rac3 srvctl remove asm -n rac3 2. $ORACLE_HOME/bin/netca Choose cluster management Choose listener Choose Remove Choose the the name as LISTENER Next we remove the node from the database 1./runInstaller -updateNodeList ORACLE_HOME=$ORACLE_HOME "CLUSTER_NODES={rac3}" -local . and set your DISPLAY environment variable./runInstaller 2.

sh rac3.rac3}" CRS=TRUE 7.3 4./rootdelete. Run the following from node as user oracle cd $CRS_HOME/oui/bin ./runInstaller -updateNodeList ORACLE_HOME=$ORACLE_HOME "CLUSTER_NODES={rac1. obtain the node number first $CRS_HOME/bin/olsnodes -n cd $CRS_HOME/install . Now run the following from node 1 as user root./rootdeletenode. Choose to deinstall software and remove the CRS_HOME 6.sh 3. Now run the below from the node to be removed as user oracle cd $CRS_HOME/oui/bin . you obtain the port number from remoteport section in the ons.rac2./runInstaller 5. Run the following from node 1. the second you should not see any output and the last command you should only see nodes rac1 and rac2 srvctl status nodeapps -n rac3 crs_stat |grep -i rac3 olsnodes -n .1. the first should report "invalid node". Check that the node has been removed. Run the following from the node to be removed as user root cd $CRS_HOME/install .config file in $ORA_CRS_HOME/opmn/conf $CRS_HOME/bin/racgons remove_config rac3:6200 2./runInstaller -updateNodeList ORACLE_HOME=$ORACLE_HOME "CLUSTER_NODES={rac3}" CRS=TRUE -local .

a disk file or an abstract entity a resource that can be accessed by all the nodes within the cluster examples would be the following Resource n/a (Global) • • • Data Buffer Cache Block Transaction Enqueue Database Data Structures LVB Lock Value contains a small amount of data regarding the lock Block . The GSD is not an Oracle instance background process and is therefore not started with the Oracle instance Parallel PCM formly know as (integrated) Distributed Lock Manager. GCS Services also known as PCM Global coordinates the requests of all global enqueues uses the GCS.11. Acronyms Acronyms Global Cache in memory database containing current locks and awaiting locks. it Resource n/a can be a area in memory. and the SRVCTL utility to GSD Services execute administrative job tasks such as instance startup or Daemon shutdown. as my experience with RAC grows I will update this section. EM. its Cache (IDLM) another name for GCS Management it is a identifiable entity it basically has a name or a reference. see GRD for more details Directory Global helps to coordinate and communicate the locks requests between GRM Resource Oracle processes Manager runs on each node with one GSD process per node. RAC Cheat sheet Cheatsheet This is a quick and dirty cheatsheet on Oracle RAC 10g. also GES Enqueue known as non-PCM Services Global all resources available to the cluster (formed and managed by GRD Resource GCS and GES). below is a beginners guide on the commands and information that you will require to administer Oracle RAC. The GSD coordinates with the cluster manager to receive requests from Global clients such as the DBCA.

log OCR command log file contains trace files for the $ORA_CRS_HOME/crs/log CRS resources contains trace files for the CRS daemon during $ORA_CRS_HOME/crs/init startup. missed check-ins. Look here to obtain when reboots occur contains core dumps from $ORA_CRS_HOME/css/init the cluster synchronization service daemon (OCSd) logfiles for the event $ORA_CRS_HOME/evm/log volume manager and eventlogger daemon $ORA_CRS_HOME/evm/init pid and lock files for EVM logfiles for Oracle Cluster $ORA_CRS_HOME/srvm/log Registry (OCR) log fles for Oracle clusterware which contains $ORA_CRS_HOME/log diagnostic messages at the Oracle cluster level $ORA_CRS_HOME/cdata/<cluster_name> Useful Views/Tables v$cache v$cache_transfer GCS and Cache Fusion Diagnostics contains information about every cached block in the buffer cache contains information from the block headers in SGA that have been pinged at least once . a good place to start contains cluster reconfigurations. connects and $ORA_CRS_HOME/css/log disconnects from the client CSS listener.TRFC Traffic Controller controls the DLM traffic between instances (messaging tickets) Files and Directories Files and Directories OCR backups (default location) $ORA_HOME/log/<hostname>/client/ocrconfig_<pid>.

contains information about the transfer of cache blocks through the interconnect contains statistics about CR block transfer across the v$cr_block_server instances contains statistics about current block transfer across the v$current_block_server instances contains one-to-one information for each global cache v$gc_element resource used by the buffer cache GES diagnostics contains information about locks held within a database and v$lock outstanding requests for locks and latches contains information about locks that are being blocked or v$ges_blocking_enqueue blocking others and locks that are known to the lock manager v$enqueue_statistics contains details about enqueue statistics in the instance v$resource_limits display enqueue statistics contains information about DML locks acquired by different v$locked_object transactions in databases with their mode held v$ges_statistics contains miscellaneous statistics for GES contains information about all locks known to the lock v$ges_enqueue manager v$ges_convert_local contains information about all local GES operations v$ges_convert_remote contains information about all remote GES operations contains information about all resources known to the lock v$ges_resource manager v$ges_misc contains information about messaging traffic information v$ges_traffic_controller contains information about the message ticket usage Dynamic Resource Remastering contains information about current and previous master v$hvmaster_info instances of GES resources in relation to hash value ID of resource v$gcshvmaster_info the same as above but globally conatins information about current and previous masters about GCS resources belonging to files mapped to a v$gcspfmaster_info particular master. including the number of times the resource has remastered Cluster Interconnect contains information about interconnects that are being used v$cluster_interconnects for cluster communication v$instance_cache_transfer .

v$configured_interconnects v$service x$kjmsdp x$kjmddp

same as above but also contains interconnects that AC is aware off that are not being used Miscellanous services running on an instance display LMS daemon statistics display LMD daemon statistics

Useful Parameters
Parameters cluster_interconnects specify a specific IP address to use for the inetrconnect _gcs_fast_config enables fast reconfiguration for gcs locks (true|false) controls which instance will hold or (re)master more _lm_master_weight resources than others controls the number of resources an instance will master at _gcs_resources a time _lm_tickets controls the number of message tickets controls the number of message tickets (aggressive _lm_ticket_active_sendback messaging) limits the number of CR copies per DBA on the buffer _db_block_max_cr_dba cache (see grd) used when too many CR requested arrive for a particular _fairness_threshold buffer and the block becomes disowned (see grd) _gc_affinity_time specifies interval minutes for reamstering defines the number of times a instance access the resource _gc_affinity_limit before remastering defines the minimum number of times a instance access the _gc_affinity_minimum resource before remastering disables dynamic remastering for the objects belonging to _lm_file_affinity those files _lm_dynamic_remastering enable or disable remastering define the time by which an instance deferred downgrading _gc_defer_time a lock (see Cache Fusion) _lgwr_async_broadcast change the SCN boardcast method (see troubleshooting)

Processes

Oracle RAC Daemons and Processes OPROCd Process Monitor provides basic cluster integrity services Event EVMd spawns a child process event logger and generates callouts Management Cluster OCSSd Synchronization basic node membership, group services, basic locking Services Cluster Ready CRSd resource monitoring, failover and node recovery Services this is the cache fusion part, it handles the consistent copies of blocks that are tranferred between instances. It receives requests from LMD to perform lock requests. I rools back any uncommitted transactions. There can be upto ten LMS Lock Manager processes running and can be started dynamically if demand LMSn Server process - requires it. GCS they manage lock manager service requests for GCS resources and send them to a service queue to be handled by the LMSn process. It also handles global deadlock detection and monitors for lock conversion timeouts. LMON Lock Monitor this process manages the GES, it maintains consistency of Process - GES GCS memory in case of process death. It is also responsible

for cluster reconfiguration and locks reconfiguration (node joining or leaving), it checks for instance deaths and listens for local messaging. A detailed log file is created that tracks any reconfigurations that have happened. this manages the enqueue manager service requests for the Lock Manager GCS. It also handles deadlock detention and remote resource Daemon - GES requests from other instances. manages instance resource requests and cross-instance call Lock Process operations for shared resources. It builds a list of invalid lock GES elements and validates lock elements during recovery. This is a lightweight process, it uses the DIAG framework to Diagnostic monitor the healt of the cluster. It captures information for Daemon later diagnosis in the event of failures. It will perform any neccessary recovery if an operational hang is detected.

LMD

LCK0

DIAG

General Administration
Managing the Cluster /etc/init.d/init.crs start starting crsctl start crs /etc/init.d/init.crs stop stopping crsctl stop crs /etc/init.d/init.crs enable /etc/init.d/init.crs disable

enable/disable at boot time

crsctl enable crs crsctl disable crs Managing the database configuration with SRVCTL srvctl start database -d <database> -o <option> Note: starts listeners if not already running, you can use the -o option to specify startup/shutdown options

start all instances force open mount nomount srvctl stop database -d <database> -o <option> Note: the listeners are not stopped, you can use the -o option to

stop all instances

specify startup/shutdown options immediate abort normal transactional start/stop srvctl [start|stop] database -d <database> -i <instance>.log .<instance> srvctl stop service -d <database> -s <service>.<instance> srvctl stop nodeapps -n <node> srvctl stop asm -n <node> stopping/starting srvctl start database -d <database> srvctl start instance -d <database> -i <instance>.<instance> status srvctl status service -d <database> srvctl status nodeapps -n <node> srvctl status asm -n <node> srvctl stop database -d <database> srvctl stop instance -d <database> -i <instance>.<instance> srvctl start nodeapps -n <node> srvctl start asm -n <node> srvctl add database -d <database> -o <oracle_home> srvctl add instance -d <database> -i <instance> -n <node> srvctl add service -d <database> -s <service> -r <preferred_list> srvctl add nodeapps -n <node> -o <oracle_home> -A <name| ip>/network srvctl add asm -n <node> -i <asm_instance> -o <oracle_home> adding/removing srvctl remove database -d <database> -o <oracle_home> srvctl remove instance -d <database> -i <instance> -n <node> srvctl remove service -d <database> -s <service> -r <preferred_list> srvctl remove nodeapps -n <node> -o <oracle_home> -A <name| ip>/network srvctl asm remove -n <node> OCR utilities log file $ORA_HOME/log/<hostname>/client/ocrconfig_<pid>.<service> -i <instance>.<service> -i <instance>.<instance> particular instance display the registered srvctl config database databases srvctl status database -d <database> srvctl status instance -d <database> -i <instance>.<instance> srvctl start service -d <database> -s <service>.

free space. space used. location of each device and the result of the integrity check ocrdump -backupfile <file> Note: by default it dumps the contents into a file named OCRDUMP in the current directory ocrconfig -export <file> ocrconfig -restore <file> # show backups ocrconfig -showbackup # to change the location of the backup. you can even specify a ASM disk ocrconfig -backuploc <path|+asm> # perform a backup. will use the location specified by the -backuploc location ocrconfig -manualbackup # perform a restore ocrconfig -restore <file> # delete a backup orcconfig -delete <file> Note: there are many more option so see the ocrconfig man page ## add/relocate the ocrmirror file to the specified location ocrconfig -replace ocrmirror '/ocfs2/ocr2.dbf' ## remove the OCR or OCRMirror file ocrconfig -replace ocr ocrconfig -replace ocrmirror dump contents export/import backup/restore CRS Administration starting CRS Administration ## Starting CRS using Oracle 10g R1 not possible . total space allocated.ocrcheck checking Note: will return the OCR version.dbf' ## relocate an existing OCR file add/remove/replace ocrconfig -replace ocr '/ocfs1/ocr_new.

crs stop ## Stopping CRS using Oracle 10g R2 $ORA_CRS_HOME/bin/crsctl stop crs ## use to stop CRS restarting after a reboot disabling/enabling ## Oracle 10g R1 /etc/init.crs [disable|enable] ## Oracle 10g R2 $ORA_CRS_HOME/bin/crsctl [disable|enable] crs $ORA_CRS_HOME/bin/crsctl check crs $ORA_CRS_HOME/bin/crsctl check evmd checking $ORA_CRS_HOME/bin/crsctl check cssd $ORA_CRS_HOME/bin/crsctl check crsd $ORA_CRS_HOME/bin/crsctl check install -wait 600 Resource Applications (CRS Utilities) status $ORA_CRS_HOME/bin/crs_stat create profile $ORA_CRS_HOME/bin/crs_profile register/unregister $ORA_CRS_HOME/bin/crs_register application $ORA_CRS_HOME/bin/crs_unregister Start/Stop an $ORA_CRS_HOME/bin/crs_start application $ORA_CRS_HOME/bin/crs_stop $ORA_CRS_HOME/bin/crs_getparam Resource permissions $ORA_CRS_HOME/bin/crs_setparam Relocate a resource $ORA_CRS_HOME/bin/crs_relocate Nodes member number/name olsnodes -n local node name olsnodes -l activates logging olsnodes -g Oracle Interfaces display oifcfg getif delete oicfg delig -global set oicfg setif -global <interface name>/<subnet>:public .d/init.stopping ## Starting CRS using Oracle 10g R2 $ORA_CRS_HOME/bin/crsctl start crs ## Stopping CRS using Oracle 10g R1 srvctl stop -d database <database> srvctl stop asm -n <node> srvctl stop nodeapps -n <node> /etc/init.d/init.

sh print cluster name print the clusterware version Add Node Delete Node Note: see adding and deleting nodes Enqueues SQL> column current_utilization heading current SQL> column max_utilization heading max_usage SQL> column initial_allocation heading initial displaying statistics .starting stopping status oicfg setif -global <interface name>/<subnet>:cluster_interconnect Global Services Daemon Control gsdctl start gsdctl stop gsdctl status Cluster Configuration (clscfg is used during installation) clscfg -install clscfg -upgrade clscfg -downgrade clscfg -add clscfg -delete clscfg -local create a new configuration upgrade or downgrade and existing configuration add or delete a node from the configuration create a special singlenode configuration for ASM brief listing of terminology used in the other nodes used for tracing help clscfg -concepts clscfg -trace clscfg -h Cluster Name Check cemulto -n Note: in Oracle 9i the ulity was called "cemutls" cemulto -w Note: in Oracle 9i the ulity was called "cemutls" Node Scripts addnode.sh Note: see adding and deleting nodes deletenode.

force dynamic remastering (DRM) ## Determine who masters it SQL> oradebug setmypid SQL> oradebug lkdebug -a <OBJECT_ID> Enable tracing Disable tracing ## Now remaster the resource SQL> oradebug setmypid SQL> oradebug lkdebug -m pkey <OBJECT_ID> GRD. set to 0 if the instance is a query only instance. tckt_wait waiting from v$ges_traffic_controller. snd_q_len send_queue. SRVCTL. Messaging (tickets) select local_nid local. GSD and SRVCONFIG Tracing $ export SRVM_TRACE=true $ export SRVM_TRACE="" Voting Disk adding deleting querying crsctl add css votedisk <file> crsctl delete css votedisk <file> crsctl query css votedisk Books . fairness_down_converts from v$cr_block_server. Remastering ## Obtain the OBJECT_ID form the below table SQL> select * from v$gcspfmaster_info. remote_nid remote. data_requests. downconvert Note: lower the _fairness_threshold if the ratio goes above 40%. light_works. SQL> oradebug setmypid dump ticket information SQL> oradebug unlimit SQL> oradebug lkdebug -t Lighwork Rule and Fairness Threshold select cr_requests. ticket usage tckt_limit limit. tckt_avail avail. SQL> select * from v$resource_limit.SQL> column resource_limit format a23.

Flashback and Data Guard clarify some of the points in the above book . it has enough detail to give you what Oracle 10g Real Application Clusters you need to manage RAC. Oracle 10g High Availability with This book had small section on RAC and helped Rac. however i did have to Handbook consult the web in order to obtain more detailed information and to clarify certain points.Oracle RAC Books Although this book is lightweight compared to other Oracle books.

Master your semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master your semester with Scribd & The New York Times

Cancel anytime.