Oracle Real Application Clusters (RAC

)
RAC Internals, Cache Fusion and Performance Tuning

A “BrainSurface” Presentation www.brainsurface.com

Disclaimer
This views/content in this document are those of the author and do not necessarily reflect that of Oracle Corporation and/or its affiliates/subsidiaries. The material in this document is for informational purposes only and is published with no guarantee or warranty, express or implied.

Oracle RAC Internals

Agenda
• Node & Clusterware stack startup sequence • Heartbeat mechanism • Voting disk functionality • Split-brain resolution • Node reboot causes

Oracle RAC Internals: Node Startup Sequence

Clusterware startup order discussed in the coming slides

Figure/Diagram from Oracle Documentation

Oracle RAC Internals: Clusterware Stack Startup Sequence: Pre-11gR2
Entries in the /etc/inittab
h1:35:respawn:/etc/init.d/init.evmd run >/dev/null 2>&1 </dev/null h2:35:respawn:/etc/init.d/init.cssd fatal >/dev/null 2>&1 </dev/null h3:35:respawn:/etc/init.d/init.crsd run >/dev/null 2>&1 </dev/null

Added during the root.sh execution

inittab
Clusterware stack 2 1

OS startup

Node boots up

3

init.evmd
evmd.bin
Publish the events upon detecting Responsible to execute callouts
Voting disk

init.cssd
ocssd.bin

init.crsd
crsd.bin
OCR

oclsmon.bin oprocd.bin

Provides cluster group membership Monitor nodes in the cluster via heartbeat mechanism

Manage and monitor CRS resources Updates OCR when srvctl is used

Oracle RAC Internals: Clusterware Stack Startup Sequence: 11gR2
Entries in the /etc/inittab
h1:3:respawn:/sbin/init.d/init.ohasd run >/dev/null 2>&1 </dev/null

Added during the root.sh execution

inittab init.ohasd

OS startup

Node boots up

Oracle High Availability Services Daemon

oraagent.bin
MDNSD GIPCD GPNPD EVMD ASM

orarootagent.bin
CSSD Monitor CRSD CTSSD Diskmon ACFS Drivers

cssdagent
OCSSD

oraagent
ONS ASM Instance DB Instance Listener SCAN Listener

orarootagent
GSD VIP SCAN VIP

Oracle RAC Internals: Clusterware Stack Startup Order: 11gR2

Oracle High Availability Services Daemon

Figure/Diagram from Oracle Documentation

Oracle RAC Internals: Clusterware Stack Startup Order: 11gR2

Figure/Diagram from Oracle Documentation

Oracle RAC Internals: Clusterware and Heartbeat Mechanism
Clusterware and heartbeat mechanism
Two (02) types of heartbeats:
1.Network heartbeat • Performed once per second. • Node will evict from cluster when failed to send a network heartbeat within <MissCount – maximum time in seconds> time frame. • clssnmPollingThread (ocssd.log) CSSD]2009-01-27 11:15:37.409 [18] >TRACE: clssnmPollingThread: Eviction started for node usogp06 (6), flags 0x0001, state 3,wt4c 0 2.Disk (Voting Disk) heartbeat • Each node of a cluster writes a disk heartbeat to voting disk every second • Reads kill block every second to commit suicide, if required. • Node evicts from cluster if no heartbeat is updated within I/O (MissCount/Disktimeout) timeout. • clssnmDiskPMT (ocssd.log)
CSSD]2009-10-11 15:56:23.668 [93645744] >WARNING: clssnmDiskPMT: long disk latency >(45940 ms) to voting disk (0//dev/raw/raw1)

Oracle RAC Internals: Clusterware and Heartbeat Mechanism
CSS parameters and their default values in 11gR2: crsctl get css prarameter crsctl set css parameter value clusterguid disktimeout (200 (seconds)) misscount (30 (seconds)) – more misscount time when vendor cluster is configured reboottime (3 (seconds)) priority (4 (UNIX), 3 (Windows)) logfilesize (50 (MB))

Oracle RAC Internals: Voting Disk Functionality
Network heartbeat (every second)
• Used by the Cluster synchronization Service (CSS). • It records and manages the node membership information. • At any time, each node of a cluster must be able to access more than half of the voting disks.

Node1

Node2
cs s
All 3 nodes can see each other ALL IS WELL!

Node3

cs s

cs s

• Recommended to have 2n+1 (odd number) voting disk files.

Voting Voting Voting Disk Disk Disk Disk heartbeat
(once per second)
Figure/Diagram from Oracle Documentation

Oracle RAC Internals: Split-Brain Syndrome
Split-brain

Node1
cs s

Node2
cs s

Node3
cs s

Node 1 & 2 can see each other but both can’t see 3 ? lets evict Node3

Voting Disk

can’t see 1&2 Kill yourself (Node3)

Figure/Diagram from Oracle Documentation

Oracle RAC Internals: Split-Brain Resolution What is Split-Brain?
The term "Split-Brain" is often used to describe the scenario when two or more co-operating processes in a distributed system, typically a high availability cluster, lose connectivity with one another but then continue to operate independently of each other, including acquiring logical or physical resources, under the incorrect assumption that the other process(es) are no longer operational or using the said resources.
Quote/Abstract from MOS document

Oracle RAC Internals: Node Reboot Causes
When does a node reboots?
•Network failure– interconnect • Slow interconnect (latency) – must fail 30 consecutive times! | check private interconnect configuration • Voting disk IO– cannot read or write | refer ocssd.log • CPU-bound– CPU is too busy to maintain heartbeat | configure oswatcher to verify resource consumption • Files moved, delected, changed or some other human error • Configuration error– wrong network for private interconnect • ocssd process died • Some Oracle Clusterware bug

Oracle RAC Internals: Grid Infrastructure: Log Files Hierarchy

Figure/Diagram from Oracle Documentation

What is Cache Fusion? Synopsis & Overview
Cache Fusion is the driving technology behind Oracle RAC that enable Applications to scale out on multiple servers/instances. Cache Fusion/Synchronization enables concurrent/simultaneous transactionprocessing between all Instances using the Private Cluster Interconnect. DB Blocks are synchronized, NOT mirrored = Faster performance.

What is Cache Fusion? Synopsis & Overview
With the advent of Oracle RAC 9i in 2001, Cache Fusion provides the following great features: More nodes can be added/removed in HOT MODE=ZERO DOWNTIME with zero database downtime to provide elasticity and scalability. Database Files residing on Shared Disk Cluster File System provide a uniform, fast and readconsistent image to the end-user. Applications typically scale out-of-the-box with zero/minimal tuning.

Cache Fusion – Synopsis & Overview
Cache Fusion is very fast due to the fact that, disk writes are eliminated when other instances request blocks for updates. Cache Fusion is a mechanism within Oracle RAC employs Shared Cache Architecture that fuses the in-memory data buffer cache across all nodes into a single logical read-consistent buffer cache available to all instances. DB Blocks are transferred in-memory from instanceto-instance cache over the Cluster InterConnect when requested after proper locking procedures are implemented.

Cache Fusion – Synopsis & Overview
Global Cache Service (GCS) is used for FAST instance-toinstance block buffer transfer and establishes/implements Cache Coherency = Never more than 3 hops. Global Enqueue Service (GES), previously known as Dynamic Lock Manager (DLM) is used for block buffer locking. Global Resource Directory (GRD) is used for keeping track of Block Buffer Location/Mode/Role information. The Private Cluster InterConnect is used for block-transfers amongst instances to enable Cache Fusion.

Cache Fusion Architecture Overview

Figure/Diagram from Oracle Documentation

Cache Fusion Architecture – Global Resource Directory (GRD)
GCS & GES maintain the Global Resource Directory (GRD). Internal Repository stored by all instances of the RAC Cluster. Global Resource Directory (GRD) is used for keeping track of Data Structures, Block Buffer Location, Mode, Role, Inventory etc.

Cache Fusion Architecture – Global Cache Service (GCS)
The backbone of Cache Fusion: Responsible for Cache Coherence. Responsible for maintaining different block modes and transfer of data buffers amongst the instances. Implemented by the Global Cache Service Processes (LMSn). Lock Manager Server (LMS): Processes that are responsible for remote messaging. LMSn: n = 0 – 9: Upto 10 LMS processes: Can be set with the Init parameter GCS_SERVER_PROCESSES

Cache Fusion Architecture – Global Enqueue Service (GES)
Global Enqueue Service (GES), previously known as Dynamic Lock Manager (DLM) is responsible for locking mechanisms used in Cache Fusion. LMON process responsible for cluster monitoring & management of global resources: Also know as Cluster Group Services. LMD0 processes responsible for: Management of resource requests from RAC instances. Distributed Deadlock Detections. Processing of Enqueued Requests. Access Control to Global Enqueues.

Cache Fusion – Measuring Efficiency
Global Cache Services (GCS) Waits = Cross-Instance Block transfer Waits = Measure of Data Block Transfer Efficiency.

Cache Fusion – Dynamic Performance Views
Some useful Dynamic Performance Views for monitoring Cache Fusion:
v$gc_element v$cache v$instance_cache_transfer v$cr_block_server v$cache_transfer v$ges_blocking_enqueue gv$file_cache_transfer gv$temp_cache_transfer gv$cache_transfer gv$class_cache_transfer

RAC Performance Tuning: Starting Out

Nemiec (2004 – 9i RAC) – App Tuning – Database Tuning – OS Tuning

Nanda (2009)

CPU and I/O (not Interconnect) are necessary for RAC Performance

THEN... RAC Tuning

Lawson (2010)

“The Essence Of Performance Tuning Is The Same”

• These quotes are from

presentations in the RAC SIG library.

RAC Performance Tuning: Approaches

Top-Down – Application Responsiveness – Grid Control Performance Tab – Statspack/AWR Reports

• Goal: Minimize Response Time or Throughput

Bottom-Up – Storage • Spindles, Controllers, Paths – OS • I/O times, queues • Network latency • Memory • CPU (each core) • Goal: Balance & Maximize Utilization

RAC Performance Tuning: Application & Schema Design

Look Out For:
– – – – – –

Indexes Sequences “Hot” rows or small tables MSSM “gc” Wait Events High Interconnect Utilization

RAC Performance Tuning: Application & Schema Design

Main Principle: parallelize (avoid serialization on any data) If it doesn't scale on SMP then it won't scale on RAC

• •

Decrease rows/block Reverse Key or Hash Indexes

No Range Scans

• • •

Same principles of good app design for non-RAC!!

Seq NoOrder+Cache ASSM (or FreeL Gr) Data & Index Partitioning App Partitioning

RAC Performance Tuning: Tune the Entire System as a Whole

Figure/Diagram from Bert Scalzo

RAC Performance Tuning: Tune the Entire System as a Whole

Figure/Diagram from Bert Scalzo

RAC Performance Tuning: Real Life Case Study

Figure/Diagram from Bert Scalzo

RAC Performance Tuning: Configuration Checklist
• Hardware • All nodes have similar performance characteristics • Interconnect (The RAC Achilles’ heal) • Network segment truly private • Bond NIC’s to improve throughput • All nodes set NIC’s to Jumbo Frames • Switches / VLAN’s set to Jumbo Frames • Consider 10Gbit Ethernet for Interconnect • Storage • Multipath • Verify settings for read & write caching match application nature • If using iSCSI, treat as similar to interconnect network (see above) • Software • All nodes have the exact same OS patches • All nodes have the exact same Oracle patches • Oracle both recommends and pushes for using ASM on RAC • Do NOT rely on non-RAC enabled scripts or tools for handling RAC

RAC Performance Tuning: Block Size is Important
• DBCA’s default block size is 8K • Many DBA’s experience is that bigger block size is better • So most databases these days often have block sizes >= 8K • But bigger is not always better • Block size and number of nodes should be considered (next 2 slides) • No matter how fast or good cache fusion is – don’t stress it if unnecessary • Example: OLTP application using 8K block size and having 8 nodes • Larger block size = more rows per block • More rows per block = more likelihood of block contention • More nodes (>=4) = more likelihood of block contention • More block contention means more cache fusion work • Remember, interconnect is most often RAC’s Achilles’ heal …

RAC Performance Tuning: Block Contention

Figure/Diagram from Bert Scalzo

RAC Performance Tuning: Block Contention

Figure/Diagram from Bert Scalzo

Summary
To summarize, Oracle RAC is proven, robust and stable and is used by corporations, organizations & governments across the globe to achieve High Availability, Elasticity & Scalability by providing a lower-cost and higher ROI alternative to Mainframe-like SMP (Symmetric Multi-Processing) models of computing. Learn more about Oracle RAC at Oracle's RAC homepage.
http://www.oracle.com/technology/products/database/clustering/index.html

Sign up to vote on this title
UsefulNot useful