What Are Oracle Real Application Clusters?

• Multiple instances accessing the same database • One Instance per node • Physical or logical access to each database file • Software-controlled data access


Shared cache

Instances spread across nodes Database files

RAC Architecture
public network


VIP1 Service Listener instance 1 ASM

VIPn Service Listener instance n ASM

Node n

Oracle Clusterware interconnect Oracle Clusterware

Operating System

Operating System

shared storage

Managed by ASM RAW Devices

Redo / Archive logs all instances Database / Control files OCR and Voting Disks

Global Resources Coordination Cluster Cache GES GCS Node1 Instance1 GRD Master … LMON LMD0 LMSx LCK0 DIAG Noden Instancen GRD Master … GES GCS Global resources Interconnect Cache LMON LMD0 LMSx LCK0 DIAG Global Resource Directory (GRD) Global Cache Services (GCS) Global Enqueue Services (GES) .

ONS. OCR VIP. ONS. Services. OEM Oracle Clusterware CRSD & RACGIMON EVMD OCSSD & OPROCD Applications ASM. Listener . Services. DB. OCR VIP. EMD. Listener Cluster interface Global management: SRVCTL. DB.RAC Software Node1 Instance1 Cache … LMON LMD0 LMSx LCK0 DIAG Cluster Noden Instancen Cache … LMON LMD0 LMSx LCK0 DIAG Global resources Oracle Clusterware CRSD & RACGIMON EVMD OCSSD & OPROCD Applications ASM. DBCA. EMD.

RAC Software Storage Node1 Instance1 CRS_HOME ORACLE_HOME ASM_HOME Noden Node1 Instance1 CRS_HOME Noden … Instancen CRS_HOME ORACLE_HOME ASM_HOME … Instancen CRS_HOME Local storage Local storage Local storage Local storage Voting files OCR files Shared storage Voting files OCR files ORACLE_HOME ASM_HOME Shared storage Permits rolling patch upgrades Software not a single point of failure .

RAC Database Storage Node1 Instance1 Archived log files Local storage … Noden Instancen Archived log files Local storage Undo tablespace files for instance1 Online redo log files for instance1 Data files Temp files Control files Flash recovery area files Change tracking file SPFILE TDE Wallet Shared storage Undo tablespace files for instancen Online redo log files for instancen .

Automatic Storage Management • Eliminates need for conventional file system and volume manager • Capacity on demand • Add/drop disks online • Automatic I/O load balancing • Stripes data across disks to balance load • Best I/O throughput • Automatic mirroring • Easy .

Automatic Storage Management • Simplify and Automate Database Storage management • Fraction of the time is needed to manage database files • Increase Storage Utilization • Eliminate over provisioning and maximize storage resource utilization • Predictably Delivers on Service Level Agreements • Never get out of tune delivering higher performance than RAW & File System over time • Uncompromized availability empowering low cost storage deployment reliably .

Clusters and Scalability SMP model RAC model Memory Shared storage Cache Cache CPU CPU SGA BGP BGP SGA BGP BGP CPU CPU Cache coherency Cache fusion BGP: Background process .

Real Application Clusters Benefits • Highest Availability • On-demand flexible scalability • Lower computing costs • World record performance Database Storage .

Levels of Scalability • • • • • Hardware: Disk input/output (I/O) Internode communication: High bandwidth and low latency Operating system: Number of CPUs Database management system: Synchronization Application: Design .

Scaleup and Speedup Original system Hardware Time 100% of task Cluster system scaleup Hardware Time Up to 200% of task Cluster system speedup Hardware Time Up to 300% of task Hardware 100% of task Hardware Time/2 Hardware Time .

Speedup/Scaleup and Workloads Workload OLTP and Internet DSS with parallel query Batch (mixed) Speedup No Yes Possible Scaleup Yes Yes Yes .

Definition of a Data Warehouse “An enterprise structured repository of subjectoriented.” . The data warehouse stores atomic and summary data. historical data used for information retrieval and decision support. time-variant.

Data Warehouse .Characteristics • What is Data Warehousing today? • Not a simple batch query and analytical engine anymore • Large user population with diverse query and analytical needs • 1000’s of users accessing data both internally and externally • Large size. data . 10 TB and upwards of 100 TB • Not a simple schema with few tables • Multiple applications sharing an common copy of enterprise data • Strict performance and operational SLA’s • Adaptable to growing business needs • Constantly evolving with more business units and functionality • Constant requirement to scale users.

Data Warehouse . complex database operations • Complex SQL and calculations • Updated through a controlled process • Extract. Transform. Load (ETL) • Heterogeneous workload • • • • ETL processing Scheduled reporting Ad hoc queries Aggregations etc… • Peak usage of different workload patterns at different times • System have to be sized appropriately .Characteristics • Large.

Data Warehouse . accurate data Stay Informed Have the ability to Make Decisions & Take Action Have a Lag-Time of Hours/ Minutes • High performance and throughput. • Capability to scale quickly as the business is growing • Flexibility to meet diverse. shifting demands .Requirements • High availability and reliability • Deliver real-time data for real time queries • • • • Get more in-time.

RAC and Data Warehouse Physical Considerations .

Configure for a Balanced System Interconnects HBA1 HBA2 HBA1 HBA2 HBA1 HBA2 HBA1 HBA2 “The weakest link” defines the performance Balance these components: CPU HBA (Host Bus Adapter) NICs and Interconnect Protocol Switch speed Controllers Disks FC-Switch1 FC-Switch2 Disk Array 1 Disk Array 2 Disk Array 3 Disk Array 4 Disk Array 5 Disk Array 6 Disk Array 7 Disk Array 8 .

Grid Component* Dependencies Maximal Number of Rule of Thumb: HBA = 200MB/s per CPU Number of Switches = Number of Number of Controllers HBAs Number of HBA per node = number CPUs per node Number of HBAs + Number of Controllers CPU. otherwise infiniband Switch Controller Disk Interconnect * 2Gbit based Minimum number of disks = number of controller x 4 .Node Host Bus Adapter Number of nodes <=8 GigE.

I/O Operations – IOPS/sec • Single block block I/O requests • Estimation should include requirements for both normal/backup I/O’s . Bandwidth .I/O Design • Optimal Storage Design • Support workload that perform Sequential I/O • Expressed.MB/sec • Large multi-block I/O’s • Table/Index scans • Support workload that does Random I/O • Expressed.

16-way striped = 1920 IOPS per LUN 16 LUNS) .I/O Design Estimate aggregated throughput and IOPS (E.g.g... 2GB/sec.000/16 = 1875 IOPS/node) Choose the appropriate storage class and build the configuration (E.. 2GB/sec for 16 nodes = 128MB/node/sec or 30.g. 120 IOPS per spindle.000 IOPS) Calculate the total bandwidth requirement per node (E. or 30.

I/O Design • DW Specific Best practices • Plan 50-60% utilization per HBA • Target 30-50 Meg Per CPU Core • Use ASM • Managing Ultra Large Database fairly simple • Eliminate contention by evenly spreading I/O • Expanding Storage need is addressed easily • Re-balancing ensures I/O performance is constant • Create optimal size LUN’s • Small LUN’s for multi-terabyte DB’s are sub-optimal • Pay attention to initial storage layout while increasing cluster nodes exponentially • Offset partition table to stripe-width of the Storage Array .

Interconnect Design • Interconnect Design • In DW environment primary users of interconnect • Inter-node Parallel Query • Typical message size • PARALLEL_EXECUTION_MESSAGE_SIZE default 2k • Global Cache Fusion • Two Types of message • Short 256 Byte message • Block Transfer .DB_BLOCK_SIZE .

Interconnect Design • Interconnect Bandwidth Estimation • Message received (M) • 256 * (GES message + GCS messages) • Blocks received (B) • (db_block_size * (cr block received + current block received)) / mtu size • PQ message received (P) • (Parallel_execution_message_size * no of PQ remote messages received) / mtu size • Total bandwidth required … • (Message received + Blocks received + PQ message received) / max network transmit capacity • (M+B+P)/85000 .

--------Global Cache blocks received: 2.22 0.23 Global Cache blocks served: 2.03 GCS/GES messages sent: 136.07 136.18 103.---------.70 2.96 113.56 DBWR Fusion writes: 0.Interconnect design – Cache Traffic • Example from AWR: Global Cache Load Profile Per Sec Per Trans ------------------------------.08 Estd Interconnect traffic (KB): • This DW system primarily uses PQ • Global cache traffic is minimal • Mostly dictionary blocks .36 GCS/GES messages received: 164.84 2.

--------.---------PX local messages recv'd 104 0.1 PX remote messages recv'd 200271 200.2 151.Interconnect Design – IPQ traffic • Example from AWR: Statistic Total per Sec per Trans --------------------------.5 MB/Sec For this workload GigE should be optimal .1 0.-------.2 156.1 0.1 PX remote messages sent 213267 213.1 PX local messages sent 104 0.1 • • • • The per second this system receives 200 messages PQ message Size is 8182 Usage is 1.

if available on your platform • RDS in Linux offers good performance over IB .Interconnect Design • DW Specific Best practices • Plan 50-70% utilization of Network Bandwidth • GigE performs very well • IPQ usage is less • Multiplexed GigE is choice for many customers • For high IPQ usage • Infiniband.

space allocated in one instance is not returned to common pool • Space reclamation is done under SS and CI enqueue • This could cause slowdown if space is reclaimed constantly • A few queries with excessive temp space requirement can cause imbalance of usage among instances .Temporary Tablespace Design • Large sorts in Data Warehouse use temp spaces • For performance reasons temp space allocation is managed thru SGA • Unless requested.

” • Metalink Note: 465840. create as many temp files as the no.1 for more details . use the following command to release excessive allocation • “alter session set events 'immediate trace name drop_segments level <TS number + 1>'. of instances • This would eliminate ‘buffer busy’ waits associated with temp file header • If imbalance is found.Temporary Tablespace Design • DW Specific Best practices • Make sure enough temp space is allocated combining all instances’ usage • Allocate separate temp tablespace for users who perform large sorts • For each temp tablespace.

RAC and Data Warehouse Database Technologies .

Scheduler.Automatic Workload Management: Services • Application workloads can be defined as Services • • • • • • Individually managed and controlled Assigned to instances during normal startup On instance failure.G. Streams) • Managed by Oracle Clusterware . automatic re-assignment Service performance individually tracked Finer grained control with Resource Manager Integrated with other Oracle tools / facilities (E.

One Database Node-1 Node-2 Node-3 Node-4 Node-5 Node-6 Queries Aggregations ETL1 Backu p ETL2 .Many Services.

INSTANCE4 (db) (service) (preferred instances) • 2.INSTANCE2 srvctl add service –d ORA –s APP2 –r INSTANCE3.CREATE_SERVICE . Using OEM Grid Control • 3. SRVCTL srvctl add service –d ORA –s APP1 –r INSTANCE1.How to define a service • 1. DBMS_SERVICE (for single instance) DBMS_SERVICE.

NOT bottom-up) .Partitioning • Powerful functionality for partitioning objects into smaller piece • Beneficial for any environment with large volumes of data • Business decision. not hardware based (top-down design approach.

Partitioning Strategies Range Partitioning Hash Partitioning List Partitioning Composite Partitioning • Composite Range-Range Partitioning Composite Range-Hash Partitioning Composite Range-List Partitioning Composite List-Range Partitioning Composite List-Hash Partitioning Composite List-List Partitioning .

Query Performance: Partition Pruning 05-Jan 05-Feb Only the relevant partitions are accessed select sum(sales_amount) from sales where sales_date between to_date(‘01-MAR-2005’.‘DD-MON-YYYY’) and to_date(‘31-MAY-2005’.’DD-MON-YYYY’) 05-Mar 05-Apr 05-May 05-Jun Sales .

Partition-wise Joins • Partition-wise join may provide significant performance improvements • Partition-wise join supported for range. hash and composite partitioning • Optimizer chooses partition-wise joins whenever possible • Degree of parallelism not correlated to number of partitions .

Oracle may choose to join on a per-partition basis Lineitem Orders 05-Apr Sub-1 Lineitem Orders 05-Apr Sub-1 Sub-1 Sub-1 Node 1 Sub-2 Sub-2 Sub-2 Sub-2 Node 2 Sub-3 Sub-3 Sub-3 Sub-3 Node 3 .Full Partition-wise Joins When joining two tables that are partitioned on the join-key.

Partial Partition-wise Joins Partial Partition-wise join: If Lineitem is partitioned by the join key. then Orders can be re-distributed to enable partition-wise join Lineitem Orders Lineitem Sub-1 Orders Sub-1 Sub-1 Node 1 Sub-2 Sub-2 Sub-2 Node 2 Sub-3 Sub-3 Sub-3 Node 3 .

distinct units • Instead of one process doing all the work multiple processes working concurrently on smaller unit • Independent of the number of nodes .What is Parallelism • Breaking a single task into multiple smaller.

only one process is used • With parallel execution: • One parallel execution coordinator process • Many parallel execution servers • Table may be dynamically partitioned Serial Process SELECT COUNT(*) FROM sales Coordinator SELECT COUNT(*) FROM sales SALES SALES Parallel Execution Servers .How Parallel Execution Works? • With serial execution.

Execution Servers Consumers SQL Data sort A-K dispatching results sort L-S sort T-Z scan scan scan Producers Table on disk Coordinator DOP=3 Table’s dynamic Intra-Parallelism Intra-Parallelism partitioning (granules) Inter-Parallelism . cust_first_name FROM customers ORDER BY cust_last_name.Parallel Operations SELECT cust_last_name.

How Parallel Execution Servers Communicate • Rows Distribution: • • • • • • • PARTITION HASH RANGE ROUND-ROBIN BROADCAST QC(ORDER) QC(RANDOM) QC Parallel Execution Server Set 1 Parallel Execution Server Set 2 DOP=3 .

use a relatively high number of partitions .Degree of Parallelism (DOP) • Number of parallel execution servers used by one parallel operation • Applies only to intra-operation parallelism • If inter-operation parallelism is used then the number of parallel execution servers can be twice the DOP • No more than two sets of parallel execution servers can be used for one parallelized statement • When using partition granules.

Parallel Execution with RAC • Execution slaves have node affinity with the execution coordinator. Node 1 Node 2 Node 3 Node 4 Execution coordinator Shared disks Parallel execution server . but will expand if needed.

Adaptive Parallelism •Adaptive Multiuser feature adjusts the DOP based on user load Initially no workload •Enabled by default: PARALLEL_ADAPTIVE_MULTI_USER=TRUE Node 1 Node 2 1st user logs on issues a query -> parallel 8 Node 1 Node 2 2nd user logs on issues a query -> parallel 4 3rd and 4th user logs on issues a query -> parallel 4 Node 1 Node 2 .

ig3 (not dynamic) • parallel_instance_group=ig2 (dynamic) .ig2.Inter-node Parallel Query– Oracle10g • Parallel execution slaves allocated on instances without regard for services • Benefits of services greatly reduced when using parallel execution • Workaround – instance groups • instance_groups=ig1.

Inter-node Parallel Query– Oracle11g • Parallel execution slaves only allocated on instances offering the service that the user session is connected to • All services have equivalent. dynamic instance groups • Services can be created • For different IPQ user groups • Preferred and Available Characteristics of services can be exploited • IPQ SLA’s can be guaranteed thru service failover .

Overview: Parallel Join Execution • • • EMP and DEPT joined on deptno Repartition EMP and DEPT on deptno Join each partition QC DFO Send Hash Join Receive Receive Hash DFO Send Hash DFO Send Table Scan Table Scan .

Parallel Hash-Join with 8 Slaves Node 1 Node 2 Interconnect Can Become a Bottleneck .

Pre-filtering can reduce communication DFO Hash Join Filter Create Receive Receive Set DFO Send Shared Bloom filter DFO Send Test Filter Use Scan Dept Scan Dept .

11gR1: Extended to Serial Execution

Serial Plan

Hash Join

Filter Create


Group By
Local Bloom filter

Scan Dept


Filter Use

Scan Emp

Parallel Execution on RAC
• Need to Merge bloom filter over a cluster
• Potentially costly operation

• Prior to Merging, each node contains a private, incomplete bloom filter • Merging done in Parallel
• Each producers split the bloom filter in pieces • Each pieces is sent to a single consumers on each other node • Each consumer merges the received pieces in their local bloom filter

• After Merging the bloom filter is complete and can be used for filtering

Two Approaches to Parallelism and Partitioning
Shared Everything
Parallel degree independent of the number of nodes Data partitioning independent of the number of nodes

Shared Nothing
Static parallel degree dependent on number of nodes Static Data partitioning dependent on number of nodes

Data A-Z

Hash 1

Hash 2

Hash 3

Hash 4

backup. data warehousing • Compress all data types: structured. unstructured • Savings cascade to all db copies: test. dev. archiving. . updates • Trade some cpu for disk & i/o efficiency • Compress large application tables • Transaction processing. etc. mirrors.Oracle Advanced Compression • Oracle 9i compresses data only during bulk load. useful for DW and ILM • Oracle 11g compresses w/ inserts. standby.

Let’s Talk About RAC & Data Warehouse .

.The Key Question How should I design and configure my Oracle Data Warehouse ? Answer : It depends….

Few Large Nodes or Many Small Nodes ? .

Manageability • Many nodes are more difficult to manage: • • • Increase maintenance Performance problems are harder to diagnose Statistic gathering is more challenging • However. computing power lost during planned and unplanned outages has less impact: • • 16 x 2 grid 4 x 8 grid 6% less power 25% less power • Many nodes are more flexible to distribute different workloads .

but Keep a balanced system Watch out for number of slots in switch • We recommend adding only nodes with similar performance characteristics: CPUs. • Scale-out increment is one node • • 16 x 2 grid 4 x 8 grid 6% increase in computing power 25% increase in computing power .Scalability: Scale-Out • Easy scale out • • • Simply add nodes with no reconfiguration of database. HBAs. NICs etc.

How many instances will offer a service? .Are there services that should run on one instance for performance reason (contention on resources for example) • Managing the workload using Resource Manager . An instance can support multiple services.How can I run different workload types? • Managing and partitioning the workload using Services .Using Oracle Database Resource Manager facilitates meeting SLAs and provide effective control of system resources focused primarily on running an Oracle database instance .Service span one or more instances of the database.How many services do I need to define? . The number of instances offering the service is managed by the DBA independent of the application .Services provide a single system image for managing workload .

What is the optimal partitioning strategy? • Partitions are the foundation for achieving effective performance in a large/very large data warehouse and other features depend on partitioning to achieve the benefit objective • Important criteria considered when choosing partitioning Performance (primary motivation) Ease of administration/management Data purge Data archiving Data movement Data lifecycle management Efficiency of backup .

list) • Balancing data distribution (hash partitioning). • Dividing data across parallel processing to balance workload (partition wise join) • Combining different partition mechanism (composite partitioning) .What is the optimal partitioning strategy? • Grouping data by value for pruning (range.

Which degree of parallelism ? • Different scenarios can be used for parallel query: .Standard use of parallel query for large data sets. This scenario restricts the processing to specific nodes in the cluster. . the degree of parallelism can be defined to utilize all of the available resources across the cluster. This can be done by using Services and/or Parallel_instance_Group . Thus nodes can be logically grouped for specific types of operations.Use of restricted parallel query. In this scenario.

database administrators can further control the allocation of these resources based on application requirements or service level agreements. using more instances for parallel query may help. use of parallel operations may exacerbate it .Which degree of parallelism ? • The downside of parallel operations is the exhaustion of server resources: . Utilizing instance groups.If CPU utilization is relatively high on one node. • The use of parallel operations within the RAC environment provides for the flexibility to utilize all the server hardware that is part of the cluster architecture. .If I/O bottlenecks currently exist.

Summary – DW on RAC Best Practices • • • • • • Design to support Business Needs Implement and test Partition the Data Partition the Workload Configure Parallel Query Measure and Monitor .