RAC Internals

Julian Dyke Independent Consultant
CERN Geneva - November 2008
1

© 2008 Julian Dyke

juliandyke.com

About me...
20 years Oracle experience as DBA, developer and consultant Independent Consultant specializing in Kernel Performance Tuning RAC and High Availability Chair of UKOUG RAC & HA SIG Regular presenter at conferences, seminars and user group meetings in UK, Europe and USA Member of Oak Table Network Website http://www.juliandyke.com specializing in Oracle internals
2

© 2008 Julian Dyke

juliandyke.com

About the book...
Pro Oracle Database 10g RAC on Linux Co-authored with Steve Shaw of Intel Corporation Published by Apress Available August 2006 ISBN: 1-59059-524-6 New edition planned for 2009 (Oracle 11gR2)

3

© 2008 Julian Dyke

juliandyke.com

1010111 0101 010110101 0110101
10101 101010010 01010010101 010101010101 1001010 110 10101 10010 1101010 1010101 0001 100101010 010 101010 11111000000 0000011000 101 0101010100 1010010 10101 1001010 1010 10101 101001 01101010 '1011011'; 10101 101010010 01010010101 010101010101
11101101110 0110 1001011 1010 11001 10001 00100110 10101 1000110 00101 10101 1010 1001 1111 1001 0101 1000101 111011 101110

4

© 2008 Julian Dyke

juliandyke.com

com .Agenda Interconnect RAC Background Processes Global Cache Services 5 © 2008 Julian Dyke juliandyke.

com .RAC 4-node cluster Public Network Private Network (Interconnect) Node 1 Instance 1 Node 2 Instance 2 Node 3 Instance 3 Node 4 Instance 4 Storage Network Shared Storage 6 © 2008 Julian Dyke juliandyke.

com .Interconnect Overview Instances communicate with each other over the interconnect (network) Information transferred between instances includes data blocks locks SCNs Typically 1Gb Ethernet UDP protocol Often teamed in pairs to avoid SPOFs Can also use Infiniband Fewer levels in stack Other proprietary protocols are available 7 © 2008 Julian Dyke juliandyke.

across physical layer then up again 5 Application 4 Transport 3 Network 2 Data Link 1Physical 5 Application 4 Transport 3 Network 2 Data Link 1Physical 8 © 2008 Julian Dyke juliandyke.Interconnect TCP/IP Five Layer Model All messages travel down through layers.com .

SSH. RARP Ethernet. UDP IP (IPv4. NTP. FTP. 100BASE-T.com .11. SMTP. ICMP. TELNET. NFS. 1000BASE-T. IPv6). 802. DNS. Optical Fibre. ARP. PPP 10BASE-T. HTTP. SOAP TCP. Twisted Pair Four-layer model combines data link and physical layers 9 © 2008 Julian Dyke juliandyke. RPC. Wi-Fi. Token Ring. SNMP. FDDI.Interconnect TCP/IP Five Layer Model TCP/IP has a four or five layer model Five-layer model shown below Layer 5 Application 4 Transport 3 Network 2 Data Link 1 Physical TCP/IP Suite DHCP.

com .Interconnect TCP/IP Transport Layer Transport Layer Connection-oriented (TCP) Connectionless (UDP) Clusterware TCP UDP RAC IP Ethernet Physical Layer 10 © 2008 Julian Dyke juliandyke.

com .Interconnect Encapsulation Data UDP Header IP Header Ethernet Header IP Header UDP Header UDP Header Data Data Ethernet Trailer Data 14 bytes 20 bytes 8 bytes 4 bytes MTU Size 11 © 2008 Julian Dyke juliandyke.

Oracle Clusterware Node Heartbeat Messages Sent to each node in cluster every second in both directions Checks nodes are still members of cluster Sent by ocssd.bin using TCP well-known port 49895 Outgoing message is 134 bytes (80 byte payload) Incoming message is 66 bytes (12 byte payload) Node 1 Node 2 Node 3 Node 4 Outgoing Incoming 12 © 2008 Julian Dyke juliandyke.com .

200 57.000 446.800 43.1) * 4 messages * 3600 seconds Number of nodes 2 3 4 5 6 7 8 16 32 13 Packets per hour 14.600 72.400 100.com .400 28.000 86.800 216.Oracle Clusterware Node Status Messages Number of packets exchanged by a node is determined by number of nodes in cluster Number of packets per node per hour is (#nodes .400 © 2008 Julian Dyke juliandyke.

com .Global Services Overview Resource Object to which access must be controlled at instance level Enqueue Memory structure that serializes access to a resource Global Resources Object to which access must be controlled at cluster level Global Enqueue Locks and enqueues which need to be consistent between all instances 14 © 2008 Julian Dyke juliandyke.

com .Global Services Overview Global Resource Directory (GRD) Records current state and owner of each resource Contains convert and write queues Distributed across all instances in cluster Maintained by GCS and GES Global Cache Services (GCS) Implements cache coherency for database Coordinates access to database blocks for instances Global Enqueue Services (GES) Controls access to other resources (locks) including library cache and dictionary cache Performs deadlock detection 15 © 2008 Julian Dyke juliandyke.

com .RAC Background Processes Overview Node 1 DIAG LMON LCK0 LMD0 LMSn Buffer Cache CKPT ARCn LGWR DBWR DBWR LGWR Buffer Cache CKPT ARCn Shared Pool Shared Pool PMON SMON SMON Node 2 PMON DIAG LMON LCK0 LMD0 LMSn Instance 1 Instance 2 Datafiles Controlfiles Redo Logs 16 Redo Logs © 2008 Julian Dyke juliandyke.

RAC Background Processes LMSn LMSn Global Cache Service Process Manage requests for data access across cluster Up to 20 in Oracle 10. number of GCS server processes can be configured using gcs_server_processes parameter Default value is 1 (single CPU system) Can also be configured using _lm_lms parameter 17 © 2008 Julian Dyke juliandyke.2 LMS0-LMS9 LMSa-LMSz In Oracle 10.com .1 and above.1 LMS0-LMS9 LMSa-LMSj Up to 36 in Oracle 10.

user.opri.2 and above LMS processes run in real-time mode Remaining processes run in time-share mode Check using: [oracle@server3 ~]$ ps -eo pid.cmd | grep ora_lm 8596 oracle 75 ora_lmon_TEST1 8598 oracle 75 ora_lmd0_TEST1 8601 oracle 58 ora_lms0_TEST1 58 is real time.Time © 2008 Julian Dyke juliandyke.RAC Background Processes LMSn In Oracle 10. 75 or 76 is time share You can also check process scheduling policies using chrt oracle@server3 ~]$ chrt -p 8601 Time pid 8601's current scheduling policy: SCHED_RR pid 8601's current scheduling priority: 1 [oracle@server3 ~]$ chrt -p 8596 Share pid 8596's current scheduling policy: SCHED_OTHER pid 8596's current scheduling priority: 0 18 # lms0 .Real # lmon .com .

com . number of lock processes may be configurable using _gc_lck_procs parameter 19 © 2008 Julian Dyke juliandyke.1 and below.0.RAC Background Processes LCK0 LCK0 Instance Enqueue Process Part of KCL (Kernel Cache Library) Manages instance resource requests cross-instance call operations Assists LMS processes Formerly known as lock process One LCK0 process per instance In 9.

com .1.RAC Background Processes LMD0 LMD0 Global Enqueue Service Daemon Manages requests for global enqueues Updates status of enqueues when granted to / revoked from an instance Responsible for deadlock detection One LMD0 process per instance In 8.7 and below number of lock daemons may be configurable using _lm_dlmd_processes parameter 20 © 2008 Julian Dyke juliandyke.

RAC Background Processes LMON LMON Global Enqueue Service Monitor One LMON process per instance Monitors cluster to maintain global enqueues and resources Manages instance and process expirations recovery processing for cluster enqueues 21 © 2008 Julian Dyke juliandyke.com .

0.RAC Background Processes DIAG DIAG .1 and above can be disabled using _diag_daemon parameter Do not try this on a production system 22 © 2008 Julian Dyke juliandyke.com .Diagnosability Process Collects diagnostic data in the event of a failure Creates subdirectories in BACKGROUND_DUMP_DEST directory In Oracle 9.

com .Global Cache Services Introduction Global Cache Services exist to implement Cache Fusion Cache Fusion allows blocks to be updated by multiple instances Only one instance can have the updatable (current) version of a block GCS must ensure that only one instance can update a block at any time Many instances can have read-only (consistent read) versions of a block Instances can have multiple copies of same block at different SCNs 23 © 2008 Julian Dyke juliandyke.

Global Cache Services 2 way Consistent Read N S 1 Request shared resource Resource Master Instance 3 Instance 2 2 Request granted 3 Read request Instance 1 4 Block returned Instance 4 Instance 2 requests current read on block 1318 24 STOP © 2008 Julian Dyke juliandyke.com .

com .Global Cache Services 3-way Current Read N S N 3 Block and resource status 2 Transfer block to Instance 1 for exclusive access 1 Request exclusive resource 4 Resource status Instance 4 Resource Master 1318 Instance 2 Instance 3 N X 1320 Instance 1 Instance 1 requests exclusive read on block 1318 25 STOP © 2008 Julian Dyke juliandyke.

Global Cache Services 3-way Current Read (Dirty Block) N S N 2 Transfer block to Instance 4 in exclusive mode Resource Master 1 Request block in exclusive mode 1318 Instance 2 Instance 3 4 Resource status N X N N X 1320 3 Block and resource status Instance 1 Instance 4 requests exclusive read on block 1323 Instance 4 1318 Note that Instance 1 will create a past image (PI) of the dirty block 26 STOP © 2008 Julian Dyke juliandyke.com .

5 and above _fairness_threshold is used to avoid unnecessary lock conversions 27 STOP © 2008 Julian Dyke juliandyke.Global Cache Services 3-way Current (Without Downgrade) N 1 Request block in shared mode Resource Master 2 Transfer block to Instance 2 in shared mode Instance 2 4 Resource status Instance 3 N X N N X 3 Block and resource status 1320 Instance 1 Instance 2 requests current read on block 1323 Instance 4 1318 In Oracle 8.com .1.

1.com .Global Cache Services 3-way Current (With Downgrade) S 1 Request block in shared mode Resource Master 2 Transfer block to Instance 2 in shared mode Instance 2 4 Resource status Instance 3 N X N N X S 3 Block and resource status 1320 Instance 1 Instance 2 requests current read on block 1323 Instance 4 1318 In Oracle 8.5 and above _fairness_threshold is used to avoid unnecessary lock conversions 28 STOP © 2008 Julian Dyke juliandyke.

Global Cache Services Wait Events Wait events show reads where messages have been exchanged with other instances Can include: gc cr grant 2-way gc cr block 2-way gc cr block 3-way gc cr multi block request gc current grant 2-way gc current block 2-way gc current block 3-way gc current multi block request 29 © 2008 Julian Dyke juliandyke.com .

1.50 RAC1 2 1.42 2.40 1318 2.44 RAC4 UPDATE t1 SET c2 = 50 WHERE c1 = 2. 30 © 2008 Julian Dyke juliandyke.com .42 2.Global Cache Services Cache Fusion Example Resource Master RAC2 1 1.44 RAC3 UPDATE t1 SET c2 = 42 WHERE c1 = 1.

com . No statistics so dynamic sampling required No indexes so full table scan required Steps are: Dynamic Sampling 3-way 2-way 2-way 3-way 2-way Current Read 3-way 31 Consistent Read Consistent Read Consistent Read Consistent Read Consistent Read Current Read Table block 15 Undo block 89 Undo block 239 Table block 15 Undo block 89 Table block 15 Consistent Read © 2008 Julian Dyke juliandyke.Global Cache Services Cache Fusion Example RAC4 executes UPDATE t1 SET c2 = 42 WHERE c1 = 2.

cu=0. 'false') NO_PARALLEL_INDEX(SAMPLESUB) NO_SQL_TUNE */ NVL(SUM(C1).r=1.Global Cache Services Cache Fusion Example Dynamic Sampling .e=423.cr=0.:"SYS_B_0").dep=1."C1"=:"SYS_B_3" THEN :"SYS_B_4" ELSE :"SYS_B_5" END AS C2 FROM "T7" "T7") SAMPLESUB END OF STMT PARSE #4:c=0.mis=1.cu=0.e=2540.cu=0.dep=1.p=0.dep=1.e=10615.mis=0.10046/8 PARSING IN CURSOR #4 len=433 dep=1 uid=55 oct=3 lid=55 hv=574971495 ad='2b8da360' SELECT /* OPT_DYN_SAMP */ /*+ ALL_ROWS IGNORE_WHERE_CLAUSE NO_PARALLEL(SAMPLESUB) opt_param('parallel_execution_enabled'.mis=1.r=0. CASE WHEN "T7".og=1 WAIT #4: nam='gc cr block 3-way' ela= 836 p1=8 p2=15 p3=1 obj#=51836 WAIT #4: nam='gc cr block 2-way' ela= 442 p1=6 p2=89 p3=67 obj#=51836 WAIT #4: nam='gc cr block 2-way' ela= 453 p1=6 p2=239 p3=68 obj#=51836 FETCH #4:c=0.cr=0. NVL(SUM(C2).:"SYS_B_1") FROM (SELECT /*+ IGNORE_WHERE_CLAUSE NO_PARALLEL("T7") FULL("T7") NO_PARALLEL_INDEX("T7") */ :"SYS_B_2" AS C1.og=1 STAT #4 id=1 cnt=1 pid=0 pos=1 obj=0 op='SORT AGGREGATE (cr=10 pr=0 pw=0 time=3903 us)' STAT #4 id=2 cnt=32 pid=1 pos=1 obj=51836 op='TABLE ACCESS FULL T7 (cr=10 pr=0 pw=0 time=2650 us)' 32 © 2008 Julian Dyke juliandyke.og=1 EXEC #4:c=1999.com .p=0.r=0.cr=10.p=0.

r=1.r=0.e=61121.cr=10.mis=1.com .Global Cache Services Cache Fusion Example UPDATE statement .mis=0.cr=11.p=0.cu=1.10046/8 PARSING IN CURSOR #1 len=34 dep=0 uid=55 oct=6 lid=55 tim=1168417842291309 hv=3829255502 ad='2b8d04dc' UPDATE t7 SET c2 = 20 WHERE c1 = 5 END OF STMT PARSE #1:c=10998.og=1 WAIT #1: nam='SQL*Net message to client' ela= 5 driver id=1650815232 #bytes=1 p3=0 obj#=51836 WAIT #1: nam='SQL*Net message from client' ela= 7807082 driver id=1650815232 #bytes=1 p3=0 obj#=51836 STAT #1 id=1 cnt=0 pid=0 pos=1 obj=0 op='UPDATE T7 (cr=10 pr=0 pw=0 time=2875 us)' STAT #1 id=2 cnt=1 pid=1 pos=1 obj=51836 op='TABLE ACCESS FULL T7 (cr=10 pr=0 pw=0 time=1665 us)' 33 © 2008 Julian Dyke juliandyke.p=0.cu=0.dep=0.dep=0.e=2931.og=1 WAIT #1: nam='gc cr block 3-way' ela= 702 p1=8 p2=15 p3=1 obj#=51836 WAIT #1: nam='gc cr block 2-way' ela= 447 p1=6 p2=89 p3=67 obj#=0 WAIT #1: nam='gc current block 3-way' ela= 650 p1=8 p2=15 p3=33554433 obj#=51836 EXEC #1:c=0.

LMS1 RAC3 .LMS1 Destination RAC2 .Server Description Request file 8 block 15 OK Send file 8 block 15 to RAC4 OK Block file 8 block 15 part 1 Block file 8 block 15 part 2 Block file 8 block 15 part 3 Block file 8 block 15 part 4 Block file 8 block 15 part 5 Block file 8 block 15 part 6 Bytes 456 212 480 212 1500 1500 1500 1500 1500 868 34 © 2008 Julian Dyke juliandyke.Server RAC3 .LMS1 RAC3 .Server RAC2 .LMS1 RAC2 .Server RAC4 .LMS1 RAC3 .LMS1 RAC4 .LMS1 RAC3 .Server RAC4 .Server RAC4 .com .LMS1 RAC2 .Global Cache Services gc cr block 3-way wait event Source RAC4 .Server RAC4 .Server RAC4 .LMS1 RAC3 .LMS1 RAC3 .LMS1 RAC3 .LMS1 RAC4 .

Global Cache Services gc cr block 3-way wait event Resource Master 4 RAC2 2 1 RAC1 1.40 1318 2.44 3 1.42 2.44 5 RAC3 6 7 8 9 10 RAC4 UPDATE t1 SET c2 = 50 WHERE c1 = 2.44 1. 35 © 2008 Julian Dyke juliandyke.42 2.com .

Server RAC3 .Server RAC4 .com .Server RAC4 .Server Description Request file 6 block 69 OK Block file 6 block 69 part 1 Block file 6 block 69 part 2 Block file 6 block 69 part 3 Block file 6 block 69 part 4 Block file 6 block 69 part 5 Block file 6 block 69 part 6 Bytes 400 212 1500 1500 1500 1500 1500 868 36 © 2008 Julian Dyke juliandyke.LMS1 RAC3 .Global Cache Services gc cr block 2-way wait event 2-way Consistent Read Source RAC4 .LMS1 RAC4 .Server RAC4 .LMS1 RAC3 .Server RAC4 .LMS1 RAC3 .LMS1 RAC3 .LMS1 RAC3 .LMS1 RAC3 .LMS1 Destination RAC3 .Server RAC4 .Server RAC4 .

40 2. 37 STOP © 2008 Julian Dyke juliandyke.Global Cache Services gc cr block 2-way wait event 1.44 Resource Master 3 4 5 6 7 8 RAC2 1 RAC3 2 1.com .44 RAC1 1.40 1318 2.40 2.44 RAC4 UPDATE t1 SET c2 = 50 WHERE c1 = 2.

Server RAC4 .LMS1 38 Destination RAC2 .Global Cache Services gc current block 3-way wait event 3-way Current Read Source RAC4 .LMS1 RAC4 .LMS1 RAC3 .Server RAC4 .LMS1 RAC2 .Server RAC4 .LMS1 RAC4 .Server RAC4 .Server RAC3 .LMS1 RAC4 .Server RAC4 .LMS1 Description Request file 8 block 15 OK Send file 8 block 15 to RAC4 OK Block file 8 block 15 part 1 Block file 8 block 15 part 2 Block file 8 block 15 part 3 Block file 8 block 15 part 4 Block file 8 block 15 part 5 Block file 8 block 15 part 6 Received file 8 block 15 OK Bytes 456 212 480 212 1500 1500 1500 1500 1500 868 244 212 © 2008 Julian Dyke juliandyke.LMS1 RAC3 .LMS1 RAC4 .com .LMS1 RAC3 .LMS1 RAC3 .LMS1 RAC3 .Server RAC2 .Server RAC2 .LMS1 RAC2 .LMS1 RAC3 .LMS1 RAC2 .LMS1 RAC3 .

50 2. 5 6 7 8 9 10 RAC4 1. RAC3 saves past image of the dirty block until RAC4 writes the block to disk 39 STOP © 2008 Julian Dyke juliandyke.Global Cache Services gc current block 3-way wait event Resource Master 4 RAC2 2 12 RAC1 11 1 1.44 UPDATE t1 SET c2 = 50 WHERE c1 = 2.40 1318 2.44 3 1.44 RAC3 UPDATE t1 SET c2 = 42 WHERE c1 = 1.42 2.com .42 2.

Global Cache Services Past Images When an instance passes a dirty block to another instance it Flushes redo buffer to redo log Retains past image (PI) of block in buffer cache PI is retained until another instance writes block to disk Used to reduce recovery times Recorded in V$BH.STATUS as PI Based on X$BH.2) 40 © 2008 Julian Dyke juliandyke.com .STATE (value 8 in Oracle 10.

7126. 7125.com . COMMIT. COMMIT. 7128. Instance 1 Instance 2 7123 7124 7125 7126 7127 7124 7125 7126 7127 7128 7128 7129 7129 7123 Redo Log 1 41 STOP BlockUndo/Redoappliedchanges DBWR hasis updatedcolumn a Instance 42updates perform to Instance 1subsequentlybuffer Block 42 1table t1 contains to Block 422 1 1is2 writtenfrom Assumeis needs recovery Instance notmust column Undo/redoupdated in from GCS transferswritten to 42 Block updates block Instance makes in buffer Undo/Redo Crashes 42 is Undo/redo block written Instance written Undo/Redo written to Block 42 is read from to disk ContentsPastdisk Instance 2 lost to recovery1for by block yet Instance 42cachePast Image Instance cachecache 42 backof 1 uses block are blockRedo Logto disk single row Log 1 a tobuffer 1 Redo in 2 Image DBWR 7127 7126 2 7124 1 1329 7128 7125 back to Log Redo Redo Log 2 © 2008 Julian Dyke juliandyke. Buffer Cache 7123 7124 7125 7126 7127 7128 7129 7128 7129 UPDATE t1 SET c1 = 7129.Global Cache Services Past Images Buffer Cache UPDATE t1 SET c1 = 7124. 7127.

Server Destination RAC3 .Server RAC4 .LMS1 Description Request file 6 block 69 OK Grant read file 6 block 69 OK Bytes 400 212 276 212 42 © 2008 Julian Dyke juliandyke.Global Cache Services gc cr grant 2-way wait event 2-way Consistent Read Source RAC4 .LMS1 RAC4 .com .LMS1 RAC4 .LMS1 RAC3 .Server RAC3 .Server RAC3 .

40 2. 43 STOP © 2008 Julian Dyke juliandyke.44 RAC1 1.44 5 6 RAC4 SELECT c2 FROM t1 WHERE c1 = 1.40 1318 2.com .Global Cache Services gc cr grant 2-way wait event Resource Master RAC2 1 RAC3 2 4 3 1.

LMS1 44 © 2008 Julian Dyke juliandyke.com .LMS1 RAC3 .Server RAC3 .Global Cache Services gc cr multi block request wait event Source Destination Description Request file 8 blocks 69-73 Bytes 1872 212 772 212 RAC4 .Server RAC3 .LMS1 RAC4 .Server Grant file 8 blocks 69-73 to RAC4 OK RAC4 .LMS1 RAC3 .Server OK RAC4 .

44 1318 1.44 1.40 2.40 1.44 2.40 2.44 1.40 1.44 5 6 RAC4 SELECT c2 FROM t1 WHERE c1 = 1.40 2.44 2.44 1.40 2.44 RAC1 1.Global Cache Services gc cr multi block request wait event Resource Master RAC2 1 RAC3 2 4 3 1.com .40 1.40 2.40 1.44 2.44 2.40 2. 45 STOP © 2008 Julian Dyke juliandyke.

com .Global Cache Services gc cr multi block request wait event The following 10046/8 trace is for a gc cr multi block request WAIT #2: nam='gc cr multi block request' ela= 722 file#=4 block#=248 class#=1 obj#=51866 tim=1169728375495574 WAIT #2: nam='db file scattered read' ela= 10437 file#=4 block#=244 blocks=5 obj#=51866 tim=1169728375506092 This trace can be misleading because: the gc cr multi block request specifies the LAST block in the range the gc cr multi block request does not specify how many blocks should be read the gc cr multi block request does not specify how many blocks have been returned from another instance 46 © 2008 Julian Dyke juliandyke.

Global Cache Services UDP Messages
There are two types of message exchanged within RAC These are PROBABLY defined as follows Synchronous These messages require an acknowledgement for each packet In some cases the acknowledgement packet can be larger than the original request e.g. SCN synchronization Asynchronous These messages do not require an individual acknowledgement for each packet e.g. block transfers between instances
47

© 2008 Julian Dyke

juliandyke.com

Global Cache Services Lock Modes
Lock modes can be: Null Another instance can hold an exclusive or shared lock Shared Another instance can hold a shared lock but not an exclusive lock Exclusive No other instances can hold shared or exclusive locks Locks can also be: Local No other instance has held an exclusive lock Global Another instance has held an exclusive lock in the past
48

© 2008 Julian Dyke

juliandyke.com

Global Cache Services Fairness Threshold
Intended to prevent unnecessary lock downgrades when other instances only require read-only copies For write to read transfers Writing instance retains X lock Reading instance retains null lock If _fairness_threshold reached then Writing instance downgrades X lock to S lock Reading instance receives S lock _fairness_threshold default value is 4

49

© 2008 Julian Dyke

juliandyke.com

Global Cache Services Lock Elements
Lock elements are externalized in the V$LOCK_ELEMENT dynamic performance view Based on X$LE Additional information is available in the X$LE view Past image buffers do not have a lock element In OPS one lock element could manage a contiguous range of blocks Still can in RAC using GC_FILES_PER_LOCK parameter Disables Cache Fusion

50

© 2008 Julian Dyke

juliandyke.com

com .Global Cache Services Lock Elements Contain embedded GCS Client structures (KJBL) Buffer Header Buffer Header Buffer Header Buffer Header Lock Element GCS Client Lock Element GCS Client Lock Element GCS Client 51 © 2008 Julian Dyke juliandyke.

Global Cache Services Memory Structures Block Header BH BH Lock Element GCS Client LE KJBL LE KJBL KJBL GCS Shadow GCS Resource KJBR KJBR GCS Shadow describes blocks held by other instances.com . but mastered locally 52 © 2008 Julian Dyke juliandyke.

com .Global Cache Services Memory Structures GCS Resources (KJBR) Stored in segmented array Number of GCS resource structures determined by _gcs_resources parameter Externalized in X$KJBR Number of free GCS resource structures in X$KJBRFX GCS Enqueues (Clients / Shadows) (KJBL) GCS clients embedded in lock elements GCS shadows stored in segmented array Number of GCS shadow structures determined by _gcs_shadow_locks parameter Externalized in X$KJBL Number of free GCS shadow structures in X$KJBLFX 53 © 2008 Julian Dyke juliandyke.

dom 0] kjga st 0x4.0x0. cflag 0x0 sender 2 flags 0x0 replay# 0 disk: 0x0000. hb 0.0. 1->1.0.00000000 pi scn: 0x0000. drmb 178.0x3591.0] infop 0x0 pkey 181 hv 107 [stat 0x0.0x0.0x0.0x1.18a9c bctx: (nil) write: 0 scan: 0x0 xflg: 0 xid: 0x0.10000] pkey 181 grant 1 cvt 0 mdrole 0x21 st 0x20 GRANTQ rl LOCAL master 1 owner 0 sid 0 remote[(nil).0x0. rmno 10. RMno 0. myb 178.Global Cache Services Dumps To dump the contents of the global cache use: ALTER SESSION SET EVENTS 'IMMEDIATE TRACE NAME GC_ELEMENTS LEVEL 1'.(nil)] resp[(nil).0.0.0 GCS CLIENT 0x21fecd60. wm 32767.com . apifrz 0 54 © 2008 Julian Dyke juliandyke. GLOBAL CACHE ELEMENT DUMP (address: 0x21fecd18): id1: 0x3591 id2: 0x10000 obj: 181 block: (1/13713) lock: SL rls: 0x0000 acq: 0x0000 latch: 0 flags: 0x41 fair: 0 recovery: 0 fpin: 'kdswh05: kdsgrp' bscn: 0x0. cinc 8.00000000 msgseq 0x1 updseq 0x0 reqids[1.0] hist 0x7c history 0x3c. reminc 6. flags 0x0 lb 0.1 sq[(nil). step 0.00000000 write request: 0x0000.

..0x2ee64e8c] resp[0x2ee64e74..1 sq[0x2ee64e8c.26992 bctx: (nil) write: 0 scan: 0x0 xflg: 0 xid: 0x0.0x2eff3858] resp[0x2ee64e74.com ..0] hist 0x12a5 . GCS SHADOW 0x237f43a0.0....0x6a39.0@65535 flag 0x0 mdrole 0x1 mode 1 scan 0 role LOCAL ..0x6a39... GCS SHADOW 0x2eff3858..10000] pkey 74 grant 1 cvt 0 mdrole 0x21 st 0x40 GRANTQ rl LOCAL master 0 owner 0 sid 0 remote[(nil).1 sq[0x2ee64e8c...1 sq[0x237f43a0.0] hist 0x12a5 . GCS RESOURCE 0x2ee64e74 hashq [0x2ee61894.10000] pkey 74 grant 0x2eff3858 cvt (nil) send (nil).0 GCS SHADOW 0x237f43a0.1] hist 0x65f ..0x6a39..0 write (nil).0x2eff3858] resp[0x2ee64e74.10000] pkey 74 grant 1 cvt 0 mdrole 0x21 st 0x40 GRANTQ rl LOCAL master 0 owner 1 sid 0 remote[0x23fea160.Global Cache Services Dumps Continued GLOBAL CACHE ELEMENT DUMP (address: 0x237f4358): id1: 0x6a39 id2: 0x10000 obj: 74 block: (1/27193) lock: SL rls: 0x0000 acq: 0x0000 latch: 0 flags: 0x41 fair: 0 recovery: 0 fpin: 'kdswh05: kdsgrp' bscn: 0x0.0x2ff57390] name[0x6a39.10000] pkey 74 grant 1 cvt 0 mdrole 0x21 st 0x40 GRANTQ rl LOCAL master 0 owner 0 sid 0 remote[(nil).. 55 © 2008 Julian Dyke juliandyke.

.com . block# 311 Ordering by X$KJBR.: [0x12E][0x40000][BL] [0x12F][0x40000][BL] [0x13][0x40000][BL] [0x130][0x40000][BL] [0x131][0x40000][BL] etc.KJBRNAME is difficult because the resource names do not collate when sorted e.Global Cache Services Block Mastering Each block is mastered on one instance Block DBA is reported by X$KJBR.g.. 56 © 2008 Julian Dyke juliandyke.KJBRNAME Names have the format: [<block_number>][<file_number>][BL] For example [0x137][0x40000][BL] is file# 4.

pos2 INTEGER := INSTR (p_resource_name.'x'.'XXXXXXXX'). END. s VARCHAR2(30) := SUBSTR (p_resource_name.1.Global Cache Services Block Mastering Some useful functions CREATE OR REPLACE FUNCTION get_file_number (p_resource_name VARCHAR2) RETURN INTEGER IS pos1 INTEGER := INSTR (p_resource_name.1.2).'XXXXXXXX') / 65536.pos1+1. pos2 INTEGER := INSTR (p_resource_name. / 57 © 2008 Julian Dyke juliandyke. / CREATE OR REPLACE FUNCTION get_block_number (p_resource_name VARCHAR2) RETURN INTEGER IS pos1 INTEGER := INSTR (p_resource_name.pos1+1. s VARCHAR2(30) := SUBSTR (p_resource_name.1. BEGIN RETURN TO_NUMBER (s.pos2-pos1-1).']'.1). BEGIN RETURN TO_NUMBER (s.1).'x'.2).com .pos2-pos1-1).']'. END.1.

Global Cache Services Block Mastering In Oracle 10.com .2 block mastering is determined by _lm_contiguous_res_count Specifies number of contiguous blocks that will hash to the same HV bucket Defaults to 128 For example Instance 1 Instance 0 Start 0x080 0x180 0x280 0x380 0x480 0x580 etc 58 End 0x0FF 0x1FF 0x2FF 0x3FF 0x4FF 0x5FF etc Start 0x000 0x100 0x200 0x300 0x400 0x500 etc End 0x07F 0x17F 0x27F 0x37F 0x47F 0x57F etc © 2008 Julian Dyke juliandyke.

Global Cache Services Block Mastering In Oracle 10.com .1 and below block mastering is determined by a hash function Algorithm applied to groups of 1289 contiguous blocks In two node cluster Instance 0 has 645 blocks Instance 1 has 644 blocks etc In three node cluster Instance 0 has 430 blocks Instance 2 has 215 blocks Instance 1 has 430 blocks Instance 2 has 214 blocks etc Beware of small hot tables and indexes.... 59 © 2008 Julian Dyke juliandyke.

com .Global Cache Services Block Mastering The following table shows that masters are still assigned to ranges of 128 contiguous blocks in a four-node cluster Start Block 0 128 256 384 512 640 768 896 1024 1280 60 End Block 127 255 383 511 639 767 895 1023 1279 1407 Master 1 2 2 3 3 3 1 0 2 1 © 2008 Julian Dyke juliandyke.

Global Cache Services Dynamic Remastering In Oracle 9.com .2 works at segment level thresholds are relatively low 61 © 2008 Julian Dyke juliandyke.2 documentation describes dynamic remastering not implemented in code In Oracle 10.1 work at data file level very high threshold so difficult to test does occur on some customer sites In Oracle 10.

OBJECT_ID --------52084 To remaster object at current instance use: ORADEBUG LKDEBUG -m pkey 52084 All blocks now mastered by the current instance To redistribute masters to all available instances use: ORADEBUG LKDEBUG -m dpkey 52084 Blocks mastered by both (all) instances again 62 © 2008 Julian Dyke juliandyke.Global Cache Services Dynamic Remastering Example SELECT data_object_id FROM dba_objects WHERE owner = 'US01'AND object_name = 'T1'.com .

Object ID 52084 Current Master 0 Previous Master 32767 After remastering object 52084 to instance 1 Object ID 52084 63 Current Master 1 Previous Master 0 © 2008 Julian Dyke juliandyke. previous_master FROM v$gcspfmaster_info.com . current_master.Global Cache Services Dynamic Remastering Object remastering is recorded in V$GCSPFMASTER_INFO Instances are internally numbered 0. 1 etc Initially contains no rows After remastering object 52084 to instance 0 SELECT object_id.

information about Dynamic Remastering operations is also reported in the following fixed views X$KJDRMREQ Dynamic Remastering Requests X$KJDRMAFNSTATS File Remastering Statistics X$KJDRMHVSTATS Hash Value Statistics 64 © 2008 Julian Dyke juliandyke.com .Global Cache Services Dynamic Remastering In Oracle 10.2 and above.

com .umn Name REMASTER_OPS REMASTER_TIME REMASTERED_OBJECTS QUIESCE_TIME FREEZE_TIME CLEANUP_TIME REPLAY_TIME FIXWRITE_TIME SYNC_TIME RESOURCES_CLEANED REPLAYED_LOCKS_SENT REPLAYED_LOCKS_RECEIVED CURRENT_OBJECTS 65 Data Type NUMBER NUMBER NUMBER NUMBER NUMBER NUMBER NUMBER NUMBER NUMBER NUMBER NUMBER NUMBER NUMBER © 2008 Julian Dyke juliandyke. Dynamic Remastering statistics are reported in V$DYNAMIC_REMASTER_STATS Col.1 and above.Global Cache Services Dynamic Remastering In Oracle 11.

Global Cache Services Dynamic Remastering Dynamic remastering is coordinated by the LMD0 background The LMD0 process background process includes limited details of dynamic remastering operations Excessive dynamic remastering can cause instance freezes Observed in both Oracle 10.com .2 Oracle Support occasionally recommends that dynamic remastering is disabled using the following parameters: _gc_affinity_time = 0 _gc_undo_affinity=FALSE 66 © 2008 Julian Dyke juliandyke.1 and 10.

Global Cache Services System Change Number In RAC clusters SCN must be maintained across all nodes in cluster SCN propagation scheme differs according to version In Oracle 10.1and below defaults to Lamport algorithm Lamport in alert.com .2 and above defaults to Broadcast on Commit algorithm SCN negotiated immediately Apparently no delay 67 © 2008 Julian Dyke juliandyke.log SCN piggy-backed on GCS/GES messages Recorded in redo log Default delay of 7 seconds In Oracle 10.

value must be set to 0 (Broadcast on commit) Examples include: E-Business suite SAP 68 © 2008 Julian Dyke juliandyke.1 and below Initialization parameter specified in centriseconds Default value is 700 centiseconds (7 seconds) Specifies maximum time taken for a COMMIT on one node to be reflected on other nodes in the cluster For some applications performing rapid updates and queries of the same data from different instances.com .Global Cache Services System Change Number System Change Number algorithm is determined by the MAX_COMMIT_PROPAGATION_DELAY parameter In Oracle 10.

2 and above Default value of MAX_COMMIT_PROPAGATION_DELAY parameter is 0 SCN broadcast on commit method is used SCN updates are synchronized immediately SCN is synchronized after current read before block updated This ensures correct SCN is written to block 69 © 2008 Julian Dyke juliandyke.com .Global Cache Services System Change Number In Oracle 10.

com .1) 70 © 2008 Julian Dyke juliandyke.Global Cache Services Broadcast on Commit Ethernet broadcast is not used SCN is synchronized by updating instance Sends UDP SCN synchronization message to each remote instance Remote instances respond with their current SCN Another round of messages may be required if remote SCNs are more recent than local SCN Synchronization occurs every time an instance needs a new SCN Synchronization is always performed by the updating instance Number of messages = 4 x (number of instances .

Global Cache Services Broadcast on Commit In a 4-node cluster 12 messages are exchanged Source RAC4-LMS0 RAC1-LMS0 RAC4-LMS0 RAC2-LMS0 RAC4-LMS0 RAC3-LMS0 RAC1-LMS0 RAC4-LMS0 RAC2-LMS0 RAC4-LMS0 RAC3-LMS0 RAC4-LMS0 71 Destination RAC1-LMS0 RAC4-LMS0 RAC2-LMS0 RAC4-LMS0 RAC3-LMS0 RAC4-LMS0 RAC4-LMS0 RAC1-LMS0 RAC4-LMS0 RAC2-LMS0 RAC4-LMS0 RAC3-LMS0 Description Send current SCN OK Send current SCN OK Send current SCN OK Send current SCN OK Send current SCN OK Send current SCN OK Bytes 192 212 192 212 192 212 192 212 192 212 192 212 © 2008 Julian Dyke juliandyke.com .

any outstanding redo must first be flushed from redo buffer of remote instance to redo log Can have significant performance impact on consistent reads Particularly on extended clusters 72 © 2008 Julian Dyke juliandyke.com .Global Cache Service Read Consistency When a read consistent version of a block is requested it may be necessary to apply undo to a more recent version of that block Undo can be applied by LMSn background process in Remote instance Local instance If undo applied by remote instance.

com .Global Cache Service Read Consistency Statistics on inter-instance consistent reads are reported in V$CR_BLOCK_SERVER Reports statistics for blocks served by local instances to remote instances including Number of consistent reads served Number of current reads served Number of data blocks served Number of undo blocks served Number of undo headers served Number of fairness down converts Number of log flushes Number of times light works rule invoked 73 © 2008 Julian Dyke juliandyke.

Global Cache Service Read Consistency In theory. the LMS process will not attempt to read it again when responding to a consistent read request Light Works Rule Prevents LMS processes from going to disk when responding to CR requests for data.com . once a block has been written to disk. undo or undo segment blocks Can prevent LMS process from completing its response to a CR request 74 © 2008 Julian Dyke juliandyke.

Global Cache Service Read Consistency Uncommitted changes MUST be flushed to the redo log before the LMS process can ship a consistent block to another instance Reading process must wait until redo log changes have been written to redo log by LMS process Bad for standard RAC databases Reads must wait for redo log writes Worse for extended / stretch RAC clusters Increased latency of cross site disk communications 75 © 2008 Julian Dyke juliandyke.com .

com . for a full table scan gc cr multi block request 76 © 2008 Julian Dyke juliandyke.g.Global Cache Service Read Consistency For each block on which a consistent read is performed. consistent reads will experience high wait times e. a redo log flush must first be performed Number of redo log flushes is recorded in the FLUSHES column of V$CR_BLOCK_SERVER Redo log flush time is recorded in the gc cr block flush time statistic for the LMS process will increase time taken to serve consistent block will increase time taken to perform consistent read If LMS processes become very busy.

All blocks still in buffer cache 2 110 3 110 109 108 Buffer Cache Redo Buffer RAC1 1 108 Buffer Cache Redo Buffer RAC2 Redo Log 77 STOP © 2008 Julian Dyke juliandyke.Global Cache Services Read Consistency Committed transaction on RAC2 .com .

Some blocks written to disk 3 110 110 109 108 Buffer Cache Redo Buffer RAC1 1 110 4 2 Buffer Cache Redo Buffer RAC2 Redo Log 78 STOP © 2008 Julian Dyke juliandyke.Global Cache Services Read Consistency Committed transaction on RAC2 .com .

All blocks still in buffer cache 1 3 108 6 108 109 110 110 4 5 109 108 Buffer Cache Redo Buffer RAC1 2 108 Buffer Cache Redo Buffer RAC2 Redo Log 79 STOP © 2008 Julian Dyke juliandyke.com .Global Cache Services Read Consistency Uncommitted transaction on RAC2 .

Some blocks written to disk 2 108 109 110 5 7 6 8 108 Buffer Cache Redo Buffer RAC1 3 110 4 1 Buffer Cache Redo Buffer RAC2 110 109 Redo Log 80 STOP © 2008 Julian Dyke juliandyke.Global Cache Services Read Consistency Uncommitted transaction on RAC2 .com .

Global Cache Services Jumbo Frames By default Maximum Transmission Unit (MTU) is 1500 MTU includes IP header UDP header Data Requires six packets to transmit one 8192 byte block On some adapters MTU can be increased to around 9000 e.g. Intel PRO/1000 At command line ifconfig eth1 mtu 9000 up or in /etc/sysconfig/ifcfg-eth<x> MTU=9000 81 © 2008 Julian Dyke juliandyke.com .

Global Cache Services Jumbo Frames Example .com .cost of sending on 8192 byte block MTU=1500 (default) Frame# 1 2 3 4 5 6 Total Ethernet Header 14 14 14 14 14 14 84 IP Header 20 20 20 20 20 20 120 UDP Header 8 8 8 8 8 8 48 Data 1472 1472 1472 1472 1472 840 8200 Ethernet Trailer 4 4 4 4 4 4 24 Total 1518 1518 1518 1518 1518 886 8476 MTU=9000 Frame# 1 Total 82 Ethernet Header 14 14 IP Header 20 20 UDP Header 8 8 Data 8200 8200 Ethernet Trailer 4 4 Total 8246 8246 © 2008 Julian Dyke juliandyke.

5.Global Cache Services Jumbo Frames Not all network adapter drivers support jumbo frames Particularly cheap ones..2 and above 83 © 2008 Julian Dyke juliandyke.2.. 10. All network adapters in private interconnect must have same MTU size Switch must also be configured to support jumbo frames Lots of bugs and compatibility issues e.com ..0.g.0.1.0. Bug 4447620: RAC UDP MTU size restricted to 1500 or 9000 affects 10.1 fixed in 10.2.

Thank you for listening Any questions? info@juliandyke.com .com 84 © 2008 Julian Dyke juliandyke.

Sign up to vote on this title
UsefulNot useful