You are on page 1of 198

i

ii
Contents

Preface vii

1 Data Acesses 1

1.1 Physical Read . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 DB File Read Access Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.2 Oracle and UNIX References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.1.3 Test Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.1.4 Physical Read Stats in Oracle Views . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.1.5 Plsql Test Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.1.6 Dtrace Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.2 Logical Read - Consistent Get . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.2.1 Test Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.2.2 Buffer Read Access Path Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1.2.3 Test Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1.2.4 latch: cache buffers chains Demo . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1.3 Logical Read - Current Get . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

1.3.1 Test Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

1.3.2 Dtrace Output Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

1.3.3 Current Read Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

1.3.4 Sql Trace, Dtrace and Oracle Performance Views . . . . . . . . . . . . . . . . . . . 35

1.3.5 Dtrace Script Double Counted Statistics . . . . . . . . . . . . . . . . . . . . . . . . 36

1.3.6 dtracelio.d . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

iii
iv CONTENTS

2 Redo and Undo 37

2.1 Undo Practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.1.1 Undo Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.1.2 Undo Linked Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

2.1.3 Cleanout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

2.1.4 Undo Complexity Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

2.2 Redo Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

2.2.1 Test Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

2.2.2 Asynchronous Commit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

2.2.3 Synchronous Commit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

2.2.4 Piggybacked Commit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

2.2.5 Distributed Transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

2.2.6 Distributed Transaction Commit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

2.2.7 Distributed Transaction with autonomous transaction . . . . . . . . . . . . . . . . 71

2.2.8 Distributed Transaction: distributed lock timeout . . . . . . . . . . . . . . . . 72

2.2.9 Redo/Undo Explosion from Thick Declared Table Insert . . . . . . . . . . . . . . . 73

2.3 Getting Oracle Transaction Commit SCN . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

2.3.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

2.3.2 Run Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

2.3.3 Comparing with Other Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

2.3.4 Commit SCN Exposed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

3 Locks, Latches and Mutexes 79

3.1 Locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

3.1.1 TM Contention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

3.1.2 Enqueue Trace Event 10704 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

3.1.3 Two Other TSDP Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

3.2 Latches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

3.2.1 latch: row cache objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86


CONTENTS v

3.2.2 CBC Latch Hash Collision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

3.2.3 Latch Pseudo Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

3.3 Mutexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

3.3.1 Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

3.3.2 Mutex Contention and Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 102

3.3.3 Hot Library Cache Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

4 Parsing and Compiling 109

4.1 Sql Parse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

4.1.1 Parse Differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

4.1.2 Parse Identifying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

4.1.3 Cursor Details in Cusrordump . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

4.2 Plsql Validation Self-Deadlock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

4.3 Sql library cache lock (cycle) Deadlock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

4.3.1 Test Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

4.3.2 Library Cache Deadlock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

4.3.3 Single Session Cycle Dependency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

4.3.4 Type Dropping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

5 Memory Usage and Allocation 121

5.1 SGA Memory Usage and Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

5.1.1 Subpool Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

5.1.2 KKSSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

5.1.3 db block hash buckets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

5.1.4 SQLA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

5.1.5 KGLH0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

5.1.6 Free Memory and Fragmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

5.1.7 Session Private Cursor Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

5.1.8 Cursor Versions and Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

5.1.9 SGA Auto Resizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145


vi CONTENTS

5.2 PGA Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

5.2.1 ORA-04030 incident file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

5.2.2 View of dbms session.get package memory utilization . . . . . . . . . . . . . . . . 147

5.2.3 dbms session.get package memory utilization limitations . . . . . . . . . . . . . . . 148

5.2.4 Populate Process Memory Detail . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

5.2.5 PGA Memory Internals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

5.2.6 Plsql Collection Memory Usage and Performance . . . . . . . . . . . . . . . . . . . 152

5.3 Oracle LOB Memory Usage and Leak . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

5.3.1 Temporary LOBs: cache lobs, nocache lobs, abstract lobs . . . . . . . . . . . . . . 153

5.3.2 LOB Memory Leak . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

6 CPU and Performance Modelling 159

6.1 Performance of Oracle Collection Operators . . . . . . . . . . . . . . . . . . . . . . . . . . 159

6.1.1 Test Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

6.1.2 SET Operator Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

6.2 Row Cache Performance and CPU Modelling . . . . . . . . . . . . . . . . . . . . . . . . . 161

6.2.1 Plsql Object Types Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

6.2.2 Plsql Dynamic Call and 10222 Trace . . . . . . . . . . . . . . . . . . . . . . . . . . 162

6.2.3 Test and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

6.2.4 M/D/1 Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

6.2.5 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

6.2.6 Model Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

6.2.7 Model Justification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

6.3 IBM AIX POWER CPU Usage and Throughput . . . . . . . . . . . . . . . . . . . . . . . 174

6.3.1 POWER7 and POWER8 Execution Units . . . . . . . . . . . . . . . . . . . . . . . 174

6.3.2 CPU Usage and Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

6.3.3 POWER PURR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

6.3.4 vpm throughput mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

6.3.5 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183


Preface

An expert is a person who has found out by his own painful experience all the mistakes that one can
make in a very narrow field.
Niels Bohr

The best way to learn is to do, the best time to learn is at problem solving. All the contents in this book
are extracted from the field experience in Oracle problem troubleshouting and performance tuning, and
investigated through further studies and researches with reproducible test cases.

The book consists of 6 chapters: data access, redo-undo, locking, parsing-compiling, memory and cpu,
and covers main parts of Oracle core architecture. Each section is focused on one particular subject, from
essential fundamentals to mathematical models. Each subject is presented as practice as a perspiration
real problem, and studied as deep as an inspiration research domain.

It collects various troubleshooting cases, which are encountered in the real-world Oracle applications. In
the book, we are trying to re-construct them with reproducible test code, and then understand them
with repeated experiments. All the tests are done in Oracle Version 11g, 12c, or 18c.

It is our belief that all faced issues have to be understood with a reproducible test code, and all the
solutions to be applied have to be justified against a reproducible test code, simply because facts can never
be wrong. Troubleshooting is often a process of post-mortal analysis, without reconstructing a test code,
it is hard to provide a hard-proved solution. Only in this way, we can have a profound understanding
of system internal mechanisms, and thereafter we are developed to solve the daunting tasks. For the
applications, it implies lowering down the chance of regressions, and increasing productivity.

Troubles are produced by code and should also be shot by the code. This book will let code speak lauder,
and hence enjoys spacious allocation. By nature, code is the only instruction language computers listen
to. Moreover, it is the best documentation without deformation.

The test code in the book can also be served as Oracle performance exercises. After sufficient Oracle
technical documentations and dozens of popular Oracle learning books, one exercises book is a compliment
for Oracle learning and applying. Once finished the first draft of this book, I realized that myself was
the first beneficiary of those big exercises, continuously learning from repeatedly testing.

My description and understanding could be inaccurate, or inadequate, but all the output are from Oracle
or OS. Readers are encouraged to make their own test. I am sure that discrepancies will be discovered,
and Oracle community will hence be enriched.

vii
Acknowledgements

First I want to thank Oracle community and colleagues for sharing the worthful information. My first
and last resort are always googling with keyword ”Oracle”. Most importantly I want to thank my family
for all the support.

About the Author

He obtained Docteur ès Sciences from Swiss Federal Institute of Technology (EPFL) in 1994.
He has been using Oracle since Version 7.3.2.

September 26, 2019

viii
Chapter 1

Data Acesses

To use the data stored in database, applications have to make them disposal. That is the task of data
accesses provided by Oracle. Based on data locality, they are differentiated as physical read and logical
read. Logical read is further divided into consistent gets and current gets. As the first Chapter, we will
commence to discuss the fundamentals of data accesses.

1.1 Physical Read

To access any data, firstly Oracle has to move it from disk (persistent mass storage) to memory (volatile
main storage), that is, physical read (disk read, db file read, cold read) .

Oracle provides 3 basic approaches of db file read:

(a). db file sequential read

(b). db file scattered read

(c). db file parallel read

In this Section, we will look into different access paths and investigate their executions with tools like:

(1). Sql Trace

(2). Dtrace

(3). Oracle View: v$filestat and v$iostat file

Note: All tests are done in Oracle 12.1.0.2 on Solaris.

In the following test, we create a table and one index on it, each row occupies about 1 DB block
(db block size = 8192). Full test code is appended at the end of this section.

1
create table test_tab tablespace test_ts as
select level x, rpad(’ABC’, 3500, ’X’) y, rpad(’ABC’, 3500, ’X’) z from dual connect by level <= 1e4;

create index test_tab#i1 on test_tab(x) tablespace test_ts;

select round(bytes/1024/1024) mb, blocks from dba_segments where segment_name = ’TEST_TAB’;


--80 10240

1.1.1 DB File Read Access Path

We will run 4 variants of access path tests, and measure their performance in Sql Trace (event 10046)
and Dtrace (see appended Dtrace Script).

1.1.1.1 Test-1 Single Read

As the first test, we will select 333 adjacent rows by rowid. Here the Sql Trace output:

SQL > exec db_file_read_test(’single’, 1, 333);

-- adjacent rowid, single block read, ’db file sequential read’


SELECT /*+ single_read */ Y FROM TEST_TAB T WHERE ROWID = :B1

call count cpu elapsed disk query current rows


------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 333 0.01 0.01 0 0 0 0
Fetch 333 0.01 0.01 641 333 0 333
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 667 0.02 0.02 641 333 0 333

Row Source Operation


---------------------------------------------------
TABLE ACCESS BY USER ROWID TEST_TAB (cr=1 pr=8 pw=0 time=160 us cost=1 size=3513 card=1)

Event waited on Times Max. Wait Total Waited


---------------------------------------- Waited ---------- ------------
db file scattered read 44 0.00 0.00
db file sequential read 289 0.00 0.00

To read 333 rows, we perform 44 scattered read and 289 sequential read, in total, 333 reads. However,
641 blocks are read into memory because of scattered read.

Dtrace output reveals more details about lower OS layer calls:

PROBEFUNC FD RETURN_SIZE COUNT


lseek 260 0 44
readv 260 65536 44
pread 260 8192 289

PROBEFUNC FD MAX_READ_Blocks
pread 260 1
readv 260 8

TOTAL_SIZE = 5251072 , TOTAL_READ_Blocks = 641 , TOTAL_READ_CNT = 333

readv 260
value ------------- Distribution ------------- count
8192 | 0
16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 43
32768 |@ 1

2
65536 | 0

pread 260
value ------------- Distribution ------------- count
2048 | 0
4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 226
8192 |@@@@@@@@ 60
16384 | 2
32768 | 0
65536 | 1
131072 | 0

44 scattered read are fulfilled by 44 readv from file descriptor FD: 260 (which are proceeded by 44 lseek,
to be discussed later), each of which fetches 65536 bytes (8 DB blocks).

289 sequential read are done by 289 pread, each of which fetches 8192 bytes (1 DB block).

Totally we read 641 DB blocks in 333 read OS calls.

Now we look the Dtrace quantize (frequency distribution diagram) output, in which the values in all lines
are always increased by power-of-two in nanoseconds. Each line indicates the count of the number of
elements greater than or equal to the corresponding value, but less than the next larger row value. It is
similar to Oracle Wait Event Histogram (for instance, v$event histogram).

The whole elapsed time (multiplied by 1.5 to get average value) can be estimated as:

readv: (16384*43 + 32768*1)*1.5 = 1105920


pread: (4096*226 + 8192*60 + 16384*2 + 65536*1)*1.5 = 2273280
total: 1105920 + 2273280 = 3379200

The total elapse time of 3 millisecond (3379200 ns) in Dtrace is much less than xplan 20 millisecond (0.02
second) since Dtrace only collects time of of OS IO activities, the other 17 ms could be consumed in the
DB side. For example, in the above xplan, Execute phase took 10 millisecond (0.01 second), whereas two
Wait Events there: db file scattered read and db file sequential read having Total Waited equal to 0.00
(the minimum time unit in xplan is centisecond, which seems inherited from old Oracle hundredths of a
second counting).

We can also compare elapsed time per block read for readv (8 blocks per read request), and pread (1
block per read request), thereby evaluate the exact performance difference between single block read and
multi block read. The result shows that readv is 2 times faster than pread per block read.

readv: (16384*43 + 32768*1)*1.5/8/44 = 3142


pread: (4096*226 + 8192*60 + 16384*2 + 65536*1)*1.5/289 = 7866

In the next 3 tests, we will follow the same pattern of discussion.

1.1.1.2 Test-2 Scattered Read

In the second test, we also select 333 rows by rowid. Instead of adjacent rows, we read one row after
skipping 10 rows (see appended Test Code).

Here the Sql Trace output:

3
SQL > exec db_file_read_test(’scattered’, 1, 333);

-- jumped rowid, scattered read, ’db file scattered read’


SELECT /*+ scattered_read */ Y FROM TEST_TAB T WHERE ROWID = :B1

call count cpu elapsed disk query current rows


------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 333 0.00 0.00 0 0 0 0
Fetch 333 0.02 0.02 2664 333 0 333
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 667 0.02 0.02 2664 333 0 333

Row Source Operation


---------------------------------------------------
TABLE ACCESS BY USER ROWID TEST_TAB (cr=1 pr=8 pw=0 time=156 us cost=1 size=3513 card=1)

Elapsed times include waiting on following events:


Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
db file scattered read 333 0.00 0.00

Oracle chooses db file scattered read to fetch all 333 rows with 2664 disk reads. But xplan looks identical
as single read, so xplan alone is not able to reveal the difference.

But Dtrace output shows the difference:

------------------------------ dtrace ------------------------------


PROBEFUNC FD RETURN_SIZE COUNT
lseek 260 0 91
readv 260 65536 333

PROBEFUNC FD MAX_READ_Blocks
readv 260 8

TOTAL_SIZE = 21823488 , TOTAL_READ_Blocks = 2664 , TOTAL_READ_CNT = 333

readv 260
value ------------- Distribution ------------- count
8192 | 0
16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 290
32768 |@@@@@ 43
65536 | 0

Each readv request returns 8 DB blocks, 333 readv accumulated to exactly 333 x 8 = 2664. In other
hand, 91 lseek moving probably indicates that most of blocks are located next to each other.

1.1.1.3 Test-3 Parallel Read

In the next test, we read 333 rows by index range scan. Sql Trace shows the third type of db file read:
db file parallel read. In the output, we also include part of Raw Trace file.

SQL > exec db_file_read_test(’parallel’, 1, 333);

SELECT /*+ index(t test_tab#i1) parallel_read */ MAX(Y) FROM TEST_TAB T WHERE X BETWEEN 1 AND :B1

call count cpu elapsed disk query current rows


------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 1 0.00 0.00 344 335 0 1
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 3 0.00 0.00 344 335 0 1

4
Row Source Operation
---------------------------------------------------
SORT AGGREGATE (cr=335 pr=344 pw=0 time=3760 us)
FILTER (cr=335 pr=344 pw=0 time=1698 us)
TABLE ACCESS BY INDEX ROWID BATCHED TEST_TAB (cr=335 pr=344 pw=0 time=1361 us cost=168 size=1167165 card=333)
INDEX RANGE SCAN TEST_TAB#I1 (cr=2 pr=8 pw=0 time=279 us cost=1 size=0 card=333)(object id 2260477)

Event waited on Times Max. Wait Total Waited


---------------------------------------- Waited ---------- ------------
db file scattered read 4 0.00 0.00
db file parallel read 2 0.00 0.00

-- Raw Trace File --


’db file scattered read’ ela= 49 file#=917 block#=10368 blocks=8 obj#=2260477 (Index TEST_TAB#I1)
’db file scattered read’ ela= 27 file#=917 block#=128 blocks=8 obj#=2260476 (Table TEST_TAB)
’db file scattered read’ ela= 21 file#=917 block#=136 blocks=8 obj#=2260476
’db file parallel read’ ela= 422 files=1 blocks=127 requests=127 obj#=2260476
’db file parallel read’ ela= 334 files=1 blocks=127 requests=127 obj#=2260476
’db file scattered read’ ela= 264 file#=917 block#=409 blocks=66 obj#=2260476

Look Raw Trace file, first 3 lines are db file scattered read with blocks=8 (one of which is to read index
TEST TAB#I1), then 2 lines of db file parallel read with both blocks=127 and requests=127, last line is
one db file scattered read with blocks=66. In total, we made 3*8 + 2*127 + 66 = 344 disk reads in 258
read requests.

Dtrace Ouput shows more details of OS calls:

PROBEFUNC FD RETURN_SIZE COUNT


pread 260 540672 1
lseek 260 0 2
readv 260 65536 3
pread 260 8192 254

PROBEFUNC FD MAX_READ_Blocks
readv 260 8
pread 260 66

TOTAL_SIZE = 2818048 , TOTAL_READ_Blocks = 344 , TOTAL_READ_CNT = 258

readv 260
value ------------- Distribution ------------- count
8192 | 0
16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2
32768 |@@@@@@@@@@@@@ 1
65536 | 0

pread 260
value ------------- Distribution ------------- count
2048 | 0
4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 196
8192 |@@@@@@@@ 52
16384 | 2
32768 | 0
65536 |@ 4
131072 | 1
262144 | 0

Crosschecking Sql Raw Trace with Dtrace, we can see:

3 blocks=8 db file scattered read are implemented by 3 readv with RETURN SIZE=65536 each.
2 blocks=127 db file parallel read are satisfied by 254 pread with RETURN SIZE=8192 each.
1 blocks=66 db file scattered read is done by 1 pread with RETURN SIZE=540672(=66*8192).

In total, we read 344 DB blocks by 258 (=3+254+1) OS read calls.

5
The last db file scattered read with blocks=66 also shows that one pread can read 66 blocks, much higher
than db file multiblock read count=32 configured in this database. Since 66 is not divisible by 32, it
is probably an OS disk read optimization (disk read merging) for Oracle ”Batched” reads, which is visible
in xplan as ”table access by index rowid batched”. Such kind of pread is triggered after low level OS
optimization, that is probably why db file multiblock read count=32 has no effect there. ”Batched”
reads is controlled by Oracle 12c hidden parameter optimizer batch table access by rowid (enable
table access by ROWID IO batching), or 11g nlj batching enabled (enable batching of the RHS IO
in NLJ).

For example, ”Batched” can be disabled by:

SELECT /*+ index(t test_tab#i1) opt_param(’_optimizer_batch_table_access_by_rowid’, ’false’) parallel_read */ MAX(Y)


FROM TEST_TAB T WHERE X BETWEEN 1 AND :B1;

In xplan, db file parallel read is indicated with Times Waited being 2, but real OS calls is 254 pread
requests. We will discuss it later on AIO read.

By the way, we have 3 readv, but only 2 lseek, so there are probably 2 readv share one lseek.

From above Sql Trace output and Dtrace output, we can see that pread can fulfill both db file parallel
read and db file scattered read. Back to Test-1 Single Read, in which db file sequential read is also
performed by pread, we can say pread is universal for all 3 types of db file reads.

1.1.1.4 Test-4 Full Read

As the last test, we read 333 rows by a full table scan. Here the Sql Trace output including its raw trace
lines:

SQL > exec db_file_read_test(’full’, 1, 333);

SELECT /*+ full_read */ MAX(Y) FROM TEST_TAB T WHERE ROWNUM <= :B1

call count cpu elapsed disk query current rows


------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 1 0.00 0.00 342 342 3 1
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 3 0.00 0.00 342 342 3 1

Row Source Operation


---------------------------------------------------
SORT AGGREGATE (cr=342 pr=342 pw=0 time=3788 us)
COUNT STOPKEY (cr=342 pr=342 pw=0 time=1371 us)
TABLE ACCESS FULL TEST_TAB (cr=342 pr=342 pw=0 time=925 us cost=99 size=1169334 card=334)

Event waited on Times Max. Wait Total Waited


---------------------------------------- Waited ---------- ------------
db file sequential read 2 0.00 0.00
db file scattered read 23 0.00 0.00

-- Raw Trace File --


’db file sequential read’ ela= 19 file#=917 block#=130 blocks=1 obj#=2260476 tim=647299975335
’db file sequential read’ ela= 14 file#=3 block#=768 blocks=1 obj#=0 tim=647299975399
-- UNDO file#=3 /oratestdb/oradata/testdb/undo01.dbf
’db file scattered read’ ela= 22 file#=917 block#=131 blocks=5 obj#=2260476 tim=647299975501
’db file scattered read’ ela= 25 file#=917 block#=136 blocks=8 obj#=2260476 tim=647299975609
’db file scattered read’ ela= 25 file#=917 block#=145 blocks=7 obj#=2260476 tim=647299975713
’db file scattered read’ ela= 23 file#=917 block#=152 blocks=8 obj#=2260476 tim=647299975806
’db file scattered read’ ela= 25 file#=917 block#=161 blocks=7 obj#=2260476 tim=647299975901
’db file scattered read’ ela= 24 file#=917 block#=168 blocks=8 obj#=2260476 tim=647299975994

6
’db file scattered read’ ela= 23 file#=917 block#=177 blocks=7 obj#=2260476 tim=647299976088
’db file scattered read’ ela= 23 file#=917 block#=184 blocks=8 obj#=2260476 tim=647299976178
’db file scattered read’ ela= 23 file#=917 block#=193 blocks=7 obj#=2260476 tim=647299976270
’db file scattered read’ ela= 22 file#=917 block#=200 blocks=8 obj#=2260476 tim=647299976364
’db file scattered read’ ela= 23 file#=917 block#=209 blocks=7 obj#=2260476 tim=647299976465
’db file scattered read’ ela= 22 file#=917 block#=216 blocks=8 obj#=2260476 tim=647299976554
’db file scattered read’ ela= 22 file#=917 block#=225 blocks=7 obj#=2260476 tim=647299976646
’db file scattered read’ ela= 29 file#=917 block#=232 blocks=8 obj#=2260476 tim=647299976759
’db file scattered read’ ela= 24 file#=917 block#=241 blocks=7 obj#=2260476 tim=647299976866
’db file scattered read’ ela= 23 file#=917 block#=248 blocks=8 obj#=2260476 tim=647299976956
’db file scattered read’ ela= 128 file#=917 block#=258 blocks=32 obj#=2260476 tim=647299977200
’db file scattered read’ ela= 95 file#=917 block#=290 blocks=32 obj#=2260476 tim=647299977511
’db file scattered read’ ela= 97 file#=917 block#=322 blocks=32 obj#=2260476 tim=647299977822
’db file scattered read’ ela= 87 file#=917 block#=354 blocks=30 obj#=2260476 tim=647299978113
’db file scattered read’ ela= 96 file#=917 block#=386 blocks=32 obj#=2260476 tim=647299978407
’db file scattered read’ ela= 108 file#=917 block#=418 blocks=32 obj#=2260476 tim=647299978719
’db file scattered read’ ela= 94 file#=917 block#=450 blocks=32 obj#=2260476 tim=647299979021

We made 2 db file sequential read with blocks=1, and 23 db file scattered read with blocks varied from
5 to 32, but summing up them together, it is 342 disk reads. In this case, the maximum blocks=32 is
probably dictated by db file multiblock read count (see later Oracle Docu).

Note that the second db file sequential read is to read undo data block (file#=3 block#=768 obj#=0).
It is visible in Raw Trace, and counted in xplan statistics.

Look again Dtrace output:

PROBEFUNC FD RETURN_SIZE COUNT


pread 260 8192 1
pread 260 245760 1
readv 260 40960 1
pread 260 262144 6
readv 260 57344 7
lseek 260 0 8
readv 260 65536 8

PROBEFUNC FD MAX_READ_Blocks
readv 260 8
pread 260 32

TOTAL_SIZE = 2793472 , TOTAL_READ_Blocks = 341 , TOTAL_READ_CNT = 24

readv 260
value ------------- Distribution ------------- count
8192 | 0
16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 16
32768 | 0

pread 260
value ------------- Distribution ------------- count
16384 | 0
32768 |@@@@@@@@@@@@@@@@@@@@ 4
65536 |@@@@@@@@@@@@@@@ 3
131072 |@@@@@ 1
262144 | 0

Crosscheck Sql Raw Trace with Dtrace, we can see:

1 blocks=1 db file sequential read is implemented by 1 pread with RETURN SIZE=8192.


1 blocks=5 db file scattered read is implemented by 1 readv with RETURN SIZE=40960.
7 blocks=7 db file scattered read is implemented by 7 readv with RETURN SIZE=57344.
8 blocks=8 db file scattered read is implemented by 8 readv with RETURN SIZE=65536.

7
1 blocks=30 db file scattered read is implemented by 1 pread with RETURN SIZE=245760.
6 blocks=32 db file scattered read is implemented by 6 pread with RETURN SIZE=262144.

In total, we read 341 DB blocks by 24 OS read calls.

Sql Trace showed 25 (=23+2) reads to get 342 blocks, Dtrace showed 24 reads to get 341 blocks, one extra
read in Sql Trace is due to one undo read (file#=3). (Note: in Dtrace script, we only trace FD=260,
that is file#=917, undo file#=3 is not traced)

If we display segment extent allocations by two queries below:

SQL > select segment_type, segment_subtype, header_block, blocks, extents, initial_extent, next_extent
from dba_segments v where segment_name = ’TEST_TAB’;

SEGMENT_TYPE SEGMENT_SU HEADER_BLOCK BLOCKS EXTENTS INITIAL_EXTENT NEXT_EXTENT


------------- ---------- ------------ ------ ------- -------------- -----------
TABLE ASSM 130 10240 81 65536 1048576

SQL > select blocks, count(*) cnt, min(extent_id), min(block_id)


from dba_extents where segment_name = ’TEST_TAB’
group by blocks order by min(extent_id);

BLOCKS CNT MIN(EXTENT_ID) MIN(BLOCK_ID)


------ --- -------------- -------------
8 16 0 128
128 63 16 256
1024 2 79 8320

and then crosscheck with above raw trace file again. We can see that the first db file sequential read
(block#=130) by pread is to read segment header block (HEADER BLOCK: 130), the next 16 db file scattered
read with blocks between 5 and 8 by readv is to read all 16 initial extents (8 blocks per extent), the rest 7
db file scattered read with blocks between 30 and 32 by pread is to read incremental extents (128 blocks
per extent). The size of incremental extents is 128 block, but each scattered read can read maximum 32
blocks (db file multiblock read count=32).

1.1.2 Oracle and UNIX References

After above 4 variant tests, it is worth of aligning our comprehension with Oracle and UNIX Documen-
tations.

1.1.2.1 Oracle References

1.1.2.1.1 db file sequential read (P1 = file#, P2 = block#, P3 = blocks)

file#: This is the file# of the file that Oracle is trying to read from. From Oracle8 onwards it is the
ABSOLUTE file number (AFN).

block#: This is the starting block number in the file from where Oracle starts reading the blocks.
Typically only one block is being read.

blocks: This parameter specifies the number of blocks that Oracle is trying to read from the file#
starting at block#. This is usually 1 but if P3 > 1 then this is a multiblock read. Multiblock db file
sequential read may be seen in earlier Oracle versions when reading from a sort (temporary) segments.

8
1.1.2.1.2 db file scattered read (P1 = file#, P2 = block#, P3 = blocks)

file#: This is the file# of the file that Oracle is trying to read from. In Oracle8 onwards it is the
absolute file number (AFN).

block#: This is the starting block number in the file from where Oracle starts reading the blocks.

blocks: This parameter specifies the number of blocks that Oracle is trying to read from the file#
starting at block#.

The upper limit is DB FILE MULTIBLOCK READ COUNT, which is self tuned from Oracle 10.2 onwards.

1.1.2.1.3 db file parallel read (P1 = files, P2 = blocks, P3 = requests)

files: This indicates the number of files to which the session is reading.

blocks: This indicates the total number of blocks to be read.

requests: This indicates the total number of I/O requests, which will be the same as blocks.

This happens during recovery. It can also happen during buffer prefetching, as an optimization (rather
than performing multiple single-block reads. also see: C.3.34 db file parallel read [21])

1.1.2.1.4 WAITEVENT: ”db file sequential read” Reference Note (Doc ID 34559.1)
This signifies a wait for an I/O read request to complete. This call differs from ”db file scattered read”
in that a sequential read reads data into contiguous memory (whilst a scattered read reads multiple blocks
and scatters them into different buffers in the SGA).

1.1.2.1.5 WAITEVENT: ”db file scattered read” Reference Note (Doc ID 34558.1)
This wait happens when a session is waiting for a multiblock IO to complete. This typically occurs dur-
ing FULL TABLE SCANs or INDEX FAST FULL SCANs. Oracle reads up to DB FILE MULTIBLOCK READ COUNT
consecutive blocks at a time and scatters them into buffers in the buffer cache.

1.1.2.2 UNIX References

Here the used OS subroutines in the above tests and descriptions:

ssize_t read(int fildes, void *buf, size_t nbyte);


ssize_t pread(int fildes, void * buf, size_t nbyte, off_t offset);
ssize_t readv(int fildes, struct iovec * iov, int iovcnt);
off_t lseek(int fildes, off_t offset, int whence)

1.1.2.2.1 read()
attempts to read nbyte bytes from the file associated with the open file descriptor, fildes, into the buffer
pointed to by buf.

9
1.1.2.2.2 pread()
performs the same action as read(), except that it reads from a given position in the file without
changing the file pointer. The first three arguments to pread() are the same as read() with the addition
of a fourth argument offset for the desired position inside the file. pread() will read up to the maximum
offset value that can be represented in an off t for regular files.

1.1.2.2.3 readv()
is equivalent to read(), but places the input data into the iovcnt buffers specified by the members of
the iov array: iov 0 , iov 1 , ..., iov [ iovcnt -1]. The iovcnt argument is valid if greater than
0 and less than or equal to IOV MAX. IOV MAX: 1024 On Linux, 16 On Solaris, 16 On AIX and HP-UX.

1.1.2.2.4 lseek()
sets the file pointer associated with the open file descriptor specified by fildes.

1.1.3 Test Discussions

Aftet making tests and reading Docu, we can step further to study Oracle DB file reads and underlined
OS Subroutines.

1.1.3.1 OS Calls

Both pread and readv read contiguous file space. pread places the input data into one single contiguous
buffer (memory space); whereas readv distributes them into multi buffers.

It looks like that the difference between pread and readv is the difference of memory allocation. Their
disk operations are the same.

pread specifies file reading position by a third parameter: offset; whereas readv requires a precedent
lseak (with parameter: offset) to fix the file reading position.

Contiguous file space means logically contiguous in a file, but not necessarily physical contiguous in a
disk.

1.1.3.2 Oracle Calls

pread can fulfill all 3 kinds of db file read:

db file sequential read (Test-1 Single Read)


db file scattered read (Test-3 Parallel Read, see next discussion)
db file parallel read (Test-3 Parallel Read, see next discussion)

whereas readv is for

db file scattered read (Test-2 Scattered Read)

10
1.1.3.3 Disk Read and Logical Read

In first three tests: Single Read, Scattered Read and Parallel Read, Sql Trace shows disk is bigger than
query (641 > 333, 2664 > 333, 344 > 335), there seems some wastage since more disk blocks are read
than consumed.

But the number of disk read requests, which is showed by dtrace TOTAL READ CNT, is no more than Sql
Trace query (333 vs. 333, 333 vs. 333, 258 vs. 335). From performance point of view, number of disk read
requests is one more determinant runtime factor than number of read blocks. This is visible in Dtrace
quantize (frequency distribution diagram) output, where the value field denotes elapsed nanoseconds.

For example, previous Test-1 Single Read test showed that elapsed time per block read for readv and
and pread are:

readv: (16384*43 + 32768*1)/8/44 = 2094


pread: (4096*226 + 8192*60 + 16384*2 + 65536*1)/289 = 5244

By the way, in v$bh, the unused blocks fetched by Scattered Read are marked as class#=14, and can be
listed by:

select * from v$bh, user_objects


where objd = data_object_id and object_name = ’TEST_TAB’
and class#=14;

1.1.3.4 Disk Asynch IO and DB File Parallel Read

db file parallel read specifies the number of files (first parameter), and number of blocks to read (second
and third parameters are equal). It is similar to db file sequential read, the difference is that former reads
multi blocks (probably asynchronously), but later reads one single block.

In fact, we can observe the 254 aio requests by repeating our previous Parallel Read test with a new
Dtrace script to track pread only:

SQL > exec db_file_read_test(’parallel’, 1, 333);

sudo dtrace -n ’
syscall::pread:entry / pid == $1 && arg0 == $2 / {self->pread_fd = arg0;}
syscall::pread:return/ pid == $1 && self->pread_fd == $2 /
{@STACK_CNT[probefunc, self->pread_fd, arg1, ustack(5, 0)]=count(); self->pread_fd = 0;}
’ 11352 260

pread 260 540672


libc.so.1‘_pread+0xa
oracle‘skgfqio+0x284
oracle‘ksfd_skgfqio+0x195
oracle‘ksfd_skgfrvio+0xcb4
oracle‘ksfd_vio+0x9a3
1

pread 260 8192


libc.so.1‘_pread+0xa
libaio.so.1‘_aio_do_request+0x18e
libc.so.1‘_thr_setup+0x5b
libc.so.1‘_lwp_start
254

The trace line aio do request (fourth line from bottom) in Dtrace ustack confirmed the AIO calls of

11
254 pread (last line). Back to previous Parallel Read test, look its Sql Raw Trace File and Dtrace output,
we can see:

-. first 3 db file scattered read with blocks=8, corresponding to 3 readv (not shown here, since
above Dtrace only probes pread), each reads 65536 bytes (8 blocks).

-. next 2 db file parallel read with blocks=127, corresponding to 254 pread, each reads 8192 bytes
(1 block).

-. last db file scattered read with blocks=66, corresponding to 1 pread, which reads 540672 Bytes
(66 Blocks).

In other words, there exist 3 readv, which read 3*65536 bytes=24 blocks. These 3 readv can neither
match to 254 (=127 + 127) blocks db file parallel read, nor 66 blocks db file scattered read. Therefore
above 254 blocks db file parallel read and 66 blocks db file scattered read in this test are accomplished
by pread.

db file parallel read has 2 plural parameters, P1 (files), P3 (requests) (P2=P3), each of which can denote
one dimension of parallel operations. P1 (files) signifies multi files reading in parallel. P3 (requests)
stands for multi parallel disk reading requests. This is similar to log file parallel write and control file
parallel write, in which first parameter: files represents number of log files (in one redo group) and number
of control files respectively.

In the above example, since files=1, therefore db file parallel read implies multi requests. The number
of requests and blocks are visible in Raw Trace file. Elapsed time is measured for AIO requests from
first request sending to last response receiving. Because multi requests are performed asynchronously,
probably better named as ”db file async read”.

If we set DISK ASYNCH IO=false (and restart DB), there is no more AIO calls ( aio do request) visible
in Dtrace ustack as shown in following output, but Sql Trace is not able to reveal this setting change,
and it still shows the same output as above.

pread 268 540672


libc.so.1‘_pread+0xa
oracle‘skgfqio+0x284
oracle‘ksfd_skgfqio+0x195
oracle‘ksfd_skgfrvio+0xcb4
oracle‘ksfd_vio+0x9a3
1
pread 268 8192
libc.so.1‘_pread+0xa
oracle‘skgfqio+0x284
oracle‘ksfd_skgfqio+0x203
oracle‘ksfdgo+0x188
oracle‘ksfd_sbio+0xdd1
254

1.1.4 Physical Read Stats in Oracle Views

Sql Trace and Dtrace are low level tools, which have to be triggered on purpose. In the daily operations,
only Oracle dynamic performance views are available. We repeat the above 4 tests again and compare
stats views with previous test result, so that we can see if they can provide reliable info as Sql Trace, and
precise info as Dtrace for our daily usage.

12
1.1.4.1 v$filestat vs. v$iostat file Views

Oracle provides two views to record file disk I/O statistics:

(1). v$filestat and its cumulative dba hist filestatxs in centisecond since Oracle 8 (or 7).
(2). v$iostat file and dba hist iostat * in milliseconds since Oracle 11.

v$iostat file looks like an improved version of v$filestat with higher precision (milliseconds vs.
centisecond). In AWR Report, both v$filestat and v$iostat file data seems appeared in different
places. In fac, if we collect one AWR and trace it with Sql Trace 10046, we can find many occur-
rences of singleblkrds from dba hist filestatxs (centisecond converted to milliseconds in AWR),
and small read reqs from dba hist iostat filetype. If we try to match the names between Views
and AWR, in AWR Section ”Tablespace IO Stats” and ”File IO Stats”, the column prefixed by ”1-bk
Rd” are probably from v$filestat. But in Section: ”IOStat by Filetype summary”, the last 2 columns
are named as ”Small Read” and ”Large Read”, which are probably from v$iostat file. Therefore,
potential stats inconsistence can appear even in the same AWR report. (See dba hist filestatxs query in
Blog [41])

Both views have their derivatives, for example, v$file histogram seems from v$filestat, because both
exist before Oracle 11 and use ”single” as column prefix. For temp files, the counterpart of v$filestat
is v$tempstat; and in v$iostat file, filetype name is marked as ’Temp File’.

1.1.4.2 Stats Views Test

Run code block below (see appended Test Code):

alter session set timed_statistics = true;


alter session set statistics_level=all;

truncate table read_stats;


exec db_file_read_test(’single’, 1, 333);
exec db_file_read_test(’scattered’, 1, 333);
exec db_file_read_test(’parallel’, 1, 333);
exec db_file_read_test(’full’, 1, 333);

and then collect statistics from both views (AIX and Linux are added for comparison. Only first 7 stats
fields are showed in Table 1.1 due to page limit. For a full output and discussions, see Blog [52]):

select test_name, --ts,


phyrds, phyblkrd, singleblkrds, singleblkrdtim,
ceil(small_read_megabytes*1024*1024/8192) small_read_blks, -- converted to Block for comparison
small_read_reqs, small_read_servicetime, small_sync_read_reqs, small_sync_read_latency,
ceil(large_read_megabytes*1024*1024/8192) large_read_blks, -- converted to Block for comparison
large_read_reqs, large_read_servicetime
from read_stats_delta_v where phyrds > 0 order by test_name desc, ts;

Legend:

phyrds: Number of physical reads done

phyblkrd: Number of physical blocks read

singleblkrds: Number of single block reads

13
OS Test name phyrds phyblkrd singleblkrds small read blks small read reqs large read blks large read reqs
Solaris Single 333 641 289 640 333 0 0
Solaris Scattered 333 2664 0 2560 333 0 0
Solaris Parallel 258 344 254 384 257 0 1
Solaris Full 24 341 1 0 17 256 7
AIX Single 333 333 333 384 333 0 0
AIX Scattered 333 333 333 384 333 0 0
AIX Parallel 260 335 259 256 259 0 1
AIX Full 24 341 1 128 17 256 7
Linux Single 333 641 289 640 333 0 0
Linux Scattered 333 2664 0 2688 333 0 0
Linux Parallel 258 344 254 256 257 0 1
Linux Full 24 341 1 128 17 256 7

Table 1.1: Physical Read Statistics

singleblkrdtim: Cumulative single block read time (in hundredths of a second)

small read blks: Number of small block read (from small read megabytes)

small read reqs: Number of small block read requests

large read blks: Number of large block read (from large read megabytes)

large read reqs: Number of large block read requests

1.1.4.3 DB File Read Stats

At first, we recap all previous test result of Sql Trace and Dtrace into Table 1.2 so that we can compare
them with Oracle dynamic performance views. The first 3 stats columns are from Sql Trace, the fourth
from Dtrace, and the last is for both Sql Trace and Dtrace.

OS test name sequential read scattered read parallel read TOTAL READ CNT TOTAL READ Blocks
Solaris Single 289 44 333 641
Solaris Scattered 333 333 2664
Solaris Parallel 4 254 258 344
Solaris Full 1 23 24 341

Table 1.2: Sql Trace and Dtrace Statistics

The first 3 stats columns in Table 1.1 are from v$filestat, which are prefixed with ”single” in the view
(but in AWR, renamed as ”1-bk Rds”). The last 4 columns are from v$iostat file, which are preceded
by ”small” or ”large”. Here some observations:

1.1.4.3.1 Read Requests

v$filestat.phyrds matches Dtrace TOTAL READ CNT, equal to v$iostat file.(small read reqs +
large read reqs).

v$filestat.phyblkrd matches Dtrace TOTAL READ Blocks, approximate to v$iostat file.(small read blks
+ large read blks), hence v$iostat file is not accurate.

v$filestat.singleblkrds matches number of blocks read in ”db file sequential read” + ”db file parallel
read”.

14
For parallel read, Oracle view records it as single block read, and shows 254 in above Parallel Test, but
Sql Trace xplan shows 2.

1.1.4.3.2 Read Blocks per Read Request

Only Scattered test showed much higher small read blks in Table 1.1: 2560 by 333 small read reqs,
approximately 8 (2560/333) blocks per request. In Sql Trace, they are marked as db file scattered read
and accomplished by readv.

1.1.4.3.3 Timed Statistics

db file parallel read in Parallel test seems running asynchronously, and hard to collect precise timed
statistics for each single request.

1.1.4.3.4 Cold Read

In all 4 tests, xplan stats showed that ”disk” is bigger or equal to ”query”. This is due to Disk Cold
Read since we flush Buffer Cache before each test. In normal operations, Buffer Cache is already warmed
up, but it can be a performance problem when Buffer Cache is under sized. For example, if KEEP pool is
configured much smaller than needed, heavy physical reads can be observed when accessing table/index
in that pool.

1.1.5 Plsql Test Code

Note: small fonts used to make code fit into page.

drop tablespace test_ts including contents and datafiles;


create tablespace test_ts datafile ’/oratestdb/oradata/testdb/test_ts.dbf’ size 200m online;

drop table test_tab;

-- DB_BLOCK_SIZE = 8192, each row occupies one BLOCK


create table test_tab tablespace test_ts as
select level x, rpad(’ABC’, 3500, ’X’) y, rpad(’ABC’, 3500, ’X’) z from dual connect by level <= 1e4;

select round(bytes/1024/1024) mb, blocks from dba_segments where segment_name = ’TEST_TAB’;


--80 10240

create index test_tab#i1 on test_tab(x) tablespace test_ts;

exec dbms_stats.gather_table_stats(null, ’TEST_TAB’, cascade=>true);

drop tablespace test_ts_aux including contents and datafiles;


create tablespace test_ts_aux datafile ’/oras5d00003/oradata/s5d00003/test_ts_aux.dbf’ size 200m online;

drop table test_rid_tab;


create table test_rid_tab tablespace test_ts_aux as select x, rowid rid from test_tab;
create index test_rid_tab#i1 on test_rid_tab(x) tablespace test_ts_aux;

exec dbms_stats.gather_table_stats(null, ’TEST_RID’, cascade=>true);

create or replace view read_stats_v as


select to_char(localtimestamp, ’yyyy-mm-dd hh24:mi:ss’) ts
,phyrds, phyblkrd, singleblkrds, 10*singleblkrdtim singleblkrdtim
,small_read_megabytes, small_read_reqs, small_read_servicetime, small_sync_read_reqs, small_sync_read_latency
,large_read_megabytes, large_read_reqs, large_read_servicetime
,f.name
from v$filestat v8, v$iostat_file v11, v$datafile f

15
where v8.file# = f.file#
and v11.file_no = f.file#
and f.name like ’%test_ts.dbf’;

drop table read_stats;

create table read_stats as select ’setall_seq_readxx’ test_name, v.* from read_stats_v v where 1=2;

create or replace view read_stats_delta_v as


select test_name, ts
,phyrds - lag(phyrds) over(partition by test_name order by ts) phyrds
,phyblkrd - lag(phyblkrd) over(partition by test_name order by ts) phyblkrd
,singleblkrds - lag(singleblkrds) over(partition by test_name order by ts) singleblkrds
,singleblkrdtim - lag(singleblkrdtim) over(partition by test_name order by ts) singleblkrdtim
,small_read_megabytes - lag(small_read_megabytes) over(partition by test_name order by ts) small_read_megabytes
,small_read_reqs - lag(small_read_reqs) over(partition by test_name order by ts) small_read_reqs
,small_read_servicetime - lag(small_read_servicetime) over(partition by test_name order by ts) small_read_servicetime
,small_sync_read_reqs - lag(small_sync_read_reqs) over(partition by test_name order by ts) small_sync_read_reqs
,small_sync_read_latency - lag(small_sync_read_latency) over(partition by test_name order by ts) small_sync_read_latency
,large_read_megabytes - lag(large_read_megabytes) over(partition by test_name order by ts) large_read_megabytes
,large_read_reqs - lag(large_read_reqs) over(partition by test_name order by ts) large_read_reqs
,large_read_servicetime - lag(large_read_servicetime) over(partition by test_name order by ts) large_read_servicetime
from read_stats s;

create or replace procedure db_file_read_test (p_test_name varchar2, p_loops number, p_rows number) as
l_max_y varchar2(3500);
type tab_rowid is table of rowid index by pls_integer;
l_rowid_cache tab_rowid;
begin
case
when p_test_name = ’single’ then
select rowid bulk collect into l_rowid_cache from test_tab where x between 1 and p_rows;
when p_test_name = ’scattered’ then
select rowid bulk collect into l_rowid_cache from test_tab where mod(x, 10) = 0 and rownum <= p_rows;
else null;
end case;
dbms_output.put_line(’Number of Rows to read = ’|| l_rowid_cache.count);

insert into read_stats select p_test_name test_name, v.* from read_stats_v v;


for i in 1..p_loops loop
execute immediate ’alter system flush buffer_cache’;
dbms_lock.sleep(1);
case
when p_test_name = ’single’ then
for r in 1..l_rowid_cache.count loop
-- adjacent rowid, single block read, ’db file sequential read’
select /*+ single_read */ y into l_max_y from test_tab t where rowid = l_rowid_cache(r);
end loop;
when p_test_name = ’scattered’ then
for r in 1..l_rowid_cache.count loop
-- jumped rowid, scattered read, ’db file scattered read’
select /*+ scattered_read */ y into l_max_y from test_tab t where rowid = l_rowid_cache(r);
end loop;
when p_test_name = ’parallel’ then
-- table access by index rowid batched ’db file parallel read’
select /*+ index(t test_tab#i1) parallel_read */ max(y) into l_max_y from test_tab t where x between 1 and p_rows;
when p_test_name = ’full’ then
-- table access by FULL
select /*+ full_read */ max(y) into l_max_y from test_tab t where rownum <= p_rows;
end case;
dbms_lock.sleep(1);
insert into read_stats select p_test_name test_name, v.* from read_stats_v v;
end loop;
commit;
end;
/

1.1.6 Dtrace Script

pfiles 11352
260: /oratestdb/oradata/testdb/test_ts.dbf

--Dtrace Script

16
#!/usr/sbin/dtrace -Zs

/*
* read_dtrace.d pid fd
* chmod u+x read_dtrace.d
* sudo ./read_dtrace.d 11352 260
*/

BEGIN / $1 > 0 && $2 > 0 /


{TOTAL_SIZE = 0; TOTAL_READ_CNT = 0; }
syscall::pread:entry / pid == $1 && arg0 == $2 /
{self->pread_fd = arg0; self->pread_t = timestamp;}
syscall::pread:return/ pid == $1 && self->pread_fd == $2 /
{@CNT[probefunc, self->pread_fd, arg1] = count();
@MAXB[probefunc, self->pread_fd] = max(arg1/8192);
@ETIME[probefunc, self->pread_fd] = quantize(timestamp- self->pread_t);
TOTAL_SIZE = TOTAL_SIZE + arg1; TOTAL_READ_CNT = TOTAL_READ_CNT + 1;
self->pread_fd = 0;}
syscall::readv:entry / pid == $1 && arg0 == $2 /
{self->readv_fd = arg0; self->readv_t = timestamp; }
syscall::readv:return/ pid == $1 && self->readv_fd == $2 /
{@CNT[probefunc, self->readv_fd, arg1] = count();
@MAXB[probefunc, self->readv_fd] = max(arg1/8192);
@ETIME[probefunc, self->readv_fd] = quantize(timestamp- self->readv_t);
TOTAL_SIZE = TOTAL_SIZE + arg1; TOTAL_READ_CNT = TOTAL_READ_CNT + 1;
self->readv_fd = 0;}
syscall::kaio:entry / pid == $1 && arg1 == $2 /
{self->kaio = arg1;}
syscall::kaio:return / pid == $1 && self->kaio == $2 /
{@CNT[probefunc, self->kaio, arg1] = count(); self->kaio = 0;}
syscall::lseek:entry / pid == $1 && arg0 == $2/
{@CNT[probefunc, arg0, 0] = count(); }
END / $1 > 0 && $2 > 0 /
{printf("\n%11s %6s %12s %9s \n", "PROBEFUNC", "FD", "RETURN_SIZE", "COUNT");
printa(" %-10s %6d %12d %9@d\n", @CNT);
printf("\n%11s %6s %16s \n", "PROBEFUNC", "FD", "MAX_READ_Blocks");
printa(" %-10s %6d %16@d\n", @MAXB);
printf("\nTOTAL_SIZE = %-10d, TOTAL_READ_Blocks = %-6d, TOTAL_READ_CNT = %-6d\n",
TOTAL_SIZE, TOTAL_SIZE/8192, TOTAL_READ_CNT);
printa("\n%-10s %6d %16@d\n", @ETIME);}

1.2 Logical Read - Consistent Get

Once data moved from disk to memory by physical read, they are accessed by logical reads . Oracle
logical read (buffer get, memory read, warm read) fetches data from buffer cache (memory) in two
different modes: consistent mode get (consistent get) and current mode get (db block get) . Blocks in
consistent mode are the memory versions at the point in time the query started, whereas blocks in current
mode are the versions at current time (right now). Each block can have multiversion clones in consistent
mode, but maximum one single version in current mode.

This section discusses consistent gets, and next section will talk about db block gets. At first, we test
consistent gets in 4 different access paths and measure the block gets in terms of Oracle Event 10200:
consistent read buffer status, then demonstrate ’latch: cache buffers chains’ in row-by-row slow processing.

Note: All tests are done in Oracle 12.1.0.2 on AIX, Solaris, Linux with 6 physical processors.

1.2.1 Test Setup

First we create a test table of 100 rows, with 5 rows per block, in total 20 blocks. We use table option:
minimize records per block to control the number of rows in each block . One optimal value should
be no more than db block max cr dba (maximum allowed number of CR buffers per dba) , which is 6
in default (pctfree is not able to control the exact number of rows). The 6 buffers can be 5 consistent

17
(CR) buffers, and one current buffer. Each CR buffer is an original version of current block (Oracle stats:
”switch current to new buffer” represents the number of times the CURRENT block moved to a different
buffer, leaving a CR block in the original buffer). By the way, sometimes this technique is intentionally
opted to reduce Buffer Busy Waits on the hot blocks caused by multiple concurrent sessions so that there
exist maximum 6 sessions accessing the same block simultaneously.

drop table test_tab;


create table test_tab
INITRANS 26 -- prevent Segments ITL Waits and Deadlocks
as select level id, rpad(’ABC’, 10, ’X’) val from dual connect by level <= 5;

alter table test_tab minimize records_per_block;

truncate table test_tab;


insert into test_tab select level id, rpad(’ABC’, 10, ’X’) val from dual connect by level <= 100;
commit;

drop index test_tab#u1;


create unique index test_tab#u1 on k.test_tab (id)
pctfree 90
initrans 26 -- prevent Segments ITL Waits and Deadlocks
;

exec dbms_stats.gather_table_stats(null, ’TEST_TAB’, cascade => TRUE);

The allocation of 5 rows per data block can be verified by table select:

select block, to_char(block, ’xxxxxxxx’) block_hex, count (*) cnt


from (select rowid rid,
dbms_rowid.rowid_object (rowid) object,
dbms_rowid.rowid_relative_fno (rowid) fil,
dbms_rowid.rowid_block_number (rowid) block,
dbms_rowid.rowid_row_number (rowid) ro,
t.*
from test_tab t
order by fil, block, ro)
group by block
order by block, cnt desc;

BLOCK BLOCK_HEX CNT


-------- --------- ------
3070331 2ed97b 5
3070332 2ed97c 5
3070333 2ed97d 5
3070334 2ed97e 5
3070335 2ed97f 5
3072640 2ee280 5
3072641 2ee281 5
3072642 2ee282 5
3072643 2ee283 5
3072644 2ee284 5
3072645 2ee285 5
3072646 2ee286 5
3072647 2ee287 5
3072649 2ee289 5
3072650 2ee28a 5
3072651 2ee28b 5
3072652 2ee28c 5
3072653 2ee28d 5
3072654 2ee28e 5
3072655 2ee28f 5
20 rows selected.

5 Rows per index block can also be displayed by index select below:

select object_name, object_id from dba_objects where object_name = ’TEST_TAB#U1’;

OBJECT_NAME OBJECT_ID

18
----------- ----------
TEST_TAB#U1 2260907

select block#, to_char(block#, ’xxxxxxxx’) block#_hex


,count(*) rows_per_block
from (
select dbms_rowid.rowid_block_number(sys_op_lbid (2260907, ’L’, t.rowid)) block#
,t.*
from test_tab t
where id is not null
)
group by block#
order by block#;

BLOCK# BLOCK#_HEX ROWS_PER_BLOCK


-------- ---------- --------------
3072660 2ee294 5
3072661 2ee295 5
3072662 2ee296 5
3072663 2ee297 5
3072664 2ee298 5
3072665 2ee299 5
3072666 2ee29a 5
3072667 2ee29b 5
3072668 2ee29c 5
3072669 2ee29d 5
3072670 2ee29e 5
3072671 2ee29f 5
3072673 2ee2a1 5
3072674 2ee2a2 5
3072675 2ee2a3 5
3072676 2ee2a4 5
3072677 2ee2a5 5
3072678 2ee2a6 5
3072679 2ee2a7 5
3072680 2ee2a8 5
20 rows selected

1.2.2 Buffer Read Access Path Tests

We will use Trace Event 10200 to track all consistent reads as well as their sequence, and then compare
the number of consistent block reads with different access paths.

At first, list table and index meta information for later reference:

column object_name format a12


column object_id_hex format a20
column data_object_id_hex format a20

select object_name, object_id, data_object_id,


to_char(object_id, ’xxxxxxxx’) object_id_hex, to_char(data_object_id, ’xxxxxxxx’) data_object_id_hex
from dba_objects where object_name in (’TEST_TAB’, ’TEST_TAB#U1’);

OBJECT_NAME OBJECT_ID DATA_OBJECT_ID OBJECT_ID_HEX DATA_OBJECT_ID_HEX


----------- ---------- --------------- -------------- -------------------
TEST_TAB 2260905 2260906 227fa9 227faa
TEST_TAB#U1 2260907 2260907 227fab 227fab

1.2.2.1 Test-1 User Rowid Without Using Index

To make use of user rowid, we need to collect ROWIDs of first 7 adjacent rows:

select /*+ user_rowid_no_index */ t.*, rowid rid from test_tab t


where id in (1, 2, 3, 4, 5, 6, 7) order by t.id;

19
ID VAL RID
--- ---------- ------------------
1 ABCXXXXXXX AAIn+qAAAAALtl7AAA
2 ABCXXXXXXX AAIn+qAAAAALtl7AAB
3 ABCXXXXXXX AAIn+qAAAAALtl7AAC
4 ABCXXXXXXX AAIn+qAAAAALtl7AAD
5 ABCXXXXXXX AAIn+qAAAAALtl7AAE
6 ABCXXXXXXX AAIn+qAAAAALtl8AAA
7 ABCXXXXXXX AAIn+qAAAAALtl8AAB

then turn on 10046 and 10200 traces, run a query to fetch rows by the collected 7 ROWIDs:

alter session set tracefile_identifier = ’trace_10200_10046_test_1’;

alter session set events ’10046 trace name context forever, level 12’;
-- alter session set events ’10051 trace name context forever, level 1’;
alter session set events ’10200 trace name context forever, level 10’;

select /*+ user_rowid_no_index */ * from test_tab t


where id in (1, 2, 3, 4, 5, 6, 7)
and rowid in
(’AAIn+qAAAAALtl7AAA’
,’AAIn+qAAAAALtl7AAB’
,’AAIn+qAAAAALtl7AAC’
,’AAIn+qAAAAALtl7AAD’
,’AAIn+qAAAAALtl7AAE’
,’AAIn+qAAAAALtl8AAA’
,’AAIn+qAAAAALtl8AAB’);

alter session set events ’10200 trace name context off’;


-- alter session set events ’10051 trace name context off’;
alter session set events ’10046 trace name context off’;

The following 10200 Trace reveals 3 consistent gets of 2 different table blocks in consistent mode since
all 7 rows are located in 2 DB blocks. To facilitate cross reference, we add block read stats in xplan line
Row Source Operation, and Table/Index mappings in 10200 Trace with prefix <--.

------------------ 10046 Trace ------------------

call count cpu elapsed disk query current rows


------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 2 0.00 0.00 0 3 0 7
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 4 0.00 0.00 0 3 0 7

Rows Row Source Operation (0 Branch Block, 0 Leaf Blocks, 3 Table Blocks)
---- ---------------------------------------------------
7 INLIST ITERATOR (cr=3 pr=0 pw=0 time=130 us)
7 TABLE ACCESS BY USER ROWID TEST_TAB (cr=3 pr=0 pw=0 time=261 us cost=1 size=14 card=1)

------------------ 10200 Trace (irrelevant lines removed) ------------------

ktrget2(): started for block


<0x07cf : 0x002ed97b> objd: 0x00227faa <-- Table Block 1
<0x07cf : 0x002ed97b> objd: 0x00227faa <-- Table Block 1
<0x07cf : 0x002ed97c> objd: 0x00227faa <-- Table Block 2

Legend:
0x07cf: tablespace number
0x002ed97b: data block number
0x00227faa: data object number

We have 3 Table Block gets since Oracle detected all 7 given ROWIDs are located in 2 blocks (0x002ed97b,
0x002ed97c), and first block are fetched twice. Both are from table objd: 0x00227faa (data object id).

20
Now it comes the question why the first data block 0x002ed97b read twice. If we compare the two
ktrget2 on Block 1 (0x002ed97b) in the 10200 raw trace below, the only difference is the first marked
with flg: 0x00000661, but the second with flg: 0x00000660 (the third one on Table Block 2 is also
flg: 0x00000660):

ktrget2(): started for block <0x07cf : 0x002ed97b> objd: 0x00227fa ... flg: 0x00000661
ktrget2(): started for block <0x07cf : 0x002ed97b> objd: 0x00227fa ... flg: 0x00000660

Now we need to run again above query, and trace it with one line Dtrace script as follows:

$ > dtrace -n ’pid$target::kcbgtcr:entry {ustack(12, 0);}’ -p 830


CPU ID FUNCTION:NAME
0 80731 kcbgtcr:entry
oracle‘kcbgtcr
oracle‘ktrget2+0x292
oracle‘kdsgrp+0x15b6
a.out‘qetlbr+0x11c
oracle‘qertbFetchByRowID+0x47c
a.out‘qerstFetch+0x4ca
a.out‘qerilFetch+0x25c
a.out‘qerstFetch+0x4ca
a.out‘opifch2+0x188b
a.out‘kpoal8+0x132b -- V8 Bundled Exec call
oracle‘opiodr+0x433
oracle‘ttcpip+0x593

0 80731 kcbgtcr:entry
oracle‘kcbgtcr
oracle‘ktrget2+0x292
oracle‘kdsgrp+0x15b6
a.out‘qetlbr+0x11c
oracle‘qertbFetchByRowID+0x47c
a.out‘qerstFetch+0x4ca
a.out‘qerilFetch+0x25c
a.out‘qerstFetch+0x4ca
a.out‘opifch2+0x188b
a.out‘opifch+0x36
oracle‘opiodr+0x433
oracle‘ttcpip+0x593

0 80731 kcbgtcr:entry
oracle‘kcbgtcr
oracle‘ktrget2+0x292
oracle‘kdsgrp+0x15b6
a.out‘qetlbr+0x11c
oracle‘qertbFetchByRowID+0x47c
a.out‘qerstFetch+0x4ca
a.out‘qerilFetch+0x25c
a.out‘qerstFetch+0x4ca
a.out‘opifch2+0x188b
a.out‘opifch+0x36
oracle‘opiodr+0x433
oracle‘ttcpip+0x593

The only difference between first ustack and other two is that first one calls kpoal8, other two call
opifch.

kpoal8 stands for Oracle V8 bundled execution (see Blog [30]). It is the server function which processes
all the OPI requests sent to server by client in one bundled SQL Net payload to reduce the number of
network roundtrips.

In fact, if we turn on Trace event 10051: trace OPI calls:

alter session set events ’10051 trace name context forever, level 1’;

21
The output shows that the first ktrget2 is preceded by OPI CALL: type=94 (name=V8 Bundled Exec),
where other two appear after OPI CALL: type= 5 (name=FETCH).

OPI CALL: type=94 argc=31 cursor= 0 name=V8 Bundled Exec


=====================
PARSING IN CURSOR #18446604434610992800 sqlid=’f4q8t82byvspq’
select /*+ user_rowid_no_index */ * from test_tab t
where id in (1, 2, 3, 4, 5, 6, 7)
and rowid in
(’AAJJPwAAAAAKCuDAAA’
,’AAJJPwAAAAAKCuDAAB’
,’AAJJPwAAAAAKCuDAAC’
,’AAJJPwAAAAAKCuDAAD’
,’AAJJPwAAAAAKCuDAAE’
,’AAJJPwAAAAAKCuEAAA’
,’AAJJPwAAAAAKCuEAAB’)
END OF STMT

ktrget2(): started for block <0x07cf : 0x00282b83> objd: 0x002493f0


... flg: 0x00000661)

OPI CALL: type= 5 argc= 2 cursor= 3 name=FETCH


=====================
ktrget2(): started for block <0x07cf : 0x00282b83> objd: 0x002493f0
... flg: 0x00000660)

ktrget2(): started for block <0x07cf : 0x00282b84> objd: 0x002493f0


... flg: 0x00000660)

Now one can challenge this architecture decision. V8 Bundled Exec is designed to reduce the number of
network roundtrips, but it introduced additional kcbgtcr call, which is also accumulated to the number
of block gets. Probably that is a trade-off between network roundtrip and memory kcbgtcr.

1.2.2.2 Test-2 Index Range Scan With Index Rowid Batched

Next we use index range scan to select the same 7 rows. We still made 3 consistent gets of 2 different
table blocks in consistent mode. But additionally we fetched 1 Branch Block, 3 Leaf Blocks (2 are same).
In total, it makes 7 consistent gets.

select /*+ index_range_scan index(t test_tab#u1) */ * from test_tab t where id between 1 and 7;

call count cpu elapsed disk query current rows


------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 2 0.00 0.00 0 7 0 7
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 4 0.00 0.00 0 7 0 7

Rows Row Source Operation (1 Branch Block, 3 Leaf Blocks, 3 Table Blocks)
---- ---------------------------------------------------
7 TABLE ACCESS BY INDEX ROWID BATCHED TEST_TAB (cr=7 pr=0 pw=0 time=210 us cost=3 size=98 card=7)
7 INDEX RANGE SCAN TEST_TAB#U1 (cr=4 pr=0 pw=0 time=552 us cost=2 size=0 card=7)(object id 2260907)

10200 Trace shows the reading details on index branch/leaf (objd: 0x00227fab) and table (objd:
0x00227faa):

ktrgtc2(): started for block


<0x07cf : 0x002ee293> objd: 0x00227fab <-- Branch Block
<0x07cf : 0x002ee294> objd: 0x00227fab <-- Leaf Block 1
<0x07cf : 0x002ed97b> objd: 0x00227faa <-- Table Block 1
<0x07cf : 0x002ee294> objd: 0x00227fab <-- Leaf Block 1
<0x07cf : 0x002ed97b> objd: 0x00227faa <-- Table Block 1
<0x07cf : 0x002ee295> objd: 0x00227fab <-- Leaf Block 2
<0x07cf : 0x002ed97c> objd: 0x00227faa <-- Table Block 2

22
The fetching sequence shows that we first visit branch block to drill down to leaf blocks, from which
Oracle picks a set of satisfied adjacent ROWIDs to get rows from table blocks.

1.2.2.3 Test-3 Index Unique Scan

Now we select the same 7 rows with index unique scan, looped by inlist iterator.

select /*+ index_uniqe_scan */ * from test_tab t where id in (1, 2, 3, 4, 5, 6, 7);

call count cpu elapsed disk query current rows


------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 2 0.00 0.00 0 13 0 7
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 4 0.00 0.00 0 13 0 7

Rows Row Source Operation (2 Branch Block, 8 Leaf Blocks, 3 Table Blocks)
---- ---------------------------------------------------
7 INLIST ITERATOR (cr=13 pr=0 pw=0 time=218 us)
7 TABLE ACCESS BY INDEX ROWID TEST_TAB (cr=13 pr=0 pw=0 time=812 us cost=3 size=98 card=7)
7 INDEX UNIQUE SCAN TEST_TAB#U1 (cr=10 pr=0 pw=0 time=585 us cost=2 size=0 card=7)(object id 2260907)

It needs 2 Branch Block Gets, 8 Leaf Blocks gets, 3 Table Blocks gets, in total, 13 consistent gets. 10200
Trace shows each read:

ktrgtc2(): started for block


<0x07cf : 0x002ee293> objd: 0x00227fab <-- Branch Block
<0x07cf : 0x002ee294> objd: 0x00227fab <-- Leaf Block 1
<0x07cf : 0x002ed97b> objd: 0x00227faa <-- Table Block 1
<0x07cf : 0x002ee293> objd: 0x00227fab <-- Branch Block
<0x07cf : 0x002ee294> objd: 0x00227fab <-- Leaf Block 1
<0x07cf : 0x002ed97b> objd: 0x00227faa <-- Table Block 1
<0x07cf : 0x002ee294> objd: 0x00227fab <-- Leaf Block 1
<0x07cf : 0x002ee294> objd: 0x00227fab <-- Leaf Block 1
<0x07cf : 0x002ee294> objd: 0x00227fab <-- Leaf Block 1
<0x07cf : 0x002ee294> objd: 0x00227fab <-- Leaf Block 1
<0x07cf : 0x002ee295> objd: 0x00227fab <-- Leaf Block 2
<0x07cf : 0x002ed97c> objd: 0x00227faa <-- Table Block 2
<0x07cf : 0x002ee295> objd: 0x00227fab <-- Leaf Block 2

1.2.2.4 Test-4 Index Rowid Get

In this last test, we construct a pipelined function to produce a bunch of id numbers.

create type number_tab as table of number;


/

create or replace function id_pipelined(p_cnt number, p_sleep_seconds number := 0)


return number_tab pipelined as
begin
for i in 1..p_cnt loop
if p_sleep_seconds > 0 then
dbms_lock.sleep(p_sleep_seconds); -- intentially delay, specially before producing first row
end if;
pipe row(i);
end loop;
return;
exception
when no_data_needed then
raise;
when others then

23
dbms_output.put_line(’others Handler’);
raise;
end;
/

and then we use these numbers to force index rowid gets in a given frequency.

select /*+ leading(c) cardinality(c 10) indx(t test_tab#u1) index_rowid_pipelined */ *


from table(cast (id_pipelined(7, 1.3) as number_tab)) c
,test_tab t
where c.column_value = t.id;

call count cpu elapsed disk query current rows


------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 2 0.00 9.10 0 17 0 7
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 4 0.00 9.10 0 17 0 7

Misses in library cache during parse: 0


Optimizer mode: ALL_ROWS
Parsing user id: 49

Rows Row Source Operation (2 Branch Block, 8 Leaf Blocks, 7 Table Blocks)
---- ---------------------------------------------------
7 NESTED LOOPS (cr=17 pr=0 pw=0 time=9103769 us cost=35 size=160 card=10)
7 NESTED LOOPS (cr=10 pr=0 pw=0 time=9103227 us cost=35 size=160 card=10)
7 COLLECTION ITERATOR PICKLER FETCH ID_PIPELINED (cr=0 pr=0 pw=0 time=9102140 us cost=30 size=20 card=10)
7 INDEX UNIQUE SCAN TEST_TAB#U1 (cr=10 pr=0 pw=0 time=1058 us cost=1 size=0 card=1)(object id 2260907)
7 TABLE ACCESS BY INDEX ROWID TEST_TAB (cr=7 pr=0 pw=0 time=531 us cost=1 size=14 card=1)

Elapsed times include waiting on following events:


Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
SQL*Net message to client 2 0.00 0.00
PL/SQL lock timer 7 1.30 9.10
SQL*Net message from client 2 0.08 0.08

The query requires 2 Branch Block gets, 8 Leaf Blocks gets, 7 Table Blocks gets, in total, 17 consistent
gets. 10200 Trace contains the concrete read calls:

ktrgtc2(): started for block


<0x07cf : 0x002ee293> objd: 0x00227fab <-- Branch Block
<0x07cf : 0x002ee294> objd: 0x00227fab <-- Leaf Block 1
<0x07cf : 0x002ed97b> objd: 0x00227faa <-- Table Block 1
<0x07cf : 0x002ee293> objd: 0x00227fab <-- Branch Block
<0x07cf : 0x002ee294> objd: 0x00227fab <-- Leaf Block 1
<0x07cf : 0x002ed97b> objd: 0x00227faa <-- Table Block 1
<0x07cf : 0x002ee294> objd: 0x00227fab <-- Leaf Block 1
<0x07cf : 0x002ed97b> objd: 0x00227faa <-- Table Block 1
<0x07cf : 0x002ee294> objd: 0x00227fab <-- Leaf Block 1
<0x07cf : 0x002ed97b> objd: 0x00227faa <-- Table Block 1
<0x07cf : 0x002ee294> objd: 0x00227fab <-- Leaf Block 1
<0x07cf : 0x002ed97b> objd: 0x00227faa <-- Table Block 1
<0x07cf : 0x002ee294> objd: 0x00227fab <-- Leaf Block 1
<0x07cf : 0x002ee295> objd: 0x00227fab <-- Leaf Block 2
<0x07cf : 0x002ed97c> objd: 0x00227faa <-- Table Block 2
<0x07cf : 0x002ee295> objd: 0x00227fab <-- Leaf Block 2
<0x07cf : 0x002ed97c> objd: 0x00227faa <-- Table Block 2

Comparing with previous Test-3 Index Unique Scan, we have 4 more Table Block gets. The nested loops
in xplan shows that we get one rowid and immediately fetch the pointed row from Table Block. 10200
Trace also confirms this zigzag traversing path.

Look Row Source Operation in the first test, it fetches rows by table access by user rowid; whereas in
this test, Row Source Operation shows table access by index rowid. The difference is that the former one

24
uses ROWID specified by user, however the current one gets ROWID by the precedent outer (driving)
operation: index unique scan.

1.2.3 Test Discussions

1.2.3.1 Access Paths

Recap above 4 tests stats, we can can see the differences in number of block gets among 4 operations
although they all return the same 7 rows:

USER ROWID: 3 Blocks (0 Branch Block, 0 Leaf Blocks, 3 Table Blocks)


INDEX RANGE SCAN: 7 Blocks (1 Branch Block, 3 Leaf Blocks, 3 Table Blocks)
INDEX UNIQUE SCAN: 13 Blocks (2 Branch Block, 8 Leaf Blocks, 3 Table Blocks)
INDEX ROWID GET: 17 Blocks (2 Branch Block, 8 Leaf Blocks, 7 Table Blocks)

The fourth test is the most expensive one because pipelined function is a row-by-row processing. The
consumer starts processing only after producer produces it (the principle of ”pipelined”). In the above
example, each time consumer has to wait for p sleep seconds till receiving row from producer. In the
real applications, during p sleep seconds, any modifications can occur, and in order to maintain read
consistency from query start SCN, Oracle should use undo to clone buffers, hence, prone to ’latch: cache
buffers chains’ contentions (see Demo in later Section 1.2.4).

In the aspect of disk read, row-by-row access is fulfilled by db file sequential read instead of more efficient
multiblock db file scattered read or db file parallel read as discussed in previous Section: 1.1.

The first test with user rowid is the cheapest fetching approach. In practice, it can be implemented
by first caching ROWID of frequently fetched rows, and subsequently, make the direct table fetch, and
therefore bypass index access.

1.2.3.2 latch: cache buffers chains on Index Blocks

Index is made of a part of table columns. Usually it is more condensed than table, i.e, each branch/leaf
block can contain more rows than table block. In other words, the same index block is more frequently
touched than table block, hence more prone to latch: cache buffers chains (CBC) contentions when multi
sessions concurrently access it (In Chapter Locks, Latches and Mutexes - Section Latches 3.2, we will
give further look of CBC latches).

1.2.4 latch: cache buffers chains Demo

One frequent issue with consistent gets is latch: cache buffers chains. As an example, we can use our
previous pipelined function to build a test case and demonstrate this Wait Event.

create or replace procedure index_rowid_pipelined_get(p_cnt number, p_sleep_seconds number := 0) as


begin
for c in (
select /*+ leading(c) cardinality(c 10) indx(t test_tab#u1) index_rowid_pipelined */ *
from table(cast (id_pipelined(p_cnt, p_sleep_seconds) as number_tab)) c, test_tab t
where c.column_value = t.id)

25
loop
null;
end loop;
end;
/

create or replace procedure index_rowid_pipelined_get_jobs(p_job_cnt number, p_cnt number, p_sleep_seconds number := 0) as


l_job_id pls_integer;
begin
for i in 1.. p_job_cnt loop
dbms_job.submit(l_job_id, ’begin while true loop index_rowid_pipelined_get(’||p_cnt||’, ’||p_sleep_seconds||’); end loop; end;’);
end loop;
commit;
end;
/

Launch 6 Jobs on a UNIX machine with 6 physical Processors:

exec index_rowid_pipelined_get_jobs(6, 90, 0);

If we monitor v$bh, all blocks in v$bh have status xcur. Now run an update statement, but no
commit/rollback:

update test_tab set id = -id where abs(id) between 3 and 12; -- Without commit/rollback

Immediately we can observe wait event: latch: cache buffers chains, and buffer busy waits:

select * from v$session_wait where event in (’latch: cache buffers chains’) order by event, sid;

select * from v$session_wait_history where event in (’latch: cache buffers chains’) order by event, sid;

For latch: cache buffers chains, we observed following pattern:

(1). if v$session wait.state = ’WAITED SHORT TIME’ (less than one centisecond), then
v$latch.spin gets increased
(2). if v$session wait.state = ’WAITED KNOWN TIME’ (more than one centisecond), then
v$latch.sleeps increased
(3). if no entries in v$session wait, but v$bh.status=’cr’ block read, then v$latch.immediate gets
increased

This behaviour still needs to be confirmed.

By querying v$bh, we can see consistent read copies (cloned blocks) marked as cr:

select d.object_name, data_object_id, b.status, b.*


from v$bh b, dba_objects d
where b.objd = d.data_object_id
and object_name in (’TEST_TAB’, ’TEST_TAB#U1’)
--and b.status not in (’free’)
and b.status in (’xcur’, ’cr’)
order by b.status, object_name, block#;

By the way, in this test, Event 10201 (consistent read undo application) can be used to trace CR reads,
which are generated by undo blocks applying.

To stop the previous launched test Jobs, we can call precedure clean jobs below.

26
-------------- clean_jobs --------------

create or replace procedure clean_jobs as


begin
for c in (select job from dba_jobs) loop
begin
dbms_job.remove (c.job);
exception when others then null;
end;
commit;
end loop;

for c in (select d.job, d.sid, (select serial# from v$session where sid = d.sid) ser
from dba_jobs_running d) loop
begin
execute immediate
’alter system kill session ’’’|| c.sid|| ’,’ || c.ser|| ’’’ immediate’;
dbms_job.remove (c.job);
exception when others then null;
end;
commit;
end loop;

-- select * from dba_jobs;


-- select * from dba_jobs_running;
end;
/

1.3 Logical Read - Current Get

Following the discussion of first mode logical reads (consistent get ), it is time to talk the second one:
current get . We will use Sql Trace Event 10046 to display the statistics, and Dtrace script: dtracelio.d
published in Blog: Dynamic tracing of Oracle logical I/O [2] to show data/index block read details and
sequence. Ideally Oracle could provide one event to trace current get, analogue to Event 10200 (consistent
read buffer status) for consistent get used in the previous section. To conclude this section, we compare
states reported in Oracle dynamic performance views with output of Sql Trace and Dtrace.

All tests are done in Oracle 12.1.0.2 on Solaris with 6 physical processors.

1.3.1 Test Setup

First create one table and 2 indexes on it, each of which are placed in one different tablespace.

drop tablespace test_tts including contents and datafiles;


drop tablespace test_its1 including contents and datafiles;
drop tablespace test_its2 including contents and datafiles;

create tablespace test_tts datafile ’/oratestdb/oradata/testdb/test_tts.dbf’ size 200m online;


create tablespace test_its1 datafile ’/oratestdb/oradata/testdb/test_its1.dbf’ size 200m online;
create tablespace test_its2 datafile ’/oratestdb/oradata/testdb/test_its2.dbf’ size 200m online;

drop table test_tab;

create table test_tab tablespace test_tts


as select trunc(level/1000) type_id, mod(level, 100) id1, trunc(mod(level, 1000)/100) id2, -1234 dummy
from dual connect by level <= 4099;

create unique index test_tab#u#1 on test_tab (type_id, id1, id2, dummy) tablespace test_its1;
create unique index test_tab#u#2 on test_tab (type_id, id2, id1) tablespace test_its2;

exec dbms_stats.gather_table_stats(null, ’TEST_TAB’, cascade=>true);

Then collect their meta info which will appear in Dtrace output.

27
select name, ts# from v$tablespace where name like ’%TEST%’ or ts# in (2, 1999);
--UNDO 2
--USER1 1999
--TEST_TTS 2231
--TEST_ITS1 2232
--TEST_ITS2 2233

select rfile#, name, v.* from v$datafile v where name like ’%test%’;
--917 /oratestdb/oradata/testdb/test_tts.dbf
--931 /oratestdb/oradata/testdb/test_its1.dbf
--101 /oratestdb/oradata/testdb/test_its2.dbf

select object_name, object_id from dba_objects where object_name like ’TEST_TAB%’ or object_id in (1577349);
--RMTAB$ 1577349
--TEST_TAB 2286270
--TEST_TAB#U#1 2286271
--TEST_TAB#U#2 2286272

select segment_name, blocks from dba_segments v where segment_name like ’TEST_TAB%’;


--TEST_TAB 16
--TEST_TAB#U#1 24
--TEST_TAB#U#2 16

select ’TEST_TAB’
,count(distinct dbms_rowid.rowid_relative_fno (rowid)||’-’||dbms_rowid.rowid_block_number (rowid)) used_blocks
from TEST_TAB;
--TEST_TAB 11

select ’TEST_TAB#U#1’
,count(distinct dbms_rowid.rowid_block_number(sys_op_lbid (2286271, ’L’, t.rowid))) used_blocks
from TEST_TAB t;
--TEST_TAB#U#1 14

select ’TEST_TAB#U#2’
,count(distinct dbms_rowid.rowid_block_number(sys_op_lbid (2286272, ’L’, t.rowid))) used_blocks
from TEST_TAB t;
-- TEST_TAB#U#2 11

-- list different functions used to perform different types of logical I/O


select indx, kcbwhdes from x$kcbwh where indx in (798, 1060, 1062, 1066, 1153, 1199);
--798 ktewh25: kteinicnt
--1060 kdiwh15: kdifxs
--1062 kdiwh17: kdifind
--1066 kdiwh22: kdifind
--1153 kdiwh169: skipscan
--1199 kddwh01: kdddel

1.3.2 Dtrace Output Descriptions

To read the terms used in the later Dtrace Output, we first repeat the descriptions of Oracle internal
functions from Blog: Dynamic tracing of Oracle logical I/O: part 2. Dtrace LIO v2 is released [3].

1.3.2.0.1 Consistent Gets

kcbgtcr: Kernel Cache Buffer Get Consistent Read. This is general entry point for consistent read.
kcbldrget: Kernel Cache Buffer Load Direct-Read Get. The function performing direct-path read.

1.3.2.0.2 Current Gets (db block gets)

kcbgcur: Kernel Cache Buffer Get Current Read.


kcbget: Kernel Cache Buffer Get Buffer, analogue to kcbgcur, called for index branch and leaf
blocks.

28
kcbnew: Kernel Cache Buffer New Buffer.
kcblnb: Kernel Cache Buffer Load New Buffer.

Dtrace output consists of two parts, the first part lists kernel function calls line by line, the second part
contains a summary of call statistics.

Each line in the first part has a few columns to describe the details of function call, for example,

524191: kcbgcur(0xAED0EB0,2,1199,0) [tsn: 2231 rdba: 0xe5400083 (917/131) obj: 2286270 dobj: 2286270] where: 1199 mode_held: 2

The column descriptions [2] are :

524191 timestamp (in nanosecond) at function entry (analogue to truss Option -d, not included in original script).
kcbgcur(0xAED0EB0,2,1199,0) a function call with 3 non-null arguments (the fourth is 0).
the first argument is a pointer on the structure which describes a block.
the second argument is unknown, but I could suggest that this is lock_mode (from MOS bug 7109078).
the third argument (the least significant bits) is "where" (x$kcbwh), that is a module where the function is called.
tsn: 2231 a tablespace number, ts# from v$tablespace for TEST_TTS.
rdba: 0xe5400083 a relative dba (data block address).
(917/131) file 917 (test_tts.dbf) block 131.
obj: 2286270 dictionary object number, object_id from dba_objects for TEST_TAB,
dobj: 2286270 data object number, data_object_id from dba_objects for TEST_TAB.
where: 1199 location from function (kcbgcur in this case) was executed. 1199 is INDX from x$kcbwf for "kddwh01: kdddel".

To ease Dtrace reading in later output, we replace obj and dobj with object name (obj is persistent
when object is created, dobj is renewed after each truncate. Both are identical in our test), and compress
”where: 1199” as ”w1199” appended with the second part of the corresponding kcbwhdes from x$kcbwh.
For example, above line will be displayed as:

524191: kcbgcur(0xAED0EB0,2,1199,0) [tsn: 2231 rdba: 0xe5400083 (917/131) TEST_TAB] w1199 kdddel mode_held: 2

1.3.3 Current Read Tests

We will perform 3 Tests with different access path, and trace them by Sql Trace and Dtrace. For each
test, we first show the test output, and then give a discussion.

1.3.3.1 Test-1 Delete Using 2 Different Indexes

In the first test, we use both index: test tab#u#1 and test tab#u#2 to delete 41 rows.

Here the delete statement and Sql Trace output.

--******************************* SQL Trace *******************************--


delete /*+ test_1 use_concat index_ss(@DEL$1 T@DEL$1 test_tab#u#2)
index_ss(@DEL$1_2 T@DEL$1_2 test_tab#u#1) */
test_tab t where 99 in (id2, id1)

41 rows deleted.

-- rollback; -- make test repeatable

call count cpu elapsed disk query current rows


------- ------ -------- ---------- ---------- ---------- ---------- ----------

29
Parse 1 0.00 0.00 0 0 0 0
Execute 1 0.00 0.00 0 12 211 41
Fetch 0 0.00 0.00 0 0 0 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 2 0.00 0.00 0 12 211 41

Rows Row Source Operation


----- ---------------------------------------------------
0 DELETE TEST_TAB (cr=12 pr=0 pw=0 time=1569 us)
41 CONCATENATION (cr=12 pr=0 pw=0 time=356 us)
0 INDEX SKIP SCAN TEST_TAB#U#2 (cr=6 pr=0 pw=0 time=82 us cost=3 size=14 card=1)(object id 2286272)
41 INDEX SKIP SCAN TEST_TAB#U#1 (cr=6 pr=0 pw=0 time=228 us cost=8 size=574 card=41)(object id 2286271)

It uses index skip scan on both test tab#u#2 and test tab#u#1. Row Source Operation only shows the
query stats, but not current stats. Both test tab#u#2 (object id 2286272) and test tab#u#1 (object id
2286271) are marked with cr=6, together is 12 consistent gets. However, in the xplan statistics, current
mode gets is included under column current.

Here Dtrace output of function calls:

(Note that Dtrace Script statistics are double counted, which will be discussed in later Section 1.3.5)

$ > dtracelio.d 23585

Dynamic tracing of Oracle logical I/O v2.1 by Alexander Anokhin (http://alexanderanokhin.wordpress.com)

108075: kcbgtcr(0xAE17D58,0,1153,0) [tsn: 2233 rdba: 0x19400083 (101/131) TEST_TAB#U#2] w1153 skipscan exam: 0
124990: kcbgtcr(0xAE17D58,0,1153,0) [tsn: 2233 rdba: 0x19400083 (101/131) TEST_TAB#U#2] w1153 skipscan exam: 0
--Repeated lines removed--

424351: kcbgtcr(0xA0EF830,0,1153,0) [tsn: 2232 rdba: 0xe8c00083 (931/131) TEST_TAB#U#1] w1153 skipscan exam: 0
428303: kcbgtcr(0xA0EF830,0,1153,0) [tsn: 2232 rdba: 0xe8c00083 (931/131) TEST_TAB#U#1] w1153 skipscan exam: 0
--Repeated lines removed--

511748: kcbgcur(0xAED0EB0,2,1199,0) [tsn: 2231 rdba: 0xe5400083 (917/131) TEST_TAB] w1199 kdddel mode_held: 2
524191: kcbgcur(0xAED0EB0,2,1199,0) [tsn: 2231 rdba: 0xe5400083 (917/131) TEST_TAB] w1199 kdddel mode_held: 2
658937: kcbgcur(0xFDF4660,1,1062,0) [tsn: 2232 rdba: 0xe8c00083 (931/131) TEST_TAB#U#1] w1062 kdifind mode_held: 1
663567: kcbgcur(0xFDF4660,1,1062,0) [tsn: 2232 rdba: 0xe8c00083 (931/131) TEST_TAB#U#1] w1062 kdifind mode_held: 1
695708: kcbget(0xFDF4660,2,1066,0) [tsn: 2232 rdba: 0xe8c00087 (931/135) TEST_TAB#U#1] w1066 kdifind mode_held: 2
707668: kcbget(0xFDF4660,2,1066,0) [tsn: 2232 rdba: 0xe8c00087 (931/135) TEST_TAB#U#1] w1066 kdifind mode_held: 2
745953: kcbgcur(0xFDF4660,1,1062,0) [tsn: 2233 rdba: 0x19400083 (101/131) TEST_TAB#U#2] w1062 kdifind mode_held: 1
750657: kcbgcur(0xFDF4660,1,1062,0) [tsn: 2233 rdba: 0x19400083 (101/131) TEST_TAB#U#2] w1062 kdifind mode_held: 1
759713: kcbget(0xFDF4660,2,1066,0) [tsn: 2233 rdba: 0x19400084 (101/132) TEST_TAB#U#2] w1066 kdifind mode_held: 2
764240: kcbget(0xFDF4660,2,1066,0) [tsn: 2233 rdba: 0x19400084 (101/132) TEST_TAB#U#2] w1066 kdifind mode_held: 2

784175: kcbgcur(0xAED0EB0,2,1199,0) [tsn: 2231 rdba: 0xe5400083 (917/131) TEST_TAB] w1199 kdddel mode_held: 2
789056: kcbgcur(0xAED0EB0,2,1199,0) [tsn: 2231 rdba: 0xe5400083 (917/131) TEST_TAB] w1199 kdddel mode_held: 2
806713: kcbgcur(0xFDF4660,1,1062,0) [tsn: 2232 rdba: 0xe8c00083 (931/131) TEST_TAB#U#1] w1062 kdifind mode_held: 1
811388: kcbgcur(0xFDF4660,1,1062,0) [tsn: 2232 rdba: 0xe8c00083 (931/131) TEST_TAB#U#1] w1062 kdifind mode_held: 1
818717: kcbget(0xFDF4660,2,1066,0) [tsn: 2232 rdba: 0xe8c00087 (931/135) TEST_TAB#U#1] w1066 kdifind mode_held: 2
823620: kcbget(0xFDF4660,2,1066,0) [tsn: 2232 rdba: 0xe8c00087 (931/135) TEST_TAB#U#1] w1066 kdifind mode_held: 2
833734: kcbgcur(0xFDF4660,1,1062,0) [tsn: 2233 rdba: 0x19400083 (101/131) TEST_TAB#U#2] w1062 kdifind mode_held: 1
838017: kcbgcur(0xFDF4660,1,1062,0) [tsn: 2233 rdba: 0x19400083 (101/131) TEST_TAB#U#2] w1062 kdifind mode_held: 1
843481: kcbget(0xFDF4660,2,1066,0) [tsn: 2233 rdba: 0x19400084 (101/132) TEST_TAB#U#2] w1066 kdifind mode_held: 2
847875: kcbget(0xFDF4660,2,1066,0) [tsn: 2233 rdba: 0x19400084 (101/132) TEST_TAB#U#2] w1066 kdifind mode_held: 2
--Repeated lines removed (40 Repeats)--

The Dtrace output of call statistics:

===================== Logical I/O Summary (grouped by object/function) ==============


function stat object_id data_object_id mode_held where bufs calls
--------- ------- ----------- ---------------- ----------- ------- -------- ---------
kcbgcur cu 0 -1 2 53 2 2
kcbgcur cu 0 -1 2 88 2 2
kcbgcur cu 0 -1 2 86 4 4
kcbnew cu 0 -1 47 4 4
kcbgtcr cr 2286271 2286271 1153 12 12

30
kcbgtcr cr 2286272 2286272 1153 12 12
kcbgcur cu 2286270 2286270 2 1199 82 82
kcbgcur cu 2286271 2286271 1 1062 82 82
kcbgcur cu 2286272 2286272 1 1062 82 82
kcbget cu 2286271 2286271 2 1066 82 82
kcbget cu 2286272 2286272 2 1066 82 82
=====================================================================================

================================= Logical I/O Summary (grouped by object) ==========================================


object_id data_object_id lio cr cr (e) cr (d) cu cu (d) ispnd (Y) ispnd (N) pin rls
---------- --------------- --------- --------- --------- --------- --------- --------- --------- --------- ---------
0 -1 12 0 0 0 12 0 0 0 4
2286270 2286270 82 0 0 0 82 0 0 0 82
2286271 2286271 176 12 0 0 164 0 0 0 12
2286272 2286272 176 12 0 0 164 0 0 0 12
---------- --------------- --------- --------- --------- --------- --------- --------- --------- --------- ---------
total 446 24 0 0 422 0 0 0 110
====================================================================================================================

Legend
lio : logical gets (cr + cu)
cr : consistent gets
cr (e) : consistent gets - examination
cr (d) : consistent gets direct
cu : db block gets
cu (d) : db block gets direct
ispnd (Y): buffer is pinned count
ispnd (N): buffer is not pinned count
pin rls : pin releases

If we only look the kernel function calls of current gets, Dtrace shows that each current get (kcbgcur) of ta-
ble block on test tab (obj: 2286270) is followed by two current gets of index block on test tab#u#1 (obj:
2286271), one kcbgcur, and one kcbget, then two similar current gets of index block on test tab#u#2
(obj: 2286272). It indicates that we first delete rows in table block, then corresponding entries in index
blocks.

In total, 41 deleted Rows results in:

41 (test_tab) + 2*41 (test_tab#u#1) + 2*41 (test_tab#u#2) = 205 Current Gets

plus 4 current gets (kcbgcur) and 2 current gets (kcbnew) of undo segments (object id: 0), in total,
205 + 4 + 2 = 211.

If we read from first line of Dtrace, we can see that it starts with consistent gets (kcbgtcr) of test tab#u#2
and test tab#u#1 in location: Where 1153 skipscan exam, which match the xplan Row Source Oper-
ation of index skip scan test tab#u#2 and index skip scan test tab#u#1. That is exactly the beginning
select phase of DML statement execution since delete statement has to find the rows to be deleted at
first, then second phase of locking rows, thrid phase of deleting rows.

1.3.3.2 Test-2 Delete Using One Index

In the second test, delete same number of rows by using only one index: test tab#u#2.

Here Sql Trace output.

--******************************* SQL Trace *******************************--


delete /*+ test_2 index(t test_tab#U#2) */
test_tab t where 99 in (id2, id1)

41 rows deleted.

31
-- rollback; -- make test repeatable

call count cpu elapsed disk query current rows


------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 1 0.00 0.00 0 12 75 41
Fetch 0 0.00 0.00 0 0 0 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 2 0.00 0.00 0 12 75 41

Rows Row Source Operation


---- ---------------------------------------------------
0 DELETE TEST_TAB (cr=12 pr=0 pw=0 time=1025 us)
41 INDEX FULL SCAN TEST_TAB#U#2 (cr=12 pr=0 pw=0 time=405 us cost=6 size=574 card=41)(object id 2286272)

The xplan uses index full scan on test tab#u#2 (object id 2286272).

Here Dtrace output first part:

--******************************* dtrace *******************************--

$ > dtracelio.d 23585

Dynamic tracing of Oracle logical I/O v2.1 by Alexander Anokhin ( http://alexanderanokhin.wordpress.com )

233462: kcbgcur(0xAED9270,2,1199,0) [tsn: 2231 rdba: 0xe5400083 (917/131) TEST_TAB] w1199 kdddel mode_held: 2
245054: kcbgcur(0xAED9270,2,1199,0) [tsn: 2231 rdba: 0xe5400083 (917/131) TEST_TAB] w1199 kdddel mode_held: 2
365187: kcbgcur(0xAED9270,2,1199,0) [tsn: 2231 rdba: 0xe5400083 (917/131) TEST_TAB] w1199 kdddel mode_held: 2
369681: kcbgcur(0xAED9270,2,1199,0) [tsn: 2231 rdba: 0xe5400083 (917/131) TEST_TAB] w1199 kdddel mode_held: 2
404103: kcbgcur(0xAED9270,2,1199,0) [tsn: 2231 rdba: 0xe5400083 (917/131) TEST_TAB] w1199 kdddel mode_held: 2
408598: kcbgcur(0xAED9270,2,1199,0) [tsn: 2231 rdba: 0xe5400083 (917/131) TEST_TAB] w1199 kdddel mode_held: 2
429972: kcbgcur(0xAED9270,2,1199,0) [tsn: 2231 rdba: 0xe5400083 (917/131) TEST_TAB] w1199 kdddel mode_held: 2
434528: kcbgcur(0xAED9270,2,1199,0) [tsn: 2231 rdba: 0xe5400083 (917/131) TEST_TAB] w1199 kdddel mode_held: 2
453201: kcbgtcr(0xAED1730,0,1060,0) [tsn: 2233 rdba: 0x19400085 (101/133) TEST_TAB#U#2] w1060 kdifxs exam: 0
458011: kcbgtcr(0xAED1730,0,1060,0) [tsn: 2233 rdba: 0x19400085 (101/133) TEST_TAB#U#2] w1060 kdifxs exam: 0
479314: kcbgcur(0xAED9270,2,1199,0) [tsn: 2231 rdba: 0xe5400084 (917/132) TEST_TAB] w1199 kdddel mode_held: 2
484221: kcbgcur(0xAED9270,2,1199,0) [tsn: 2231 rdba: 0xe5400084 (917/132) TEST_TAB] w1199 kdddel mode_held: 2
506878: kcbgcur(0xAED9270,2,1199,0) [tsn: 2231 rdba: 0xe5400084 (917/132) TEST_TAB] w1199 kdddel mode_held: 2
511463: kcbgcur(0xAED9270,2,1199,0) [tsn: 2231 rdba: 0xe5400084 (917/132) TEST_TAB] w1199 kdddel mode_held: 2
532201: kcbgcur(0xAED9270,2,1199,0) [tsn: 2231 rdba: 0xe5400084 (917/132) TEST_TAB] w1199 kdddel mode_held: 2
536500: kcbgcur(0xAED9270,2,1199,0) [tsn: 2231 rdba: 0xe5400084 (917/132) TEST_TAB] w1199 kdddel mode_held: 2
556552: kcbgcur(0xAED9270,2,1199,0) [tsn: 2231 rdba: 0xe5400084 (917/132) TEST_TAB] w1199 kdddel mode_held: 2
560690: kcbgcur(0xAED9270,2,1199,0) [tsn: 2231 rdba: 0xe5400084 (917/132) TEST_TAB] w1199 kdddel mode_held: 2
577493: kcbgtcr(0xAED1730,0,1060,0) [tsn: 2233 rdba: 0x19400086 (101/134) TEST_TAB#U#2] w1060 kdifxs exam: 0
581798: kcbgtcr(0xAED1730,0,1060,0) [tsn: 2233 rdba: 0x19400086 (101/134) TEST_TAB#U#2] w1060 kdifxs exam: 0
--Repeated lines removed--

1786824: kcbgcur(0xFDF3200,1,1062,0) [tsn: 2233 rdba: 0x19400083 (101/131) TEST_TAB#U#2] w1062 kdifind mode_held: 1
1791226: kcbgcur(0xFDF3200,1,1062,0) [tsn: 2233 rdba: 0x19400083 (101/131) TEST_TAB#U#2] w1062 kdifind mode_held: 1
1830860: kcbget(0xFDF3200,2,1066,0) [tsn: 2233 rdba: 0x19400084 (101/132) TEST_TAB#U#2] w1066 kdifind mode_held: 2
1841407: kcbget(0xFDF3200,2,1066,0) [tsn: 2233 rdba: 0x19400084 (101/132) TEST_TAB#U#2] w1066 kdifind mode_held: 2
1885701: kcbgcur(0xFDF3200,1,1062,0) [tsn: 2233 rdba: 0x19400083 (101/131) TEST_TAB#U#2] w1062 kdifind mode_held: 1
1890635: kcbgcur(0xFDF3200,1,1062,0) [tsn: 2233 rdba: 0x19400083 (101/131) TEST_TAB#U#2] w1062 kdifind mode_held: 1
1904651: kcbget(0xFDF3200,2,1066,0) [tsn: 2233 rdba: 0x19400085 (101/133) TEST_TAB#U#2] w1066 kdifind mode_held: 2
1909925: kcbget(0xFDF3200,2,1066,0) [tsn: 2233 rdba: 0x19400085 (101/133) TEST_TAB#U#2] w1066 kdifind mode_held: 2
--Repeated lines removed--

2202041: kcbgcur(0xFDF3200,1,1062,0) [tsn: 2232 rdba: 0xe8c00083 (931/131) TEST_TAB#U#1] w1062 kdifind mode_held: 1
2206643: kcbgcur(0xFDF3200,1,1062,0) [tsn: 2232 rdba: 0xe8c00083 (931/131) TEST_TAB#U#1] w1062 kdifind mode_held: 1
2220344: kcbget(0xFDF3200,2,1066,0) [tsn: 2232 rdba: 0xe8c00087 (931/135) TEST_TAB#U#1] w1066 kdifind mode_held: 2
2224948: kcbget(0xFDF3200,2,1066,0) [tsn: 2232 rdba: 0xe8c00087 (931/135) TEST_TAB#U#1] w1066 kdifind mode_held: 2
2242122: kcbgcur(0xFDF3200,1,1062,0) [tsn: 2232 rdba: 0xe8c00083 (931/131) TEST_TAB#U#1] w1062 kdifind mode_held: 1
2246527: kcbgcur(0xFDF3200,1,1062,0) [tsn: 2232 rdba: 0xe8c00083 (931/131) TEST_TAB#U#1] w1062 kdifind mode_held: 1
2255942: kcbget(0xFDF3200,2,1066,0) [tsn: 2232 rdba: 0xe8c0008a (931/138) TEST_TAB#U#1] w1066 kdifind mode_held: 2
2260392: kcbget(0xFDF3200,2,1066,0) [tsn: 2232 rdba: 0xe8c0008a (931/138) TEST_TAB#U#1] w1066 kdifind mode_held: 2
--Repeated lines removed--

Dtrace output second part:

===================== Logical I/O Summary (grouped by object/function) ==============

32
function stat object_id data_object_id mode_held where bufs calls
--------- ------- ----------- ---------------- ----------- ------- -------- ---------
kcbgcur cu 0 -1 2 53 2 2
kcbgcur cu 0 -1 2 86 2 2
kcbgcur cu 0 -1 2 88 2 2
kcbgtcr cr 2286272 2286272 1298 2 2
kcbgtcr cr 2286272 2286272 1299 2 2
kcbnew cu 0 -1 47 2 2
kcbgcur cu 2286271 2286271 1 1062 8 8
kcbget cu 2286271 2286271 2 1066 8 8
kcbgtcr cr 2286272 2286272 1060 20 20
kcbgcur cu 2286272 2286272 1 1062 22 22
kcbget cu 2286272 2286272 2 1066 22 22
kcbgcur cu 2286270 2286270 2 1199 82 82
=====================================================================================

================================= Logical I/O Summary (grouped by object) ==========================================


object_id data_object_id lio cr cr (e) cr (d) cu cu (d) ispnd (Y) ispnd (N) pin rls
---------- --------------- --------- --------- --------- --------- --------- --------- --------- --------- ---------
0 -1 8 0 0 0 8 0 0 0 2
2286270 2286270 82 0 0 0 82 0 0 0 82
2286271 2286271 16 0 0 0 16 0 0 0 0
2286272 2286272 68 24 2 0 44 0 0 2 18
---------- --------------- --------- --------- --------- --------- --------- --------- --------- --------- ---------
total 174 24 2 0 150 0 0 2 102
====================================================================================================================

Legend
lio : logical gets (cr + cu)
cr : consistent gets
cr (e) : consistent gets - examination
cr (d) : consistent gets direct
cu : db block gets
cu (d) : db block gets direct
ispnd (Y): buffer is pinned count
ispnd (N): buffer is not pinned count
pin rls : pin releases

In this test, at first, 41 current gets (kcbgcur) of test tab (obj: 2286270) for 41 deleted rows, then 22
current gets (kcbgcur and kcbget) of test tab#u#2 (obj: 2286272), and 8 current gets (kcbgcur and
kcbget) of test tab#u#1 (obj: 2286271):

41 (test_tab) + 22 (test_tab#u#2) + 8 (test_tab#u#1) = 71 Current Gets

plus 3 current gets (kcbgcur) and 1 current gets (kcbnew) of undo segments (object id: 0), all together,
71 + 3 + 1 = 75.

1.3.3.3 Test-3 Delete Using Same Index Twice

In the third test, delete same number of rows by using single index: test tab#u#2 twice.

Here Sql Trace output.

--******************************* SQL Trace *******************************--


delete /*+ test_3 use_concat index_ss(@DEL$1_1 T@DEL$1 test_tab#u#2)
index_ffs(@DEL$1_2 T@DEL$1_2 test_tab#u#2) */
test_tab t where 99 in (id2, id1)

41 rows deleted.

-- rollback; -- make test repeatable

call count cpu elapsed disk query current rows


------- ------ -------- ---------- ---------- ---------- ---------- ----------

33
Parse 1 0.00 0.00 0 0 0 0
Execute 1 0.00 0.00 0 23 211 41
Fetch 0 0.00 0.00 0 0 0 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 2 0.00 0.00 0 23 211 41

Rows Row Source Operation


----- ---------------------------------------------------
0 DELETE TEST_TAB (cr=23 pr=0 pw=0 time=1180 us)
41 CONCATENATION (cr=23 pr=0 pw=0 time=398 us)
0 INDEX SKIP SCAN TEST_TAB#U#2 (cr=6 pr=0 pw=0 time=72 us cost=3 size=14 card=1)(object id 2286272)
41 INDEX FAST FULL SCAN TEST_TAB#U#2 (cr=17 pr=0 pw=0 time=242 us cost=5 size=574 card=41)(object id 2286272)

The xplan shows that it uses index test tab#u#2 twice, the first is index skip scan and the second is
index fast full scan.

Here Dtrace output first part:

--******************************* dtrace *******************************--

131208: kcbgtcr(0xFDE5230,0,798,0) [tsn: 0 rdba: 0x5133b8 (1/1127352) RMTAB$] w798 kteinicnt exam: 0
181969: kcbgtcr(0xFDE4E40,0,798,0) [tsn: 0 rdba: 0x5133b8 (1/1127352) RMTAB$] w798 kteinicnt exam: 0
--Repeated lines removed --

2537995: kcbgcur(0xA0FB028,2,1199,0) [tsn: 2231 rdba: 0xe5400085 (917/133) TEST_TAB] w1199 kdddel mode_held: 2
2543063: kcbgcur(0xA0FB028,2,1199,0) [tsn: 2231 rdba: 0xe5400085 (917/133) TEST_TAB] w1199 kdddel mode_held: 2
2557738: kcbgcur(0xFDF4660,1,1062,0) [tsn: 2232 rdba: 0xe8c00083 (931/131) TEST_TAB#U#1] w1062 kdifind mode_held: 1
2562180: kcbgcur(0xFDF4660,1,1062,0) [tsn: 2232 rdba: 0xe8c00083 (931/131) TEST_TAB#U#1] w1062 kdifind mode_held: 1
2569550: kcbget(0xFDF4660,2,1066,0) [tsn: 2232 rdba: 0xe8c00087 (931/135) TEST_TAB#U#1] w1066 kdifind mode_held: 2
2575512: kcbget(0xFDF4660,2,1066,0) [tsn: 2232 rdba: 0xe8c00087 (931/135) TEST_TAB#U#1] w1066 kdifind mode_held: 2
2584580: kcbgcur(0xFDF4660,1,1062,0) [tsn: 2233 rdba: 0x19400083 (101/131) TEST_TAB#U#2] w1062 kdifind mode_held: 1
2589073: kcbgcur(0xFDF4660,1,1062,0) [tsn: 2233 rdba: 0x19400083 (101/131) TEST_TAB#U#2] w1062 kdifind mode_held: 1
2594662: kcbget(0xFDF4660,2,1066,0) [tsn: 2233 rdba: 0x19400086 (101/134) TEST_TAB#U#2] w1066 kdifind mode_held: 2
2599288: kcbget(0xFDF4660,2,1066,0) [tsn: 2233 rdba: 0x19400086 (101/134) TEST_TAB#U#2] w1066 kdifind mode_held: 2
--Repeated lines removed (40 Repeats)--

Dtrace output second part:

===================== Logical I/O Summary (grouped by object/function) ==============


function stat object_id data_object_id mode_held where bufs calls
--------- ------- ----------- ---------------- ----------- ------- -------- ---------
kcbgcur cu 0 -1 2 53 2 2
kcbgcur cu 0 -1 2 88 2 2
kcbgtcr cr 1577349 1577349 799 2 2
kcbgtcr cr 2286272 2286272 798 2 2
kcbgcur cu 0 -1 2 86 4 4
kcbgtcr cr 1577349 1577349 798 4 4
kcbgtcr cr 2286272 2286272 799 4 4
kcbgtcr cr 2286272 2286272 800 4 4
kcbnew cu 0 -1 47 4 4
kcbgtcr cr 2286272 2286272 1153 12 12
kcbgtcr cr 2286272 2286272 1118 24 24
kcbgcur cu 2286270 2286270 2 1199 82 82
kcbgcur cu 2286271 2286271 1 1062 82 82
kcbgcur cu 2286272 2286272 1 1062 82 82
kcbget cu 2286271 2286271 2 1066 82 82
kcbget cu 2286272 2286272 2 1066 82 82
=====================================================================================

================================= Logical I/O Summary (grouped by object) ==========================================


object_id data_object_id lio cr cr (e) cr (d) cu cu (d) ispnd (Y) ispnd (N) pin rls
---------- --------------- --------- --------- --------- --------- --------- --------- --------- --------- ---------
1577349 1577349 6 6 0 0 0 0 0 0 6
0 -1 12 0 0 0 12 0 0 0 4
2286270 2286270 82 0 0 0 82 0 0 0 82
2286271 2286271 164 0 0 0 164 0 0 0 0
2286272 2286272 210 46 0 0 164 0 0 0 46
---------- --------------- --------- --------- --------- --------- --------- --------- --------- --------- ---------
total 474 52 0 0 422 0 0 0 138
====================================================================================================================

34
This test makes 211 current mode gets, same as Test-1. Although only one single index: test tab#u#2 is
showed in xplan, Dtrace shows that it follows the same access pattern as Test-1 to read both test tab#u#1
and test tab#u#2 in current mode, however Sql Trace is not able to expose test tab#u#1 gets.

This reveals a noticeable deficiency of Sql Trace, which is not able to list complete data access information
in this case. If we only depend on xplan, which says test tab#u#1 not used, we could think it could be
dropped.

Test-3 Sql Trace published 23 consistent mode gets (query=23), but Dtrace shows 26 (=52/2). The 3
(=26-23) extra is to read RMTAB$ (object id = 1577349), which are not disclosed in Sql Trace. (See
MOS: Bug 20853821 - HIgh number of executions for queries on rmtab$ (Doc ID 20853821.8)). Probably
they stem from recursive calls, which can be visible in Sqlplus autotrace. Again, this shows that Sql
Trace is not able to list all logical reads. This often occurs when Sql hard parsing or cursor aged-out. In
this case, reading raw Sql Trace will acquire more details.

1.3.4 Sql Trace, Dtrace and Oracle Performance Views

We have seen that in each test, the stats showed by Sql Trace output and Dtrace output are identical
for both modes of logical read: query and current (Test-3 Dtrace contains extra Gets of recursive calls of
RMTAB$ (object id = 1577349) as already discussed before).

To recap the execution statistics of previous 3 tests, we can summarize them as Table 1.3 below. Although
each test deletes the same 41 rows, their block gets are varied with the different access paths.

Test Test-1 Test-2 Test-3


Consistent Gets 12 12 23
Current Gets 211 75 211

Table 1.3: Execution Statistics

The table shows that the majority of logical read are current gets since our tests are all about DML
statements.

With Sql Trace and Dtrace, we can obtain the exact stats of logical read and kernel function calls. Now
we should look at the stats reported in our daily used dynamic performance views. We will run above 3
Tests 10000 times one after another, and watch their statistics in view v$sql.

alter system flush shared_pool;

declare
p_cnt number := 10000;
begin
for i in 1..p_cnt loop
delete /*+ test_1 use_concat index_ss(@DEL$1 T@DEL$1 test_tab#u#2) index_ss(@DEL$1_2 T@DEL$1_2 test_tab#u#1) */
test_tab t where 99 in (id2, id1);
rollback;
end loop;

for i in 1..p_cnt loop


delete /*+ test_2 index(t test_tab#U#2) */ test_tab t where 99 in (id2, id1);
rollback;
end loop;

for i in 1..p_cnt loop


delete /*+ test_3 use_concat index_ss(@DEL$1_1 T@DEL$1 test_tab#u#2) index_ffs(@DEL$1_2 T@DEL$1_2 test_tab#u#2) */
test_tab t where 99 in (id2, id1);
rollback;

35
end loop;
end;
/

select substr(sql_text, 1, 18) sql_text, buffer_gets, round(buffer_gets/10000) buffer_gets_per_exec


,rows_processed, physical_read_bytes, physical_write_bytes, elapsed_time, cpu_time, sql_id --, v.*
from v$sql v where lower(sql_text) like ’delete%test%’ order by v.sql_text;

SQL_TEXT DELETE /*+ test_1 DELETE /*+ test_2 DELETE /*+ test_3
---------------------- ----------------- ------------------- -------------------
BUFFER_GETS 2,210,823 870,403 2,320,261
BUFFER_GETS_PER_EXEC 221 87 232
ROWS_PROCESSED 410,000 410,000 410,000
PHYSICAL_READ_BYTES 1,146,880 786,432 401,408
PHYSICAL_WRITE_BYTES 210,026,496 210,026,496 105,013,248
ELAPSED_TIME 7,534,494 7,802,519 9,445,604
CPU_TIME 7,400,730 7,755,552 9,320,234
SQL_ID g2267fmk9u39r 5xqavr92xbvw7 66dw2ur1uakhn

(vertically dispalyed to fit in page)

Comparing with previous Sql Trace and Dtrace outputs, BUFFER GETS PER EXEC computed from v$sql
gives the similar results. Test-2 had the least BUFFER GETS, whereas Test-1 and Test-3 made almost 3
times more.

By the way, compared to the query on v$sql, it is recommended to use v$sqlstats, which does not
require library cache latch [9] in Oracle 10. According to Oracle Docu, it is faster, more scalable, and
has a greater data retention. But it contains a subset of columns that appear in v$sql and v$sqlarea.

1.3.5 Dtrace Script Double Counted Statistics

Look Sql Trace output and Dtrace output for Test-1 and Test-2:

Test-1: Sql Trace query = 12 current = 211


Dtrace cr = 24 cu = 422

Test-2: Sql Trace query = 12 current = 75


Dtrace cr = 24 cu = 150

Dtrace output shows even number for cr and cu, but in all 3 tests, we purposely deleted and selected odd
number of rows (41). It looks like that Dtrace has double counting in line output as well as the statistics. It
can be also verified with Sqlplus autotrace, v$sql.buffer gets, and sys.dbms sqltune.report sql monitor.
Blog [51] has more detail discussions.

1.3.6 dtracelio.d

See Blog: Dynamic tracing of Oracle logical I/O [2].

36
Chapter 2

Redo and Undo

Once data is moved from disk to memory (SGA), it has to be coordinatedly accessed by multiple con-
current sessions under ACID regime. These are the tasks of Oracle Undo and Redo.

In this Chapter, we first discuss undo, then redo, and finally demonstrate how to find exact commit SCN
with the learned knowledge.

2.1 Undo Practices

Book Oracle Core: Essential Internals for DBAs and Developers [15]) declared that change vector (the
heart of redo and undo) is the most important feature of Oracle, and gives detail description of undo
and redo mechanisms (Chapter 1 and 2). While redo data(after image) is written to be forgotten, undo
data(before image) is active over the life time of instance. Even each transaction is uniquely identified
by its undo slot fields. Therefore it is worth of doing a couple of undo exercises at first. To demonstrate
the paramount role of undo mechanism, in the later section 2.1.4.7, we will show how to crash one Oracle
instance by SMON parallel transaction recovery. At the end of this section, we also have a short discussion
of Oracle LOB undo.

Note: All tests are done in Oracle 12c1 with undo management=auto and db block size=8192.

2.1.1 Undo Organization

At first, we follow the description of Book Oracle Core, perform a small test, and dump a few of data
blocks, undo header blocks and undo blocks. We will make two sets of dumps. The first set (Dump1) is
to run update as well as commit, then dump, so it is a dump after commit; the second (Dump2) is to do
update, then dump, so it is a dump without commit.

Then we go through all dumps and try to understand them with reference to above book. Only with
concrete examples, we can have a profound comprehension of undo complexity.

Here our code to create a small table, insert two broad rows (one data block per row), collect meta info,
then make some updates to fill Itl with committed info.

37
alter system set transactions_per_rollback_segment=1 scope=spfile;
alter system set "_rollback_segment_count"=1 scope=spfile;
--alter system reset "_rollback_segment_count" scope=spfile;
--Re-Start DB

drop table test_tab;


create table test_tab (id number, seq number, n1 varchar2(3000), n2 varchar2(3000));

select object_name, object_id from dba_objects where object_name = ’TEST_TAB’;

OBJECT_NAME OBJECT_ID
TEST_TAB 2392300

-- each row occpies one Block


insert into test_tab values (1, 1, rpad(’a’, 3000, ’a’), rpad(’x’, 3000, ’x’));
insert into test_tab values (2, 2, rpad(’a’, 3000, ’a’), rpad(’x’, 3000, ’x’));
commit;

-- update to make Itl (minimum 2 in each data block) being filled.


update test_tab set seq = 3 where id = 1;
update test_tab set seq = 4 where id = 2;
commit;

update test_tab set seq = 5 where id = 1;


update test_tab set seq = 6 where id = 2;
commit;

2.1.1.1 Dump1: update and commit

To start a transaction, we update two rows, run queries to collect transaction and undo segment info for
the later dump, then commit the transaction.

==================== Dump1 ====================

update test_tab set seq = 1 where id = 1;


update test_tab set seq = 2 where id = 2;

-- get transaction info


select s.sid, t.* from v$transaction t, v$session s where t.ses_addr=s.saddr;

SID ADDR XIDUSN XIDSLOT XIDSQN UBAFIL UBABLK UBASQN UBAREC STATUS
366 0000000184F06E18 4 24 271659 3 958 -15725 10 ACTIVE

select dbms_transaction.local_transaction_id from dual;


4.24.271659 (hex 0x0004.018.0004252b)

-- get undo segment header block info


select * from dba_rollback_segs where segment_id = 4; -- XIDUSN: 4

SEGMENT_NAME OWNER TABLESPACE_NAME SEGMENT_ID FILE_ID BLOCK_ID STATUS


_SYSSMU4_1733607116$ PUBLIC Undo 4 3 952 ONLINE

-- get ACTIVE undo extents info


select * from dba_undo_extents where segment_name = ’_SYSSMU4_1733607116$’ and status = ’ACTIVE’;

SEGMENT_NAME TABLESPACE_NAME EXTENT_ID FILE_ID BLOCK_ID BYTES BLOCKS STATUS


_SYSSMU4_1733607116$ UNDO 0 3 952 65536 8 ACTIVE

commit;

With above collected info, we can perform following 3 dumps:

(1). dump1 datablock row2 for table data block of row 2


(2). dump1 undoheader for undo header
(3). dump1 undoblock row2 for undo block of row 2 update

38
By the way, we can see that segment name ( SYSSMU4 1733607116$) contains segment id number 4
(XIDUSN 4). We can also monitor undo space distribution by querying dba undo extents for active,
expired and unexpired extents.

Here the code to generate dumps:

alter system checkpoint;


alter system flush buffer_cache; -- write buffer dirty data to disk

-- dump Data Block of row 2

select t.id, rowid rd,


dbms_rowid.rowid_to_absolute_fno(t.rowid, ’K’, ’TEST_TAB’) afn,
dbms_rowid.rowid_block_number(t.rowid) block
from test_tab t where id = 2;

ID RD AFN BLOCK
2 AAJIDsAAAAAKCuFAAA 1548 2632581

alter session set tracefile_identifier = ’dump1_datablock_row2’;


alter system dump datafile 1548 block 2632581;

-- dump Undo segment Header Block

alter session set tracefile_identifier = ’dump1_undoheader’;


alter system dump datafile 3 block 952;

-- dump Undo Block of row 2, ’00c003be’ is


-- Uba: 0x00c003be.c293.0a of Itl 0x02 in ’dump1_datablock_row2’.

select dbms_utility.data_block_address_file (to_number(’00c003be’, ’XXXXXXXXX’)) file_id


,dbms_utility.data_block_address_block(to_number(’00c003be’, ’XXXXXXXXX’)) block_no
from dual;

FILE_ID BLOCK_NO
3 958

alter session set tracefile_identifier = ’dump1_undoblock_row2’;


alter system dump datafile 3 block 958;

In the above dump script, to make dump1 undoblock row2, we open the data block dump file dump1 datablock row2,
find Itl 0x02 by Xid: 0x0004.018.0004252b (returned by dbms transaction.local transaction id),
then pick block address (0x00c003be) in Uba. Here the Itl line (Itl Flag: --U- indicates Upper
bound commit from fast commit):

0x02 0x0004.018.0004252b 0x00c003be.c293.0a --U- 1 fsc 0x0000.ea9f91fc

2.1.1.2 Dump2: update without commit

Following the similar procedure as Dump1, we update same two rows again, however leaving the trans-
action open (not commit). The first 3 dumps are the same as Dump1, and 4th is an extra one:

(1). dump2 datablock row2 for table data block of row 2

(2). dump2 undoheader for undo header

(3). dump2 undoblock row2 for undo block of row 2 update

(4). dump2 undoblock row1 for undo block of row 1 update

39
==================== Dump2 ====================

update test_tab set seq = 3, n1 = null, n2 = null where id = 1;


update test_tab set seq = 4, n1 = null, n2 = null where id = 2;

-- get transaction info


select s.sid, t.* from v$transaction t, v$session s where t.ses_addr=s.saddr;

SID ADDR XIDUSN XIDSLOT XIDSQN UBAFIL UBABLK UBASQN UBAREC STATUS
366 0000000184F06E18 4 2 271642 3 11840 -15724 1 ACTIVE

-- get undo segment header block info


select * from dba_rollback_segs where segment_id = 4;

SEGMENT_NAME OWNER TABLESPACE_NAME SEGMENT_ID FILE_ID BLOCK_ID STATUS


_SYSSMU4_1733607116$ PUBLIC Undo 4 3 952 ONLINE

alter system checkpoint;


alter system flush buffer_cache;

-- dump Data Block of row 2

select t.id, rowid rd,


dbms_rowid.rowid_to_absolute_fno(t.rowid, ’K’, ’TEST_TAB’) afn,
dbms_rowid.rowid_block_number(t.rowid) block
from test_tab t where id = 2;

ID RD AFN BLOCK
2 AAJIDsAAAAAKCuFAAA 1548 2632581

alter session set tracefile_identifier = ’dump2_datablock_row2’;


alter system dump datafile 1548 block 2632581;

-- dump Undo segment Header Block

alter session set tracefile_identifier = ’dump2_undoheader’;


alter system dump datafile 3 block 952;

-- dump Undo Block of row 2, ’00c02e40’ is


-- Uba: 0x00c02e40.c294.01 of Itl 0x01 in ’dump2_datablock_row2’.

select dbms_utility.data_block_address_file (to_number(’00c02e40’, ’XXXXXXXXX’)) file_id


,dbms_utility.data_block_address_block(to_number(’00c02e40’, ’XXXXXXXXX’)) block_no
from dual;

FILE_ID BLOCK_NO
3 11840

alter session set tracefile_identifier = ’dump2_undoblock_row2’;


alter system dump datafile 3 block 11840;

-- follow Trx Linked List to find undo block of row 1 from above ’dump2_undoblock_row2’.
-- in ’dump2_undoblock_row2’, it is marked as:
-- rdba: 0x00c003bf
-- dump it for row 1

select dbms_utility.data_block_address_file (to_number(’00c003bf’, ’XXXXXXXXX’)) file_id


,dbms_utility.data_block_address_block(to_number(’00c003bf’, ’XXXXXXXXX’)) block_no
from dual;

FILE_ID BLOCK_NO
3 959

alter session set tracefile_identifier = ’dump2_undoblock_row1’;


alter system dump datafile 3 block 959;

Note that in order to make dump2 undoblock row1, we need to open dump2 undoblock row2 to get rdba
(relative data block address) because we have to traverse Trx Linked List to find the previous undo block.
(to be discussed later in section 2.1.2.2)

Here our 7 dump files from Dump1 and Dump2 (cut to minimum showing relevant lines only). We will
use them to show the 3 undo linked lists.

40
2.1.1.3 dump1 datablock row2

==================== dump1_datablock_row2 ====================

BH (0xcaf856d8) file#: 1548 rdba: 0x00282b85 (1024/2632581) class: 1 --class: 1 is data block

seg/obj: 0x2480ec csc: 0x893.ea9f91e5 itc: 2 flg: E typ: 1 - DATA --csc: last Delayed Block Cleanout SCN

Itl Xid Uba Flag Lck Scn/Fsc


0x01 0x0003.01a.0004ee11 0x00c00204.cf42.05 C--- 0 scn 0x0893.ea9f91e4 <-ITL Uba Linked List (3)->
0x02 0x0004.018.0004252b 0x00c003be.c293.0a --U- 1 fsc 0x0000.ea9f91fc --partially filled, not yet with full scn

tab 0, row 0, @0x819


tl: 6015 fb: --H-FL-- lb: 0x2 cc: 4
col 0: [ 2] c1 03
col 1: [ 2] c1 03
col 2: [3000]
61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61
...
col 3: [3000]
78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78
...

2.1.1.4 dump1 undoheader

==================== dump1_undoheader ====================

BH (0xd4f66558) file#: 3 rdba: 0x00c003b8 (3/952) class: 23 --class: 23 is undo segment header block
set: 13 pool: 3 bsz: 8192 bsi: 0 sflg: 0 pwc: 0,25
dbwrid: 0 obj: -1 objn: 0 tsn: [0/2] afn: 3 hint: f

TRN CTL:: seq: 0xc293 chd: 0x0008 ctl: 0x0018 inc: 0x00000000 nfb: 0x0002
mgc: 0xb000 xts: 0x0068 flg: 0x0001 opt: 2147483646 (0x7ffffffe)
uba: 0x00c003be.c293.09 scn: 0x0893.ea9f9091

TRN TBL::

index state cflags wrap# uel scn dba nub cmt


----------------------------------------------------------------------------
0x00 9 0x00 0x4251c 0x0009 0x0893.ea9f90c1 0x00000000 0x00000000 1546605175
0x01 9 0x00 0x4252c 0x000f 0x0893.ea9f911f 0x00000000 0x00000000 1546605248
0x02 9 0x00 0x42519 0x001a 0x0893.ea9f909e 0x00c003bc 0x00000002 1546605119
0x03 9 0x00 0x4252b 0x0010 0x0893.ea9f91e0 0x00c003be 0x00000001 1546605656

0x17 9 0x00 0x4251c 0x0002 0x0893.ea9f909b 0x00c003bb 0x00000001 1546605119


0x18 9 0x00 0x4252b 0xffff 0x0893.ea9f91fc 0x00c003be 0x00000001 1546605715
0x19 9 0x00 0x42518 0x0006 0x0893.ea9f9116 0x00c003bc 0x00000001 1546605248

0x21 9 0x00 0x42519 0x0004 0x0893.ea9f91bf 0x00c003be 0x00000001 1546605635

2.1.1.5 dump1 undoblock row2

==================== dump1_undoblock_row2 ====================

BH (0xe1ff2558) file#: 3 rdba: 0x00c003be (3/958) class: 24 --class: 24 is undo data block
set: 15 pool: 3 bsz: 8192 bsi: 0 sflg: 0 pwc: 0,25
dbwrid: 0 obj: -1 objn: 0 tsn: [0/2] afn: 3 hint: f

Undo BLK:
xid: 0x0004.018.0004252b seq: 0xc293 cnt: 0xa irb: 0xa

Rec Offset Rec Offset Rec Offset Rec Offset Rec Offset
---------------------------------------------------------------------------
0x06 0x1c44 0x07 0x1ba0 0x08 0x1b30 0x09 0x1a74 0x0a 0x19ec

*-----------------------------
* Rec #0xa slt: 0x18 objn: 2392300(0x002480ec)
* Layer: 11 (Row) opc: 1 rci 0x09
Undo type: Regular undo Last buffer split: No

41
rdba: 0x00000000
*-----------------------------
KDO undo record:
op: 0x04 ver: 0x01
op: L itl: xid: 0x0004.010.0004252b uba: 0x00c003be.c293.08
flg: C--- lkc: 0 scn: 0x0893.ea9f91e2
Array Update of 1 rows:
ncol: 4 nnew: 1 size: 0
KDO Op code: 21 row dependencies Disabled
xtype: XAxtype KDO_KDOM2 flags: 0x00000080 bdba: 0x00282b85 hdba: 0x00282b82
<-bdba: data block dba, hdba: segment header block dba->
itli: 2 ispac: 0 maxfr: 4858
vect = 3
col 1: [ 2] c1 07

2.1.1.6 dump2 datablock row2

==================== dump2_datablock_row2 ====================

BH (0xabfa54d8) file#: 1548 rdba: 0x00282b85 (1024/2632581) class: 1

seg/obj: 0x2480ec csc: 0x893.ea9f923a itc: 2 flg: E typ: 1 - DATA

Itl Xid Uba Flag Lck Scn/Fsc


0x01 0x0004.002.0004251a 0x00c02e40.c294.01 ---- 1 fsc 0x1776.00000000 <-ITL Uba Linked List(1)->
0x02 0x0004.018.0004252b 0x00c003be.c293.0a C--- 0 scn 0x0893.ea9f91fc

tab 0, row 0, @0x810


tl: 9 fb: --H-FL-- lb: 0x1 cc: 2
col 0: [ 2] c1 03
col 1: [ 2] c1 05

2.1.1.7 dump2 undoheader

==================== dump2_undoheader ====================

BH (0xc1f8f1d8) file#: 3 rdba: 0x00c003b8 (3/952) class: 23

TRN CTL:: seq: 0xc294 chd: 0x001a ctl: 0x0017 inc: 0x00000000 nfb: 0x0002
mgc: 0xb000 xts: 0x0068 flg: 0x0001 opt: 2147483646 (0x7ffffffe)
uba: 0x00c003bf.c293.01 scn: 0x0893.ea9f909e <-SLOT Linked List(1)->

TRN TBL::

index state cflags wrap# uel scn dba nub cmt


---------------------------------------------------------------------------
0x00 9 0x00 0x4251c 0x0009 0x0893.ea9f90c1 0x00000000 0x00000000 1546605175
0x01 9 0x00 0x4252c 0x000f 0x0893.ea9f911f 0x00000000 0x00000000 1546605248
0x02 10 0x80 0x4251a 0x0000 0x0893.ea9f91fc 0x00c02e40 0x00000002 0 <-TRX Linked List(1)->
0x03 9 0x00 0x4252b 0x0010 0x0893.ea9f91e0 0x00c003be 0x00000001 1546605656

0x17 9 0x00 0x4251d 0xffff 0x0893.ea9f9235 0x00c003be 0x00000001 1546605849


0x18 9 0x00 0x4252b 0x0008 0x0893.ea9f91fc 0x00c003be 0x00000001 1546605715
0x19 9 0x00 0x42518 0x0006 0x0893.ea9f9116 0x00c003bc 0x00000001 1546605248

0x21 9 0x00 0x42519 0x0004 0x0893.ea9f91bf 0x00c003be 0x00000001 1546605635

2.1.1.8 dump2 undoblock row2

==================== dump2_undoblock_row2 ====================

BH (0xe1f923d8) file#: 3 rdba: 0x00c02e40 (3/11840) class: 24

Undo BLK:
xid: 0x0004.002.0004251a seq: 0xc294 cnt: 0x1 irb: 0x1 <-ITL Uba Linked List (2)->

Rec Offset Rec Offset Rec Offset Rec Offset Rec Offset

42
---------------------------------------------------------------------------
0x01 0x0804

*-----------------------------
* Rec #0x1 slt: 0x02 objn: 2392300(0x002480ec)
* Layer: 11 (Row) opc: 1 rci 0x00
Undo type: Regular undo Last buffer split: No
rdba: 0x00c003bf <-TRX Linked List(2)->
*-----------------------------
KDO undo record:
op: L itl: xid: 0x0003.01a.0004ee11 uba: 0x00c00204.cf42.05
flg: C--- lkc: 0 scn: 0x0893.ea9f91e4
KDO Op code: URP row dependencies Disabled

itli: 1 ispac: 0 maxfr: 4858

ncol: 4 nnew: 3 size: 6006


col 1: [ 2] c1 03 -- column 1 is number 2
col 2: [3000]
61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 -- column 2 is 3000 a
...
col 3: [3000]
78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 -- column 3 is 3000 x
...

2.1.1.9 dump2 undoblock row1

==================== dump2_undoblock_row1 ====================

BH (0xbefcc858) file#: 3 rdba: 0x00c003bf (3/959) class: 24 <-SLOT Linked List(2)->

Undo BLK:
xid: 0x0004.002.0004251a seq: 0xc293 cnt: 0x1 irb: 0x1

Rec Offset Rec Offset Rec Offset Rec Offset Rec Offset
---------------------------------------------------------------------------
0x01 0x07d0

*-----------------------------
* Rec #0x1 slt: 0x02 objn: 2392300(0x002480ec)
* Layer: 11 (Row) opc: 1 rci 0x00
Undo type: Regular undo Begin trans Last buffer split: No
rdba: 0x00000000Ext idx: 0 <-TRX Linked List(3)->
*-----------------------------
uba: 0x00c003be.c293.14 ctl max scn: 0x0893.ea9f909b prv tx scn: 0x0893.ea9f909e <-SLOT Linked List(3)->
txn start scn: scn: 0x0893.ea9f91fc logon user: 49
prev brb: 12583868 prev bcl: 0
KDO undo record:

op: L itl: xid:


0x0003.01a.0004ee11 uba: 0x00c00204.cf42.04
flg: C--- lkc: 0 scn: 0x0893.ea9f91e4
KDO Op code: URP row dependencies Disabled

itli: 1 ispac: 0 maxfr: 4858 tabn: 0 slot: 0(0x0) flag: 0x2c lock: 0 ckix: 0

ncol: 4 nnew: 3 size: 6006


col 1: [ 2] c1 02 -- column 1 is number 1
col 2: [3000]
61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 -- column 2 is 3000 a
...
col 3: [3000]
78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 -- column 3 is 3000 x
...

2.1.2 Undo Linked Lists

As described in Book Oracle Core, Oracle builds 3 undo linked lists:

43
(1). Data Block ITL Uba Linked List: one linked list per modified data block in each transaction.

(2). Undo TRN TBL TRX (Rec, rci) Linked List: one linked list per transaction.

(3). Undo TRN CTL SLOT Linked List: one linked list per undo segment.

each of which is applied in a different circumstance. The sticking units and locations are data block,
transaction, and undo segment respectively, which are gradually increased in the scope and extent.

In order to facilitate tracking the different links and memorizing the data elements, we can draw a diagram
to illustrate those 3 linked list in Figure 2.1, which will be referred in the later discussion.

This section contains many cross references to the above dumps. Printing out all of them, and putting
aside will help later reading.

Figure 2.1: Undo Link Lists

2.1.2.1 Data Block ITL Uba Linked List

It is used to reconstruct (clone) a block consistency read (CR) copy by a fetch operation. There exists
one linked list for every modified data block in each transaction. That means if one transaction modifies
10 blocks, there will be 10 such lists for this same transaction.

44
The head of this linked list is the ITL Uba (most recent) in data block. The Uba points to the undo block,
which contains the previous replaced ITL, the Uba of that ITL again recursively pointing to its previous
undo block, till the last undo block of previous committed transaction (if not committed, not allowed to
be re-used), whose committed SCN is smaller than required SCN (e.g. query started SCN).

Take a data block dump, for example, dump2 datablock row2, in which we have updated row with id 2.
If another session selects this un-committed row, it has to use the undo to reconstruct a CR copy. The
first ITL in the dump is about an un-committed transaction ITL (Flag ----, Lck 1), which is marked
by <-ITL Uba Linked List(1)-> for the sake of visibility.

==================== dump2_datablock_row2 ====================

Itl Xid Uba Flag Lck Scn/Fsc


0x01 0x0004.002.0004251a 0x00c02e40.c294.01 ---- 1 fsc 0x1776.00000000 <-ITL Uba Linked List(1)->
0x02 0x0004.018.0004252b 0x00c003be.c293.0a C--- 0 scn 0x0893.ea9f91fc

Its Uba points to its undo block 0x00c02e40 (rdba 3/11840), Rec #0x1, seq: 0xc294.

If we look rdba:0x00c02e40 (3/11840) in dump2 undoblock row2, under line Undo BLK, we can see the
same Xid, Uba marked as <-Itl Uba Linked List(2)->. At the bottom of dump, it shows the before
image (undo data), for example, ncol:4 nnew:3 size:6006, in which 6006 is the row size increase if
applying this undo record, 3 is the number of columns changed. If we apply this undo record, we will
restore back to original content.

==================== dump2_undoblock_row2 ====================

BH (0xe1f923d8) file#: 3 rdba: 0x00c02e40 (3/11840) class: 24

Undo BLK:
xid: 0x0004.002.0004251a seq: 0xc294 cnt: 0x1 irb: 0x1 <-ITL Uba Linked List (2)->

Rec Offset Rec Offset Rec Offset Rec Offset Rec Offset
---------------------------------------------------------------------------
0x01 0x0804

*-----------------------------
* Rec #0x1 slt: 0x02 objn: 2392300(0x002480ec)
* Layer: 11 (Row) opc: 1 rci 0x00
Undo type: Regular undo Last buffer split: No
rdba: 0x00c003bf <-TRX Linked List(2)->
*-----------------------------
KDO undo record:
op: L itl: xid: 0x0003.01a.0004ee11 uba: 0x00c00204.cf42.05
flg: C--- lkc: 0 scn: 0x0893.ea9f91e4
KDO Op code: URP row dependencies Disabled

itli: 1 ispac: 0 maxfr: 4858

ncol: 4 nnew: 3 size: 6006


col 1: [ 2] c1 03 -- column 1 is number 2
col 2: [3000]
61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 -- column 2 is 3000 a
...
col 3: [3000]
78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 -- column 3 is 3000 x
...

Since ITL 0x01 is reused, the previous ITL entry (Xid, Uba, Scn/Fsc) has to be saved too. In the above
dump, that is

op: L itl: xid: 0x0003.01a.0004ee11 uba: 0x00c00204.cf42.05


flg: C--- lkc: 0 scn: 0x0893.ea9f91e4

45
If fact, we can find the above ITL in dump1 datablock row2, which is marked as <-ITL Uba Linked
List(3)->:

==================== dump1_datablock_row2 ====================


BH (0xcaf856d8) file#: 1548 rdba: 0x00282b85 (1024/2632581) class: 1

Itl Xid Uba Flag Lck Scn/Fsc


0x01 0x0003.01a.0004ee11 0x00c00204.cf42.05 C--- 0 scn 0x0893.ea9f91e4 <-ITL Uba Linked List (3)->
0x02 0x0004.018.0004252b 0x00c003be.c293.0a --U- 1 fsc 0x0000.ea9f91fc --partially filled, not yet with full scn

It is the data block state of last committed transaction before this transaction. Oracle can determine
it by simply checking if both XIDs are different. In this case, they are 0x0004.002.0004251a and
0x0003.01a.0004ee11, and are not equal. The second one is a committed transaction, its ITL Flag is
noted with C---, and Lck is marked as 0. Since it is the ITL of a committed transaction, it is the end
of Data Block ITL Uba Linked List for transaction XID 0x0004.002.0004251a (marked by <-ITL Uba
Linked List(1)->).

If above committed SCN 0x0893.ea9f91e4 (marked by <-ITL Uba Linked List(3)->) is smaller than
our query started SCN, the CR copy is reconstructed. Otherwise, repeat above restoring process. This
scenario happens, for example, when the selected row has been updated and committed several times.

In summary, the above ITL Uba Linked List is made of following 2 nodes:

ITL Uba Linked List(1) in data block (dump2_datablock_row2)


---> ITL Uba Linked List(2) in undo block (dump2_undoblock_row2)

The first one is list head, and second is tail. The list is terminated because the second one points to
<-ITL Uba Linked List(3)->, which is a committed ITL. The second one contains all undo info to
reconstruct CR blocks.

ITL in data block can be considered as special type of rows created by Oracle. It is subject to the similar
handling and contention as normal table rows, for example, "enq: TX - allocate ITL entry", and
eventually ITL deadlock.

By the way, for index root and branch blocks, there exist only one single ITL, named ”service ITL”,
reserved for index recursive operations like splitting and coalescing, so that no two sessions can perform
them at the same time. For leaf block, that is the first ITL reserved as ”service ITL”. However, in Oracle
12c, INITRANS is still documented as ”The default value for an index is 2”, which is not the case for
index root and branch blocks.

2.1.2.2 Undo TRN TBL TRX (Rec, rci) Linked List

It is used for transaction rollback. One linked list per transaction.

Each undo segment has one special header block, which consists of two main sections: TRN CTL (Trans-
action Control) and TRN TBL (Transaction Table) . TRN TBL contains a list of slots, each transaction
is allocated one slot when started. In fact, it is also identified by this slot, i.e, each transaction is assigned
to one slot, which is served as the header of this linked list.

Look our transaction in Dump2, we first updated row with id 1, then id 2, so Oracle wrote two undo
records. In dump2 undoheader (or v$transaction), we can see the dba of last (most recent) active (state
10) TRX slot 0x02 with column dba 0x00c02e40 marked as <-TRX Linked List(1)->:

46
==================== dump2_undoheader ====================

BH (0xc1f8f1d8) file#: 3 rdba: 0x00c003b8 (3/952) class: 23

TRN CTL:: seq: 0xc294 chd: 0x001a ctl: 0x0017 inc: 0x00000000 nfb: 0x0002
mgc: 0xb000 xts: 0x0068 flg: 0x0001 opt: 2147483646 (0x7ffffffe)
uba: 0x00c003bf.c293.01 scn: 0x0893.ea9f909e <-SLOT Linked List(1)->

TRN TBL::

index state cflags wrap# uel scn dba nub cmt


---------------------------------------------------------------------------
0x00 9 0x00 0x4251c 0x0009 0x0893.ea9f90c1 0x00000000 0x00000000 1546605175
0x01 9 0x00 0x4252c 0x000f 0x0893.ea9f911f 0x00000000 0x00000000 1546605248
0x02 10 0x80 0x4251a 0x0000 0x0893.ea9f91fc 0x00c02e40 0x00000002 0 <-TRX Linked List(1)->
0x03 9 0x00 0x4252b 0x0010 0x0893.ea9f91e0 0x00c003be 0x00000001 1546605656

It points to the most recent undo record, that is, undo record for update of row with id 2.

Pick dba 0x00c02e40 of slot 0x02, dump it into dump2 undoblock row2:

==================== dump2_undoblock_row2 ====================

BH (0xe1f923d8) file#: 3 rdba: 0x00c02e40 (3/11840) class: 24

Undo BLK:
xid: 0x0004.002.0004251a seq: 0xc294 cnt: 0x1 irb: 0x1 <-ITL Uba Linked List (2)->

Rec Offset Rec Offset Rec Offset Rec Offset Rec Offset
---------------------------------------------------------------------------
0x01 0x0804

*-----------------------------
* Rec #0x1 slt: 0x02 objn: 2392300(0x002480ec)
* Layer: 11 (Row) opc: 1 rci 0x00
Undo type: Regular undo Last buffer split: No
rdba: 0x00c003bf <-TRX Linked List(2)->
*-----------------------------
KDO undo record:
op: L itl: xid: 0x0003.01a.0004ee11 uba: 0x00c00204.cf42.05
flg: C--- lkc: 0 scn: 0x0893.ea9f91e4
KDO Op code: URP row dependencies Disabled

itli: 1 ispac: 0 maxfr: 4858

ncol: 4 nnew: 3 size: 6006


col 1: [ 2] c1 03 -- column 1 is number 2
col 2: [3000]
61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 -- column 2 is 3000 a
...
col 3: [3000]
78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 -- column 3 is 3000 x
... <-TRX Linked List(2)->

The rdba:0x00c003bf in line marked <-TRX Linked List(2)-> indicates the previous undo record in
another undo data block (FILE ID 3, BLOCK NO 959), that is exact the undo block generated by update
of row with id 1.

We have dumped this undo block in dump2 undoblock row1.

==================== dump2_undoblock_row1 ====================

BH (0xbefcc858) file#: 3 rdba: 0x00c003bf (3/959) class: 24 <-SLOT Linked List(2)->

Undo BLK:
xid: 0x0004.002.0004251a seq: 0xc293 cnt: 0x1 irb: 0x1

Rec Offset Rec Offset Rec Offset Rec Offset Rec Offset
---------------------------------------------------------------------------

47
0x01 0x07d0

*-----------------------------
* Rec #0x1 slt: 0x02 objn: 2392300(0x002480ec)
* Layer: 11 (Row) opc: 1 rci 0x00
Undo type: Regular undo Begin trans Last buffer split: No
rdba: 0x00000000Ext idx: 0 <-TRX Linked List(3)->
*-----------------------------
uba: 0x00c003be.c293.14 ctl max scn: 0x0893.ea9f909b prv tx scn: 0x0893.ea9f909e <-SLOT Linked List(3)->
txn start scn: scn: 0x0893.ea9f91fc logon user: 49
prev brb: 12583868 prev bcl: 0
KDO undo record:

op: L itl: xid:


0x0003.01a.0004ee11 uba: 0x00c00204.cf42.04
flg: C--- lkc: 0 scn: 0x0893.ea9f91e4
KDO Op code: URP row dependencies Disabled

itli: 1 ispac: 0 maxfr: 4858 tabn: 0 slot: 0(0x0) flag: 0x2c lock: 0 ckix: 0

ncol: 4 nnew: 3 size: 6006


col 1: [ 2] c1 02 -- column 1 is number 1
col 2: [3000]
61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 -- column 2 is 3000 a
...
col 3: [3000]
78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 -- column 3 is 3000 x
...

The rdba:0x00000000 in line marked <-TRX Linked List(3)-> and rci 0x00 in 3 lines above signify
the end of linked list. (if rdba is 0x0, the non 0x0 rci points to the undo record in the same undo block,
e.g, dump1 undoblock row2 contains rci 0x09, rdba:0x00000000).

So we have a Trx Linked List as follows:

Trx Linked List(1) in undo header block (dump2_undoheader - TRN TBL::0x02)


---> Trx Linked List(2) in undo block (dump2_undoblock_row2)
---> Trx Linked List(3) in undo block (dump2_undoblock_row1)

To rollback the transaction, Oracle has to traverse this list starting from TRN TBL active slot till its
first undo record, so that entire un-committed transaction is undone.

In fact, the rollback is an automatic update process with transaction’s undo records from the most recent
to the oldest one, and finally commit it so that all updates become persistent. In contrary to TRX Linked
List, ITL Linked List for read consistency is used to only construct a temporary clone, and it can be
thrown away once used.

Since rollback updates segment data, it can be performed either by the owner session of current transaction
for rollback, or by system super user for recovery, and it generates undo and redo as normal transactions.
However, other two Linked Lists are constructed by subsequent different sessions. In later section 2.1.4.7,
we will see how an instance gets crashed when super user had troubles to use this linked list in parallel
recovery.

2.1.2.3 Undo TRN CTL SLOT Linked List

It can be used to find upper bound commit SCN. There exists one such linked list per undo segment.
Theoretically, it can be infinitive length, but in practice, limited by undo tablespace, or undo retention.

When Oracle DBWR writes modified data block to disk, not all of them have the committed scn in data
block ITL because commit is issued after modifications and Buffer Cache has a limited capacity. To

48
optimize the performance, during commit, Oracle only performs minimal necessary work to fill scn and
cmt (commit time) of its TRN TBL slot (constant O(1) complexity whatever the transaction size). The
later cleanout will be performed by the consumer when it accesses the modified rows. If the modified
data block is already written to disk, the cleanout is postponed, hence delayed block cleanout. If it is
still kept in Buffer Cache (limited to 10% of Buffer Cache according to Book Oracle Core [15, p. 46]),
it will experience commit cleanout and delayed logging block cleanout (to be discussed in later section
2.1.3) . Therefore, deferred cleanout is a postponed process executed by a fetch operation from another
later session to cleanup dirty fields.

Pick one data block, if one ITL entry is not stamped by a scn, we can use the first two fields of its XID
to find undo segment number (undo segment header block) and slot number in TRN TBL. As we have
seen from TRN TBL of undo header (e.g. dump1 undoheader), for each committed transaction (state 9
in TRN TBL), its slot records both committed scn and cmt. If wrap# in the TRN TBL slot matches the
third field of data block ITL XID, we can simply pick scn from that TRN TBL slot and fill the missed
scn in that data block ITL. If wrap# in the TRN TBL slot is bigger then third field of data block ITL
XID, it means that this slot in TRN TBL is reused (overwritten) by later new transactions. In this case,
we should restore the content of transaction table in undo segment header block. That is the task of our
third undo linked list: Undo TRN CTL SLOT Linked List, discussed in this section.

Each node of this slot linked list is the first undo record of the previous transaction, which has al-
ready(inactive) used or is currently(active) using this undo segment (remember that XID contains TRN
TBL slot and wrap#). As we know, when each transaction starts, it is allocated a slot (just like one
table row) from TRN TBL in one selected undo segment header block. The undo segment number, slot
(index) number, and an increased wrap# (like a sequence) are used as an unique name to identify this
transaction. Due to limited number of slots (34 from 0x00 to 0x21) in each undo segment header block,
the old slot is overwritten. In Oracle, any block modification should be saved in undo record, hence the
old slot is also saved in the undo record of this transaction. Because it occurs at the very beginning of
transaction, it is recorded in the first undo record of the transaction. Since each overwritten TRN TBL
slot contains the commit scn of the represented transaction, Oracle can restore it to find that commit
scn.

When a new transaction starts, the address of first undo record for the new transaction is also saved in
uba field under TRN CTL section of undo segment header block. Since this first undo record contains
the previous overwritten slot info, it becomes the header of this Linked List. There exists only one such
linked list per undo segment. If Oracle wants to find the required commit SCN, it can start from this
header, following the linked list, traversing the still available undo records, or throwing ORA-01555 (no
more available undo record) .

In our example of Dump1 and Dump2, we started one transaction for each dump, therefore we can
construct a linked list, whose two first undo records stem from those 2 transactions.

Look dump2 undoheader,

==================== dump2_undoheader ====================

BH (0xc1f8f1d8) file#: 3 rdba: 0x00c003b8 (3/952) class: 23

TRN CTL:: seq: 0xc294 chd: 0x001a ctl: 0x0017 inc: 0x00000000 nfb: 0x0002
mgc: 0xb000 xts: 0x0068 flg: 0x0001 opt: 2147483646 (0x7ffffffe)
uba: 0x00c003bf.c293.01 scn: 0x0893.ea9f909e <-SLOT Linked List(1)->

TRN TBL::

index state cflags wrap# uel scn dba nub cmt


---------------------------------------------------------------------------
0x00 9 0x00 0x4251c 0x0009 0x0893.ea9f90c1 0x00000000 0x00000000 1546605175
0x01 9 0x00 0x4252c 0x000f 0x0893.ea9f911f 0x00000000 0x00000000 1546605248

49
0x02 10 0x80 0x4251a 0x0000 0x0893.ea9f91fc 0x00c02e40 0x00000002 0 <-TRX Linked List(1)->
0x03 9 0x00 0x4252b 0x0010 0x0893.ea9f91e0 0x00c003be 0x00000001 1546605656

0x17 9 0x00 0x4251d 0xffff 0x0893.ea9f9235 0x00c003be 0x00000001 1546605849


0x18 9 0x00 0x4252b 0x0008 0x0893.ea9f91fc 0x00c003be 0x00000001 1546605715
0x19 9 0x00 0x42518 0x0006 0x0893.ea9f9116 0x00c003bc 0x00000001 1546605248

0x21 9 0x00 0x42519 0x0004 0x0893.ea9f91bf 0x00c003be 0x00000001 1546605635

The line marked as <-SLOT Linked List(1)-> in section TRN CTL shows:

uba: 0x00c003bf.c293.01 scn: 0x0893.ea9f909e <-SLOT Linked List(1)->

That is the header of SLOT Linked List, from which we can start to restore previous content of Transaction
Table (TRN TBL).

The above uba: 0x00c003bf.c293.01 points to the first undo record of most recent TRN TBL slot in
this undo segment. In the above example, it is slot 2 (index 0x02 with state 10).

By the way, scn: 0x0893.ea9f909e in that line is the commit SCN of the previous slot, which has
been overwritten by the most recent transaction for reuse. Therefore it is the earliest SCN that current
transaction table knows, before which (inclusive) all transactions using this undo segment (registered
in one slot of this Transaction Table) have been committed. It can be used as a quick way to get a
conservative committed SCN of all committed slots in this Transaction Table.

Dump this undo block into dump2 undoblock row1:

==================== dump2_undoblock_row1 ====================

BH (0xbefcc858) file#: 3 rdba: 0x00c003bf (3/959) class: 24 <-SLOT Linked List(2)->

Undo BLK:
xid: 0x0004.002.0004251a seq: 0xc293 cnt: 0x1 irb: 0x1

Rec Offset Rec Offset Rec Offset Rec Offset Rec Offset
---------------------------------------------------------------------------
0x01 0x07d0

*-----------------------------
* Rec #0x1 slt: 0x02 objn: 2392300(0x002480ec)
* Layer: 11 (Row) opc: 1 rci 0x00
Undo type: Regular undo Begin trans Last buffer split: No
rdba: 0x00000000Ext idx: 0 <-TRX Linked List(3)->
*-----------------------------
uba: 0x00c003be.c293.14 ctl max scn: 0x0893.ea9f909b prv tx scn: 0x0893.ea9f909e <-SLOT Linked List(3)->
txn start scn: scn: 0x0893.ea9f91fc logon user: 49
prev brb: 12583868 prev bcl: 0
KDO undo record:

op: L itl: xid:


0x0003.01a.0004ee11 uba: 0x00c00204.cf42.04
flg: C--- lkc: 0 scn: 0x0893.ea9f91e4
KDO Op code: URP row dependencies Disabled

itli: 1 ispac: 0 maxfr: 4858 tabn: 0 slot: 0(0x0) flag: 0x2c lock: 0 ckix: 0

ncol: 4 nnew: 3 size: 6006


col 1: [ 2] c1 02
col 2: [3000]
61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61
...
col 3: [3000]
78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78
...

In the above dump, rdba:0x00c003bf (class: 24) in line marked as <-SLOT Linked List(2)-> is the
same as TRN CTL uba, and line col 1: [ 2] c1 02 (Oracle raw number of 1) is the undo record

50
for the update of row with id 1. Since this is the first update, it becomes the first undo record of the
transaction.

Above undo record Rec #0x1 in dump2 undoblock row1 is pointed by TRN CTL uba:0x00c003bf.c293.01.
Since it is an undo record, its content should be some previous information stored in the same location.
In fact, from line marked as <-SLOT Linked List(3)->, it saved 3 fragments of previous Transaction
Table info about TRN CTL and TRN TBL, which can be used to reconstruct its previous content.

Here some short descriptions about them:

uba: 0x00c003be.c293.14 ctl max scn: 0x0893.ea9f909b


-- previous TRN CTL --

prv tx scn: 0x0893.ea9f909e prev brb: 12583868 (dba: 0x00c003bc)


-- previous TRN TBL slot 2 --

txn start scn: scn: 0x0893.ea9f91fc logon user: 49


-- current Transaction Start SCN --

The first description shows that the previous TRN CTL is saved in undo record uba:0x00c003be.c293.14.
Therefore we can pick 00c003be (FILE ID: 3 BLOCK NO: 958), dump it by:

alter session set tracefile_identifier = ’dump3_undoblock_trnctl’;


alter system dump datafile 3 block 958;

Here the new dump:

==================== dump3_undoblock_trnctl ====================

Start dump data blocks tsn: 2 file#:3 minblk 958 maxblk 958

UNDO BLK:
xid: 0x0004.017.0004251d seq: 0xc293 cnt: 0x19 irb: 0x19 icl: 0x0 flg: 0x0000

*-----------------------------
* Rec #0x14 slt: 0x17 objn: 5819(0x000016bb) objd: 5819
* Layer: 11 (Row) opc: 1 rci 0x00
Undo type: Regular undo Begin trans Last buffer split: No
rdba: 0x00000000Ext idx: 0
*-----------------------------
uba: 0x00c003be.c293.0e ctl max scn: 0x0893.ea9f9097 prv tx scn: 0x0893.ea9f909b <- SLOT Linked List(4) ->
txn start scn: scn: 0x0893.ea9f9234 logon user: 0
prev brb: 12583867 prev bcl: 0
KDO undo record:
KTB Redo
op: L itl: xid: 0x0002.016.0004f6af uba: 0x00c017c3.d04c.22
flg: C--- lkc: 0 scn: 0x0893.ea9f88db
KDO Op code: URP row dependencies Disabled
xtype: XA flags: 0x00000000 bdba: 0x00452585 hdba: 0x00402d98
itli: 4 ispac: 0 maxfr: 4863
tabn: 0 slot: 3(0x3) flag: 0x2c lock: 0 ckix: 191
ncol: 58 nnew: 7 size: 11
col 42: [ 2] c1 04
col 43: [13] 78 77 01 04 0d 2d 07 33 1a ac 10 15 3c

We can see that this Rec #0x14 is another first Undo record of a transaction since it contains one special
line marked as <-SLOT Linked List(4)->

51
uba: 0x00c003be.c293.0e ctl max scn: 0x0893.ea9f9097 prv tx scn: 0x0893.ea9f909b <-SLOT Linked List(4)->

The above special line content signifies that it is the first undo record of a transaction as repeatedly
explained in Book Oracle Core [15].

Till now, we got a SLOT Linked List:

SLOT Linked List(1) in undo header block (dump2_undoheader - TRN CTL uba)
---> SLOT Linked List(2) in undo block (dump2_undoblock_row1)
---> SLOT Linked List(4) in undo block (dump3_undoblock_trnctl)

If we pick uba:0x00c003be.c293.0e (marked as <-SLOT Linked List(4)->), continue undo block


dump, there will be an undo record Rec #0x0e in undo block: file#:3 minblk 958, which contains the
first undo record of one more previous transaction. In this way, we can walk through a longer slot Linked
List till find satisfied SCN or get undo exhausted (ORA-01555).

As we have seen, the whole above process is trying to restore Transaction Table to the state at which the
transaction started (get a slot by overwriting an old slot, and create first undo record). In our example,
Dump2 transaction used slot 0x02 (state 10) as showed below, so the previous content of this slot has to
be saved in first undo record of Dump2.

==================== dump2_undoheader ====================

BH (0xc1f8f1d8) file#: 3 rdba: 0x00c003b8 (3/952) class: 23

TRN CTL:: seq: 0xc294 chd: 0x001a ctl: 0x0017 inc: 0x00000000 nfb: 0x0002
mgc: 0xb000 xts: 0x0068 flg: 0x0001 opt: 2147483646 (0x7ffffffe)
uba: 0x00c003bf.c293.01 scn: 0x0893.ea9f909e <-SLOT Linked List(1)->

TRN TBL::

index state cflags wrap# uel scn dba nub cmt


---------------------------------------------------------------------------
0x02 10 0x80 0x4251a 0x0000 0x0893.ea9f91fc 0x00c02e40 0x00000002 0 <-TRX Linked List(1)->

Once we restored the content of slot 0x02 of previous committed transaction, we get its committed scn
0x0893.ea9f909e and committed time cmt 1546605119 as the shown in following dump1 undoheader.

==================== dump1_undoheader ====================

BH (0xd4f66558) file#: 3 rdba: 0x00c003b8 (3/952) class: 23

TRN CTL:: seq: 0xc293 chd: 0x0008 ctl: 0x0018 inc: 0x00000000 nfb: 0x0002
mgc: 0xb000 xts: 0x0068 flg: 0x0001 opt: 2147483646 (0x7ffffffe)
uba: 0x00c003be.c293.09 scn: 0x0893.ea9f9091

TRN TBL::

index state cflags wrap# uel scn dba nub cmt


----------------------------------------------------------------------------
0x02 9 0x00 0x42519 0x001a 0x0893.ea9f909e 0x00c003bc 0x00000002 1546605119

This committed SCN is used to stamp all Delayed Block Cleanout on ITL of the transaction identified
by XID 0x0004.002.00042519.

Compare slot 0x02 (state 9) in dump1 undoheader and slot 0x02 (state 10) in dump2 undoheader,
the wrap# changed from 0x42519 to 0x4251a, the increase 1 (0x4251a - 0x42519) indicates that one is
directly after another in slot 2 of this undo segment.

52
If we look the two above undo header dump, the BH are different: BH (0xc1f8f1d8) and BH (0xd4f66558)
for the same rdba: 0x00c003b8 (3/952). So they are the different versions of undo header block in
Buffer Cache.

This linked list is expensive to build up because each undo segment header block is central point for all
transactions (maximum 34 active) whose TRN TBL slots are allocated or were previously allocated in
that block. If all queries make such a long processing, it could incur certain contentions on undo segment
header block. It is interesting to see how Oracle constructs such a linked list. It is also possibly that
Oracle only constructs the first part of it to determine if the transaction is committed, if yes, use it for
Delayed Block Cleanout (to be discussed in next section 2.1.3).

2.1.3 Cleanout

After a transaction commit, its modified data blocks contain a few dirty fields left because of Oracle fast
commit strategy. There are 3 types of cleanouts:

(1). commit cleanout

(2). delayed block cleanout

(3). delayed logging block cleanout

all described in Blog: Clean it up [14].

At the moment of a transaction commit, the modified data block is either still in Buffer Cache, or already
wrtten to disk. If still in Buffer Cache, it gets a commit cleanout, which updates ITL Flag and stamps
commit scn. If no more in Buffer Cache, it will experience a delayed block cleanout, which also updates
ITL Flag and stamps commit scn, and additionally sets Itl Lck and lb of all rows to 0.

The block, which has been gotten commit cleanout, will later experience a delayed logging block cleanout
to set ITL Lck and lb of all rows to 0 (there is a parameter: delayed logging block cleanouts in
v$obsolete parameter, documented in Oracle8i Reference Release 8.1.5 [23]). So commit cleanout plus
delayed logging block cleanout is equal to delayed block cleanout. One difference of commit cleanout
with other two is that it is performed by the transaction owner session itself on time, but other two are
done by other sessions on some deferred occasions.

For example, in Dump1, we execute update and commit, ITL 0x2 is used:

update test_tab set seq = 2 where id = 2;


commit;

in Dump2, we execute only update, but no commit, ITL 0x1 is used:

update test_tab set seq = 4, n1 = null, n2 = null where id = 2;

If we look the 3 ITL fields in ITL entry 0x02: Flag, Lck, Scn/Fsc, and row lb, they are changed from
Dump1 dump1 datablock row2 to Dump2 dump2 datablock row2 (copied again below), we can observe
this delayed logging block cleanout. Flag is changed from --U- to C---; and Scn/Fsc is changed from
0x0000.ea9f91fc to 0x0893.ea9f91fc, hence fully filled.

53
==================== dump1_datablock_row2 ====================

Itl Xid Uba Flag Lck Scn/Fsc


0x01 0x0003.01a.0004ee11 0x00c00204.cf42.05 C--- 0 scn 0x0893.ea9f91e4
0x02 0x0004.018.0004252b 0x00c003be.c293.0a --U- 1 fsc 0x0000.ea9f91fc

tab 0, row 0, @0x819


tl: 6015 fb: --H-FL-- lb: 0x2 cc: 4
col 0: [ 2] c1 03
col 1: [ 2] c1 03
col 2: [3000]
61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61
...
col 3: [3000]
78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78
...

==================== dump2_datablock_row2 ====================

Itl Xid Uba Flag Lck Scn/Fsc


0x01 0x0004.002.0004251a 0x00c02e40.c294.01 ---- 1 fsc 0x1776.00000000
0x02 0x0004.018.0004252b 0x00c003be.c293.0a C--- 0 scn 0x0893.ea9f91fc

tab 0, row 0, @0x810


tl: 9 fb: --H-FL-- lb: 0x1 cc: 2
col 0: [ 2] c1 03
col 1: [ 2] c1 05

In fact, this cleanout is triggered by a fetch operation in the update statement of Dump2. Even though
Dump2 is using ITL 0x01, ITL 0x02 used in Dump1 has been cleaned out. tl (total length) is shrunk
from 6015 to 9 (6015-9=6006 less), col 1 (table column seq) is modified from c1 03 (number 2) to c1
05 (number 4).

Blog [16, 32] has a detailed documentation about 4 bits of ITL Flag:

unsigned short flags; // KTBFTAC, KTBFUPB, KTBFIBI, KTBFCOM

#define KTBFTAC 0x1000 /* this xac is active as of ktbbhcsc */


#define KTBFUPB 0x2000 /* commit time is upper bound */
#define KTBFIBI 0x4000 /* rollback of this uba gives a BI of the itl */
#define KTBFCOM 0x8000 /* transaction is committed */

In ITL Flag, 4 bits are positioned as:

KTBFCOM KTBFIBI KTBFUPB KTBFTAC

We have observed the following 5 patterns of ITL changes between two query operations for the same
ITL slot.

-----------------Before-------------- => ------------After-------------

Flag Lck Scn/Fsc => Flag Lck Scn/Fsc


(1). ---- 1 fsc 0x0000.00000000 => --U- 1 fsc 0x0000.67c128ff
(2). ---- 1 fsc 0x0000.00000000 => C--- 0 scn 0x0894.67b38ac4
(3). ---- 1 fsc 0x0000.00000000 => C-U- 0 scn 0x0894.689fdb6c
(4). --U- 2 fsc 0x0000.67b2d3ed => C--- 0 scn 0x0894.67b2d3ed
(5). C-U- 0 scn 0x0894.67bbee01 => C-U- 0 scn 0x0894.67bbd3eb

Case (1) to (4) are cleanout. Here is our understanding:

54
(1). from "----" to "--U-": commit cleanout
(2). from "----" to "C---": when exact slot found in undo segment header TRN TBL, get exact
commit scn.
(3). from "----" to "C-U-": when no more slot existed (expired) in undo segment header TRN
TBL, only get a upper bound scn.
(4). from "--U-" to "C---": for block of commit cleanout, when exact slot found in undo segment
header TRN TBL, get exact commit scn (delayed logging block cleanout).

In Case (2) to (4), Lck is set to 0, and lb flag of ITL locked rows is also changed to 0x0. However, in
Case (5), ITL change is not a Block Cleanout because Lck is already 0.

So we have following 3 cases of cleanouts:

(a). commit cleanout : Case (1)


(b). delayed block cleanout : Case (2) and (3)
(c). delayed logging block cleanout : Case (4)

We will give more discussions on Case (3) and (5), and finally attempt to dipict algorithm by a pseudo
code.

2.1.3.1 ITL Change Case (3)

We can demonstrate Case (3) with code below.

----====================== Step 1: prepare test ======================----

-- speed up undo extents EXPIRED


alter system set undo_retention=120;

drop table big_table;


-- create big_table, each block has 2 rows
create table big_table as select level id, systimestamp ts, rpad(’A’, 3000, ’1’) txt
from dual connect by level <= 1e6;

drop table small_table;


create table small_table as select level id from dual connect by level <= 1;

set numformat 999,999,999,999,999

-- get first block info for block dump


select id, ts, ora_rowscn row_scn, scn_to_timestamp(ora_rowscn) row_scn_ts
,dbms_rowid.rowid_to_absolute_fno(t.rowid, ’K’, ’BIG_TABLE’) afn
,dbms_rowid.rowid_block_number(t.rowid) block
from big_table t where id=1;

ID TS ROW_SCN ROW_SCN_TS AFN BLOCK


-- ------------------------ ----------------- ----------------- ---- ------
1 18.03.19 09:39:30.592618 9,433,503,419,740 18.03.19 09:39:28 1548 519427

----================= Step 2: update first row in each block =================----

update big_table set txt = rpad(’A’, 3000, ’2’) where mod(id, 2) = 1;

-- find its ACTIVE undo segment_name


select * from dba_undo_extents where status =’ACTIVE’ order by status;
_SYSSMU7_1140884659$

55
commit;

----================= Step 3: dump one block =================----

alter session set tracefile_identifier = ’big_table_row1_dump1’;


alter system dump datafile 1548 block 519427;

--Block already written to disk before commit, Lck not cleared, ITL looks like:

-- Itl Xid Uba Flag Lck Scn/Fsc


-- 0x02 0x0007.005.00045d7f 0x00c023e5.c69c.4a ---- 1 fsc 0x0000.00000000

----================= Step 4: make undo segment header TRN CTL reused =================----

-- UNDO extent are UNEXPIRED


select * from dba_undo_extents where segment_name = ’_SYSSMU7_1140884659$’ order by status;

-- Start many small Transactions to make undo segment header TRN CTL reused

begin
for i in 1 .. 50000 loop
update small_table set id = id + 1;
commit;
end loop;
end;
/

----=========== Step 5: Wait for its UNDO extent EXPIRED (not so necessary) ===========----

select * from dba_undo_extents where segment_name = ’_SYSSMU7_1140884659$’ order by status;

----================= Step 6: update second row in each block =================----

-- update 2nd row in one block, so to set ITL Flag of 1st row as "C-U-"
update big_table set txt = rpad(’A’, 3000, ’2’) where mod(id, 2) = 0;
commit;

select id, ts, ora_rowscn row_scn, scn_to_timestamp(ora_rowscn) row_scn_ts


,dbms_rowid.rowid_to_absolute_fno(t.rowid, ’K’, ’BIG_TABLE’) afn
,dbms_rowid.rowid_block_number(t.rowid) block
from big_table t where id=1;

ID TS ROW_SCN ROW_SCN_TS AFN BLOCK


-- ------------------------ ----------------- ----------------- ---- ------
1 18.03.19 09:39:30.592618 9,433,503,491,343 18.03.19 09:55:49 1548 519427

alter session set tracefile_identifier = ’big_table_row1_dump2’;


alter system dump datafile 1548 block 519427;

--Block already written to disk before commit, Lck cleared, ITL looks like:

-- Itl Xid Uba Flag Lck Scn/Fsc


-- 0x02 0x0007.005.00045d7f 0x00c023e5.c69c.4a C-U- 0 scn 0x0894.689fdb6c

----================= Step 7: Look ITL 0x02 Change in Step3 and Step6 =================----

Itl Xid Uba Flag Lck Scn/Fsc


big_table_row1_dump1: 0x02 0x0007.005.00045d7f 0x00c023e5.c69c.4a ---- 1 fsc 0x0000.00000000
big_table_row1_dump2: 0x02 0x0007.005.00045d7f 0x00c023e5.c69c.4a C-U- 0 scn 0x0894.689fdb6c

We first create a thick big table, each data block contains 2 rows. We update first row in each block,
commit the transaction, dump first data block. Since it is a big table, first row has already been written
to disk when commit issued. Start many small transactions to update a small table so that TRN CTL in
all undo segment header are re-used (Note 1: script 1 in section Redo Practice - Synchronous Commit
2.2.3 can be used for test without Plsql commit time optimization. Note 2: Step 5 is not so necessary).
Then update second row in each block, commit the transaction, dump first data block again. Open our
two dumps of first data block, look ITL for the first row, Flag changed from ---- to C-U-, Lck cleaned,
and scn filled. C-U- means that the transaction already committed, but we can only find an upper bound
commit scn after searching all available undo records. Moreover, by cleaning Lck, it signifies that later
session should not try to find any earlier commit scn because no rows in the block are known to perform

56
cleanout.

If immediately before Step 2, we open another session and set its transaction mode as read only (or
isolation level serializable), at the end of test, we select again the first row from big table, the result
might be ORA-01555.

After the test, execute again the small Transactions in Step 4, then run the query in Step 6 as following
code block. We even observed that ora rowscn increasing each time when running the code block. In
the following example, ora rowscn is increased from 9,433,504,111,164 to 9,433,504,178,791.

select id, ts, ora_rowscn row_scn, scn_to_timestamp(ora_rowscn) row_scn_ts


,dbms_rowid.rowid_to_absolute_fno(t.rowid, ’K’, ’BIG_TABLE’) afn
,dbms_rowid.rowid_block_number(t.rowid) block
from big_table t where id=1;

ID TS ROW_SCN ROW_SCN_TS AFN BLOCK


-- ------------------------ ----------------- ----------------- ---- ------
1 18.03.19 09:39:30.592618 9,433,504,111,164 18.03.19 11:58:46 1548 519427

begin
for i in 1 .. 50000 loop
update small_table set id = id + 1;
commit;
end loop;
end;
/

select id, ts, ora_rowscn row_scn, scn_to_timestamp(ora_rowscn) row_scn_ts


,dbms_rowid.rowid_to_absolute_fno(t.rowid, ’K’, ’BIG_TABLE’) afn
,dbms_rowid.rowid_block_number(t.rowid) block
from big_table t where id=1;

ID TS ROW_SCN ROW_SCN_TS AFN BLOCK


-- ------------------------ ----------------- ----------------- ---- ------
1 18.03.19 09:39:30.592618 9,433,504,178,791 18.03.19 11:59:07 1548 519427

2.1.3.2 ITL Change Case (5)

In this case, Flag is not changed, but scn moved backward (67bbee01 > 67bbd3eb). It is observed when
a read only session is using ora rowscn to get commit scn (see Blog: How can ORA ROWSCN change
between queries when no update? [5]). It moved scn from 67bbee01 to a smaller value 67bbd3eb by the
read only session because read only session moves scn according to its start scn (transaction ”read only”
start scn). After read only session started, the test re-used all Transaction Table Slots in undo segment
header after 5000 transactions in the Blog. When the read only session run the query again, ITL scn
was shifted to a smaller value. In this case, Oracle probably uses Data Block ITL Uba Linked List to
construct CR copy since it has to read rows from CR block.

2.1.3.3 ITL Change Pseudo Code

To summarize the above long discussed cleanout triggered by a fetching operation, we can write a pseudo
algorithm to highlight the principle steps and components (commit cleanout is not included).

start query s at scn


fetch row r in data block d, r is locked by ITL i with XID(usn, slot, wrap#)
when ( i.Flag is ’----’
and i.scn is empty)

57
or
( i.Flag is ’--U-’
and i.scn is not empty
and s.scn < i.scn)
with i.usn, get undo segment header block h

in h.TRN TBL, use i.slot, find line l (index, wrap#, scn) with l.index = i.slot
if i.wrap# = l.wrap# <--"exact found"-->
set i.Flag as ’C---’
else <--"i.wrap# < l.wrap#"-->
set i.Flag as ’C---’ (or ’C-U-’)
set i.Lck as 0
use l.scn to replace i.scn
cleanup lb for all i locked rows

(note i.wrap# > l.wrap# is not possible because each transaction is started by first allocating one l,
and i is a previous or current entry at slot l).

The above pseudo code is not complete and is only a first attempt to draw a draft. Remember that
our main job is to read desired rows, cleanout is only a sideline activity. For example, if one row to be
read has been updated and committed several times, the code path will run to the check of i.wrap#
= l.wrap#, perform cleanout, and a new CR data block has to cloned, traversing Data Block ITL Uba
Linked List, till find the satisfied CR block according to query start scn, or throws ORA-01555.

Once undergoing above long undo exercise, we can now start to analyse and diagnose undo issues with
following concrete examples.

2.1.4 Undo Complexity Examples

Undo is an activity to bring us back to history. In all history studies, there are always ongoing new
discoveries and debatable claims. Probably Oracle undo is neither an exception. It is too complex to
understand without concrete examples (and dumps). In this section, we look a few undo examples.

2.1.4.1 Undo Documentation Reading

In the Section: Deeper into Buffer Cloning (Page 242-244) of Oracle Performance Firefighting (4th
Printing) ([33]), there is some text:

Figure 6-40 is the basis for this example. Suppose our query begins at SCN time 12330 ...

We begin with the first ITL that is associated with active transaction 7.3.8. Our server process needs to
retrieve any undo that has occurred after our query started at SCN time 12330.

Our server process must now access undo block 2,90 ... The SCN is 12320, which is before our query
started at time 12330. Therefore, we do not apply the undo. If we did apply the undo, our CR buffer
would represent a version of block 7,678 at time 12320, which is too early!

Probably even though 12320 is before query start scn 12330, we also have to apply this undo since it
belongs to the same opened transaction (TRX# 7.3.8 ). In fact, we have to reverse all blocks in this

58
un-committed transaction because each transaction is imposed by ACID atomic constraint. (We have
already seen that undo is tracked by ITL Uba Linked List. And moreover, it is not clear from where to
get undo SCN.)

If above description were true, the elapsed per exec of select statement in later Undo Duration discussion
of section 2.1.4.5 would arrive at a equilibrium fixpoint.

In Page 242 [33], there is some more text like:

If you recall, when a server process locates a desired buffer and discovers a required row has changed since
its query began, it must create a back-in-time image of the buffer.

Same as above argument, even a required row is changed before query began, it should also be reversed
if not yet committed by other session (if committed before query began, it could experience cleanout).

2.1.4.2 Undo Bugs

Blog ORA-08177: can’t serialize access for this transaction [46] reveals a bug in undo handling during
SQL Parsing.

It seems that fixing such a bug requires a lot of design change in the code, for example, distinguishing
user DML and sys DML on underlying objects (e.g. undo$ or user$ tables).

If we are told that fixing such bug is not feasible for the time being and therefore be cautious when using
such features, it should be quite understandable due to the undo complexity.

With new release evolving, undo becomes even more sophisticated. In 18c (12c2), it happens to see
one transaction being allocated more than one ACTIVE undo segment slots (v$transaction.status
’ACTIVE’), and entire DB is blocked.

2.1.4.3 Undo Performance

As already discussed, Oracle commit is optimized to have O(1) complexity. However, rollback can take
long time if undo size is large (more than 10 GB). Query below can be used to monitor undo size:

select f.file_name, count(distinct u.segment_name) undo_seg_cnt, round(sum(u.bytes)/1024/1024) undo_mb


from dba_undo_extents u, dba_data_files f
where u.file_id = f.file_id
and u.status = ’ACTIVE’
-- and not exists (select t.xidusn from v$transaction t where u.segment_name like ’_SYSSMU’||t.xidusn||’_%$’)
group by f.file_name;

If the rollback session is killed for some reason, e.g. time pressure, Oracle SMON will immediately start
parallel recovery, which will take even longer and consume more resources (even worse, instance can crash
by smon, demonstrated later section 2.1.4.7). In such case, v$fast start transactions can be used to
estimate recovering duration.

2.1.4.4 Undo Size

We can make an undo test by updating one block many times, and querying an un-modified block in a
different session. The purpose is trying to make undo length (space) expanding to infinitive.

59
Open one session, update first row 1000 times, without commit:

begin
for i in 1..1000 loop
update test_tab set n1 = rpad(’a’, 3000, i) where id = 1;
end loop;
end;
/

Open a second session, read second row:

select value from v$sysstat where name = ’data blocks consistent reads - undo records applied’;
select count(*) from test_tab t where id = 2;
select value from v$sysstat where name = ’data blocks consistent reads - undo records applied’;

The output:

25256
1
26256

shows that there are 1000 (26256 - 25256) ”data blocks consistent reads - undo records applied”.

Even though the second row is not modified and in a different block, it still needs to apply whole Undo
records to make the CR read because of full table scan (reading all data blocks). We hit this performance
issue when we only think that we do not select the modified rows. Even we add an index to the table,
then query the rows, it could also experience such problem since index is also subject to the same undo
mechanism.

(More statistics can be listed by runstats in Expert Oracle Database Architecture (second edition) [12])

2.1.4.5 Undo Duration

The next undo test is by continuously inserting rows without commit, and querying in a few other sessions.
The purpose is trying to make undo timely expanding to infinitive.

At first, setup test:

create sequence test_seq;


create table test_tab2 (id number, seq_nr number, cnt number);
create index test_tab2_ind on test_tab2 (id, seq_nr, cnt);
create type type_c100 as table of varchar2(100);
/

create or replace procedure insert_no_commit(p_cnt number) as


begin
insert into test_tab2 select 99, test_seq.nextval, level from dual connect by level <= p_cnt;
dbms_lock.sleep(0.1);
end;
/

create or replace procedure test_select as


l_tab type_c100;
begin
select rowidtochar(rowid) bulk collect into l_tab
from test_tab2 where id = 99;
dbms_lock.sleep(0.1);

60
end;
/

create or replace procedure insert_no_commit_loop(p_job_cnt number)


as
l_job_id pls_integer;
begin
for i in 1.. p_job_cnt loop
dbms_job.submit(l_job_id, ’begin while true loop insert_no_commit(4); end loop; end;’);
end loop;
commit;
end;
/

create or replace procedure test_select_loop(p_job_cnt number)


as
l_job_id pls_integer;
begin
for i in 1.. p_job_cnt loop
dbms_job.submit(l_job_id, ’begin while true loop test_select; end loop; end;’);
end loop;
commit;
end;
/

Launch the test by:

exec insert_no_commit_loop(1);
exec test_select_loop(24);

From time to time, run following query, we could see elapsed per exec of select statement is gradually
increasing. At beginning, it is some milliseconds, after a few hours, it could reach a couple of minutes.

select executions, disk_reads


,rows_processed, round(rows_processed/executions, 2) rows_per_exec
,buffer_gets, round(buffer_gets/executions, 2) buffer_per_exec
,round(elapsed_time/1e3) elapsed_time, round(elapsed_time/1e3/executions, 2) elapsed_per_exec
,v.*
from v$sql v where lower(sql_text) like ’%test_tab2%’ and v.executions > 0
order by v.executions desc;

The problem is that the number of undo records is continuously increasing. When we run the query,
Oracle has to build the previous discussed Data Block Itl Uba Linked List (see section 2.1.2.1) for read
consistency (CR copy). With growing undo records in Linked List, the query gets slower when traversing
a longer path.

We have seen the similar code in the real applications. At beginning, it looks fast, but following time, it
gets slower and slower.

2.1.4.6 Temporary Table (GTT): Undo / Redo

In this section, we can make some rule-based reasoning about GTT undo and redo handling. As we
learned from Book Oracle Core [15] , Oracle fulfils a DML by following 5 Steps (See Book Oracle Core
[15, p. 10] and Blog [35]):

(1). create Undo Change Vector

(2). create Redo Change Vector

61
(3). combine both Change Vectors into Redo Buffer, and then write to redo log
(4). write Undo record into Undo file
(5). write modified Data into Data file

In case of GTT, applying the rule:

Oracle redo log never records temporary data (no redo for GTT)

Step (2) is pruned away from above 5 Steps, a DML on GTT now contains 4 Steps:

(1). create Undo Change Vector


(3). put Undo Change Vectors into Redo Buffer, and then write to redo log
(4). write Undo record into Undo file
(5). write modified Data into DATA (TEMP) file

when setting 12c temp undo enabled=TRUE , Step (3) is removed, it is further shortened as:

(1). create Undo Change Vector


(4). write Undo record into TEMP file (Remember Temp has NEVER redo, v$tempundostat)
(5). write modified Data into Data (Temp) file

At end, GTT undo is reduced to minimum in redo log.

2.1.4.7 Instance Crash during SMON Parallel Transaction Recovery

In this section, we will demonstrate instance crash error:

ORA-00474: SMON process terminated with error during parallel transaction recovery

Note: tested with fast start parallel rollback either LOW (default) or HIGH, and fast start mttr target
either 300 or 1800 in Oracle 12c.

At first, we setup our test:

drop table big_table;

create table big_table as select level x, rpad(’ABC’, 100, ’X’) y from dual connect by level < 1e6;

create index big_table#i1 on big_table (y, x);

select count(*) from big_table;


-- 999,999

select segment_name, round(bytes/1024/1024) mb from dba_segments where segment_name = ’BIG_TABLE’;

62
-- BIG_TABLE 122

create or replace procedure smon_crash(p_nr number) as


l_val varchar2 (256 byte);
begin
for i in 1..1e8 loop
update big_table set y = rpad(’abc’, 100, i) where mod(x, 4) = p_nr;
end loop;
end;
/

create or replace procedure smon_crash_jobs(p_job_cnt number) as


l_job_id pls_integer;
begin
for i in 1.. p_job_cnt loop
dbms_job.submit(l_job_id, ’begin while true loop smon_crash(’|| (i-1) ||’); end loop; end;’);
end loop;
commit;
end;
/

create or replace procedure clean_jobs as


begin
for c in (select job from dba_jobs) loop
begin
dbms_job.remove (c.job);
exception when others then null;
end;
commit;
end loop;

for c in (select d.job, d.sid, (select serial# from v$session where sid = d.sid) ser
from dba_jobs_running d) loop
begin
execute immediate
’alter system kill session ’’’|| c.sid|| ’,’ || c.ser|| ’’’ immediate’;
dbms_job.remove (c.job);
exception when others then null;
end;
commit;
end loop;

dbms_lock.sleep(2);

-- select * from dba_jobs;


-- select * from dba_jobs_running;
end;
/

Then launch 8 update Jobs:

Sql > exec smon_crash_jobs(8);

After 5 minutes, we kill 6 of 8 launched Jobs by UNIX command (kill -9).

Immediately, we can see that SMON started parallel transaction recovery by launching a few parallel
processes P0xx. Pick one parallel processes, for example, P001 (orapid 33), suspend it:

oradebug setorapid 33
oradebug suspend

We also stop the rest of launched Jobs (otherwise too much redo generated) by (see script in section
1.2.4):

SQL > exec clean_jobs;

63
After about half hour (30 minutes to 40 minutes as tested), instance crashed, and alert log shows:

SMON started with pid=16, OS id=16729

ORA-00600: internal error code, arguments: [15709], [29], [1], [], [], [], [], [], [], [], [], []
ORA-30319: Message 30319 not found; product=RDBMS; facility=ORA
USER (ospid: 16729): terminating the instance due to error 474

Probably SMON (parallel coordinator) crashed instance after certain time limit if it can not receive signal
from its managed parallel slaves.

As we have discussed about TRX (Rec, rci) Linked List in section 2.1.2.2, it is used in transaction
recovery. When SMON has troubles to construct this list in time, it stops the instance. Hence the undo
is the critical mechanism for stable running of Oracle database.

By the way, if we kill one parallel processes (instead of suspend), no instance crash happened, and alert
log shows:

SMON: slave died unexpectedly, downgrading to serial recovery

One workaround we have seen is to set fast start parallel rollback=false to switch off parallel
recovery.

2.1.4.8 LOB Undo and ORA-22924: snapshot too old

To complete undo discussion, we can have a look on particularities of Oracle LOB Undo. Each LOB is
stored in two segments: lobindex and lobsegment. lobindex is managed under normal Undo paradigm,
whereas lobsegment is in a different way. Hence, LOB can throw one normal ORA-01555 for lobindex,
or two errors (ORA-01555 and ORA-22924) for lobsegment:

ORA-01555: snapshot too old: rollback segment number with name "" too small
ORA-22924: snapshot too old

According to Oracle specification, we can specify either pctversion or retention for BasicFiles LOBs,
but not both (only retention parameter can be specified for SecureFiles), and retention can not be
set explicitly, it is determined by the undo retention parameter. Two parameters give two dimensions:
pctversion fixes space usage, retention limits life time.

To conform with ANSI standard (ACID regime), in case of permanent LOB, lobsegment consistency is
maintained by versions (different copies). Therefore, lobsegment does not generate rollback information
(redo/undo). Only lobindex generates undo/redo since it is implemented in normal Oracle undo/redo
mechanism.

However, in case of temporary LOB, cr, undo and versions are not supported. They are stored in
temporary tablespace and are session private.

Blog [59] contains further discussions, tests, and fixes for LOB specified ORA-22924.

64
2.2 Redo Practice

Although undo is more complicated than redo as explained in Book Oracle Core [15]. Redo performance
is more visible in applications. For example, while end user is waiting for ”log file sync”, DBA is googling
Panacea of ”log file parallel write”.

Oracle architecture is designed to let redo traverse through all layers of computer system, from applica-
tions, Oracle, OS (Scheduler, VM, FS), adapter, down to disk (LUNS, RAID) and network. Any points
among them could induce a bottleneck, which makes redo performance tracking considerable broad. In
this section, we focus redo discussion mainly from application point of view.

First we look transaction redo behaviours in one single database, then those in distributed transactions
running on two databases, which nowadays are widely used in (Java) connections with external systems
(for example, messaging, streaming). At end, we will try to find their differences.

All tests are done in Oracle 11.2.0.4.0 (see appended Test Code).

2.2.1 Test Setup

At first, we setup two DBs, one named dblocal, another dbremote. Run Test Code-1 in both DBs,
additionally run Test Code-2 only in dblocal (to repeat the test, adapt database name, user id, password
at first).

2.2.1.1 Test Code-1

Create a test table of 10,000 rows with row length 105, one AWR snapshot create procedure, and two
commit procedure.

drop table test_redo;

create table test_redo (


id int primary key using index (create index ind_p on test_redo (id)),
name varchar2(300));

insert into test_redo select level, rpad(’abc’, 100, ’x’) y from dual connect by level <= 10000;

exec dbms_stats.gather_table_stats(null, ’TEST_REDO’, cascade => true);

select table_name, num_rows, avg_row_len from dba_tables where table_name = ’TEST_REDO’;


TABLE_NAME NUM_ROWS AVG_ROW_LEN
----------- --------- -----------
TEST_REDO 10000 105

create or replace procedure create_awr as


begin
sys.dbms_workload_repository.create_snapshot(’ALL’);
end;
/

create or replace procedure db_commit as


begin
commit;
end;
/

create or replace procedure db_commit_autotrx(i number) as


pragma autonomous_transaction;

65
begin
update test_redo set name = i||’_autonomous_transaction_at_’||localtimestamp where id = (10000 - (i - 1));
commit;
end;
/

2.2.1.2 Test Code-2

Create a dblink, one update procedure, and two remote commit procedures.

drop database link dblinkremote;

create database link dblinkremote connect to k identified by k using ’dbremote’;

create or replace procedure update_test_tab(p_cnt number, p_job number) as


begin
for i in 1.. p_cnt loop
update test_redo set name = rpad(’abc’, 100, i) where id = (p_job -1) * 1000 + mod(i, 1001);

---- get similar AWR Redo figures for select for update
--for c in (select name from test_redo where id = (p_job -1) * 1000 + mod(i, 1001) for update)
--loop null; end loop;

commit;
end loop;
end;
/

create or replace procedure update_test_tab_loop(p_cnt number, p_job_cnt number)


as
l_job_id pls_integer;
begin
for i in 1.. p_job_cnt loop
dbms_job.submit(l_job_id, ’update_test_tab(’||p_cnt||’, ’|| i||’);’);
end loop;
commit;
end;
/

create or replace procedure dblinkremote_commit as


begin
db_commit@dblinkremote;
end;
/

create or replace procedure dblinkremote_commit_autotrx(i number) as


begin
db_commit_autotrx@dblinkremote(i);
end;
/

2.2.2 Asynchronous Commit

The default PL/SQL commit behaviour for non-distributed transactions is batch nowait if commit logging
and commit wait database initialization parameters are not set. (see Database PL/SQL Language Ref-
erence - COMMIT Statement [20])

Run code block below to update 1000 rows in one single session:

exec create_awr;
exec update_test_tab(1000, 1);
exec create_awr;

The collected AWR is showed in Table 2.1. For 1000 updates and 1000 user commits, it requires 971

66
”redo write” which triggered almost same number of ”log file parallel write” (972), very close to the
number of user commits. However, only 2 ”redo synch write”, hence 2 ”log file sync”. That is the effect
of Asynchronous Commit.

Statistic Total / Waits per Second per Trans


redo size 11,369,340 943,435.40 11,245.64
user commits 1,011 83.89 1
redo synch writes 2 0.17 0
redo writes 971 80.57 0.96
log file sync 2
log file parallel write 972

Table 2.1: Asynchronous Commit

Note that "select for update" statement shows the similar redo behaviour as "update" even though
there is no real update (see out-commented lines in procedure update test tab in Test Code-2). It
means that redo space is forecasted, and pre-allocated accordingly before DML executions.

This redo optimization is not only for Plsql, but also effective for Oracle server-side internal driver JVM.

When we make redo benchmark test with Plsql, it should not simply look the number of commits inside
PL/SQL block because all commits are compressed and postponed till quitting PL/SQL block.

2.2.3 Synchronous Commit

To get rid of redo optimization from above asynchronous commit, we can try to update 1000 rows with
synchronous commit as following script:

---------------- script_1 ----------------

#!/bin/ksh

sqlplus -s testu/testp |&


print -p "exec create_awr;"
print -p "set feedback off"

i=0
while (( i < $1 ))
do
(( i+=1 ))
echo $i
print -p "update test_redo set name = rpad(’abc’, 100, $i) where id = mod($i, 1001);"
print -p "commit;"
done

print -p "exec create_awr;"


print -p "exit 0"

Run script 1 1000, the collected AWR redo stats is showed in Table 2.2:

Statistic Total / Waits per Second per Tran


redo size 11,202,868 3,002,644.87 11,080.9
user commits 1,011 270.97 1
redo synch writes 1,001 268.29 0.99
redo writes 1,036 277.67 1.02
log file sync 1,001
log file parallel write 1,034

Table 2.2: Synchronous Commit

67
In case of 1000 synchronous commits, all 5 statistics except ”redo size” are almost identical (between
1,001 and 1,036). Each user commit leads to one respective event.

”redo synch write” represents the number of times the redo is forced to flush into disk immediately,
usually for a transaction commit.

Client-side JDBC connection has the similar behaviour. But it can be eliminated by switching off default
auto-commit by connection.setAutoCommit(false).

2.2.4 Piggybacked Commit

Piggybacked commit reduces the number of commit redo writes by grouping redo records from several
sessions together (see Oracle MOS WAITEVENT: ”log file sync” Reference Note (Doc ID 34592.1)). It
is an optimization performing in the background by Oracle LGWR.

Again launch 10 jobs (sessions), each of which updates 1000 rows as follows:

exec create_awr;
exec update_test_tab_loop(1000, 10);
exec dbms_lock.sleep(120);
-- wait for job finished
exec create_awr;

The respective AWR is Table 2.3.


Statistic Total / Waits per Second per Tran
redo size 18,138,376 136,033.06 1,808.05
user commits 10,032 75.24 1
redo synch writes 12 0.09 0
redo writes 440 3.3 0.04
log file sync 12
log file parallel write 439

Table 2.3: Piggybacked Commit

Comparing to Table 2.1, although we make 10,000 updates and 10,032 user commits by 10 parallel
sessions, ”redo size” is not 10 times that of Table 2.1 (18,138,376 vs. 11,369,340), and ”redo write” (also
”log file parallel write”) is even less than that of Table 2.1 (440 vs. 971). That is the role of Piggybacked
Commit plays.

To reveal the internal mechanism, truss LGWR process, and look the output:

listio64(0x0000000010000004, 0x000000000FFFFFFF, 0x00000000FFFDB4D0, 0x0000000000000002, ...) = 0x0000000000000000


aio_nwait64(0x0000000000001000, 0x0000000000000002, 0x0FFFFFFFFFFEB4D0, 0x800000000000D032, ...) = 0x0000000000000002
thread_post_many(7, 0x0FFFFFFFFFFF3488, 0x0FFFFFFFFFFF3480) = 0
listio64(0x0000000010000004, 0x000000000FFFFFFF, 0x00000000FFFDB4D0, 0x0000000000000002, ...) = 0x0000000000000000
aio_nwait64(0x0000000000001000, 0x0000000000000002, 0x0FFFFFFFFFFEB4D0, 0x800000000000D032, ...) = 0x0000000000000002
thread_post_many(4, 0x0FFFFFFFFFFF3488, 0x0FFFFFFFFFFF3480) = 0
listio64(0x0000000010000004, 0x000000000FFFFFFF, 0x00000000FFFDB4D0, 0x0000000000000002, ...) = 0x0000000000000000
aio_nwait64(0x0000000000001000, 0x0000000000000002, 0x0FFFFFFFFFFEB4D0, 0x800000000000D032, ...) = 0x0000000000000002
thread_post_many(6, 0x0FFFFFFFFFFF3488, 0x0FFFFFFFFFFF3480) = 0

Each redo write is accomplished by 3 consecutive operations: listio64, aio nwait64, thread post many.
It means that redo of multiple threads (oracle sessions) are collected and written by one single listio64,
wait I/O finished, and then post all of involved sessions. Note that the first parameter of thread post many
is nthreads, i.e, number of threads.

68
2.2.5 Distributed Transactions

Distributed transaction skips many redo optimization, and hence puts more burden on LGWR.

We will look at Oracle-controlled distributed transactions using database link.

Run following script 2 1000 to update 1000 rows in both local and remote DBs, then collect respective
AWRs on both DBs.

---------------- script_2 ----------------

#!/bin/ksh

sqlplus -s testu/testp |&


print -p "exec create_awr;"
print -p "exec create_awr@dblinkremote;"
print -p "set feedback off"

i=0
while (( i < $1 ))
do
(( i+=1 ))
echo $i
print -p "update test_redo set name = rpad(’abc’, 100, $i) where id = mod($i, 1001);"
print -p "update test_redo@dblinkremote set name = rpad(’abc’, 100, $i) where id = mod($i, 1001);"
print -p "commit;"
done

print -p "exec create_awr;"


print -p "exec create_awr@dblinkremote;"
print -p "exit 0"

Table 2.4 is AWR from dblocal (Commit Point Site).

Statistic Total / Waits per Second per Tran


redo size 12,135,636 394,513.70 12,003.6
user commits 1,011 32.87 1
redo synch writes 3,003 97.62 2.97
redo writes 3,038 98.76 3
log file sync 3,002
log file parallel write 3,116
transaction branch allocation 4,012

Table 2.4: Distributed Transaction - dblocal

Table 2.5 is AWR from dbremote.


Statistic Total / Waits per Second per Tran
redo size 11,315,104 371,681.63 11,191.9
user commits 1,011 33.21 1
redo synch writes 2,003 65.8 1.98
redo writes 2,035 66.85 2.01
log file sync 2,002
log file parallel write 2,036
transaction branch allocation 9,013

Table 2.5: Distributed Transaction - dbremote

Comparing with Table 2.2 of synchronous commits in non-distributed transactions, 1000 rows update in
distributed transaction demands 3 times of redo events in Commit Point Site (dblocal), and 2 times in
other node (dbremote). ”transaction branch allocation” is one stats special to Distributed Transactions.
It is 4 times of user commits in dblocal, 9 times of user commits in dbremote.

69
Each ”redo synch write” is translated as one UNIX System call ”pwrite”, which can be watched by
truss or Dtrace command in Solaris. Therefore there are 3,003 pwrites in dblocal, and 2,003 pwrites in
dbremote.

As we know, a redo record (redo entry) is made up of a group of change vectors. In AWR report, we can
also check ”redo entries”.

By dumping Redo Log for the ”commit” command in dblocal and dbremote with:

ALTER SYSTEM DUMP LOGFILE ’<full_path_logfile_name>’;

CHANGE vectors in dblocal are:

CLS:117 AFN:3 DBA:0x00c00290 OBJ:4294967295 SCN:0x0853.9628ab8a SEQ:1 OP:5.2


CLS:118 AFN:3 DBA:0x00c00550 OBJ:4294967295 SCN:0x0853.9628ab8a SEQ:1 OP:5.1
CLS:118 AFN:3 DBA:0x00c00550 OBJ:4294967295 SCN:0x0853.9628ab8d SEQ:1 OP:5.1
CLS:117 AFN:3 DBA:0x00c00290 OBJ:4294967295 SCN:0x0853.9628ab8d SEQ:1 OP:5.12
CLS:117 AFN:3 DBA:0x00c00290 OBJ:4294967295 SCN:0x0853.9628ab8d SEQ:2 OP:5.12
CLS:117 AFN:3 DBA:0x00c00290 OBJ:4294967295 SCN:0x0853.9628ab8f SEQ:1 OP:5.4

CHANGE vectors in dbremote are:

CLS:123 AFN:3 DBA:0x00c002b0 OBJ:4294967295 SCN:0x0853.9628ac1d SEQ:1 OP:5.2


CLS:124 AFN:3 DBA:0x00c009ba OBJ:4294967295 SCN:0x0853.9628ac5c SEQ:1 OP:5.1
CLS:124 AFN:3 DBA:0x00c009ba OBJ:4294967295 SCN:0x0853.9628ac6a SEQ:1 OP:5.1
CLS:123 AFN:3 DBA:0x00c002b0 OBJ:4294967295 SCN:0x0853.9628ac6a SEQ:1 OP:5.4

we can see that ”commit” in dblocal includes two additional OP:5.12 CHANGE vectors on undo header
(CLS:117), apart from similar OP:5.2 (CHANGE for update transaction table in undo segment header),
OP:5.1 (CHANGE for update of the undo block), OP:5.4 (CHANGE for Commit). Probably OP:5.12
is special to 2pc commit.

As documented in Oracle Two-Phase commit mechanism, Commit Point Site experiences 3 Phases:
Prepare/Commit/Forget, whereas other resource Nodes perform only first 2 Phases.

Each committed transaction has to be stamped by a distinct scn to uniquely identify the changes made
by the SQL statements within that transaction. During the prepare phase, the database determines the
highest SCN at all Nodes involved in the transaction. The transaction then commits with the high SCN
at the Commit Point Site. The commit SCN is then sent to all prepared Nodes with the commit decision.
(see Database Administrator’s Guide - Distributed Transactions Concepts [19])

Database alert log of dblocal contains the following text to show scn synchronization in remote Nodes
with Commit Point Site when first time to start the distributed transaction:

Fri Jul 03 13:48:00 2015


Advanced SCN by 29394 minutes worth to 0x0824.738bd480, by distributed transaction logon, remote DB: DBREMOTE.COM.
Client info: DB logon user TESTU, machine dbremote, program oracle@dbremote (TNS V1-V3), and OS user oracle
Fri Jul 03 13:50:04 2015

Following two queries can be used to monitor distributed transactions.

select ’local’ db, v.* from v$global_transaction v union all


select ’remote’ db, v.* from v$global_transaction@dblinkremote v;

select ’local’ db, v.* from v$lock v where type in (’TM’, ’TX’, ’DX’) union all
select ’remote’ db, v.* from v$lock@dblinkremote v where type in (’TM’, ’TX’, ’DX’);

70
For XA Based Distributed Transactions (X/Open DTP), the similar behaviour can be observed. Addi-
tionally AWR - Enqueue Activity shows stats of ”DX-Distributed Transaction”.

2.2.6 Distributed Transaction Commit

We can commit the whole distributed transaction in local db:

update test_redo set name = ’update_local_1’ where id = 1;


update test_redo@dblinkremote set name = ’update_remote_1’ where id = 1;
commit;

or commit in remote db:

update test_redo set name = ’update_local_1’ where id = 1;


update test_redo@dblinkremote set name = ’update_remote_1’ where id = 1;
exec dblinkremote_commit;

2.2.7 Distributed Transaction with autonomous transaction

We can make our test more sophisticated by incorporating one autonomous transaction in remote DB to
update 1000 rows.

Run following script 3 1000, and then collect AWRs on both DBs.

---------------- script_3 ----------------

#!/bin/ksh

sqlplus -s testu/testp |&


print -p "exec create_awr;"
print -p "exec create_awr@dblinkremote;"
print -p "set feedback off"

i=0
while (( i < $1 ))
do
(( i+=1 ))
echo $i
print -p "update test_redo set name = rpad(’abc’, 100, $i) where id = mod($i, 1001);"
print -p "update test_redo@dblinkremote set name = rpad(’abc’, 100, $i) where id = mod($i, 1001);"
print -p "exec dblinkremote_commit_autotrx(mod($i, 1001));"
print -p "commit;"
done

print -p "exec create_awr;"


print -p "exec create_awr@dblinkremote;"
print -p "exit 0"

Table 2.6 is AWR in dblocal (Commit Point Site).

Table 2.7 is AWR in dbremote.

Comparing dblocal AWR in Table 2.6 and Table 2.4, both are almost identical. For dbremote, however,
the numbers in Table 2.7 is higher than Table 2.5 due to autonomous transaction. That is the impact
when Piggybacked Commit disabled.

71
Statistic Total / Waits per Second per Tran
redo size 12,010,796 483,565.34 11,880.1
user commits 1,011 40.7 1
redo synch writes 3,004 120.94 2.97
redo writes 3,041 122.43 3.01
log file sync 3,004
log file parallel write 3,042
transaction branch allocation 6,010

Table 2.6: Autonomous Distributed Transaction - dblocal

Statistic Total / Waits per Second per Tran


redo size 12,156,052 499,673.30 6,047.79
user commits 2,010 82.62 1
redo synch writes 3,002 123.4 1.49
redo writes 3,035 124.75 1.51
log file sync 3,004
log file parallel write 3,034
transaction branch allocation 10,019

Table 2.7: Autonomous Distributed Transaction - dbremote

2.2.8 Distributed Transaction: distributed lock timeout

In remote db, set distributed lock timeout by:

alter system set distributed_lock_timeout=27 scope=spfile;

then restart remote db:

startup force;

in remote db, execute one update on table test redo:

update test_redo set name = ’update_remote_0’ where id = 1;

and then in local db, execute two updates, one on local test redo, another on remote test redo@dblinkremote:

update test_redo set name = ’update_local_1’ where id = 1;


update test_redo@dblinkremote set name = ’update_remote_1’ where id = 1;

after 27 seconds (wait event "enq: TX - contention"), it returns:

ORA-02049: timeout: distributed transaction waiting for lock


ORA-02063: preceding line from DBLINKREMOTE

ORA-02049 is thrown when the DML modified segment is blocked more than distributed lock timeout.
By the way, the same error is also observed when the segment is stored in Oracle bigfile tablespace, and
pre-allocation takes more than distributed lock timeout (see Blog: Oracle bigfile tablespace pre-allocation
and session blocking [43]).

One thing needs to mention is that wait event is the special one: "enq: TX - contention", not general
"enq: TX - row lock contention".

72
2.2.9 Redo/Undo Explosion from Thick Declared Table Insert

Now we can look a real application problem caused by Redo/Undo explosion when inserting rows into a
thick declared table.

Originally there is a thin table (thin tab) consisting of two columns: one number and one 40 length
varchar2. Later the application requires to add two new 1000 char length columns to store some seldom
occurred message, so a new table thick tab is created by including two new columns.

We have performed tests on 10gR2, 11gR2 and 12cR1 on a NOARCHIVELOG-mode database with
settings:

undo_management=AUTO,
db_block_size=8192,
nls_characterset=AL32UTF8.

In the following test, we first insert 10000 rows into thin tab, then we insert the same content into
thick tab. However two new columns in thick tab are not referred at all.

Table test stats is used to store the test statistics for each step.

---- Test Setup ----


drop table test_stats;
create table test_stats (step varchar2(10), name varchar2(30), value number);

drop table thin_tab;


create table thin_tab(num number, txt varchar2(40 char));

drop table thick_tab;


create table thick_tab(num number, txt varchar2(40 char), n1 varchar2(1000 char), n2 varchar2(1000 char));

---- Step_1 init ----


insert into test_stats
select ’step_1’ step, vn.name, vs.value
from v$sesstat vs
, v$statname vn
where vs.sid = userenv(’sid’)
and vs.statistic# = vn.statistic#
and vn.name in (’redo size’, ’undo change vector size’);

---- Step_2 insert into thin_tab ----


insert into thin_tab(num, txt)
select level, ’abc’ from dual connect by level <= 10000;

insert into test_stats


select ’step_2’ step, vn.name, vs.value
from v$sesstat vs
, v$statname vn
where vs.sid = userenv(’sid’)
and vs.statistic# = vn.statistic#
and vn.name in (’redo size’, ’undo change vector size’);

---- Step_3 insert into thick_tab ----


insert into thick_tab(num, txt)
select level, ’abc’ from dual connect by level <= 10000;

insert into test_stats


select ’step_3’ step, vn.name, vs.value
from v$sesstat vs
, v$statname vn
where vs.sid = userenv(’sid’)
and vs.statistic# = vn.statistic#
and vn.name in (’redo size’, ’undo change vector size’);

commit;

Then we query the test result:

73
select step, name, value,
(value - lag(value) over (partition by name order by step)) diff
from test_stats;

STEP NAME VALUE DIFF


------- ------------------------ ---------- ----------
step_1 redo size 11,592,572
step_2 redo size 11,793,484 200,912
step_3 redo size 14,335,284 2,541,800
step_1 undo change vector size 3,011,440
step_2 undo change vector size 3,037,820 26,380
step_3 undo change vector size 3,720,120 682,300

select segment_name, blocks, bytes


from dba_segments where segment_name in (’THIN_TAB’, ’THICK_TAB’);

SEGMENT_NAME BLOCKS BYTES


------------- ------ -------
THICK_TAB 24 196,608
THIN_TAB 24 196,608

step 2 is the thin table insert; step 3 is the thick table insert. The output demonstrates thick tab insert
generated 12 times redo (2,541,800/200,912), and 25 times (682,300/26,380) undo than thin tab even
though the two new columns in thick tab have nothing inserted and both data segments have the similar
size.

By dumping the redo logfile, it turns out that Oracle uses row array allocation for thin tab, but single
row allocation for thick tab. Probably that is an Oracle internal optimization related to redo/undo size
forcasting, calculated according to DDL information.

If using direct-path insert (insert /*+ append */), there will be no big difference for both inserts, and
redo and undo will be much less (redo size = 10K, undo change vector size = 2K).

2.3 Getting Oracle Transaction Commit SCN

After long discussions of undo and redo in last two sections, we can now apply our accumulated knowledge
to answer a real world question:

How to find exact Commit SCN of a transaction ?

We will demonstrates one implementation to get it and then compare with other methods. The solution
has not only (Oracle) academic interest, but also practical demand, for example, data replication (see
Blog: Clean it up [13]).

Note: all tests are done in Oracle 12.1.0.2.0.

2.3.1 Implementation

At first, setup test:

drop table test_tab;


drop table commitscn_gtt;

74
create table test_tab(id number, sid number);

create global temporary table commitscn_gtt(dummy)


on commit preserve rows
as select -1 from dual;

create or replace procedure push_commitscn as


begin
delete from commitscn_gtt;
insert into commitscn_gtt (dummy) values (-1);
end;
/

create or replace function get_commitscn return number as


l_commitscn number;
begin
select ora_rowscn into l_commitscn from commitscn_gtt where rownum=1;
return l_commitscn;
end;
/

2.3.2 Run Test

Start one transaction with insert, call push commitscn, then commit.

insert into test_tab (id, sid) values (1, sys.dbms_support.mysid);

-- run immediately before commit


exec push_commitscn;

commit;

No we can call get commitscn to get exact commit scn:

set numformat 999,999,999,999,999

select get_commitscn from dual;

GET_COMMITSCN
--------------------
9,183,757,165,446

2.3.3 Comparing with Other Approaches

Here 5 different methods of getting commit scn. We will analyse their difference one by one after following
tests.

drop table test_tab2;


create table test_tab2(id number, sid number, scn number) rowdependencies;
insert into test_tab2 (id, sid, scn) values (1, sys.dbms_support.mysid, userenv(’commitscn’));
exec push_commitscn;
commit;

set numformat 999,999,999,999,999


select scn from test_tab2 where id = 1;
SCN
--------------------
9,183,757,165,468

select ora_rowscn from test_tab2 where id = 1;


ORA_ROWSCN
--------------------

75
9,183,757,165,469

select dbms_flashback.get_system_change_number from dual;


GET_SYSTEM_CHANGE_NUMBER
------------------------
9,183,757,165,471

select current_scn from v$database;


CURRENT_SCN
--------------------
9,183,757,165,472

select get_commitscn from dual;


GET_COMMITSCN
--------------------
9,183,757,165,469

Let’s look their differences.

2.3.3.1 Method 1. userenv(’commitscn’)

select scn from test_tab2 where id = 1;

(1). userenv(’commitscn’) seems the actual commit scn minus 1, i.e immediately before commit.
Decreasing 1 is probably because each commit creates 1 new number for commit record.

(2). Undocumented

(3). 12c userenv deprecated

2.3.3.2 Method 2. ora rowscn

select ora_rowscn from test_tab2 where id = 1;

Oracle Database SQL Language Reference - ORA ROWSCN Pseudo column ([22]) said:

Whether at the block level or at the row level, the ORA ROWSCN should not be considered to be an exact
SCN. If a block is queried twice, then it is possible for the value of ORA ROWSCN to change between
the queries even though rows have not been updated in the time between the queries. The only guarantee
is that the value of ORA ROWSCN in both queries is greater than the commit SCN of the transaction
that last modified that row.

(Note: greater than seems an Oracle Document error. It should be ”not less than” because the possibility
of getting exact commit scn.)

So ora rowscn is not deterministic because the returned scn also depends on query start scn due to
previous discussed Delayed Block Cleanout in the section Undo - Cleanout 2.1.3. Hence it can decrease
even there are not any user updates on the selected table (see Blog: How can ORA ROWSCN change
between queries when no update? [5]) between two queries, or even increase as demonstrated in section
Undo - Cleanout 2.1.3.

For each row, ora rowscn returns the conservative upper bound scn of the most recent change to the
row. For example, each ”Delayed Block Cleanout” triggered by a query can modify this data block scn
according to its query start scn, which means returned scn depending on query start scn. If there are

76
several queries made ”Delayed Block Cleanout” on the same data block, probably they put the different
upper bound commit scn into the data block at different moment.

More critically, this method is bugig as mentioned in Oracle MOS Doc ID 2210391.1:
INSERT to a Table With ROWDEPENDENCIES Failed With ORA-00600 [kdtgsph-row]
and indeed we have hit this bug in real applications.

2.3.3.3 Method 3 & 4. System functions

select dbms_flashback.get_system_change_number from dual;

select current_scn from v$database;

-. Upper bound system change number (SCN), not precise.

2.3.3.4 Method 5. Exact commit SCN

select get_commitscn from dual;

Once a commit is executed on GTT, this single row (single block) is stamped with that commit scn.
Since the one-row GTT is session local, it can never be selected by other sessions and does not incur any
”Delayed Block Cleanout” by other queries. When the single row is inserted and committed, ITL Flag is
set as C---, and scn is filled. We have seen this approach is successfully implemented in real applications.

2.3.4 Commit SCN Exposed

When using Oracle Database 11g Release 1 introduced Result Cache feature, we have observed that com-
mit is enhanced to perform Result Cache invalidation before it returns, and made following interpretation
(see Blog: PL/SQL Function Result Cache Invalidation (I) [48]):

The above Stack Trace shows that when a transaction user session calls commit command, commit takes
a detour to visit Result Cache along its code path in order to perform the invalidation before publishing
the news to the world.

The following test will demonstrate that exactly the same commit scn is recorded in v$result cache objects.scn
as Result Cache is invalided.

At first, create test code of Plsql result cache function as follows.

drop table rc_tab;


create table rc_tab (id number, val number);
insert into rc_tab select level, level*10 from dual connect by level <= 3;
commit;

create or replace function get_val (p_id number) return number result_cache as


l_val number;
begin
select val into l_val from rc_tab where id = p_id;
return l_val ;
end;
/

create or replace procedure run_test as

77
l_val number;
begin
for i in 1 .. 3 loop
l_val := get_val(i);
end loop;
end;
/

Invoke the function to build result cache, and show them by query:

exec dbms_result_cache.flush;
exec run_test;
column name format a13
select id, type, status, name, namespace, creation_timestamp, scn
from v$result_cache_objects ro
order by scn desc, type, id;

ID TYPE STATUS NAME NAMESPACE CREATION_TIMESTAMP SCN


-- ---------- --------- ------------- ---------- -------------------- --------------------
0 Dependency Published K.GET_VAL 2017*JUL*12 07:06:43 9,198,309,753,651
2 Dependency Published K.RC_TAB 2017*JUL*12 07:06:43 9,198,309,753,651
1 Result Published "K"."GET_VAL" PLSQL 2017*JUL*12 07:06:43 9,198,309,753,651
3 Result Published "K"."GET_VAL" PLSQL 2017*JUL*12 07:06:43 9,198,309,753,651
4 Result Published "K"."GET_VAL" PLSQL 2017*JUL*12 07:06:43 9,198,309,753,651

We can see two dependencies (one table and one function) and all result cache rows stamped by the
timestamp and scn of creating time.

Now after we update dependency table rc tab and commit, all 3 result rows get invalidated. Get commit
scn with our implementation, and run the same result cache query again.

update rc_tab set val = -2 where id = 2;


exec push_commitscn;
commit;
select get_commitscn from dual;

GET_COMMITSCN
------------------
9,198,309,753,654

select id, type, status, name, namespace, creation_timestamp, scn


from v$result_cache_objects ro
order by scn desc, type, id;

ID TYPE STATUS NAME NAMESPACE CREATION_TIMESTAMP SCN


-- ---------- --------- ------------- ---------- -------------------- --------------------
2 Dependency Published K.RC_TAB 2017*JUL*12 07:06:43 9,198,309,753,654
0 Dependency Published K.GET_VAL 2017*JUL*12 07:06:43 9,198,309,753,651
1 Result Invalid "K"."GET_VAL" PLSQL 2017*JUL*12 07:06:43 9,198,309,753,651
3 Result Invalid "K"."GET_VAL" PLSQL 2017*JUL*12 07:06:43 9,198,309,753,651
4 Result Invalid "K"."GET_VAL" PLSQL 2017*JUL*12 07:06:43 9,198,309,753,651

The first output line shows that dependency table rc tab has gotten a new scn: 9,198,309,753,654,
which is exactly the same scn returned by our get commitscn because at that instant, result cache was
invalidated by its dependency. Hence we can reasonably think that Oracle has to stamp result cache
invalidation with that commit scn, no more, no less.

Oracle 12c Database Reference [25] said about v$result cache objects.scn:

it is Build SCN for TYPE Result or Invalidation SCN for TYPE Dependency

By the way, there are also a few Oracle tables/views containing commitscn or commit scn columns, which
are worth of further checking.

78
Chapter 3

Locks, Latches and Mutexes

Data and program (e.g. Plsql) provided in database are sharing among multiple concurrent sessions.
It should be accessed in a coordinated manner, which is accomplished by different Oracle protection
mechanisms: locks, latches and mutexes. The usage depends on the nature of applying targets, for
example, persistent vs. temporary; central vs. distributed; continuous vs. interruptive (single phrase vs.
multi phrases) and granularity (size).

Only locks can be directly manipulated by users with LOCK TABLE statement, whereas latches and
mutexes are completely managed by Oracle.

One locking method is not strictly bound to one target. For example, Oracle is now gradually migrating
latches to mutex. In Oracle 11.2, latch: library cache is replaced by library cache mutex; In
Oracle 12.2, latch: row cache objects replaced by row cache mutex.

In one side, shared resources have to be protected by locking; in other side, any locking inherently
serializes data accesses and consequently slows down system. Hence profound understanding and efficient
locking are critical for data integrity and database performance.

3.1 Locks

Locks are mainly used to protect ”large” objects (e.g SQ on sequence), less volatile data (e.g. TM on
tables), more persistent items (e.g. TX on table rows), non interruptive (e.g. Result Cache: Enqueue).

It can be centralized like TM for each table in whole DB for a broad coverage, or de-centralized like TX
for each ITL in every data block for each transaction in a local area.

All available lock types can be enumerated by querying v$lock type. Current active locks can be listed
by querying v$lock, and lock enqueuing statistics are recorded in v$enqueue statistics, which also
gives a short description (req description) about enqueue request.

It is a well known mechanism and there exist detailed documentations.

In this section, we will look TM lock used by 12c new introduced TSDP (Transparent Sensitive Data
Protection), where there are no any transactions, but commit is still required.

79
We start by reading Oracle Document:

COMMIT: Use the COMMIT statement to end your current transaction and make permanent all
changes performed in the transaction. A transaction is a sequence of SQL statements that Oracle
Database treats as a single unit. This statement also erases all savepoints in the transaction and
releases transaction locks.

ROLLBACK: Use the ROLLBACK statement to undo work done in the current transaction or
to manually undo the work done by an in-doubt distributed transaction.

It sounds like that commit/rollback have to be issued in the context of transactions.

Note: All tests are done in Oracle 12.1.0.2.

3.1.1 TM Contention

Now we can demonstrate one case in which COMMIT/ROLLBACK has to be executed even without any
transactions.

The test code is also trying to reproduce and study MOS:

Bug 26965236: DELETE FROM TSDP SENSITIVE DATA$ CAUSING ENQ: TM - CONTENTION
WAITS

In Oracle 12c, we noticed a hidden behaviour of 12c new introduced TSDP , which blocks DB with serious
TM contentions.

When exeute:

alter table ttx drop column colx;

It will implicitly trigger an internal delete statement:

-- sql_id: 1vr7aynaagb05
delete from tsdp_sensitive_data$ where obj# = :1 and col_argument# = :2

Even though we don’t use TSDP explicitly.

Let’s look this new behaviour and TM locks. First create a table with 3 columns:

------------------ Test SetUp@T0 ------------------


drop table tt1;

create table tt1 (x number, c1 number, c2 number);

Then open two Sqlplus sessions. In the first session (sid 190), execute one 0 rows delete (which has at
least the same effect as TSDP triggered delete), and in the second session (sid 290), drop one column.

80
------------------ Session_1@T1 ------------------
SQL(sid 190) > delete from sys.tsdp_sensitive_data$ where 1=2;
0 rows deleted.

------------------ Session_2@T2 ------------------


SQL(sid 290) > alter table tt1 drop column c2;

-- same can be reproduced by:


-- alter table k.tt1 set unused (c1);

Now session 290 is blocked. Open a third session (sid 390), display all TM locks:

------------------ Monitor Session_3@T3 ------------------


SQL(sid 390) > select o.object_name, o.subobject_name sub_name, k.*
from v$lock k, dba_objects o
where k.type in (’TM’) and k.ID1 = o.object_id;

OBJECT_NAME SUB_NAME ADDR KADDR SID TYPE ID1 ID2 LMODE REQUEST CTIME BLOCK
----------------------------- -------- ------------ ------------ --- ---- ------- --- ----- ------- ----- -----
TSDP_SENSITIVE_DATA$ 7F6DB924D9D0 7F6DB924DA38 190 TM 1576498 0 3 0 14 0
TSDP_SUBPOL$ 7F6DB924D9D0 7F6DB924DA38 190 TM 1578689 0 3 0 14 0
TSDP_PROTECTION$ 7F6DB924D9D0 7F6DB924DA38 190 TM 1578695 0 3 0 14 1
TT1 7F6DB924D9D0 7F6DB924DA38 290 TM 2321087 0 6 0 9 0
WRI$_OPTSTAT_HISTHEAD_HISTORY 7F6DB924D9D0 7F6DB924DA38 290 TM 601844 0 3 0 9 0
WRI$_OPTSTAT_HISTGRM_HISTORY 7F6DB924D9D0 7F6DB924DA38 290 TM 601856 0 3 0 9 0
COM$ 7F6DB924D9D0 7F6DB924DA38 290 TM 136 0 3 0 9 0
COL_USAGE$ 7F6DB924D9D0 7F6DB924DA38 290 TM 456 0 3 0 9 0
OBJAUTH$ 7F6DB924D9D0 7F6DB924DA38 290 TM 61 0 3 0 9 0
TSDP_SENSITIVE_DATA$ 7F6DB924D9D0 7F6DB924DA38 290 TM 1576498 0 3 0 9 0
TSDP_SUBPOL$ 7F6DB924D9D0 7F6DB924DA38 290 TM 1578689 0 3 0 9 0
TSDP_PROTECTION$ 7F6DB924D9D0 7F6DB924DA38 290 TM 1578695 0 0 5 9 0

The last line shows that session 2 makes the request with lock mode 5 (SSX/SRX) ON tsdp protection$,
and is blocked by session 1 in ’enq: TM - contention’ lock mode 3 (RX), even though there is no any active
transactions in v$transaction. But surprisingly, v$session.taddr (address of the transaction state ob-
ject) is filled with a real value, because v$transaction is defined on x$ktcxb, whereas v$session.taddr
is taken from x$ksuse.ksusetrn.

SQL(sid 390) > select * from v$transaction;


0 rows selected.

select taddr, event, p1||’(’||to_char(p1, ’XXXXXXXX’)||’)’ p1, p2||’(’||o.object_name||’)’ p2, p3


from v$session s, dba_objects o
where s.p2 = o.object_id and sid = 290;

TADDR EVENT P1 P2 P3
---------------- -------------------- --------------------- ------------------------- --
0000000165FA2FC0 enq: TM - contention 1414332421( 544D0005) 1578695(TSDP_PROTECTION$) 0

select indx sid, ksuseser serial#, ksusetrn taddr from x$ksuse where indx = 290;

SID SERIAL# TADDR


--- ------- ----------------
545 50722 0000000165FA2FC0

The blocking chain looks like:

SQL(sid 390) > select chain_signature, sid, blocker_sid, wait_event_text, p1, p1_text, p2, p2_text, p3, p3_text
from v$wait_chains;

CHAIN_SIGNATURE SID BLOCKER_SID WAIT_EVENT_TEXT P1 P1_TEXT P2 P2_TEXT P3 P3_TEXT


--------------- --- ----------- --------------------------- ---------- --------- ------- -------- -- ---------------
’TM-contention’ 290 190 enq: TM - contention 1414332421 name|mode 1578695 object # 0 table/partition
’TM-contention’ 190 SQL*Net message from client 1413697536 driver id 1 #bytes 0

(CHAIN_SIGNATURE replaced to fit in page, original text is ’SQL*Net message from client’<=’enq: TM - contention’)

81
In fact, it is caused by a foreign key constraint: tsdp protection$fksd, but strangely the referred
column tsdp protection$.sensitive# does not have any index created by Oracle.

constraint tsdp_protection$fksd
foreign key (sensitive#)
references sys.tsdp_sensitive_data$ (sensitive#)
on delete cascade
enable validate

It can also be displayed by query:

SQL(sid 390) > select * from dba_constraints where constraint_name = ’TSDP_PROTECTION$FKSD’;

OWNER CONSTRAINT_NAME CONSTRAINT_TYPE TABLE_NAME R_CONSTRAINT_NAME DELETE_RULE STATUS


----- -------------------- --------------- ---------------- ---------------------- ----------- ------
SYS TSDP_PROTECTION$FKSD R TSDP_PROTECTION$ TSDP_SENSITIVE_DATA$PK CASCADE ENABLED

-- Suspending LGWR will not block commit in Session_1


-- SQL> oradebug setorapid 13
-- Oracle pid: 13, Unix process pid: 18482, image: oracle@testdb (LGWR)
-- SQL> oradebug suspend
-- SQL> oradebug resume

To unlock session 2, we issue a commit in session 1:

------------------ T4: Session_1, Release TM Locks to deblock ------------------


SQL(sid 190) > commit;

For application user, one quick workaround is to disable(or drop) the foreign key constraint if TSDP is
not used (in our test DB, tsdp protection$ is indeed empty), or create an index if TSDP is used:

alter table sys.tsdp_protection$ disable constraint tsdp_protection$fksd;

create index sys.tsdp_protection$sensitiv on sys.tsdp_protection$(sensitive#);

For product provider, now that it is accepted as a bug, one can try to decouple (or decrease) the inter-
ference of 12c new introduced TSDP with conventional DDL (for example, above drop column) if TSDP
is not used.

If we suspend LGWR in session 3 (see above commented-out code), and run the whole test again, the
commit in session 1 will not be blocked. It indicates no commit record written to redo log for this commit
execution, in other words, not every commit generates a commit record.

3.1.2 Enqueue Trace Event 10704

Oracle provides trace event 10704 for enqueue activity, and documented as below:

10704, 00000, "Print out information about what enqueues are being obtained"
// *Cause: When enabled, prints out arguments to calls to ksqcmi and
// ksqlrl and the return values.
// *Action: Level indicates details:
// Level: 1-4: print out basic info for ksqlrl, ksqcmi
// 5-9: also print out stuff in callbacks: ksqlac, ksqlop
// 10+: also print out time for each line

82
If we repeat the above test, and trace blocked session 2 with event 10704, we can list all triggered TM
Locks (see Blog: Investigating Oracle lock issues with event 10704 [27]). Following their occurrence
sequence, we can track the details of locking activities.

------------------ Session_2 ------------------

SQL(sid 290) > alter session set events=’10046 trace name context forever, level 1 :
10704 trace name context forever, level 3’ tracefile_identifier=’test_10704_1’;

SQL(sid 290) > alter table tt1 drop column c1;

-- blocked till Session_1 commit

SQL(sid 290) > alter session set events=’10046 trace name context off : 10704 trace name context off ’;

Here the output (only relevant text). At the bottom, some comments are added about object id,
object name and kernel subroutines.

PARSING IN CURSOR #18446604434595894768 len=42 tim=2741317512980 sqlid=’3hqb4gkqka8qw’


LOCK TABLE "TT1" IN EXCLUSIVE MODE WAIT 5
END OF STMT
PARSE #18446604434595894768:c=20325,e=28725,p=2,tim=2741317512979
ksqgtl *** TM-00236ABF-00000000-00000000-00000000 mode=6 flags=0x400 timeout=5 ***
...
ksqgtl *** TM-001816C7-00000000-00000000-00000000 mode=5 flags=0x400 timeout=21474836 ***
...
ksqcmi: TM-001816C7-00000000-00000000-00000000 mode=5 timeout=21474836

*** 2019-03-21 08:49:33.043


ksqcmi: returns 0
ksqgtl: RETURNS 0
...
ksqcnv: TM-001816C7-00000000-00000000-00000000 mode=3 timeout=21474836
ksqcmi: TM-001816C7-00000000-00000000-00000000 mode=3 timeout=21474836
ksqrcl: TM-001816C7-00000000-00000000-00000000
...
ksqrcl: TM-00236ABF-00000000-00000000-00000000

-- Legend
00236ABF (2321087) = TT1
001816C7 (1578695) = SYS.TSDP_PROTECTION$

ksqgtl: enqueue lock get


ksqcmi: maybe related to enqueue commit
ksqcnv: enqueue lock convert
ksqrcl: enqueue lock release

The trace file shows that from session 2, a TM lock mode 6 (EXCLUSIVE) is first gotten for tt1
(TM-00236ABF) with timeout=5 (probably picked from ddl lock timeout, which is configured as 5 in
Test DB), and then is trying to get TM lock on sys.tsdp protection$ (TM-001816C7) in mode 5 (SRX)
with timeout 21474836, but ksqcmi:TM-001816C7 is blocked by session 1.

Once committed in session 1, session 2 got sys.tsdp protection$ (TM-001816C7) in mode 5, but
immediately converted to 3 (RX).

Above trace also shows that lock table command (line 2) took ddl lock timeout = 5, but other subrou-
tines use a very high (default) timeout (21474836 seconds is 248 days). Maybe we can try to wait 249
days to see what will happen.

If we also print out call stack of blocked session 2 during hanging, it looks like:

testdb $ pstack 1234

83
ffff80ffbdbb36eb semsys (4, 23, ffff80ffbffe8c38, 1, ffff80ffbffe8c40)
000000000578acd8 sskgpwwait () + f8
000000000578a965 skgpwwait () + c5
0000000005944ffc ksliwat () + 8dc
0000000005944350 kslwaitctx () + 90 --wait context
000000000681f8af ksqcmi () + 123f --enqueue commit
000000000593ae62 ksqgtlctx () + ea2 --enqueue get
0000000005c6fac6 ktaiam () + 656
00000000058b7246 ktagetg0 () + 4a6
00000000058b6343 ktagetp_internal () + 63
0000000005c6f20f ktagdw () + 3bf
0000000005b43192 ktaadm () + 222
0000000005b3c163 kksfbc () + 1a23
000000000570571e opiexe () + 9de
0000000005c73275 kpoal8 () + a45
00000000056f87e3 opiodr () + 433
0000000005de3542 kpoodrc () + 22
000000000571a07d rpiswu2 () + 2fd
0000000005de3187 kpoodr () + 287
0000000005de272b upirtrc () + e0b
0000000005de18e0 kpurcsc () + 70
0000000005ddc2da kpuexec () + 1b0a
0000000005dda7bb OCIStmtExecute () + 2b
000000000d08cddc kzdpatbdc () + 2ec
0000000009bb6862 atbdcolsub () + aa2
0000000009bb9e82 atbdrpcol () + 1952
0000000005fa8975 atbdrv () + 6ce5 --alter table driver
0000000005709178 opiexe () + 4438

From bottom to top, we can see that atbdrv (alter table driver) calls ksqgtlctx (enqueue get), which
triggers the call of ksqcmi (enqueue commit) and then waiting at kslwaitctx (wait context).

If ksqcmi is about enqueue commit, Oracle documentation about commit/rollback (cited at beginning
of this section), probably can use ”enqueue” to substitute for ”transaction” (obviously transaction is a
more popular term although enqueue is technically more precise).

3.1.3 Two Other TSDP Cases

Till now we looked one case of TM contentions when dropping a column. Further tests revealed that
there are more complicated cases, which also triggered different tsdp sensitive data$ delete statements
like:

-- sql_id: 52xfvkwy3u9xx
delete from tsdp_sensitive_data$ where obj# = :1

Therefore, they are not handled in the same way as previous one since the triggered delete statement
looks different. This shows that the 12c new introduced TSDP is widely implicated in different Oracle
components, therefore, the related locks should be thoroughly checked in the applications.

Here two short test cases to generate above delete statement.

3.1.3.1 Non Empty Partition Move Online

drop table pt1;

create table pt1 (prt int, id int)


partition by list (prt) (
partition p1 values (1),
partition p2 values (2)

84
) tablespace ts_org;

insert into pt1(prt, id) values(2, 22);

commit;

select * from pt1 partition (p1); -- no rows selected


select * from pt1 partition (p2); -- one row selected

---- there are data and online option, sql_id: 52xfvkwy3u9xx executed
alter table pt1 move partition p2 online tablespace ts_new;

---- there are no data, but online option, sql_id: 52xfvkwy3u9xx not executed
--alter table pt1 move partition p1 online tablespace ts_new;

---- there is no online option, sql_id: 52xfvkwy3u9xx not executed


--alter table pt1 move partition p2 tablespace ts_new;

3.1.3.2 IOT Partition Move

drop table test_iot cascade constraints;

create table test_iot


(prt number(9) not null,
id number(38) not null,
constraint test_iot#p1 primary key (prt, id) enable validate)
organization index tablespace ts_org
partition by list (prt)
(partition p1 values (1) tablespace ts_org,
partition p2 values (2) tablespace ts_org);

---- sql_id: 52xfvkwy3u9xx is executed


alter table test_iot move partition p1 tablespace ts_new;

3.2 Latches

The second locking mechanism is Oracle latches. They are all centralized locks, predefined by Oracle.
To optimize performance, Oracle gets them only when entering into the critical code path, and releases
them immediately when no more needed. For example, one row cache (DC cache) get incurs 3 kslgetl
latch gets in 3 separate phases, situated in 3 different locations (WHERE column) [57]. One result cache
row insert triggers 4 ksl get shared latch latch gets (4 RC latch gets are one S Mode(8), and three X
Mode(16)) [49]. Both Blogs are studying latch activities by constructing latch State Transition Diagram.
Hence latch is interruptive (or intermediate), whereas lock is continuous in its life cycle.

In Oracle, latches and their managed targets are pre-defined. To reduce contentions, application can
either avoid touching latches, for example, nls characterset query (to be discussed in section 3.2.1), or
bypass expensive one, for example, shared latch in X mode when result cache invalidation/modification,
or decrease the occurrences of latch contention, for example, CBC latch collision (to be discussed in
section 3.2.2).

At first, let us get an overview of all latches in entire database. Following query lists all of them together
with the number of respective children and the ratio to the total:

select l.name, count(*) cnt, round(100*ratio_to_report(count(*)) over (), 2) ratio


from v$latch l, v$latch_children k
where l.latch# = k.latch#(+)
group by l.name order by cnt desc, l.name;

NAME CNT RATIO

85
--------------------------------- ------ ------
cache buffers chains 32768 75.38
simulator hash latch 2048 4.71
...
In memory undo latch 118 0.27
...
row cache objects 56 0.13
...
shared pool 7 0.02
...
Result Cache: RC Latch 1 0
...
kokc descriptor allocation latch 1 0

770 rows selected.

Even though cache buffers chains (CBC) latch amounts to majority of latches (> 75%) according to above
output, ”latch: cache buffers chains” is often found in the top latch wait events.

Note: all tests are done in Oracle 12.1.0.2.0.

3.2.1 latch: row cache objects

In this example, we look a latch contention in real applications when querying nls database parameters.
In fact, it is one ”latch: row cache objects” on DC dc props. (See Blog: nls database parameters,
dc props, latch: row cache objects [50] for more discussions on such latches).

At first, find v$rowcache.cache# used v$latch children.child#, then run a query on nls database parameters,
collect statistics before and after test, the result is a mapping between row cache gets and latch gets.

select r.kqrstcid cache#, r.kqrsttxt parameter, c.child# --, r.*, c.*


from x$kqrst r, v$latch_children c
where r.kqrstcln = c.child#
and r.kqrsttxt = ’dc_props’
and c.name = ’row cache objects’;

CACHE# PARAMETER CHILD#


------- --------- -------
15 dc_props 18

select r.cache#, r.type, r.parameter, r.gets, r.count, c.name, c.child#, c.gets latch_gets
from v$rowcache r, v$latch_children c
where r.parameter = ’dc_props’ and c.name = ’row cache objects’ and c.child# = 18;

CACHE# TYPE PARAMETER GETS COUNT NAME CHILD# LATCH_GETS


------- ------ --------- -------- ------ ----------------- ------- -----------
15 PARENT dc_props 140920 60 row cache objects 18 422820

select value from nls_database_parameters where parameter = ’NLS_CHARACTERSET’;


VALUE
--------
AL32UTF8

select r.cache#, r.type, r.parameter, r.gets, r.count, c.name, c.child#, c.gets latch_gets
from v$rowcache r, v$latch_children c
where r.parameter = ’dc_props’ and c.name = ’row cache objects’ and c.child# = 18;

CACHE# TYPE PARAMETER GETS COUNT NAME CHILD# LATCH_GETS


------- ------ --------- -------- ------ ----------------- ------- -----------
15 PARENT dc_props 140980 60 row cache objects 18 423000

We can see that dc props gets increased 60 (140980 - 140920), and latch gets increased 180 (423000
- 422820). So each rowcache get requires 3 ”latch: row cache objects” gets, that is why we have 180

86
latch gets (Book Oracle Core [15, p. 167] wrote: there was a common pattern indicating three latch gets
for each dictionary cache get).

A query on v$rowcache parent for dc props shows:

select cache_name, existent, count(*) cnt from v$rowcache_parent


where cache_name = ’dc_props’
group by cache_name, existent;

CACHE_NAME EXISTENT CNT


----------- --------- ----
dc_props Y 37
dc_props N 23

There are 60 dc props rows, of which 37 are existent objects, 23 are non-existent. Total number of
dc props cache entries is v$rowcache.count = 60, which are corresponding to 60 parent objects in
v$rowcache parent. (In Blog: Oracle ROWCACHE Views and Contents [54] Section: dc props, there
is query to list all those 60 rows and their contents).

As we know, nls database parameters is defined by:

create or replace force view sys.nls_database_parameters (parameter, value) as


select name, substr (value$, 1, 64)
from x$props
where name like ’NLS%’;

and the query on nls database parameters is a table full scan on X$PROPS:

select value from nls_database_parameters where parameter = ’NLS_CHARACTERSET’;

----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 28 | 0 (0)| 00:00:01 |
|* 1 | FIXED TABLE FULL| X$PROPS | 1 | 28 | 0 (0)| 00:00:01 |
----------------------------------------------------------------------------

The 60 rows in dc props generates 60 Gets in v$rowcache, although the underlying x$props contains
only 37 existent objects. It seems that non-existent objects also require the similar handling.

3.2.1.1 Latch Contention Test

Since one single row nls characterset select from nls database parameters incurs 180 latch gets, we
can setup a test to demonstrate ’latch: row cache objects’ contentions. In the test, we will launch 2 Job
sessions to loop on select from nls database parameters, and then run a query to watch ”latch: row
cache objects” contentions.

create or replace procedure nls_select(p_cnt number) as


l_val VARCHAR2 (256 Byte);
begin
for i in 1..p_cnt loop
select value into l_val from nls_database_parameters where parameter = ’NLS_CHARACTERSET’;
end loop;
end;
/

create or replace procedure nls_select_jobs(p_job_cnt number, p_cnt number) as

87
l_job_id pls_integer;
begin
for i in 1.. p_job_cnt loop
dbms_job.submit(l_job_id, ’begin while true loop nls_select(’||p_cnt||’); end loop; end;’);
end loop;
commit;
end;
/

Start the test:

SQL > exec nls_select_jobs(2, 1e9);

column p1text format a10


column p2text format a10
column p3text format a10
column event format a25

SQL > select sid, event, p1text, p1, p1raw, p2text, p2, p2raw, p3text, p3, p3raw
from v$session where program like ’%(J0%’;

SID EVENT P1TEXT P1 P1RAW P2TEXT P2 P2RAW P3TEXT P3 P3RAW


----- ------------------------- -------- ---------- ---------------- ------- ---- ---------------- ------- --- -----
6 latch: row cache objects address 6376016176 000000017C0A4930 number 411 000000000000019B tries 0 00
287 latch: row cache objects address 6376016176 000000017C0A4930 number 411 000000000000019B tries 0 00

-- After Test stop all jobs by clean_jobs:


-- SQL > exec clean_jobs;

(Note: procedure clean jobs is script 1.2.4)

Execute a query below to display the top latch misses and their locations, they are kqrpre and kqreqd.
The sleep count and wtr slp count pair in each row describe Holder-Waiter logic. In total, the sum of
sleep count is close to sum of wtr slp count.

column PARENT_NAME format a30


column WHERE format a20

SQL > select * from (select * from v$latch_misses order by sleep_count desc)
where rownum <= 5;

PARENT_NAME WHERE NWFAIL_COUNT SLEEP_COUNT WTR_SLP_COUNT LONGHOLD_COUNT LOCATION


---------------------------- ---------------- ------------ ----------- ------------- -------------- ----------------
row cache objects kqrpre: find obj 0 64009 94027 0 kqrpre: find obj
row cache objects kqreqd: reget 0 61416 30578 0 kqreqd: reget
row cache objects kqreqd 0 35943 36768 0 kqreqd
space background task latch ktsj_grab_task 0 4 4 0 ktsj_grab_task
call allocation ksuxds 0 3 3 0 ksuxds

(Note: WHERE column is obsolete, it is always equal to the value in LOCATION)

It seems that Oracle realized the performance impacts of ”latch: row cache objects”. Starting from
12cR2, ”latch: row cache objects” (12cR1) is replaced by ”row cache mutex”. However, test showed that
the new mutex substitution has not yet brought out obvious improvement in this case (see Blog: row
cache mutex in Oracle 12.2.0.1.0 [53]). But in general, performance should be improved by new mutex
implementation.

In fact, to avoid dc props latch contentions, we can run the equivalent query:

select value from v$nls_parameters where parameter = ’NLS_CHARACTERSET’;

It gives the same result since characterset (nls characterset, nls nchar characterset) are unique in
each database, and they are fixed when creating (or migrating) database, and will never be changed (see

88
Oracle MOS: The Priority of NLS Parameters Explained (Where To Define NLS Parameters) (Doc ID
241047.1)) for detailed description about NLS Parameters).

3.2.1.2 Deep into latch: row cache objects

To continue our demo, we will manually create ’latch: row cache objects’ blocking, and then print out
the code path. At first, open two Sqlplus sessions and one Unix window. In Unix window, start a Dtrace
script which stops the first session (sid 123, spid 3456) before releasing the held ’latch: row cache objects’,
so it becomes a latch holder, then we print out call stack.

$ > dtrace -w -n \
’pid$target::kqreqd:entry /execname == "oracle"/ {self->rin = 1;}
pid$target::kslgetl:return /execname == "oracle" && self->rin > 0 /
{@[pid, ustack(5, 0)] = count(); stop(); exit(0);}
’ -p 3456

-- Blocker SID = 123 --


oracle‘kslgetl+0x185
oracle‘kkdlpExecSql+0x20c
oracle‘kkdlpftld+0x147
oracle‘qerfxFetch+0x125f
oracle‘opifch2+0x188b

// one can also use:


// dtrace -w -n ’pid$target::kqrLockAndPinPo:entry {@[pid, ustack(5, 0)] = count(); stop(); exit(0);}’ -p 3456

$ > pstack 3456

0000000001ec38f8 kqreqd () + d8
00000000028ee50c kkdlpExecSql () + 20c
000000000757a7f7 kkdlpftld () + 147
000000000204467f qerfxFetch () + 125f
0000000001ee765b opifch2 () + 188b

SQL (123) > exec nls_select(1);

In the second session (sid 789, spid 5678), run the same statement, it is blocked. In Unix Window, print
out call stack:

SQL (789) > exec nls_select(1);

$ > pstack 5678

-- Waiter SID = 789 --


fffffd7ffc9d3e3b semsys (2, 1000001f, fffffd7fffdf0af8, 1, 124a)
0000000001ab90f5 sskgpwwait () + 1e5
0000000001ab8c95 skgpwwait () + c5
0000000001aaae69 kslges () + 5b9
0000000001ebd12e kqrpre1 () + 72e
00000000028ee49b kkdlpExecSql () + 19b
000000000757a7f7 kkdlpftld () + 147
000000000204467f qerfxFetch () + 125f
0000000001ee765b opifch2 () + 188b

Open a third Sqlplus session, run a query to display latch blocking details:

select sid, event, p1text, p1, p1raw, p2text, p2, p2raw, p3text, p3, p3raw, blocking_session, final_blocking_session
from v$session s where sid in (123, 789);

SID EVENT P1TEXT P1 P1RAW P2TEXT P2 P2RAW P3TEXT P3 P3RAW bs fbs


--- ---------------------------- ---------- ---------- --------- ------ ---- ------ ------ -- ----- --- ---
123 SQL*Net message from client driver id 1413697536 054435000 #bytes 1 000001 0 00
789 latch: row cache objects address 6376016176 17C0A4930 number 411 00019B tries 0 00 123 123

(Note: shortcut bs for BLOCKING_SESSION, shortcut fbs for FINAL_BLOCKING_SESSION)

89
3.2.2 CBC Latch Hash Collision

Oracle has two hidden parameters for buffer cache configuration, based on which we can calculate the
number of buckets protected by one CBC latch:

_db_block_hash_latches
Number of database block hash latches, default 32768
_db_block_hash_buckets
Number of database block hash buckets, default 1048576

buckets per latch, default 1048576/32768=32

Since the number of CBC latch default is 32768 (even though it amounts to 75% of all latches), if a DB
has more than 32768 blocks, there will exist two blocks which are protected by the same latch. Suppose
db block size = 8192, each DB with buffer cache bigger than 256 MB will undertake CBC latch hash
collision (nowadays it is hard to find a DB with less than 256 MB buffer cache). So CBC latch collision
is ubiquitous phenomenon. In addition, if we consider undo blocks, CR blocks, Oracle meta data (for
example, obj$, user$), there exist more collisions in buffer cache.

In the following test, we will show CBC latch hash collisions by two irrelevant objects. Probably it
helps us understand why CBC latch contentions occasionally occurs when we make updates on unrelated
objects.

For our test, we create a parent table and a child table, each row in parent table groups 1000 rows in
child table. Then we select them with the most popular join method: nested loops. At the same time,
we update another irrelevant noise table.

We will perform 3 test cases. On parent/child tables, we execute only queries. On noise table, we first
test query, then DML in one session, or multi sessions.

Case-1 . multi query sessions on parent/child tables and one query session on noise table
Case-2 . multi query sessions on parent/child tables and one DML session on noise table
Case-3 . multi query sessions on parent/child tables and multi DML sessions on noise table

3.2.2.1 Test Setup

---==================== Test Setup =====================---

drop table cbc_tab_parent purge;


create table cbc_tab_parent
as select trunc((level-1)/1e3) grp_id, mod((level-1), 1e3) id, level seq from dual connect by level <= 1e6;
create unique index cbc_tab_parent#ids on cbc_tab_parent(grp_id, id);

drop table cbc_tab_child purge;


create table cbc_tab_child
as select (level-1) id, rpad(’CBC_TAB_CHILD’, 100, ’X’) txt, level seq from dual connect by level <= 1e5;
create unique index cbc_tab_child#id on cbc_tab_child(id) reverse; -- reverse index

drop table cbc_noise purge;


create table cbc_noise
as select level x, rpad(’CBC_NOISE’, 100, ’X’) y, level seq from dual connect by level < 1e6;
create index cbc_noise#i#1 on cbc_noise(x, y);

exec dbms_stats.gather_table_stats(null, ’CBC_TAB_PARENT’, cascade=> true);

90
exec dbms_stats.gather_table_stats(null, ’CBC_TAB_CHILD’, cascade=> true);
exec dbms_stats.gather_table_stats(null, ’CBC_NOISE’, cascade=> true);

select object_name, data_object_id, blocks, round(bytes/1024/1024) mb


from dba_objects o, dba_segments s
where o.object_name = s.segment_name
and o.object_name in (’CBC_TAB_PARENT’, ’CBC_TAB_PARENT#IDS’
,’CBC_TAB_CHILD’, ’CBC_TAB_CHILD#ID’
,’CBC_NOISE’, ’CBC_NOISE#I#1’);

-- OBJECT_NAME DATA_OBJECT_ID BLOCKS MB


-- ------------------ -------------- ------ ---
-- CBC_TAB_PARENT 2418154 2560 20
-- CBC_TAB_PARENT#IDS 2418155 2560 20
-- CBC_TAB_CHILD 2418156 1664 13
-- CBC_TAB_CHILD#ID 2418157 256 2
-- CBC_NOISE 2418158 16384 128
-- CBC_NOISE#I#1 2418159 17408 136

create or replace procedure cbc_select(p_cnt number, p_job_no number, p_loop number := 10) as
l_val varchar2 (1000 byte);
l_stmt varchar2 (1000 byte);
type num_tab is table of number(10);
l_num_tab num_tab := new num_tab();
l_x number := 1e6;
begin
-- select "c.id + p_job_no" to make stmt different to avoid "cursor: pin S" before CBC Latch
dbms_session.set_identifier(’cbc_select_’||p_job_no);
l_stmt := q’[
with sq as (select level n from dual connect by level <= ]’ || p_loop|| q’[)
select /*+ leading(sq p c) use_nl(p c) index(p cbc_tab_parent#ids) index(c cbc_tab_child#id) */
c.id +]’ || p_job_no || q’[
from sq, cbc_tab_parent p, cbc_tab_child c
where p.grp_id = 3
and p.id = c.id]’;

for i in 1..p_cnt loop


execute immediate l_stmt bulk collect into l_num_tab;
end loop;
end;
/

-- exec cbc_select(1, 1);

create or replace procedure cbc_select_jobs(p_job_cnt number, p_cnt number) as


l_job_id pls_integer;
begin
for i in 1.. p_job_cnt loop
dbms_job.submit(l_job_id, ’begin while true loop cbc_select(’||p_cnt||’,’||i||’); end loop; end;’);
end loop;
commit;
end;
/

-- exec cbc_select_jobs(12, 1e3);

create or replace procedure cbs_noise_select_job as


l_stmt varchar2 (1000 byte);
l_job_id pls_integer;
begin
l_stmt := q’[
declare
l_cnt number;
begin
dbms_session.set_identifier(’cbs_noise_select’);
select /*+ leading(a) use_nl(a b c) index(b cbc_noise#i#1) index(c cbc_noise#i#1) */
count(c.seq) into l_cnt
from (select level n from dual connect by level <= 1e6) a, cbc_noise b, cbc_noise c
where b.x = c.x;
end;]’;
dbms_job.submit(l_job_id, l_stmt);
commit;
end;
/

-- exec cbs_noise_select_job;

91
create or replace procedure cbc_noise_update(p_loop_cnt number, p_x number, p_sleep number) as
begin
dbms_session.set_identifier(’cbc_noise_update_x_’||p_x);
for i in 1..p_loop_cnt loop
update cbc_noise set y = ’make_noise’||i where x = p_x;
if p_sleep > 0 then
dbms_lock.sleep(p_sleep);
end if;
commit;
end loop;
end;
/

-- exec cbc_noise_update (1, 1, 0.00);

create or replace procedure cbc_noise_update_job(p_loop_cnt number, p_x number, p_sleep number) as


l_job_id pls_integer;
begin
dbms_job.submit(l_job_id,
’begin
while true loop cbc_noise_update(’||p_loop_cnt||’,’||p_x||’,’||p_sleep||’); end loop;
end;’);
commit;
end;
/

3.2.2.2 CBC Collision Case-1

At first, to trigger CBC collisions, we start 2 query sessions on cbc tab child and cbc tab parent, and
one query session on cbs noise (for procedure clean jobs, see script 1.2.4):

---==================== Produce CBC Hash Collision =====================---

alter system flush buffer_cache;


exec clean_jobs;
exec cbc_select_jobs(2, 1e3);
exec cbs_noise_select_job;

Then run query below to find collision latches between cbc tab child and cbs noise data blocks (table
and index).

select * from (
with sq as (select object_name, data_object_id
from dba_objects where object_name like ’CBC_%’)
,bh as (select hladdr, obj, file#, dbablk, sum(tch) tch
from x$bh group by hladdr, obj, file#, dbablk)
select hladdr cbc_latch_addr
,sum(tch) tch
,listagg(tch || ’-’ || obj || ’(’ || object_name || ’)/’ || file# || ’/’ ||dbablk, ’;’)
within group (order by tch desc) tch_list -- "tch-obj(name)/file/blk_list"
,count(*) blk_cnt
from bh, sq
where bh.obj = sq.data_object_id
and tch > 0
-- and (hladdr like ’%18BB9FB40’ or hladdr like ’%18BC7FD00’)
group by hladdr
order by tch desc)
where 1=1
and (tch_list like ’%CBC_TAB_CHILD%’)
-- and (tch_list like ’%CBC_TAB_CHILD%CBC_NOISE%’ or tch_list like ’%CBC_NOISE%CBC_TAB_CHILD%’);

-- 122 rows selected.

There are 122 rows returned:

(1). 26 have 3 data blocks under the same CBC latch.

92
(2). 62 have 2 data blocks under the same CBC latch.
(3). 34 have only one single data block.

So there are 88 (= 26 + 62) blocks having CBC latch collisions. (122 blocks because there are 122
CBC TAB CHILD#ID index blocks involved (1 branch and 121 leaves), to be discussed in section 3.2.2.5)

From 88 collision latches, we pick two latches below to continue our investigations:

CBC_LATCH_ADDR TCH TCH_LIST BLK_CNT


----------------- --- ------------------------------------------ -------
000000018BB9FB40 107 59-2418157(CBC_TAB_CHILD#ID)/1548/2983101;
48-2418159(CBC_NOISE#I#1)/1548/2786683 2

000000018BC7FD00 155 59-2418157(CBC_TAB_CHILD#ID)/1548/2983098;


48-2418159(CBC_NOISE#I#1)/1548/2768969;
48-2418159(CBC_NOISE#I#1)/1548/2786680 3

Note: each row has 4 columns: (CBC child_latch, total tch count, tch_list, protected blocks)
each tch_list is a list of tuples seperated by ";"
each tuple is in format: touch_count - data_object_id(object_name)/file#/block#

The first collision involves two data blocks from two different indexes on two different tables:

CBC_TAB_CHILD#ID dbablk 2983101


CBC_NOISE#I#1 dbablk 2786683

but they are hashed to the same CBC latch 000000018BB9FB40.

The second collision involves three data blocks from two different indexes on two different tables:

CBC_TAB_CHILD#ID dbablk 2983098


CBC_NOISE#I#1 dbablk 2768969
CBC_NOISE#I#1 dbablk 2786680

and all three are hashed to the same CBC latch 000000018BC7FD00.

At this moment, if run queries below,

select * from v$latch where name like ’cache buffers chains’;


select * from v$latch_children where addr = ’000000018BB9FB40’;
select * from v$latch_misses where parent_name = ’cache buffers chains’ order by sleep_count desc;

We can see that both latch misses and spin gets are increasing, but latch sleeps stays same. That
means all latch misses are resolved by spin gets. spin gets is on CPU, sleeps is on Wait. If there
are no latch sleeps, they are not visible in AWR top wait events.

This is the first case of CBC latch collisions, which are caused by multi-query sessions.

3.2.2.3 CBC Collision Case-2

Now we will trigger the second case of CBC latch collision, in which collision occurs between query
sessions and one DML session.

Look the first collision latch 000000018BB9FB40:

93
CBC_LATCH_ADDR TCH TCH_LIST BLK_CNT
----------------- --- ------------------------------------------ -------
000000018BB9FB40 107 59-2418157(CBC_TAB_CHILD#ID)/1548/2983101;
48-2418159(CBC_NOISE#I#1)/1548/2786683 2

Note: the second line is about index CBC_NOISE#I#1 block 2786683, with touch_count = 48.
block 2786683 contains exactly 61 rows (see later query).
in average, each CBC_NOISE#I#1 block contains 1,000,000/17408 = 57.45 rows.

It conains index block CBC NOISE#I#1 dbablk 2786683. We need to run a query to list all table rows
pointed by this index block.

select object_id from dba_objects where object_name = ’CBC_NOISE#I#1’;


-- 2418159

select block#, x from (


select x, y
,dbms_rowid.rowid_block_number(sys_op_lbid (2418159, ’L’, t.rowid)) block#
from cbc_noise t)
where block# in (2786683);

BLOCK# X
-------- ------
2786683 599084
2786683 599085
...
2786683 599144

61 rows selected.

The output shows that there are 61 rows indexed by above index block, and at the moment of sampling,
touch count is 48.

At first, we start 12 query sessions on cbc tab child, and watch CBC latch activities:

---==================== Test Run =====================---

alter system flush buffer_cache;


exec clean_jobs;

exec cbc_select_jobs(12, 1e3);

select * from v$latch where name like ’cache buffers chains’;


select * from v$latch_children where addr = ’000000018BB9FB40’;
select * from v$latch_misses where parent_name = ’cache buffers chains’ order by sleep_count desc;

Same as Case-1, there are slight latch misses, but they are all resolved by spin gets, and latch sleeps
is not increasing.

Now pick one row from 61 rows found in CBC NOISE#I#1 dbablk 2786683, for example, x=599084, and
start one DML session to update this single row of cbs noise.

--599084 is one row from CBC_NOISE#I#1 dbablk 2786683

exec cbc_noise_update_job(1e8, 599084, 0.00); -- index block collision, latch.sleep increasing

Immediately we can observe latch misses and sleeps increasing by:

select * from v$latch_children where addr = ’000000018BB9FB40’;

and latch misses locations, noticeably appear on kcbgtcr: fast path exam:

94
select * from v$latch_misses where parent_name = ’cache buffers chains’ order by sleep_count desc;

After the test, all started job sessions can be stopped by calling clean jobs (see script 1.2.4).

3.2.2.4 CBC Collision Case-3

Now we look the second CBC collision latch 000000018BC7FD00:

CBC_LATCH_ADDR TCH TCH_LIST BLK_CNT


----------------- --- ------------------------------------------ -------
000000018BC7FD00 155 59-2418157(CBC_TAB_CHILD#ID)/1548/2983098;
48-2418159(CBC_NOISE#I#1)/1548/2768969;
48-2418159(CBC_NOISE#I#1)/1548/2786680 3

which involves 3 index blocks, two of which are from two different CBC NOISE#I#1 blocks.

CBC_TAB_CHILD#ID dbablk 2983098


CBC_NOISE#I#1 dbablk 2768969
CBC_NOISE#I#1 dbablk 2786680

Same as Case-2, at first, list rows in CBC NOISE#I#1 dbablk 2786683 and 2786680:

select block#, x from (


select x, y
,dbms_rowid.rowid_block_number(sys_op_lbid (2418159, ’L’, t.rowid)) block#
from cbc_noise t)
where block# in (2768969, 2786680);

BLOCK# X
-------- ------
2768969 49840
2768969 49841
2768969 49842
...

2786680 598901
2786680 598902
2786680 598903

122 rows selected.

We start the similar test, but with two DML sessions on updating two index blocks of CBC NOISE#I#1
protected by the same CBC latch. This is our third case of CBC collisions, which is caused by multi
query sessions and multi DML sessions.

---==================== Test Run =====================---

alter system flush buffer_cache;


exec clean_jobs;

exec cbc_select_jobs(12, 1e3);

--49840 is one row from CBC_NOISE#I#1 dbablk 2768969


exec cbc_noise_update_job(1e8, 49840, 0.00);

--598901 is one row from CBC_NOISE#I#1 dbablk 2786680


exec cbc_noise_update_job(1e8, 598901, 0.00);

select * from v$latch_children where addr = ’000000018BC7FD00’;

95
Comparing to Case-2 (latch children addr=000000018BB9FB40), Case-3 (latch children addr=000000018BC7FD00)
has doubled misses, sleeps, spin gets and wait time.

After the test, all started job sessions can be stopped by calling clean jobs (see script 1.2.4).

In all above tests, we only investigate data blocks of ”static” objects. If the application makes use of
global temporary table (GTT) for frequently DML, the CBC collision is hard to track because GTT
data blocks are session private and even more dynamically allocated. Even complicated, undo blocks are
dynamically allocated in Buffer Cache when executing DML, hence their BH addresses and mapped hash
buckets are also dynamically managed by CBC latches. When one undo block is accessed for CR clone
by multi sessions, CBC latches are also requested concurrently by those multi sessions. Consequently,
latch: cache buffers chains contentions become more spontaneous, and hard to be reproduced.

With Case-2 and Case-3, we only tested DML update with cbc noise update job on cbc noise so that
the contention index cbc noise#i#1 blocks are not changed, hence the contention CBC child latches
remain the same during test running. It makes the test reproducible and contention observation points
fixed. However, in real applications, DML insert, delete and merge will make the CBC collisions a moving
target.

3.2.2.5 CBC Latch Hash Collision, Reverse Index and Nested Join

In our test code, index cbc tab child#id is created as reverse on purpose. If we run cbc select(1,
1), turn on Sql trace, the trace file shows that to get 10,000 rows, we make 10,045 logical reads, of which
10,002 on cbc tab child#id.

Sql > exec cbc_select(1, 1);

with sq as (select level n from dual connect by level <= 10)


select /*+ leading(sq p c) use_nl(p c) index(p cbc_tab_parent#ids) index(c cbc_tab_child#id) */
c.id +1
from sq, cbc_tab_parent p, cbc_tab_child c
where p.grp_id = 3
and p.id = c.id

call count cpu elapsed disk query current rows


------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 1 0.01 0.01 0 10045 0 10000
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 3 0.01 0.01 0 10045 0 10000

Misses in library cache during parse: 0


Optimizer mode: ALL_ROWS
Parsing user id: 49 (recursive depth: 1)

Rows Row Source Operation


------- ---------------------------------------------------
10000 NESTED LOOPS (cr=10045 pr=0 pw=0 time=14880 us cost=6 size=13000 card=1000)
10000 NESTED LOOPS (cr=43 pr=0 pw=0 time=2182 us cost=5 size=8000 card=1000)
10 VIEW (cr=0 pr=0 pw=0 time=45 us cost=2 size=0 card=1)
10 CONNECT BY WITHOUT FILTERING (cr=0 pr=0 pw=0 time=41 us)
1 FAST DUAL (cr=0 pr=0 pw=0 time=1 us cost=2 size=0 card=1)
10000 INDEX RANGE SCAN CBC_TAB_PARENT#IDS (cr=43 pr=0 pw=0 time=1153 us cost=3 size=8000 card=1000)(object id 2418155)
10000 INDEX UNIQUE SCAN CBC_TAB_CHILD#ID (cr=10002 pr=0 pw=0 time=10104 us cost=1 size=5 card=1)(object id 2418157)

Look the xplan, the starting outer table (connect by level <= 10) is used to regulate the number of
nested loops, it is fixed as 10 in the code.

The driving table cbc tab parent returns 1000 id values (0 to 999) from index cbc tab parent#ids for
grp id = 3:

96
select min(id), max(id), count(distinct id) from cbc_tab_parent where grp_id = 3;

MIN(ID) MAX(ID) COUNT(DISTINCTID)


------- ------- -----------------
0 999 1000

Becase of ”index range scan”, all fetched 1000 id values are sequentially ordered from 0 to 999. 10 times
loops over cbc tab parent return 10000 id values (10 chunks, each has 1000 well sorted id values). When
they are used to select inner table by nested loops join with index unique scan on cbc tab child#id, fetch-
ing 10,000 id values are fulfilled by 10,000 consistent gets from 10,000 block gets of cbc tab child#id,
just because the index is reversed. For reverse index, all adjacent id values are located in non adjacent
index blocks. Even though the selected 10,000 id values have only 1000 distinct values, the execution
statistics shows cr=10002 logical reads. Probably we can introduce a concept of runtime Index Clustering
Factor to describe this behaviour.

If we flush buffer cache, re-run test with Sql trace, the output looks like:

Sql > alter system flush buffer_cache;

Sql > exec cbc_select(1, 1);

call count cpu elapsed disk query current rows


------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 1 0.02 0.02 128 10045 0 10000
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 3 0.02 0.02 128 10045 0 10000

Misses in library cache during parse: 0


Optimizer mode: ALL_ROWS
Parsing user id: 49 (recursive depth: 1)

Rows Row Source Operation


------- ---------------------------------------------------
10000 NESTED LOOPS (cr=10045 pr=128 pw=0 time=18359 us cost=6 size=13000 card=1000)
10000 NESTED LOOPS (cr=43 pr=6 pw=0 time=2524 us cost=5 size=8000 card=1000)
10 VIEW (cr=0 pr=0 pw=0 time=45 us cost=2 size=0 card=1)
10 CONNECT BY WITHOUT FILTERING (cr=0 pr=0 pw=0 time=42 us)
1 FAST DUAL (cr=0 pr=0 pw=0 time=1 us cost=2 size=0 card=1)
10000 INDEX RANGE SCAN CBC_TAB_PARENT#IDS (cr=43 pr=6 pw=0 time=1422 us cost=3 size=8000 card=1000)(object id 2418155)
10000 INDEX UNIQUE SCAN CBC_TAB_CHILD#ID (cr=10002 pr=122 pw=0 time=13290 us cost=1 size=5 card=1)(object id 2418157)

Elapsed times include waiting on following events:


Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
db file sequential read 128 0.00 0.00

The last rowsource: cr=10002 pr=122 shows 122 physical reads and 10,002 consistent gets on cbc tab child#id.
Therefore only 122 distinct index blocks (1 branch and 121 leaves, to be discussed later) are fetched. In
average, each leaf block is accessed 83 (=10,000/121) times in CR mode.

6 other physical reads are from cbc tab parent#ids: 1 from its root, 1 from branch, 4 from leaf blocks
(its blevel is 2. block dump with kdxcolev 2 for root, kdxcolev 1 for branches).

In total, there are 128 (122+6) physical reads, all fulfilled by ”db file sequential read”

With following queries, we can see that 1000 id values (0 to 999) in cbc tab child are distributed in 121
cbc tab child#id leaf blocks, all under one branch block (i.e. root block, because index blevel = 1).

select object_id from dba_objects where object_name = ’CBC_TAB_CHILD#ID’;

OBJECT_ID

97
---------
2418157

select count(distinct block#), count(*) from (


select id
,dbms_rowid.rowid_block_number(sys_op_lbid (2418157, ’L’, t.rowid)) block#
from cbc_tab_child t)
where id between 0 and 999;

COUNT(DISTINCTBLOCK#) COUNT(*)
--------------------- --------
121 1000

-- index root branch block. index blevel is 1. HEX value referred in later 10200 trace.
select header_block+1 root_branch_block, to_char(header_block+1, ’xxxxxxxx’) root_branch_block_hex
from dba_segments v
where v.segment_name = ’CBC_TAB_CHILD#ID’;

ROOT_BRANCH_BLOCK ROOT_BRANCH_BLOCK_HEX
----------------- ---------------------
2800907 2abd0b

If want to dig further to understand one single rowsource line on cbc tab child#id from above xplan
(copied here again):

Rows Row Source Operation


------- ---------------------------------------------------
10000 INDEX UNIQUE SCAN CBC_TAB_CHILD#ID (cr=10002 pr=122 pw=0 time=13290 us cost=1 size=5 card=1)(object id 2418157)

we can use section 1.2 discussed consistent read trace event 10200, or Dtrace to get block reading details.
For example, perform 10200 trace:

Sql > alter session set tracefile_identifier = "trace_10200_1";


Sql > alter session set events ’10200 trace name context forever, level 10’;

Sql > exec cbc_select(1, 1);

Sql > alter session set events ’10200 trace name context off’;

It shows that index cbc tab child#id root branch block# are accessed only 2 times as follows:

ktrgtc2(): started for block <0x07cf : 0x002abd0b> objd: 0x0024e5ed


ktrget2(): started for block <0x07cf : 0x002abd0b> objd: 0x0024e5ed

Legend:
0x07cf: decimal 1999, v$tablespace.ts#
0x002abd0b: decimal 2800907, index CBC_TAB_CHILD#ID root branch block#
0x0024e5ed: decimal 2418157, CBC_TAB_CHILD#ID data_object_id

All rest 10,000 fetches in index cbc tab child#id are from 121 leaf blocks. In total 10,002 consistent
gets of index cbc tab child#id as reported in Sql trace files.

If we rebuild the index as noreverse, and make the same Sql trace, it shows 185 logical reads, instead
of 10,045 in case of reverse index.

Sql > alter index obj#p#id rebuild noreverse;

Sql > exec cbc_select(1, 1);

call count cpu elapsed disk query current rows


------- ------ -------- ---------- ---------- ---------- ---------- ----------

98
Parse 0 0.00 0.00 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 1 0.01 0.01 0 185 0 10000
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 2 0.01 0.01 0 185 0 10000

Misses in library cache during parse: 0


Optimizer mode: ALL_ROWS
Parsing user id: 49 (recursive depth: 1)

Rows Row Source Operation


------- ---------------------------------------------------
10000 NESTED LOOPS (cr=185 pr=0 pw=0 time=8223 us cost=6 size=13000 card=1000)
10000 NESTED LOOPS (cr=43 pr=0 pw=0 time=2078 us cost=5 size=8000 card=1000)
10 VIEW (cr=0 pr=0 pw=0 time=44 us cost=2 size=0 card=1)
10 CONNECT BY WITHOUT FILTERING (cr=0 pr=0 pw=0 time=43 us)
1 FAST DUAL (cr=0 pr=0 pw=0 time=1 us cost=2 size=0 card=1)
10000 INDEX RANGE SCAN CBC_TAB_PARENT#IDS (cr=43 pr=0 pw=0 time=1027 us cost=3 size=8000 card=1000)(object id 2418155)
10000 INDEX UNIQUE SCAN CBC_TAB_CHILD#ID (cr=142 pr=0 pw=0 time=3613 us cost=1 size=5 card=1)(object id 2418157)

The xplan looks same as reverse index, but logical reads is dropped down to 185 from 10,045, a factor of
more than 50.

For noreverse index, repeat the same 3 test cases as reverse index, AWR and ASH reports showed that
logical read is 25 times reduced. However, executions (throughput) are doubled. The Top Row Source
is changed to nested loops in case of noreverse index, from index - unique scan in case of reverse
index.

3.2.3 Latch Pseudo Code

In most popular Oracle books, we can find certain latch pseudo code, which is trying to describe Oracle
latch algorithm. But they all have some assumptions about latch usage counting as follows:

-. misses is subset of gets


-. spin gets is subset of misses
-. sleeps is subset of spin gets

In normal case, above assumptions could match running system statistics. But in contention case, for
example, following two AWR Section ”Latch Sleep Breakdown” picked from 12c heavily loaded system,
the pseudo code is hard to clarify the figures. However, that is the occasion, which requires problem
solving.
Latch Name Get Requests Misses Sleeps Spin Gets
cache buffers chains 15,134,617,472 176,939,147 1,040,507 314,665,205
row cache objects 101,225,008 4,075,626 429,314 3,673,739

Table 3.1: Recurrent Spin gets

Latch Name Get Requests Misses Sleeps Spin Gets


row cache objects 55,852,812 3,505,131 320,033 3,224,234
cache buffers chains 1,722,527,053 2,564,938 2,819,003 1,026,077

Table 3.2: Recurrent Sleeps

In Table 3.1, cache buffers chains shows:

99
spin_gets(314,665,205) > misses(176,939,147)

which indicate the existence of recurrent spin gets.

In Table 3.2, cache buffers chains shows:

sleeps(2,819,003) > misses(2,564,938) > spin_gets(1,026,077)

which signifies the existence of recurrent sleeps, and existence of recurrent sleeps following spin gets.

Probably we can introduce a recurrent misses, and approximately formulated as:

sleeps + spin_gets - misses


= recurrent_misses
= recurrent_sleeps + recurrent_spin_gets

As we know, Oracle session’s response time is made of service time and queue time. spin gets is counted
as service time since it is on CPU, whereas sleeps is categorized as queue time since it is on waiting.
Generally, latch spins is burning CPU, but latch sleeps yields CPU. Therefore when investigating latch
contentions, it would be necessary to distinguish between spin gets and sleeps. As observed, spin gets
are usually caused by frequently concurrent access, whereas sleeps are triggered by invalidations or DML
modifications.

In case of heavy sleeps, processes are on wait, performance is degraded, CPU load is dropped. If
simply increasing workload (for example, more parallel batch sessions) because of lower CPU usage, the
performance gets even worse. In such case, investigating the root cause of heavy latch contentions should
be the first priority.

In Oracle, there are two sorts of latches, one has children (see v$latch children), for example, ”cache
buffers chains”; other has no children (instance wide single latch), for example, ”Result Cache: RC Latch”
(it seems that v$latch.name does not care letter upper or lower cases). Therefore, when monitoring
sleeps and spin gets, the number of children should be taken into account because single latch can
serialize whole system.

3.3 Mutexes

The last and most recent locking mechanism is Oracle mutexes. In previous section, we saw that latch is
an instance-wide centralized locking mechanism. In this section, we will see that mutex is a distributed
locking mechanism, directly attached on its protected shared memory data structures. That is why
there exists v$latch (v$latch children) for all latches, but no such central views for mutexes. Mutex
is exposed in v$db object cache.hash value, which is simultaneously lived with its protected object.
In comparing to pre-defined and limited number of latches, mutex is dynamically created/erased when
requested/released in accompanying to the life of its locking target.

Blog: Reducing ”library cache: mutex X” concurrency with dbms shared pool.markhot [7] lists top 3
differences between mutexes and latches:

(1). A mutex can protect a single structure, latches often protect many structures.

100
(2). A mutex get is about 30-35 instructions in the algorithm, compared to 150-200 instructions
for a latch get.

(3). A mutex is around 16 bytes in size, compared to 112-200 bytes for a latch.

It looks that mutex is about 5 times slimmer and should be proportionally faster than latch. More
discussion can be found in Blog: LATCHES, LOCKS, PINS AND MUTEXES [8].

In this section, we will demonstrate ”library cache: mutex X” in application context, where such heavy
wait event is observed when application context is frequently changed. The application is using Oracle
Virtual Private Database (VPD) to control data access with driving application context to determines
which policy group is in effect for which use case.

Note: All tests are done in Oracle 12.1.0.2 on AIX, Solaris, Linux with 6 physical processors.

3.3.1 Test

At frist, setup test of application context.

create or replace context test_ctx using test_ctx_pkg;

create or replace package test_ctx_pkg is


procedure set_val (val number);
end;
/

create or replace package body test_ctx_pkg is


procedure set_val (val number) as
begin
dbms_session.set_context(’test_ctx’, ’attr’, val);
end;
end;
/

create or replace procedure ctx_set(p_cnt number, val number) as


begin
for i in 1..p_cnt loop
test_ctx_pkg.set_val(val); -- ’library cache: mutex X’ on TEST_CTX
end loop;
end;
/

create or replace procedure ctx_set_jobs(p_job_cnt number) as


l_job_id pls_integer;
begin
for i in 1.. p_job_cnt loop
dbms_job.submit(l_job_id, ’begin while true loop ctx_set(100000, ’||i||’); end loop; end;’);
end loop;
commit;
end;
/

-- clean_jobs is same as in last section.

Then launch 4 parallel jobs:

exec ctx_set_jobs(4);

Watch Job sessions:

101
select sid, program, event, p1text, p1, p2text, p2, p3text, p3
from v$session where program like ’%(J0%’;

SID PROGRAM EVENT P1TEXT P1 P2TEXT P2 P3TEXT P3


----- -------------------- ------------------------- ------ ----------- ------ -------------- ------ -----------------
38 oracle@testdb (J003) library cache: mutex X idn 1317011825 value 3968549781504 where 9041305591414788
890 oracle@testdb (J000) library cache: mutex X idn 1317011825 value 163208757248 where 9041305591414874
924 oracle@testdb (J001) library cache: mutex X idn 1317011825 value 4556960301056 where 9041305591414874
1061 oracle@testdb (J002) library cache: mutex X idn 1317011825 value 3968549781504 where 9041305591414879

Pick idn (P1): 1317011825, and query v$db object cache:

select name, namespace, type, hash_value, locks, pins, locked_total, pinned_total


from v$db_object_cache where hash_value in (1317011825);

NAME NAMESPACE TYPE HASH_VALUE LOCKS PINS LOCKED_TOTAL PINNED_TOTAL


---------- --------------- --------------- ----------- ----------- ----------- ------------ ------------
TEST_CTX APP CONTEXT APP CONTEXT 1317011825 4 0 4 257802287

It shows that ”library cache: mutex X” is on application context: test ctx, and pinned total is increasing
for each access.

Although test ctx is a local context and its values is stored in the User Global Area (UGA), its definition
is globally protected by ”library cache: mutex X”.

select namespace, package, type from dba_context where namespace = ’TEST_CTX’;

NAMESPACE PACKAGE TYPE


---------- ------------ ----------------
TEST_CTX TEST_CTX_PKG ACCESSED LOCALLY

3.3.2 Mutex Contention and Performance

Continue with above test, we can run queries to observe mutex contention locations and their impacts
on applications (sleeps and wait time).

column owner format a6


column name format a10
column property format a10
column namespace format a12
column type format a12

SQL > select owner, name, property, hash_value, locks, pins, locked_total, pinned_total
,executions, sharable_mem, namespace, type
from v$db_object_cache v
where (name in (’TEST_CTX’) or hash_value in (1317011825) or property like ’%HOT%’);

OWNER NAME PROPERTY HASH_VALUE LOCKS PINS LOCKED_TOTAL PINNED_TOTAL EXECUTIONS SHARABLE_MEM NAMESPACE TYPE
------ -------- ---------- ---------- ----- ---- ------------ ------------ ---------- ------------ ------------ -----------
SYS TEST_CTX 1317011825 4 0 4 167495977 167495970 4096 APP CONTEXT APP CONTEXT

SQL > select * from v$mutex_sleep order by sleeps desc, location;

MUTEX_TYPE LOCATION SLEEPS WAIT_TIME


----------------- --------------------------- ---------- ----------
Library Cache kglpndl1 95 831410 59481666
Library Cache kglpin1 4 192654 146431660
Library Cache kglpnal1 90 106937 33325575

Then display mutex requesting/blocking details for each involved session:

102
SQL > select * from v$mutex_sleep_history order by sleep_timestamp desc, location;

MUTEX_IDENTIFIER SLEEP_TI MUTEX_TYPE GETS SLEEPS REQ_SES BLOCKING_SES LOCATION MUTEX_VALUE P1 P1RAW
---------------- -------- ------------- --------- ------ ------- ------------ ------------------------- ----------- -- ---------
1317011825 14:15:40 Library Cache 675726540 449377 7 0 kglpin1 4 00 0 176D1E860
1317011825 14:15:40 Library Cache 675726542 442641 368 0 kglpndl1 95 00 0 176D1E860
1317011825 14:15:40 Library Cache 675711683 444299 901 7 kglpndl1 95 700000000 0 176D1E860
1317011825 14:15:40 Library Cache 675709618 438207 187 0 kglpin1 4 00 0 176D1E860
1317011825 14:09:06 Library Cache 2806872 1 900 0 kglGetHandleReference 123 00 0 176D1E860

Pick spid of one Oracle session, for example, 10684, get call stack:

$ > pstack 10684

10684: ora_j000_testdb
fffffd7ffc9d3e3b semsys (4, e000013, fffffd7fffdf5658, 1, fffffd7fffdf5660)
0000000001ab9008 sskgpwwait () + f8
0000000001ab8c95 skgpwwait () + c5
0000000001c710d5 ksliwat () + 8f5
0000000001c70410 kslwaitctx () + 90
0000000001e6ffb0 kgxWait () + 520
000000000dd1ae6f kgxExclusive () + 1cf
00000000021cc025 kglGetMutex () + b5
000000000212400e kglpin () + 2fe
00000000026aa159 kglpnp () + 269
00000000026a71ab kgiina () + 1db
000000000dd118b9 kgintu_named_toplevel_unit () + 39
0000000007ac16a6 kzctxBInfoGet () + 746
0000000007ac38ed kzctxChkTyp () + fd
0000000007ac43f0 kzctxesc () + 510
0000000002781d9d pevm_icd_call_common () + 29d
0000000002781930 pfrinstr_ICAL () + 90
0000000001a435ca pfrrun_no_tool () + 12a
0000000001a411e0 pfrrun () + 4c0
0000000001a3fb48 plsql_run () + 288

Where semsys(4, ...) is specified in Unix syscall.h as:

semtimedop(int semid, struct sembuf *sops, size_t nsops, const struct timespec *timeout)

The above call stack shows that kgxExclusive is triggered by kglpin via kglGetMutex.

Run a small dtrace script to get performance statistics:

$ > sudo dtrace -n \


’BEGIN {self->start_wts = walltimestamp; self->start_ts = timestamp;}
pid$target::kglpndl:entry /execname == "oracle"/ { self->rc = 1; }
pid$target::kgxExclusive:entry /execname == "oracle" && self->rc == 1/ { self->ts = timestamp; }
pid$target::kgxExclusive:return /self->ts > 0/ {
@lquant["ns"] = lquantize(timestamp - self->ts, 0, 10000, 1000);
@avgs["AVG_ns"] = avg(timestamp - self->ts);
@mins["MIN_ns"] = min(timestamp - self->ts);
@maxs["MAX_ns"] = max(timestamp - self->ts);
@sums["SUM_ms"] = sum((timestamp - self->ts)/1000000);
@counts[ustack(10, 0)] = count();
self->rc = 0; self->ts = 0;}
END { printf("Start: %Y, End: %Y, Elapsed_ms: %d\n", self->start_wts,
walltimestamp, (timestamp - self->start_ts)/1000000);}
’ -p 10684

dtrace: description ’BEGIN ’ matched 8 probes


Start: 2017 Oct 24 14:30:02, End: 2017 Oct 24 14:31:08, Elapsed_ms: 66183

ns
value ------------- Distribution ------------- count
< 0 | 0
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 3352394

103
1000 |@@@@@@@@ 803168
2000 | 11598
3000 | 1484
4000 | 890
5000 | 626
6000 | 460
7000 | 315
8000 | 265
9000 | 147
>= 10000 | 2227

AVG_ns 1999
MIN_ns 777
MAX_ns 20411473
SUM_ms 4214

oracle‘kgxExclusive+0x105
oracle‘kglpndl+0x1fe
oracle‘kglUnPin+0x101
a.out‘kzctxChkTyp+0x14e
a.out‘kzctxesc+0x510
a.out‘pevm_icd_call_common+0x29d
a.out‘pfrinstr_ICAL+0x90
oracle‘pfrrun_no_tool+0x12a
oracle‘pfrrun+0x4c0
oracle‘plsql_run+0x288
4173574

It shows that average mutex time is 1999 ns, max time is about 20 ms (20411473 ns), total number of
executions is 4173574 for an elapsed time of 66183 ms.

Solaris prstat -mL output displays that about 30% percentage of time the process has spent in sleeping
(SLP).

3.3.3 Hot Library Cache Objects

Described in Blog: Divide and conquer the ”true” mutex contention [18], ”library cache: mutex X” can be
alleviated by creating multiple copies of hot objects, which can be configured by two hidden parameters:

_kgl_hot_object_copies: controls the maximum number of copies


_kgl_debug: marks hot library cache objects as a candidate for cloning

The Blog also describes following Oracle subroutines (see the output of previous v$mutex sleep history
query):

KGLPIN: KGL PIN heaps and load data pieces of an object


KGLPNDL: KGL PiN DeLete
KGLPNAL1: KGL PiN ALlOcate

KGLHBH1 63, KGLHDGN2 106: Invalid Password, Application Context(eg: SYS_CONTEXT)

Now we can try to configure those two hidden parameters:

SQL > alter system set "_kgl_hot_object_copies"= 255 scope=spfile;

--alter system reset "_kgl_hot_object_copies" scope=spfile;

104
SQL > alter system set "_kgl_debug"=
"name=’TEST_CTX’ schema=’SYS’ namespace=21 debug=33554432",
"name=’PLITBLM’ schema=’PUBLIC’ namespace=1 debug=33554432"
scope=spfile;

--alter system reset "_kgl_debug" scope=spfile;

In above configuration, library cache object namespace/type id and name mapping can be found by
following queries:

SQL > select distinct namespace, object_type from dba_objects v order by 1;

SQL > select distinct namespace, type# from sys.obj$ order by 1;

SQL > select distinct kglhdnsp NAMESPACE_id, kglhdnsd NAMESPACE_name from x$kglob
--where kglhdnsd in (’APP CONTEXT’)
order by kglhdnsp;

SQL > select distinct kglobtyp TYPE_id, kglobtyd TYPE_name from x$kglob
--where kglobtyd in (’APP CONTEXT’)
order by kglobtyp;

Public synonym (namespace=1) PLITBLM is added into kgl debug to show multiple library cache objects
can be specified. PLITBLM is package for Plsql Index TaBLe Management, i.e Plsql Collections (Associative
Arrays, Nested Table, Varrays). All its implementations are through c interface.

Re-run the same test again and monitor it by the same queries (for clean jobs, see script 1.2.4)

-- Stop all Jobs


SQL > exec clean_jobs;

--Restart DB to activate Hot library cache objects


SQL> startup force

SQL > select owner, name, property, hash_value, locks, pins, locked_total, pinned_total
,executions, sharable_mem, namespace, type
from v$db_object_cache v
where (name in (’TEST_CTX’) or hash_value in (1317011825) or property like ’%HOT%’);

OWNER NAME PROPERTY HASH_VALUE LOCKS PINS LOCKED_TOTAL PINNED_TOTAL EXECUTIONS SHARABLE_MEM NAMESPACE TYPE
------ ---------- ---------- ---------- ----- ---- ------------ ------------ ---------- ------------ ------------ ------
SYS TEST_CTX HOT 1317011825 0 0 1 0 0 0 APP CONTEXT CURSOR

SQL > exec ctx_set_jobs(4);

SQL > select sid, program, event, p1text, p1, p2text, p2, p3text, p3
from v$session where program like ’%(J%’;

SID PROGRAM EVENT P1TEXT P1 P2TEXT P2 P3TEXT P3


---- ------------------------ ---------- ------ --- ------ --- ------ ---
5 oracle@s5d00003 (J001) null event 0 0 0
186 oracle@s5d00003 (J004) null event 0 0 0
369 oracle@s5d00003 (J005) null event 0 0 0
902 oracle@s5d00003 (J000) null event 0 0 0

SQL > select owner, name, property, hash_value, locks, pins, locked_total, pinned_total
,executions, sharable_mem, namespace, type
from v$db_object_cache v
where (name in (’TEST_CTX’) or hash_value in (1317011825) or property like ’%HOT%’);

OWNER NAME PROPERTY HASH_VALUE LOCKS PINS LOCKED_TOTAL PINNED_TOTAL EXECUTIONS SHARABLE_MEM NAMESPACE TYPE
----- ---------- ---------- ---------- ----- ---- ------------ ------------ ---------- ------------ ------------ -----------
SYS TEST_CTX HOT 1317011825 0 0 1 0 0 0 APP CONTEXT CURSOR
SYS TEST_CTX HOTCOPY6 1487681198 1 0 2 151394920 151394917 4096 APP CONTEXT APP CONTEXT
SYS TEST_CTX HOTCOPY138 3082567164 1 0 2 151821083 151821080 4096 APP CONTEXT APP CONTEXT
SYS TEST_CTX HOTCOPY187 3192676979 1 0 2 151252013 151252010 4096 APP CONTEXT APP CONTEXT
SYS TEST_CTX HOTCOPY115 4198626891 1 0 2 150529629 150529626 4096 APP CONTEXT APP CONTEXT

SQL > select * from v$mutex_sleep order by sleeps desc, location;

105
MUTEX_TYPE LOCATION SLEEPS WAIT_TIME
----------- ------------------- ------ ---------
Cursor Pin kkslce [KKSCHLPIN2] 2 20118

SQL > select * from v$mutex_sleep_history order by sleep_timestamp desc, location;

MUTEX_IDENTIFIER SLEEP_TI MUTEX_TYPE GETS SLEEPS REQ_SES BLOCKING_SES LOCATION MUTEX_VALUE P1 P1RAW
---------------- -------- ---------- ---- ------ ------- ------------ ------------------- ----------- -- -----
2816823972 15:09:13 Cursor Pin 1 1 183 364 kkslce [KKSCHLPIN2] 16C00000000 2 00
2214650983 15:04:50 Cursor Pin 1 1 5 902 kkslce [KKSCHLPIN2] 38600000000 2 00

Invoke the same dtrace script to display running statistics:

$ > sudo dtrace -n \


’BEGIN {self->start_wts = walltimestamp; self->start_ts = timestamp;}
pid$target::kglpndl:entry /execname == "oracle"/ { self->rc = 1; }
pid$target::kgxExclusive:entry /execname == "oracle" && self->rc == 1/ { self->ts = timestamp; }
pid$target::kgxExclusive:return /self->ts > 0/ {
@lquant["ns"] = lquantize(timestamp - self->ts, 0, 10000, 1000);
@avgs["AVG_ns"] = avg(timestamp - self->ts);
@mins["MIN_ns"] = min(timestamp - self->ts);
@maxs["MAX_ns"] = max(timestamp - self->ts);
@sums["SUM_ms"] = sum((timestamp - self->ts)/1000000);
@counts[ustack(10, 0)] = count();
self->rc = 0; self->ts = 0;}
END { printf("Start: %Y, End: %Y, Elapsed_ms: %d\n", self->start_wts
,walltimestamp, (timestamp - self->start_ts)/1000000);}
’ -p 11751

Start: 2017 Oct 24 15:21:02, End: 2017 Oct 24 15:22:40, Elapsed_ms: 97999
ns
value ------------- Distribution ------------- count
< 0 | 0
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 8050589
1000 |@@ 330902
2000 | 1106
3000 | 1606
4000 | 1352
5000 | 630
6000 | 322
7000 | 201
8000 | 133
9000 | 94
>= 10000 | 481

AVG_ns 897
MIN_ns 813
MAX_ns 315083
SUM_ms 0

oracle‘kgxExclusive+0x105
oracle‘kglpndl+0x1fe
oracle‘kglUnPin+0x101
a.out‘kzctxChkTyp+0x14e
a.out‘kzctxesc+0x510
a.out‘pevm_icd_call_common+0x29d
a.out‘pfrinstr_ICAL+0x90
oracle‘pfrrun_no_tool+0x12a
oracle‘pfrrun+0x4c0
oracle‘plsql_run+0x288
8387416

Compared to mutex test without hot objects in previous section 3.3.2, it shows that average mutex time
is 897 ns, max time is about 0.3 ms (315083 ns), total number of executions is 8387416 for an elapsed
time of 97999 ms.

Further looking v$db object cache query output for first test and second test with multiple copies of hot
objects, the first has 167495970 executions, and the second has 4 hot copies, each has an similar number
of executions (between 150529626 and 151821080). So the second test has almost 4 times executions of

106
first one. However, the first test has a higher sleeps and wait time(in microseconds), which are showed
from v$mutex sleep query, for example, kglpin1 has a sleeps of 192654 and accumulated wait time of
146 seconds. But in the second test, they are not observable any more. This is because mutex is created
and bound to its protected object, and it is only responsible for that object. If each session has its own
object, it means that object is dedicated (privatized) to one single session, therefore there is no more
such mutex contention.

Run again Solaris prstat -mL, now it also shows that almost 100% percentage of time the process has
spent in user mode (USR), whereas in the first test, 30

As an alternative test, we also try to use Oracle documented API: dbms shared pool, it seems that
namespace: ’APP CONTEXT’ not yet supported.

-- Stop all Jobs


SQL > exec clean_jobs;

SQL > alter system reset "_kgl_debug" scope=spfile;

--Restart DB
SQL> startup force

SQL > exec sys.dbms_shared_pool.markhot(’SYS’, ’TEST_CTX’, 21);


--exec sys.dbms_shared_pool.unmarkhot(’SYS’, ’TEST_CTX’, 21);

ORA-26680: object type not supported


ORA-06512: at "SYS.DBMS_SHARED_POOL", line 133

-- Using 32 Byte (16 hexadecimal) V$DB_OBJECT_CACHE.FULL_HASH_VALUE


SQL > exec sys.dbms_shared_pool.markhot(hash=>’3581f5a97dfac7485a3330954e800171’, NAMESPACE=>21);
--exec sys.dbms_shared_pool.unmarkhot(hash=>’3581f5a97dfac7485a3330954e800171’, NAMESPACE=>21);

ORA-26680: object type not supported


ORA-06512: at "SYS.DBMS_SHARED_POOL", line 138

Comparing kgl debug and markhot, kgl debug seems persistent after DB restart, but not always
stable after DB restart. Several sessions can still contend for the same library cache objects instead of
creating/using hot objects.

Whereas markhot seems stable after DB restart, but not always persistent after DB restart. Moreover,
markhot does not support all namespaces of library cache objects, for example, above ’APP CONTEXT’.

As an example, in following test, we marked a synonym as hot in dbms shared pool.markhot (or
kgl debug), it hit core dump with ORA-00600 Error:

drop table tt1;

create table tt1 as select 1 x from dual;

create or replace public synonym tt1 for tt1;

select * from tt1;

select * from "PUBLIC".tt1;

select owner, property, name, namespace, type, full_hash_value


from v$db_object_cache v
where name = ’TT1’ and type = ’SYNONYM’ or property like ’%HOT%’;

-- OWNER PROPERTY NAME NAMESPACE TYPE FULL_HASH_VALUE


-- ------ ---------- ---------- ---------------- ------------ --------------------------------
-- PUBLIC TT1 TABLE/PROCEDURE SYNONYM 52e39b4b6a80a55af7cffca07abd5ddf

-- namespace 1 for SYNONYM


exec sys.dbms_shared_pool.markhot(hash=>’52e39b4b6a80a55af7cffca07abd5ddf’, namespace=>1);
-- exec sys.dbms_shared_pool.unmarkhot(hash=>’52e39b4b6a80a55af7cffca07abd5ddf’, namespace=>1);

107
select owner, property, name, namespace, type, full_hash_value
from v$db_object_cache v
where name = ’TT1’ and type = ’SYNONYM’;

-- OWNER PROPERTY NAME NAMESPACE TYPE FULL_HASH_VALUE


-- ------ ---------- ---------- ---------------- ------------ --------------------------------
-- PUBLIC HOT TT1 TABLE/PROCEDURE SYNONYM 52e39b4b6a80a55af7cffca07abd5ddf

select * from tt1;

alter system flush shared_pool;

select * from "PUBLIC".tt1;

-- select * from "PUBLIC".tt1


-- *
-- ERROR at line 1:
-- ORA-00600: internal error code, arguments: [kgltti-no-dep1], [], [], [], [], [], [], [], [], [], [], []

select * from tt1;

The callstack looks like:

kgltti()+1358 -> dbgeEndDDEInvocation() <--- ERROR SIGNALED: yes COMPONENT: LIBCACHE


kqlCompileSynonym()+3840 -> kgltti()
kqllod_new()+3768 -> kqlCompileSynonym()
kqlCallback()+79 -> kqllod_new()
kqllod()+710 -> kqlCallback()
kglobld()+1058 -> kqllod()
kglobpn()+1232 -> kglobld()
kglpim()+489 -> kglobpn()
kglpin()+1785 -> kglpim()
kglgob()+493 -> kglpin()
kgiind()+1529 -> kglgob()
pfri8_inst_spec()+126 -> kgiind()
pfri1_inst_spec()+69 -> pfri8_inst_spec()
pfrrun()+1506 -> pfri1_inst_spec()
plsql_run()+648 -> pfrrun()

This error is documented in Oracle MOS:

ORA-00600 [kgltti-no-dep1] When Synonym Marked Hot (Doc ID 2153847.1)

By the way, after such ORA-00600, the session is not disconnected, and further query can be executed.

108
Chapter 4

Parsing and Compiling

In comparing with traditional program languages, Sql Parse and Plsql compile (interpret) are more dy-
namic and adaptive to running environments. They are not only to transform human readable source
code into machine executable target code (PL/SQL Virtual Machine (PVM) running bytecode (a.k.a.
MCode) when plsql code type = INTERPRETED), but also search the optimized execution plans to per-
form Dr Codd’s Relational algebra. Consequently, these result in complex Sql cursor management, and
library cache Plsql dependency maintenance.

In this chapter, we will look Sql hard and soft parse, and two demanding cases of Plsql and Sql validations.

4.1 Sql Parse

4.1.1 Parse Differences

Popular Oracle books invested large paragraphes on Sql parse:

-. Effective Oracle by Design [11, p. 287-302]

-. Oracle Core: Essential Internals for DBAs and Developers [15, p. 173-178]

-. Oracle Performance Firefighting [33, p. 261]

-. Troubleshooting Oracle Performance (2nd Edition) [4, p. 436-438]

In summary, one cursor can undergo 4 different life stages:

(1). hard parse: not yet existed in SGA.

(2). soft parse: globally available in SGA’s shared pool (library cache).

(3). softer parse: locally available in PGA’s session cached cursor (controlled by session cached cursors).

(4). no parse: loaded into PL/SQL executing cache (specially optimized for PL/SQL).

109
If we classify them according to the locality (analoque to NUMA memory), they are:

(1). hard parse: first-time creating(or re-creating)

(2). soft parse: global shared pool cache

(3). softer parse: local session UGA cache

(4). no parse: running thread heap

If we model them in 4 dimensional Qualitative Physics with 4 Oracle metrics, each stage can be represented
as:

(1). hard parse: parse count hard(+), parse count total(+), lock count(+), pin count(+)

(2). soft parse: parse count hard(0), parse count total(+), lock count(+), pin count(+)

(3). softer parse: parse count hard(0), parse count total(+), lock count(0), pin count(+)

(4). no parse: parse count hard(0), parse count total(0), lock count(0), pin count(0)

In the above list, lock count is v$db object cache.locked total, pin count is v$db object cache.pinned total.

+ denotes increase, 0 denotes no change (or little change).

So 4 metrics are needed to determine a cursor’s parse stage (see queries in section 4.1.2).

4.1.2 Parse Identifying

In order to determine the different parse stages, we can use the queries below to acquire 4 metrics (Note:
’parse count (total)’ is same as ’parse calls’ ):

select n.name, s.value


from v$statname n, V$SESSTAT s
where n.statistic# = s.statistic#
and n.name in (’execute count’, ’parse count (hard)’, ’parse count (total)’, ’session cursor cache hits’)
and s.sid in (:sid);

select executions, parse_calls, locked_total, pinned_total, loads, invalidations, child_number, last_load_time, v.*
from v$sql v
where sql_id in (’:SQL_ID’) or lower(sql_text) like ’%:sql_text%’
order by v.last_load_time desc;

select timestamp, locked_total, pinned_total, loads, invalidations, v.*


from v$db_object_cache v
where lower(name) like ’%:name%’
order by v.timestamp desc;

select cursor_type, v.*


from v$open_cursor v
where v.cursor_type in (’SESSION CURSOR CACHED’, ’PL/SQL CURSOR CACHED’)
and v.sid in (:sid) and sql_id in (’:sql_id’)
--and lower(sql_text) like ’%:sql_text%’
order by v.sid, v.cursor_type;

The metric changes are reflected in columns’ value variations as follows:

110
1. hard parse:
v$sesstat ’parse count (hard)’ value
v$sql.invalidations and loads
v$db_object_cache.invalidations and loads
2. soft parse:
v$sql.locked_total
v$db_object_cache.locked_total
3. softer parse:
v$sesstat ’session cursor cache hits’ value
v$sql.pinned_total
v$db_object_cache.pinned_total
v$open_cursor ’SESSION CURSOR CACHED’
4. no parse:
v$open_cursor ’PL/SQL CURSOR CACHED’
v$sql.parse_calls (not increasing)
v$db_object_cache.parse_calls (not increasing)

Only in case of ”no parse”, ’parse count (total)’ and parse calls are not increased.

4.1.3 Cursor Details in Cusrordump

If we set following cursor parameters, and make a cusrordump:

alter session set open_cursors=400;


alter session set session_cached_cursors=600;
alter session set tracefile_identifier = "cursordump_4";
alter session set events ’immediate trace name cursordump level 4’;

The trace file shows certain aspects of cursor implementation (irrelevant details are removed):

----- Session Cursor Dump -----


Current cursor: 4, pgadep=0

Open cursors(pls, sys, hwm, max): 4(0, 2, 64, 400)


NULL=3 SYNTAX=0 PARSE=0 BOUND=1 FETCH=0 ROW=0
Cached frame pages(total, free):
4k(50, 50), 8k(1, 1), 16k(1, 1), 32k(0, 0)
----- Session Open Cursors -----
----------------------------------------
Cursor#1(0x1108e1b38) state=NULL curiob=0x110906698
......
Cursor#6(0x1108e1e08) state=NULL curiob=0x110e38c70
......
Cursor#5(0x1108e1d78) state=NULL curiob=0x11090a528
......
Cursor#4(0x1108e1ce8) state=BOUND curiob=0x110908068
......

----- Session Cached Cursor Dump -----


----- Generic Session Cached Cursor Dump -----
-----------------------------------------------------------
hash table=1108e4100 cnt=539 LRU=1108d5168 cnt=536 hit=2263 max=600 NumberOfTypes=6
type#0 name=DICTION count=0
type#1 name=BUNDLE count=13
type#2 name=SESSION count=38
type#3 name=PL/SQL count=485
type#4 name=CONSTRA count=0
type#5 name=REPLICA count=0
Bucket#001 seg=1108e4128 nit=8 nal=8 ips=8 sz=56 flg=3 ucnt=0

111
Bucket#008 seg=1108e4278 nit=8 nal=8 ips=8 sz=56 flg=3 ucnt=1
0 cob=110caaa28 idx=8 flg=0 typ=3 cur=110e27f90 lru=1 fl=15
......
Bucket#109 seg=1108e5568 nit=8 nal=8 ips=8 sz=56 flg=3 ucnt=2
0 cob=110907140 idx=6d flg=0 typ=2 cur=11092ac40 lru=1 fl=1
1 cob=110907178 idx=1006d flg=0 typ=2 cur=110c796f0 lru=1 fl=1
......
Bucket#256 seg=1108e70f8 nit=8 nal=8 ips=8 sz=56 flg=3 ucnt=0

In the above dump, max number of open cursors is noted with 400 (line 4), corresponding to open cursors=400,
and each cursor is indicated with state info (line 5) as follows:

NULL=3 SYNTAX=0 PARSE=0 BOUND=1 FETCH=0 ROW=0

Max number of session cached cursor is 600 (max=600 in line starting with hash table=1108e4100 ...),
derived from session cached cursors=600 (we will gvie further look in section 5.1.7 from memory point
of view).

Internally session cached cursor is a hash table consisting of 256 Buckets (Bucket#001 - Bucket#256),
classified into 6 types (type#0 - type#5), and located in PGA. Since session cached cursor is limited with
256 Buckets, setting session cached cursors bigger than 256 incurs the hash collision, for example,
above Bucket#109 contains two lines:

0 cob=110907140 idx=6d flg=0 typ=2 cur=11092ac40 lru=1 fl=1


1 cob=110907178 idx=1006d flg=0 typ=2 cur=110c796f0 lru=1 fl=1

Both idx are mapped to the same Bucket#:

select mod(to_number(’6d’, ’xxxxx’), 256),


mod(to_number(’1006d’, ’xxxxx’), 256)
from dual;

MOD(TO_NUMBER(’6D’,’XXXXX’),256) MOD(TO_NUMBER(’1006D’,’XXXXX’),256)
-------------------------------- -----------------------------------
109 109

Open cursors and session cached cursors in each session are exposed by v$open cursor. It tracks cursors
that each user session currently has opened and parsed, or cached. Internally they are the kgl locks
(x$kgllk) imposed by the session.

Joining v$open cursor with v$libcache locks, we can display information about the locks (lock held,
refcount, mode held, mode requested):

select *
from v$open_cursor c, v$libcache_locks l
where c.saddr = l.holding_user_session
and c.address = l.object_handle;

When we run a query on v$open cursor for one given session and sql id, if notice that child address
is changing in each run, there are probably cursor mutex contentions, like ”cursor: mutex X”, or ”cursor:
mutex S”.

select child_address, v.* from v$open_cursor v


where sid = :sid
and sql_id = ’:sql_id’;

112
The problem is related to hard parsing caused by the difference of cursor shared criteria, for example,
language mismatch caused by different NLS settings; user bind peek mismatch when bind variable
peeking and adaptive plans disabled ( optim peek user binds = false, optimizer adaptive plans =
false) in Oracle 18c. It results in high version count for the problem Sql statements. We can observe
mutex sleeps and wait time increasing in location kkscsAddChildNode and kkscsPruneChild with
query below (see Blog [36] NLS test):

select * from v$mutex_sleep


where location like ’kkscsAddChildNode%’ or location like ’kkscsPruneChild%’;

4.2 Plsql Validation Self-Deadlock

Plsql objects dependencies are built and maintained dynamically during run-time. Before each execution,
it has to be validated. If deadlock is detected during execution time, it throws exceptions.

Here is a short test case extracted from one real application. It throws error: ORA-04027: self-deadlock
during automatic validation.

drop table test_tab;


drop package test_pkg;
drop procedure test_proc;

create table test_tab (a number(2));


insert into test_tab values (12);
commit;

create or replace package test_pkg as


function fun return number;
procedure prc;
end test_pkg;
/

create or replace procedure test_proc


as
procedure prt(p_name varchar2) as
begin
for c in
(select p_name ||object_name||’ (’ || object_type ||’) ’||status s
from dba_objects
where object_name in (’TEST_PKG’, ’TEST_PROC’))
loop
dbms_output.put_line(c.s);
end loop;
end;
begin
prt(’Before Alter: ’);
execute immediate ’alter table test_tab modify (a number(2))’;
prt(’After Alter: ’);
update test_tab set a=test_pkg.fun;
end test_proc;
/

create or replace package body test_pkg as


function fun return number as
begin
return 10;
end;

procedure prc is
begin
test_proc;
end;
end test_pkg;
/

113
Now run it and look the output:

SQL> exec test_proc;

Before Alter: TEST_PKG (PACKAGE) VALID


Before Alter: TEST_PKG (PACKAGE BODY) VALID
Before Alter: TEST_PROC (PROCEDURE) VALID

After Alter: TEST_PKG (PACKAGE) VALID


After Alter: TEST_PKG (PACKAGE BODY) INVALID
After Alter: TEST_PROC (PROCEDURE) INVALID

ORA-04027: self-deadlock during automatic validation for object K.TEST_PROC


ORA-06512: at "K.TEST_PROC", line 17

There are two invalids: test pkg(package body) and test proc(procedure). Both can be validated
by:

alter package test_pkg compile body;

but whenever you call test proc, they are invalidated again.

The dependency graph is procedure prc in test pkg(package body) depending on test proc, and
test proc depending on test pkg(package) and test tab (table).

When calling test proc, it is pinned. After alter table test tab DDL, test proc is invalid because
of its dependency on test tab, which in turn, causes the invalid of test pkg (package body) due to
dependency (the pinned version of test proc is still valid since it is the currently executing unit in call
stack).

When test proc runs to the update statement, it sees test pkg (package body) invalid. Therefore
preparing to validate test pkg (package body), it requests an X-lock on test pkg (package body),
which again triggers an X-lock of test proc (via dependency).

Since test proc is already pinned (Share-lock) by its own at beginning, it is not possible to allocate an
X-lock to itself. So a self-deadlock is generated during validation of test proc.

The code was tested on 10g, 11g and 12c.

In 11gr2, toggling hidden parameters below:

_disable_fast_validate(TRUE, FALSE)
_ignore_fg_deps (TABLES, PLSQL, ALL, NONE)

There are no influences on the above deadlock behaviour.

4.3 Sql library cache lock (cycle) Deadlock

When upgrading to 12c (12.1.0.2.0) from 11gR2, we hit single session ”library cache lock (cycle)” ORA-
04020 deadlock . With a few queries and dump file, it can help us get certain understanding of library
cache activities.

114
4.3.1 Test Code

At first, run following test code to create lc pin# package and package body.

------------------------- Test Code -------------------------


-- This test is with dba_tables.
-- It is also reproducible with dba_segments, dba_objects, dba_indexes.

drop package lc_pin#;

create or replace package lc_pin# as


type t_dba_row_tab is table of sys.dba_tables%rowtype;
type t_vc is record (name varchar2(30));
type t_vc_tab is table of t_vc;

function foo return t_vc_tab pipelined;


function koo return t_dba_row_tab pipelined;
function soo return t_dba_row_tab pipelined;
end lc_pin#;
/

create or replace package body lc_pin# as


function foo return t_vc_tab pipelined is
l_result t_vc;
begin
l_result.name := ’lc_test’;
pipe row(l_result);
return;
end foo;

function koo return t_dba_row_tab pipelined is


begin
for c in (select * from dba_tables where rownum = 1) loop
pipe row(c);
end loop;
end koo;

function soo return t_dba_row_tab pipelined is


begin
for c in (
with sq as (select * from table(foo)) -- Line 20
select nt.*
from sq
,(select * from table(koo)) nt
-- following re-write works
-- select nt.* from (select * from table(foo)) sq, (select * from table(koo)) nt
) loop
pipe row(c); -- Line 27
end loop;
end soo;
end lc_pin#;
/

4.3.2 Library Cache Deadlock

Run a query to list the new created private and SYS sources:

select owner, object_name, object_type from dba_objects


where last_ddl_time > sysdate -10/1440
order by object_name;

OWNER OBJECT_NAME OBJECT_TYPE


----- -------------------------- ------------
Test LC_PIN# PACKAGE BODY
Test LC_PIN# PACKAGE
Test SYS_PLSQL_6174CDA6_21_1 TYPE
Test SYS_PLSQL_6174CDA6_31_1 TYPE
Test SYS_PLSQL_6174CDA6_9_1 TYPE
Test SYS_PLSQL_6174CDA6_DUMMY_1 TYPE

115
SYS SYS_PLSQL_750F00_462_1 TYPE
SYS SYS_PLSQL_750F00_DUMMY_1 TYPE

Then look the source lines with query:

select * from dba_source


where name like ’SYS_PLSQL_6174CDA6%’ or name like ’SYS_PLSQL_750F00%’
order by name, line;

It shows the mapping between new generated types and lc pin# defined types:

SYS_PLSQL_6174CDA6_21_1 for t_vc


SYS_PLSQL_6174CDA6_31_1 for t_vc_tab (table of "SYS_PLSQL_6174CDA6_21_1")
SYS_PLSQL_6174CDA6_9_1 for t_dba_row_tab (table of "SYS_PLSQL_750F00_462_1")
SYS_PLSQL_6174CDA6_DUMMY_1 for index table of SYS_PLSQL_6174CDA6_31_1

SYS_PLSQL_750F00_462_1 for sys.dba_tables%rowtype


SYS_PLSQL_750F00_DUMMY_1 for index table of SYS_PLSQL_6174CDA6_9_1

Now if we drop the generated SYS type by (to be discussed later in section Type Dropping 4.3.4):

SQL > drop type SYS.SYS_PLSQL_750F00_462_1 force;

SYS PLSQL 750F00 462 1 is no more registered in dba objects, but still retained in sys.obj$. It can be
displayed by:

select * from sys.obj$ where mtime > sysdate -10/1440 order by mtime;

In sys.obj$, however, it is altered from type# 13 (TYPE) to type# 10 object (also named non-existent
object in Oracle).

Since SYS PLSQL 6174CDA6 9 1 is declared as a table of SYS PLSQL 750F00 462 1 (dependency), it be-
comes invalid. Try to recompile it, we got an error:

SQL > alter type test.sys_plsql_6174cda6_9_1 compile;


Warning: Type altered with compilation errors.

SQL > show error


Errors for TYPE TEST.SYS_PLSQL_6174CDA6_9_1:
LINE/COL ERROR
-------- -----------------------------------------------------------------
0/0 PL/SQL: Compilation unit analysis terminated
1/46 PLS-00201: identifier ’SYS.SYS_PLSQL_750F00_462_1’ must be declared

If compiling lc pin# package body, we will get one ORA-04020: deadlock.

SQL > alter package lc_pin# compile body;


Warning: Package Body altered with compilation errors.

SQL > show error


Errors for PACKAGE BODY LC_PIN#:
LINE/COL ERROR
-------- -----------------------------------------------------------------
20/8 PL/SQL: ORA-04020: deadlock detected while trying to lock object
TEST.SYS_PLSQL_6174CDA6_31_1
20/8 PL/SQL: SQL Statement ignored
27/8 PL/SQL: Statement ignored
27/17 PLS-00364: loop index variable ’C’ use is invalid

116
where Line 20 (see above attached Test Code) is

with sq as (select * from table(foo))

Now SYS PLSQL 6174CDA6 9 1 type and lc pin# (package body) are invalid, but lc pin# (package
spec) is still valid as before.

A quick workaround is to recompile the package spec even it is valid:

alter package lc_pin# compile;

which re-compiled SYS PLSQL 6174CDA6 9 1 (TYPE) and lc pin# (package body), but not lc pin#
(package).

After the re-compilation, all are valid, you can run the query:

select * from table(lc_pin#.soo);

And object dependencies currently loaded in the shared pool can be shown by:

select (select to_name from v$object_dependency where to_hash = d.from_hash and rownum=1) from_name
,(select sql_text from v$sql where hash_value = d.from_hash) sql_text
,d.*
from v$object_dependency d
where to_name like ’SYS_PLSQL_6174CDA6%’ or to_name like ’SYS_PLSQL_750F00%’ or to_name = ’LC_PIN#’
order by to_name;

4.3.3 Single Session Cycle Dependency

The problem is caused by the ”with” factoring clause in function soo of lc pin# (package body) at
Line 20.

When Oracle parses ”with” factoring clause, it acquires a ”library cache pin” in the Share Mode (S) on
the dependent objects, in this case, it is t vc tab, then it proceeds to main clause, in which it realizes
that the dependent object: t dba row tab (SYS PLSQL 6174CDA6 9 1) is invalid. In order to resolve this
invalid, Oracle attempts to recompile package spec, which requests Exclusive Mode (X) on the related
objects.

Since the already held mode (S) on t vc tab is not consistent with requesting mode (X), Oracle session
throws Error: ORA-04020 and generates a dump. The trace file shows:

A deadlock among DDL and parse locks is detected.


ORA-04020: deadlock detected while trying to lock object TEST.SYS_PLSQL_6174CDA6_31_1
--------------------------------------------------------
object waiting waiting mode blocking blocking mode
handle session lock session lock
-------- -------- ----------- ---- --------- --------- ----
15ab8f290 18fbfb3c0 15f2189a8 X 18fbfb3c0 165dbbe28 S

------------- WAITING LOCK -------------


SO: 0x15f2189a8, type: 96, owner: 0x180658498
LibraryObjectLock: Address=15f2189a8 Handle=15ab8f290 RequestMode=X
CanBeBrokenCount=9
User=18fbfb3c0 Session=18fbff560 ReferenceCount=0

117
Flags=[0000] SavepointNum=2043e
LibraryHandle: Address=15ab8f290

------------- BLOCKING LOCK ------------


SO: 0x165dbbe28, type: 96, owner: 0x15f102fe0
LibraryObjectLock: Address=165dbbe28 Handle=15ab8f290 Mode=S
CallPin=155fbeed8 CanBeBrokenCount=9
User=18fbfb3c0 Session=18fbfb3c0 ReferenceCount=1
Flags=CNB/PNC/[0003] SavepointNum=203a9
LibraryHandle: Address=15ab8f290

--------------------------------------------------------
This lock request was aborted.

If we quickly select on v$wait chains by:

select chain_signature, to_char(p1, ’xxxxxxxxxxxxxxxxxxxx’) p1, p1_text,


to_char(p2, ’xxxxxxxxxxxxxxxxxxxxxxxx’) p2, p2_text,
to_char(p3, ’xxxxxxxxxxxxxxxxx’) p3, p3_text,
in_wait_secs, time_remaining_secs
from v$wait_chains;

We got:

IN TIME
_WAIT _REMAINING
chain_signature P1 P1_TEXT P2 P2_TEXT P3 P3_TEXT _SECS _SECS
---------------------------- --------- ------------- --------- ----------- ------------- ------------------ ------ -----------
’library cache lock’ (cycle) 15ab8f290 handle addres 15f2189a8 lock addres 585a300010003 100*mode+namespace 1 898

Although time remaining secs shows 898 seconds (about 15 minutes) in Oracle 12c, the above row
disappeared after 9 seconds, probably because the session already generated the dump.

However in 11gR2, session spins on the wait event ”library cache pin”, and after 15 minutes, it throws
error: ORA-04021: timeout occurred while waiting to lock object. The above 898 seconds in Oracle 12c
is probably a residue of 11gR2 15 minutes.

A further query:

select (select kglnaobj||’(’||kglobtyd||’)’


from x$kglob v
where kglhdadr = object_handle and rownum=1) kglobj_name
,v.*
from v$libcache_locks v
where v.holding_user_session =
(select saddr from v$session
where event =’library cache lock’ and rownum = 1)
and object_handle in (select object_handle from v$libcache_locks where mode_requested !=0)
order by kglobj_name, holding_user_session, type, mode_held, mode_requested;

shows there exist two rows on SYS PLSQL 6174CDA6 31 1(TYPE) with value LOCK in column TYPE. If we
look the first row, which has MODE REQUESTED: 3 (Exclusive mode), holding user session (18FBFB3C0)
and holding session (18FBFF560) are different.

HOLDING HOLDING OBJECT MODE MODE SAVEPOINT


KGLOBJ_NAME TYPE ADDR _USER_SESSION _SESSION _HANDLE LOCK_HELD REFCOUNT _HELD _REQUESTED _NUMBER
----------------------- ---- --------- ------------- --------- --------- --------- -------- ----- ---------- ---------
SYS_PLSQL_6174CDA6_31_1 LOCK 15F2189A8 18FBFB3C0 18FBFF560 15AB8F290 0 0 0 3 132158
SYS_PLSQL_6174CDA6_31_1 LOCK 165DBBE28 18FBFB3C0 18FBFB3C0 15AB8F290 155FBEED8 1 2 0 132009

118
From the query result, we can see that holding user session already held a lock of mode 2 (Share mode),
but at the same time designates a different recursive session to request a lock of mode 3 (Exclusive mode).
The column savepoint number shows the sequence of lock get (132009) and request (132158), so the first
is ”get”, the second is ”request” (132009 < 132158).

Oracle throws such cycle deadlock since both get and request are originated from same holding user session.

Crossing check with above dump file, under line ”WAITING LOCK”, we can see:

------------- WAITING LOCK -------------


SO: 0x15f2189a8, type: 96, owner: 0x180658498
LibraryObjectLock: Address=15f2189a8 Handle=15ab8f290 RequestMode=X
CanBeBrokenCount=9
User=18fbfb3c0 Session=18fbff560 ReferenceCount=0
Flags=[0000] SavepointNum=2043e
LibraryHandle: Address=15ab8f290

where User=18fbfb3c0 (holding user session) is different from Session=18fbff560 (holding session).

However under line ”BLOCKING LOCK”, both are same (18fbfb3c0):

------------- BLOCKING LOCK ------------


SO: 0x165dbbe28, type: 96, owner: 0x15f102fe0
LibraryObjectLock: Address=165dbbe28 Handle=15ab8f290 Mode=S
CallPin=155fbeed8 CanBeBrokenCount=9
User=18fbfb3c0 Session=18fbfb3c0 ReferenceCount=1
Flags=CNB/PNC/[0003] SavepointNum=203a9
LibraryHandle: Address=15ab8f290

The respective SavepointNum are hex: 2043e (decimal 132158), and 203a9 (decimal 132009).

In Oracle, holding user session is the session collected in v$session, whereas holding session is
the recursive session when they are not the same. Normally recursive session is spawned out when
holding user session requires ”SYS” user privilege to perform certain tasks.

By the way, recursive session is not exported in v$session because of filter predicate on the underlined
x$ksuse:

bitand("s"."ksuseflg",1)<>0

So only rows with odd number of ksuseflg are included in v$session.

Look the definition of gv$session, column type is derived from ksuseflg values as:

DECODE (BITAND (s.ksuseflg, 19),


17, ’BACKGROUND’,
1, ’USER’,
2, ’RECURSIVE’,
’?’),

which shows that ksuseflg of ’RECURSIVE’ session is an even number. (See Blog: Recursive Sessions
and ORA-00018: maximum number of sessions exceeded [29])

119
4.3.4 Type Dropping

In the above discussion, we drop the type manually to force the invalid with statement:

drop type SYS.SYS_PLSQL_750F00_462_1 force;

Actually, it seems that Oracle 12c has introduced certain automatic CLEANUP JOBs to perform such
dropping. They can be listed by query:

select job_name, comments


from dba_scheduler_jobs
where job_name like ’CLEANUP%’;

JOB_NAME COMMENTS
------------------------- ------------------------------------
CLEANUP_NON_EXIST_OBJ Cleanup Non Existent Objects in obj$
CLEANUP_ONLINE_IND_BUILD Cleanup Online Index Build
CLEANUP_ONLINE_PMO Cleanup after Failed PMO
CLEANUP_TAB_IOT_PMO Cleanup Tables after IOT PMO
CLEANUP_TRANSIENT_PKG Cleanup Transient Packages
CLEANUP_TRANSIENT_TYPE Cleanup Transient Types

Look JOB: CLEANUP NON EXIST OBJ , the comments Column said:

Cleanup Non Existent Objects in obj\$.

and job action column is filled with code block:

declare
myinterval number;
begin
myinterval := dbms_pdb.cleanup_task (1);
if myinterval <> 0
then
next_date := systimestamp + numtodsinterval (myinterval, ’second’);
end if;
end;

If we run the above block, the NON-EXISTENT object of our above test:

SYS.SYS_PLSQL_750F00_462_1

is indeed removed.

In fact, those auto jobs seem very active, within each past 20 minutes, LAST DDL TIME are updated.

select object_name, object_type, last_ddl_time


from dba_objects v
where last_ddl_time > sysdate - 20/1440
order by v.last_ddl_time, v.object_name;

OBJECT_NAME OBJECT_TYPE LAST_DDL_TIME


------------------------- ------------ --------------------
CLEANUP_NON_EXIST_OBJ JOB 2019-JAN-16 12:34:30
CLEANUP_TRANSIENT_TYPE JOB 2019-JAN-16 12:35:24
CLEANUP_ONLINE_IND_BUILD JOB 2019-JAN-16 12:43:59
CLEANUP_TAB_IOT_PMO JOB 2019-JAN-16 12:44:09
CLEANUP_TRANSIENT_PKG JOB 2019-JAN-16 12:44:29
CLEANUP_ONLINE_PMO JOB 2019-JAN-16 12:44:39
FILE_SIZE_UPD JOB 2019-JAN-16 12:49:39

120
Chapter 5

Memory Usage and Allocation

In this Chapter, we will look memory allocation and usage in SGA, PGA and Oracle LOBs. Over usage
of memory often generates Oracle exceptions, typically, ORA-04030 and ORA-04031, and occasionally
causes DB (or even UNIX System) crashes.

5.1 SGA Memory Usage and Allocation

Oracle SGA is composed of two main parts, the first is about data, a fixed area, called buffer cache(s);
and the second is about executables and meta info, a dynamic pool, called shared pool.

Shared pool (introduced in Oracle 7 [9]) is again made of two main parts, one is Sql and Plsql executables,
for example, Sqlarea (Heap6); other is library cache (basic elements as the name ”library” implies), for
example, execution environment (Heap0 or KGLH0), Plsql DIANA (Heap2), MPCODE(Heap4), tables
(KGLS Heap), Row Cache (DC Cache, KQR), dependencies and relationships. The library elements are
linked together to build up executables. In certain context, all of them together are also referred as
”Library Cache”, for example, v$librarycache.

While buffer cache stores well formatted data (specified by DDL) and is divided into predefined chunk
size by db block size, shared pool stores variously sized components and is structured into multi layers
like: subpools, heaps, subheaps, buckets, extents, chunks, which makes shared pool more complex to
manage. When memory under pressure, ORA-04031 is signalled.

Technically the best moment to study a problem is when the acute point reached. In case of Oracle, it is
the time when problem occurs, and that is exactly the occasion worth of investigation (or paid to do it).
So in this section, we will start by analysing an ORA-04031 trace dump, then make experiments with
various heap dumps to understand shared pool memory operations.

In a performance sluggish Prod DB, alert log is full of messages like:

ORA-04031: unable to allocate 256 bytes of shared memory ("shared pool","unknown object","KKSSP^9876","kgllk").

one of which even shows that smallest memory chunk of 32 bytes is no more procurable:

ORA-04031: unable to allocate 32 bytes of shared memory

121
The Prod DB is configured with 20GB shared pool, 384GB buffer cache, running as dedicated server with
6000 concurrent login sessions in Oracle 11.2.0.3.0.

To analyse the problem, we will go through Prod DB dumps, and at the same time make experiments
on a Test DB (shared pool size = 1408MB), compare dumps from both DBs, try to reproduce the
Prod DB issues, so that we can dig into the details of shared pool memory management. All descriptions
are from observations and experiments, they can only be considered as guesses and have not been fully
confirmed, but all the dumps are from Oracle and can be used for further investigations.

5.1.1 Subpool Memory

At first, we pick one ORA-04031 trace dump from Prod DB:

=================================
Begin 4031 Diagnostic Information
=================================
Memory Utilization of Subpool 1
================================
Allocation Name Size
___________________________ ____________
"free memory " 306562330
"SQLA " 1110240
"KGLH0 " 554532520
"KKSSP " 617693540
"db_block_hash_buckets " 529039360
==============================
Memory Utilization of Subpool 2
================================
Allocation Name Size
___________________________ ____________
"free memory " 358029240
"SQLA " 588943660
"KGLH0 " 485312100
"KKSSP " 563340900
"db_block_hash_buckets " 535429120
...
==============================
Memory Utilization of Subpool 7
================================
Allocation Name Size
___________________________ ____________
"free memory " 304535580
"SQLA " 272452360
"KGLH0 " 332455850
"KKSSP " 577237150
"db_block_hash_buckets " 535429120

It shows memory utilization of all 7 used subpools. Based on it, we can establish a few memory summary
overviews about the subpool size and number of components stored in each subpool.

Name Subpool 1 Subpool 2 Subpool 3 Subpool 4 Subpool 5 Subpool 6 Subpool 7 Sum


Subpool Size 2’684’354 3’355’443 3’355’443 2’684’354 3’355’443 2’684’354 3’355’443 21’474’836
components Count 295 303 308 306 306 308 316 959

Table 5.1: Subpool Size (in KB) and Count

Table 5.1 shows that subpool size can have more than 20% difference (from 2’684’354KB to 3’355’443KB,
all size in KB), the number of components in subpools varies from 295 to 316. In total, there are 959
components, but each subpool can have maximum 316 components. So all components are distributed
into different subpools, none of which contains more than 1/3 of all components.

122
If we run query below:

select count(name), count(distinct name) from v$sgastat where pool = ’shared pool’;

COUNT(NAME) COUNT(DISTINCTNAME)
----------- -------------------
881 881

It returns only 881 distinct component names, so not all areas are registered in v$sgastat, for example,
"kokcd", "post agent".

Table 5.2 is top 5 memory consuming components in each subpool (we will get into them one by one
later):

Name Subpool 1 Subpool 2 Subpool 3 Subpool 4 Subpool 5 Subpool 6 Subpool 7 Sum


KKSSP 617’693 563’340 590’296 590’798 577’532 643’504 577’237 4’160’404
db block hash buckets 529’039 535’429 529’039 539’525 535’429 529’044 535’429 3’732’935
KGLH0 554’532 485’312 464’006 353’634 450’528 346’045 332’455 2’986’514
SQLA 1’110 588’943 565’317 185’664 574’008 155’498 272’452 2’342’994
free memory 306’562 358’029 353’146 306’659 342’194 325’386 304’535 2’296’513

Table 5.2: Top 5 Memory Components (in KB)

Table 5.3 is top 10 disparity components allocated in 1 or 2 subpools (column CNT <= 2) with large
size difference (all size in byte);

Name Subpool 1 Subpool 2 Subpool 3 Subpool 4 Subpool 5 Subpool 6 Subpool 7 Sum CNT
FileOpenBlock 24 510’025’424 510’025’448 2
enqueue 39’257’104 24 39’257’128 2
KSK VT POOL 24 19’130’696 19’130’720 2
ktlbk state objects 17’971’200 17’971’200 1
Wait History Segment 15’733’120 15’733’120 1
Global Context 11’017’024 11’017’024 1
call 24 10’057’864 10’057’888 2
keswx:plan en 10’011’896 10’011’896 1
KEWS sesstat values 9’802’944 9’802’944 1
FileIdentificatonBlock 24 7’711’368 7’711’392 2

Table 5.3: To 10 Disparity Components (in byte)

Table 5.3 shows that there are components areas allocated only into one single subpool (CNT = 1). For
example, ”Global Context” is only allocated in Subpool 4. With such observation in mind, probably we
can get some explanation of ”Global Context” contention as experienced in the past since it is all only
in one single subpool, hence it is protected by one single latch.

In total, we found 110 components allocated into two subpools, and often one of two is allocated simply
24 bytes, and other is allocated the rest, for example, in Table 5.3, we can see that ”enqueue” is allocated
39,257,104 bytes into Subpool 1, but only 24 bytes into Subpool 7 (24 bytes probably is a special marker).
So we can think that whole Oracle enqueues (locks) are managed in Subpool 1.

Table 5.4 is top 10 unbalanced components according to allocation size divergence over all 7 subpools
(average difference > 30%, all size in byte):

There exist certain components, which are extremely unbalanced among subpools. For example, SQLA
at the first row:

Subpool_1 has 1’110 KB for SQLA.

123
Name Subpool 1 Subpool 2 Subpool 3 Subpool 4 Subpool 5 Subpool 6 Subpool 7 Sum CNT
SQLA 1’110’240 588’943’660 565’317’412 185’664’243 574’008’432 155’498’158 272’452’360 2’342’994’419 7
FileOpenBlock 24 510’025’424 510’025’448 2
SQLP 761’808 66’155’040 61’548’864 15’324’936 57’623’952 10’234’544 21’440’800 233’089’944 7
KGLS 737’792 9’169’960 8’730’128 4’925’272 8’779’216 4’559’760 5’985’728 42’887’856 7
KQR M PO 30’720 799’232 512 255’896 30’568’128 138’680 8’590’256 40’383’424 7
enqueue 39’257’104 24 39’257’128 2
PLDIA 0 7’246’864 8’072’928 4’285’000 7’129’088 3’520’016 5’066’408 35’320’304 7
write state object 4’662’208 4’662’184 48 9’324’368 24 4’662’208 4’662’208 27’973’248 7
KQR L PO 924’032 21’262’248 117’768 22’304’048 3
PRTDS 241’720 4’668’976 4’023’352 2’747’344 5’001’776 2’203’384 1’311’744 20’198’296 7

Table 5.4: Top 10 Unbalanced Components (in byte)

Subpool_2 has 588’943 KB for SQLA.

It could give certain hints on error below:

ORA-04031: unable to allocate 32 bytes of shared memory ("shared pool","SELECT MAX(XX) FROM...","SQLA","tmp")

(it would be more helpful if subpool number were included inside above error message)

Now we could wonder if this shared pool architecture is by design hardly to be balanced. For static area
allocations, like db block hash buckets, it should be acceptable. However dynamic components with
frequent memory fragmentations can put tremendous pressure on memory management.

Oracle also describes that ORA-04031 is a cumulative outcome after certain time of ineffective usage of
memory. Quite often it is thrown by the victim sessions, which are not necessarily the cause of the error.
Consequently it makes the error hard to track, predict and reproduce.

By what we learned from Lisp (CLOS) and Java, which are using automatic memory management
(garbage collection), we can also understand that shared pool memory management sets remarkable
challenge on this technique, and hence there could be a long way to be perfect.

In Oracle 11.2, each subpool is further subdivided into 4 durations: ”instance”, session”, ”cursor”, and
”execution” (vs. Lisp garbage collector generations), which classifies allocated memory according to the
duration of time that is expected to be needed. For example, in the dump file, we can see the line:

HEAP DUMP heap name="sga heap(1, 0)"

where 1 denotes subpool number, 0 denotes ”instance” duration.

In Oracle 12.1, a change was made, only two durations (probably ”instance” and ”cursor”) in each
subpool are implemented in order to reduce the unbalance. That reflects the iteration of shared pool
improvements (number of durations increased from 0 to 4, and then from 4 decreased to 2. For Oracle
12.1, 2 seems the best fix point).

In Oracle, if redo is claimed being the most critical mechanism, shared pool probably should be declared
as the most sophisticated one.

By the way, to inspect number of actually used subpools, run following query (instead of checking
kghdsidx count).

124
select * from v$latch_children where name = ’shared pool’;

It always returns 7 rows (maxmum 7 subpools), but active ones are those with higher gets or misses.
So if we filter out the latches with very few (or always constant) gets and misses, remaining ones are
the actually allocated subpools (By the way, the existence of 7 shared pool latches again shows that latch
is pre defined, statically allocated, and probably hard coded with the maximum values in the factory).

As shown in Table 5.2, among top 5 memory consumption components, only db block hash buckets is
(almost) evenly distributed. In the following discussions, we will go through all 5 components one by
one.

5.1.2 KKSSP

Table 5.2 shows that KKSSP is the top memory consumer in Prod DB. The total KKSSP consumption
amounts to 4GB, average 700KB per session (6000 sessions at problem time). Trace dump contains the
message like:

ORA-04031: unable to allocate 256 bytes of shared memory ("shared pool","unknown object","KKSSP^9876","kgllk").

To better understand KKSSP allocation, in Test DB, we make a shared pool heapdump,

alter session set max_dump_file_size = unlimited;


alter session set tracefile_identifier = ’shared_pool_1’;
alter system set events ’immediate trace name heapdump level 536870914’;

-- heapdump level 2050 dumps SGA with contents


-- alter session set events ’immediate trace name heapdump level 2050’;

in which all KKSSP are listed like:

Chunk 70000009fad14d0 sz= 2136 freeable "KKSSP^2566 " ds=7000000A5DA1798

In comment ”KKSSP^2566”, 2566 is the session id (sid) of login session. So KKSSP is per session allocated
and is a session specific area in shared pool. This contradicts a common belief of shared pool being shared
(at least in majority) across all sessions since this top allocation is already dedicated to each particular
session.

Pick above ds (descriptor) marked address, and dig further by a KKSSP address dump:

ORADEBUG DUMP HEAPDUMP_ADDR 2 0x7000000A5DA1798

then aggregate by Heapdump Analyzer (see Blog: Oracle memory troubleshooting, Part 1: Heapdump
Analyzer [28]):

Total_size #Chunks Chunk_size, From_heap, Chunk_type, Alloc_reason


---------- ------- ------------ ----------------- --------------- -----------------
188160 735 256 , KKSSP^2566, freeable, kgllk
181504 709 256 , KKSSP^2566, freeable, kglpn
56320 220 256 , KKSSP^2566, freeable, KQR ENQ
28896 516 56 , KKSSP^2566, freeable, kglseshtSegs
12312 1 12312 , KKSSP^2566, freeable, kglseshtTable

125
The above table shows that the top 3 memory consumers are kgllk, kglpn, KQR ENQ. Each single one is
allocated with Chunk size of 256 bytes. More than half of memory is allocated to kgllk and kglpn since
the application is coded in Plsql packages and types, which requires kgllk and kglpn during each call to
keep them stateful. The last allocation kglseshtTable is one single chunk but with a large contiguous
allocation of 12312 bytes, probably ”session param values” memory allocate at the start of session [9].

To inspect kgllk and kglpn touched objects, another way to list them is running a query like:

select s.sid, username, logon_time


,(select kglnaobj||’(’||kglobtyd||’)’ from x$kglob v where kglhdadr = v.object_handle and rownum=1) kobj_name, v.*
from v$libcache_locks v, v$session s
where holding_session = s.saddr
and s.sid = 2566;

A query below can be used to debug: library cache pin and library cache lock:

select * from x$kglob where kglhdadr in (select p1raw from v$session where sid = :blocked_session);

Instead of Heapdump, a direct way to get the KKSSP memory consumption for one given session is query
like:

select count(*), sum(ksmchsiz) from x$ksmsp where ksmchcom=’KKSSP^2566’;

Following query can also give the address for KKSSP address Heapdump:

select ksmchpar from x$ksmsp where ksmchcom=’KKSSP^2566’ and ksmchcls = ’recr’;

The output contains above address 07000000A5DA1798, then we can also make the same dump by (replace
first 0 with 0X):

oradebug dump heapdump_addr 2 0X7000000A5DA1798

By the way, the 3rd line in the above aggregated output shows that ”KQR ENQ” is now moved to KKSSP
in Oracle 11.2.0.3.0. That helps us find the lost ”KQR ENQ” mentioned in Book Oracle Core [15, p. 169]

... when I ran the same query against an instance of 11.2.0.2 there was no entry for KQR ENQ ...

MOS: ORA-4031 Or Excessive Memory Consumption On KKSSP Due To Parse Failures (Doc ID
2369127.1) wrote:

KKSSP is just a type of internal memory allocation related to child cursors.

We also noticed that high usage of kgllk and kglpin was allied with heavy contention on kokc latch(kokc
descriptor allocation latch), which is responsible for pinning, unpinning and freeing objects (Oracle object
types). kokc is a single latch without children, thus a single point of contention.

5.1.3 db block hash buckets

db block hash buckets is for database block hash buckets. It is allocated in shared pool. It takes about
1% of Buffer Pool for db block size = 8192, or 70 Bytes for each database block hash bucket (chain).

126
The Prod DB is configured with:

db_cache_size = 320G
db_keep_cache_size = 4G
db_recycle_cache_size = 60G

All together is about 384G for whole Buffer Pool. Table 5.2 showed that db block hash buckets in
shared pool is 3.7GB (3’732’935 KB), which is close to 1%.

Oracle has a same named hidden paramter, whose value is evolved following the Releases:

Name: _db_block_hash_buckets
Description: Number of database block hash buckets
Default value:
262144 Oracle 10.2
131072 Oracle 11.2.0.1 & Oracle 11.2.0.2 (halved)
524288 Oracle 11.2.0.3 (quadrupled)
1048576 Oracle 11.2.0.4 & 12.1.0.2 & 12.2.0.2 & 18c (doubled)

Each DB block is hashed to a bucket, which hooks a chain of DB blocks (to be precise, it is a chain of
buffer headers, each of header points to its represented data block) and protected by one ”cache buffers
chains” latch (See section 3.2.2 in Chapter Locks, Latches and Mutexes).

5.1.4 SQLA

The top 5 memory allocation in Table 5.2 shows that SQLA in Subpool 1 is desperately low allocated,

Subpool_1 has 1’110 KB for SQLA.


Subpool_2 has 588’943 KB for SQLA.

So if a statement requires more than 1’110 KB in Subpool 1, it will not be satisfied.

It is not clear why SQLA in Subpool 1 is extremely low. One possible guess is that KGLH0 in Subpool 1
is too high, and there is certain cap on the total Sql memory usage in each subpool (see later section
5.1.7).

We can list Sql memory consumption according to sql id, alloc class by:

select /*+ leading(c) */ -- without leading(c) hint, no row returns


sql_id, alloc_class, sum(chunk_size)/1024 sum_mb, count(*) chunk_cnt
from v$sql_shared_memory
--where sql_id = ’:sql_id’
group by sql_id, alloc_class
order by sum_mb desc;

As we observed by the incident, the low SQLA caused frequently cursor AgedOut, consequently, reload-
ing/hard parsing, and session dumps with messages like:

127
ORA-04031: unable to allocate 32 bytes of shared memory
("shared pool","SELECT MAX(XX) FROM...","SQLA","tmp")

ORA-04031: unable to allocate 48 bytes of shared memory


("shared pool","select yy from tt whe...","TCHK^3fefd486","qcsqlpath: qcsAddSqlPath")

5.1.4.1 heapdump of shared pool

Now we start to do some experiment on Test DB. First make a top level shared pool heapdump:

SQL > oradebug dump heapdump 2

which shows some lines about SQLA, each of which is allocated in chunk size of 4096 bytes:

Chunk 700000088ff8000 sz= 4096 freeable "SQLA^8b7ceb5a " ds=7000000a88fafc8


Chunk 700000088ff9000 sz= 4096 freeable "SQLA^8b7ceb5a " ds=7000000a88fafc8

where 8b7ceb5a is the hash value of sql id, which can be obtained by dbms utility.sqlid to sqlhash.
Computing remainder by mod(0x8b7ceb5a, 131072), it gives hash bucket number in library cache.

Suppose all 7 subpools being used, we can have a guess that subpool number for a sql is determined by:

mod(mod(to_number(’8b7ceb5a’, ’xxxxxxxxx’), 131072), 7) + 1

or directly from sql id:

mod(mod(dbms_utility.sqlid_to_sqlhash(:sql_id), 131072), 7) + 1

If the above formula is indeed used internally by Oracle, all sql id, which are mapped to Subpool 1, can
have a higher chance of hitting ORA-04031 in our Prod DB (this is only a guess). Later in this section,
we will verify it again with hash value listed in ORA-04031 error messages.

5.1.4.2 heapdump addr dump of SQLA

Pick SQLA ds value from above dump, make a low level addr dump:

SQL > oradebug dump heapdump_addr 1 0X7000000a88fafc8

*** 2013-03-12 10:38:24.072


Processing Oradebug command ’dump heapdump_addr 1 0X7000000a88fafc8’
******************************************************
HEAP DUMP heap name="SQLA^8b7ceb5a" desc=7000000a88fafc8
extent sz=0xfe8 alt=32767 het=368 rec=0 flg=2 opc=2
parent=700000000000198 owner=7000000a88fae88 nex=0 xsz=0xfe8 heap=0
fl2=0x67, nex=0, dsxvers=1, dsxflg=0x0
dsx first ext=0x8c5e2f90
EXTENT 0 addr=7000000902996a0
Chunk 7000000902996b0 sz= 4056 freeable "TCHK^8b7ceb5a " ds=70000008c5e3a98
...
EXTENT 85 addr=7000000902ee6a8
Chunk 7000000902ee6b8 sz= 4056 freeable "TCHK^8b7ceb5a " ds=70000008c5e3a98

The above dump shows that each TCHK takes 4056 bytes, 40 bytes of overhead since above SQLA ds is
allocated in chunk size of 4096.

128
5.1.4.3 heapdump addr dump of TCHK (Typecheck heap)

Above dump shows that SQLA consists of TCHK, again pick ds value from above TCHK, make further
dump to drill-down memory allocations:

SQL > oradebug dump heapdump_addr 1 0X70000008c5e3a98

*** 2013-03-12 10:46:42.079


Processing Oradebug command ’dump heapdump_addr 1 0X70000008c5e3a98’
******************************************************
HEAP DUMP heap name="TCHK^8b7ceb5a" desc=70000008c5e3a98
extent sz=0xfc0 alt=32767 het=32767 rec=0 flg=2 opc=2
parent=7000000a88fafc8 owner=7000000a88fae88 nex=0 xsz=0xfc0 heap=0
fl2=0x67, nex=0, dsxvers=1, dsxflg=0x0
dsx first ext=0x8c5a7b30
EXTENT 0 addr=7000000902996c8
Chunk 7000000902996d8 sz= 608 free " "
Chunk 700000090299938 sz= 40 freeable "chedef : qcuatc"
Chunk 700000094527010 sz= 112 freeable "optdef: qcopCre"
Chunk 7000000a16d2678 sz= 152 freeable "opndef: qcopCre"
Chunk 700000092c9e160 sz= 288 freeable "kkojnp - infode"
Chunk 7000000902bea50 sz= 40 freeable "chedef : qcuatc"
Chunk 7000000902ebf58 sz= 184 freeable "kggec.c.kggfa "
...
Chunk 70000003ec98078 sz= 576 recreate "177.kggfa " latch=0

Above TCHK dump lists the concrete memory consumers at the atomic level (smallest unit for that con-
sumer). The comment on each line can give certain hints on its content, for example ”recreatable” are
the memory allocations for objects that can be rebuilt [9].

We can also use following query to track memory consumption:

select /*+ leading(c) */ *


from v$sql_shared_memory
where subheap_desc not like ’00’;

The output shows SQLA as heap desc, and TCHK as subheap desc (TCHK is a subheap of SQLA heap), for
example:

v$sql_shared_memory.heap_desc points to ds=7000000a88fafc8 in "SQLA^8b7ceb5a "


v$sql_shared_memory.subheap_desc points to ds=70000008c5e3a98 in "TCHK^8b7ceb5a "

Therefore, we can also pick the values of heap desc and subheap desc from above query to make SQLA
and TCHK dump.

In the following example, if we look the size reported in v$sql.typecheck mem for one sql id, it seems
close to sum of size reported in v$sql shared memory.chunk size for that sql id in function type
TCHK.

select sql_id, typecheck_mem, type_chk_heap, sql_text


from v$sql
where typecheck_mem > 0;

sql_id typecheck_mem
-------------- -------------
4512qfum52bj7 197168

select /*+ leading(c) */ sum(chunk_size)


from v$sql_shared_memory
where subheap_desc not like ’00’

129
and sql_id = ’4512qfum52bj7’
and function like ’TCHK%’;

sum(chunk_size)
---------------
199152

With following query, we can watch chunk size by function:

select /*+ leading(c) */ sql_id, function, sum(chunk_size) func_chunk_size, sql_text


from v$sql_shared_memory
where sql_id = ’4512qfum52bj7’
group by sql_id, function, sql_text
order by sql_id, total_chunk_size desc, function, sql_text;

SQL_ID FUNCTION FUNC_CHUNK_SIZE


------------- -------------- ---------------
4512qfum52bj7 TCHK^a6512e27 199152
4512qfum52bj7 qcopCre 20616
4512qfum52bj7 qbcqtcHTHeap 10224
4512qfum52bj7 qcdlgc 8200
4512qfum52bj7 qcuatc 8136
...

As tested, the majority (>90%) of memory in SQLA is consumed by TCHK in this Test DB.

5.1.5 KGLH0

Similar to SQLA, in shared pool heapdump of Test DB, we look lines containing KGLH0, for example,

SQL > oradebug dump heapdump 2

Chunk 7000000a1efce68 sz= 4096 freeable "KGLH0^8b7ceb5a " ds=7000000a1d9c450

Pick KGLH0 ds, make address dump:

SQL > oradebug dump heapdump_addr 1 0X7000000a1d9c450

Total_size #Chunks Chunk_size, From_heap, Chunk_type, Alloc_reason


---------- ------- ------------ ----------------- ----------------- -----------------
3296 1 3296 , KGLH0^8b7ceb5a, perm, perm
1960 1 1960 , KGLH0^8b7ceb5a, freeable, policy chain
1760 1 1760 , KGLH0^8b7ceb5a, perm, perm
1392 2 696 , KGLH0^8b7ceb5a, freeable, policy chain
1384 1 1384 , KGLH0^8b7ceb5a, perm, perm
1152 8 144 , KGLH0^8b7ceb5a, freeable, context chain
880 1 880 , KGLH0^8b7ceb5a, freeable, policy chain
760 5 152 , KGLH0^8b7ceb5a, freeable, kgltbtab
720 1 720 , KGLH0^8b7ceb5a, freeable, policy chain
712 1 712 , KGLH0^8b7ceb5a, freeable, policy chain
656 1 656 , KGLH0^8b7ceb5a, freeable, policy chain
608 1 608 , KGLH0^8b7ceb5a, freeable, policy chain
416 1 416 , KGLH0^8b7ceb5a, free,
376 1 376 , KGLH0^8b7ceb5a, free,
80 1 80 , KGLH0^8b7ceb5a, perm, perm
48 1 48 , KGLH0^8b7ceb5a, free,
(sum=16200 Bytes)

All output items seem about VPD. They can be found by query:

--sql_id is ’2z32kkb821u9g’;
select * from v$vpd_policy v where sql_id = :sql_id;

130
The query returns 8 rows, which match 7 policy chains and 1 context chain in above heapdump addr.
In later section 5.1.7, we will discuss that KGLHO is to store statement execution environment. VPD
policy is one kind of such environment, which determines VPD predicates to amend statement according
to login application context as discussed in section 3.3.

A library cache dump (”library cache level 16”, see later section 5.1.7) reveals that each child cursor
is associated with one KGHL0 and one SQLA. KGHL0 stores environment information, whereas SQLA
stores parsing tree and xplan. When memory is reclaimed under memory pressure, KGHL0 kept, whereas
SQLA deallocated. The later re-use of the removed child cursor will result in a hard re-parsing based on
the kept KGHL0 info.

As an example, sporadically we get Sql Trace (10046) like:

SQL ID: 67kamvx1dz051

SELECT * FROM XXX WHERE ID = :B1

call count cpu elapsed disk query current rows


------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 0 0.00 0.00 0 0 0 0
Execute 233 0.01 0.85 0 0 0 0
Fetch 233 0.00 0.01 0 1494 0 78
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 466 0.02 0.86 0 1494 0 78

Misses in library cache during parse: 0


Misses in library cache during execute: 2
Parsing user id: 49

Elapsed times include waiting on following events:


Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
latch: shared pool 12 0.00 0.03
latch: row cache objects 3 0.29 0.79

The output line "Misses in library cache during execute: 2" indicates such hard-parsing during
execute. Moreover, Wait Event "latch: shared pool" and "latch: row cache objects" also pro-
vide the evidence of such hard-parsing. While executing the statement (233 times), the required child
cursor could be always found in KGHL0, hence no parse calls during parse; but there are 2 "Misses in
library cache during execute", that indicates xplan has been evicted (2 times), and should be newly
created during execute (Note: if next line is ”Parsing user id: SYS”, it is for recursive statements).

For this select statement, "Execute" line takes most of elapsed time (0.85 of 0.86), but "Parse" line
shows 0.00. It again indicates that "Misses in library cache during execute" occurred.

In the above output, "parse 0" and "Misses in library cache during parse: 0" mean no parse call.
This is a proof of existence, where parse call is 0, but hard parsing is not 0. Therefore, statistics on
parse call and hard parsing are not inclusive (see section 4.3 in Chapter Parsing and Compiling).

Above shared pool heapdump shows that both SQLA and KGHL0 are allocated in chunk size of 4096
bytes. So their memory allocations start from:

Bucket 240 size=4096

in Free List (memory in Free List is partitioned into 255 Buckets from 0 to 254, to be discussed in later
section 5.1.6).

Usaually, big chunk size generates less fragmentation, but memory utilization is less efficient and results
in more overhead. If an ORA-04031 says:

131
unable to allocate 32 bytes of shared memory ("shared pool","SELECT MAX(XX) FROM...","SQLA","tmp")

It could mean that although only 32 bytes memory required, but still converted as a request of 4096 bytes
for SQLA allocation.

5.1.6 Free Memory and Fragmentation

Look free memory summary in Table 5.5 (copied from Table 5.2) below:

Name Subpool 1 Subpool 2 Subpool 3 Subpool 4 Subpool 5 Subpool 6 Subpool 7 Sum


free memory 306’562 358’029 353’146 306’659 342’194 325’386 304’535 2’296’513

Table 5.5: SGA Free Memory (in KB)

Although total free memory is 2’296’513 KB and each subpool has at least 304’535 KB free, we are still
facing:

ORA-04031: unable to allocate 32 bytes of shared memory ("shared pool","SELECT MAX(XX) FROM...","SQLA","tmp")

One intuitive question is:


Why do I get ORA-04031 even though there is plenty of free memory (> 10%) ?

Often an unsubstantial reply is: memory fragmentation (or memory leak). In this section, we will try to
disperse such fashionable pretext.

5.1.6.1 Free Lists

In Test DB, make a SGA summary heapdump:

alter session set events ’immediate trace name heapdump level 2’;

and then look FREE LISTS:

FREE LISTS:
Bucket Size Increase
------ ------ --------
0 32
1 40 8
2 48 8
...
179 1464 8
180 1480 16
...
189 1624 16
190 1672 48
...
239 4024 48
240 4096 72
241 4104 8
242 4120 16
243 8216 4096
244 8752 536
245 8760 8

132
246 8768 8
247 8776 8
248 9384 608
249 9392 8
250 12368 2976
251 12376 8
252 16408 4032
253 32792 16384
254 65560 32768

It shows that FREE LISTS is organized in 255 Buckets with different chunk size, starting with minmum
size=32 bytes in Bucket 0 till to size=65560 (64K) in Bucket 254. From Bucket 0 to 239, increase is 8 to
48 bytes, then some irregular increase.

Since minimum memory chunk size in shared pool is 32 Byte, when error says:

unable to allocate 32 bytes

there is indeed no more free memory. But Prod DB trace dump shows that each subpool has at least
304’535 KB free memory.

So where is the mystery behind the controversial information ?

To understand it, firstly we will look at what free memory implies in different aspects, then expose the
memory allocation of certain popular components, and finally discuss its impacts on applications.

5.1.6.2 Free Memory: x$ksmss (v$sgastat) vs. x$ksmsp

Take the same Test DB (shared pool size = 1408M), it has two subpools. One direct way to get shared
pool memory statistics is to run 3 queries below:

select name, round(bytes/1024/1024) mb from v$sgastat


where pool=’shared pool’ and name in (’free memory’, ’KKSSP’, ’KGLH0’, ’SQLA’)
order by name desc;

NAME MB
------------ ---
free memory 298
SQLA 117
KKSSP 4
KGLH0 115

select ksmssnam name, ksmdsidx, round(ksmsslen/1024/1024) mb


from x$ksmss
where ksmssnam in (’free memory’, ’KKSSP’, ’KGLH0’, ’SQLA’)
order by name desc;

NAME KSMDSIDX MB
------------ --------- ---
free memory 0 208 -- RESERVED EXTENTS
free memory 1 42 -- subpool 1
free memory 2 48 -- subpool 2
SQLA 1 117
KKSSP 1 4
KGLH0 1 115

with sq as
(select substr(ksmchcom, 1, decode((instr(ksmchcom, ’^’) - 1), -1,
length(ksmchcom), (instr(ksmchcom, ’^’) - 1))) name
,v.*
from x$ksmsp v)
select name, round(sum(ksmchsiz)/1024/1024) mb

133
from sq
where name in (’free memory’, ’KKSSP’,’KGLH0’, ’SQLA’)
group by name
order by name desc;

NAME MB
------------ ---
free memory 81
SQLA 117
KKSSP 4
KGLH0 116

The first query on v$sgastat reports 298 MB free memory.

The second query on x$ksmss lists free memory per subpool, where ksmdsidx 0 denotes the ”RE-
SERVED EXTENTS”, (see next section ”SGA Summary Heapdump”, line reserved granule count
13 (granule size 16777216)). ”RESERVED EXTENTS” has 208 MB free memory, Subpoop 1 has 42
MB, Subpoop 2 has 48 MB, all together 298 MB, which matches the free memory reported in v$sgastat
since v$sgastat is defined on x$ksmss.

The third query on x$ksmss reports 81 MB free memory. Comparing x$ksmss with x$ksmsp, the values
of component KKSSP, KGLH0, SQLA are very similar (1 MB difference on KGLH0), but free memory in
x$ksmss is 90 MB (excluding ”RESERVED EXTENTS”), whereas in x$ksmsp, it is 81 MB, there exists
a discrepancy of 9 MB. Now we can try to figure out what caused 9 MB difference in these two aspects.

5.1.6.3 SGA Summary Heapdump vs. Component Heapdump

Make SGA Summary Heapdump on Test DB, it shows:

-- heapdump dump level 2 for SGA summary


-- alter session set events ’immediate trace name heapdump level 2’;

--------------------- <SGA Summary Heapdump:free space> ---------------------

HEAP DUMP heap name="sga heap" desc=700000000000198


reserved granule count 13 (granule size 16777216)
RESERVED EXTENTS

HEAP DUMP heap name="sga heap(1,0)" desc=700000000052a48


FREE LISTS
Total free space = 9970088
RESERVED FREE LISTS
Total reserved free space = 29473232

HEAP DUMP heap name="sga heap(2,0)" desc=70000000005c310


FREE LISTS
Total free space = 12790056
RESERVED FREE LISTS
Total reserved free space = 31900768

--------------------- <SGA Summary Heapdump:KGLH0> ---------------------

Chunk 7000000a1cd3578 sz= 4096 recreate "KGLH0^d020e92f " latch=0


Chunk 7000000a1d9c4d8 sz= 4096 recreate "KGLH0^d020e92f " latch=0
Chunk 7000000a1efce68 sz= 4096 freeable "KGLH0^d020e92f " ds=7000000a1d9c450
Chunk 7000000a7b8f880 sz= 4096 freeable "KGLH0^d020e92f " ds=7000000a1d9c450
Chunk 7000000a7f07588 sz= 4096 freeable "KGLH0^d020e92f " ds=7000000a1d9c450

--------------------- <SGA Summary Heapdump:FREE LISTS> ---------------------


-- Bucket 50 to 254 are not listed

HEAP DUMP heap name="sga heap(1,0)" desc=700000000052a48


FREE LISTS:
Bucket 0 size= 32
Bucket 1 size= 40

134
Bucket 2 size= 48
...
Bucket 42 size=368
Bucket 43 size=376
Bucket 44 size=384
Bucket 45 size=392
Bucket 46 size=400
Bucket 47 size=408
Bucket 48 size=416
Bucket 49 size=424
HEAP DUMP heap name="sga heap(2,0)" desc=70000000005c310
FREE LISTS:
Bucket 42 size=368
Bucket 43 size=376
Bucket 44 size=384
Bucket 45 size=392
Bucket 46 size=400
Bucket 47 size=408
Chunk 7000000a3e8c3d8 sz= 408 free " "
Bucket 48 size=416
Bucket 49 size=424

At first, look Section: <SGA Summary Heapdump:free space>, sumuming up free memory:

name HEAP_DUMP MB
-------------------- ------------------------------------------ -----
<reserved granule count 13 (granule size 16777216)>
RESERVED EXTENTS round(13*16777216/1024/1024) = 208

<Total free space = 9970088>


<Total reserved free space = 29473232>
sga heap(1,0) round((9970088+29473232)/1024/1024) = 38

<Total free space = 12790056>


<Total reserved free space = 31900768>
sga heap(2,0) round((12790056+31900768)/1024/1024) = 43

The ”RESERVED EXTENTS” of 208 MB matches previous output of x$ksmss query.

sga heap(1,0) has 38 MB free space, sga heap(2,0) has 43 MB free space, together 81 MB, same as
reported in x$ksmsp. (In above heapdump, summing all chunks commented with ”R-free” and ”free”
also gives the same result)

However, if we look the Section: <SGA Summary Heapdump:FREE LISTS>, there is only one free
Chunk Bucket 47 with size=408 for Bucket 42 to 49 (other Buckets are not listed in above dump).

So now the question is why x$ksmss (or its derived v$sgastat) reports free memory 42 MB in Subpool 1,
and free memory 48 MB in Subpool 2, which are 4 MB (42-38) respective 5 MB (48-43) more than free
memory reported in Section: <SGA Summary Heapdump:free space> (or x$ksmsp) for both subpools.

Let’s try to dig it further.

Look above section <SGA Summary Heapdump:KGLH0> (copied here again), which are all extracted
lines containing one special comment, say, KGLH0^d020e92f.

--------------------- <SGA Summary Heapdump:KGLH0> ---------------------

Chunk 7000000a1cd3578 sz= 4096 recreate "KGLH0^d020e92f " latch=0


Chunk 7000000a1d9c4d8 sz= 4096 recreate "KGLH0^d020e92f " latch=0
Chunk 7000000a1efce68 sz= 4096 freeable "KGLH0^d020e92f " ds=7000000a1d9c450
Chunk 7000000a7b8f880 sz= 4096 freeable "KGLH0^d020e92f " ds=7000000a1d9c450
Chunk 7000000a7f07588 sz= 4096 freeable "KGLH0^d020e92f " ds=7000000a1d9c450

First 2 chunks are marked as chunk types "recreate", other 3 chunks marked as "freeable", but

135
no chunk types "free". (see MOS: Troubleshooting and Diagnosing ORA-4031 Error [Video] (Doc ID
396940.1) about chunk types)

Pick ds for KGLH0^d020e92f, make a KGLH0 Component addr dump in Test DB:

SQL > oradebug dump heapdump_addr 1 0X7000000a1d9c450

--------------------- <KGLH0 addr dump> ---------------------

Processing Oradebug command ’dump heapdump_addr 1 0X7000000A1D9C450’

HEAP DUMP heap name="KGLH0^d020e92f" desc=7000000a1d9c450


EXTENT 0 addr=7000000a1efce80
Chunk 7000000a1efce90 sz= 1384 perm "perm " alo=600
Chunk 7000000a1efd3f8 sz= 152 freeable "kgltbtab "
Chunk 7000000a1efd490 sz= 696 freeable "policy chain "
Chunk 7000000a1efd748 sz= 144 freeable "context chain "
Chunk 7000000a1efd7d8 sz= 656 freeable "policy chain "
Chunk 7000000a1efda68 sz= 144 freeable "context chain "
Chunk 7000000a1efdaf8 sz= 880 freeable "policy chain "
EXTENT 1 addr=7000000a7f075a0
Chunk 7000000a7f075b0 sz= 376 free " "
Chunk 7000000a7f07728 sz= 144 freeable "context chain "
Chunk 7000000a7f077b8 sz= 712 freeable "policy chain "
Chunk 7000000a7f07a80 sz= 144 freeable "context chain "
Chunk 7000000a7f07b10 sz= 720 freeable "policy chain "
Chunk 7000000a7f07de0 sz= 1960 freeable "policy chain "
EXTENT 2 addr=7000000a7b8f898
Chunk 7000000a7b8f8a8 sz= 1760 perm "perm " alo=1760
Chunk 7000000a7b8ff88 sz= 416 free " "
Chunk 7000000a7b90128 sz= 144 freeable "context chain "
Chunk 7000000a7b901b8 sz= 144 freeable "context chain "
Chunk 7000000a7b90248 sz= 696 freeable "policy chain "
Chunk 7000000a7b90500 sz= 608 freeable "policy chain "
Chunk 7000000a7b90760 sz= 144 freeable "context chain "
Chunk 7000000a7b907f0 sz= 144 freeable "context chain "
EXTENT 3 addr=7000000a1cd35a8
Chunk 7000000a1cd35b8 sz= 80 perm "perm " alo=80
Chunk 7000000a1cd3608 sz= 3296 perm "perm " alo=3296
Chunk 7000000a1cd42e8 sz= 48 free " "
Chunk 7000000a1cd4318 sz= 152 freeable "kgltbtab "
Chunk 7000000a1cd43b0 sz= 152 freeable "kgltbtab "
Chunk 7000000a1cd4448 sz= 152 freeable "kgltbtab "
Chunk 7000000a1cd44e0 sz= 152 freeable "kgltbtab "
Total heap size = 16200
FREE LISTS:
Bucket 0 size=0
Chunk 7000000a7b8ff88 sz= 416 free " "
Chunk 7000000a7f075b0 sz= 376 free " "
Chunk 7000000a1cd42e8 sz= 48 free " "
Chunk 7000000a1cd35d8 sz= 0 kghdsx
Total free space = 840
UNPINNED RECREATABLE CHUNKS (lru first):
PERMANENT CHUNKS:
Chunk 7000000a1efce90 sz= 1384 perm "perm " alo=600
Chunk 7000000a7b8f8a8 sz= 1760 perm "perm " alo=1760
Chunk 7000000a1cd3608 sz= 3296 perm "perm " alo=3296
Chunk 7000000a1cd35b8 sz= 80 perm "perm " alo=80
Permanent space = 6520

then aggregate by Heapdump Analyzer (see Blog: Oracle memory troubleshooting, Part 1: Heapdump
Analyzer [28]):

--------------------- <KGLH0 addr dump:Analyzer> ---------------------

Total_size #Chunks Chunk_size, From_heap, Chunk_type, Alloc_reason


---------- ------- ------------ ----------------- ----------------- -----------------
3296 1 3296 , KGLH0^d020e92f, perm, perm
1960 1 1960 , KGLH0^d020e92f, freeable, policy chain
1760 1 1760 , KGLH0^d020e92f, perm, perm

136
1392 2 696 , KGLH0^d020e92f, freeable, policy chain
1384 1 1384 , KGLH0^d020e92f, perm, perm
1152 8 144 , KGLH0^d020e92f, freeable, context chain
880 1 880 , KGLH0^d020e92f, freeable, policy chain
760 5 152 , KGLH0^d020e92f, freeable, kgltbtab
720 1 720 , KGLH0^d020e92f, freeable, policy chain
712 1 712 , KGLH0^d020e92f, freeable, policy chain
656 1 656 , KGLH0^d020e92f, freeable, policy chain
608 1 608 , KGLH0^d020e92f, freeable, policy chain
416 1 416 , KGLH0^d020e92f, free,
376 1 376 , KGLH0^d020e92f, free,
80 1 80 , KGLH0^d020e92f, perm, perm
48 1 48 , KGLH0^d020e92f, free,

Above output shows three chunk types: perm, freeable, free, of which there are 3 lines with Chunk type
perm, 9 lines with freeable, rest 3 lines with free.

Summing all numbers in first column (Total size), we get total used memory being 16200, among which
840 (416+376+48) is free space, 6520 (3296+1760+1384+80) is permanent space, and the rest 8840 is
freeable.

However, in previous <SGA Summary Heapdump:KGLH0>, there are 5 allocated chunks (2 recreate,
3 freeable), each of which is 4096 bytes, all together is 5*4096 = 20480, but effectively used memory is
16200, so 20480 - 16200 = 4280 overhead.

Look <SGA Summary Heapdump:FREE LISTS> (copied here again),

--------------------- <SGA Summary Heapdump:FREE LISTS> ---------------------


-- Bucket 50 to 254 are not listed

HEAP DUMP heap name="sga heap(1,0)" desc=700000000052a48


FREE LISTS:
Bucket 0 size= 32
Bucket 1 size= 40
Bucket 2 size= 48
...
Bucket 42 size=368
Bucket 43 size=376
Bucket 44 size=384
Bucket 45 size=392
Bucket 46 size=400
Bucket 47 size=408
Bucket 48 size=416
Bucket 49 size=424
HEAP DUMP heap name="sga heap(2,0)" desc=70000000005c310
FREE LISTS:
Bucket 42 size=368
Bucket 43 size=376
Bucket 44 size=384
Bucket 45 size=392
Bucket 46 size=400
Bucket 47 size=408
Chunk 7000000a3e8c3d8 sz= 408 free " "
Bucket 48 size=416
Bucket 49 size=424

there are no free chunks in Bucket 43 (size=376), Bucket 48 (size=416) and Bucket 2 (size=48), only one
free chunk in ”Bucket 47 size=408”. So it can only report 408 bytes free memory for Bucket 42 to 49.

However in component <KGLH0 addr dump> FREE LISTS (copied again below), it shows 3 free Chunks
with size 416, 376, and 48 respectively.

--------------------- <KGLH0 addr dump> ---------------------

FREE LISTS:

137
Bucket 0 size=0
Chunk 7000000a7b8ff88 sz= 416 free " " -- Bucket 48
Chunk 7000000a7f075b0 sz= 376 free " " -- Bucket 43
Chunk 7000000a1cd42e8 sz= 48 free " " -- Bucket 2
Chunk 7000000a1cd35d8 sz= 0 kghdsx
Total free space = 840

Above <KGLH0 addr dump> FREE LISTS reports Total free space = 840 in three different chunk
sizes.

So there are two FREE LISTS, each of which reports different values of free memory from different point
of view. The <SGA Summary Heapdump:FREE LISTS> reports 408 byte for Buckets from 42 to 49. The
local component <KGLH0 addr dump> FREE LISTS reports 788 (= 376 + 416) bytes for two Buckets
(43 and 48) from 42 to 49. So <SGA Summary Heapdump:FREE LISTS> reports 380 (= 788 − 408)
bytes less than that of <KGLH0 addr dump> for Buckets from 42 to 49.

If FREE LISTS in KGLH0^d020e92f <KGLH0 addr dump> are exposed in x$ksmss (respective v$sgastat),
more free memory is reported. However if <SGA Summary Heapdump:FREE LISTS> are exposed in
x$ksmsp, less free memory is reported. This is probably why v$sgastat displays more free memory than
<SGA summary heapdump> (x$ksmsp).

Since free chunks from <KGLH0 addr dump> FREE LISTS are not listed in any <SGA Summary
Heapdump> FREE LISTS, they are not eligible to be allocated to any memory requests till their bound
parental chunks returned to LRU LIST (UNPINNED RECREATABLE CHUNKS).

In v$sgastat ’shared pool’, column Bytes for ”free memory” is allocable free memory (top RESERVED
EXTENTS, and FREE LISTS, RESERVED FREE LISTS in each subpool) plus above un-allocable
free memory (overhead) in already allocated chunks. And Bytes for each v$sgastat component is the
effectively occupied memory (not including overhead). So total memory still matches the configured
shared pool size. Or we can think that free memory reported in v$sgastat is derived by total memory
minus effective used memory.

Whereas x$ksmsp reports real allocable free memory, similar to <SGA Summary Heapdump:free space>.
If one chunk is allocated, it is no more counted even it still holds one portion of free memory. Additionally
it reports more details, for example, KGLH0 for each cursor (by the way, there are many rows in x$ksmsp,
in which values in column ksmchcom like ’permanent memor’), but it lists less rows than v$sgastat.

As we can see, such chunk-intra free memory can lead to two different counting approaches, and therefore
the confusion:

Why do I get ORA-04031 even though there is plenty of free memory (> 10%) ?

Since KGLH0 and SQLA are allocated in big chunk size (Bucket) of 4096 bytes, there can exist heavy un-
allocable free memory inside allocated chunks (more discussion later with session cached cursors in
section 5.1.7).

In summary, free memory in v$sgastat probably displays more than real allocable free memory since it
includes un-allocable free memory inside the allocated chunks (chunk-intra free memory).

With following query (SGA Summary Heapdump ”RESERVED EXTENTS” excluded), we can get a
rough comparison of both aspects.

with sq as
(select substr(ksmchcom, 1, decode((instr(ksmchcom, ’^’) - 1), -1,
length(ksmchcom), (instr(ksmchcom, ’^’) - 1))) name
,v.*

138
from sys.x_ksmsp v)
,ksmsp as
(select name ksmsp_name
,round(sum(ksmchsiz)/1024/1024) ksmsp_mb
,count(ksmchsiz) cnt
,round(avg(ksmchsiz)) avg
,min(ksmchsiz) min
,max(ksmchsiz) max
from sq group by name)
,ksmss as
(select ksmssnam ksmss_name
,round(sum(ksmsslen)/1024/1024) ksmss_mb
from sys.x_ksmss
where (ksmssnam, ksmdsidx) not in ((’free memory’, 0))
group by ksmssnam)
select ksmss_name
,ksmss_mb
,nvl(ksmss_mb, 0) - nvl(ksmsp.ksmsp_mb, 0) delta_mb
,ksmsp.*
from ksmss full outer join ksmsp
on lower(ksmss.ksmss_name) = lower(ksmsp.ksmsp_name)
where ksmss.ksmss_name in (’KKSSP’, ’db_block_hash_buckets’, ’KGLH0’, ’SQLA’, ’free memory’)
order by abs(delta_mb) desc nulls last;

Here is an example output:

KSMSS_NAME KSMSS_MB DELTA_MB KSMSP_NAME KSMSP_MB CNT AVG MIN MAX


---------------------- --------- --------- ------------ --------- ------ ------ ----- -------
db_block_hash_buckets 22 22
free memory 90 15 free memory 75 3938 20034 48 2096960
SQLA 118 -1 SQLA 119 30336 4104 4096 33960
KGLH0 116 -1 KGLH0 117 29871 4111 4096 52560
KKSSP 3 0 KKSSP 3 860 4244 568 12352

In above output, DELTA MB is the difference between x$ksmss and x$ksmsp. Look line ”free memory”,
DELTA MB is 15 MB, that signifies that there exist 15 MB free memory in already allocated chunks.
They are no more eligible for any allocations.

Pick this DELTA MB, we can estimate memory allocation efficiency by:

DELTA_MB / shared_pool_size

Till now, we have walked through all 5 top memory consuming components in Table 5.2, and partially
exploited them with dumps and queries. In the next discussions, we will look session local cursor cache,
and parameters which impact cursor versions and size.

5.1.7 Session Private Cursor Cache

In addition to instance-wide shared pool, each session has its own private cursor cache for currently
opened and parsed, or cached cursors. It is divided into different sub-caches for different cursor types.
All currently executing ones are marked as ”opened”. For performance improvement, it provides a
fast and short path to the parent shared pool. It is exposed in v$open cursor, and controlled by
session cached cursors and open cursor. All cached cursors are hashed to 256 Buckets (see section
4.1.3).

In shared pool, each cursor is allocated in two distinct heaps: KGLH0 (Heap0) and SQLA (Heap6,
sqlarea) with multiple of 4K chunks. Repeated parse calls (more than 3 times) of the same Sql (including
recursive Sql) and Plsql statement by any session connected to DB are candidate for addition to session

139
cursor cache. When a cursor is added into the session cursor cache, this results in Heap0 being pinned in
the sared pool, but not Heap6 ([9]). Therefore cursor in this private cache partially pins its dependent
cursor in shared pool. Setting high session cached cursors increases pressure on shared pool.

In normal operation, SQLA for child cursor xplan should be kept in library cache for KGLH0, otherwise
there are heavy hard parsing (due to invalidation/reload)

In fact, Prod DB, which threw ORA-04031, has set session cached cursors = 600 (Oracle Default=50).
And Table 5.2 shows that KGLH0 is the third top memory consumer in all subpools, and SQLA in sub-
pool 1 is dramatically low comparing to other subpools.

With 6000 concurrently connected sessions and session cached cursors = 600, there could be 3’600’000
pinned KGLH0 in the extreme case (this pure math number will not appear since majority of cursors are
identical in shared pool).

Nowadays OO programs (e.g. Java) access DB by generated getter and setter methods for each class
fields. Handed to Oracle shard pool, these are a lot of small sql statements, but they are allocated with
a minimum unit of 4096 bytes for KGLH0 and SQLA. When shared pool under memory pressure, KGLH0
is kept (controlled by session cached cursors), SQLA is evicated. So shared pool is occupied by the
majority of KGLH0. If ORA-04031 is marked as ”SQLA”, it is probably caused by re-loading SQLA for
existed KGLH0 statement since memory for KGLH0 is first required and has to be satisfied before re-loading
SQLA.

In fact, we observed that at the beginning, KGLH0 and SQLA are almost balanced, following the time, KGLH0
is increasing, but SQLA decreasing. If there are continuous demands for SQLA on a particular subpool (for
example, Subpool 1), ORA-04031 are often thrown for that particular subpool (see previous sql id to
subpool mapping). That is probably one reason why in Prod DB, Subpool 1 (see Table 5.2) has extreme
un-balanced KGLH0 and SQLA, and caused frequent ORA-04031.

By the way, there are also high memory allocation for PLMCD (Plsql bytecode (a.k.a. MCode)) since the
applications running heavy Plsql with plsql code type=INTERPRETED.

To verify our observation, we extract all 173 ORA-04031 errors (text are shortened) from alert.log, for
example, the first 10 error below. Although maximum 400 bytes are needed for each line, 4096 bytes
have to be statisfied.

ORA-04031: 48 bytes ("shared pool","select id from cod...", "TCHK^1fefd466","qcsqlpath: qcsAddSqlPath")


ORA-04031: 32 bytes ("shared pool","unknown object", "KGLH0^b2ecac91","kglHeapInitialize:temp")
ORA-04031: 400 bytes ("shared pool","select i.obj#,...", "SQLA^bc5573b6","opixpop:kctdef")
ORA-04031: 400 bytes ("shared pool","SELECT OB.ID ,...", "SQLA^40121e6","opixpop:kctdef")
ORA-04031: 32 bytes ("shared pool","unknown object", "KGLH0^700d67c","kglHeapInitialize:temp")
ORA-04031: 56 bytes ("shared pool","INSERT INTO XX_I(...", "SQLA^a782ebb","idndef*[]: qkexrPackName")
ORA-04031: 120 bytes ("shared pool","UPDATE DBMS_ALERT_INFO ...", "SQLA^d2e09759","qeeOpt: qeesCreateOpt")
ORA-04031: 120 bytes ("shared pool","select audit$,propert ...", "SQLA^833d368b","opn: qkexrInitOpn")
ORA-04031: 48 bytes ("shared pool","SELECT /*+ all_rows */ ...", "SQLA^66826579","idndef : qcuAllocIdn")
ORA-04031: 48 bytes ("shared pool","select id from code_...", "TCHK^1fefd466","qcsqlpath: qcsAddSqlPath")

Pick hash value in each line and compute the corresponding subpool number, all of them return 1, that
means, Subpool 1.

select mod(mod(to_number(’1fefd466’, ’xxxxxxxxx’), 131072), 7) + 1,


mod(mod(to_number(’b2ecac91’, ’xxxxxxxxx’), 131072), 7) + 1,
mod(mod(to_number(’bc5573b6’, ’xxxxxxxxx’), 131072), 7) + 1,
mod(mod(to_number(’40121e6’, ’xxxxxxxxx’), 131072), 7) + 1,
mod(mod(to_number(’700d67c’, ’xxxxxxxxx’), 131072), 7) + 1,
mod(mod(to_number(’a782ebb’, ’xxxxxxxxx’), 131072), 7) + 1,
mod(mod(to_number(’d2e09759’, ’xxxxxxxxx’), 131072), 7) + 1,

140
mod(mod(to_number(’833d368b’, ’xxxxxxxxx’), 131072), 7) + 1,
mod(mod(to_number(’66826579’, ’xxxxxxxxx’), 131072), 7) + 1,
mod(mod(to_number(’1fefd466’, ’xxxxxxxxx’), 131072), 7) + 1
from dual;

Actually, among 173 ORA-04031 errors (for a timespan of 4 minutes), 162 are hashed to Subpool 1.

Look one session trace dump from Prod DB, the section <Session Wait History> lists last 10 wait events
as follows (some text removed. Note: The history is displayed in reverse chronological order):

Session Wait History:


0: waited for ’latch: shared pool’
address=0x70000000010ac28, number=0x133, tries=0x0
1: waited for ’latch: shared pool’
address=0x70000000010ac28, number=0x133, tries=0x0
2: waited for ’cursor: pin S wait on X’
idn=0xfbbdfa8b, value=0x1fef00000000, where=0x800000000
3: waited for ’cursor: pin S wait on X’
idn=0x2ded54a6, value=0x1fef00000000, where=0x300000000
4: waited for ’cursor: pin S wait on X’
idn=0xa4927d51, value=0x1fef00000000, where=0x300000000
5: waited for ’latch: shared pool’
address=0x70000000010ac28, number=0x133, tries=0x0
6: waited for ’latch: shared pool’
address=0x70000000010ac28, number=0x133, tries=0x0
7: waited for ’cursor: pin S wait on X’
idn=0xb9a7f11c, value=0x1fef00000000, where=0x500000000
8: waited for ’cursor: pin S wait on X’
idn=0xb9a7f11c, value=0x1fef00000000, where=0x500000000
9: waited for ’cursor: pin S wait on X’
idn=0xb9a7f11c, value=0x1fef00000000, where=0x500000000

There are 4 events marked as ’latch: shared pool’, the rest 6 events marked as waited for ’cursor:
pin S wait on X’, but only 4 different idn values (Entry 7, 8, 9 having same idn value 0xb9a7f11c)
(See Blog: cursor: pin S wait on X [37]).

Pick all 4 idn values under line: waited for ’cursor: pin S wait on X’, and compute their subpool
number, all are hashed to Subpool 1:

select mod(mod(to_number(’fbbdfa8b’, ’xxxxxxxxx’), 131072), 7) + 1,


mod(mod(to_number(’2ded54a6’, ’xxxxxxxxx’), 131072), 7) + 1,
mod(mod(to_number(’a4927d51’, ’xxxxxxxxx’), 131072), 7) + 1,
mod(mod(to_number(’b9a7f11c’, ’xxxxxxxxx’), 131072), 7) + 1
from dual;

By the way, in Wait Event ’cursor: pin S wait on X’, P3RAW (”where”, see v$event name) is data
type RAW(8). Its top 4 bytes contains x$mutex sleep.location id, for example, above 0x500000000
points to x$mutex sleep.location id = 0x5, that is Location: kkslce [KKSCHLPIN2] (it is visible in
AWR Section: Mutex Sleep Summary for ”cursor: pin S wait on X”).

In above discussion, we mainly discussed session cached cursors on Sql cursor cache. If we looked view
v$open cursor, the column cursor type lists a few of different caches, for example, generic session
cursor cache for session cursor, dictionary lookup cursor cache for dictionary lookup cursor,
PL/SQL cursor cache for PL/SQL cursor. All of them are controlled by session cached cursors and
open cursors. The currently executing cursors are marked as open cursors and have one sql exec id (see
v$sql monitor), for example, Open PL/SQL cursors for OPEN PL/SQL (currently executing PL/SQL).

Specially Plsql cursor cache is managed independently of above discussed Sql session cursor cache. Plsql
cache is not a closed cursor cache, rather the cursor are cached in an open state as explained in [9]

141
(Oracle 10 White Paper). That probably means Sql cursor is partially pinned (only KGLH0), but
Plsql is entirely pinned. To make the matter more complex, there are still other cursor related Oracle
parameters, for example, cursor space for time (deprecated as of Release 10.2), serial reuse, Plsql
serially reusable Pragma.

5.1.8 Cursor Versions and Size

With Oracle 11.2.0.3.0, a new hidden parameter was introduced to control the number of child cursors
per parent. The default value evolved from inital 100 in Oracle 11.2.0.3.0 to 8192 in 18c .

Name: _cursor_obsolete_threshold
Description: Number of cursors per parent before obsoletion
Default value:
100 Oracle 11.2.0.3.0
1024 Oracle 11.2.0.4.0 & 12c1
8192 Oracle 12c2 & 18c

This parameter has an influential factor on KGLH0 and SQLA memory consumption, which can be monitored
by:

select sql_id, sharable_mem, persistent_mem, runtime_mem, typecheck_mem, sql_text


from v$sqlarea
order by sharable_mem desc;

In one application, it was observed that all connected sessions were blocked after dozens of hours (some-
times even a couple of days) by one session on wait event: "library cache: mutex X" or "library
cache lock" when default value of cursor obsolete threshold was increased after Oracle upgrade.

By certain extreme tests in Oracle 11.2.0.3.0, we saw more than 60,000 child cursors for one statement
(see Blog: One Mutex Collision Test [36]):

-- sql_id: ’754r1k9db5u80’
select id into l_id from testt where name = :B1;

select count(*) from v$sql where sql_id =’754r1k9db5u80’;


>>> 66’510

It seems that Oralce does not strictly follow this threshold, and eventually can cause shared pool explosion
by the number of child cursor versions.

From application side, some quick mitigation could be to modify sql text so that they are evenly mapped
to all subpools, or try to reduce the number of different Sql statements and their children numbers (See
Blog: cursor: pin S wait on X [37]).

From Oracle side, the pacakge dbms shared pool can be used to manipulate shared pool objects. It pro-
vides the methods (keep/unkeep, markhot/unmarkhot, purge, sizes, aborted request threshold)
to manually administrate the offending Sql or Plsql objects. For instance, if an existed big object, which
is no more needed; or a shared memory leak object, or no more referenced, are found to be remaining in
shared pool, we can first use sizes procedure to check if it is over the specified size, then invoke purge
procedure to clean it out of shared pool.

142
For example, running sizes procedure or the equivalent query below, sys.dbms stats (package body)
is found occupying 1056 KB, then purge procedure can be invoked to reset its shared memory (note that
after reset, both package spec and body still exist in v$db object cache, but with sharable mem being
0).

SQL > exec sys.dbms_shared_pool.sizes(1000);


SIZE(K) KEPT NAME
------- ------ -----------------------------------------
1056 SYS.DBMS_STATS (PACKAGE BODY)

with threshold_size as (select 1000 kb from dual)


select to_char(sharable_mem/1024, ’999999’) sz,
decode(kept_versions, 0, ’’, rpad(’YES(’||to_char(kept_versions)||’)’, 6)) keeped,
rawtohex(address)||’,’||to_char(hash_value) name,
substr(sql_text,1,354) extra,
1 iscursor
from v$sqlarea, threshold_size
where sharable_mem > threshold_size.kb *1024
union
select to_char(sharable_mem/1024, ’999999’) sz,
decode(kept, ’YES’, ’YES ’, ’ ’) keeped,
owner||’.’ ||name||lpad(’ ’, 29 - (length(owner) + length(name)))||’(’||type||’)’ name,
null extra,
0 iscursor
from v$db_object_cache v, threshold_size
where sharable_mem > threshold_size.kb *1024
order by 1 desc;

Referring to Oracle docu about procedure dbms shared pool.purge, it gives certain hints about heap 0
and heap 6 in Library cache in the specification of its parameter heaps:

DBMS_SHARED_POOL.PURGE (
name VARCHAR2,
flag CHAR DEFAULT ’P’,
heaps NUMBER DEFAULT 1);

Heaps to be purged. For example, if heap 0 and heap 6 are to be purged:

1<<0 | 1<<6 => hex 0x41 => decimal 65, so specify heaps =>65.

Default value is 1, that is, heap 0 which means the whole object would be purged

It seems that the existence of heap 6 (SQLA) depends on heap 0 (KGLH0) (see next discussion on
library cache dump). And default 1 cleans at least both heap 0 and heap 6, which means that the value
of heaps parameter is not less than 65.

Continuing with our experiment in Test DB, a library cache dump(”level 16”) confirms that there is
always one KGLH0 for each cursor (parent and child), and occasionally no SQLA, so more KGLH0
subheaps than SQLA subheaps.

Library cache dump also shows certain information about memory usage, for example, KGLH0 and SQLA
below:

--------------------- <library_cache> ---------------------


-- alter session set max_dump_file_size = unlimited;
-- alter session set tracefile_identifier = ’ksun_library_cache_1’;
-- alter session set events ’immediate trace name library_cache level 16’;

Block: #=’0’ name=KGLH0^580dee70 pins=0 Change=NONE


Heap=70000521f4f8300 Pointer=70000521ad6f918 Extent=70000521ad6f7f8 Flags=I/-/P/A/-/-
FreedLocation=0 Alloc=20.015625 Size=23.859375 LoadTime=28198292040

143
Block: #=’6’ name=SQLA^580dee70 pins=0 Change=NONE
Heap=7000000a0117ec8 Pointer=7000000a49c0040 Extent=7000000a49bf3e8 Flags=I/-/-/A/-/E
FreedLocation=0 Alloc=23.171875 Size=23.742188 LoadTime=0

-- Block: #=’6’ name=Permanent space^580dee70 pins=0 Change=NONE

To find out what "Alloc" and "Size" denote, make an addr heapdump, for example, Heap=7000000a0117ec8,
for above SQLA^580dee70, the output looks like:

SQL > oradebug setmypid


SQL > oradebug dump heapdump_addr 1 0X7000000a0117ec8

Total heap size = 24312 (Bytes)


Total free space = 504
Permanent space = 80

and then try to correlate them with SQLA in <library cache> dump, we get following equations:

FreedLocation=0 Alloc=23.171875 Size=23.742188 LoadTime=0 -- from above <library_cache>

Total heap size = 24312 = SQLA Size = 23.742188*1024


Total free space + Permanent space = 504 + 80 = 584 = 23.742188*1024 - 23.171875*1024 = SQLA Size - SQLA Alloc

So "Size" counts total allocated memory in byte for SQLA, but "Alloc" does not include "Total free
space" and "Permanent space". Since SQLA is allocated in chunk size of 4096 bytes, the ”Total free
space” of 504 byte in heapdump addr is an overhead and is not allocable till its bound SQLA is freed. As
discussed before, since this 504 byte is exposed in v$sgastat as free space, more free space is reported.
Here again, two different points of views on memory usage, summing all SQLA ”Size” gives more, whereas
that of SQLA ”Alloc” gives less.

In <library cache> dump, "Size" denotes total allocated memory, "Alloc" represents real used memory
(not including ”Total free space” and ”Permanent space”). This naming convention seems not intuitive,
in common sense, "Alloc" should be total, whereas "Size" should be really used.

When applying the same calculation for KGLH0, it seems that only "Size" matches, but not "Alloc". So
it remains to be further investigated.

If we convert hash value 0x580dee70 from above dump (copied here again)

Block: #=’6’ name=SQLA^580dee70 pins=0 Change=NONE

to decimal 1477308016, then use it to check sharable mem in two queries below, v$db object cache.sharable mem
matches almost exactly "Size" in <library cache> dump, but v$sql.sharable mem is smaller. Probably
shared memory used by the (child) cursor is only the (main) part of library object.

select hash_value, sharable_mem, t.*


from v$db_object_cache t
where hash_value = 1477308016;

select sql_id, sharable_mem, persistent_mem, runtime_mem, typecheck_mem, v.*


from v$sql v -- or v$sqlarea for all child cursors
where hash_value = 1477308016
order by v.sharable_mem desc;

To verify if all "Size" for "KGLH0" and "SQLA" are almost a multi of 4 KB (i.e both components are
allocated in chunk size of 4096 bytes), on Test DB, run query below:

144
select name, ksmchsiz_4k, count(*) cnt
from (select substr(ksmchcom, 1, decode((instr(ksmchcom, ’^’) - 1), -1,
length(ksmchcom), (instr(ksmchcom, ’^’) - 1))) name,
ksmchsiz/4096 ksmchsiz_4k
from sys.x_ksmsp v)
where name in (’KGLH0’, ’SQLA’)
group by name, name, ksmchsiz_4k
order by cnt desc;

NAME KSMCHSIZ_4K CNT


------ ----------- ----------
SQLA 1 18199
KGLH0 1 10686
KGLH0 1.015625 310
SQLA 1.00976563 96
SQLA 1.015625 36
SQLA 1.02929688 24
KGLH0 1.33203125 20
SQLA 1.02539063 18
SQLA 4.04296875 4

The output for SQLA and KGLH0 shows that the majority of KSMCHSIZ 4K are exactly 1, so both are
probably allocated in chunk size of 4K.

Sometimes when shard pool memory under pressure (ORA-04031), we observed that there exist cer-
tain extreme offending cursors with noticeable KGLH0 or SQLA memory Size, for example, following
KGLH0^ac7e9a16 consumes about 131’343’512 bytes (Size=128265.148438).

DataBlocks:
Block: #=’0’ name=KGLH0^ac7e9a16 pins=0 Change=NONE
Heap=700014eb0a10b30 Pointer=7000150fedd92e0 Extent=7000150fedd9170 Flags=I/-/P/A/-/-/-
FreedLocation=0 Alloc=128010.093750 Size=128265.148438 LoadTime=16455656254

To handle such excessive shard pool memory usage, Oracle also introduced two hidden parameters to
control soft warning and hard error thresholds respectively .

_kgl_large_heap_warning_threshold
maximum heap size before KGL writes warnings to the alert log
default 52428800 (50MB) since 10.2.0.2
write Heap size <heap size K> exceeds notification threshold (51200K) into alert log

_kgl_large_heap_assert_threshold
maximum heap size before KGL raises an internal error
default 524288000 (500MB) since 12.1.0.2
raise ORA-00600: internal error code, arguments: [KGL-heap-size-exceeded], [0x7FF91F844240], [6], [532279608]

5.1.9 SGA Auto Resizing

Oracle 11.2.0.1 introduced one new hidden parameter:

Name: _memory_imm_mode_without_autosga
Description: Allow immediate mode without sga/memory target
Default value: True Oracle 11.2.0.1

when using Automatic Shared Memory Management (ASMM) or Automatic Memory Management
(AMM). It allows memory to be moved automatically among components (buffer cache, shared pool)
in the SGA. For example, buffer cache: SHRINK and shared pool: GROW, since both components are
configured with v$sgainfo.resizeable = ’Yes’, so that memory allocations are regulated according

145
to requests, and the occurrences of ORA-04031 are subsequently reduced. The amount of memory for
each time of resizing is given by v$sgainfo.bytes where name = ’Granule Size’. The resizing ac-
tivities are recorded in v$memory resize ops (dba hist memory resize ops). The side effect of this
dynamic resizing is occasionally heavy wait event: library cache lock on hot objects, and eventually
event: SGA: allocation forcing component growth. Set it to false to disable this feature with the
consequence that ORA-4031 error could be raised.

5.2 PGA Memory

While SGA is managed by Oracle, and application has little control over it, PGA is mainly application
relevant and ORA-04030 is directly caused by user program. Hence we should have a tool to locate and
measure application PGA usage, and subsequntly to reduce it.

We first build an utility to watch PGA usage of own session based on Oracle provided:

dbms_session.get_package_memory_utilization,

discuss its limitations. Then we build a second one with more general usage. Finally we look high PGA
usage generated by Oracle collections.

5.2.1 ORA-04030 incident file

ORA-04030 is thrown when PGA memory allocation is over certain limit: default 16 GB in Oracle
11.2.0.3, and 32 GB as of 11.2.0.4.

Quite often the generated ORA-04030 incident file contains following text:

Dump of Real-Free Memory Allocator Heap [0x1108c1090]


mag=0xfefe0001 flg=0x5000003 fds=0x0 blksz=65536
blkdstbl=0x1108c10a0, iniblk=252928 maxblk=262144 numsegs=255
In-use num=252073 siz=3639541760, Freeable num=0 siz=0, Free num=0 siz=0

The 16GB limit seems derived from:

262144(maxblk) * 65536(blksz) = 16GB

In the above example, the real allocated memory is:

252073 (In-use num) * 65536(blksz) = 16’519’856’128 Bytes

but only siz=3,639,541,760 is reported. This is probably due to the 32 bit integer overflow (maximum
4GB, as observed in Blog [34]).

Adding 3*4GB Overflow, the effective memory allocated should be:

3*(4*1024*1024*1024) + 3639541760 = 16,524,443,648 (16GB)

146
16GB is an upper limit. Sometimes a session throws ORA-04030 with only 11GB memory. In such case,
memory is probably capped on the UNIX layer or certain virtual space limitation.

Same as library cache dump discussed in previous SGA section 5.1.7, above siz (Size in library dump)
denotes total allocated memory, In-use num is real used memory (Alloc in library dump).

Oracle also provides special event or new parameter pga aggregate limit (12c) to limit PGA usage.
For example, in pfile/spfile, the event below:

event = 10261 trace name context forever,level 3145728

enforces a 3.2 GB limit on the PGA size, and throws ORA-600 [723] error instead of ORA-4030.

5.2.2 View of dbms session.get package memory utilization

Oracle 11g Release 2 extends dbms session package by introducing a new procedure:

dbms_session.get_package_memory_utilization,

to expose memory usage of instantiated packages so that the memory consumption of program units is
revealed and the analysis of ORA-04030 becomes straightforward.

The output parameters are 5 PL/SQL associative arrays, but nowadays Oracle applications get used to
v$ like dynamic performance views. Here is a Plsql implementation to turn them into a convenient view
by pipelined function.

create or replace package sess_mem_usage as


type t_rec is record (
owner varchar2(4000)
,unit varchar2(4000)
,type varchar2(40)
,used number
,free number
);
type t_rec_tab is table of t_rec;

function get return t_rec_tab pipelined;


end sess_mem_usage;
/

create or replace package body sess_mem_usage as


function map_type2name(p_type integer)
return varchar2
as
l_v varchar2(20);
begin
l_v := case p_type when 7 then ’(procedure)’
when 8 then ’(function)’
when 9 then ’(package)’
when 11 then ’(package body)’
when 12 then ’(trigger)’
when 13 then ’(type)’
when 14 then ’(type body)’
else ’’
end;
return rpad(to_char(p_type), 3) || l_v;
end map_type2name;

-- since Oracle 11.2.0.4.0

147
function get return t_rec_tab pipelined
is
l_desired_info dbms_session.integer_array;
l_owner_array dbms_session.lname_array;
l_unit_array dbms_session.lname_array;
l_type_array dbms_session.integer_array;
l_amounts dbms_session.big_integer_matrix;
l_used_array dbms_session.big_integer_array;
l_free_array dbms_session.big_integer_array;
l_rec t_rec;
begin
l_desired_info(1) := dbms_session.used_memory;
l_desired_info(2) := dbms_session.free_memory;
dbms_session.get_package_memory_utilization(l_desired_info, l_owner_array, l_unit_array, l_type_array, l_amounts);
for i in 1 .. l_owner_array.count loop
l_rec.owner := l_owner_array(i);
l_rec.unit := l_unit_array (i);
l_rec.type := map_type2name(l_type_array(i));
l_rec.used := l_amounts(1)(i);
l_rec.free := l_amounts(2)(i);
pipe row(l_rec);
end loop;
return;
end get;
end sess_mem_usage;
/

create or replace force view v$ora_sess_mem_usage as


select * from table(sess_mem_usage.get);

then we can access them by:

select * from v$ora_sess_mem_usage order by used desc;

With the new created view, ”order by” and ”where” clause can be conveniently applied instead of calling
get package memory utilization and interpreting the result manually.

5.2.3 dbms session.get package memory utilization limitations

Oracle Documentation (text extracted from dbms session package) wrote:

These procedures describe static package memory usage. The output collections describe memory usage
in each instantiated package.

Probably static package memory usage stands for memory usage of package declared variables.

Memory usage of Plsql package spec or body declared variables is exposed, but not memory usage of
locally declared variables within functions, procedures, or anonymous blocks. It means only memory
usage of stateful variables is tracked (declared in package spec or body).

We can demonstrate this limitation with following code:

create or replace procedure proc_mem_test(p_cnt number) as


type t_rec is record (id number, text varchar2(1000));
type t_rec_tab is table of t_rec index by pls_integer;
local_rec_tab t_rec_tab;
begin
select level id, rpad(’ABC’, 1000, ’X’) text
bulk collect into local_rec_tab
from dual connect by level <= p_cnt;

-- list to procedure PGA memory

148
for c in (select sum(used) used, sum(free) free from v$ora_sess_mem_usage)
loop
dbms_output.put_line(’procedure PGA used by view = ’||c.used);
end loop;

for c in (
select sum(used) used from v$process_memory
where pid = (select pid from v$process
where addr = (select paddr from v$session where sid=sys.dbms_support.mysid)))
loop
dbms_output.put_line(’procedure PGA used by process = ’||c.used);
end loop;
end;
/

create or replace package pkg_mem_test is


type t_rec is record (id number, text varchar2(1000));
type t_rec_tab is table of t_rec index by pls_integer;
spec_rec_tab t_rec_tab;

procedure run(p_cnt number);


end;
/

create or replace package body pkg_mem_test is


body_rec_tab t_rec_tab;

procedure run(p_cnt number) as


local_rec_tab t_rec_tab;
begin
select level id, rpad(’ABC’, 1000, ’X’) text
bulk collect into spec_rec_tab
from dual connect by level <= p_cnt;

select level id, rpad(’ABC’, 1000, ’X’) text


bulk collect into body_rec_tab
from dual connect by level <= p_cnt;

select level id, rpad(’ABC’, 1000, ’X’) text


bulk collect into local_rec_tab
from dual connect by level <= p_cnt;

proc_mem_test(p_cnt);

-- list to package PGA memory


for c in (select sum(used) used, sum(free) free from v$ora_sess_mem_usage)
loop
dbms_output.put_line(’package PGA used by view = ’||c.used);
end loop;

for c in (
select sum(used) used from v$process_memory
where pid = (select pid from v$process
where addr = (select paddr from v$session where sid=sys.dbms_support.mysid)))
loop
dbms_output.put_line(’package PGA used by process = ’||c.used);
end loop;
end;
end;
/

Run test by:

exec pkg_mem_test.run(1000*1000);

The output looks like:

procedure PGA used by view = 2471030440


procedure PGA used by process = 5079798504

package PGA used by view = 2471031752


package PGA used by process = 3811967240

149
The output compares memory usage reported in new created view and v$process memory. It shows that
PGA memory usage reported in view is less than that of v$process memory since view only reports PGA
memory used by variables declared in package spec or body, which is about 2.5 GB in above test. But the
real PGA allocation includes local variable local rec tab defined standalone procedure proc mem test,
and package procedure pkg mem test.run declared local variable local rec tab, both together occupy
another 2.5 GB. So the real total PGA is about 5 GB. Once exiting procedure proc mem test, its local
allocation is released, and real total PGA dropped to 3.8 GB.

The second restriction is that it is only bound to caller’s session. And moreover, it is not able to
disclose heap allocation details (see next discussion). So we need to build a second tool to overcome such
limitations.

5.2.4 Populate Process Memory Detail

v$process memory detail (since Oracle 10.2) lists PGA memory usage by category, heap name, and
component name for each Oracle process, and summarized in v$process memory by category. They
expose memory usage in the dimension of heap components, which can be served to relate memory usage
to program code .

Here the code to build the second view based v$process memory detail sampling.

drop table process_memory_detail_v;

create table process_memory_detail_v as


select 123 run, rpad(’A’, 40, ’X’) run_name, timestamp’1998-02-17 11:22:00’ timestamp
,234 session_id, 345 session_serial#, v.*
from v$process_memory_detail v where 1=2;

-- sampling for p_dur seconds in p_interval seconds interval.


create or replace procedure pga_sampling(p_sid number, p_dur number := 120, p_interval number := 1) as
l_start_time number := dbms_utility.get_time;
l_sample_time number;
l_sleep_time number;
l_pid number;
l_sid number;
l_serial# number;
l_run number := 0;
l_pga_status varchar2(10) := ’NOT’;
l_run_name varchar2(40) := ’PGA Sampling(’||p_sid||’, ’||p_dur||’, ’||p_interval||’)’;
begin
select pid, s.sid, s.serial#
into l_pid, l_sid, l_serial#
from v$process p, v$session s
where p.addr = s.paddr and s.sid = p_sid and rownum = 1;

l_sample_time := dbms_utility.get_time;
while ((l_sample_time - l_start_time)/100 < p_dur) loop
execute immediate q’[alter session set events ’immediate trace name PGA_DETAIL_GET level ]’||l_pid||q’[’]’;
-- wait status = COMPLETE when status = SCANNING, or ENABLED but elapsed time is over duration
while (true) loop
select status into l_pga_status from v$process_memory_detail_prog where pid = l_pid;
exit when l_pga_status = ’COMPLETE’ or ((dbms_utility.get_time - l_start_time)/100 > p_dur);
dbms_lock.sleep(0.1);
end loop;
delete from process_memory_detail_v
where pid = l_pid and session_id = l_sid and session_serial# = l_serial# and run = l_run;
insert into process_memory_detail_v select l_run, l_run_name, systimestamp, l_sid, l_serial#, v.*
from v$process_memory_detail v where pid = l_pid;
-- outcomment to keep last PGA_DETAIL
-- execute immediate q’[alter session set events ’immediate trace name PGA_DETAIL_CANCEL level ]’||l_pid||q’[’]’;
commit;
l_run := l_run + 1;
l_sleep_time := p_interval - (dbms_utility.get_time - l_sample_time)/100;
if l_sleep_time > 0 then
dbms_lock.sleep(l_sleep_time);

150
end if;
l_sample_time := dbms_utility.get_time;
end loop;
end;
/

For example, open two Sqlplus sessions, in the first session (sid 234), start sampling procedure to collect
PGA memory usage in second session:

SQL (234) > exec pga_sampling(p_sid => 789);

repeat above allocation test again in the second session (sid 789):

SQL (789) > exec pkg_mem_test.run(1000*1000);

Then, we can query memory usage and display memory usage per timestamp:

select v.*, mb - lag(mb) over(order by run) mb_delta


from(
select run, timestamp, session_id, session_serial#, pid
,round(sum(bytes)/1024/1024) mb
,sum(allocation_count) allocation_count
from process_memory_detail_v
group by run, timestamp, session_id, session_serial#, pid) v
order by run;

RUN TIMESTAMP SESSION_ID MB ALLOCATION_COUNT MB_DELTA


--- --------- ---------- -- ---------------- --------
0 11:19:04 789 6 3,397
1 11:19:05 789 20 4,729 14
2 11:19:06 789 715 68,645 695
3 11:19:07 789 1,278 85,803 563

5.2.5 PGA Memory Internals

Look above query output, pick one RUN with some peak memory usage, for example, 4,

Run query:

select run, category, name, heap_name, depth, path


,round(sum(bytes/1024)) kb, sum(allocation_count) alloc_count
,heap_descriptor, parent_heap_descriptor, cycle
from (
select v.*, (level-1) depth
,sys_connect_by_path(’(’||category||’ , ’||name||’ , ’||heap_name||’)’, ’ -> ’) path
,connect_by_iscycle as cycle
from process_memory_detail_v v
where lower(name) like ’%recursive addr reg file%’
start with parent_heap_descriptor = ’00’ and run = 4
connect by nocycle prior heap_descriptor = parent_heap_descriptor and prior run = run
)
--where lower(name) like ’%recursive addr reg file%’
group by run, category, name, heap_name, heap_descriptor, parent_heap_descriptor, depth, path, cycle
--having sum(bytes/1024) > 1024
order by run, category, name, heap_name, depth, kb;

Note that in the output rows, the values in all columns except column PATH are identical. To fit into
page, we display the different values in PATH column separately:

151
RUN CATEGORNAME HEAP_NAME DEPTH PATH KB ALLOC_COUN HEAP_DESCRIPTOR PARENT_HEAP_DESCRIPTOR CYCLE
--- ------------------------- ------------- ----- ---- --- ----------- ----------------- ----------------------- -----
4 PL/SQL recursive addr reg koh-kghu sess 2 713 48 00007F99CEAD4028 00007F99D4A098B8 0

PATH
----------------------------------------------------
-> (Other, kghdsx, top uga heap)
-> (Other, kxsFrame16kPage, session heap)
-> (PL/SQL, recursive addr reg file,koh-kghu sess

-> (Other, kghdsx, top uga heap)


-> (Other, kqlpWrntoStr:string, session heap)
-> (PL/SQL, recursive addr reg file, koh-kghu
...

-> (Other, free memory, top uga heap)


-> (Other, kxsFrame16kPage, session heap)
-> (PL/SQL, recursive addr reg file, koh-kghu

-> (Other, free memory, top uga heap)


-> (Other, kqlpWrntoStr:string, session heap)
-> (PL/SQL, recursive addr reg file, koh-

In the above query, we connect heap descriptor with parent heap descriptor to draw a graph of PGA
heap tree structure.

We can observe that only ”Other” category has DEPTH 0 (root) nodes, and all other categories are
subtrees to ”Other”. (One exception is when category = ’PL/SQL’ and name=’miscellaneous’, both
heap descriptor and parent heap descriptor equal to ’00’, causing cycle)

One time we noticed certain high PGA memory consumptions. By running above query, it turns out
that the main contribution is due to ”recursive addr reg file”. Further searching Oracle MOS, it is
documented as something related to Plsql anonymous blocks. (Oracle MOS Bug 9478199: Memory
corruption / ORA-600 from Plsql anonymous blocks).

As previously discussed, Oracle provided dbms session.get package memory utilization is hard to in-
ject into existing code, let alone Oracle background processes. However, populating v$process memory detail
opens a tiny door to peer Oracle internals, even for background processes, for example, PMON, DBWx,
CJQ0, MMON.

5.2.6 Plsql Collection Memory Usage and Performance

Plsql Collections are used to store a set of elements, they are more prone to ORA-04030 when storing
large number of elements in multidimensional collections (collections of collections).

For example, after running above PGA memory allocation test:

exec pkg_mem_test.run(1000*1000);

we make pga detail dump by:

alter session set events ’immediate trace name pga_detail_dump level 27’;

-- 27 is Oracle process number (pid). Output only shows 3 top categories

It reveals corresponding details by categories:

152
2252216168 bytes,137939 chunks: "pl/sql vc2 " PL/SQL
286497440 bytes, 17549 chunks: "pmucalm coll " PL/SQL
32656 bytes, 2 chunks: "pmuccst: adt/record " PL/SQL

In the above dump,

(1). ”pl/sql vc2” are all involved varchar2 strings.

(2). ”pmucalm coll” looks like all the allocated collections, which represent the branch nodes.

(3). ”pmuccst: adt/record” (ADT: Abstract Data Type) stores all the Plsql records, i.e. leaf nodes.

In Oracle applications, Plsql collections are often the cause of ORA-04030 when storing large number of
elements. Determining the categories helps pinpoint the main memory consumption, for instance,

-. If it is ”pmuccst: adt/record”, then the cause is the number of elements.

-. If it is ”pmucalm coll”, then the cause is the number of collections.

In case of Plsql multidimensional collections, they are modelled by creating a collection whose elements
are again collections. We observed that their memory usage and performance depend on the total number
of branch nodes, which are determined by the data characteristics and the subscript indices ordering.

(a). One dimensional collection uses less more memory than multidimensional collection. Therefore
when possible, concatenate multi-indices as one, for example, convert 2 dimensional array(i 2)(i 1)
as array(i 2 i 1).

(b). For a two dimensional array(i 1)(i 2) where i 1 in [1..10]; i 2 range [1..100,000] (total 1,000,000
elements), store it as array(i 1)(i 2) uses less more memory than array(i 2)(i 1) since array(i 1)(i 2)
has less branch nodes. Each branch node takes about 5KB to 13KB. Simply exchanging the sub-
scripts can get a factor 10 memory usage difference. (imagine the cases of one such organizations
with 90% vs. 10% employees are appointed as managers).

(c). The performance is proportional to the memory usage.

Further discussion can be found in Blog [47].

5.3 Oracle LOB Memory Usage and Leak

Oracle LOBs are conceived to store large objects, hence have more impacts on memory consumption. In
this section, we will try to watch LOB memory usage and memory leaks.

5.3.1 Temporary LOBs: cache lobs, nocache lobs, abstract lobs

First we use following test code to demonstrate different types of temporary LOBs, and their space usage
(tested in Oracle 12cR1 and 12cR2).

153
------------------------------ Test Code -------------------------------

create or replace package lob_cache_test_pkg as


g_CACHE_LOBS clob;
g_NOCACHE_LOBS clob;
g_ABSTRACT_LOBS clob;
end;
/

create or replace procedure lob_cache_test_CACHE_LOBS (p_cnt number) as


l_txt varchar2(10) := ’0123456789’;
begin
for i in 1..p_cnt loop
dbms_lob.createtemporary(lob_loc => lob_cache_test_pkg.g_CACHE_LOBS, cache => true, dur => dbms_lob.call);
dbms_lob.writeappend(lob_loc => lob_cache_test_pkg.g_CACHE_LOBS, amount => 10, buffer => l_txt);
end loop;
end;
/

create or replace procedure lob_cache_test_NOCACHE_LOBS (p_cnt number) as


l_txt varchar2(10) := ’0123456789’;
begin
for i in 1..p_cnt loop
dbms_lob.createtemporary(lob_loc => lob_cache_test_pkg.g_NOCACHE_LOBS, cache => false, dur => dbms_lob.call);
dbms_lob.writeappend(lob_loc => lob_cache_test_pkg.g_NOCACHE_LOBS, amount => 10, buffer => l_txt);
end loop;
end;
/

-- 12cR1 not reported space (BLOCKs) usage of ABSTRACT_LOBS in v$tempseg_usage.


-- But 12cR2 Reported, and put them into CACHE_LOBS.

create or replace procedure lob_cache_test_ABSTRACT_LOBS (p_cnt number) as


l_txt varchar2(10) := ’0123456789’;
begin
for i in 1..p_cnt loop
dbms_lob.createtemporary(lob_loc => lob_cache_test_pkg.g_ABSTRACT_LOBS, cache => true, dur => dbms_lob.call);
lob_cache_test_pkg.g_ABSTRACT_LOBS := l_txt;
end loop;
end;
/

Run test code below:

--------------------------- Test on 12cR2 ---------------------------

exec dbms_session.reset_package;

select l.*, t.blocks


from v$session s, v$temporary_lobs l, v$tempseg_usage t
where s.sid = l.sid and s.saddr = t.session_addr(+);

exec lob_cache_test_CACHE_LOBS(1122);

select l.*, t.blocks


from v$session s, v$temporary_lobs l, v$tempseg_usage t
where s.sid = l.sid and s.saddr = t.session_addr(+);

exec lob_cache_test_NOCACHE_LOBS(1133);

select l.*, t.blocks


from v$session s, v$temporary_lobs l, v$tempseg_usage t
where s.sid = l.sid and s.saddr = t.session_addr(+);

exec lob_cache_test_ABSTRACT_LOBS(1144);

select l.*, t.blocks


from v$session s, v$temporary_lobs l, v$tempseg_usage t
where s.sid = l.sid and s.saddr = t.session_addr(+);

Here the output by above test steps:

----------------- Test Result on 12cR2 ----------------

154
SID CACHE_LOBS NOCACHE_LOBS ABSTRACT_LOBS BLOCKS
---- ---------- ------------ ------------- ----------
738 0 0 0 0

SQL > exec lob_cache_test_CACHE_LOBS(1122);

SID CACHE_LOBS NOCACHE_LOBS ABSTRACT_LOBS BLOCKS


---- ---------- ------------ ------------- ----------
738 1122 0 0 1280

SQL > exec lob_cache_test_NOCACHE_LOBS(1133);

SID CACHE_LOBS NOCACHE_LOBS ABSTRACT_LOBS BLOCKS


---- ---------- ------------ ------------- ----------
738 1122 1133 0 2304

SQL > exec lob_cache_test_ABSTRACT_LOBS(1144)

SID CACHE_LOBS NOCACHE_LOBS ABSTRACT_LOBS BLOCKS


---- ---------- ------------ ------------- ----------
738 2266 1133 1144 3584

We can see that each LOB is allocated with at least one data block, and in v$temporary lobs ab-
stract lobs is counted twice, once in itself, once added into cache lobs.

If we create a high number (more than 1,000,000) of temporary LOBs, and call dbms session.reset package
to free created temporary LOBs, it will take long time (hours). The call stack looks like:

kdlt freetemp -> kdl destroy -> kdlclose -> memcmp.

However, if using alter system kill session ’sid,serial#’, the memory will be released immedi-
ately.

5.3.2 LOB Memory Leak

5.3.2.1 Un-Released PGA Memory

Setup a test by:

------------------------------ Test Code -------------------------------


create or replace function create_clob(p_clob_len number) return clob as
l_clob clob; -- BLOB has similar behaviour
begin
l_clob := lpad(’a’, p_clob_len, ’b’);
return l_clob;
end;
/

create or replace type t_clob as object(c clob);


/

create or replace type t_clob_tab as table of clob;


/

create or replace procedure print_lob_and_mem as


l_ret varchar2(400);
l_sid number := sys.dbms_support.mysid;
l_mb number := 1024*1024;
begin
select ’PGA_MEM(MB): ’||’Used=’||round(p.pga_used_mem/l_mb)||’ --- TEMP_LOBS: ’||
’CACHE_LOBS=’||cache_lobs||’, NOCACHE_LOBS=’||nocache_lobs||’, ABSTRACT_LOBS=’||abstract_lobs
into l_ret
from v$process p, v$session s, v$temporary_lobs l
where p.addr=s.paddr and s.sid = l.sid and s.sid = l_sid;

155
dbms_output.put_line(l_ret);
end;
/

create or replace procedure test_run(p_cnt number, p_clob_len number) as


l_stmt_var_c1 varchar2(100);
l_clob clob;
l_clob_t t_clob := t_clob(null);
l_clob_tab t_clob_tab := t_clob_tab();
begin
l_stmt_var_c1 := ’begin select create_clob(’||p_clob_len||’) into :c1 from dual; end;’;

for i in 1..p_cnt loop


execute immediate l_stmt_var_c1 using out l_clob;
end loop;

print_lob_and_mem;
end;
/

In procedure test run, we dynamically call function create clob (execute immediate) in a loop, each
run allocates a certain amount of memory (LOB).

Run test to allocate 1024 LOBs, each with 16KB, total about 16MB.

SQL > exec print_lob_and_mem;


SQL > exec test_run(1024, 1024*16);

The Output is:

------------------------------ Test Output -------------------------------

PGA_MEM(MB): Used= 7 --- TEMP_LOBS: CACHE_LOBS=0, NOCACHE_LOBS=0, ABSTRACT_LOBS=0


PGA_MEM(MB): Used=16 --- TEMP_LOBS: CACHE_LOBS=924, NOCACHE_LOBS=0, ABSTRACT_LOBS=1024

After the call returns, PGA memory still remains. The second line with ABSTRACT LOBS (1024)
incrasing indicates that the leak is located in abstract lobs. In all above test, we loop over function
create clob, which is a stateless Plsql procedure/function, there are no any statefull Plsql package
variables are involved (so there is no package state to be kept after each call).

5.3.2.2 LOB Memory Leak Test

Open a new Sql session, run two code blocks below:

set serveroutput on
prompt -------- 1st Block --------

begin
test_run(1, 2);
test_run(100, 2);
test_run(10000, 2);
end;
/

prompt -------- 2nd Block --------

begin
test_run(1, 2);
test_run(100, 2);
test_run(10000, 2);
end;
/

156
Here the output:

-------- 1st Block --------

PGA_MEM(MB): Used= 7 --- TEMP_LOBS: CACHE_LOBS=0, NOCACHE_LOBS=0, ABSTRACT_LOBS=1


PGA_MEM(MB): Used= 8 --- TEMP_LOBS: CACHE_LOBS=1, NOCACHE_LOBS=0, ABSTRACT_LOBS=101
PGA_MEM(MB): Used=60 --- TEMP_LOBS: CACHE_LOBS=10001, NOCACHE_LOBS=0, ABSTRACT_LOBS=10101

-------- 2nd Block --------

PGA_MEM(MB): Used=60 --- TEMP_LOBS: CACHE_LOBS=1, NOCACHE_LOBS=0, ABSTRACT_LOBS=1


PGA_MEM(MB): Used=60 --- TEMP_LOBS: CACHE_LOBS=101, NOCACHE_LOBS=0, ABSTRACT_LOBS=101
PGA_MEM(MB): Used=59 --- TEMP_LOBS: CACHE_LOBS=10101, NOCACHE_LOBS=0, ABSTRACT_LOBS=10101

At the first line of 2nd Block, PGA Used=60 shows that the PGA MEM is not released (leak) after 1st
Block terminated, however ABSTRACT LOBS count is reset. That means once the call is terminated,
v$temporary lobs.abstract lobs is reset to 0 (no abstract lobs is exposed any more in this view).
However their allocated PGA memory is still kept.

5.3.2.3 Test till ORA-04030

In order to make sure the reported PGA memory reflects the real space usage (not merely a counting
error), we can try to allocate more than 32GB (12c default) PGA, if it throws ORA-04030, it is certainly
a real memory leak.

We open 3 Plsql Sessions, the first one (sid: 555) is using pga sampling presented in the previous section
5.2.4 to collect PGA details; the second (sid: 666) runs a query to watch about sampling result; the third
(sid: 777) performs out test.

At first, start PGA sampling:

SQL(555) > exec pga_sampling(777, 3600);

Watch sampling result(only partial result are shown). Below is one output after half hour (near to
appearance of ORA-04030).

SQL(666) > select * from process_memory_detail_v order by timestamp desc, bytes desc;

CATEGORY NAME HEAP_NAME BYTES ALLOCATION_COUNT


------- --------------- --------------- -------------- ----------------
Other permanent memory kokltcr: creat 31,680,073,912 8,579,890
Other free memory kokltcr: creat 648,951,096 6,239,915
Other free memory session heap 616,868,032 445,136
Other kokltcr: create clob koh dur heap d 299,515,712 1,559,980

Allocate 32 GB PGA:

SQL(777) > exec test_run(1024*1024*2, 1024*16);

BEGIN test_run(1024*1024*2, 1024*16); END;

*
ERROR at line 1:
ORA-04030: out of process memory when trying to allocate 4040 bytes (kokltcr: creat,kghsseg: kolaslCreateCtx)
ORA-06512: at "S.CREATE_CLOB", line 4

157
The incident file looks like:

ORA-04030: out of process memory when trying to allocate 169040 bytes (pga heap,kgh stack)
ORA-04030: out of process memory when trying to allocate 4040 bytes (kokltcr: creat,kghsseg: kolaslCreateCtx)

========= Dump for incident 22642 (ORA 4030) ========


----- Beginning of Customized Incident Dump(s) -----
=======================================
TOP 10 MEMORY USES FOR THIS PROCESS
---------------------------------------

*** 2017-03-17 22:06:16.101


95% 30 GB, 8600973 chunks: "permanent memory "
kokltcr: creat ds=fffffd77ec09d628 dsprt=fffffd7ffbebb900
2% 620 MB, 6255235 chunks: "free memory "
kokltcr: creat ds=fffffd77ec09d628 dsprt=fffffd7ffbebb900
2% 590 MB, 446319 chunks: "free memory "
session heap ds=fffffd7ffc02d728 dsprt=fffffd7ffc358350
1% 286 MB, 1563814 chunks: "kokltcr: create clob "
koh dur heap d ds=fffffd7ffbebb900 dsprt=fffffd7ffc02d728
0% 62 MB, 781909 chunks: "kolraloc-1 "
kolr heap ds i ds=fffffd7ffc048488 dsprt=fffffd7ffc02d728
0% 61 MB, 3850 chunks: "kolrde_alloc "
koh-kghu sessi ds=fffffd7ffc05edd8 dsprt=fffffd7ffc02d728
0% 48 MB, 781907 chunks: "kolrarfc:lobloc_kolrhte "
kolr heap ds i ds=fffffd7ffc048488 dsprt=fffffd7ffc02d728
0% 27 MB, 195483 chunks: "free memory "
koh dur heap d ds=fffffd7ffbebb900 dsprt=fffffd7ffc02d728
0% 828 KB, 17329 chunks: "free memory "
kolr heap ds i ds=fffffd7ffc048488 dsprt=fffffd7ffc02d728
0% 505 KB, 34 chunks: "permanent memory "
pga heap ds=fffffd7ffc345640 dsprt=0

We can see that ”30 GB, 8600973 chunks” are allocated as ”permanent memory”, that probably explains
why it is not reclaimable (memory type: ”permanent”), and consequently leading to ORA-04030.

Oracle9i Docu: Temporary LOB Performance Guidelines in Oracle9i Application Developer’s Guide -
Large Objects (LOBs) [24] has a Note:

Temporary LOBs created using a session locator are not cleaned up automatically at the end of function or
procedure calls. The temporary LOB should be explicitly freed by calling DBMS LOB.FREETEMPORARY().

But in Oracle 12.2 Docu: Temporary LOB Performance Guidelines, such note can not be found any more.

158
Chapter 6

CPU and Performance Modelling

CPU is about the performance in common sense, or algorithm complexity in computer science. In this
chapter, we first take Oracle collection to expose its internal implementation of classical ”sort” algorithm.
Then try to build a mathematic model of one latch algorithm, evaluated and compared with tests. Finally
we turn to AIX system to look its advanced CPU accounting model, which can help us plan resource
usage and forecast system scalability.

6.1 Performance of Oracle Collection Operators

Oracle collection is used in applications to store large amount of data, hence more prone to performance
problem (see section 5.2.6 about collection memory allocation). It provides a series of operators like: SET,
EQUAL, COLLECT, MULTISET. Applications using collections often hit performance degradation since
small data was tested in developing phase, and big data is faced in production.

In order to explore its internal implementation, we will create a user-defined object type (UDT), inject
a counter into its map member function, then make test with different size of collections. Finally get
statistics to investigate the performance.

Oracle documentation wrote about SET :

SET Converts a nested table into a set by eliminating duplicates. The function returns a nested
table whose elements are distinct from one another.
(Note: SET function requires map method, order method does not work).

In this section, we pick SET for our discussion, but other operators can also be investigated in the same
way (see Blog[44], [45]).

6.1.1 Test Setup

In following code, we implement a map member function in Oracle Object (similar to Java Compara-
tor/Comparable Interface) to record the number of calls in a stateful helper package.

159
create or replace package helper as
cmp_cnt number := 0;
end helper;
/

drop type test_obj_tab force;

drop type test_obj force;

create or replace type test_obj as object (


num number,
map member function comp return integer);
/

create or replace type body test_obj as


map member function comp return integer is
begin
helper.cmp_cnt := helper.cmp_cnt + 1;
return num;
end;
end;
/

create or replace type test_obj_tab as table of test_obj;


/

create or replace procedure set_test (p_size number) as


l_test_obj_tab test_obj_tab := test_obj_tab();
l_start_time number;
l_elapsed number;
begin
select cast(collect(test_obj(level)) as test_obj_tab)
into l_test_obj_tab
from dual connect by level <= p_size;

helper.cmp_cnt := 0;
l_start_time := dbms_utility.get_time;
l_test_obj_tab := set(l_test_obj_tab);
l_elapsed := dbms_utility.get_time - l_start_time;

dbms_output.put_line(’SET_Size=’||p_size||’, Compare_Counter=’||helper.cmp_cnt||’, Elapsed=’||l_elapsed);


end;
/

6.1.2 SET Operator Complexity

Run a few tests with different set size and look their output:

SQL > exec set_test(10);


SET_Size=10, Compare_Counter=90, Elapsed=0

SQL > exec set_test(100);


SET_Size=100, Compare_Counter=9,900, Elapsed=0

SQL > exec set_test(1000);


SET_Size=1,000, Compare_Counter=999,000, Elapsed=49

SQL > exec set_test(10000);


SET_Size=10,000, Compare_Counter=99,990,000, Elapsed=4,897

The above test result shows:

Compare_Counter = SET_Size * (SET_Size - 1)

that means a complexity of O(n2 ).

160
Look last two tests, the set size, compare operations and elapsed time can be approximately expressed
as:

10, 0002 /1, 0002 = 99, 990, 000/999, 000 = 4, 897/49 = (10, 000/1, 000)2

If set size is increased to 1 million, it can not finish within 5 days.

It seems that Oracle internally implemented an O(n2 ) sort algorithm. Maybe Heapsort [65] is more
suitable for large collection since the more the data, the more the regularity, hence Heapsort can benefits
from the characteristic of large data, which is often partially sorted in real applications.

6.2 Row Cache Performance and CPU Modelling

In this section, we will study the performance of Oracle Row Cache (Dictionary Cache or DC ) Gets
and make various tests. Based on test result, we first try to compute pure CPU performance in M/D/1
Queue, and then try to build a model to understand the performance behaviour. Both attempts are
approximative, not precise, and can be inadequate. They are put here only to explore the feasibility of
Oracle latch modelling.

Nowadays Oracle is widely used together with Object-Oriented language like Java or C#, in which
applications communicate with DB persistent layer in Oracle Object Types. It is also related to Oracle
programs using Object Types, such as dbms aq.dequeue paramter payload defined as ADT (Blog [38]),
and XML Pull Parser in Java (Blog [40]).

First we create Plsql test functions with Oracle Object Types as parameters, and then trace function
calling with 10222 trace event for Row Cache Gets (and "latch: row cache objects" ). All tests
are extracted from real applications, in which performance is heavily affected by "latch: row cache
objects" contentions.

Note 1: tests are done in Oracle 12cR1(12.1.0.2).

Note 2: 12cR1 "latch: row cache objects" replaced by 12cR2 "row cache mutex" [53] ). For further
discussion, see Blog [56].

6.2.1 Plsql Object Types Function

In the following test code, we define a Plsql function foo. Its returned result and 3 input parameters are
all object types. The 3 parameters are declared as IN, OUT, IN OUT respectively, so that we can cover
all 3 possible Plsql parameter modes.

--=========== object types ===========--

create or replace type t_obj_ret force as object (id number, name varchar2(30));
/

create or replace type t_obj_in force as object (id number, name varchar2(30));
/

create or replace type t_obj_out force as object (id number, name varchar2(30));
/

create or replace type t_obj_inout force as object (id number, name varchar2(30));
/

161
--=========== function foo with parameters (IN, OUT, IN OUT)===========--

create or replace function foo (


p_in in t_obj_in := null
,p_out out t_obj_out
,p_inout in out t_obj_inout) return t_obj_ret
as
l_ret t_obj_ret;
begin
-- l_ret.id return 1122+112=1234
if p_in is null then
l_ret := t_obj_ret(9876, ’p_in is NULL’);
else
l_ret := t_obj_ret(p_in.id + 112, p_in.name);
end if;
p_out := t_obj_out(p_inout.id, p_inout.name);
p_inout.id := l_ret.id;
return l_ret;
end;
/

--=========== Plsql Dynamic Call ===========--

create or replace procedure foo_proc (p_cnt number) as


l_stmt varchar2(100);
l_ret t_obj_ret;
l_in t_obj_in;
l_out t_obj_out;
l_inout t_obj_inout;
begin
l_in := t_obj_in(1122, ’T_OBJ_IN’);
l_inout := t_obj_inout(1144, ’T_OBJ_INOUT’);

l_stmt := q’[begin :1 := foo(:2, :3, :4); end;]’;

for i in 1..p_cnt loop


execute immediate l_stmt
using OUT l_ret, IN l_in, OUT l_out, IN OUT l_inout;
end loop;
dbms_output.put_line(’l_ret.id=’ ||l_ret.id);
end;
/

6.2.2 Plsql Dynamic Call and 10222 Trace

Run test by calling function foo 100 times with execute immediate in foo proc, and at the same time
trace it with 10222.

alter session set tracefile_identifier = ’row_cache_10222_trc_1’;


alter session set events ’10222 trace name context forever, level 4294967295’;
exec foo_proc(100);
alter session set events ’10222 trace name context off’;

Pick one Object, for example, T OBJ OUT, look the generated 10222 trace file below, we can see that it is
accessed 3 times in 3 different lines starting with kqrReadFromDB. The 3 lines are marked by 3 different
cid numbers, first is cid=17 (dc global oids), second is cid=11 (dc object ids) and third is cid=7
(dc users), in the sequence of 17, 11, 7. The same CIDs are also exposed in v$rowcache.cache#. So
we can also query this view to monitor rowcache usage statistics (see Blog [55]).

In 10222 trace file, all lines like <-- xx --> are the added comments, used in later modelling discussion.

--================ Start T_OBJ_OUT Get ================--

kqrfrpo : freed to fixed free list po=175d2f708 time=1459184662

162
<--cid = 17, Start time=1459184662-->
kqrpad: new po 17727bac0 from kqrpre1.1

kqrReadFromDB : kqrpre1.1 po=17727bac0 flg=8000 cid=17 eq=1753067e0 idx=0 dsflg=0


kqrpre1 : done po=17727bac0 cid=17 flg=2 hash=3c4bb9d0 0 eq=1753067e0 SQL=begin :1 := foo(:2, :3, :4); end; time=1459184804

<--cid = 17, End time=1459184917, Elapsed = 255 (1459184917-1459184662)-->


kqrfrpo : freed to fixed free list po=17727bac0 time=1459184917

<--cid = 11, Start time=1459184917, including cid = 7-->

kqrpad: new po 175d2f708 from kqrpre1.1

kqrReadFromDB : kqrpre1.3 po=175d2f708 flg=8000 cid=11 eq=1753067e0 idx=0 dsflg=0


kqrpre1 : done po=175d2f708 cid=11 flg=2 hash=94b841cf a9461655 eq=1753067e0
obobn=2360170 obname=T_OBJ_OUT obtyp=13 obsta=1 obflg=0 SQL=begin :1 := foo(:2, :3, :4); end; time=1459185132

<--cid = 7, Start time=1459185132-->

kqrpad: new po 170d94488 from kqrpre1.1

kqrReadFromDB : kqrpre1.3 po=170d94488 flg=8000 cid=7 eq=176e410b8 idx=0 dsflg=0


kqrpre1 : done po=170d94488 cid=7 flg=2 hash=de7751cd 395edb55 eq=176e410b8
SQL=begin :1 := foo(:2, :3, :4); end; time=1459185415

<--cid = 7, End time=1459185542, Elapsed = 410 (1459185542-1459185132)-->


kqrfrpo : freed to heap po=170d94488 time=1459185542

kqrmupin : kqrpspr2 Unpin po 175d2f708 cid=11 flg=2 hash=a9461655 time=1459185556

<--cid = 11, End time=1459185651, Elapsed = 734 (1459185651-1459184917), Pure = 324 (734-410)-->
kqrfrpo : freed to fixed free list po=175d2f708 time=1459185651

--================ End T_OBJ_OUT Get ================--

In each foo call, 3 sequential Row Cache Gets for each object type are:

dc_global_oids CID = 17 GET


dc_object_ids CID = 11 GET
dc_users CID = 7 GET

T OBJ INOUT is special, it requires 2 times the above Gets. So in total, we have 5 times above 3 Row
Cache Gets per foo call, that means 5*3 = 15 Row Cache Gets for every foo call.

As showed by Book Oracle Core [15, p. 167], each Row Cache Object Get triggers 3 consecutive "latch:
row cache objects" Gets, which have following respective ”Where” latch locations (it is visible in AWR
reports):

kqrpre: find obj


kqreqd
kqreqd: reget

So total 15*3 = 45 "latch: row cache objects" Gets for each foo call. Above test invoked foo 100
times, that is 100*15 = 1500 Row Cache Gets, or 1500*3 = 4500 "latch: row cache objects" Gets.

Surprisingly, if we call foo from Java, besides T OBJ INOUT, other two objects: T OBJ RET and T OBJ OUT
also require 2 times Row Cache Gets. Only T OBJ IN still requires 1 Row Cache Get for each cid in
every foo call, same as in Plsql. So total 7*3 = 21 Row Cache Gets in Java, instead of 15 in Plsql. For
100 foo calls from Java, it requires 100*21 = 2100 Row Cache Gets, or 2100*3 = 6300 "latch: row
cache objects" Gets. So Java call requires about 40% more latch gets than Plsql. It means that in real

163
applications, Java performance is satured (or suffered) ealier than Plsql with same number of calls. To
demonstrate the difference, Blog [39] and [56] made the comparative tests of Java vs. Plsql. In fact, such
performance degradation was experienced in production system when migating Plsql implementations to
Java.

6.2.3 Test and Analysis

We perform test by calling (foo proc) for 9 different parallel degrees (varying from 1 to 48), each of
which is running for a duration of 10 minutes (see Test Code in Blog [56]). Table 6.1 shows the test
result.
SESSIONS EXECUTIONS CPU_TIME_S CONC_TIME_S EL_PER_EXEC CPU_PER_EXEC CONC_PER_EXEC TP_PER_SESSION
-------- ---------- ---------- ----------- ----------- ------------ ------------- --------------
Solairs
1 8,096,318 413 0 51 51 0 8,096,318
3 21,850,342 1,212 45 57 55 2 7,283,447
6 32,609,592 1,923 520 75 59 16 5,434,932
12 42,217,169 2,555 1,693 125 61 40 3,518,097
18 42,890,535 2,615 3,548 193 61 83 2,382,808
24 42,955,245 2,631 5,667 265 61 132 1,789,802
36 43,117,305 2,633 9,372 406 61 217 1,197,703
42 42,523,671 2,616 11,551 480 62 272 1,012,468
48 40,546,142 2,506 13,915 561 62 343 844,711

Linux
1 13,992,932 413 0 30 30 0 13,992,932
3 32,715,487 1,216 44 39 37 1 10,905,162
6 52,665,062 2,222 316 48 42 6 8,777,510
12 49,743,420 2,146 2,321 98 43 47 4,145,285
18 50,448,264 2,249 4,525 149 45 90 2,802,681
24 50,836,921 2,340 6,895 201 46 136 2,118,205
36 51,864,133 2,458 12,002 307 47 231 1,440,670
42 48,914,411 2,495 14,942 390 51 305 1,164,629
48 48,549,375 2,535 17,989 460 52 371 1,011,445

AIX
1 6,535,357 251 0 65 38 0 6,535,357
3 17,355,458 714 75 74 41 4 5,785,153
6 21,327,093 1,069 664 117 50 31 3,554,516
12 24,797,761 1,353 1,814 198 55 73 2,066,480
18 25,037,784 1,740 2,992 297 69 119 1,390,988
24 20,691,357 2,597 2,537 542 126 123 862,140
36 19,048,665 2,945 7,872 963 155 413 529,130
42 19,199,984 2,872 11,683 1,132 150 608 457,142
48 19,214,011 2,815 15,315 1,306 147 797 400,292

Table 6.1: Row Cache Performance Test

It includes following stats:

(1). the number of sessions (SESSIONS)

(2). total number of executions (EXECUTIONS)

(3). cpu time in seconds (CPU TIME S)

(4). concurrency wait time in seconds (CONC TIME S)

(5). elapsed microseconds per execution (EL PER EXEC or US PER EXEC)

(6). cpu microseconds per execution (CPU PER EXEC)

164
(7). concurrency time per execution(CONC PER EXEC)

(8). throughput per session (TP PER SESSION)

The data for executions, elapsed time, cpu time, concurrency wait time are selected from v$sqlarea.

All tests are done in Oracle 12.1.0.2.0 on Solaris, Linux, AIX (SMT 4, LCPU=24) with 6 physical
processors. Linux and AIX data are added here for comparison.

The above result shows:

(1). Total Throughput (EXECUTIONS) from 1 to 3 SESSIONS climbs almost linearly; of 6 SES-
SIONS tends to be flat or descending, probably braked by the contentions of 3 CIDs (17, 11,
7)

(2). Max throughput is achieved with around 9 parallel sessions, probably because of 3 CIDs (17,
11, 7), each of which has 3 latch locations.

(3). The performance is saturated by more than 12 parallel sessions. The response time per execu-
tion (US PER EXEC) gets increased, probably due to the latch contentions (sessions spend more
time on latch Gets instead of real work).

(4). Elapsed time (EL PER EXEC) is made of two parts: CPU time and concurrency wait time.
CPU (CPU PER EXEC) is relatively stable (particularly Solairs), but concurrency (CONC PER EXEC)
increases almost linearly with the SESSIONS.

(5). In AIX, CPU PER EXEC is not same as EL PER EXEC for SESSIONS=1 (even though
there is no concurrency, i.e. CONC PER EXEC = 0) because of AIX PURR CPU accounting and
vpm throughput mode (see section 6.3 AIX CPU discussion).

If we draw a graph of Sessions and Executions for Solairs as showed in Figure 6.1 EXECUTIONS, it
starts linearly till SESSIONS=9, and then reaches peek value at SESSIONS=36.

Draw another graph of CPU time vs. Concurrency time with SESSIONS for Solairs (Figure 6.2), it shows
that CPU is almost constant (between 51 to 62), but concurrency time increases (almost) linearly with
SESSIONS, which alone brings the elapsed time increasing linearly.

During the test, we also collect the data for "latch: row cache objects" Stats (Dictionary Cache
Stats from v$rowcache are collected as well. Each Dictionary Cache Get requires 3 latch Gets. (see Book
Oracle Core [15, p. 167], Blog [56], also discussed in section 3.2.1)

Table 6.2 contains the data for parallel sessions 12 and 42, which shows:

(1). The number of gets (effective work) for 12 and 42 are similar, but wait time has a difference
of factor 10. So system throughputs are still the same when increasing parallel sessions from 12 to
42.

(2). misses, sleeps, spin gets, wait time increase in the order of dc users, dc object grants,
dc global oids. So dc global oids has the most delay.

165
Figure 6.1: Sessions and Executions

Figure 6.2: CPU vs Concurrency

6.2.4 M/D/1 Queue

Look CPU PER EXEC in Table-1 for Solaris, it is almost constant for different number of SESSIONS.
If we consider entire 3 CIDs (17, 11, 7) as a single server, with deterministic service time, and buffer
size sufficient large (at least 48), we can try to roughly model CPU time of Row Cache Gets in M/D/1
Queue [61]:

1
arrival rate = λ = interarrival time

166
SESSIONS CHILD# RC_PARAMETER GETS MISSES SLEEPS SPIN_GETS WAIT_TIME_S
-------- ------ ---------------- ----------- ---------- ------ ---------- -----------
12 8 dc_users 633,984,132 19,649,121 52,958 19,596,500 262
12 9 dc_object_grants 633,984,845 54,784,381 58,712 54,726,007 292
12 10 dc_global_oids 633,982,830 52,451,846 68,813 52,383,487 346

42 8 dc_users 638,662,605 22,653,084 56,234 22,597,401 2,274


42 9 dc_object_grants 638,674,026 56,604,969 65,236 56,540,334 3,006
42 10 dc_global_oids 638,657,482 64,387,264 76,460 64,311,564 3,905

Table 6.2: Row Cache Latch Gets

1
service rate = µ = service time

λ
utilization = ρ = µ

The average number of entities in the system, L (SESSIONS) is given by:

ρ2
L=ρ+ 2(1−ρ)

The average waiting time in the system, ω is given by:

1 ρ
ω= µ + 2µ(1−ρ)

Since the test is performed with different number of SESSIONS, that means, given L, utilization ρ is a
function of L:


ρ = (1 + L) − 1 + L2

interarrival time can be expressed as function of the collected service time (CPU PER EXEC):

1 1 service time
interarrival time = λ = µρ = ρ

The calculated result for Solaris is showed in Table 6.3, where:

CPU PER EXEC is service time


SYS WAITING TIME is ω
AVG WAITING TIME PER SESSION is ω/SESSION S

We can see that M/D/1 server utilization is more than 97% when SESSIONS reaches 18, and interar-
rival time is limited by CPU PER EXEC = 62. Increasing SESSIONS will not serve more requests for

167
----------------------- Solaris ----------------------

SESSIONS UTILIZATION% CPU_PER_EXEC INTERARRIVAL_TIME SYS_WAITING_TIME AVG_WAITING_TIME_PER_SESSION


-------- ------------ ------------ ----------------- ---------------- ----------------------------
1 58.58 51 87.08 63.28 63.28
3 83.77 55 66.21 172.08 57.36
6 91.72 59 64.29 357.19 59.53
12 95.84 61 63.15 728.45 60.70
18 97.22 61 62.71 1097.60 60.98
24 97.92 61 62.55 1473.37 61.39
36 98.61 61 61.93 2197.63 61.05
42 98.81 62 62.26 2585.83 61.57
48 98.96 62 62.46 2972.46 61.93

-- calculation query
select * from (
select level sessions,
round((1 + level) - sqrt(1 + level*level), 4) utilization
from dual connect by level < 100
) where sessions in (1, 3, 6, 12, 18, 24, 36, 42, 48);

Table 6.3: M/D/1 Queue CPU Utilization

Row Cache Gets.

The above estimation is an attempt to use M/D/1 Queue to compute pure CPU usage for Solaris since
its CPU PER EXEC (Table-1) is almost constant. For Linux, particularly AIX, it is hard to be applied
since their CPU PER EXEC (Table-1) is increasing with SESSIONS due to AIX SMT.

6.2.5 Modeling

Above 10222 trace file shows that each Oracle Object is accomplished by 3 consecutive Row Cache
Gets (cid=17 (dc global oids), cid=11 (dc object ids), cid=7 (dc users)). Now let’s try to model
those 3 Row Cache Object Gets, each of which includes 3 Latch Gets occurring in 3 different latch locations
("Where").

In the following discussion, we model:

Oracle Session as Job Generator


Row Cache Object Get as Task
Task Processing Server as Machine

Suppose we have one Workshop W, and n Job Generators:

G = {G1, G2, ... ,Gn}

Every Generator holds one single Job in each instant (one terminated, a new produced).

Each Job is made of 3 Tasks (Sub-job, corresponding to Row Cache Object Gets of cid: 17, 11, 7):

J_i = {Si_1, Si_2, Si_3}

Each Task consists of 3 Work Units (corresponding to latch Gets in 3 "Where" locations):

168
Si_1 = {Si_1_u1, Si_1_u2, Si_1_u3}
Si_2 = {Si_2_u1, Si_2_u2, Si_2_u3}
Si_3 = {Si_3_u1, Si_3_u2, Si_3_u3}

They are subject to constraints:

(1). All 3 Tasks in each Job have to be processed sequentially.


(2). All 3 Work Units in each Task have to be processed sequentially.
(3). The Tasks and Work Units among different Jobs can be in parallel.

The Workshop is equipped with an assembly line, which consists of 3 Machines (processing cid: 17,
11, 7 respectively):

W = {M1, M2, M3}

Each Machine possesses 3 Processing Units (corresponding to latch "Where" locations: kqrpre:find
obj, kqreqd, kqreqd:reget):

M1 = {p1_1, p1_2, p1_3}


M2 = {p2_1, p2_2, p2_3}
M3 = {p3_1, p3_2, p3_3}

3 Machines are dedicated respectively for 3 Tasks as follows:

M1 exclusively processes Si 1
M2 exclusively processes Si 2
M3 exclusively processes Si 3

M1, M2, M3 are running in parallel (inter-parallel); but each Machine’s 3 processing Units have to be
running in serial (no intra-parallel).

The service time of 3 Machines for each assigned task are:

t1 for M1 to process Si 1
t2 for M2 to process Si 2
t3 for M3 to process Si 3

So minimum processing time of one single Job by 3 Machines is (t1 + t2 + t3).

Let’s look the processing of first n Jobs.

Assume t1 < t2 < t3 (the data in the case of this section matches such presumption), when i-th Job being
processed in M2, there are (n-i) Jobs not yet processed. Since M2 is (t2 − t1) slower than M1, processing
i-th Job makes an aggreagate delay of (n − i) ∗ (t2 − t1) for remaining (n-i) Jobs. Within this amount of
delay, M1 can process more Jobs, or equally to say that following number of Jobs:

169
(n-i)*(t2-t1)/t1

waiting before M2 after M1 when i-th Job being processed in M2.

Similarly, there are

(n-i)*(t3-t2)/t2

Jobs waiting before M3 after M2 when i-th Job being processed.

So M3 processing i-th Job creates a delay of

(t3-t2)

for

(n-i)*(t3-t2)/t2

Jobs, that means an accumulation delay by i-th Job (J i) is:

(t3-t2) * (n-i) * (t3-t2)/t2 1 <= i <= n

The total accumulation waiting time before M3 for processing all n Jobs is:

(t3-t2) * (n*(n-1)/2) * (t3-t2)/t2


= n*(n-1) * (t3-t2)^2 / (2 * t2)

In case of t1 < t2 < t3, all Jobs are waiting before M3. If we only consider the fastest machine M1
and slowest machine M3, i.e. min(t1, t2, t3) = t1, max(t1, t2, t3) = t3, M1 is first machine, M3 is
last one, both are running in parallel, The accumulation waiting time inside whole Workshop can be
approximatively expressed as:

(t3-t1) * (n*(n-1)/2) * (t3-t1)/t2


= n*(n-1) * (t3-t1)^2 / (2 * t1)

Average waiting time per Job is:

(n-1) * (t3-t1)^2 / (2 * t1)

Put all together, average response time can be expressed as:

avg_response_time (rt) = waiting_time (wt) + service_time (st)


= (n-1) * (t3-t1)^2 / (2 * t1) + t3

throughput = (1 / rt)
<= (1 / t3)

170
Therefore, we can have some summary as follows:

(1). total response time in entire Workshop is quadratically to n.

(2). average response time per Job is linearly to n, which is similar to test result.

(3). maximum throughput of Workshop is 1 / avg response time.

Take Solairs 10222 trace file at the beginning of this section:

t1 = 255 us for cid = 17


t2 = 324 us for cid = 11
t3 = 410 us for cid = 7 (covered by cid = 11)

sum = t1 + t2 + t3 = 989 (255+324+410)

The time collected in 10222 trace is much higher than Table-1 Soalris EL PER EXEC for one single
session (SESSIONS=1), probably because overhead of 10222 tracing. So we first convert them to the
number without tracing (numbers commented after ”->”).

EL_PER_EXEC = 51 -- Table-1 Solaris SESSIONS=1

t1 + t2 + t3 = 989 (255+324+410) = sum -> 51

t1 = 255 -> 51 * 255/989 = 13


t2 = 324 -> 51 * 324/989 = 17
t3 = 410 -> 51 * 410/989 = 21

Now we can try to make the calculation for 12 SESSIONS (n=12). The accumulation waiting time inside
entire Workshop is:

n*(n-1) * (t3-t2)^2 / (2 * t2)


= 12*11*(21-13)*(21-13)/(2*13) = 324

Average waiting time for each Job:

(n-1) * (t3-t2)^2 / (2 * t2)


= 11*(21-13)*(21-13)/(2*13) = 27

Average response time for each Job:

27 + 21 = 48

171
6.2.6 Model Limitations

In the above discussion, Job is modelled by 2 layers:

Task
Work Unit

Workshop is also modelled by 2 processing layers correspondingly:

Machine
Processing Units

Till now, we have only considered the first layer. The two second layers below have not yet been taken
into account:

Work Units
Processing Units

Another deficiency is that there are n Job Generators, and each Generator can produce next Job once
previous Job is terminated (boundary condition), so there are total n Jobs in the system, but above
model only consider one time of processing first n Jobs.

Back to Row Cache, the scope of model is still too far to reflect the real system because we only consider
inter-parallel, but no intra-parallel. This is not precise because intra-parallel is related to 2nd and 3rd
latch Get "Where" locations; whereas inter-parallel is only related to 1st latch Get location.

Besides that, the real system is influenced by Latch timeout, spinning, process preemption, multithread-
ing, and hardware specialities (e.g. Solaris LWP, AIX SMT).

As evidenced by the model, there exists no deadlock in such a system since all operations in each Job
(Oracle session) are performed sequentially.

6.2.7 Model Justification

In order to verify the presented model, all 3 parameters t1, t2, t3 have to be acquired.

First we run foo proc (see Blog [58]) with 10222 trace in an idle system. Then extract lines of all 3
Cache Gets (cid: 17, 11, 7) for one object instance Get, for example, T OBJ OUT (see 10222 trace file
at beginning of of section 6.2.2). It is only illustrative. In practice, more representative data should be
captured).

Look that example output from Solaris (irrelevant lines removed), values for 3 Cache Gets are captured
(”time” in microsecond are enclosed in comment lines like <-- xx -->), then converted into values (all
prefixed by ”->”) of non 10222 tracing according to the value for SESSIONS=1 (no concurrency) in Table
6.1 (AIX and Linux are added for comparison).

----------------------- Solaris ----------------------


EL_PER_EXEC = 51 -- Table-1 Solaris SESSIONS=1

172
t1 = 255 us -> 13 for cid = 17
t2 = 324 us -> 17 for cid = 11
t3 = 410 us -> 21 for cid = 7 (covered by cid = 11)

total Elapsed = 989 us (255+324+410) -> *51

----------------------- Linux ----------------------


EL_PER_EXEC = 30 -- Table-1 Linux SESSIONS=1

t1 = 136 us -> 8 for cid = 17


t2 = 164 us -> 9 for cid = 11
t3 = 229 us -> 13 for cid = 7 (covered by cid = 11)

total Elapsed = 529 us (136+164+229) -> 30

----------------------- AIX ----------------------


EL_PER_EXEC = 65 -- Table-1 AIX SESSIONS=1

t1 = 143 us -> 9 for cid = 17


t2 = 186 us -> 12 for cid = 11
t3 = 252 us -> 16 for cid = 7 (covered by cid = 11)

total Elapsed = 581 us (143+186+252) -> 65

Using the above average response time formula:

(n-1) * (t3-t1)^2 / (2 * t1) + t3

substitute all variables and run following queries (for SESSIONS varying from 1 to 48), we get the average
response time in microsecond (us) per execution: MODEL EL PER EXEC.

Both test data and model data are shown in Table 6.4.

The following queries are used to calculate model data.

-------- Solaris --------


select * from (
select level sessions,
round((level-1)*(21-13)*(21-13)/(2*13)) + 21 model_el_per_exec
from dual connect by level < 100
) where sessions in (1, 3, 6, 12, 18, 24, 36, 42, 48);

-------- Linux --------


select * from (
select level sessions,
round((level-1)*(13-8)*(13-8)/(2*8)) + 13 model_el_per_exec
from dual connect by level < 100
) where sessions in (1, 3, 6, 12, 18, 24, 36, 42, 48);

-------- AIX --------


select * from (
select level sessions,
round((level-1)*(16-9)*(16-9)/(2*9)) + 16 model_el_per_exec
from dual connect by level < 100
) where sessions in (1, 3, 6, 12, 18, 24, 36, 42, 48);

All tests are done in Oracle 12.1.0.2.0 on Solaris, Linux, AIX (SMT 4, LCPU=24) with 6 physical
processors. Linux and AIX are added for comparison.

Now it is open to examine the model by comparing the empirical observations with model predicated
values, and inspect its capability of extrapolation.

173
Parallel | Test | Test | Model
SESSIONS | EXECUTIONS | US_PER_EXEC | US_PER_EXEC
---------| ---------- | ----------- | -----------
Solairs | | |
1 | 8,096,318 | 51 | 21
3 | 21,850,342 | 57 | 26
6 | 32,609,592 | 75 | 33
12 | 42,217,169 | 125 | 48
18 | 42,890,535 | 193 | 63
24 | 42,955,245 | 265 | 78
36 | 43,117,305 | 406 | 107
42 | 42,523,671 | 480 | 122
48 | 40,546,142 | 561 | 137
---------| ---------- | ----------- | -----------
Linux | | |
1 | 13,992,932 | 30 | 13
3 | 32,715,487 | 39 | 16
6 | 52,665,062 | 48 | 21
12 | 49,743,420 | 98 | 30
18 | 50,448,264 | 149 | 40
24 | 50,836,921 | 201 | 49
36 | 51,864,133 | 307 | 68
42 | 48,914,411 | 390 | 77
48 | 48,549,375 | 460 | 86
---------| ---------- | ----------- | -----------
AIX | | |
1 | 6,535,357 | 65 | 16
3 | 17,355,458 | 74 | 21
6 | 21,327,093 | 117 | 30
12 | 24,797,761 | 198 | 46
18 | 25,037,784 | 297 | 62
24 | 20,691,357 | 542 | 79
36 | 19,048,665 | 963 | 111
42 | 19,199,984 | 1,132 | 128
48 | 19,214,011 | 1,306 | 144

Table 6.4: Test and Model Elapsed Time With Number of Parallel Sessions

6.3 IBM AIX POWER CPU Usage and Throughput

AIX POWER introduced advanced scalable CPU modelling. This often caused confusion. When people
see:

elapsed time > cpu time + wait time

in Sql Trace, it is attributed to unaccounted time. When AWR Section: "SQL ordered by Elapsed
Time" or "SQL ordered by CPU Time" never report %CPU over 65%, it was regarded as wrong or
thought system still having free capacity to explore.

In this section, we will try to discuss our pensée on AIX CPU accounting, and thereafter build a model
to formulate CPU usage of POWER SMT Architecture. At the same time, we perform the tests and
collect real system statistics to verify whether our modelling is representative to the real behaviour (see
Blog [42] for more details).

6.3.1 POWER7 and POWER8 Execution Units

POWER7 (see POWER7 [62]) Core is made of 12 execution units (16 units in POWER8):

2 fixed-point units

174
2 load/store units
4 double-precision floating-point units
1 (2*) vector unit supporting VSX
1 decimal floating-point unit
1 branch unit
1 condition register unit
2* load pipelines (no results to store)
1* cryptographic pipeline (AES, Galois Counter Mode, SHA-2)

Note: All units of POWER8 different from POWER7 are marked by ”*”. (see POWER8 [63])

6.3.2 CPU Usage and Throughput

In AIX, utilization of processors (cores) is related to Simultaneous Multi-threading (SMT) . When setting
SMT=4, each core provides 4 Hardware Thread Contexts(HTC, logic CPU) and can simultaneously
execute 4 Software Threads (Processes, Tasks). Due to hardware implementation, for example, it is not
possible to run more than 2 FP on the same core at the same time (cycle). Therefore, with SMT=4,
number of instructions executed by a single HTC slows down, but overall throughput goes up per core.
IBM claims 60% boost of throughput.

Let h represent the active HTC on one core, that means when 4 processes run on a core (h=4), it delivers
1.6 times throughput than a single process per core. ([6], [10]). If h=2, the boost is 40%, or 1.4 times
throughput.

Mathematically, with h=4, one could think that 25% Core usage provide 40% CPU power. With 40%
CPU, the response time is 2.5 (= 1/0.4) times longer than a full CPU, rather than 4 times longer.

Now it comes puzzle, how much should we show the CPU usage for each HTC and each process in the
above example ? 25% or 40% ?

Academically, measuring and modelling SMT cpu usage is an on-going research subjects ([17]). POWER
is advanced with a new model of CPU usage. The primary and ingenious intent of POWER is to build
a linear relationship between CPU utilization and real throughput (e.g., transactions per second) [6], so
that CPU utilization can be served as a direct mapping to application performance in the measurement.
Therefore, by looking percentage of CPU usage, we can deduce the rate of business transactions. In
comparing to other CPU utilization model, where the CPU percentage is non-linear to throughput, AIX
CPU model is innovative and representative, hence meaningful.

For example, configure SMT=4, maximum 4 Threads running per core, each Thread shares 25% of
one whole core, and provides 40% throughput in comparing to h=1. To build up the linear relation of
throughput to CPU usage, the CPU usage of h from 1 to 4 can be computed as:

CPU%(h=1) = (1.0/0.4) * 25% = 62.50%

CPU%(h=2) = (0.7/0.4) * 25% = 43.75%

175
CPU%(h=3) = (0.5/0.4) * 25% = 31.25%

CPU%(h=4) = (0.4/0.4) * 25% = 25.00%

Group them together, we get a discrete function f (n) as Table 6.5:

h CPU%
1 62.50
2 43.75
3 31.25
4 25.00

Table 6.5: smt (HTC) and CPU Usage - Power7

Note that for h=3, boost of 50%, or 1.5 times throughput, stems from empirical system test, and can be
inaccurate.

Expressing throughput in linear function of CPU usage, it looks like:

t = f(h) * u

where t for Throughput, h for active HTC on one core (with value of 1, 2, 3, 4), u for CPU Usage.

Putting all together, we can draw Table 6.6, which shows maximum CPU usage of HTC (logic CPU)
and that OS Software Thread (Process or Task) is 62.5%. In POWER7, setting SMT=4, it would be
astonishing if it were possible to observe a Process CPU usage more than 65%, or a HTC’s CPU usage
more than 65% (AIX command "mpstat -s" output).

h CPU% Throughput/h Throughput/core


1 62.50 1.0 1.0
2 43.75 0.7 1.4
3 31.25 0.5 1.5
4 25.00 0.4 1.6

Table 6.6: CPU Usage and Throughput Model - Power7

Picking performance test data out of Blog [40] Table-1 (tested on POWER7, 4 Core, SMT=4, Oracle
11.2.0.3.0), and verifying against above linear relations (TP is abbreviation of Throughput), we get Table
6.7:
JOB CNT h h sum C2 RUN CNT TP Test/h TP Theory/h TP Based CPU% TP Ratio to Min
1 1 1 119 119 (119/1) 115.00 64.67 2.59 (119/46)
8 2 8 580 73 (580/8) 80.50 39.40 1.58 (73/46)
12 3 12 654 55 (654/12) 57.50 29.89 1.20 (55/46)
16 4 16 730 46 (730/16) 46.00 25.00 1.00 (46/46)

Table 6.7: CPU Usage and Throughput Test Result

where column: TP Theory/h is linearly interpoled based on CPU% of Table 6.6 calculated by using start
point TP Test/h = 46 for h=4 as follows:

TP_Theory = 46*(0.25/0.25) = 46.00 for h=4


TP_Theory = 46*(0.3125/0.25) = 57.50 for h=3

176
TP_Theory = 46*(0.4375/0.25) = 80.50 for h=2
TP_Theory = 46*(0.6250/0.25) = 115.00 for h=1

and TP Based CPU% is computed as:

TP_Based_CPU% = (46/46)*25% = 25.00% for h=4


TP_Based_CPU% = (55/46)*25% = 29.89% for h=3
TP_Based_CPU% = (73/46)*25% = 39.40% for h=2
TP_Based_CPU% = (119/46)*25% = 64.67% for h=1

Table 6.7 shows that TP Theory is close to TP Test with less than 10% error ((TP Test-TP Theory) /
TP Theory). Therefore, the theory AIX CPU usage, computed according to the model, can be applied as
a calibrated, scalable metric.

Usually, applications with a lot of transactions are benchmarked in terms of throughput. The AIX CPU
model, which maps throughput linearly to CPU usage, provides a practical way to assess application
performance.

In traditional modelling, CPU usage represents the throughput, and its complement (1 - usage) stands
for the remaining available capacity. One process running in one core with CPU usage of 62.5% on first
HTC stands for that there is still 37.5% available capacity on other 3 HTCs, each of which can share a
portion of 12.5%.

Apllying such model to assess CPU utilization for charging back of computing resources, and its comple-
ment for predication of capacity planning, obviously it is no more reasonable and accurate since remaining
37.5% does not represent the same proportional capacity.

In practice, the new AIX model of SMT CPU accounting is not widely acknowledged, and often caused
confusion. For example, Oracle Note 1963791.1:

Unaccounted Time in 10046 File on AIX SMT4 Platform when Comparing Elapsed and CPU Time (Doc
ID 1963791.1) [60]

where session trace showed:

call count cpu elapsed


------- ------ -------- ----------
Parse 1 0.00 0.00
Execute 1 0.00 0.00
Fetch 2187 86.86 142.64
------- ------ -------- ----------
total 2189 86.86 142.64

Max. Total
Event waited on Times Wait Waited
----------------------------- Waited ---------- ------------
SQL*Net message to client 2187 0.00 0.00
SQL*Net message from client 2187 0.08 7.06
latch: cache buffers chains 6 0.00 0.00
latch free 1 0.00 0.00

and the difference:

177
elapsed_time - (cpu_time + waited_time) =
142.64 - (86.86 + 7.06) =
48.72 seconds

is interpreted as ”Unaccounted Time”.

In fact, applying AIX CPU modelling, we got:

86.86/142.64 = 60.90%

which indicates that almost a single Oracle session alone occupies one full core because 60.90/62.50 =
97.44% is close to 100% usage for h=1 (hopefully that Oracle Note can confirm it). In fact, above Oracle
Note Sql trace does not contain any disk waits, all time is truly consumed by CPU intensive Buffer Cache
consistent read. With default plsql array fetch size of 100, the above query returns about 2189*100 =
218900 rows with pure Buffer Gets (although it is not shown there).

Blog [31] also reported the similar observation on AIX POWER7 and trying to explain the unaccounted
time in the same fashion.

Probably people working in other UNIX (Solaris, HP-UX, Linux) gets used to intuitive interpretation of
CPU time and elapsed time, but with the advancing of multi-threaded processors like AIX, an inception
of re-thinking would help disperse the confusion so that CPU resource can be efficiently allocated and
accurately assessed.

We can make a small test to reproduce the above Oracle MOS reported case. At first create a test table,
then run Plsql block with Sql Trace (trace the second run to ensure no disk read same as above Oracle
Note) in AIX and Solaris, each of which has only this single test session, and AIX is configured with
vpm throughput mode = 0 (to be discussed later section 6.3.4). Here the output from AIX and Solaris
(Linux is similar to Solaris).

drop table cpu_testt purge;


create table cpu_testt as select level x, rpad(’y’||level, 5, ’X’) y from dual connect by level <= 1000000;

begin
for rec in (select * from cpu_testt) loop
null;
end loop;
end;
/

SQL ID: 0fd1yu6p0uwjz Plan Hash: 583555168


SELECT * FROM CPU_TESTT

************************************ AIX ***************************************

call count cpu elapsed disk query current rows


------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 10001 0.35 0.59 0 12216 0 1000000
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 10003 0.35 0.59 0 12216 0 1000000

*********************************** Solaris ************************************

call count cpu elapsed disk query current rows


------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 10001 0.51 0.51 0 12216 0 1000000
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 10003 0.51 0.51 0 12216 0 1000000

178
We can see that the different CPU usage accounting in both systems:

AIX 0.35/0.59 = 59%

Solaris 0.51/0.51 = 100%

6.3.2.1 POWER8 and POWER9

Different to POWER7, AIX POWER8 increased maximum SMT from 4 to 8. For POWER8, the model
has to be slightly adjusted due to its additional execution units, as listed in the beginning of this section.

For example, when setting SMT=4, Table 6.8 shows the figures for h = 1 and 4:
h CPU% Throughput/h Throughput/core
1 60.00 1.00 1.00
4 25.00 0.42 1.67

Table 6.8: CPU Usage and Throughput Model - Power8 SMT=4

when SMT=8, Table 6.9 shows the figures for h = 1 and 8:


h CPU% Throughput/h Throughput/core
1 56.00 1.00 1.00
8 12.50 0.22 1.79

Table 6.9: CPU Usage and Throughput Model - Power8 SMT=8

In comparing to POWER7, POWER8 maximum Throughput/core increased about 5% for SMT=4, and
10% for SMT=8.

(0.42-0.4)/0.4 = 5% for SMT=4

(0.22*2-0.4)/0.4 = 10% for SMT=8

Further tests are needed in order to get more precise modelling.

The similar modelling can be applied for POWER9. Note that POWER9 core comes in two variants:
SMT4 and SMT8 (see POWER9 [64]). According to IBM (see [1]), to provide the best out-of-the-box
performance experience, POWER9 default SMT setting for AIX 7.2 TL 3 has been changed to SMT8
(POWER8 default remains SMT4).

6.3.2.2 AIX Internal Code

AIX struct procinfo used in getprocs subroutine (/usr/include/procinfo.h) contains a comment


on pi cpu:

struct procinfo
{
/* scheduler information */
unsigned long pi_pri; /* priority, 0 high, 31 low */

179
unsigned long pi_nice; /* nice for priority, 0 to 39 */
unsigned long pi_cpu; /* processor usage, 0 to 80 */
...
}

Look comment processor usage, 0 to 80, probably in the mind of AIX developers, processor usage is
not allowed to be over 80.

6.3.3 POWER PURR

To record the overall LPAR processor utilization, AIX is equipped with a dedicated Register, called PURR.
AIX has a detailed and broad documentation about it. Here some excerpts [6] [26].
POWER5 includes a per-thread Processor Utilization Resource Register (PURR), which increments at
the timebase frequency multiplied by the fraction of cycles on which the thread can dispatch instructions.

Beginning with IBM POWER5 TM processor architecture, a new register, PURR, is introduced to assist
in computing the utilization. The PURR stands for Processor Utilization of Resources Register and its
available per Hardware Thread Context.

The PURR counts in proportion to the real time clock (timebase).

The SPURR stands for Scaled Processor Utilization of Resources Register.

The SPURR is similar to PURR except that it increments proportionally to the processor core frequency.

The AIX lparstat, sar & mpstat utilities are modified to report the PURR-SPURR ratio via a new column,
named ”nsp”.

The above documentation also demonstrates the enhanced command:

time (timex)

sar -P ALL

mpstat -s

lpstat -E

pprof -r PURR

AIX provides a proprietary command pprof with flag: -r PURR to report CPU usage in PURR time instead
of TimeBase.

To understand PURR time report, we have run a few CPU intensive tests in an AIX POWER7 system
with 4 Core, SMT=4 (lcpu=16), vpm throughput mode = 0, and measured the CPU usage with this
command (see Blog [42] for more details). Here two tests and the respective command outputs.

6.3.3.1 One Single Session

In the first, we run one single CPU intensive job and track its PURR time for 100 seconds by:

180
exec xpp_test.run_job(p_case => 2, p_job_cnt => 1, p_dur_seconds => 120);

collect CPU usage by:

pprof 100 -r PURR

and display the report by:

head -n 50 pprof.cpu

The output shows (irrelevant lines removed):

Pprof CPU Report

E = Exec’d F = Forked
X = Exited A = Alive (when traced started or stopped)
C = Thread Created

* = Purr based values

Pname PID PPID BE TID PTID ACC_time* STT_time STP_time STP-STT


===== ===== ===== === ===== ===== ========= ======== ======== ========
ora_j000_testdb 42598406 7864540 AA 21299317 0 62.930 0.037 99.805 99.768

Legend:
Pname: Process Name
PID: Process ID
PPID: Parent Process ID
BE: Process State Beginning and End
TID: Thread ID
PTID: Parent Thread ID
ACC_time: Actual CPU Time
STT_time: Start Time
STP_time: Stop Time
STP_STT: The difference between the Stop time and the Start time

It shows that PURR CPU usage is about 62.930/99.768 = 63%, which is the number of CPU% in our
model for h=1 (see Table 6.6).

If tracking with TimeBase by:

pprof 100

The output (head -n 50 pprof.cpu) looks like:

Pname PID PPID BE TID PTID ACC_time STT_time STP_time STP-STT


===== ===== ===== === ===== ===== ======== ======== ======== ========
ora_j000_testdb 1835064 0 AA 2687059 0 99.899 0.016 99.916 99.900

which reports CPU usage about 99.899/99.900 = 100%, i.e, the accounting in habitually regarded practice.

6.3.3.2 Multiple Concurrent Sessions

In the second test, we start 8 CPU intensive Oracle sessions on 4 core AIX, that means each core runs 2
jobs, 2 Threads are active per core, i.e h=2:

181
exec xpp_test.run_job(p_case => 2, p_job_cnt => 8, p_dur_seconds => 120);

and look PURR report for one Oracle process:

Pname PID PPID BE TID PTID ACC_time* STT_time STP_time STP-STT


===== ===== ===== === ===== ===== ========= ======== ======== ========
ora_j007_testdb 17760298 7864540 AA 57475195 0 42.910 0.340 99.210 98.870

It reports PURR CPU usage is about 42.910/98.870 = 43%, which is the number of CPU% in our model
for h=2 (see above Table 6.6). In Oracle AWR, we can see the similar values in column ”%CPU” of all
”SQL Statistics” Sections.

6.3.4 vpm throughput mode

If confused with Oracle AWR maximum CPU% being limited under 43% even though the AIX system is
quite idle, it is probably related to AIX scheduler configurations.

AIX scheduler has a dedicated tunable parameter: vpm throughput mode which regulates the desired level
of SMT exploitation for scaled throughput mode. A value of 0 gives default behaviour (raw throughput
mode). A value of 1, 2, or 3 selects the scaled throughput mode and the desired level of SMT exploitation.
It controls the number of threads used by one core before using next core, and documented as follows:

schedo -p -o vpm throughput mode =

0: Legacy Raw mode (default)

1: Enhanced Raw mode with a higher threshold than legacy

2: Scaled mode, use primary and secondary SMT threads

3: Scaled mode, use all four SMT threads

6.3.4.1 Raw Mode (0, 1)

It provides the highest per-thread throughput and best response times at the expense of activating more
physical core. For example, Legacy Raw mode (default) dispatches workload to all primary threads before
using any secondary threads.

Secondary threads are activated when the load of all primary threads is over certain utilization, probably
50%, and new workload (process) comes to be dispatched for running.

3rd and 4th threads are activated when the load of secondary threads is over certain utilization, probably
20%, and new workload (process) comes to be dispatched for running.

6.3.4.2 Scaled Mode (2, 4)

It intends the highest per-core throughput at the expense of per-thread response times and per-thread
throughput. For example, Scaled mode 2 dispatches workload to both primary and secondary threads of

182
one core before using those of next core. Scaled mode 4 dispatches workload to all 4 threads of one core
before using those of next core.

In Scaled mode 2, 1st and 2nd threads of each core are bound together, thus both have the similar
workload (CPU Usage). 3rd and 4th threads are activated when the load of 1st and 2nd threads is over
certain utilization, probably 30%, and new workload (process) comes to be dispatched for running.

Note that this tuning intention is per active core, not all cores in the LPAR. In fact, it is aimed at
activating less cores. It would be a setting conceived for test systems with a few LPARs.

Referring to Table 6.6, vpm throughput mode = 2 is corresponding to h = 2, two threads are running
per core, Throughput/HTC = 0.7, CPU% = 43.75. In real applications with Scaled mode 2, we also
observed that CPU% is constrained under 43% even if runqueue is shorter than number of cores. That
means even though workload is low, CPU% can not score up to its maximum of 62.50, and applications
can not benefit from the maximum Throughput/HTC. For the performance critical application, Scaled
mode is questionable. On the contrary, Raw Mode automatically tunes the CPU% based on workload.
That is probably why vpm throughput mode is set to 0 in default.

We can see there is no vpm throughput mode=3. Probably it is related to Blog [10] mentioned the
particularity on non-existence of smt=3 mode.

There is also a naming confusion. According to IBM, POWER7 runs in ”Legacy Raw mode” in default,
and POWER6 behaves like ”Scaled throughput mode”. Literally ”Legacy” means it was used in some
previous model or release, but here POWER6 uses something like ”Scaled mode, and a later model
(POWER7) introduced a ”Legacy” mode 0 (It could hint certain technique decisions under POWER
develepment).

6.3.5 Observations

Since 2010, we have been monitoring hundred diverse AIX configured systems (POWER7 and POWER8),
verifying the above model against thousand Oracle reports (AWR and Sql Traces), and checking various
AIX performance commands output (particularly PURR related commands), there are not yet offending
exceptions discovered.

Extensive tests also showed that above model is a close approximation to the output of AIX command
pprof PURR (Blog [42] contains more test cases and discussions).

183
184
Bibliography

[1] IBM United States Software Announcement 218-381. Ibm aix 7.2 delivers enhancements for workload
scalability, high availability, security, and i/o features. https://www-01.ibm.com/common/ssi/cgi-
bin/ssialias?infotype=AN&subtype=CA&htmlfid=897/ENUS218-381&appname=USN, 2018-08-07.

[2] Alexander Anokhin. Dynamic tracing of oracle logical i/o.


https://alexanderanokhin.com/2011/11/13/dynamic-tracing-of-oracle-logical-io/, 2011-11-13.

[3] Alexander Anokhin. Dynamic tracing of oracle logical i/o: part 2. dtrace lio v2 is released.
https://alexanderanokhin.com/2012/03/19/dtrace-lio-new-features/, 2012-03-19.

[4] Christian Antognini. Troubleshooting Oracle Performance (2nd Edition). Apress, 2014.

[5] Stew Ashton. How can ora rowscn change between queries when no update?
https://community.oracle.com/thread/4060768?start=0&tstart=0.

[6] Saravanan Devendran. Understanding cpu utilization on aix.


https://www.ibm.com/developerworks/community/wikis/home? lang=en#!/wiki/Power%20Systems/page/Unde
2015-09-10.

[7] Julian Dontcheff. Reducing ”library cache: mutex x” concurrency with dbms shared pool.markhot.
https://juliandontcheff.wordpress.com/2013/02/12/reducing-library-cache-mutex-x-concurrency-
with-dbms shared pool-markhot/, 2013-02-12.

[8] Anju Garg. Latches, locks, pins and mutexes. http://oracleinaction.com/latche-lock-pin-mutex/,


2013-01-05.

[9] Russell Green. Understanding shared pool memory structures (oracle white paper).
https://www.oracle.com/technetwork/database/manageability/ps-s003-274003-106-1-fin-v2-
128827.pdf, 2005-09.

[10] Nigel Griffiths. Local, near & far memory part 3 - scheduling processes to smt
& virtual processors. https://www.ibm.com/developerworks/community/blogs/aixpert/entry/
local near far memory part 3 scheduling processes to smt virtual processors130?lang=en, 2011-09-
05.

[11] Thomas Kyte. Effective Oracle by Design. ORACLE Press, 2003.

[12] Thomas Kyte. Expert Oracle Database Architecture. Apress, 2010.

[13] Jonathan Lewis. The commit scn - an undocumented feature.


http://www.jlcomp.demon.co.uk/commit.html, 1999-05.

[14] Jonathan Lewis. Clean it up. https://jonathanlewis.wordpress.com/2009/06/16/clean-it-up/, 2009-


06-16.

185
[15] Jonathan Lewis. Oracle Core: Essential Internals for DBAs and Developers. Apress, 2011.
[16] David Litchfield. The oracle data block. http://www.davidlitchfield.com/OracleForensicsDataBlock.pdf,
2010-10-27.
[17] Carlos Luque. Cpu accounting in multi-threaded processors. Department of Computer Architecture,
Universitat Politcnica de Catalunya, 2014.
[18] Andrey S. Nikolaev. Divide and conquer the true mutex contention.
https://andreynikolaev.wordpress.com/2011/05/01/divide-and-conquer-the-true-mutex-
contention/, 2011-05-01.
[19] Oracle. Database administrator’s guide - distributed transactions concepts.
https://docs.oracle.com/cd/E11882 01/server.112/e25494/ds txns.htm#ADMIN031.
[20] Oracle. Database pl/sql language reference - commit statement.
https://docs.oracle.com/cd/E11882 01/appdev.112/e25519/static.htm#LNPLS592.
[21] Oracle. Database reference 12c release 1 (12.1.0.2). https://docs.oracle.com/database/121/REFRN/GUID-
DE96A76F-9FA4-4656-907B-62D55C027000.htm#REFRN00530.
[22] Oracle. Database sql language reference - ora rowscn pseudocolumn.
https://docs.oracle.com/database/121/SQLRF/pseudocolumns007.htm#SQLRF50953.
[23] Oracle. Oracle8i reference release 8.1.5 (a67790-01) initialization parameters.
https://docs.oracle.com/cd/F49540 01/DOC/server.815/a67790/ch1.htm.
[24] Oracle. Oracle9i application developer’s guide - large objects (lobs) re-
lease 1 (9.0.1) part number a88879-01 - temporary lob performance guidelines.
https://docs.oracle.com/cd/A91202 01/901 doc/appdev.901/a88879/adl09be4.htm.
[25] Oracle. V$result cache objects. https://docs.oracle.com/database/121/REFRN/GUID-
2DA2EDEA-8B1D-42E6-A293-663B3124AAFD.htm#REFRN30438.
[26] T. S. Mathews P. Mackerras and R. C. Swanberg. Operating system exploitation of the power5
system. IBM J. Res. Dev., 49(4/5):533539, 2005.
[27] Franck Pachot. Investigating oracle lock issues with event 10704. https://blog.dbi-
services.com/investigating-oracle-lock-issues-with-event-10704/, 2014-03-14.
[28] Tanel Poder. Oracle memory troubleshooting, part 1: Heapdump analyzer.
https://blog.tanelpoder.com/2009/01/02/oracle-memory-troubleshooting-part-1-heapdump-
analyzer, 2009-01-02.
[29] Tanel Poder. Recursive sessions and ora-00018: maximum number of sessions ex-
ceeded. http://tech.e2sn.com/oracle/oracle-internals-and-architecture/recursive-sessions-and-ora-
00018-maximum-number-of-sessions-exceeded, 2010-01-22.
[30] Tanel Poder. V8 bundled exec call and oracle program interface (opi) calls.
https://blog.tanelpoder.com/2011/08/23/v8-bundled-exec-call-and-oracle-program-interface-opi-
calls/, 2011-08-23.
[31] Marcin Przepiorowski. Oracle on aix - where’s my cpu time ?
http://oracleprof.blogspot.com/2013/02/oracle-on-aix-wheres-my-cpu-time.html, 2013-02-21.
[32] qqmengxue. The oracle data block. http://blog.itpub.net/10130206/viewspace-1042721/, 2010-12-
07.
[33] Craig Shallahamer. Oracle Performance Firefighting. Orapub, 2009.

186
[34] Kun Sun. dbms session package memory utilization. http://ksun-
oracle.blogspot.com/2011/04/dbmssession-packagememoryutilization.html, 2011-04-27.

[35] Kun Sun. Update restart and new active undo extent. http://ksun-
oracle.blogspot.com/2011/05/update-restart-and-new-undo-extent.html, 2011-05-22.

[36] Kun Sun. One mutex collision test. http://ksun-oracle.blogspot.com/2012/07/one-mutex-collision-


test.html, 2012-07-30.

[37] Kun Sun. cursor: pin s wait on x. http://ksun-oracle.blogspot.com/2013/04/cursor-pin-s-wait-on-


x 12.html, 2013-04-12.

[38] Kun Sun. dbms aq.dequeue - latch: row cache objects on aix. http://ksun-
oracle.blogspot.com/2014/03/dbmsaqdequeue-latch-row-cache-objects.html, 2014-03-27.

[39] Kun Sun. java stored procedure calls and latch: row cache objects. http://ksun-
oracle.blogspot.com/2014/05/java-stored-procedure-calls-and-latch.html, 2014-05-07.

[40] Kun Sun. java stored procedure calls and latch: row cache objects, and performance. http://ksun-
oracle.blogspot.com/2014/05/java-stored-procedure-calls-and-latch 7.html, 2014-05-07.

[41] Kun Sun. Oracle 11.2.0.4.0 awr ”tablespace io stats” column names shifted. http://ksun-
oracle.blogspot.com/2015/04/oracle-112040-awr-tablespace-io-stats.html, 2015-04-20.

[42] Kun Sun. Ibm aix power7 cpu usage and throughput. http://ksun-oracle.blogspot.com/2015/04/ibm-
aix-power7-cpu-usage-and-throughput.html, 2015-04-29.

[43] Kun Sun. Oracle bigfile tablespace pre-allocation and session blocking. http://ksun-
oracle.blogspot.com/2015/12/oracle-bigfile-tablespace-pre.html, 2015-12-07.

[44] Kun Sun. Performance of oracle object collection comparisons - part1. http://ksun-
oracle.blogspot.com/2016/01/performance-of-oracle-object-collection 13.html, 2016-01-13.

[45] Kun Sun. Performance of oracle object collection comparisons - part2. http://ksun-
oracle.blogspot.com/2016/01/performance-of-oracle-object-collection 9.html, 2016-01-13.

[46] Kun Sun. Sql parsing in serializable transaction throws ora-08177: can’t serialize access for this
transaction. http://ksun-oracle.blogspot.com/2016/06/sql-parsing-in-serializable-transaction.html,
2016-06-13.

[47] Kun Sun. Pl/sql multidimensional collection memory usage and performance. http://ksun-
oracle.blogspot.com/2016/09/plsql-multidimensional-collection.html, 2016-09-12.

[48] Kun Sun. Pl/sql function result cache invalidation (i). http://ksun-
oracle.blogspot.com/2017/03/plsql-function-result-cache-invalidation.html, 2017-03-20.

[49] Kun Sun. Pl/sql function result cache invalidation (i). http://ksun-
oracle.blogspot.com/2017/03/plsql-function-result-cache-invalidation.html, 2017-03-20.

[50] Kun Sun. nls database parameters, dc props, latch: row cache objects. http://ksun-
oracle.blogspot.com/2017/07/nlsdatabaseparameters-dcprops-latch-row.html, 2017-07-21.

[51] Kun Sun. Oracle logical read: Current gets access path and cost. http://ksun-
oracle.blogspot.com/2018/01/oracle-logical-read-current-gets-access.html, 2018-01-24.

[52] Kun Sun. Oracle physical read access path and cost. http://ksun-
oracle.blogspot.com/2018/01/oracle-physical-read-access-path-and.html, 2018-01-24.

187
[53] Kun Sun. row cache mutex in oracle 12.2.0.1.0. http://ksun-oracle.blogspot.com/2018/07/row-cache-
mutex-in-oracle-122010 28.html, 2018-07-28.
[54] Kun Sun. Oracle rowcache views and contents. http://ksun-oracle.blogspot.com/2018/10/oracle-
rowcache-views.html, 2018-10-18.

[55] Kun Sun. Oracle rowcache views and contents. http://ksun-oracle.blogspot.com/2018/10/oracle-


rowcache-views.html, 2018-10-18.
[56] Kun Sun. Latch: row cache objects contentions and scalability (v). https://ksun-
oracle.blogspot.com/2018/11/latch-row-cache-objects-contentions-and.html, 2018-11-07.

[57] Kun Sun. Oracle row cache objects event: 10222, dtrace scripts (i). https://ksun-
oracle.blogspot.com/2018/11/oracle-row-cache-objects-event-10222.html, 2018-11-07.
[58] Kun Sun. Row cache objects, row cache latch on object type: Plsql vs java call (part-1)
(ii). https://ksun-oracle.blogspot.com/2018/11/row-cache-objects-row-cache-latch-on 7.html, 2018-
11-07.

[59] Kun Sun. Lob ora-22924: snapshot too old and fix. http://ksun-oracle.blogspot.com/2019/04/lob-
ora-22924-snapshot-too-old-and-fix.html, 2019-04-17.
[60] My Oracle Support. Unaccounted time in 10046 file on aix smt4 platform when comparing elapsed
and cpu time (doc id 1963791.1). https://support.oracle.com/.
Bug 13354348 : UNACCOUNTED GAP BETWEEN ELAPSED TO CPU TIME ON 11.2 IN AIX,
Bug 16044824 - UNACCOUNTED GAP BETWEEN ELAPSED AND CPU TIME FOR DB 11.2
ON PLATFORM AIX POWER7,
Bug 18599013 : NEED TO CALCULATE THE UNACCOUNTED TIME FOR A TRACE FILE,
Bug 7410881 : HOW CPUCOLLECTED ON AIX VIA EM,
Bug 15925194 : AIX COMPUTING METRICS INCORRECTLY.

[61] Wikipedia. M/d/1 queue. https://en.wikipedia.org/wiki/M/D/1 queue.


[62] Wikipedia. Power7. https://en.wikipedia.org/wiki/POWER7.
[63] Wikipedia. Power8. https://en.wikipedia.org/wiki/POWER8.
[64] Wikipedia. Power9. https://en.wikipedia.org/wiki/POWER9.

[65] J. W. J. Williams. Heapsort. https://en.wikipedia.org/wiki/Heapsort.

188
Index

cursor obsolete threshold, 142 fast start parallel rollback, 62


db block max cr dba, 17 Free Memory and Fragmentation in SGA, 132
kgl large heap assert threshold, 145
kgl large heap warning threshold, 145 hard parse, 109
memory imm mode without autosga, 145 Heapsort, 161
10222 trace event, 161 Hot Library Cache Objects, 104
10704 Enqueue Trace Event, 82
ITL, 45
abstract lobs, 153 KGLH0, 130
AIX POWER CPU Usage and Throughput, 174 KKSSP, 125
AIX POWER PURR, 180
AIX Simultaneous Multi-threading (SMT), 175 Latch gets, 99
AIX vpm throughput mode, 182 Latch misses, 99
Asynchronous Commit, 66 Latch Pseudo Code, 99
Latch sleeps, 99
cache lobs, 153 Latch spin gets, 99
Cleanout, 53 latch: cache buffers chains, 25
cleanup non exist obj, 120 latch: cache buffers chains (CBC), 90
cold read, 1 latch: cache buffers chains - Reverse Index, 96
Collection Operator Performance, 159 latch: row cache objects, 86, 161
commit cleanout, 49, 53 latches, 85
Commit SCN, 74 library cache lock (cycle) deadlock, 114
Commit SCN Result Cache, 77 LOB Memory Leak, 155
consistent get, 17, 27 LOB Memory Usage and Leak, 153
current read, 17, 27 Locks, 79
Cusrordump, 111 logical read, 17
lseek(), 10
Data Block ITL Uba Linked List, 44
db file parallel read, 9 M/D/1 Queue - Row Cache, 166
db file read, 1 Mutexes, 100
db file scattered read, 9
no parse, 109
db file sequential read, 8
nocache lobs, 153
db block hash buckets, 126
non-existent object, 116
db file multiblock read count, 6
dba undo extents, 39 ORA-01555, 49
delayed block cleanout, 53 ORA-04020 deadlock, 114
delayed logging block cleanout, 49, 53 ORA-04030, 146
Disk Asynch IO, 11 ORA-04030 Lob, 157
disk read, 1 ORA-04031, 122
Distributed Transaction Commit, 71 ORA-08177, 59
Distributed Transactions Redo, 69 ora rowscn, 57, 76
distributed lock timeout, 72
dtracelio.d, 27 parse count (total), 110

189
parse calls, 110
PGA Memory, 146
physical read, 1
Piggybacked Commit, 68
Plsql Collection Memory Usage and Performance,
152
Plsql self-deadlock, 113
pread(), 10

read(), 9
readv(), 10
records per block, 17
recursive session, 119
recursive session and v$session, 119
Redo, 65
Redo/Undo Thick Declared Table Insert, 73
Row Cache Modeling, 168
row cache mutex 12cR2, 161
Row Cache Performance, 161

session cached cursors: KGLH0 and SQLA, 139


SET Collection Operator, 159
SGA Memory Usage and Allocation, 121
SMON ORA-00474, 62
soft parse, 109
softer parse, 109
SQLA, 127
switch current to new buffer, 18
Synchronous Commit, 67

TCHK, 128
TCHK (Typecheck heap), 129
temp undo enabled, 62
Temporary Table (GTT): Undo / Redo, 61
TRN CTL (Transaction Control), 46
TRN TBL (Transaction Table), 46

undo, 37
Undo Linked Lists, 43
Undo TRN CTL SLOT Linked List, 48
Undo TRN TBL TRX (Rec, rci) Linked List, 46

v$fast start transactions, 59


v$filestat, 13
v$iostat file, 13
v$open cursor, 139
v$process memory detail, 150

190

You might also like