The Query Optimizer tries to select best plan for any query and it does that based on generating all possible plans, estimating cost of them, and selects cheapest costed plan. Estimating cost of the plan is complex process. But cost is straight proportionate with number of I/O s. Also there is functional dependence between number of the rows retrieved from database and number of I/O s. So the cost of the plan is depend on estimation of number of the rows retrieved in each step of the plan – cardinality of the operation. Therefore Optimizer should correctly estimate cardinality of the each step in the plan. In this presentation we will see how oracle optimizer calculates join selectivity and cardinality in different situations like how CBO calculates join selectivity when histograms are available (including new types of histograms, in 12c)?, what factors does error(estimation) depend on? and etc.

© All Rights Reserved

290 views

The Query Optimizer tries to select best plan for any query and it does that based on generating all possible plans, estimating cost of them, and selects cheapest costed plan. Estimating cost of the plan is complex process. But cost is straight proportionate with number of I/O s. Also there is functional dependence between number of the rows retrieved from database and number of I/O s. So the cost of the plan is depend on estimation of number of the rows retrieved in each step of the plan – cardinality of the operation. Therefore Optimizer should correctly estimate cardinality of the each step in the plan. In this presentation we will see how oracle optimizer calculates join selectivity and cardinality in different situations like how CBO calculates join selectivity when histograms are available (including new types of histograms, in 12c)?, what factors does error(estimation) depend on? and etc.

© All Rights Reserved

- ghid sql
- OBIEE - Aggregate Tables
- Elementary-Statistics-A-Step-By-Step-Approach-7th-Edition-Bluman-Test-Bank.pdf
- INF2603 - Part III - Database Programming - Ch08.pdf
- End to End Development (Chapter 2, rev 26)
- SP SqlServer 2005
- Karwin Bill ZF Db ZendCon 20071009
- School Management
- Interpreting Extended Statistics
- fl
- oracle 10g sql
- SQLLanjut (Data base).pdf
- Refer TC
- dp_s10_l01
- 20140210 Day4 Problem Set 1
- Oracle Tuning
- 2 Process Control Techniques - Chapter Two
- Files-2-Presentations Malhotra Mr05 Ppt 22
- #12 Aggregate
- 15waystooptimizeyoursqlqueries Hungreddotcom 121023051411 Phpapp02

You are on page 1of 35

Chinar Aliyev

chinaraliyev@gmail.com

As it is known the Query Optimizer tries to select best plan for the query and it does that based on

generating all possible plans, estimates cost each of them, and selects cheapest costed plan as optimal

one. Estimating cost of the plan is a complex process. But cost is directly proportionate to the

number of I/O s. Here is functional dependence between number of the rows retrieved from

database and number of I/O s. So the cost of a plan depends on estimated number of the rows

retrieved in each step of the plan – cardinality of the operation. Therefore optimizer should accurately

estimate cardinality of each step in the execution plan. In this paper we going to analyze how oracle

optimizer calculates join selectivity and cardinality in different situations, like how does CBO

calculate join selectivity when histograms are available (including new types of histograms, in 12c)?,

what factors does error (estimation) depend on? And etc. In general two main join cardinality

estimation methods exists: Histogram Based and Sampling Based.

Thanks to Jonathan Lewis for writing “Cost Based Oracle Fundamentals” book. This book actually

helped me to understand optimizer`s internals and to open the “Black Box”. In 2007 Alberto

Dell`Era did an excellent work, he investigated join size estimation with histograms. However there

are some questions like introduction of a “special cardinality” concept. In this paper we are going to

review this matter also.

For simplicity we are going to use single column join and columns containing no null values. Assume

we have two tables t1, t2 corresponding join columns j1, j2 and the rest of columns are filter1 and

filter2. Our queries are

(Q0)

SELECT COUNT (*)

FROM t1, t2

WHERE t1.j1 = t2.j2

AND t1.filter1 ='value1'

AND t2.filter2 ='value2'

(Q1)

SELECT COUNT (*)

FROM t1, t2

WHERE t1.j1 = t2.j2;

(Q2)

SELECT COUNT (*)

FROM t1, t2;

Histogram Based Estimation

As you know the query Q2 is a Cartesian product. It means we will get Join Cardinality -

𝐶𝑎𝑟𝑑𝑐𝑎𝑟𝑡𝑒𝑠𝑖𝑎𝑛 for the join product as:

𝐶𝑎𝑟𝑑𝑐𝑎𝑟𝑡𝑒𝑠𝑖𝑎𝑛 =num_rows(𝑡1 )*num_rows(𝑡2 )

Here num_rows(𝑡𝑖 ) is number of rows of corresponding tables. When we add join condition into the

query (so Q1) then it means we actually get some fraction of Cartesian product. To identify this

fraction here Join Selectivity has been introduced.

Therefore we can write this as follows

𝐶𝑎𝑟𝑑𝑄1 ≤ 𝐶𝑎𝑟𝑑𝑐𝑎𝑟𝑡𝑒𝑠𝑖𝑎𝑛

𝐶𝑎𝑟𝑑𝑄1 = 𝐽𝑠𝑒𝑙 *𝐶𝑎𝑟𝑑𝑐𝑎𝑟𝑡𝑒𝑠𝑖𝑎𝑛 = 𝐽𝑠𝑒𝑙 ∗num_rows(𝑡1 )*num_rows(𝑡2 ) (1)

Definition: Join selectivity is the ratio of the “pure”-natural cardinality over the Cartesian product.

I called 𝐶𝑎𝑟𝑑𝑄1 as “pure” cardinality because it does not contain any filter conditions.

Here 𝐽𝑠𝑒𝑙 is Join Selectivity. This is our main formula. You should know that when optimizer tries to

estimate JC- Join Cardinality it first calculates 𝐽𝑠𝑒𝑙 . Therefore we can use same 𝐽𝑠𝑒𝑙 and can write

appropriate formula for query Q0 as

𝐶𝑎𝑟𝑑𝑄0 = 𝐽𝑠𝑒𝑙 ∗Card(𝑡1 )*Card(𝑡2 ) (2)

Here Card (𝑡𝑖 ) is final cardinality after applying filter predicate to the corresponding table. In other

words 𝐽𝑠𝑒𝑙 for both formulas (1) and (2) is same. Because 𝐽𝑠𝑒𝑙 does not depend on filter columns,

unless filter conditions include join columns. According to formula (1)

𝐽𝑠𝑒𝑙 = 𝐶𝑎𝑟𝑑𝑄1 /(num_rows(𝑡1 ) ∗ num_rows(𝑡2 )) (3)

𝐶𝑎𝑟𝑑𝑄1 ∗Card(𝑡1 )∗Card(𝑡2 )

or 𝐶𝑎𝑟𝑑𝑄0 = (4)

num_rows(𝑡1 )∗num_rows(𝑡2 )

Based on this we have to find out estimation mechanism of expected cardinality - 𝑪𝒂𝒓𝒅𝑸𝟏 . Now

consider that for 𝑗𝑖 join columns of 𝑡𝑖 tables here is not any type of histogram. So it means in this

case optimizer assumes uniform distribution and for such situations as you already know 𝐽𝑐𝑎𝑟𝑑 and

𝐽𝑠𝑒𝑙 are calculated as

The question now is: where does formula (5) come from? How do we understand it?

According to (3) in order to calculate 𝐽𝑠𝑒𝑙 we first have to estimate “pure” expected cardinality -

𝐶𝑎𝑟𝑑𝑄1 . And it only depends on Join Columns. For 𝑡1 table, based on uniform distribution the

number of rows per distinct value of the 𝑗1 column will be num_rows(𝑡1 )/𝑛𝑢𝑚_𝑑𝑖𝑠𝑡(𝑗1 ) and for

𝑡2 table it will be num_rows(𝑡2 )/𝑛𝑢𝑚_𝑑𝑖𝑠𝑡(𝑗2 ). Also here will be

min(𝑛𝑢𝑚_𝑑𝑖𝑠𝑡(𝑗1 ), 𝑛𝑢𝑚_𝑑𝑖𝑠𝑡(𝑗2 )) common distinct values. Therefore expected “pure” cardinality

is

num_rows(𝑡1 ) num_rows(𝑡2 )

𝐶𝑎𝑟𝑑𝑄1 = min (𝑛𝑢𝑚_𝑑𝑖𝑠𝑡(𝑗1 ), 𝑛𝑢𝑚_𝑑𝑖𝑠𝑡(𝑗2 )) ∗ ∗ (6)

num_dist(𝑗1 ) num_dist(𝑗2 )

𝐶𝑎𝑟𝑑𝑄1 min (𝑛𝑢𝑚_𝑑𝑖𝑠𝑡(𝑗1 ),𝑛𝑢𝑚_𝑑𝑖𝑠𝑡(𝑗2 ))

𝐽𝑠𝑒𝑙 = = =

num_rows(𝑡1 )∗num_rows(𝑡2 ) 𝑛𝑢𝑚_𝑑𝑖𝑠𝑡(𝑗1 )∗ 𝑛𝑢𝑚_𝑑𝑖𝑠𝑡(𝑗2 )

1

max (𝑛𝑢𝑚_𝑑𝑖𝑠𝑡(𝑗1 ),𝑛𝑢𝑚_𝑑𝑖𝑠𝑡(𝑗2 ))

As it can be seen we have got formula (5). Without histogram optimizer is not aware of the data

distribution, so in dictionary of the database here are not “(distinct value, frequency)” – this pairs

indicate column distribution. Because of this, in case of uniform distribution, optimizer actually

thinks and calculates “average frequency” as 𝑛𝑢𝑚_𝑟𝑜𝑤𝑠(𝑡1 )/𝑛𝑢𝑚_𝑑𝑖𝑠𝑡(𝑗1 ). Based on “average

frequency” optimizer calculates “pure” expected cardinality and then join selectivity. So if a table

column has histogram (depending type of this) optimizer will calculates join selectivity based on

histogram. In this case “(distinct value, frequency)” pairs are not formed based on “average

frequency”, but are formed based on information which are given by the histogram.

In this case both join columns have frequency histogram and our query(freq_freq.sql) is

SELECT COUNT (*)

FROM t1, t2

WHERE t1.j1 = t2.j2 AND t1.f1 = 13;

---------------------------------------------------------------

| Id | Operation | Name | Starts | E-Rows | A-Rows |

---------------------------------------------------------------

| 0 | SELECT STATEMENT | | 1 | | 1 |

| 1 | SORT AGGREGATE | | 1 | 1 | 1 |

|* 2 | HASH JOIN | | 1 | 2272 | 2260 |

|* 3 | TABLE ACCESS FULL| T1 | 1 | 40 | 40 |

| 4 | TABLE ACCESS FULL| T2 | 1 | 1000 | 1000 |

---------------------------------------------------------------

---------------------------------------------------

2 - access("T1"."J1"="T2"."J2")

3 - filter("T1"."F1"=13)

Estimation is good enough for this situation, but it has not been exactly estimated. And why? How

did optimizer calculate cardinality of the join as 2272?

If we enable SQL trace for the query then we will see oracle queries only histgrm$ dictionary table.

Therefore information about columns and tables is as follows.

Select table_name,num_rows from user_tables where table_name in (‘T1’,’T2’);

tab_name num_rows

T1 1000

T2 1000

(Freq_values1)

endpoint_number - NVL (prev_endpoint, 0) frequency,

endpoint_number ep

FROM (SELECT endpoint_number,

NVL (LAG (endpoint_number, 1) OVER (ORDER BY endpoint_number),

0

)

prev_endpoint,

endpoint_value

FROM user_tab_histograms

WHERE table_name = 'T1' AND column_name = 'J1')

ORDER BY endpoint_number

col j1 j2

value frequency ep value frequency ep

0 40 40 0 100 100

1 40 80 2 40 140

2 80 160 3 120 260

3 100 260 4 20 280

4 160 420 5 40 320

5 60 480 6 100 420

6 260 740 8 40 460

7 80 820 9 20 480

8 120 940 10 20 500

9 60 1000 11 60 560

12 20 580

13 20 600

14 80 680

15 80 760

16 20 780

17 80 860

18 80 940

19 60 1000

Frequency histograms exactly express column distribution. So “(column value, frequency)” pair

gives us all opportunity to estimate cardinality of any kind of operations. Now we have to try to

estimate pure cardinality 𝐶𝑎𝑟𝑑𝑄1 then we can find out 𝐽𝑠𝑒𝑙 according to formula (3). Firstly we

have to find common data for the join columns. These data is spread between

max(min_value(j1),min_value(j2)) and min(max_value(j1),max_value(j2)). It means we are

not interested in the data which column value greater than 10 for j2 column. Also we have to take

equval values, so we get following table

tab t1, col tab t2, col

j1 j2

value frequency value frequency

0 40 0 100

2 80 2 40

3 100 3 120

4 160 4 20

5 60 5 40

6 260 6 100

8 120 8 40

9 60 9 20

100*40+80*40+100*120+160*20+60*40+260*100+120*40+60*20=56800 and Join selectivity

𝐶𝑎𝑟𝑑𝑄1 56800

𝐽𝑠𝑒𝑙 = = = 0.0568

𝑛𝑢𝑚_𝑟𝑜𝑤𝑠(𝑡1)∗𝑛𝑢𝑚_𝑟𝑜𝑤𝑠(𝑡2) 1000∗1000

𝐶𝑎𝑟𝑑𝑄0 = 𝐽𝑠𝑒𝑙 ∗Card (𝑡1 )*Card (𝑡2 ) = 0.0568 ∗ 40 ∗ 1000 = 2272

Also if we enable 10053 event then in trace file we see following lines regarding on join selectivity.

Join Card: 2272.000000 = outer (40.000000) * inner (1000.000000) * sel

(0.056800)

Join Card - Rounded: 2272 Computed: 2272.000000

As we see same number as in above execution plan. Another question was why we did not get exact

cardinality – 2260? Although join selectivity by definition does not depend on filter columns and

conditions, but filtering actually influences this process. Optimizer does not consider join column

value range, max/min value, spreads, distinct values after applying filter – in line 3 of execution plan.

It is not easy to resolve. At least it will require additional estimation algorithms, then efficiency of

whole estimation process could be harder. So if we remove filter condition from above query we will

get exact estimation.

---------------------------------------------------------------

| Id | Operation | Name | Starts | E-Rows | A-Rows |

---------------------------------------------------------------

| 0 | SELECT STATEMENT | | 1 | | 1 |

| 1 | SORT AGGREGATE | | 1 | 1 | 1 |

|* 2 | HASH JOIN | | 1 | 56800 | 56800 |

| 3 | TABLE ACCESS FULL| T1 | 1 | 1000 | 1000 |

| 4 | TABLE ACCESS FULL| T2 | 1 | 1000 | 1000 |

---------------------------------------------------------------

---------------------------------------------------

2 - access("T1"."J1"="T2"."J2")

It means optimizer calculates “average” join selectivity. I think it is not an issue in general. As result

we got the following formula for join selectivity.

∑min _𝑚𝑎𝑥

𝑖=max _𝑚𝑖𝑛 𝑓𝑟𝑒𝑞(𝑡1.𝑗1)∗𝑓𝑟𝑒𝑞(𝑡2.𝑗2))

𝐽𝑠𝑒𝑙 = (7)

𝑛𝑢𝑚_𝑟𝑜𝑤𝑠(𝑡1)∗𝑛𝑢𝑚_𝑟𝑜𝑤𝑠(𝑡2)

Now assume one of the join column has height-balanced(HB) histogram and another has

frequency(FQ) histogram (Height_Balanced_Frequency.sql) We are going to investiagte cardinality

estimation of the two queries here

select count(*)

from t1, t2 --- (Case2 q1)

where t1.j1 = t2.j2;

select count(*)

from t1, t2 --- (Case2 q2)

where t1.j1 = t2.j2 and t1.f1=11;

For the column J1 here is Height balanced histogram - HB and for the column j2 here is frequency

- FQ histogram avilable. The appropriate information from user_tab_histogrgrams dictionary view

shown in Table 3.

tatb t1, col tab t2 , col

j1 j2

column frequency ep column frequency ep

value value

1 0 0 1 2 2

9 1 1 7 2 4

16 1 2 48 3 7

24 1 3 64 4 11

32 1 4

40 1 5

48 2 7

56 1 8

64 2 10

72 2 12

80 3 15

Ferquency column for t1.j1 of Table 3 does not express real frequency for the column. It is actually

“frequency of the bucket”. First we have to identify common values. So we have to ignore HB

histogram buckets with endpoint number greater than 10. We have exact “value, frequency” pairs of

the t2.j2 column therefore our base source must be values of the t2.j2 column. But for the t1.j1 we

do not have exact frequencies. HB histogram cointains buckets which hold approximately same

number of rows. Also we can find number of the distinct values per bucket. Then for every value of

the frequency histogram we can identify appropriate bucket of the HB histogram. Within HB bucket

we aslo can assume uniform distrbution then we can estimate size of this disjoint subset – {value of

FQ and Bucket of HB} .

Although this approach gave me some approximation and estimation of the join cardinality but it did

not give me exact number(s) which oracle optimizer calculates and reports in 10053 trace file. We

have to find what information we need to improve this approach? ,

Firstly Alberto Dell'Era investigated joins based on the histograms in 2007- (Join Over histograms).

His approach was based on grouping values into three major categories:

- “populars matching populars”

- “populars not matching populars”

- “not popular subtables”

Estimating each of them. The sum of cardinality of each group will give us join cardinality. But my

point of view to the matter is quite different:

- We have to identify “(distinct value, frequency)” pairs to approximate “pure” cardinality 𝐶𝑎𝑟𝑑𝑄1

- Our main data here is t2.j2 column`s data, because it gives us exact frequencies

- We have to walk t2.j2 columns (histograms) values and identify second part of “(distinct

value, frequency)” based on height balanced histogram.

- 𝐶𝑎𝑟𝑑𝑄1 =∑ 𝐹𝑟𝑒𝑞𝑏𝑎𝑠𝑒𝑑 𝑡1 (value=t2.j2)*𝐹𝑟𝑒𝑞𝑏𝑎𝑠𝑒𝑑 𝑡2 (value=t2.j2)

- Then we can calculate join selectivity and cardinality

We have to identify (value, frequency) pairs based on HB histogram, then it is easy to calculate “pure”

cardinality so it means we can easily and more accurately estimate join cardinality. But when forming

(value, frequency) pairs based on HB histogram, we should not approach as uniform for the single

value which is locate within the bucket, because HB gives us actually “average” density –

NewDensity (actually the density term has been introduced to avoid estimation errors in

non-uniform distribution case and has been improved with new density mechanism) for un-

popular values and special approach for popular values. So let’s identify “(value, frequency)” pairs

based on the HB histogram.

tab_name num_rows (user_tables) col_name num_distinct

T1 11 T1.J1 30

T2 130 T2.J2 4

Number of popular buckets – num_pop_bucktes=9(as sum(frequency) from table 3 where

frequency>1)

Popular value counts – pop_value_cnt=4(as count(frequency) from table 3 where frequency>1)

𝑛𝑢𝑚_𝑢𝑛𝑝𝑜𝑝_𝑏𝑢𝑐𝑘𝑒𝑡𝑠

NewDensity= =

𝑢𝑛𝑝𝑜𝑝_𝑛𝑑𝑣∗𝑛𝑢𝑚_𝑏𝑢𝑐𝑘𝑒𝑡𝑠

𝑛𝑢𝑚_𝑏𝑢𝑐𝑘𝑒𝑡𝑠−𝑛𝑢𝑚_𝑝𝑜𝑝_𝑏𝑢𝑐𝑘𝑒𝑡𝑠 15−9

=(30−4)∗15 =0.015384615≈0.015385 (8)

(𝑁𝐷𝑉−𝑝𝑜𝑝_𝑣𝑎𝑙𝑢𝑒_𝑐𝑛𝑡)𝑛𝑢𝑚_𝑏𝑢𝑐𝑘𝑒𝑡𝑠

𝑒𝑝_𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑒𝑝_𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦

=

𝑛𝑢𝑚_𝑏𝑢𝑐𝑘𝑒𝑡𝑠 15

column popular frequency calculated

value

1 N 2.00005 130*0.015385 - (num_rows*density)

7 N 2.00005 130*0.015385 - (num_rows*density)

48 Y 17.33333333 130*2/15 - (num_rows*frequency/num_buckets)

64 Y 17.33333333 130*2/15 - (num_rows*frequency/num_buckets)

We have got all “(value, frequency)” pairs so according formula (7) we can calculate Join Selectivity.

tab t1 , col tab t2 , col

j1 j2

column frequency column frequency freq*freq

value value

1 2.00005 1 2 4.0001

7 2.00005 7 2 4.0001

48 17.33333333 48 3 52

64 17.33333333 64 4 69.33333

Sum 129.3335

129.3335 129.3335

And finally 𝐽𝑠𝑒𝑙 = = =0.090443

numrows(t1)∗numrows(t2) 11∗130

So our “pure” cardinality is 𝐶𝑎𝑟𝑑𝑄1 = 129. Execution plan of the query is as follows

---------------------------------------------------------------

| Id | Operation | Name | Starts | E-Rows | A-Rows |

---------------------------------------------------------------

| 0 | SELECT STATEMENT | | 1 | | 1 |

| 1 | SORT AGGREGATE | | 1 | 1 | 1 |

|* 2 | HASH JOIN | | 1 | 129 | 104 |

| 3 | TABLE ACCESS FULL| T2 | 1 | 11 | 11 |

| 4 | TABLE ACCESS FULL| T1 | 1 | 130 | 130 |

---------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

2 - access("T1"."J1"="T2"."J2")

Join Card: 129.333333 = outer (130.000000) * inner (11.000000) * sel (0.090443)

Join Card - Rounded: 129 Computed: 129.333333

It means we were able to figure out exact estimation mechanism in this case. Execution plan of the

second query (Case2 q2) as follows

---------------------------------------------------------------

| Id | Operation | Name | Starts | E-Rows | A-Rows |

---------------------------------------------------------------

| 0 | SELECT STATEMENT | | 1 | | 1 |

| 1 | SORT AGGREGATE | | 1 | 1 | 1 |

|* 2 | HASH JOIN | | 1 | 5 | 7 |

|* 3 | TABLE ACCESS FULL| T1 | 1 | 5 | 5 |

| 4 | TABLE ACCESS FULL| T2 | 1 | 11 | 11 |

---------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

2 - access("T1"."J1"="T2"."J2")

3 - filter("T1"."F1"=11)

𝐽𝑠𝑒𝑙 *card(t1)*card(t2)= 0.090443*card(t1)*card(t2)

Also from optimizer trace file we will see the following:

Join Card: 5.173333 = outer (11.000000) * inner (5.200000) * sel (0.090443)

Join Card - Rounded: 5 Computed: 5.173333

It actually confirms our approach. However execution plan shows cardinality of the single table t1 as

5, it is correct because it must be rounded up but during join estimation process optimizer consider

original values rather than rounding.

Reviewing Alberto Dell'Era`s – complete formula (join_histogram_complete.sql)

We can list column information from dictionary as below:

tatb t1, col value tatb t2, col value

column value frequency column value frequency

20 1 10 1

40 1 30 2

50 1 50 1

60 1 60 4

70 2 70 2

80 2

90 1

99 1

So we have to find common values, as you see min(t1.value)=20 due to we must ignore t2.value=10

also max(t1.val)=70 it means we have to ignore column values t2.value>70. In addition we do not

have the value 40 in t2.value therefore we have to delete it also. Because of this we are getting

following table

tatb t1, col j1 tab t2 , col

j2

column value frequency column frequency

value

20 1 30 2

50 1 50 1

60 1 60 4

70 2 70 2

Num_rows(t1)=12;num_buckets(t1.value)=6;num_distinct(t1.value)=8,=>

𝑛𝑢𝑚_𝑢𝑛𝑝𝑜𝑝_𝑏𝑢𝑐𝑘𝑒𝑡𝑠 6−2

newdensity= = (8−1)∗6 =0.095238095, so appropriate column values

𝑢𝑛𝑝𝑜𝑝_𝑛𝑑𝑣∗𝑛𝑢𝑚_𝑏𝑢𝑐𝑘𝑒𝑡𝑠

frequency based on HB histogram will be :

t1.value freq calculated

30 1.142857143 num_rows*newdensity

50 1.142857143 num_rows*newdensity

60 1.142857143 num_rows*newdensity

70 4 num_rows*freq/num_buckets

t1.value t2.value

column frequency column frequency freq*freq

value value

30 1.142857143 30 2 2.285714286

50 1.142857143 50 1 1.142857143

60 1.142857143 60 4 4.571428571

70 4 70 2 8

sum 16

---------------------------------------------------------------

| Id | Operation | Name | Starts | E-Rows | A-Rows |

---------------------------------------------------------------

| 0 | SELECT STATEMENT | | 1 | | 1 |

| 1 | SORT AGGREGATE | | 1 | 1 | 1 |

|* 2 | HASH JOIN | | 1 | 16 | 13 |

| 3 | TABLE ACCESS FULL| T1 | 1 | 12 | 12 |

| 4 | TABLE ACCESS FULL| T2 | 1 | 14 | 14 |

---------------------------------------------------------------

---------------------------------------------------

2 - access("T1"."VALUE"="T2"."VALUE")

There Alberto also has introduced “Contribution 4: special cardinality”, but it seems it is not

necessary.

Reviewing Alberto Dell'Era`s essential case (join_histogram_essentials.sql)

This is quite interesting case, firstly because in oracle 12c optimizer calculates join cardinality as 31

but not as 30, and second in this case old and new densities are same. Let’s interpret the case.

The by corresponding information from user_tab_histograms.

tatb t1, col tab t2 , col

value value

column frequency ep column frequency ep

value value

10 2 2 10 2 2

20 1 3 20 1 3

30 2 5 50 3 6

40 1 6 60 1 7

50 1 7 70 4 11

60 1 8

70 2 10

And num_rows(t1)=20,num_rows(t2)=11,num_dist(t1.value)=11,num_dist(t2.val)=5,

Density (t1.value)=(10-6)/((11-3)*10)= 0.05. Above mechanism does not give us exact number as

expected as optimizer estimation. Because in this case to estimate frequency for un-popular values

oracle does not use density it uses number of distinct values per bucket and number of rows per

distinct values instead of the density. To prove this one we can use join_histogram_essentials1.sql.

In this case t1 table is same as in join_histogram_essentials.sql . The column T2.value has only one

value 20 with frequency one.

t1.value freq EP t2.value freq EP

10 2 2 20 1 1

20 1 3

30 2 5

40 1 6

50 1 7

60 1 8

70 2 10

In this case oracle computes join cardinality 2 as rounded up from 1.818182. We can it from trace

file

Join Card: 1.818182 = outer (20.000000) * inner (1.000000) * sel (0.090909)

Join Card - Rounded: 2 Computed: 1.818182

num_rows_bucket(number of rows per bucket) is 2 and num_rows_distinct(number of distinct

value per bucket) is 20/11=1.818182. Every bucket has 1.1 distinct value and within bucket every

distinct value has 2/1.1=1.818182 rows. And this is our cardinality. But if we increase frequency of

the t2.value - join_histogram_essentials2.sql. The (t2.value, frequency)=(20,5) and t1 table is same

as in previous case.

10 2 2 20 5 5

20 1 3

30 2 5

40 1 6

50 1 7

60 1 8

70 2 10

Join Card: 5.000000 = outer (20.000000) * inner (5.000000) * sel (0.050000)

Join Card - Rounded: 5 Computed: 5.000000

Tests show that in such cases cardinality of the join computed as frequency of the t2.value. So it

means frequency of the popular value will be:

𝑛𝑢𝑚_𝑟𝑜𝑤𝑠_𝑏𝑢𝑐𝑘𝑒𝑡

𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑡2. 𝑣𝑎𝑙𝑢𝑒 = 1

Frequency (non-popular t1.value) = { 𝑛𝑢𝑚_𝑑𝑖𝑠𝑡_𝑏𝑢𝑐𝑘𝑒𝑡𝑠

1 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑡2. 𝑣𝑎𝑙𝑢𝑒 > 1

or

Cardinality = max (frequency of t2.val, number of rows per distinct value within bucket)

Question is why? In such cases I think optimizer tries to minimize estimation errors. So

tab t1,col

val

column frequency calculated

value

10 4 - (num_rows*frequency/num_buckets)

20 1.818181818 - (num_rows_bucket/num_dist_buckets)

50 1 -frequency of t2.value

60 1.818181818 - (num_rows_bucket/num_dist_buckets)

70 4 - (num_rows*frequency/num_buckets)

Therefore

tab t1,col tab t2 , col

value value

column value frequency column value frequency freq*freq

10 4 10 2 8

20 1.818181818 20 1 1.818181818

50 1 50 3 3

60 1.818181818 60 1 1.818181818

70 4 70 4 16

sum 30.63636364

We get 30.64≈31 as expected cardinality. Let`s see trace file and execution plan

Join Card: 31.000000 = outer (11.000000) * inner (20.000000) * sel (0.140909)

Join Card - Rounded: 31 Computed: 31.000000

---------------------------------------------------------------

| Id | Operation | Name | Starts | E-Rows | A-Rows |

---------------------------------------------------------------

| 0 | SELECT STATEMENT | | 1 | | 1 |

| 1 | SORT AGGREGATE | | 1 | 1 | 1 |

|* 2 | HASH JOIN | | 1 | 31 | 29 |

| 3 | TABLE ACCESS FULL| T2 | 1 | 11 | 11 |

| 4 | TABLE ACCESS FULL| T1 | 1 | 20 | 20 |

---------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

2 - access("T1"."VALUE"="T2"."VALUE")

Case 3. Join columns with hybrid and frequency histograms

In this case we are going to analyze how optimizer calculates join selectivity when there are hybrid

and frequency histograms available on the join columns (hybrid_freq.sql). Note that the query is

same -(Case2 q1).The corresponding information from dictionary view.

SELECT endpoint_value COLUMN_VALUE,

endpoint_number - NVL (prev_endpoint, 0) frequency,

ENDPOINT_REPEAT_COUNT,

endpoint_number

FROM (SELECT endpoint_number,

ENDPOINT_REPEAT_COUNT,

NVL (LAG (endpoint_number, 1) OVER (ORDER BY

endpoint_number),0)

prev_endpoint,

endpoint_value

FROM user_tab_histograms

WHERE table_name = 'T3' AND column_name = 'J3')

ORDER BY endpoint_number

column value frequency endpoint_rep_cnt column value frequency

0 6 6 0 3

2 9 7 1 6

4 8 5 2 6

6 8 5 3 8

7 7 7 4 11

9 10 5 5 3

10 6 6 6 3

11 3 3 7 9

12 7 7 8 6

13 4 4 9 5

14 5 5

15 5 5

16 5 5

17 7 7

19 10 5

As it can be seen common column values are between 0 and 9. So we are not interested in buckets

which contain column values greater than or equival 10. Hybrid histogram gives us more information

to estimate single table and also join selectivity than height balanced histogram. Specially endpoint

repeat count column are used by optimizer to exactly estimate endpoint values. But how does

optimizer use this information to estimate join? Principle of the estimation “(value,frequency)” pairs

based on hybrid histogram are same as height based histogram. So it depends on popularity of the

value, if value is popular then frequency will be equval to the corresponding endpoint repeat count,

otherwise it will be calculated based on the density. If we enable dbms_stats trace when gathering

hybrid histogram. We get the following

DBMS_STATS:

SELECT SUBSTRB (DUMP (val, 16, 0, 64), 1, 240) ep,

freq,

cdn,

ndv,

(SUM (pop) OVER ()) popcnt,

(SUM (pop * freq) OVER ()) popfreq,

SUBSTRB (DUMP (MAX (val) OVER (), 16, 0, 64), 1, 240) maxval,

SUBSTRB (DUMP (MIN (val) OVER (), 16, 0, 64), 1, 240) minval

FROM (SELECT val,

freq,

(SUM (freq) OVER ()) cdn,

(COUNT ( * ) OVER ()) ndv,

(CASE

WHEN freq > ( (SUM (freq) OVER ()) / 15) THEN 1

ELSE 0

END)

pop

FROM (SELECT /*+ no_parallel(t) no_parallel_index(t) dbms_stats

cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) no_monitoring

xmlindex_sel_idx_tbl no_substrb_pad */

"ID"

val,

COUNT ("ID") freq

FROM "SYS"."T1" t

WHERE "ID" IS NOT NULL

GROUP BY "ID"))

ORDER BY valDBMS_STATS: > cdn 100, popFreq 28, popCnt 4, bktSize 6.6, bktSzFrc .6

DBMS_STATS: Evaluating hybrid histogram: cht.count 15, mnb 15, ssize 100, min_ssize

2500, appr_ndv TRUE,

ndv 20, selNdv 0, selFreq 0, pct 100, avg_bktsize 7, csr.hreq TRUE, normalize TRUE

Average bucket size is 7. Oracle considers value as popular when correspoindg endpoint repeat count

is greater than or equval average bucket size. Also in our case density is (crdn- popfreq)/((NDV-

popCnt)*crdn)=(100-28)/((20-4)*100)= 0.045. If we enable 10053 trace event you can clearly see

columns and tables statistics. Therefore “(value,frequency)” will be as

t1.j1 popular frequency calculated

0 N 4.5 density*num_rows

1 N 4.5 density*num_rows

2 Y 7 endpoint_repeat_count

3 N 4.5 density*num_rows

4 N 4.5 density*num_rows

5 N 4.5 density*num_rows

6 N 4.5 density*num_rows

7 Y 7 endpoint_repeat_count

8 N 4.5 density*num_rows

9 N 4.5 density*num_rows

And then final cardinality.

t1.j1 t2.j2

value frequency value frequency freq*freq

0 4.5 0 3 13.5

1 4.5 1 6 27

2 7 2 6 42

3 4.5 3 8 36

4 4.5 4 11 49.5

5 4.5 5 3 13.5

6 4.5 6 3 13.5

7 7 7 9 63

8 4.5 8 6 27

9 4.5 9 5 22.5

sum 307.5

Join sel 0.05125

---------------------------------------------------------------

| Id | Operation | Name | Starts | E-Rows | A-Rows |

---------------------------------------------------------------

| 0 | SELECT STATEMENT | | 1 | | 1 |

| 1 | SORT AGGREGATE | | 1 | 1 | 1 |

|* 2 | HASH JOIN | | 1 | 308 | 293 |

| 3 | TABLE ACCESS FULL| T2 | 1 | 60 | 60 |

| 4 | TABLE ACCESS FULL| T1 | 1 | 100 | 100 |

---------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

2 - access("T1"."J1"="T2"."J2")

Join Card: 307.500000 = outer (60.000000) * inner (100.000000) * sel (0.051250)

Join Card - Rounded: 308 Computed: 307.500000

In this case join columns have top-frequency histogram (TopFrequency_hist.sql). We are going to

use same query as above – (Case2 q1). Corresponding column information is.

Table Stats::

Table: T2 Alias: T2

#Rows: 201 SSZ: 0 LGR: 0 #Blks: 1 AvgRowLen: 3.00 NEB: 0 ChainCnt: 0.00

SPC: 0 RFL: 0 RNF: 0 CBK: 0 CHR: 0 KQDFLG: 1

#IMCUs: 0 IMCRowCnt: 0 IMCJournalRowCnt: 0 #IMCBlocks: 0 IMCQuotient: 0.000000

Column (#1): J2(NUMBER)

AvgLen: 4 NDV: 21 Nulls: 0 Density: 0.004975 Min: 1.000000 Max: 200.000000

Histogram: Top-Freq #Bkts: 192 UncompBkts: 192 EndPtVals: 12 ActualVal: yes

***********************

Table Stats::

Table: T1 Alias: T1

#Rows: 65 SSZ: 0 LGR: 0 #Blks: 1 AvgRowLen: 3.00 NEB: 0 ChainCnt: 0.00

SPC: 0 RFL: 0 RNF: 0 CBK: 0 CHR: 0 KQDFLG: 1

#IMCUs: 0 IMCRowCnt: 0 IMCJournalRowCnt: 0 #IMCBlocks: 0 IMCQuotient: 0.000000

Column (#1): J1(NUMBER)

AvgLen: 3 NDV: 14 Nulls: 0 Density: 0.015385 Min: 4.000000 Max: 100.000000

Histogram: Top-Freq #Bkts: 56 UncompBkts: 56 EndPtVals: 5 ActualVal: yes

4 10 1 14

5 16 2 18

6 17 3 18

8 12 4 17

100 1 5 15

6 19

7 19

8 22

9 17

10 18

11 13

200 2

By definition of the Top-Frequency histogram, we can say that here are two types of buckets.

Oracle placed high frequency values into some buckets (appropriate) and rest of the values of the

table oracle actually “placed” into another “bucket”. So we actually have “high frequency” and

“low frequency” values. Therefore for “high frequency” values we also have exact frequencies, but

for “low frequency” values we can approach by using “Uniform distribution”. Firstly we have to

build high frequency pairs based on common values. The max(min(t1.j1),min(t2.j2))=4 and also

max(max(t1.j1),max(t2.j2))=100. In principle we have to see and gather common values which are

between 4 and 100. So after identifying common values, for popular values we are going to use

exact frequency and for non-popular values new density. Therefore we could create following table:

common value j2.freq j1.freq freq*freq

4 17 10 170

5 15 16 240

6 19 17 323

7 19 1.000025 19.000475

8 22 12 264

9 17 1.000025 17.000425

10 18 1.000025 18.00045

11 13 1.000025 13.000325

100 1 1.000025 1.000025

sum 1065.0017

popular_value_count=14-5=9 also for t2 table we have 201-192=9 unpopular rows and 21-12=9

unpopular distinct values. Frequency for each unpopular rows of the t2.j2 is

num_rows(t2)*density(t2)= 201*0.004975= 0.999975 also for t1.j1 it is num_rows(t1)*density(t1)=

65*0.015385 = 1.000025. Due to cardinality for each individual unpopular rows will be:

CardIndvPair=unpopular_freq(t1.j1)* unpopular_freq(t2.j2)= 0.999975*1.000025=1.

Test cases show that oracle considers all low frequency (unpopular rows) values during join when

top frequency histograms are available it means cardinality for “low frequency” values will be

Card(Low frequency values)=max(unpopular_rows(t1.j1),unpopular_rows(t2.j2))* CardIndvPair=9

Therefore final cardinality of the our join will be CARD(high freq values)+ CARD(low freq

values)=1065+9=1074. Lets see execution plan.

-----------------------------------------------------------------

| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|

-----------------------------------------------------------------

| 0 | SELECT STATEMENT | | 1 | 6 | 4 (0)|

| 1 | SORT AGGREGATE | | 1 | 6 | |

|* 2 | HASH JOIN | | 1074 | 6444 | 4 (0)|

| 3 | TABLE ACCESS FULL| T1 | 65 | 195 | 2 (0)|

| 4 | TABLE ACCESS FULL| T2 | 201 | 603 | 2 (0)|

-----------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

2 - access("T1"."J1"="T2"."J2")

Join Card: 1074.000000 = outer (201.000000) * inner (65.000000) * sel (0.082204)

Join Card - Rounded: 1074 Computed: 1074.000000

Case 5. Join columns with Top frequency and frequency histograms

Now consider that we have tables which for join columns there are top frequency and frequency

histogram (TopFrequency_Frequency.sql). The columns distribution from dictionary is as below

(Freq_values1)

t1.j1 freq t2.j2 freq

1 3 0 4

2 3 1 7

3 5 2 2

4 5 4 3

5 5

6 4

7 6

8 4

9 5

25 1

In this case there is a frequency histogram for the column t2.j2 and we have exact common {1, 2, 4}

values. But test cases show that optimizer also considers all the values from top frequency histogram

which are between max(min(t1.j1),min(t2.j2)) and min(max(t1.j1),max(t2.j2)). It is quite interesting

case. Because of this we have frequency histogram and it should be our main source and this case

should have been similar to the case 3.

Table Stats::

Table: T2 Alias: T2

#Rows: 16 SSZ: 0 LGR: 0 #Blks: 1 AvgRowLen: 3.00 NEB: 0 ChainCnt: 0.00

SPC: 0 RFL: 0 RNF: 0 CBK: 0 CHR: 0 KQDFLG: 1

#IMCUs: 0 IMCRowCnt: 0 IMCJournalRowCnt: 0 #IMCBlocks: 0 IMCQuotient:

0.000000

Column (#1): J2(NUMBER)

AvgLen: 3 NDV: 4 Nulls: 0 Density: 0.062500 Min: 0.000000 Max: 4.000000

Histogram: Freq #Bkts: 4 UncompBkts: 16 EndPtVals: 4 ActualVal: yes

***********************

Table Stats::

Table: T1 Alias: T1

#Rows: 42 SSZ: 0 LGR: 0 #Blks: 1 AvgRowLen: 3.00 NEB: 0 ChainCnt: 0.00

SPC: 0 RFL: 0 RNF: 0 CBK: 0 CHR: 0 KQDFLG: 1

#IMCUs: 0 IMCRowCnt: 0 IMCJournalRowCnt: 0 #IMCBlocks: 0 IMCQuotient:

0.000000

Column (#1): J1(NUMBER)

AvgLen: 3 NDV: 11 Nulls: 0 Density: 0.023810 Min: 1.000000 Max: 25.000000

Histogram: Top-Freq #Bkts: 41 UncompBkts: 41 EndPtVals: 10 ActualVal: yes

Considered j1.freq j2.freq freq*freq

values

1 3 7 21

2 3 2 6

3 5 1 5

4 5 3 15

sum 47

Here for the value 3 j2.freq calculated as num_rows(t2)*density=16*0.0625=1. And in 10053 file

Join Card: 47.000000 = outer (16.000000) * inner (42.000000) * sel (0.069940)

Join Card - Rounded: 47 Computed: 47.000000

t1.j1 freq t3.j3 freq

1 3 0 4

2 3 1 7

3 5 2 2

4 5 4 3

5 5 10 2

6 4

7 6

8 4

9 5

25 1

Table Stats::

Table: T3 Alias: T3

#Rows: 18 SSZ: 0 LGR: 0 #Blks: 1 AvgRowLen: 3.00 NEB: 0 ChainCnt: 0.00 SPC:

0 RFL: 0 RNF: 0 CBK: 0 CHR: 0 KQDFLG: 1

#IMCUs: 0 IMCRowCnt: 0 IMCJournalRowCnt: 0 #IMCBlocks: 0 IMCQuotient: 0.000000

Column (#1): J3(NUMBER)

AvgLen: 3 NDV: 5 Nulls: 0 Density: 0.055556 Min: 0.000000 Max: 10.000000

Histogram: Freq #Bkts: 5 UncompBkts: 18 EndPtVals: 5 ActualVal: yes

***********************

Table Stats::

Table: T1 Alias: T1

#Rows: 42 SSZ: 0 LGR: 0 #Blks: 1 AvgRowLen: 3.00 NEB: 0 ChainCnt: 0.00 SPC:

0 RFL: 0 RNF: 0 CBK: 0 CHR: 0 KQDFLG: 1

#IMCUs: 0 IMCRowCnt: 0 IMCJournalRowCnt: 0 #IMCBlocks: 0 IMCQuotient: 0.000000

Column (#1): J1(NUMBER)

AvgLen: 3 NDV: 11 Nulls: 0 Density: 0.023810 Min: 1.000000 Max: 25.000000

Histogram: Top-Freq #Bkts: 41 UncompBkts: 41 EndPtVals: 10 ActualVal: yes

Considered j1.freq calculated j3.freq calculated freq*freq

val

1 3 freq 7 freq 21

2 3 freq 2 freq 6

3 5 freq 1.000008 num_rows*density 5.00004

4 5 freq 3 freq 15

5 5 freq 1.000008 num_rows*density 5.00004

6 4 freq 1.000008 num_rows*density 4.000032

7 6 freq 1.000008 num_rows*density 6.000048

8 4 freq 1.000008 num_rows*density 4.000032

9 5 freq 1.000008 num_rows*density 5.00004

10 1.00002 num_rows*density 2 freq 2.00004

sum 73.000272

Join Card: 73.000000 = outer (18.000000) * inner (42.000000) * sel (0.096561)

Join Card - Rounded: 73 Computed: 73.000000

But if we compare estimated cardinality with actual values then we will see:

---------------------------------------------------------------

| Id | Operation | Name | Starts | E-Rows | A-Rows |

---------------------------------------------------------------

| 0 | SELECT STATEMENT | | 1 | | 1 |

| 1 | SORT AGGREGATE | | 1 | 1 | 1 |

|* 2 | HASH JOIN | | 1 | 73 | 42 |

| 3 | TABLE ACCESS FULL| T3 | 1 | 18 | 18 |

| 4 | TABLE ACCESS FULL| T1 | 1 | 42 | 42 |

---------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

2 - access("T1"."J1"="T3"."J3")

As we see here is significant difference. 73 vs 42, error estimation is enough big. That is why we said

before its quite interesting case, so optimizer should consider only values from frequency histogram,

these values should be main source of the estimation process – as similar to the case3. So if consider

and walk on the values of the frequency histogram as common values then we will get the following

table:

common val j1.freq calculated j3.freq calculated freq*freq

1 3 freq 7 freq 21

2 3 freq 2 freq 6

4 5 freq 3 freq 15

10 1.00002 num_rows*density 2 freq 2.00004

sum 44.00004

You can clearly see that, such estimation is very close to the actual rows.

It is quite hard to interpret when one of join column has top frequency histogram

(Hybrid_topfreq.sql). For example here is hybrid histogram for t1.j1 and top frequency histogram

for t2.j2. Column information from dictionary

1 3 3 1 5

3 10 6 2 3

4 6 6 3 4

6 5 2 4 5

9 8 5 5 4

10 1 1 6 3

11 2 2 7 1

13 5 3 26 1

30 1

SELECT SUBSTRB (DUMP (val, 16, 0, 64), 1, 240) ep,

freq,

cdn,

ndv,

(SUM (pop) OVER ()) popcnt,

(SUM (pop * freq) OVER ()) popfreq,

SUBSTRB (DUMP (MAX (val) OVER (), 16, 0, 64), 1, 240) maxval,

SUBSTRB (DUMP (MIN (val) OVER (), 16, 0, 64), 1, 240) minval

FROM (SELECT val,

freq,

(SUM (freq) OVER ()) cdn,

(COUNT ( * ) OVER ()) ndv,

(CASE WHEN freq > ( (SUM (freq) OVER ()) / 8) THEN 1 ELSE 0 END)

pop

FROM (SELECT /*+ no_parallel(t) no_parallel_index(t) dbms_stats

cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) no_monitoring

xmlindex_sel_idx_tbl no_substrb_pad */

"J1"

val,

COUNT ("J1") freq

FROM "T"."T1" t

WHERE "J1" IS NOT NULL

GROUP BY "J1"))

ORDER BY val

DBMS_STATS: > cdn 40, popFreq 12, popCnt 2, bktSize 5, bktSzFrc 0

DBMS_STATS: Evaluating hybrid histogram: cht.count 8, mnb 8, ssize 40, min_ssize 2500,

appr_ndv TRUE, ndv 13, selNdv 0, selFreq 0, pct 100, avg_bktsize 5, csr.hreq TRUE,

normalize TRUE

High frequency common values are located between 1 and 7. Also we have two popular values for

t1.j1 column :{3,4}.

Table Stats::

Table: T2 Alias: T2

#Rows: 30 SSZ: 0 LGR: 0 #Blks: 5 AvgRowLen: 3.00 NEB: 0 ChainCnt: 0.00 SPC

0 RFL: 0 RNF: 0 CBK: 0 CHR: 0 KQDFLG: 1

#IMCUs: 0 IMCRowCnt: 0 IMCJournalRowCnt: 0 #IMCBlocks: 0

IMCQuotient: 0.000000

Column (#1): J2(NUMBER)

AvgLen: 3 NDV: 12 Nulls: 0 Density: 0.033333 Min:

1.000000 Max: 30.000000

Histogram: Top-Freq #Bkts: 27 UncompBkts: 27

EndPtVals: 9 ActualVal: yes

***********************

Table Stats::

Table: T1 Alias: T1

#Rows: 40 SSZ: 0 LGR: 0 #Blks: 5 AvgRowLen: 3.00 NEB: 0 ChainCnt: 0.00 SPC

0 RFL: 0 RNF: 0 CBK: 0 CHR: 0 KQDFLG: 1

#IMCUs: 0 IMCRowCnt: 0 IMCJournalRowCnt: 0 #IMCBlocks: 0

IMCQuotient: 0.000000

Column (#1): J1(NUMBER)

AvgLen: 3 NDV: 13 Nulls: 0 Density: 0.063636 Min:

1.000000 Max: 13.000000

Histogram: Hybrid #Bkts: 8 UncompBkts: 40

EndPtVals: 8 ActualVal: yes

1 2.54544 5 12.7272

2 2.54544 3 7.63632

3 6 4 24

4 6 5 30

5 2.54544 4 10.18176

6 2.54544 3 7.63632

7 2.54544 1 2.54544

sum 94.72704

top_freq_count=12-9=3 unpopular NDV. I have done several test cases and I think cardinality of

the join in this case consists two parts: High frequency values and low frequency values

(unpopular). In different cases estimating cardinality for low frequency values was different for me.

In current case I think based on the uniform distribution. It means for t1.j1 “average frequency” is

number for rows(t1)/NDV(j1)=40/13= 3.076923 . Also we have 3 unpopular (low frequency

values) rows and 3 unpopular NDV. For each “low frequency” value we have

num_rows(t1)*density(j1)=2.54544≈3 frequency and we have 3 low frequency-unpopular rows

therefore unpopular cardinality is 3*3=9 so final cardinality will be

CARD(popular rows)+CARD(un popular rows)= 94.72704+ 9= 103.72704.

Lines from 10053 trace file

Join Card: 103.727273 = outer (30.000000) * inner (40.000000) * sel (0.086439)

Join Card - Rounded: 104 Computed: 103.727273

---------------------------------------------------------------

| Id | Operation | Name | Starts | E-Rows | A-Rows |

---------------------------------------------------------------

| 0 | SELECT STATEMENT | | 1 | | 1 |

| 1 | SORT AGGREGATE | | 1 | 1 | 1 |

|* 2 | HASH JOIN | | 1 | 104 | 101 |

| 3 | TABLE ACCESS FULL| T2 | 1 | 30 | 30 |

| 4 | TABLE ACCESS FULL| T1 | 1 | 40 | 40 |

---------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

2 - access("T1"."J1"="T2"."J2")

The above test case was a quite simple because popular values of the hybrid histogram also are located

within range of high frequency values of the top frequency histogram. I mean popular values {1, 5,

6} of the hybrid histogram actually located 1-6 range of top frequency histogram.

Let see another example

CREATE TABLE t1(j1 NUMBER);

INSERT INTO t1 VALUES(6);

INSERT INTO t1 VALUES(2);

INSERT INTO t1 VALUES(7);

INSERT INTO t1 VALUES(8);

INSERT INTO t1 VALUES(7);

INSERT INTO t1 VALUES(1);

INSERT INTO t1 VALUES(3);

INSERT INTO t1 VALUES(6);

INSERT INTO t1 VALUES(4);

INSERT INTO t1 VALUES(7);

INSERT INTO t1 VALUES(2);

INSERT INTO t1 VALUES(3);

INSERT INTO t1 VALUES(7);

INSERT INTO t1 VALUES(9);

INSERT INTO t1 VALUES(5);

INSERT INTO t1 VALUES(6);

INSERT INTO t1 VALUES(17);

INSERT INTO t1 VALUES(18);

INSERT INTO t1 VALUES(19);

INSERT INTO t1 VALUES(20);

COMMIT;

/* execute dbms_stats.set_global_prefs('trace',to_char(512+128+2048+32768+4+8+16)); */

execute dbms_stats.gather_table_stats(null,'t1',method_opt=>'for all columns size 8');

/*exec dbms_stats.set_global_prefs('TRACE', null);*/

CREATE TABLE t2(j2 number);

INSERT INTO t2 VALUES(1);

INSERT INTO t2 VALUES(1);

INSERT INTO t2 VALUES(4);

INSERT INTO t2 VALUES(3);

INSERT INTO t2 VALUES(3);

INSERT INTO t2 VALUES(4);

INSERT INTO t2 VALUES(4);

INSERT INTO t2 VALUES(4);

INSERT INTO t2 VALUES(4);

INSERT INTO t2 VALUES(4);

INSERT INTO t2 VALUES(1);

INSERT INTO t2 VALUES(3);

INSERT INTO t2 VALUES(4);

INSERT INTO t2 VALUES(2);

INSERT INTO t2 VALUES(3);

INSERT INTO t2 VALUES(2);

INSERT INTO t2 VALUES(17);

INSERT INTO t2 VALUES(18);

INSERT INTO t2 VALUES(19);

INSERT INTO t2 VALUES(20);

COMMIT;

execute dbms_stats.gather_table_stats(null,'t2',method_opt=>'for all columns size 4');

ALTER SESSION SET EVENTS '10053 trace name context forever';

EXPLAIN PLAN

FOR

SELECT COUNT ( * )

FROM t1, t2

WHERE t1.j1 = t2.j2;

SELECT *

FROM table (DBMS_XPLAN.display);

t1.j1 freq ep_rep ep_num t2.j2 freq ep_num

1 1 1 1 1 3 3

2 2 2 3 3 4 7

4 3 1 6 4 7 14

6 4 3 10 20 1 15

7 4 4 14

17 3 1 17

18 1 1 18

20 2 1 20

DBMS_STATS: > cdn 20, popFreq 7, popCnt 2, bktSize 2.4, bktSzFrc .4

DBMS_STATS: Evaluating hybrid histogram: cht.count 8, mnb 8, ssize 20,

min_ssize 2500, appr_ndv TRUE, ndv 13, selNdv 0, selFreq 0, pct 100, avg_bktsize

3, csr.hreq TRUE, normalize TRUE

DBMS_STATS: Histogram gathering flags: 527

DBMS_STATS: Accepting histogram

DBMS_STATS: Start fill_cstats - hybrid_enabled: TRUE

So we our average bucket size is 3 and we have 2 popular values {6, 7}. These values are not a part

of high frequency values in top frequency histogram. Table and column statistics from optimizer

trace file:

Table Stats::

Table: T2 Alias: T2

#Rows: 20 SSZ: 0 LGR: 0 #Blks: 1 AvgRowLen: 3.00 NEB: 0 ChainCnt: 0.00

SPC: 0 RFL: 0 RNF: 0 CBK: 0 CHR: 0 KQDFLG: 1

#IMCUs: 0 IMCRowCnt: 0 IMCJournalRowCnt: 0 #IMCBlocks: 0 IMCQuotient:

0.000000

Column (#1): J2(NUMBER)

AvgLen: 3 NDV: 8 Nulls: 0 Density: 0.062500 Min: 1.000000 Max: 20.000000

Histogram: Top-Freq #Bkts: 15 UncompBkts: 15 EndPtVals: 4 ActualVal: yes

***********************

Table Stats::

Table: T1 Alias: T1

#Rows: 20 SSZ: 0 LGR: 0 #Blks: 1 AvgRowLen: 3.00 NEB: 0 ChainCnt: 0.00

SPC: 0 RFL: 0 RNF: 0 CBK: 0 CHR: 0 KQDFLG: 1

#IMCUs: 0 IMCRowCnt: 0 IMCJournalRowCnt: 0 #IMCBlocks: 0 IMCQuotient:

0.000000

Column (#1): J1(NUMBER)

AvgLen: 3 NDV: 13 Nulls: 0 Density: 0.059091 Min: 1.000000 Max: 20.000000

Histogram: Hybrid #Bkts: 8 UncompBkts: 20 EndPtVals: 8 ActualVal: yes

---Join Cardinality

SPD: Return code in qosdDSDirSetup: NOCTX, estType = JOIN

Join Card: 31.477273 = outer (20.000000) * inner (20.000000) * sel (0.078693)

Join Card - Rounded: 31 Computed: 31.477273

1 3 1.18182 3.54546

3 4 1.18182 4.72728

4 7 1.18182 8.27274

20 1 1.18182 1.18182

sum 17.7273

So our cardinality for high frequency values is 17.7273. And we also have num_rows(t1)-

popular_rows(t1)=20-15=5 unpopular rows. But as you see oracle computed final cardinality as

31. In my opinion popular rows of the hybrid histogram here play role. Test cases show that

optimizer in such situations also tries to take advantage of the popular values. In our case the value

6 and 7 are popular values and popular frequency is 7 (sum of popular frequency). If we try find

out frequencies of these values based on the top frequency histogram then we have to use density.

So cardinality for popular values will be:

Popular frequency*num_rows(t1)*density(j2)=7*20*0.0625=8.75. Moreover for every “low

frequency” values we have 1.18182≈1 frequency and we have 5 “low frequency” values (or

unpopular rows of the j2 column) therefore cardinality for “low frequency” could be consider as 5.

Eventually we can figure out final cardinality.

CARD = CARD (High frequency values) + CARD (Low frequency values) + CARD (Unpopular

rows) = 17.7273+8.75+5=31.4773.

And execution plan

---------------------------------------------------------------

| Id | Operation | Name | Starts | E-Rows | A-Rows |

---------------------------------------------------------------

| 0 | SELECT STATEMENT | | 1 | | 1 |

| 1 | SORT AGGREGATE | | 1 | 1 | 1 |

|* 2 | HASH JOIN | | 1 | 31 | 26 |

| 3 | TABLE ACCESS FULL| T1 | 1 | 20 | 20 |

| 4 | TABLE ACCESS FULL| T2 | 1 | 20 | 20 |

---------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

2 - access("T1"."J1"="T2"."J2")

So it is an expected cardinality.

But in general here could be estimation or approximation errors which are related with rounding.

As we know in Oracle Database 12c new dynamic sampling feature has been introduced. The

dynamic sampling level=11 is designed for the operations like single table, group by and join

operations for which oracle automatically defines sample size and tries to estimate cardinality of the

operations. Lets see following example and try to understand sampling mechanism in the join size

estimation.

CREATE TABLE t1

AS SELECT * FROM dba_users;

CREATE TABLE t2

AS SELECT * FROM dba_objects;

EXECUTE dbms_stats.gather_table_stats(user,'t2',method_opt=>'for all columns size 1');

FROM t1, t2

WHERE t1.username = t2.owner;

---------------------------------------------------------------

| Id | Operation | Name | Starts | E-Rows | A-Rows |

---------------------------------------------------------------

| 0 | SELECT STATEMENT | | 1 | | 1 |

| 1 | SORT AGGREGATE | | 1 | 1 | 1 |

|* 2 | HASH JOIN | | 1 | 92019 | 54942 |

| 3 | TABLE ACCESS FULL| T1 | 1 | 42 | 42 |

| 4 | TABLE ACCESS FULL| T2 | 1 | 92019 | 92019 |

---------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

2 - access("T1"."USERNAME"="T2"."OWNER")

---------------------------------------------------------------

| Id | Operation | Name | Starts | E-Rows | A-Rows |

---------------------------------------------------------------

| 0 | SELECT STATEMENT | | 1 | | 1 |

| 1 | SORT AGGREGATE | | 1 | 1 | 1 |

|* 2 | HASH JOIN | | 1 | 58728 | 54942 |

| 3 | TABLE ACCESS FULL| T1 | 1 | 42 | 42 |

| 4 | TABLE ACCESS FULL| T2 | 1 | 92019 | 92019 |

---------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

2 - access("T1"."USERNAME"="T2"."OWNER")

Note

-----

- dynamic statistics used: dynamic sampling (level=AUTO)

As we see without histogram there is significant difference between actual and estimated rows but in

case when automatic (adaptive) sampling is enabled estimation is good enough. The Question is how

did optimizer actually get cardinality as 58728? How did optimizer calculate it? To give the

explanation we could use 10046 and 10053 trace events. So in SQL trace file we could see following

lines.

SQL ID: 1bgh7fk6kqxg7

Plan Hash: 3696410285

SELECT /* DS_SVC */ /*+ dynamic_sampling(0) no_sql_tune no_monitoring

optimizer_features_enable(default) no_parallel result_cache(snapshot=3600)

*/ SUM(C1)

FROM

(SELECT /*+ qb_name("innerQuery") NO_INDEX_FFS( "T2#0") */ 1 AS C1 FROM

"T2" SAMPLE BLOCK(51.8135, 8) SEED(1) "T2#0", "T1" "T1#1" WHERE

("T1#1"."USERNAME"="T2#0"."OWNER")) innerQuery

------- ------ -------- ---------- ---------- ---------- ---------- ----------

Parse 1 0.00 0.00 0 0 0 0

Execute 1 0.00 0.00 0 0 0 0

Fetch 1 0.06 0.05 0 879 0 1

------- ------ -------- ---------- ---------- ---------- ---------- ----------

total 3 0.06 0.05 0 879 0 1

Optimizer mode: CHOOSE

Parsing user id: SYS (recursive depth: 1)

------- ---------------------------------------------------

1 SORT AGGREGATE (cr=879 pr=0 pw=0 time=51540 us)

30429 HASH JOIN (cr=879 pr=0 pw=0 time=58582 us cost=220 size=1287306 card=47678)

42 TABLE ACCESS FULL T1 (cr=3 pr=0 pw=0 time=203 us cost=2 size=378 card=42)

51770 TABLE ACCESS SAMPLE T2 (cr=876 pr=0 pw=0 time=35978 us cost=218 size=858204

card=47678)

During parsing oracle has executed this SQL statement and result has been used to estimate size of

the join. The SQL statement used sampling (undocumented format) actually read 50 percent of the

T2 table blocks. Sampling was not applied to the T1 table because its size is quite small when

compared to the second table and 100% sampling of the T1 table does not consume “lot of” time

during parsing. It means oracle first identifies appropriate sampling size based on the table size and

then execute specific SQL statement. So we get 30429 rows based on the 51.8135 percent therefore

our estimated cardinality is 30429/51.8135*100=58727.94≈58728. Now let’s check optimizer trace

file:

SPD: Return code in qosdDSDirSetup: NOCTX, estType = JOIN

Join Card: 92019.000000 = outer (42.000000) * inner (92019.000000) * sel (0.023810)

>> Join Card adjusted from 92019.000000 to 58727.970000 due to adaptive dynamic sampling,

prelen=2

Adjusted Join Cards: adjRatio=0.638216 cardHjSmj=58727.970000 cardHjSmjNPF=58727.970000

cardNlj=58727.970000 cardNSQ=58727.970000 cardNSQ_na=92019.000000

Join Card - Rounded: 58728 Computed: 58727.970000

Let see what will happen if we increase sizes of both tables – using multiple

insert into t select * from t

table name blocks row nums size mb

T1 3186 172032 25

T2 6158 368076 49

In this case oracle completely ignores adaptive sampling and uses uniform distribution to estimate

join size.

Table Stats::

Table: T2 Alias: T2

#Rows: 368076 SSZ: 0 LGR: 0 #Blks: 6158 AvgRowLen: 115.00 NEB: 0 ChainCnt:

0.00 SPC: 0 RFL: 0 RNF: 0 CBK: 0 CHR: 0 KQDFLG: 1

#IMCUs: 0 IMCRowCnt: 0 IMCJournalRowCnt: 0 #IMCBlocks: 0 IMCQuotient: 0.000000

Column (#1): OWNER(VARCHAR2)

AvgLen: 6 NDV: 31 Nulls: 0 Density: 0.032258

***********************

Table Stats::

Table: T1 Alias: T1

#Rows: 172032 SSZ: 0 LGR: 0 #Blks: 3186 AvgRowLen: 127.00 NEB: 0 ChainCnt:

0.00 SPC: 0 RFL: 0 RNF: 0 CBK: 0 CHR: 0 KQDFLG: 1

#IMCUs: 0 IMCRowCnt: 0 IMCJournalRowCnt: 0 #IMCBlocks: 0 IMCQuotient: 0.000000

Column (#1): USERNAME(VARCHAR2)

AvgLen: 9 NDV: 42 Nulls: 0 Density: 0.023810

(0.023810)

SQL ID: 0ck072zj5gf73

Plan Hash: 3774486692

SELECT /* DS_SVC */ /*+ dynamic_sampling(0) no_sql_tune no_monitoring

optimizer_features_enable(default) no_parallel result_cache(snapshot=3600)

*/ SUM(C1)

FROM

(SELECT /*+ qb_name("innerQuery") NO_INDEX_FFS( "T2#0") */ 1 AS C1 FROM

"T2" SAMPLE BLOCK(12.9912, 8) SEED(1) "T2#0", "T1" "T1#1" WHERE

("T1#1"."USERNAME"="T2#0"."OWNER")) innerQuery

call count cpu elapsed disk query current rows

------- ------ -------- ---------- ---------- ---------- ---------- ----------

Parse 1 0.00 0.00 0 2 0 0

Execute 1 0.00 0.00 0 0 0 0

Fetch 1 1.70 1.91 0 885 0 0

------- ------ -------- ---------- ---------- ---------- ---------- ----------

total 3 1.70 1.91 0 887 0 0

Optimizer mode: CHOOSE

Parsing user id: SYS (recursive depth: 1)

------- ---------------------------------------------------

0 SORT AGGREGATE (cr=0 pr=0 pw=0 time=36 us)

4049738 HASH JOIN (cr=885 pr=0 pw=0 time=2696440 us cost=1835 size=5288231772

card=195860436)

44649 TABLE ACCESS SAMPLE T2 (cr=761 pr=0 pw=0 time=28434 us cost=218 size=860706

card=47817)

6468 TABLE ACCESS FULL T1 (cr=124 pr=0 pw=0 time=28902 us cost=866 size=1548288

card=172032)

It is obvious that oracle stopped execution of this SQL during parsing, we can see it from rows

column of the execution statistics and also from row source statistics. Oracle did not complete HASH

JOIN operation in this SQL, we can be confirm that with result of the above SQL and row source

statistics. Sizes of the tables are not big actually but why did optimizer ignore and decided to continue

with previous approach? In my opinion here could be two factors, although sample size is not small

but in our case sample SQL actually took quite long time during parsing (1.8 sec elapsed time)

therefore oracle stopped it. I have added one filter predicate to the query:

SELECT COUNT (*)

FROM t1, t2

WHERE t1.username = t2.owner AND t2.object_type = 'TABLE';

SQL ID: 8pu5v8h0ghy1z

Plan Hash: 3252009800

SELECT /* DS_SVC */ /*+ dynamic_sampling(0) no_sql_tune no_monitoring

optimizer_features_enable(default) no_parallel result_cache(snapshot=3600)

*/ SUM(C1)

FROM

(SELECT /*+ qb_name("innerQuery") NO_INDEX_FFS( "T2") */ 1 AS C1 FROM "T2"

SAMPLE BLOCK(12.9912, 8) SEED(1) "T2" WHERE ("T2"."OBJECT_TYPE"='TABLE'))

innerQuery

------- ------ -------- ---------- ---------- ---------- ---------- ----------

Parse 1 0.00 0.00 0 2 0 0

Execute 1 0.00 0.00 0 0 0 0

Fetch 1 0.01 0.01 0 761 0 1

------- ------ -------- ---------- ---------- ---------- ---------- ----------

total 3 0.01 0.01 0 763 0 1

Optimizer mode: CHOOSE

Parsing user id: SYS (recursive depth: 1)

------- ---------------------------------------------------

1 SORT AGGREGATE (cr=761 pr=0 pw=0 time=14864 us)

756 TABLE ACCESS SAMPLE T2 (cr=761 pr=0 pw=0 time=5969 us cost=219 size=21378

card=1018)

********************************************************************************

Plan Hash: 3525519047

SELECT /* DS_SVC */ /*+ dynamic_sampling(0) no_sql_tune no_monitoring

optimizer_features_enable(default) no_parallel result_cache(snapshot=3600)

OPT_ESTIMATE(@"innerQuery", TABLE, "T2#0", ROWS=5819.31) */ SUM(C1)

FROM

(SELECT /*+ qb_name("innerQuery") NO_INDEX_FFS( "T1#1") */ 1 AS C1 FROM

"T1" SAMPLE BLOCK(25.1099, 8) SEED(1) "T1#1", "T2" "T2#0" WHERE

("T2#0"."OBJECT_TYPE"='TABLE') AND ("T1#1"."USERNAME"="T2#0"."OWNER"))

innerQuery

------- ------ -------- ---------- ---------- ---------- ---------- ----------

Parse 1 0.01 0.00 0 2 0 0

Execute 1 0.00 0.00 0 0 0 0

Fetch 1 0.74 1.02 0 6283 0 0

------- ------ -------- ---------- ---------- ---------- ---------- ----------

total 3 0.76 1.02 0 6285 0 0

Optimizer mode: CHOOSE

Parsing user id: SYS (recursive depth: 1)

------- ---------------------------------------------------

0 SORT AGGREGATE (cr=0 pr=0 pw=0 time=32 us)

1412128 HASH JOIN (cr=6283 pr=0 pw=0 time=1243755 us cost=1908 size=215466084

card=5985169)

9880 TABLE ACCESS FULL T2 (cr=6167 pr=0 pw=0 time=20665 us cost=1674 size=87285

card=5819)

6035 TABLE ACCESS SAMPLE T1 (cr=116 pr=0 pw=0 time=6069 us cost=218 size=907137

card=43197)

It means oracle firstly tried to estimate size of T2 table, because it has filter predicate and optimizer

thinks using ADS could be very efficient. If we should have added predicate like t2.owner=’HR’ then

optimizer would tried to estimate also T1 table cardinality. But the mechanism of estimating subset

of the join and then estimate whole join principle in this case actually ignored. However in this case

only T2 table has been estimated. We can easily see this fact from the trace file:

BASE STATISTICAL INFORMATION

***********************

Table Stats::

Table: T2 Alias: T2

#Rows: 368144 SSZ: 0 LGR: 0 #Blks: 6158 AvgRowLen: 115.00 NEB: 0 ChainCnt:

0.00 SPC: 0 RFL: 0 RNF: 0 CBK: 0 CHR: 0 KQDFLG: 1

#IMCUs: 0 IMCRowCnt: 0 IMCJournalRowCnt: 0 #IMCBlocks: 0 IMCQuotient: 0.000000

Column (#1): OWNER(VARCHAR2)

AvgLen: 6 NDV: 31 Nulls: 0 Density: 0.032258

***********************

Table Stats::

Table: T1 Alias: T1

#Rows: 172032 SSZ: 0 LGR: 0 #Blks: 3186 AvgRowLen: 127.00 NEB: 0 ChainCnt:

0.00 SPC: 0 RFL: 0 RNF: 0 CBK: 0 CHR: 0 KQDFLG: 1

#IMCUs: 0 IMCRowCnt: 0 IMCJournalRowCnt: 0 #IMCBlocks: 0 IMCQuotient: 0.000000

Column (#1): USERNAME(VARCHAR2)

AvgLen: 9 NDV: 42 Nulls: 0 Density: 0.023810

Single Table Cardinality Estimation for T1[T1]

SPD: Return code in qosdDSDirSetup: NOCTX, estType = TABLE

** Performing dynamic sampling initial checks. **

** Not using old style dynamic sampling since ADS is enabled.

Table: T1 Alias: T1

Card: Original: 172032.000000 Rounded: 172032 Computed: 172032.000000 Non

Adjusted: 172032.000000

Scan IO Cost (Disk) = 865.000000

Scan CPU Cost (Disk) = 48493707.840000

Total Scan IO Cost = 865.000000 (scan (Disk))

= 865.000000

Total Scan CPU Cost = 48493707.840000 (scan (Disk))

= 48493707.840000

Access Path: TableScan

Cost: 866.262422 Resp: 866.262422 Degree: 0

Cost_io: 865.000000 Cost_cpu: 48493708

Resp_io: 865.000000 Resp_cpu: 48493708

Best:: AccessPath: TableScan

Cost: 866.262422 Degree: 1 Resp: 866.262422 Card: 172032.000000 Bytes:

0.000000

***************************************

SINGLE TABLE ACCESS PATH

Single Table Cardinality Estimation for T2[T2]

SPD: Return code in qosdDSDirSetup: NOCTX, estType = TABLE

** Performing dynamic sampling initial checks. **

** Not using old style dynamic sampling since ADS is enabled.

Column (#6): OBJECT_TYPE(VARCHAR2)

AvgLen: 9 NDV: 47 Nulls: 0 Density: 0.021277

Table: T2 Alias: T2

Card: Original: 368144.000000 >> Single Tab Card adjusted from 7832.851064 to

5819.310000 due to adaptive dynamic sampling

Rounded: 5819 Computed: 5819.310000 Non Adjusted: 7832.851064

resc: 5031394.059186 resc_io: 5022741.000000 resc_cpu: 332391893348

resp: 5031394.059186 resp_io: 5022741.000000 resc_cpu: 332391893348

SPD: Return code in qosdDSDirSetup: NOCTX, estType = JOIN

Join Card: 23835893.760000 = outer (5819.310000) * inner (172032.000000) * sel

(0.023810)

Join Card - Rounded: 23835894 Computed: 23835893.760000

table name blocks row nums size mb

T1 101950 5505024 800

T2 196807 11780608 1600

In this case oracle completely ignored ADS and used statistics from dictionary to estimate size of

tables and join cardinality.

Summary

In this paper has explained the mechanism of the oracle optimizer to calculate join selectivity and

cardinality. We learned that firstly optimizer calculates join selectivity based on “pure” cardinality. To

estimate the “pure” cardinality optimizer identifies “distinct value, frequency” pairs for each column,

based on the column distribution. The column distribution information is identified by the histogram.

And as we know that, frequency histogram gives us completely whole data distribution of the column.

Also top frequency histogram gives us enough information for high frequency values. However for

less significantly values we can approach “uniform distribution”. Moreover if here are hybrid

histograms for the join columns in the dictionary then optimizer can use endpoint repeat count to

formulate frequency. In addition optimizer has chance to estimate join cardinality via sampling.

Although this process influenced by time restriction and size of the tables. As a result optimizer can

completely ignore adaptive dynamic sampling.

References

• Lewis, Jonathan. Cost-Based Oracle: Fundamentals Based Oracle: Fundamentals. Apress.

2006

• Alberto Dell'Era. Join Over Histograms. 2007

• http://www.adellera.it/investigations/join_over_histograms/JoinOverHistograms.pdf

• Chinar Aliyev. Automatic Sampling in Oracle 12c. 2014

• https://www.toadworld.com/platforms/oracle/w/wiki/11036.automaticadaptive-dynamic-

sampling-in-oracle-12c-part-2

• https://www.toadworld.com/platforms/oracle/w/wiki/11052.automaticadaptive-dynamic-

sampling-in-oracle-12c-part-3

- ghid sqlUploaded byMihai Bodnariu
- OBIEE - Aggregate TablesUploaded byvenkatesh.golla
- Elementary-Statistics-A-Step-By-Step-Approach-7th-Edition-Bluman-Test-Bank.pdfUploaded bya585855093
- INF2603 - Part III - Database Programming - Ch08.pdfUploaded bywoomichoi
- End to End Development (Chapter 2, rev 26)Uploaded byChris Nash
- SP SqlServer 2005Uploaded byapi-3731921
- Karwin Bill ZF Db ZendCon 20071009Uploaded byjanaka077
- School ManagementUploaded bymittaln90
- Interpreting Extended StatisticsUploaded byThota Mahesh Dba
- flUploaded byBharath Maddineni
- oracle 10g sqlUploaded bykashishchak
- SQLLanjut (Data base).pdfUploaded byAldo Rialdo
- Refer TCUploaded byMRLogan123
- dp_s10_l01Uploaded byLavinia Alexandru
- 20140210 Day4 Problem Set 1Uploaded byDavid Price
- Oracle TuningUploaded bySriram Balasubramanian
- 2 Process Control Techniques - Chapter TwoUploaded byAmsalu Setey
- Files-2-Presentations Malhotra Mr05 Ppt 22Uploaded byravikiran1955
- #12 AggregateUploaded byHardy Gunawan
- 15waystooptimizeyoursqlqueries Hungreddotcom 121023051411 Phpapp02Uploaded bysrikanthm44
- tghkjUploaded byKatraj Nawaz
- Non Parametric Curve EstimationUploaded byzeko257
- 1.8.3 DML & DDLUploaded byshuvam
- C1Uploaded byRachita Vij
- Master AccessSQLUploaded byBenjaminBegovic
- Relational DatabasesUploaded byHarold Costales
- Data Query KabupatenUploaded byFachrurazy Ismail
- Programacion SQL Ibm iUploaded byjmtamayo
- P10-2IMA1031Uploaded byRenkoWhels
- Report PSA AssessementUploaded byVarun Mathur

- Neutron Star Collision Shakes Space Time and Lights Up the SkyUploaded bySilvino González Morales
- MSc Dissertation: Consumer Conformity & the Role of Envy. By Jamie IvoryUploaded byJamie Ivory
- s 095458670500193 x AUploaded bydelfos.delfin.95
- CV_MaretUploaded byaryasa09
- Chapter 3Uploaded bynelson
- thesisUploaded byapi-355282791
- Edge WorkUploaded byJess Clarke
- sarafUploaded byopimadridista
- Principles of Inheritance and VariationUploaded byNeeraj Misra
- Apidra SoloSTAR Quick Reference GuideUploaded byjoelbalbasarado
- Ban Logic TutorialUploaded byina_agustina879124
- marcel moyseUploaded byRamona Ţuţuianu
- The Pleasure of Ibadah or WorshipUploaded byBaitul Aman
- Indian Star TortiseUploaded byManish Hunk
- 2014 Knowledge Solutions International Catalog v.3Uploaded byscribdpdfs
- Ida Jean OrlandoUploaded byLeonard Cyren Ligutom
- Elements of a Narrative Text (2)Uploaded byAriane Galeno Del Rosario
- SLT Working Privately in GWgtn RegionUploaded bycdw27
- CC 306M Intro. to Med. and Scientific Term. SyllabusUploaded byChristine Rizkala
- PJESMICEUploaded bymujubuju
- Medical Nutrition Therapy in Management of Food AllergyUploaded byasirajagaol
- Annisa Setyadi_22010114130157 Lab 3.pdfUploaded bytrifamonika23
- Wedding Book.pdfUploaded byAziah Gan Man
- Seismic Interpretation TechniquesUploaded byzeze_13
- Essays on Politics Matriliny and the Media in KeralaUploaded byShay Waxen
- ESET NOD32 Antivirus 4.2.64.12 Nt32 User GuideUploaded bylara_croft6008
- Hornby Panthers turn 100Uploaded byandrewvoerman
- Summative AssessmentUploaded byCassius Garcia
- Anthropology and EthnohistoryUploaded byRicksen Tam
- Icebreaker QuestionsUploaded byAnant Doshi

## Much more than documents.

Discover everything Scribd has to offer, including books and audiobooks from major publishers.

Cancel anytime.