Professional Documents
Culture Documents
4 Fileorganization
4 Fileorganization
Maurizio Lenzerini
Dipartimento di Informatica e Sistemistica “Antonio Ruberti”
Università di Roma “La Sapienza”
http://www.dis.uniroma1.it/~lenzerin/index.html/?q=node/53
DBMS
SQL engine
Disk manager
Data
(1,n) (1,1)
Relation stored DB file
(1,n) (1,n)
R-R formed
(1,1) (1,1)
(1,1) (1,n)
Record contains Page
Slot 1 Slot 1
Slot 2 Slot 2
Free
... Space
...
Slot N Slot N
Slot M
N 1 . . . 0 1 1M
• Packed organization: moving a record changes its rid, and this is a problem,
when records are referred to by other pages
• Unpacked organization: identifying a record requires to scan the bit array to
check whether the slot is free (0) or not (1)
Maurizio Lenzerini Data Management File organization - 7
Page with variable length records
Rid = (i,N) Page i
20
Rid = (i,2)
Rid = (i,1)
20 16 24 N Pointer to
start of
N ... 2 1 # slots free space
SLOT DIRECTORY
F1 F2 F3 F4
L1 L2 L3 L4
Number
Fields separeted by special symbols
of fields
F1 F2 F3 F4
(2)
Array of offset
Alternative 2 allows direct access to the fields, and
efficient storage of nulls (two consecutive pointers that
are equal correspond to a null)
Maurizio Lenzerini Data Management File organization - 10
4. Access file manager
File organization
simple indexed
hashed
heap sorted tree-index hash-index
file
Data
DIRECTORY Page N
• The directory is a list of pages, where each entry contains a pointer to a data
page
• Each entry in the directory refers to a page, and tells whether the page has
free space (for example, through a counter)
• Searching for a page with free space is more efficient, because the number of
page accesses is linear with respect to the size of the directory, and not to
the size of the relation (as in the case of the list-based representation)
Maurizio Lenzerini Data Management File organization - 16
File with sorted pages
h(k) mod N 0
2
Search key k h
N-1
Equality selection:
B(D + RC)
→ the data record can be absent, or many data records can
satisfy the condition
Range selection:
B(D + RC)
N.B. We have ignored the cost of locating and managing the pages
with free space
D (1 + Cost of search +
Hashed file 1.25 BD number of 1.25 BD 2D D
relevant
records)
In the above table, we have only considered the cost of I/O operations
3. Clustered/unclustered
4. Dense/sparse
5. Primary/secondary
Note that if we use more than one index on the same data file, at most one
of them will use technique 1.
UNCLUSTERED CLUSTERED
Index Index
entry entry
Data record
file
• Let us call duplicate two data entries with the same values of
the search key.
→ A primary index cannot contain duplicates
→ Typically, a secondary index contains duplicates
→ A secondary index is called unique if its search key
contains a (non-primary) key. Note that a unique
secondary index does not contain duplicates.
20,4000 20
22,6003 22
30,3000 30
40,6003 40
Azzurri, 20,4000
<age, sal> Bianchi, 30,3000 <age>
Neri, 40, 6003
3000,30 Verdi, 22, 6003 3000
4000,20 Data records ordered 4000
6003,22 based on name 6003
6003,40 6003
simple index-based
hash tree
coincides with
hashed file extendible linear
hashing hashing
Maurizio Lenzerini Data Management File organization - 40
Tree index: basic ideas
• Find all students with avg > 27
– If students are ordered based on avg, we can search through
binary search
– However, the cost of binary search may become high for large
files
• Simple idea: create an auxiliary structure (index) on the ordered file
k1 k2 kN Index file
The binary search can now be carried out on a smaller file (index
entries are smaller, and index entries are less -- one index entry per
page in the data file)
Maurizio Lenzerini Data Management File organization - 41
Tree index: general characteristics
• If we iterate the process behind the simple idea described
above (until the auxiliary structure produced fits in one page),
we obtain a tree index
• In a tree index, the data entries are organized according to a
tree structure, based on the value of the search key
• Searching means looking for the correct page (the page with
the desired data entry), with the help of the tree, which is a
hierachical data structure where
– every node coincides with a page
– pages with the data entries are the leaves of the tree
– any search starts from the root and ends on a leaf
(therefore it is important to minimize the height of the tree)
– the links between nodes corresponds to pointers between
pages
P0 K1 P1 K2 Km Pm
Non-leaves
Leaves
Overflow
pages
Primary pages
The leaves contain the data entries, and they can be scanned
sequentially. The structure is static: no update on the tree!
Maurizio Lenzerini Data Management File organization - 44
Comments on ISAM
• An ISAM is a balanced tree -- i.e., the path from the root to a leaf
has the same length for all leaves
• The height of a balanced tree is the length of the path from root
to leaf
• In ISAM, every non-leaf nodes have the same number of
children; such number is called the fan-out of the tree.
• If every node has F children, a tree of height h has Fh leaf pages.
• In practice, F is at least 100, so that a tree of height 4 contains
100 million leaf pages
• We will see that this implies that we can find the page we want
using 4 I/Os (or 3, if the root is in the buffer). This has to be
contrasted with the fact that a binary search of the same file
would take log2100.000.000 (>25) I/Os.
Root
40
20 33 51 63
10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97*
Root
Index 40
entries
20 33 51 63
Primary pages
Leaves
10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97*
42*
Root
40
20 33 51 63
10* 15* 20* 27* 33* 37* 40* 46* 55* 63*
start
12 78
age < 12 12 ≤ age < 78 78 ≤ age
3 9 19 56 86 94
19 ≤ age < 56
… … … … 33 44 … … … …
LEAVES F1 F2 F3
Verdi, 22,6003 Neri, 40, 6003 Rossi, 44,3000
Viola, 26,5000 Bianchi, 50,5004
Root
13 17 24 30
2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*
2* 3* 5* 7* 8*
L L2
5 13 24 30
Note: every node (except for the root) has a number of data entries
greater than or equal to the rank d (here d=2) of the tree
Root
17
5 13 24 30
2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*
Typically, the tree increases in breadth. The only case where the tree increases in
depth is when we need to insert into a full root
Scan: 1.5B(D+RC)
Selection based on equality:
D logF(1.5B) + C log2R
Search for the first page with a data record of interest
Search for the first data record, through binary search in the page
Typically, the data records of interest appear in one page
N.B. In practice, the root is in the buffer, so that we can avoid one page access
Both for the scan and the selection operator, sometime it might be convenient
to avoid using the index, sorting the file, and operate directly on the sorted file.
simple index-based
hash tree
coincides with
hashed file extendible linear
hashing hashing
Maurizio Lenzerini Data Management File organization - 64
Static Hashing (Hashed file)
• Fixed number of primary pages N (i.e., N = number of buckets)
– sequentially allocated
– never de-allocated
– with overflow pages, if needed
• h(k) mod N = address of the bucket containing record with search key k
• h should distribute the values uniformly into the range 0..N-1
0
h(key) mod N
2
key
h
N-1
Primary bucket pages Overflow pages
Insertion: 2D + C + H + 2D + C = H + 4D + 2C
Cost of inserting the data record in the heap (2D + C)
Cost of inserting the data entry in the index (H + 2D + C)
simple Index-based
hash tree
coincides with
hashed file extendible linear
hashing hashing
Maurizio Lenzerini Data Management File organization - 69
Extendible Hashing
• When a bucket (primary page) becomes full, why not
re-organizing the file by doubling the number of
buckets?
– Reading and writing all the pages is too costly
– Idea: use a directory of pointers to buckets, double
only such directory, and split only the bucket that is
becoming full
– The directory is much smaller than the file, and
doubling the directory is much less costly than
doubling all buckets
– Obviously, when the directory is doubled, the hash
function has to be adapted
GLOBAL DEPTH 2
Bucket B
1* 5* 21* 13*
2
2
00
Bucket C
10*
01
10 2
11 Bucket D
15* 7* 19*
DIRECTORY DATA PAGES
GLOBAL DEPTH 2
Bucket B
1* 5* 21* 13*
2
2
00
Bucket C
10*
01
10 2
11 Bucket D
15* 7* 19*
DIRECTORY DATA PAGES
2 2
3 2
00 1* 5* 21*13* Bucket B 000 1* 5* 21* 13* Bucket B
01 001
10 2 2
010
10* Bucket C
11 10*
011 Bucket C
100
2 2
DIRECTORY Bucket D 101
15* 7* 19*
110 15* 7* 19* Bucket D
111
2
3
4* 12* 20* Bucket A2
DIRECTORY 4* 12* 20* Bucket A2
(split image
of Bucket A) (split image
of Bucket A)
6 = 110 3 6 = 110 3
000 000
001 100
2 2
010 010
1 00 1 00
011 110
0 6* 01 100 0 10
001
1 10 6* 1 6* 01
101 101
11 11 6*
110 6* 011 6*
111 111
NEXT:
Next bucket to split
– d0 = 5
• h0(v) = h(v) mod 32
– d1 = 6
• h1(v) = h(v) mod 64
– d2 = 7
• h2(v) = h(v) mod 128
– ……
h0 h1 9* 25* 5* Bucket B
00 000
01 001 10* 14* 18* 30* Bucket C
10 010
11 011 31* 7* 11* 35* Bucket D 43*
00 100
Bucket A2
44* 36* (split image of A)
• Idea of linear hashing: by allocating the primary buckets sequentially, bucket i can
be located by simply computing an offset i), and the use of directory can be
avoided
• Heap file
– Efficient in terms of space occupancy
– Efficient for scan and insertion
– Inefficient for search and deletion
• Sorted file
– Efficient in terms of space occupancy
– Inefficient for insertion and deletion
– More efficient search with respect to heap file
We assume that the SQL engine translates the query into the
relational algebra, and therefore we are interested in studying
how relational algebra expressions with just one operator are
evaluated.
The most selective access path is the one allowing to access to the
minimum number of pages
Example: for a search key <a,b,c>, <a> and <a,b> are prefix, while
<a,c> and <b,c> are not prefix
– A tree index with search key <r,b,s> conforms to the condition (r > ‘J’ and
b = 7 and s = 5), and to (r = ‘H’ and s = 9), not to (b = 8 and s = 10)
– Any index with search key <b,s> conforms to the condition (d > 1000 and
b = 5 and s = 7): after finding the entries for the primary terms b = 5 and s
= 7, we select only those satisfying the further condition d > 1000
– If we have an index with search key <b,s> and a tree index with search
key d, then both conform to the condition (d > 1000 and b = 5 and s = 7).
No matter which one is used, we will select only the tuples retrieved with
the help of the index that satisfy the further condition
select *
from R
where <att op value>
If A1,…,An form a prefix of the search key, then, during the index-
only scan we can efficiently:
- eliminate those attributes that are not among the target
attributes in the projection
- eliminate duplicates
– Nested loop
– Block nested loop
– Index nested loop
– Sort merge join
Maurizio Lenzerini Data Management File organization - 105
Nested loop (naive version)
We choose one of the two relations (say R) and we
consider it as the “outer relation”, while the other relation
(i.e., S) is the “inner relation”.
Obviously, in the worst case (i.e., when all tuples of R join with all tuples of
S) the algorithm costs M × N. In practice, the cost is much less. For
example, when the equi join attribute is a key for the inner relation, then
every page of S is analyzed only once, and the cost is M + N, which is
optimal. In the average, even if a page of S can be required multiple times,
it is likely that it is still in the buffer, and therefore estimating the cost to be
M + N is reasonable.
If the buffer has 10 free pages for the merge sort, the cost of sorting R is
(M log10 M)=3000, the cost of sorting S is (N log10 N)=1500, and the total
cost of the sort merge join is 3000 + 1500 + 1000 + 500 = 6000,
corresponding to about one minute.