Professional Documents
Culture Documents
1
Review
• File Organization
• Files of records
• Page Formats
• Record Formats
• Indexing
Heap File Using a Page Directory
✔ insertions
✔ deletions
Variable Length Records
• Slotted Page
• pack them
• keep ptrs to them
• rec-id = <page-id, slot#>
• mark start of free space
Record Formats: Fixed Length
• Information about field types same for all records in a file; stored in
system catalogs.
• Finding i’th field done via arithmetic.
Variable Length records
• Two alternative formats (# fields is fixed):
More popular!
B+ Tree
• A hierarchy of nodes with intervals
• Balanced (more or less): good performance guarantee
• Disk-based: one node per block; large fan-out
• Next-leaf pointers: facilitate range queries
9
10
11
We can also say
this B+ tree is at
an order of 2,
which is defined
as Ceiling((f-1)/2)
12
13
Today’s Agenda
• Indexing
• Memory hierarchy
• Buffer management
• Relational operators
Today’s Agenda
• Indexing
• Memory hierarchy
• Buffer management
• Relational operators
Performance Analysis
• How many I/O’s are required for each operation?
• ℎ, the height of the tree (more or less)
• Plus one or two to manipulate actual records
• Plus O(ℎ) for reorganization (rare if # is large)
• Minus one if we cache the root in memory 1003 = 1,000,000
• How big is ℎ?
• Roughly log fanout N, where N is the number of records
• B+-tree properties guarantee that fan-out is least f/2 for all non-root nodes
• Fan-out is typically large (in hundreds)-- many keys and pointers can fit into
one block
• Usually 3 levels, a 4-level B+-tree is enough for “typical” tables
Indexing: Clustered Indexing Example
Indexing: non-clustered
Dense vs. Sparse Index
• Dense: at least one data
entry per key value
• Sparse: an entry per data
page in file
• Every sparse index is
clustered!
• Sparse indexes are smaller;
however, some useful
optimizations are based on
dense indexes.
• Sparse <-> Clustering <->
Alt#1 (full record)
• Dense <-> non-clustering
20
21
22
23
24
Tree vs. Hash-based Techniques
• Hash-based index.
• File = a collection of buckets. Bucket = primary page plus 0 or more overflow
pages.
• Hash function h: h(r.search_key) = bucket in which record r belongs.
• Tree-based index
• Hierarchical structure (Tree) directs searches
• Leaves contain data entries sorted by search key value
• B+ tree: all root->leaf paths have equal length (height)
Tree vs. Hash-based Techniques
• Hash-based index
• Best for
• exact match (average)
• Tree-based index
• Best for
• exact match (worst case)
• Range queries
• Insertion/deletion (worst case)
Today’s Agenda
• Indexing
• Memory hierarchy
• Buffer management
• Relational operators
The Storage Hierarchy
• Main memory (RAM) for currently
used data.
• Disk for the main database
(secondary storage).
• Tapes for archiving older versions of
the data (tertiary storage).
Motivation of Initial RDBMS architecture: Disk
is relatively VERY slow
• READ: disk-> main memory (RAM)
• WRITE: main memory (RAM) -> disk
• Both are high-cost operations, relative to in-memory operations, so
must be planned carefully
Rules of Thumbs
• Memory access much faster than disk I/O (~ 1000x)
• “Sequential” I/O faster than “random” I/O (~ 10x)
• seek time: moving arms to position disk head on track
• rotational delay: waiting for block to rotate under head
Dominating
• transfer time: actually moving data to/from disk surface
• SSD?
• Similar sequential and random
• Reading is much faster than writing
Disk Arrays: RAID (Redundant Array of
Inexpensive Disks)
OS virtual memory
OS file system
Can we leverage OS for DB storage
management?
• Unfortunately, OS often gets in the way of DBMS
• DBMS needs to do things “its own way”
• Control over buffer replacement policy
• LRU not always best (some times worst!)
• Control over flushing data to disk
• Write-ahead logging (WAL) protocol requires flushing log entries to disk
Today’s Agenda
• Indexing
• Memory hierarchy
• Buffer management
• Relational operators
Organize Disk Space into Pages
• A table is stored as one or more files, a file contains one or more
pages
• Higher levels call upon this layer to:
• allocate/de-allocate a page
• read/write a page
• Best if requested pages are stored sequentially on disk! Higher levels
don’t need to know if/ how this is done, nor how free space is
managed.
Buffer Management
Pinned or
Unpinned
Buffer Management
• Data must be in RAM for DBMS to operate on it!
• Buffer Mgr hides the fact that not all data is in RAM
When a Page is Requested ...
• Buffer pool information table contains: NOT FOUND <?,?,?>
• If requested page is not in pool and the pool is not full:
• Read requested page into chosen frame
• Pin the page and return its address
• If requested page is not in pool and the pool is full:
• Choose an (un-pinned) frame for replacement
• If frame is “dirty”, write it to disk
• Read requested page into chosen frame
• Pin the page and return its address
• Buffer pool information table now contains:
56
Operator Algorithms
• Selection
• Join
57
Operator Algorithms
• Selection
• Join
59
Selection Options
• No Index, Unsorted Data
60
Selection Options
• No Index, Unsorted Data => File Scan (Linear Search)
61
Selection Options
• No Index, Unsorted Data => File Scan (Linear Search)
• No Index, Sorted Data =>
62
Selection Options
• No Index, Unsorted Data => File Scan (Linear Search)
• No Index, Sorted Data => File Scan (Binary Search)
63
Selection Options
• No Index, Unsorted Data => File Scan (Linear Search)
• No Index, Sorted Data => File Scan (Binary Search)
• B+ Tree Index/Hashing Index => Use index to find qualifying data
entries, then retrieve corresponding data records
64
Operator Algorithms
• Selection
• Join
65
Join Algorithms
• Nested Loop Join
• Grace Hash Join
• Sort Merge Join
66
Join Algorithms
• Nested Loop Join
• Grace Hash Join
• Sort Merge Join
67
Nested Loop Join
68
Nested Loop Join
69
Nested Loop Join
70
Block Nested Loop Join
71
Block Nested Loop Join
72
Block Nested Loop Join
73
Block Nested Loop Join
74
Join Algorithms
• Nested Loop Join
• Grace Hash Join
• Sort Merge Join
75
Grace Hash Join
• Two-Phase Hash join:
• Partition Phase: Hash both tables on the join attribute into partitions.
• Probing Phase: Compares tuples in corresponding partitions for each table.
• Named after the GRACE database machine.
Grace Hash Join
• Hash R into (0, 1, ..., ‘max’) buckets
• Hash S into buckets (same hash function)
Grace Hash Join
• Join each pair of matching buckets:
Grace Hash Join: Partition
Buffers
R S Partition 1
1 Buffer
Partition 2
Grace Hash Join: Partition
Buffers
R S Partition 1
1 Buffer
Partition 2
Grace Hash Join: Partition
Buffers
R S Partition 1
1 Buffer
Partition 2
Grace Hash Join: Partition
Buffers
R S Partition 1
1 Buffer
Partition 2
Grace Hash Join: Partition
Buffers
R S Partition 1
1 Buffer
Partition 2
Grace Hash Join: Partition
Buffers
R S Partition 1
1 Buffer
Partition 2
Grace Hash Join: Partition
Buffers
R S Partition 1
1 Buffer
Partition 2
Grace Hash Join: Partition
Buffers
R S Partition 1
1 Buffer
Partition 2
Grace Hash Join: Partition
Buffers
R S Partition 1
1 Buffer
Partition 2
Grace Hash Join: Partition
Buffers
R S Partition 1
1 Buffer
Partition 2
Grace Hash Join: Partition
Buffers
R S Partition 1
1 Buffer
Partition 2
Grace Hash Join: Partition
Buffers
R S Partition 1
1 Buffer
Partition 2
Grace Hash Join: Partition
Buffers
R S Partition 1
1 Buffer
Partition 2
Grace Hash Join: Build & Probe
Partition 1
Partition 1
Partition 1
Partition 1
97
98
99
100
101
102