You are on page 1of 102

CSE 412 Database Management

Lecture 15 Database Storage


Jia Zou
Arizona State University

1
Review
• File Organization
• Files of records
• Page Formats
• Record Formats
• Indexing
Heap File Using a Page Directory

• The entry for a page can


include the number of free
bytes on the page.
• The directory is a collection of
pages; linked list
implementation is just one
alternative.
• Much smaller than linked
list of all Heap File pages!
Another Solution for Fixed-Length Records
• Slots+Bitmaps

✔ insertions
✔ deletions
Variable Length Records
• Slotted Page

• pack them
• keep ptrs to them
• rec-id = <page-id, slot#>
• mark start of free space
Record Formats: Fixed Length
• Information about field types same for all records in a file; stored in
system catalogs.
• Finding i’th field done via arithmetic.
Variable Length records
• Two alternative formats (# fields is fixed):

More popular!
B+ Tree
• A hierarchy of nodes with intervals
• Balanced (more or less): good performance guarantee
• Disk-based: one node per block; large fan-out
• Next-leaf pointers: facilitate range queries
9
10
11
We can also say
this B+ tree is at
an order of 2,
which is defined
as Ceiling((f-1)/2)

12
13
Today’s Agenda
• Indexing
• Memory hierarchy
• Buffer management
• Relational operators
Today’s Agenda
• Indexing
• Memory hierarchy
• Buffer management
• Relational operators
Performance Analysis
• How many I/O’s are required for each operation?
• ℎ, the height of the tree (more or less)
• Plus one or two to manipulate actual records
• Plus O(ℎ) for reorganization (rare if # is large)
• Minus one if we cache the root in memory 1003 = 1,000,000
• How big is ℎ?
• Roughly log fanout N, where N is the number of records
• B+-tree properties guarantee that fan-out is least f/2 for all non-root nodes
• Fan-out is typically large (in hundreds)-- many keys and pointers can fit into
one block
• Usually 3 levels, a 4-level B+-tree is enough for “typical” tables
Indexing: Clustered Indexing Example
Indexing: non-clustered
Dense vs. Sparse Index
• Dense: at least one data
entry per key value
• Sparse: an entry per data
page in file
• Every sparse index is
clustered!
• Sparse indexes are smaller;
however, some useful
optimizations are based on
dense indexes.
• Sparse <-> Clustering <->
Alt#1 (full record)
• Dense <-> non-clustering
20
21
22
23
24
Tree vs. Hash-based Techniques
• Hash-based index.
• File = a collection of buckets. Bucket = primary page plus 0 or more overflow
pages.
• Hash function h: h(r.search_key) = bucket in which record r belongs.
• Tree-based index
• Hierarchical structure (Tree) directs searches
• Leaves contain data entries sorted by search key value
• B+ tree: all root->leaf paths have equal length (height)
Tree vs. Hash-based Techniques
• Hash-based index
• Best for
• exact match (average)
• Tree-based index
• Best for
• exact match (worst case)
• Range queries
• Insertion/deletion (worst case)
Today’s Agenda
• Indexing
• Memory hierarchy
• Buffer management
• Relational operators
The Storage Hierarchy
• Main memory (RAM) for currently
used data.
• Disk for the main database
(secondary storage).
• Tapes for archiving older versions of
the data (tertiary storage).
Motivation of Initial RDBMS architecture: Disk
is relatively VERY slow
• READ: disk-> main memory (RAM)
• WRITE: main memory (RAM) -> disk
• Both are high-cost operations, relative to in-memory operations, so
must be planned carefully
Rules of Thumbs
• Memory access much faster than disk I/O (~ 1000x)
• “Sequential” I/O faster than “random” I/O (~ 10x)
• seek time: moving arms to position disk head on track
• rotational delay: waiting for block to rotate under head
Dominating
• transfer time: actually moving data to/from disk surface
• SSD?
• Similar sequential and random
• Reading is much faster than writing
Disk Arrays: RAID (Redundant Array of
Inexpensive Disks)

Mean time to failure


Why not store it all in memory?
• Costs too much.
• disk: ~$0.1/Gb; memory: ~$5~10/Gb
• High-end Databases today in the 10-100 TB range
• Approx 60% of the cost of a production system is in the disks
• Main memory is volatile.
• Note: some specialized systems do store entire database in main
memory.
Can we leverage OS for DB storage
management?

OS virtual memory
OS file system
Can we leverage OS for DB storage
management?
• Unfortunately, OS often gets in the way of DBMS
• DBMS needs to do things “its own way”
• Control over buffer replacement policy
• LRU not always best (some times worst!)
• Control over flushing data to disk
• Write-ahead logging (WAL) protocol requires flushing log entries to disk
Today’s Agenda
• Indexing
• Memory hierarchy
• Buffer management
• Relational operators
Organize Disk Space into Pages
• A table is stored as one or more files, a file contains one or more
pages
• Higher levels call upon this layer to:
• allocate/de-allocate a page
• read/write a page
• Best if requested pages are stored sequentially on disk! Higher levels
don’t need to know if/ how this is done, nor how free space is
managed.
Buffer Management
Pinned or
Unpinned
Buffer Management
• Data must be in RAM for DBMS to operate on it!
• Buffer Mgr hides the fact that not all data is in RAM
When a Page is Requested ...
• Buffer pool information table contains: NOT FOUND <?,?,?>
• If requested page is not in pool and the pool is not full:
• Read requested page into chosen frame
• Pin the page and return its address
• If requested page is not in pool and the pool is full:
• Choose an (un-pinned) frame for replacement
• If frame is “dirty”, write it to disk
• Read requested page into chosen frame
• Pin the page and return its address
• Buffer pool information table now contains:

• Unpin it when you finish using the page


Buffer Replacement Policy
• Frame is chosen for replacement by a replacement policy:
• Least-recently-used (LRU), MRU, Clock, etc.
• Policy -> big impact on # of I/O ’s; depends on the access pattern
LRU Replacement Policy
• Least Recently Used (LRU)
• for each page in buffer pool, keep track of time last unpinned
• replace the frame which has the oldest (earliest) time
• very common policy: intuitive and simple
• Problems?
LRU Replacement Policy
• Problem: Sequential Flooding
• LRU + repeated sequential scans.
• # buffer frames < # pages in file means each page request causes an I/O. MRU
much better in this situation (but not in all situations, of course).
Sequential Flooding – Illustration
How LRU work?
How LRU work?
How LRU work?
How will MRU Work?
How will MRU work?
How will MRU work?
How will MRU work?
How will MRU work?
Advanced Paging Algorithm
• Greedy-dual
• Locality Set
• Clock
Summary
• Buffer manager brings pages into RAM.
• Very important for performance
• Page stays in RAM until released by requestor.
• Written to disk when frame chosen for replacement (which is sometime after
requestor releases the page).
• Choice of frame to replace based on replacement policy.
Conclusions
• Memory hierarchy
• Disks: (>1000x slower) thus
• pack info in blocks
• try to fetch nearby blocks (sequentially)
• Buffer management: very important
• LRU, MRU, etc
• Record organization: Slotted page
Today’s Agenda
• Indexing
• Memory hierarchy
• Buffer management
• Relational operators
Operator Algorithms
• Selection
• Join

56
Operator Algorithms
• Selection
• Join

57
Operator Algorithms
• Selection
• Join

What is the best way to implement the sele


A: It depends on
• availability of the indexes
58
Operator Algorithms
• Selection
• Join

59
Selection Options
• No Index, Unsorted Data

60
Selection Options
• No Index, Unsorted Data => File Scan (Linear Search)

61
Selection Options
• No Index, Unsorted Data => File Scan (Linear Search)
• No Index, Sorted Data =>

62
Selection Options
• No Index, Unsorted Data => File Scan (Linear Search)
• No Index, Sorted Data => File Scan (Binary Search)

63
Selection Options
• No Index, Unsorted Data => File Scan (Linear Search)
• No Index, Sorted Data => File Scan (Binary Search)
• B+ Tree Index/Hashing Index => Use index to find qualifying data
entries, then retrieve corresponding data records

64
Operator Algorithms
• Selection
• Join

65
Join Algorithms
• Nested Loop Join
• Grace Hash Join
• Sort Merge Join

66
Join Algorithms
• Nested Loop Join
• Grace Hash Join
• Sort Merge Join

67
Nested Loop Join

68
Nested Loop Join

69
Nested Loop Join

70
Block Nested Loop Join

71
Block Nested Loop Join

72
Block Nested Loop Join

73
Block Nested Loop Join

74
Join Algorithms
• Nested Loop Join
• Grace Hash Join
• Sort Merge Join

75
Grace Hash Join
• Two-Phase Hash join:
• Partition Phase: Hash both tables on the join attribute into partitions.
• Probing Phase: Compares tuples in corresponding partitions for each table.
• Named after the GRACE database machine.
Grace Hash Join
• Hash R into (0, 1, ..., ‘max’) buckets
• Hash S into buckets (same hash function)
Grace Hash Join
• Join each pair of matching buckets:
Grace Hash Join: Partition

Buffers
R S Partition 1

1 Buffer

Partition 2
Grace Hash Join: Partition

Buffers
R S Partition 1

1 Buffer

Partition 2
Grace Hash Join: Partition

Buffers
R S Partition 1

1 Buffer

Partition 2
Grace Hash Join: Partition

Buffers
R S Partition 1

1 Buffer

Partition 2
Grace Hash Join: Partition

Buffers
R S Partition 1

1 Buffer

Partition 2
Grace Hash Join: Partition

Buffers
R S Partition 1

1 Buffer

Partition 2
Grace Hash Join: Partition

Buffers
R S Partition 1

1 Buffer

Partition 2
Grace Hash Join: Partition

Buffers
R S Partition 1

1 Buffer

Partition 2
Grace Hash Join: Partition

Buffers
R S Partition 1

1 Buffer

Partition 2
Grace Hash Join: Partition

Buffers
R S Partition 1

1 Buffer

Partition 2
Grace Hash Join: Partition

Buffers
R S Partition 1

1 Buffer

Partition 2
Grace Hash Join: Partition

Buffers
R S Partition 1

1 Buffer

Partition 2
Grace Hash Join: Partition

Buffers
R S Partition 1

1 Buffer

Partition 2
Grace Hash Join: Build & Probe

Partition 1

Hash Table Buffers

New Hash Fn.


1 Buffer 1 Buffer
Partition 2
Grace Hash Join: Build & Probe

Partition 1

Hash Table Buffers

New Hash Fn.


1 Buffer 1 Buffer
Partition 2
Grace Hash Join: Build & Probe

Partition 1

Hash Table Buffers

New Hash Fn.


1 Buffer 1 Buffer
Partition 2
Grace Hash Join: Build & Probe

Partition 1

Hash Table Buffers

New Hash Fn.


1 Buffer 1 Buffer
Partition 2
Grace Hash Join
• Can be used when both tables do not fit Memory
• Usually Used When the Smaller Table fits Memory
Join Algorithms
• Nested Loop Join
• Grace Hash Join
• Sort Merge Join

97
98
99
100
101
102

You might also like