Professional Documents
Culture Documents
Torsten Grust
Chapter 2
Storage
Disks, Buffer Manager, Files. . . Magnetic Disks
Access Time
Sequential vs. Random
I/O Parallelism
Summer 2014 RAID Levels 1, 0, and 5
Alternative Storage
Techniques
Solid-State Disks
Network-Based Storage
Managing Space
Free Space Management
Buffer Manager
Pinning and Unpinning
Replacement Policies
Databases vs.
Operating Systems
SQL Commands
Magnetic Disks
Access Time
Executor Parser Sequential vs. Random
Access
I/O Parallelism
Operator Evaluator Optimizer
RAID Levels 1, 0, and 5
Alternative Storage
Techniques
Transaction Files and Access Methods Solid-State Disks
Network-Based Storage
Manager
Recovery Managing Space
Buffer Manager
Manager Free Space Management
Databases vs.
DBMS Operating Systems
Recap
2
Storage
The Memory Hierarchy
Torsten Grust
capacity latency
CPU
(with registers) bytes < 1 ns
Magnetic Disks
Access Time
caches kilo-/megabytes < 10 ns Sequential vs. Random
Access
I/O Parallelism
main memory gigabytes 70–100 ns RAID Levels 1, 0, and 5
Alternative Storage
hard disks terabytes 3–10 ms Techniques
Solid-State Disks
Network-Based Storage
tape library petabytes varies Managing Space
Free Space Management
Buffer Manager
Recap
3
Storage
Magnetic Disks
Torsten Grust
rotation
arm
track
Magnetic Disks
Access Time
Sequential vs. Random
Access
I/O Parallelism
RAID Levels 1, 0, and 5
Alternative Storage
block sector Techniques
Managing Space
Free Space Management
Photo: http://www.metallurgy.utah.edu/
A stepper motor positions an array of Buffer Manager
Recap
4
Storage
Access Time
Torsten Grust
2 Disk controller waits for desired block to rotate under Managing Space
disk head (rotational delay tr ) Free Space Management
Buffer Manager
3 Read/write data (transfer time ttr ) Pinning and Unpinning
Replacement Policies
Databases vs.
⇒ access time: t = ts + tr + ttr Operating Systems
Recap
5
Storage
Example: Seagate Cheetah 15K.7
Torsten Grust
(600 GB, server-class drive)
Managing Space
Free Space Management
Buffer Manager
Pinning and Unpinning
Replacement Policies
Databases vs.
Operating Systems
Recap
6
Storage
Example: Seagate Cheetah 15K.7
Torsten Grust
(600 GB, server-class drive)
1 1 Managing Space
average rotational delay: 2 · 15 000 min−1 tr = 2.00 ms Free Space Management
8 kB Buffer Manager
transfer time for 8 KB: 163 MB/s ttr = 0.05 ms Pinning and Unpinning
Replacement Policies
Databases vs.
Operating Systems
access time for an 8 kB data block t = 5.45 ms
Files and Records
Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts
Recap
6
Storage
Sequential vs. Random Access
Torsten Grust
• random access:
Magnetic Disks
trnd = 1 000 · 5.45 ms = 5.45 s Access Time
I/O Parallelism
tseq = ts + tr + 1 000 · ttr + 16 · ts,track-to-track RAID Levels 1, 0, and 5
Buffer Manager
Pinning and Unpinning
Databases vs.
⇒ Avoid random I/O whenever possible Operating Systems
Recap
7
Storage
Performance Tricks
Torsten Grust
I/O Parallelism
sequential scans RAID Levels 1, 0, and 5
Alternative Storage
Techniques
that requires the smallest arm movement (SPTF: shortest Free Space Management
Buffer Manager
positioning time first, elevator algorithms) Pinning and Unpinning
Replacement Policies
Databases vs.
zoning Operating Systems
divide outer tracks into more sectors than inner tracks Free Space Management
Inside a Page
Alternative Page Layouts
Recap
8
Storage
Evolution of Hard Disk Technology
Torsten Grust
Alternative Storage
Techniques
Solid-State Disks
Managing Space
• Random access cost hurts even more as time progresses Free Space Management
Buffer Manager
Pinning and Unpinning
Replacement Policies
Recap
9
Storage
Ways to Improve I/O Performance
Torsten Grust
But:
• Throughput can be increased rather easily by exploiting Magnetic Disks
Access Time
• Idea: Use multiple disks and access them in parallel, try to I/O Parallelism
RAID Levels 1, 0, and 5
hide latency Alternative Storage
Techniques
Solid-State Disks
TPC-C: An industry benchmark for OLTP Network-Based Storage
Managing Space
A recent #1 system (IBM DB2 9.5 on AIX) uses Free Space Management
Buffer Manager
• 10,992 disk drives (73.4 GB each, 15,000 rpm) (!) Pinning and Unpinning
Replacement Policies
plus 8 146.8 GB internal SCSI drives, Databases vs.
• connected with 68 4 Gbit fibre channel adapters, Operating Systems
Recap
10
Storage
Disk Mirroring
Torsten Grust
Magnetic Disks
1 2 3 4 5 Access Time
1 2 3 4 5 I/O Parallelism
RAID Levels 1, 0, and 5
6 7 8 9 ···
Alternative Storage
Techniques
1 2 3 4 5
Solid-State Disks
6 7 8 9 ··· Network-Based Storage
Managing Space
Free Space Management
Buffer Manager
Pinning and Unpinning
Databases vs.
• Improved failure tolerance—can survive one disk failure Operating Systems
Recap
11
Storage
Disk Striping
Torsten Grust
Alternative Storage
Techniques
1 4 7 ··· 2 5 8 ··· 3 6 9 ··· Solid-State Disks
Network-Based Storage
Managing Space
Free Space Management
Buffer Manager
• High failure risk (here: 3 times risk of single disk failure)! Databases vs.
Operating Systems
Recap
12
Storage
Disk Striping with Parity
Torsten Grust
1 2 3 4 5 Magnetic Disks
6 7 8 9 ··· Access Time
Sequential vs. Random
Access
I/O Parallelism
RAID Levels 1, 0, and 5
Managing Space
Free Space Management
• Fault tolerance: any one disk may fail without data loss
Pinning and Unpinning
Replacement Policies
Recap
13
Storage
Solid-State Disks
Torsten Grust
write
Access
time
significantly slower than on I/O Parallelism
read
write
RAID Levels 1, 0, and 5
traditional magnetic drives:
read
Alternative Storage
Techniques
1 (Blocks of ) Pages have to be flash mag. disk
Solid-State Disks
Network-Based Storage
erased before they can be
Managing Space
updated Free Space Management
Recap
14
Storage
SSDs: Page-Level Writes, Block-Level Deletes
Torsten Grust
I/O Parallelism
RAID Levels 1, 0, and 5
Alternative Storage
Techniques
Solid-State Disks
Network-Based Storage
Illustration taken from arstechnica.com (The SSD Revolution)
Managing Space
Free Space Management
Buffer Manager
Pinning and Unpinning
Replacement Policies
Databases vs.
Operating Systems
Recap
15
Storage
Example: Seagate Pulsar.2
Torsten Grust
(800 GB, server-class solid state drive)
Managing Space
Free Space Management
Buffer Manager
Pinning and Unpinning
Replacement Policies
Databases vs.
Operating Systems
Recap
16
Storage
Example: Seagate Pulsar.2
Torsten Grust
(800 GB, server-class solid state drive)
Managing Space
no rotational delay: tr = 0.00 ms Free Space Management
128 kB
transfer time for 8 KB: 370 MB/s ttr = 0.30 ms Buffer Manager
Pinning and Unpinning
Replacement Policies
Databases vs.
Operating Systems
access time for an 8 kB data block t = 0.30 ms
Files and Records
Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts
Recap
16
Storage
Sequential vs. Random Access with SSDs
Torsten Grust
Buffer Manager
⇒ Sequential I/O still beats random I/O Pinning and Unpinning
Replacement Policies
(but random I/O is more feasible again) Databases vs.
Recap
17
Storage
Network-Based Storage
Torsten Grust
Magnetic Disks
Storage media/interface Transfer rate Access Time
Sequential vs. Random
Hard disk 100–200 MB/s Access
Managing Space
Free Space Management
For comparison (RAM):
Buffer Manager
PC2–5300 DDR2–SDRAM 10.6 GB/s Pinning and Unpinning
Databases vs.
Operating Systems
Recap
18
Storage
Storage Area Network (SAN)
Torsten Grust
I/O Parallelism
RAID Levels 1, 0, and 5
• SAN storage devices typically abstract from RAID or physical Alternative Storage
Techniques
disks and present logical drives to the DBMS Solid-State Disks
Network-Based Storage
Buffer Manager
Recap
19
Storage
Grid or Cloud Storage
Torsten Grust
Spare CPU cycles and disk space are sold as a service: Access
I/O Parallelism
Alternative Storage
Use Linux-based compute cluster by the hour (∼ 10 ¢/h). Techniques
Solid-State Disks
Amazon’s Simple Storage System (S3) Network-Based Storage
Recap
20
Storage
Managing Space
Torsten Grust
SQL Commands
Magnetic Disks
Executor Parser Access Time
Sequential vs. Random
Access
Operator Evaluator Optimizer
I/O Parallelism
RAID Levels 1, 0, and 5
Alternative Storage
Transaction Files and Access Methods Techniques
Solid-State Disks
Manager
Recovery Network-Based Storage
Buffer Manager
Manager Managing Space
Free Space Management
Lock Manager
Disk Space Manager Buffer Manager
Pinning and Unpinning
Replacement Policies
DBMS Databases vs.
Operating Systems
Recap
21
Storage
Managing Space
Torsten Grust
Alternative Storage
of storage to the remaining system components Techniques
Solid-State Disks
Managing Space
Free Space Management
page number 7→ physical location , Buffer Manager
Pinning and Unpinning
• tape number and offset for data stored in a tape library Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts
Recap
22
Storage
Empty Pages
Torsten Grust
Alternative Storage
Managing Space
To exploit sequential access, it is useful to allocate contiguous Free Space Management
Recap
23
Storage
Empty Pages
Torsten Grust
Alternative Storage
Managing Space
To exploit sequential access, it is useful to allocate contiguous Free Space Management
Recap
23
Storage
Buffer Manager
Torsten Grust
SQL Commands
Magnetic Disks
Executor Parser Access Time
Sequential vs. Random
Access
Operator Evaluator Optimizer
I/O Parallelism
RAID Levels 1, 0, and 5
Alternative Storage
Transaction Files and Access Methods Techniques
Solid-State Disks
Manager
Recovery Network-Based Storage
Buffer Manager
Manager Managing Space
Free Space Management
Lock Manager
Disk Space Manager Buffer Manager
Pinning and Unpinning
Replacement Policies
DBMS Databases vs.
Operating Systems
Recap
24
Storage
Buffer Manager
Torsten Grust
3
• Manages a designated main
Access
I/O Parallelism
memory area, the buffer RAID Levels 1, 0, and 5
Buffer Manager
into memory frames Pinning and Unpinning
1 2 3 4 5 6 Replacement Policies
Recap
25
Storage
Interface to the Buffer Manager
Torsten Grust
it into memory if necessary and mark page as clean (¬dirty). Sequential vs. Random
Access
eviction. Must set dirty = true if page was modified. Network-Based Storage
Managing Space
Free Space Management
Databases vs.
Operating Systems
Recap
26
Storage
Interface to the Buffer Manager
Torsten Grust
it into memory if necessary and mark page as clean (¬dirty). Sequential vs. Random
Access
eviction. Must set dirty = true if page was modified. Network-Based Storage
Managing Space
Free Space Management
Only modified pages need to be written back to disk upon Databases vs.
Operating Systems
eviction. Files and Records
Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts
Recap
26
Storage
Proper pin ()/unpin () Nesting
Torsten Grust
..
Alternative Storage
. Techniques
Network-Based Storage
Recap
27
Storage
Implementation of pin ()
Torsten Grust
Function pin(pageno)
1 if buffer pool already contains pageno then Magnetic Disks
8 read page pageno from disk into frame v ; Free Space Management
Buffer Manager
9 pinCount (pageno) ← 1 ; Pinning and Unpinning
Recap
28
Storage
Implementation of unpin ()
Torsten Grust
I/O Parallelism
Why don’t we write pages back to disk during unpin ()? RAID Levels 1, 0, and 5
Alternative Storage
Techniques
Solid-State Disks
Network-Based Storage
Managing Space
Free Space Management
Buffer Manager
Pinning and Unpinning
Replacement Policies
Databases vs.
Operating Systems
Recap
29
Storage
Implementation of unpin ()
Torsten Grust
I/O Parallelism
Why don’t we write pages back to disk during unpin ()? RAID Levels 1, 0, and 5
Alternative Storage
Techniques
– higher I/O cost (every page write implies a write to disk) Free Space Management
Buffer Manager
– bad response time for writing transactions Pinning and Unpinning
Replacement Policies
This discussion is also known as force (or write-through) vs. Databases vs.
write-back. Actual database systems typically implement Operating Systems
Recap
29
Storage
Concurrent Writes?
Torsten Grust
1 The same page p is requested by more than one transaction Sequential vs. Random
Access
Conflicts of this kind are resolved by the system’s concurrency Network-Based Storage
Managing Space
control, a layer on top of the buffer manager (see “Introduction Free Space Management
Recap
30
Storage
Replacement Policies
Torsten Grust
Like LRU, but considers the k latest unpin () calls, not just Access
I/O Parallelism
the latest RAID Levels 1, 0, and 5
Alternative Storage
Most Recently Used (MRU) Techniques
Evict the page that has been unpinned most recently Solid-State Disks
Network-Based Storage
Databases vs.
Operating Systems
Recap
31
Storage
Example Policy: Clock (“Second Chance”)
Torsten Grust
I/O Parallelism
1 Number the N frames in the buffer pool 0, . . . , N − 1; RAID Levels 1, 0, and 5
Buffer Manager
3 In pin(p): if we need to find a victim, consider page Pinning and Unpinning
Recap
32
Storage
Heuristic Policies Can Fail
Torsten Grust
I/O Parallelism
A number of transactions want to scan the same sequence of RAID Levels 1, 0, and 5
Buffer Manager
2 Now grow R by one page. Pinning and Unpinning
How about the number of I/O operations in this case? Replacement Policies
Databases vs.
Operating Systems
Recap
33
Storage
More Challenges for LRU
Torsten Grust
pages Ri , the B+-tree occupies one root node and 100 index Access
I/O Parallelism
leaf nodes Ik . RAID Levels 1, 0, and 5
Managing Space
I1 , R1 , I2 , R2 , I3 , R3 , . . . Free Space Management
Buffer Manager
Databases vs.
Operating Systems
Recap
34
Storage
More Challenges for LRU
Torsten Grust
pages Ri , the B+-tree occupies one root node and 100 index Access
I/O Parallelism
leaf nodes Ik . RAID Levels 1, 0, and 5
Managing Space
I1 , R1 , I2 , R2 , I3 , R3 , . . . Free Space Management
Buffer Manager
Databases vs.
Operating Systems
With LRU, 50 % of the buffered pages will be pages of R. However, Files and Records
the probability of re-accessing page Ri only is 1/10,000. Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts
Recap
34
Storage
Buffer Management in Practice
Torsten Grust
Prefetching
Buffer managers try to anticipate page requests to overlap
CPU and I/O operations:
• Speculative prefetching: Assume sequential scan and Magnetic Disks
Access Time
Alternative Storage
Techniques
Solid-State Disks
Page fixing/hating Network-Based Storage
E.g., maintain separate pools for indexes and tables. Inside a Page
Alternative Page Layouts
Recap
35
Storage
Databases vs. Operating Systems
Torsten Grust
Wait! Didn’t we just re-invent the operating system?
Magnetic Disks
Yes, Access Time
look like file management and virtual memory in OSs. I/O Parallelism
RAID Levels 1, 0, and 5
• concurrency control often calls for a prescribed order of Free Space Management
Buffer Manager
write operations, Pinning and Unpinning
Databases vs.
database (e.g., file size limitation, platform independence). Operating Systems
Recap
36
Storage
Databases vs. Operating Systems
Torsten Grust
Alternative Storage
Techniques
• Therefore, database systems try to Solid-State Disks
Network-Based Storage
turn off certain OS features or Managing Space
services. disk Free Space Management
Buffer Manager
Pinning and Unpinning
Databases vs.
Operating Systems
Recap
37
Storage
Files and Records
Torsten Grust
SQL Commands
Magnetic Disks
Executor Parser Access Time
Sequential vs. Random
Access
Operator Evaluator Optimizer
I/O Parallelism
RAID Levels 1, 0, and 5
Alternative Storage
Transaction Files and Access Methods Techniques
Solid-State Disks
Manager
Recovery Network-Based Storage
Buffer Manager
Manager Managing Space
Free Space Management
Lock Manager
Disk Space Manager Buffer Manager
Pinning and Unpinning
Replacement Policies
DBMS Databases vs.
Operating Systems
Recap
38
Storage
Database Files
Torsten Grust
Alternative Storage
Techniques
Solid-State Disks
Network-Based Storage
Managing Space
Free Space Management
file 0
Buffer Manager
Pinning and Unpinning
free Replacement Policies
Databases vs.
page 0 page 1 page 2 page 3 Operating Systems
file 1 Files and Records
Heap Files
Torsten Grust
The most important type of files in a database is the heap file. It
stores records in no particular order (in line with, e.g., the SQL
semantics): penyimpanan gk pake urutan jelas
Typical heap file interface
Magnetic Disks
Alternative Storage
rid = insertRecord(f ,r) Techniques
• initiate a sequential scan over the whole heap file: Databases vs.
Operating Systems
openScan(f ) Files and Records
Heap Files
Free Space Management
N.B. Record ids (rid) are used like record addresses (or pointers). Inside a Page
Alternative Page Layouts
The heap file structure maps a given rid to the page containing
Recap
the record. 40
Storage
Heap Files
Torsten Grust
Buffer Manager
Pinning and Unpinning
Databases vs.
– most pages will end up in free page list Operating Systems
Recap
41
Storage
Heap Files
Torsten Grust
Operation insertRecord(f ,r) for linked list of pages
1 Try to find a page p in the free list with free space > | r |;
should this fail, ask the disk space manager to allocate a new
page p Magnetic Disks
Access Time
2 Write record r to page p Sequential vs. Random
Access
3 Since, generally, | r || p |, p will belong to the list of pages I/O Parallelism
with free space RAID Levels 1, 0, and 5
Alternative Storage
4 A unique rid for r is generated and returned to the caller Techniques
Solid-State Disks
Network-Based Storage
Buffer Manager
Given that rids are used like record addresses: what would be a Pinning and Unpinning
Replacement Policies
feasible rid generation method? Databases vs.
Operating Systems
Recap
42
Storage
Heap Files
Torsten Grust
Operation insertRecord(f ,r) for linked list of pages
1 Try to find a page p in the free list with free space > | r |;
should this fail, ask the disk space manager to allocate a new
page p Magnetic Disks
Access Time
2 Write record r to page p Sequential vs. Random
Access
3 Since, generally, | r || p |, p will belong to the list of pages I/O Parallelism
with free space RAID Levels 1, 0, and 5
Alternative Storage
4 A unique rid for r is generated and returned to the caller Techniques
Solid-State Disks
Network-Based Storage
Buffer Manager
Given that rids are used like record addresses: what would be a Pinning and Unpinning
Replacement Policies
feasible rid generation method? Databases vs.
Generate a composite rid consisting of the address of page p and Operating Systems
Recap
42
Storage
Heap Files
Torsten Grust
Directory of pages:
Magnetic Disks
data Access Time
• Use as space map with information about free space on Managing Space
Free Space Management
Recap
43
Storage
Free Space Management
Torsten Grust
Databases vs.
Maintain cursor and continue searching where search Operating Systems
Recap
44
Storage
Free Space Witnesses
Torsten Grust
• Classify pages into buckets, e.g., “75 %–100 % full”, Sequential vs. Random
Access
“50 %–75 % full”, “25 %–50 % full”, and “0 %–25 % full”. I/O Parallelism
RAID Levels 1, 0, and 5
• For each bucket, remember some witness pages. Alternative Storage
Databases vs.
Operating Systems
Recap
45
Storage
Inside a Page — Fixed-Length Records
Torsten Grust
Alternative Storage
Managing Space
• Record position (within page): Free Space Management
Databases vs.
• record id should not Operating Systems
Recap
46
Storage
Inside a Page — Fixed-Length Records
Torsten Grust
Alternative Storage
Managing Space
• Record position (within page): Free Space Management
Databases vs.
• record id should not Operating Systems
Recap
46
Storage
Inside a Page—Variable-Sized Fields
Torsten Grust
I/O Parallelism
RAID Levels 1, 0, and 5
Alternative Storage
Techniques
Solid-State Disks
Network-Based Storage
Managing Space
Free Space Management
Databases vs.
Operating Systems
Recap
47
Storage
Inside a Page—Variable-Sized Fields
Torsten Grust
Managing Space
Free Space Management
Databases vs.
Operating Systems
Recap
47
Storage
Inside a Page—Variable-Sized Fields
Torsten Grust
Managing Space
page is compacted. Free Space Management
Databases vs.
Operating Systems
Recap
47
Storage
Inside a Page—Variable-Sized Fields
Torsten Grust
Managing Space
page is compacted. Free Space Management
Recap
47
Storage
Inside a Page—Variable-Sized Fields
Torsten Grust
Managing Space
page is compacted. Free Space Management
Recap
47
Storage
IBM DB2 Data Pages
Torsten Grust
record size: 4,005 bytes). Records do not span pages. RAID Levels 1, 0, and 5
Alternative Storage
• Maximum table size: 512 GB (with 32 K pages). Maximum Techniques
Solid-State Disks
number of columns: 1,012 (4 K page: 500), maximum Network-Based Storage
• Free space management: first-fit order. Free space map Databases vs.
Operating Systems
distributed on every 500th page in FSCR (free space control Files and Records
records). Records updated in-place if possible, otherwises Heap Files
Free Space Management
uses forward records. Inside a Page
Alternative Page Layouts
Recap
48
Storage
IBM DB2 Data Pages
Torsten Grust
Magnetic Disks
Access Time
Sequential vs. Random
Access
I/O Parallelism
RAID Levels 1, 0, and 5
Alternative Storage
Techniques
Solid-State Disks
Network-Based Storage
Managing Space
Free Space Management
Buffer Manager
Pinning and Unpinning
Replacement Policies
Databases vs.
Operating Systems
Recap
49
Storage
Alternative Page Layouts
Torsten Grust
a1 b1 c1 a4 b4 c4
Magnetic Disks
a1 b1 c1 d1 c1 d1 a2 c4 d4 Access Time
a3 b3 c3 d3 c3 d3 I/O Parallelism
a4 b4 c4 d4 RAID Levels 1, 0, and 5
page 0 page 1
Alternative Storage
Techniques
Solid-State Disks
Managing Space
Free Space Management
Buffer Manager
Pinning and Unpinning
a1 a2 a3 b1 b2 b3 Replacement Policies
a1 b1 c1 d1 a3 a4 b3 b4
Databases vs.
a2 b2 c2 d2 Operating Systems
Recap
50
Storage
Alternative Page Layouts
Torsten Grust
% Ailamaki et al. Weaving Relations for Cache minipage 0 Files and Records
Heap Files
Performance. VLDB 2001. Free Space Management
page 0
Inside a Page
Alternative Page Layouts
Recap
1 Recently, the terms row-store and column-store have become popular, too. 51
Storage
Recap
Torsten Grust
Magnetic Disks
Random access orders of magnitude slower than Magnetic Disks
sequential. Access Time
Sequential vs. Random
Access
Disk Space Manager I/O Parallelism
Abstracts from hardware details and maps RAID Levels 1, 0, and 5
Recap
52