You are on page 1of 64

Storage

Torsten Grust

Chapter 2
Storage
Disks, Buffer Manager, Files. . . Magnetic Disks
Access Time
Sequential vs. Random

Architecture and Implementation of Database Systems Access

I/O Parallelism
Summer 2014 RAID Levels 1, 0, and 5

Alternative Storage
Techniques
Solid-State Disks
Network-Based Storage

Managing Space
Free Space Management

Buffer Manager
Pinning and Unpinning
Replacement Policies

Databases vs.
Operating Systems

Files and Records


Heap Files
Free Space Management
Torsten Grust Inside a Page
Alternative Page Layouts
Wilhelm-Schickard-Institut für Informatik
Recap
Universität Tübingen 1
Storage
Database Architecture
Torsten Grust

Web Forms Applications SQL Interface

SQL Commands
Magnetic Disks
Access Time
Executor Parser Sequential vs. Random
Access

I/O Parallelism
Operator Evaluator Optimizer
RAID Levels 1, 0, and 5

Alternative Storage
Techniques
Transaction Files and Access Methods Solid-State Disks
Network-Based Storage
Manager
Recovery Managing Space
Buffer Manager
Manager Free Space Management

Lock Manager Buffer Manager


Disk Space Manager Pinning and Unpinning
Replacement Policies

Databases vs.
DBMS Operating Systems

Files and Records


data files, indices, . . . Database Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts

Recap
2
Storage
The Memory Hierarchy
Torsten Grust

capacity latency
CPU
(with registers) bytes < 1 ns
Magnetic Disks
Access Time
caches kilo-/megabytes < 10 ns Sequential vs. Random
Access

I/O Parallelism
main memory gigabytes 70–100 ns RAID Levels 1, 0, and 5

Alternative Storage
hard disks terabytes 3–10 ms Techniques
Solid-State Disks
Network-Based Storage
tape library petabytes varies Managing Space
Free Space Management

Buffer Manager

• Fast—but expensive and small—memory close to CPU Pinning and Unpinning


Replacement Policies

• Larger, slower memory at the periphery Databases vs.


Operating Systems
• DBMSs try to hide latency by using the fast memory as a Files and Records

cache. Heap Files


Free Space Management
Inside a Page
Alternative Page Layouts

Recap
3
Storage
Magnetic Disks
Torsten Grust

rotation
arm
track

Magnetic Disks
Access Time
Sequential vs. Random
Access

I/O Parallelism
RAID Levels 1, 0, and 5

Alternative Storage
block sector Techniques

platter heads Solid-State Disks


Network-Based Storage

Managing Space
Free Space Management

Photo: http://www.metallurgy.utah.edu/
A stepper motor positions an array of Buffer Manager

disk heads on the requested track Pinning and Unpinning


Replacement Policies

• Platters (disks) steadily rotate Databases vs.


Operating Systems
• Disks are managed in blocks: the system Files and Records

reads/writes data one block at a time Heap Files


Free Space Management
Inside a Page
Alternative Page Layouts

Recap
4
Storage
Access Time
Torsten Grust

Data blocks can only be read and written if disk heads


and platters are positioned accordingly.
Magnetic Disks

• This design has implications on the access time to


Access Time
Sequential vs. Random
Access
read/write a given block: I/O Parallelism
RAID Levels 1, 0, and 5
Definition (Access Time) Alternative Storage
Techniques

1 Move disk arms to desired track (seek time ts ) Solid-State Disks


Network-Based Storage

2 Disk controller waits for desired block to rotate under Managing Space
disk head (rotational delay tr ) Free Space Management

Buffer Manager
3 Read/write data (transfer time ttr ) Pinning and Unpinning
Replacement Policies

Databases vs.
⇒ access time: t = ts + tr + ttr Operating Systems

Files and Records


Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts

Recap
5
Storage
Example: Seagate Cheetah 15K.7
Torsten Grust
(600 GB, server-class drive)

• Seagate Cheetah 15K.7 performance characteristics:


• 4 disks, 8 heads, avg. 512 kB/track, 600 GB capacity
• rotational speed: 15 000 rpm (revolutions per minute) Magnetic Disks
Access Time
• average seek time: 3.4 ms Sequential vs. Random
Access
• transfer rate ≈ 163 MB/s I/O Parallelism
RAID Levels 1, 0, and 5

 What is the access time to read an 8 KB data block? Alternative Storage


Techniques
Solid-State Disks
Network-Based Storage

Managing Space
Free Space Management

Buffer Manager
Pinning and Unpinning
Replacement Policies

Databases vs.
Operating Systems

Files and Records


Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts

Recap
6
Storage
Example: Seagate Cheetah 15K.7
Torsten Grust
(600 GB, server-class drive)

• Seagate Cheetah 15K.7 performance characteristics:


• 4 disks, 8 heads, avg. 512 kB/track, 600 GB capacity
• rotational speed: 15 000 rpm (revolutions per minute) Magnetic Disks
Access Time
• average seek time: 3.4 ms Sequential vs. Random
Access
• transfer rate ≈ 163 MB/s I/O Parallelism
RAID Levels 1, 0, and 5

 What is the access time to read an 8 KB data block? Alternative Storage


Techniques
Solid-State Disks

average seek time ts = 3.40 ms Network-Based Storage

1 1 Managing Space
average rotational delay: 2 · 15 000 min−1 tr = 2.00 ms Free Space Management

8 kB Buffer Manager
transfer time for 8 KB: 163 MB/s ttr = 0.05 ms Pinning and Unpinning
Replacement Policies

Databases vs.
Operating Systems
access time for an 8 kB data block t = 5.45 ms
Files and Records
Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts

Recap
6
Storage
Sequential vs. Random Access
Torsten Grust

Example (Read 1 000 blocks of size 8 kB)

• random access:
Magnetic Disks
trnd = 1 000 · 5.45 ms = 5.45 s Access Time

• sequential read of adjacent blocks:


Sequential vs. Random
Access

I/O Parallelism
tseq = ts + tr + 1 000 · ttr + 16 · ts,track-to-track RAID Levels 1, 0, and 5

= 3.40 ms + 2.00 ms + 50 ms + 3.2 ms ≈ 58.6 ms Alternative Storage


Techniques
The Seagate Cheetah 15K.7 stores an average of 512 kB per Solid-State Disks
Network-Based Storage
track, with a 0.2 ms track-to-track seek time; our 8 kB blocks
Managing Space
are spread across 16 tracks. Free Space Management

Buffer Manager
Pinning and Unpinning

⇒ Sequential I/O is much faster than random I/O Replacement Policies

Databases vs.


⇒ Avoid random I/O whenever possible Operating Systems

Files and Records


58.6 ms
⇒ As soon as we need at least 5,450 ms = 1.07 % of a file,
Heap Files
Free Space Management
we better read the entire file sequentially Inside a Page
Alternative Page Layouts

Recap
7
Storage
Performance Tricks
Torsten Grust

• Disk manufacturers play a number of tricks to improve


performance:
track skewing Magnetic Disks

Align sector 0 of each track to avoid Access Time


Sequential vs. Random

rotational delay during longer Access

I/O Parallelism
sequential scans RAID Levels 1, 0, and 5

Alternative Storage
Techniques

request scheduling Solid-State Disks


Network-Based Storage

If multiple requests have to be served, choose the one Managing Space

that requires the smallest arm movement (SPTF: shortest Free Space Management

Buffer Manager
positioning time first, elevator algorithms) Pinning and Unpinning
Replacement Policies

Databases vs.
zoning Operating Systems

Files and Records


Outer tracks are longer than the inner ones. Therefore, Heap Files

divide outer tracks into more sectors than inner tracks Free Space Management
Inside a Page
Alternative Page Layouts

Recap
8
Storage
Evolution of Hard Disk Technology
Torsten Grust

Disk seek and rotational latencies have only marginally improved


over the last years (≈ 10 % per year)
Magnetic Disks

But: Access Time


Sequential vs. Random
Access
• Throughput (i.e., transfer rates) improve by ≈ 50 % per year I/O Parallelism

• Hard disk capacity grows by ≈ 50 % every year


RAID Levels 1, 0, and 5

Alternative Storage
Techniques
Solid-State Disks

Therefore: Network-Based Storage

Managing Space
• Random access cost hurts even more as time progresses Free Space Management

Buffer Manager
Pinning and Unpinning
Replacement Policies

Example (5 Years Ago: Seagate Barracuda 7200.7) Databases vs.


Operating Systems
Read 1K blocks of 8 kB sequentially/randomly: 397 ms / 12 800 ms Files and Records
Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts

Recap
9
Storage
Ways to Improve I/O Performance
Torsten Grust

The latency penalty is hard to avoid

But:
• Throughput can be increased rather easily by exploiting Magnetic Disks
Access Time

parallelism Sequential vs. Random


Access

• Idea: Use multiple disks and access them in parallel, try to I/O Parallelism
RAID Levels 1, 0, and 5
hide latency Alternative Storage
Techniques
Solid-State Disks
TPC-C: An industry benchmark for OLTP Network-Based Storage

Managing Space
A recent #1 system (IBM DB2 9.5 on AIX) uses Free Space Management

Buffer Manager
• 10,992 disk drives (73.4 GB each, 15,000 rpm) (!) Pinning and Unpinning
Replacement Policies
plus 8 146.8 GB internal SCSI drives, Databases vs.
• connected with 68 4 Gbit fibre channel adapters, Operating Systems

Files and Records


• yielding 6 mio transactions per minute Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts

Recap
10
Storage
Disk Mirroring
Torsten Grust

• Replicate data onto multiple disks:

Magnetic Disks
1 2 3 4 5 Access Time

6 7 8 9 ··· Sequential vs. Random


Access

1 2 3 4 5 I/O Parallelism
RAID Levels 1, 0, and 5
6 7 8 9 ···
Alternative Storage
Techniques
1 2 3 4 5
Solid-State Disks
6 7 8 9 ··· Network-Based Storage

Managing Space
Free Space Management

Buffer Manager
Pinning and Unpinning

• Achieves I/O parallelism only for reads Replacement Policies

Databases vs.
• Improved failure tolerance—can survive one disk failure Operating Systems

• This is also known as RAID 1 (mirroring without parity)


Files and Records
Heap Files

(RAID: Redundant Array of Inexpensive Disks) Free Space Management


Inside a Page
Alternative Page Layouts

Recap
11
Storage
Disk Striping
Torsten Grust

• Distribute data over disks:


Magnetic Disks
Access Time
1 2 3 4 5 Sequential vs. Random
Access
6 7 8 9 ···
I/O Parallelism
RAID Levels 1, 0, and 5

Alternative Storage
Techniques
1 4 7 ··· 2 5 8 ··· 3 6 9 ··· Solid-State Disks
Network-Based Storage

Managing Space
Free Space Management

Buffer Manager

• Full I/O parallelism for read and write operations


Pinning and Unpinning
Replacement Policies

• High failure risk (here: 3 times risk of single disk failure)! Databases vs.
Operating Systems

• Also known as RAID 0 (striping without parity) Files and Records


Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts

Recap
12
Storage
Disk Striping with Parity
Torsten Grust

• Distribute data and parity information over > 3 disks:

1 2 3 4 5 Magnetic Disks
6 7 8 9 ··· Access Time
Sequential vs. Random
Access

I/O Parallelism
RAID Levels 1, 0, and 5

1 3 5/6 7 · · · 2 3/4 5 8 · · · 1/2 4 6 7/8 · · · Alternative Storage


Techniques
Solid-State Disks
Network-Based Storage

Managing Space
Free Space Management

• High I/O parallelism Buffer Manager

• Fault tolerance: any one disk may fail without data loss
Pinning and Unpinning
Replacement Policies

(with dual parity/RAID 6: two disks may fail) Databases vs.


Operating Systems
• Distribute parity (e.g., XOR) information over disks, separating Files and Records

data and associated parity Heap Files


Free Space Management

• Also known as RAID 5 (striping with distributed parity) Inside a Page


Alternative Page Layouts

Recap
13
Storage
Solid-State Disks
Torsten Grust

Solid state disks (SSDs) have emerged as an alternative to


conventional hard disks

• SSDs provide very low-latency


Magnetic Disks
random read access (< 0.01 ms) Access Time

• Random writes, however, are


Sequential vs. Random

write
Access

time
significantly slower than on I/O Parallelism

read
write
RAID Levels 1, 0, and 5
traditional magnetic drives:

read
Alternative Storage
Techniques
1 (Blocks of ) Pages have to be flash mag. disk
Solid-State Disks
Network-Based Storage
erased before they can be
Managing Space
updated Free Space Management

2 Once pages have been Buffer Manager


Pinning and Unpinning
erased, sequentially writing Replacement Policies

them is almost as fast as Databases vs.


Operating Systems
reading
Files and Records
Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts

Recap
14
Storage
SSDs: Page-Level Writes, Block-Level Deletes
Torsten Grust

• Typical page size: 128 kB


• SSDs erase blocks of pages: block ≈ 64 pages (8 MB)

Example (Perform block-level delete to accomodate new data pages)


Magnetic Disks
Access Time
Sequential vs. Random
Access

I/O Parallelism
RAID Levels 1, 0, and 5

Alternative Storage
Techniques
Solid-State Disks
Network-Based Storage
Illustration taken from arstechnica.com (The SSD Revolution)

Managing Space
Free Space Management

Buffer Manager
Pinning and Unpinning
Replacement Policies

Databases vs.
Operating Systems

Files and Records


Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts

Recap
15
Storage
Example: Seagate Pulsar.2
Torsten Grust
(800 GB, server-class solid state drive)

• Seagate Pulsar.2 performance characteristics:


• NAND flash memory, 800 GB capacity
• standard 2.500 enclosure, no moving/rotating parts Magnetic Disks
Access Time
• data read/written in pages of 128 kB size Sequential vs. Random
Access
• transfer rate ≈ 370 MB/s I/O Parallelism
RAID Levels 1, 0, and 5

 What is the access time to read an 8 KB data block? Alternative Storage


Techniques
Solid-State Disks
Network-Based Storage

Managing Space
Free Space Management

Buffer Manager
Pinning and Unpinning
Replacement Policies

Databases vs.
Operating Systems

Files and Records


Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts

Recap
16
Storage
Example: Seagate Pulsar.2
Torsten Grust
(800 GB, server-class solid state drive)

• Seagate Pulsar.2 performance characteristics:


• NAND flash memory, 800 GB capacity
• standard 2.500 enclosure, no moving/rotating parts Magnetic Disks
Access Time
• data read/written in pages of 128 kB size Sequential vs. Random
Access
• transfer rate ≈ 370 MB/s I/O Parallelism
RAID Levels 1, 0, and 5

 What is the access time to read an 8 KB data block? Alternative Storage


Techniques
Solid-State Disks

no seek time ts = 0.00 ms Network-Based Storage

Managing Space
no rotational delay: tr = 0.00 ms Free Space Management

128 kB
transfer time for 8 KB: 370 MB/s ttr = 0.30 ms Buffer Manager
Pinning and Unpinning
Replacement Policies

Databases vs.
Operating Systems
access time for an 8 kB data block t = 0.30 ms
Files and Records
Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts

Recap
16
Storage
Sequential vs. Random Access with SSDs
Torsten Grust

Example (Read 1 000 blocks of size 8 kB)

• random access: Magnetic Disks


Access Time
trnd = 1 000 · 0.30 ms = 0.3 s Sequential vs. Random
Access

• sequential read of adjacent pages: I/O Parallelism


 · 8 kB
 RAID Levels 1, 0, and 5
tseq = 1 000
128 kB · ttr ≈ 18.9 ms Alternative Storage
Techniques

The Seagate Pulsar.2 (sequentially) reads data in 128 kB Solid-State Disks


Network-Based Storage

chunks. Managing Space


Free Space Management

Buffer Manager
⇒ Sequential I/O still beats random I/O Pinning and Unpinning
Replacement Policies
(but random I/O is more feasible again) Databases vs.

• Adapting database technology to these characteristics is a


Operating Systems

Files and Records


current research topic Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts

Recap
17
Storage
Network-Based Storage
Torsten Grust

Today the network is not a bottleneck any more:

Magnetic Disks
Storage media/interface Transfer rate Access Time
Sequential vs. Random
Hard disk 100–200 MB/s Access

Serial ATA 375 MB/s I/O Parallelism


RAID Levels 1, 0, and 5
Ultra-640 SCSI 640 MB/s Alternative Storage
10-Gbit Ethernet 1,250 MB/s Techniques
Solid-State Disks
Infiniband QDR 12,000 MB/s Network-Based Storage

Managing Space
Free Space Management
For comparison (RAM):
Buffer Manager
PC2–5300 DDR2–SDRAM 10.6 GB/s Pinning and Unpinning

PC3–12800 DDR3–SDRAM 25.6 GB/s Replacement Policies

Databases vs.
Operating Systems

Files and Records


⇒ Why not use the network for database storage? Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts

Recap
18
Storage
Storage Area Network (SAN)
Torsten Grust

• Block-based network access to storage:


• SAN emulate interface of block-structured disks Magnetic Disks
(“read block #42 of disk 10”) Access Time

• This is unlike network file systems (e.g., NFS, CIFS)


Sequential vs. Random
Access

I/O Parallelism
RAID Levels 1, 0, and 5

• SAN storage devices typically abstract from RAID or physical Alternative Storage
Techniques
disks and present logical drives to the DBMS Solid-State Disks
Network-Based Storage

• Hardware acceleration and simplified maintainability Managing Space


Free Space Management

Buffer Manager

• Typical setup: local area network with multiple participating


Pinning and Unpinning
Replacement Policies

servers and storage resources Databases vs.


Operating Systems

• Failure tolerance and increased flexibility Files and Records


Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts

Recap
19
Storage
Grid or Cloud Storage
Torsten Grust

Internet-scale enterprises employ clusters with 1000s commodity


PCs (e.g., Amazon, Google, Yahoo!):
• system cost ↔ reliability and performance,
Magnetic Disks
• use massive replication for data storage Access Time
Sequential vs. Random

Spare CPU cycles and disk space are sold as a service: Access

I/O Parallelism

Amazon’s Elastic Computing Cloud (EC2) RAID Levels 1, 0, and 5

Alternative Storage
Use Linux-based compute cluster by the hour (∼ 10 ¢/h). Techniques
Solid-State Disks
Amazon’s Simple Storage System (S3) Network-Based Storage

“Infinite” store for objects between 1 B and 5 TB in size, Managing Space


Free Space Management
organized in a map data structure (key 7→ object) Buffer Manager
• Latency: 100 ms to 1 s (not impacted by load) Pinning and Unpinning
Replacement Policies
• pricing ≈ disk drives (but addl. cost for access) Databases vs.
Operating Systems

⇒ Building a database on S3? Files and Records


Heap Files
(% Brantner et al., SIGMOD 2008) Free Space Management
Inside a Page
Alternative Page Layouts

Recap
20
Storage
Managing Space
Torsten Grust

Web Forms Applications SQL Interface

SQL Commands
Magnetic Disks
Executor Parser Access Time
Sequential vs. Random
Access
Operator Evaluator Optimizer
I/O Parallelism
RAID Levels 1, 0, and 5

Alternative Storage
Transaction Files and Access Methods Techniques
Solid-State Disks
Manager
Recovery Network-Based Storage
Buffer Manager
Manager Managing Space
Free Space Management
Lock Manager
Disk Space Manager Buffer Manager
Pinning and Unpinning
Replacement Policies
DBMS Databases vs.
Operating Systems

data files, indices, . . . Database Files and Records


Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts

Recap
21
Storage
Managing Space
Torsten Grust

Definition (Disk Space Manager)

• Abstracts from the gory details of the underlying storage


(disk space manager talks to I/O controller and initiates I/O) Magnetic Disks
Access Time

• DBMS issues allocate/deallocate and read/write Sequential vs. Random


Access

commands I/O Parallelism

• Provides the concept of a page (typically 4–64 KB) as a unit


RAID Levels 1, 0, and 5

Alternative Storage
of storage to the remaining system components Techniques
Solid-State Disks

• Maintains a locality-preserving mapping Network-Based Storage

Managing Space
Free Space Management
page number 7→ physical location , Buffer Manager
Pinning and Unpinning

where a physical location could be, e.g., Replacement Policies

• an OS file name and an offset within that file, Databases vs.


Operating Systems

• head, sector, and track of a hard drive, or Files and Records

• tape number and offset for data stored in a tape library Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts

Recap
22
Storage
Empty Pages
Torsten Grust

The disk space manager also keeps track of used/free blocks


(deallocation and subsequent allocation may create holes):
1 Maintain a linked list of free pages
• When a page is no longer needed, add it to the list Magnetic Disks
Access Time
Sequential vs. Random
2 Maintain a bitmap reserving one bit for each page Access

• Toggle bit n when page n is (de-)allocated I/O Parallelism


RAID Levels 1, 0, and 5

Alternative Storage

 Allocation of contiguous pages


Techniques
Solid-State Disks
Network-Based Storage

Managing Space
To exploit sequential access, it is useful to allocate contiguous Free Space Management

sequences of pages. Buffer Manager


Pinning and Unpinning
Which of the techniques (1 or 2) would you choose to support Replacement Policies

this? Databases vs.


Operating Systems

Files and Records


Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts

Recap
23
Storage
Empty Pages
Torsten Grust

The disk space manager also keeps track of used/free blocks


(deallocation and subsequent allocation may create holes):
1 Maintain a linked list of free pages
• When a page is no longer needed, add it to the list Magnetic Disks
Access Time
Sequential vs. Random
2 Maintain a bitmap reserving one bit for each page Access

• Toggle bit n when page n is (de-)allocated I/O Parallelism


RAID Levels 1, 0, and 5

Alternative Storage

 Allocation of contiguous pages


Techniques
Solid-State Disks
Network-Based Storage

Managing Space
To exploit sequential access, it is useful to allocate contiguous Free Space Management

sequences of pages. Buffer Manager


Pinning and Unpinning
Which of the techniques (1 or 2) would you choose to support Replacement Policies

this? Databases vs.


Operating Systems

Files and Records


This is a lot easier to do with a free page bitmap (option 2). Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts

Recap
23
Storage
Buffer Manager
Torsten Grust

Web Forms Applications SQL Interface

SQL Commands
Magnetic Disks
Executor Parser Access Time
Sequential vs. Random
Access
Operator Evaluator Optimizer
I/O Parallelism
RAID Levels 1, 0, and 5

Alternative Storage
Transaction Files and Access Methods Techniques
Solid-State Disks
Manager
Recovery Network-Based Storage
Buffer Manager
Manager Managing Space
Free Space Management
Lock Manager
Disk Space Manager Buffer Manager
Pinning and Unpinning
Replacement Policies
DBMS Databases vs.
Operating Systems

data files, indices, . . . Database Files and Records


Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts

Recap
24
Storage
Buffer Manager
Torsten Grust

page requests Definition (Buffer Manager)


main
memory • Mediates between external Magnetic Disks

storage and main memory, Access Time

6 1 4 Sequential vs. Random

3
• Manages a designated main
Access

I/O Parallelism
memory area, the buffer RAID Levels 1, 0, and 5

7 pool, for this task Alternative Storage


Techniques
Solid-State Disks
Network-Based Storage
disk page free frame Disk pages are brought into Managing Space

memory as needed and loaded Free Space Management

Buffer Manager
into memory frames Pinning and Unpinning
1 2 3 4 5 6 Replacement Policies

7 8 9 10 11 · · · A replacement policy decides Databases vs.


Operating Systems

which page to evict when the Files and Records


disk buffer is full Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts

Recap
25
Storage
Interface to the Buffer Manager
Torsten Grust

Higher-level code requests (pins) pages from the buffer manager


and releases (unpins) pages after use.
pin (pageno)
Magnetic Disks
Request page number pageno from the buffer manager, load Access Time

it into memory if necessary and mark page as clean (¬dirty). Sequential vs. Random
Access

Returns a reference to the frame containing pageno. I/O Parallelism


RAID Levels 1, 0, and 5

unpin (pageno, dirty) Alternative Storage


Techniques
Release page number pageno, making it a candidate for Solid-State Disks

eviction. Must set dirty = true if page was modified. Network-Based Storage

Managing Space
Free Space Management

 Why do we need the dirty bit? Buffer Manager


Pinning and Unpinning
Replacement Policies

Databases vs.
Operating Systems

Files and Records


Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts

Recap
26
Storage
Interface to the Buffer Manager
Torsten Grust

Higher-level code requests (pins) pages from the buffer manager


and releases (unpins) pages after use.
pin (pageno)
Magnetic Disks
Request page number pageno from the buffer manager, load Access Time

it into memory if necessary and mark page as clean (¬dirty). Sequential vs. Random
Access

Returns a reference to the frame containing pageno. I/O Parallelism


RAID Levels 1, 0, and 5

unpin (pageno, dirty) Alternative Storage


Techniques
Release page number pageno, making it a candidate for Solid-State Disks

eviction. Must set dirty = true if page was modified. Network-Based Storage

Managing Space
Free Space Management

 Why do we need the dirty bit? Buffer Manager


Pinning and Unpinning
Replacement Policies

Only modified pages need to be written back to disk upon Databases vs.
Operating Systems
eviction. Files and Records
Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts

Recap
26
Storage
Proper pin ()/unpin () Nesting
Torsten Grust

• Any database transaction is required to properly “bracket”


any page operation using pin () and unpin () calls: Magnetic Disks
Access Time

A read-only transaction Sequential vs. Random


Access

a ← pin (p); I/O Parallelism


 RAID Levels 1, 0, and 5

..
 Alternative Storage
. Techniques

read data on page at memory address a ; Solid-State Disks



Network-Based Storage

.. Managing Space


. Free Space Management

unpin (p, false); Buffer Manager


Pinning and Unpinning
Replacement Policies

• Proper bracketing enables the systems to keep a count of Databases vs.


Operating Systems
active users (e.g., transactions) of a page
Files and Records
Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts

Recap
27
Storage
Implementation of pin ()
Torsten Grust

Function pin(pageno)
1 if buffer pool already contains pageno then Magnetic Disks

2 pinCount (pageno) ← pinCount (pageno) + 1 ; Access Time


Sequential vs. Random
Access
3 return address of frame holding pageno ;
I/O Parallelism
4 else RAID Levels 1, 0, and 5

5 select a victim frame v using the replacement policy ; Alternative Storage


Techniques
6 if dirty (page in v) then Solid-State Disks
Network-Based Storage
7 write page in v to disk ;
Managing Space

8 read page pageno from disk into frame v ; Free Space Management

Buffer Manager
9 pinCount (pageno) ← 1 ; Pinning and Unpinning

10 dirty (pageno) ← false ; Replacement Policies

11 return address of frame v ; Databases vs.


Operating Systems

Files and Records


Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts

Recap
28
Storage
Implementation of unpin ()
Torsten Grust

Function unpin(pageno, dirty)


1 pinCount (pageno) ← pinCount (pageno) − 1 ;
Magnetic Disks
2 dirty (pageno) ← dirty (pageno) ∨ dirty ; Access Time
Sequential vs. Random
Access

I/O Parallelism

 Why don’t we write pages back to disk during unpin ()? RAID Levels 1, 0, and 5

Alternative Storage
Techniques
Solid-State Disks
Network-Based Storage

Managing Space
Free Space Management

Buffer Manager
Pinning and Unpinning
Replacement Policies

Databases vs.
Operating Systems

Files and Records


Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts

Recap
29
Storage
Implementation of unpin ()
Torsten Grust

Function unpin(pageno, dirty)


1 pinCount (pageno) ← pinCount (pageno) − 1 ;
Magnetic Disks
2 dirty (pageno) ← dirty (pageno) ∨ dirty ; Access Time
Sequential vs. Random
Access

I/O Parallelism

 Why don’t we write pages back to disk during unpin ()? RAID Levels 1, 0, and 5

Alternative Storage
Techniques

Well, we could . . . Solid-State Disks


Network-Based Storage

+ recovery from failure would be a lot simpler Managing Space

– higher I/O cost (every page write implies a write to disk) Free Space Management

Buffer Manager
– bad response time for writing transactions Pinning and Unpinning
Replacement Policies
This discussion is also known as force (or write-through) vs. Databases vs.
write-back. Actual database systems typically implement Operating Systems

write-back. Files and Records


Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts

Recap
29
Storage
Concurrent Writes?
Torsten Grust

Conflicting changes to a block


Assume the following: Magnetic Disks
Access Time

1 The same page p is requested by more than one transaction Sequential vs. Random
Access

(i.e., pinCount (p) > 1), and I/O Parallelism


RAID Levels 1, 0, and 5
2 . . . those transactions perform conflicting writes on p? Alternative Storage
Techniques
Solid-State Disks

Conflicts of this kind are resolved by the system’s concurrency Network-Based Storage

Managing Space
control, a layer on top of the buffer manager (see “Introduction Free Space Management

to Database Systems”, transaction management, locks). Buffer Manager


Pinning and Unpinning
Replacement Policies
The buffer manager may assume that everything is in order Databases vs.
whenever it receives an unpin (p, true) call. Operating Systems

Files and Records


Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts

Recap
30
Storage
Replacement Policies
Torsten Grust

The effectiveness of the buffer manager’s caching functionality


can depend on the replacement policy it uses, e.g.,
Least Recently Used (LRU)
Evict the page whose latest unpin () is longest ago
Magnetic Disks

LRU-k Access Time


Sequential vs. Random

Like LRU, but considers the k latest unpin () calls, not just Access

I/O Parallelism
the latest RAID Levels 1, 0, and 5

Alternative Storage
Most Recently Used (MRU) Techniques

Evict the page that has been unpinned most recently Solid-State Disks
Network-Based Storage

Random Managing Space


Free Space Management
Pick a victim randomly Buffer Manager
Pinning and Unpinning

 Rationales behind each of these policies?


Replacement Policies

Databases vs.
Operating Systems

Files and Records


Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts

Recap
31
Storage
Example Policy: Clock (“Second Chance”)
Torsten Grust

• Simulate an LRU policy with less overhead (no LRU queue


reorganization on every frame reference):
Magnetic Disks
Access Time
Clock (“Second Chance”) Sequential vs. Random
Access

I/O Parallelism
1 Number the N frames in the buffer pool 0, . . . , N − 1; RAID Levels 1, 0, and 5

initialize current ← 0, maintain a bit array Alternative Storage


Techniques
referenced[0, . . . , N − 1] initialized to all 0 Solid-State Disks
Network-Based Storage
2 In pin(p): load p into buffer pool (if needed), assign
Managing Space
referenced[frame-of(p)] ← 1 Free Space Management

Buffer Manager
3 In pin(p): if we need to find a victim, consider page Pinning and Unpinning

current; if referenced[current] = 0, current is the Replacement Policies

victim; otherwise, referenced[current] ← 0, Databases vs.


Operating Systems
current ← current + 1 mod N, repeat 3 Files and Records
Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts

Recap
32
Storage
Heuristic Policies Can Fail
Torsten Grust

The mentioned policies, including LRU, are heuristics 


only and may fail miserably in certain scenarios. Magnetic Disks
Access Time
Sequential vs. Random
Example (A Challenge for LRU) Access

I/O Parallelism
A number of transactions want to scan the same sequence of RAID Levels 1, 0, and 5

pages (consider a repeated SELECT * FROM R). Alternative Storage


Techniques
Assume a buffer pool capacity of 10 pages. Solid-State Disks
Network-Based Storage
1 Let the size of relation R be 10 or less pages. Managing Space
How many I/O (actual disk page reads) do you expect? Free Space Management

Buffer Manager
2 Now grow R by one page. Pinning and Unpinning

How about the number of I/O operations in this case? Replacement Policies

Databases vs.
Operating Systems

Files and Records


Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts

Recap
33
Storage
More Challenges for LRU
Torsten Grust

1 Transaction T1 repeatedly accesses a fixed set of pages;


transaction T2 performs a sequential scan of the database
pages. Magnetic Disks

2 Assume a B+-tree-indexed relation R. R occupies 10,000 data Access Time


Sequential vs. Random

pages Ri , the B+-tree occupies one root node and 100 index Access

I/O Parallelism
leaf nodes Ik . RAID Levels 1, 0, and 5

Transactions perform repeated random index key lookups on Alternative Storage

R ⇒ page access pattern (ignores B+-tree root node):


Techniques
Solid-State Disks
Network-Based Storage

Managing Space
I1 , R1 , I2 , R2 , I3 , R3 , . . . Free Space Management

Buffer Manager

 How will LRU perform in this case?


Pinning and Unpinning
Replacement Policies

Databases vs.
Operating Systems

Files and Records


Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts

Recap
34
Storage
More Challenges for LRU
Torsten Grust

1 Transaction T1 repeatedly accesses a fixed set of pages;


transaction T2 performs a sequential scan of the database
pages. Magnetic Disks

2 Assume a B+-tree-indexed relation R. R occupies 10,000 data Access Time


Sequential vs. Random

pages Ri , the B+-tree occupies one root node and 100 index Access

I/O Parallelism
leaf nodes Ik . RAID Levels 1, 0, and 5

Transactions perform repeated random index key lookups on Alternative Storage

R ⇒ page access pattern (ignores B+-tree root node):


Techniques
Solid-State Disks
Network-Based Storage

Managing Space
I1 , R1 , I2 , R2 , I3 , R3 , . . . Free Space Management

Buffer Manager

 How will LRU perform in this case?


Pinning and Unpinning
Replacement Policies

Databases vs.
Operating Systems
With LRU, 50 % of the buffered pages will be pages of R. However, Files and Records
the probability of re-accessing page Ri only is 1/10,000. Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts

Recap
34
Storage
Buffer Management in Practice
Torsten Grust

Prefetching
Buffer managers try to anticipate page requests to overlap
CPU and I/O operations:
• Speculative prefetching: Assume sequential scan and Magnetic Disks
Access Time

automatically read ahead. Sequential vs. Random


Access
• Prefetch lists: Some database algorithms can instruct I/O Parallelism
the buffer manager with a list of pages to prefetch. RAID Levels 1, 0, and 5

Alternative Storage
Techniques
Solid-State Disks
Page fixing/hating Network-Based Storage

Higher-level code may request to fix a page if it may be Managing Space


Free Space Management
useful in the near future (e.g., nested-loop join). Buffer Manager
Pinning and Unpinning
Likewise, an operator that hates a page will not access it any Replacement Policies

time soon (e.g., table pages in a sequential scan). Databases vs.


Operating Systems

Files and Records

Partitioned buffer pools Heap Files


Free Space Management

E.g., maintain separate pools for indexes and tables. Inside a Page
Alternative Page Layouts

Recap
35
Storage
Databases vs. Operating Systems
Torsten Grust


Wait! Didn’t we just re-invent the operating system?
Magnetic Disks
Yes, Access Time

• disk space management and buffer management very much


Sequential vs. Random
Access

look like file management and virtual memory in OSs. I/O Parallelism
RAID Levels 1, 0, and 5

But, Alternative Storage


Techniques
• a DBMS may be much more aware of the access patterns of Solid-State Disks
Network-Based Storage
certain operators (prefetching, page fixing/hating), Managing Space

• concurrency control often calls for a prescribed order of Free Space Management

Buffer Manager
write operations, Pinning and Unpinning

• technical reasons may make OS tools unsuitable for a


Replacement Policies

Databases vs.
database (e.g., file size limitation, platform independence). Operating Systems

Files and Records


Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts

Recap
36
Storage
Databases vs. Operating Systems
Torsten Grust

In fact, databases and operating systems sometimes interfere:


• Operating system and buffer manager
effectively buffer the same data twice.
DBMS
buffer Magnetic Disks

• Things get really bad if parts of the Access Time


Sequential vs. Random

DBMS buffer get swapped out to disk Access

by OS VM manager. OS I/O Parallelism


buffer RAID Levels 1, 0, and 5

Alternative Storage
Techniques
• Therefore, database systems try to Solid-State Disks
Network-Based Storage
turn off certain OS features or Managing Space
services. disk Free Space Management

Buffer Manager
Pinning and Unpinning

⇒ Raw disk access instead of OS files. Replacement Policies

Databases vs.
Operating Systems

Files and Records


Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts

Recap
37
Storage
Files and Records
Torsten Grust

Web Forms Applications SQL Interface

SQL Commands
Magnetic Disks
Executor Parser Access Time
Sequential vs. Random
Access
Operator Evaluator Optimizer
I/O Parallelism
RAID Levels 1, 0, and 5

Alternative Storage
Transaction Files and Access Methods Techniques
Solid-State Disks
Manager
Recovery Network-Based Storage
Buffer Manager
Manager Managing Space
Free Space Management
Lock Manager
Disk Space Manager Buffer Manager
Pinning and Unpinning
Replacement Policies
DBMS Databases vs.
Operating Systems

data files, indices, . . . Database Files and Records


Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts

Recap
38
Storage
Database Files
Torsten Grust

• So far we have talked about pages. Their management is


oblivious with respect to their actual content.
• On the conceptual level, a DBMS primarily manages tables
of tuples and indexes.
• Such tables are implemented as files of records:
Magnetic Disks
Access Time

• A file consists of one or more pages, Sequential vs. Random


Access

• each page contains one or more records, I/O Parallelism

• each record corresponds to one tuple: RAID Levels 1, 0, and 5

Alternative Storage
Techniques
Solid-State Disks
Network-Based Storage

Managing Space
Free Space Management
file 0
Buffer Manager
Pinning and Unpinning
free Replacement Policies

Databases vs.
page 0 page 1 page 2 page 3 Operating Systems
file 1 Files and Records
Heap Files

free Free Space Management


Inside a Page
Alternative Page Layouts
page 4 page 5 page 6 page 7
Recap
39
Database Heap Files tumpukan file Storage

Torsten Grust
The most important type of files in a database is the heap file. It
stores records in no particular order (in line with, e.g., the SQL
semantics): penyimpanan gk pake urutan jelas
Typical heap file interface
Magnetic Disks

• create/destroy heap file f named n:


Access Time
Sequential vs. Random
Access
f = createFile(n), deleteFile(n) I/O Parallelism

• insert record r and return its rid: RAID Levels 1, 0, and 5

Alternative Storage
rid = insertRecord(f ,r) Techniques

• delete a record with a given rid:


Solid-State Disks
Network-Based Storage

deleteRecord(f ,rid) Managing Space


Free Space Management

• get a record with a given rid: Buffer Manager

r = getRecord(f ,rid) Pinning and Unpinning


Replacement Policies

• initiate a sequential scan over the whole heap file: Databases vs.
Operating Systems
openScan(f ) Files and Records
Heap Files
Free Space Management
N.B. Record ids (rid) are used like record addresses (or pointers). Inside a Page
Alternative Page Layouts
The heap file structure maps a given rid to the page containing
Recap
the record. 40
Storage
Heap Files
Torsten Grust

(Doubly) Linked list of pages:


Header page allocated when createFile(n) is
Magnetic Disks
called—initially both page lists are empty: Access Time
Sequential vs. Random
Access

data data data pages w/ I/O Parallelism


··· RAID Levels 1, 0, and 5
page page page free space Alternative Storage
header Techniques
page Solid-State Disks
Network-Based Storage
data data data
page page
···
page full pages Managing Space
Free Space Management

Buffer Manager
Pinning and Unpinning

+ easy to implement Replacement Policies

Databases vs.
– most pages will end up in free page list Operating Systems

Files and Records


– might have to search many pages to place a (large) record Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts

Recap
41
Storage
Heap Files
Torsten Grust
Operation insertRecord(f ,r) for linked list of pages

1 Try to find a page p in the free list with free space > | r |;
should this fail, ask the disk space manager to allocate a new
page p Magnetic Disks
Access Time
2 Write record r to page p Sequential vs. Random
Access
3 Since, generally, | r || p |, p will belong to the list of pages I/O Parallelism
with free space RAID Levels 1, 0, and 5

Alternative Storage
4 A unique rid for r is generated and returned to the caller Techniques
Solid-State Disks
Network-Based Storage

 Generating sensible record ids (rid) Managing Space


Free Space Management

Buffer Manager
Given that rids are used like record addresses: what would be a Pinning and Unpinning
Replacement Policies
feasible rid generation method? Databases vs.
Operating Systems

Files and Records


Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts

Recap
42
Storage
Heap Files
Torsten Grust
Operation insertRecord(f ,r) for linked list of pages

1 Try to find a page p in the free list with free space > | r |;
should this fail, ask the disk space manager to allocate a new
page p Magnetic Disks
Access Time
2 Write record r to page p Sequential vs. Random
Access
3 Since, generally, | r || p |, p will belong to the list of pages I/O Parallelism
with free space RAID Levels 1, 0, and 5

Alternative Storage
4 A unique rid for r is generated and returned to the caller Techniques
Solid-State Disks
Network-Based Storage

 Generating sensible record ids (rid) Managing Space


Free Space Management

Buffer Manager
Given that rids are used like record addresses: what would be a Pinning and Unpinning
Replacement Policies
feasible rid generation method? Databases vs.
Generate a composite rid consisting of the address of page p and Operating Systems

the placement (offset/slot) of r inside p: Files and Records


Heap Files
Free Space Management

hpageno p, slotno ri Inside a Page


Alternative Page Layouts

Recap
42
Storage
Heap Files
Torsten Grust

Directory of pages:

Magnetic Disks
data Access Time

page data Sequential vs. Random


Access
page
I/O Parallelism
data RAID Levels 1, 0, and 5

page Alternative Storage


Techniques
···
Solid-State Disks
Network-Based Storage

• Use as space map with information about free space on Managing Space
Free Space Management

each page Buffer Manager

• granularity as trade-off space ↔ accuracy Pinning and Unpinning


Replacement Policies

(may range from open/closed bit to exact information) Databases vs.


Operating Systems
+ free space search more efficient Files and Records
Heap Files
– memory overhead to host the page directory Free Space Management
Inside a Page
Alternative Page Layouts

Recap
43
Storage
Free Space Management
Torsten Grust

Which page to pick for the insertion of a new record?


Append Only
Magnetic Disks
Always insert into last page. Otherwise, create a new page. Access Time
Sequential vs. Random
Best Fit Access

Reduces fragmentation, but requires searching the entire I/O Parallelism


RAID Levels 1, 0, and 5

free list/space map for each insert. Alternative Storage


Techniques
First Fit Solid-State Disks
Network-Based Storage
Search from beginning, take first page with sufficient space. Managing Space
(⇒ These pages quickly fill up, system may waste a lot of Free Space Management

search effort in these first pages later on.) Buffer Manager


Pinning and Unpinning

Next Fit Replacement Policies

Databases vs.
Maintain cursor and continue searching where search Operating Systems

stopped last time. Files and Records


Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts

Recap
44
Storage
Free Space Witnesses
Torsten Grust

We can accelerate the search by remembering witnesses: Magnetic Disks


Access Time

• Classify pages into buckets, e.g., “75 %–100 % full”, Sequential vs. Random
Access

“50 %–75 % full”, “25 %–50 % full”, and “0 %–25 % full”. I/O Parallelism
RAID Levels 1, 0, and 5
• For each bucket, remember some witness pages. Alternative Storage

• Do a regular best/first/next fit search only if no witness is


Techniques
Solid-State Disks
Network-Based Storage
recorded for the specific bucket.
Managing Space
• Populate witness information, e.g., as a side effect when Free Space Management

searching for a best/first/next fit page. Buffer Manager


Pinning and Unpinning
Replacement Policies

Databases vs.
Operating Systems

Files and Records


Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts

Recap
45
Storage
Inside a Page — Fixed-Length Records
Torsten Grust

Now turn to the internal page structure:

ID NAME SEX 4711John M1723M


arc M6381Betty Magnetic Disks
4711 John M F Access Time

1723 Marc M Sequential vs. Random


Access

6381 Betty F I/O Parallelism


RAID Levels 1, 0, and 5

Alternative Storage

• Record identifier (rid): Techniques


Solid-State Disks

hpageno, slotnoi Network-Based Storage

Managing Space
• Record position (within page): Free Space Management

slotno × bytes per slot Buffer Manager


Pinning and Unpinning
no. of records
• Record deletion? in this page
Replacement Policies

Databases vs.
• record id should not Operating Systems

Files and Records


change Heap Files

⇒ slot directory (bitmap) 3 Header Free Space Management


Inside a Page
Alternative Page Layouts

Recap
46
Storage
Inside a Page — Fixed-Length Records
Torsten Grust

Now turn to the internal page structure:

ID NAME SEX 4711John M1723M


arc M6381Betty Magnetic Disks
4711 John M F Access Time

1723 Marc M Sequential vs. Random


Access

6381 Betty F I/O Parallelism


RAID Levels 1, 0, and 5

Alternative Storage

• Record identifier (rid): Techniques


Solid-State Disks

hpageno, slotnoi Network-Based Storage

Managing Space
• Record position (within page): Free Space Management

slotno × bytes per slot Buffer Manager


Pinning and Unpinning
slot no. of records
• Record deletion? directory in this page
Replacement Policies

Databases vs.
• record id should not Operating Systems

Files and Records


change Heap Files

⇒ slot directory (bitmap) 101 3 Header Free Space Management


Inside a Page
Alternative Page Layouts

Recap
46
Storage
Inside a Page—Variable-Sized Fields
Torsten Grust

• Variable-sized fields moved to


end of each record. 4 7 1 1 • M J o h n⊥1 7 2 3 • M M
a r c⊥6 3 8 1 • F B e t t y⊥
• Placeholder points to
Magnetic Disks
location.

Access Time

• Why? Sequential vs. Random


Access

I/O Parallelism
RAID Levels 1, 0, and 5

Alternative Storage
Techniques
Solid-State Disks
Network-Based Storage

Managing Space
Free Space Management

no. of records Buffer Manager


Pinning and Unpinning
in this page Replacement Policies

Databases vs.
Operating Systems

Files and Records


3 Header
Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts

Recap
47
Storage
Inside a Page—Variable-Sized Fields
Torsten Grust

• Variable-sized fields moved to


end of each record. 4 7 1 1 • M J o h n⊥1 7 2 3 • M M
a r c⊥6 3 8 1 • F B e t t y⊥
• Placeholder points to
Magnetic Disks
location.

Access Time

• Why? Sequential vs. Random


Access

• Slot directory points to start of I/O Parallelism


RAID Levels 1, 0, and 5
each record. Alternative Storage
Techniques
Solid-State Disks
Network-Based Storage

Managing Space
Free Space Management

slot no. of records Buffer Manager


Pinning and Unpinning
directory in this page Replacement Policies

Databases vs.
Operating Systems

• • • Files and Records


3 Header
Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts

Recap
47
Storage
Inside a Page—Variable-Sized Fields
Torsten Grust

• Variable-sized fields moved to


end of each record. 4 7 1 1 • M J o h n⊥1 7 2 3 MM
a r c⊥6 3 8 1 • F B e t t y⊥1 7 2
• Placeholder points to 3 • M T i m o t h y⊥ Magnetic Disks
location.

Access Time

• Why? Sequential vs. Random


Access

• Slot directory points to start of I/O Parallelism


RAID Levels 1, 0, and 5
each record. Alternative Storage

• Records can move on page. Techniques


Solid-State Disks

• E.g., if field size changes or Network-Based Storage

Managing Space
page is compacted. Free Space Management

slot no. of records Buffer Manager


Pinning and Unpinning
directory in this page Replacement Policies

Databases vs.
Operating Systems

• • • Files and Records


3 Header
Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts

Recap
47
Storage
Inside a Page—Variable-Sized Fields
Torsten Grust

• Variable-sized fields moved to


end of each record. 4 7 1 1 • M J o h n⊥1 7 2 3 MM
a r c⊥6 3 8 1 • F B e t t y⊥
• Placeholder points to •
Magnetic Disks
location.

Access Time
forward
• Why? Sequential vs. Random
Access

• Slot directory points to start of I/O Parallelism


RAID Levels 1, 0, and 5
each record. Alternative Storage

• Records can move on page. Techniques


Solid-State Disks

• E.g., if field size changes or Network-Based Storage

Managing Space
page is compacted. Free Space Management

• Create “forward address” if slot no. of records Buffer Manager


Pinning and Unpinning
directory in this page
record won’t fit on page.

Replacement Policies

• Future updates? Databases vs.


Operating Systems

• • • Files and Records


3 Header
Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts

Recap
47
Storage
Inside a Page—Variable-Sized Fields
Torsten Grust

• Variable-sized fields moved to


end of each record. 4 7 1 1 • M J o h n⊥1 7 2 3 MM
a r c⊥6 3 8 1 • F B e t t y⊥
• Placeholder points to
Magnetic Disks
location.

Access Time

• Why? Sequential vs. Random


Access

• Slot directory points to start of I/O Parallelism


RAID Levels 1, 0, and 5
each record. Alternative Storage

• Records can move on page. Techniques


Solid-State Disks

• E.g., if field size changes or Network-Based Storage

Managing Space
page is compacted. Free Space Management

• Create “forward address” if slot no. of records Buffer Manager


Pinning and Unpinning
directory in this page
record won’t fit on page.

Replacement Policies

• Future updates? Databases vs.


Operating Systems

• Related issue: space-efficient • • • 3 Header Files and Records


Heap Files
representation of NULL values. Free Space Management
Inside a Page
Alternative Page Layouts

Recap
47
Storage
IBM DB2 Data Pages
Torsten Grust

Data page and layout details

• Support for 4 K, 8 K, 16 K, 32 K data pages in separate table


spaces. Buffer manager pages match in size. Magnetic Disks
Access Time

• 68 bytes of database manager overhead per page. On a 4 K Sequential vs. Random


Access

page: maximum of 4,028 bytes of user data (maximum I/O Parallelism

record size: 4,005 bytes). Records do not span pages. RAID Levels 1, 0, and 5

Alternative Storage
• Maximum table size: 512 GB (with 32 K pages). Maximum Techniques
Solid-State Disks


number of columns: 1,012 (4 K page: 500), maximum Network-Based Storage

number of rows/page: 255. IBM DB2 RID format? Managing Space


Free Space Management

• Columns of type LONG VARCHAR, CLOB, etc. maintained Buffer Manager


Pinning and Unpinning
outside regular data pages (pages contain descriptors only). Replacement Policies

• Free space management: first-fit order. Free space map Databases vs.
Operating Systems
distributed on every 500th page in FSCR (free space control Files and Records
records). Records updated in-place if possible, otherwises Heap Files
Free Space Management
uses forward records. Inside a Page
Alternative Page Layouts

Recap
48
Storage
IBM DB2 Data Pages
Torsten Grust

Taken directly from the DB2 V9.5 Information Center

Magnetic Disks
Access Time
Sequential vs. Random
Access

I/O Parallelism
RAID Levels 1, 0, and 5

Alternative Storage
Techniques
Solid-State Disks
Network-Based Storage

Managing Space
Free Space Management

Buffer Manager
Pinning and Unpinning
Replacement Policies

Databases vs.
Operating Systems

Files and Records


Heap Files
Free Space Management
Inside a Page
http://publib.boulder.ibm.com/infocenter/db2luw/v9r5/ Alternative Page Layouts

Recap
49
Storage
Alternative Page Layouts
Torsten Grust

We have just populated data pages in a row-wise fashion:

a1 b1 c1 a4 b4 c4
Magnetic Disks
a1 b1 c1 d1 c1 d1 a2 c4 d4 Access Time

a2 b2 c2 d2 b2 c2 d2 Sequential vs. Random


d2 a3 b3 Access

a3 b3 c3 d3 c3 d3 I/O Parallelism
a4 b4 c4 d4 RAID Levels 1, 0, and 5

page 0 page 1
Alternative Storage
Techniques
Solid-State Disks

We could as well do that column-wise: Network-Based Storage

Managing Space
Free Space Management

Buffer Manager
Pinning and Unpinning
a1 a2 a3 b1 b2 b3 Replacement Policies
a1 b1 c1 d1 a3 a4 b3 b4
Databases vs.
a2 b2 c2 d2 Operating Systems

a3 b3 c3 d3 Files and Records


Heap Files
a4 b4 c4 d4 Free Space Management
page 0 page 1 Inside a Page
Alternative Page Layouts

Recap
50
Storage
Alternative Page Layouts
Torsten Grust

These two approaches are also known as NSM (n-ary storage


model) and DSM (decomposition storage model).1
• Tuning knob for certain workload types (e.g., OLAP)
Magnetic Disks
• Suitable for narrow projections and in-memory database Access Time
Sequential vs. Random
systems Access

(% Database Systems and Modern CPU Architecture) I/O Parallelism


RAID Levels 1, 0, and 5

• Different behavior with respect to compression. Alternative Storage


Techniques
Solid-State Disks
Network-Based Storage

A hybrid approach is the PAX minipage 3 Managing Space

(Partition Attributes Across) layout: Free Space Management

minipage 2 Buffer Manager


• Divide each page into minipages. Pinning and Unpinning
Replacement Policies

• Group attributes into them. minipage 1 Databases vs.


Operating Systems

% Ailamaki et al. Weaving Relations for Cache minipage 0 Files and Records
Heap Files
Performance. VLDB 2001. Free Space Management
page 0
Inside a Page
Alternative Page Layouts

Recap
1 Recently, the terms row-store and column-store have become popular, too. 51
Storage
Recap
Torsten Grust

Magnetic Disks
Random access orders of magnitude slower than Magnetic Disks
sequential. Access Time
Sequential vs. Random
Access
Disk Space Manager I/O Parallelism
Abstracts from hardware details and maps RAID Levels 1, 0, and 5

page number 7→ physical location. Alternative Storage


Techniques
Solid-State Disks
Buffer Manager Network-Based Storage

Page caching in main memory; pin ()/unpin () interface; Managing Space


Free Space Management
replacement policy crucial for effectiveness.
Buffer Manager
Pinning and Unpinning
File Organization Replacement Policies

Stable record identifiers (rids); maintenance with Databases vs.


Operating Systems
fixed-sized records and variable-sized fields; NSM vs. DSM.
Files and Records
Heap Files
Free Space Management
Inside a Page
Alternative Page Layouts

Recap
52

You might also like