You are on page 1of 9

Review: Memory, Disks

File Organizations and Indexing


• Storage Hierarchy: cache, RAM, disk, tape, …
– Can’t fit everything in RAM (usually).

15-415 • “Page” or “Frame” - unit of buffer


management in RAM.
R&G Chapter 8
• “Page” or “Block” unit of interaction with disk.
• Importance of “locality” and sequential access
for good disk performance.
• Buffer pool management
– Slots in RAM to hold Pages
– Policy to move Pages between RAM & disk
Faloutsos, Olston 2
CMU SCS 15-415

Review: File Storage Files

• Page or block is OK when doing I/O,


• FILE: A collection of pages, each containing a
but higher levels of DBMS operate on
collection of records.
records, and files of records.
• Must support:
• We saw: – insert/delete/modify record
– How to organize records within pages. – read a particular record (specified using record id)
– How to keep pages of records on disk – scan all records (possibly with some conditions on
the records to be retrieved)
• Today we’ll see:
– How to support operations on files of
records efficiently.
Faloutsos, Olston 3 Faloutsos, Olston 4
CMU SCS 15-415 CMU SCS 15-415

Alternative File Organizations How to find records quickly?


Many alternatives exist, each good for some
situations, and not so good in others: • E.g., student.gpa = ‘3’
– Heap files: Suitable when typical access is a
file scan retrieving all records. Q: On a heap organization, with B blocks,
– Sorted Files: Best for retrieval in some order, how many disk accesses?
or for retrieving a `range’ of records.
– Index File Organizations: (will cover shortly..)

Faloutsos, Olston 5 Faloutsos, Olston 6


CMU SCS 15-415 CMU SCS 15-415

1
Heap File Implemented Using Lists How to find records quickly?

• E.g., student.gpa = ‘3’


Data Data Data Full Pages
Page Page Page
Header Q: On a heap organization, with B blocks,
Page
Data Data Data
how many disk accesses?
Pages with
Page Page Page
Free Space A: B

• The header page id and Heap file name must be


stored someplace.
• Each page contains 2 `pointers’ plus data.

Faloutsos, Olston 7 Faloutsos, Olston 8


CMU SCS 15-415 CMU SCS 15-415

How to accelerate searches? Example: Simple Index on GPA

• A: Indices, like: Directory

2 2.5 3 3.5

Data entries:

1.2* 1.7* 1.8* 1.9* 2.2* 2.4* 2.7* 2.7* 2.9* 3.2* 3.3* 3.3* 3.6* 3.8* 3.9* 4.0*

(Index File)
(Data file)

Data Records

An index contains a collection of data entries, and


supports efficient retrieval of records matching a
given search condition
Faloutsos, Olston 9 Faloutsos, Olston 10
CMU SCS 15-415 CMU SCS 15-415

Indexes Index Search Conditions


• Sometimes, we want to retrieve records by specifying the
values in one or more fields, e.g., • Search condition =
– Find all students in the “CS” department
<search key, comparison operator>
– Find all students with a gpa > 3 Examples…
• An index on a file speeds up selections on the search key (1) Condition: Department = “CS”
fields for the index.
– Any subset of the fields of a relation can be the search key for an
– Search key: “CS”
index on the relation. – Comparison operator: equality (=)
– Search key is not the same as key (e.g., doesn’t have to be
unique). (2) Condition: GPA > 3
– Search key: 3
– Comparison operator: greater-than (>)

Faloutsos, Olston 11 Faloutsos, Olston 12


CMU SCS 15-415 CMU SCS 15-415

2
Index Classification - high level Index Classification - high level

• Primary vs. Secondary • Primary vs. Secondary


– depending whether the index key is primary or not – depending whether the index key is primary or not
(= duplicates or not) (= duplicates or not)
• Clustered vs. Unclustered (~ sparse vs • Clustered vs. Unclustered (~ sparse vs
dense) dense)
• (B-tree-like index vs hash index)

Faloutsos, Olston 13 Faloutsos, Olston 14


CMU SCS 15-415 CMU SCS 15-415

Indexing - clustering index example Indexing - non-clustering

Clustering/sparse index on ssn Non-clustering / dense index


123 123
456 >=123 234
… 345
STUDENT 456 Ssn Name Address
Ssn Name Address 567 345 tomson main str
123 smith main str 234 jones forbes ave
>=456
234 jones forbes ave 567 smith forbes ave
345 tomson main str 456 stevens forbes ave
456 stevens forbes ave
123 smith main str
567 smith forbes ave

Faloutsos, Olston 15 Faloutsos, Olston 16


CMU SCS 15-415 CMU SCS 15-415

Index Classification - detailed Details

• Representation of data entries in index • ‘data entries’ == what we store at the bottom
– i.e., what is at the bottom of the index? of the index pages
– 3 alternatives here • what would you use as data entries?
• Clustered vs. Unclustered
• Primary vs. Secondary
• Dense vs. Sparse
• Single Key vs. Composite
• Indexing technique
– Tree-based, hash-based, other
Faloutsos, Olston 17 Faloutsos, Olston 18
CMU SCS 15-415 CMU SCS 15-415

3
Example: Simple Index on GPA Alternatives for Data Entry k* in Index
1. Actual data record (with key value k)
Directory

2 2.5 3 3.5
2. <k, rid of matching data record>
3. <k, list of rids of matching data records>
Data entries:

1.2* 1.7* 1.8* 1.9* 2.2* 2.4* 2.7* 2.7* 2.9* 3.2* 3.3* 3.3* 3.6* 3.8* 3.9* 4.0*

(Index File)
(Data file)

Data Records

An index contains a collection of data entries, and


supports efficient retrieval of records matching a
given search condition
Faloutsos, Olston 19 Faloutsos, Olston 20
CMU SCS 15-415 CMU SCS 15-415

Alternatives for Data Entry k* in Index Alternatives for Data Entries (Contd.)
1. Actual data record (with key value k)
Alternative 1:
2. <k, rid of matching data record>
Actual data record (with key value k)
3. <k, list of rids of matching data records>
– If this is used, index structure is a file
• Choice is orthogonal to the indexing technique.
organization for data records (like Heap
– Examples of indexing techniques: B+ trees, files or sorted files).
hash-based structures, R trees, …
– At most one index on a given collection of
– Typically, index contains auxiliary info that
directs searches to the desired data entries data records can use Alternative 1.
• Can have multiple (different) indexes per file. – This alternative saves pointer lookups but
can be expensive to maintain with
– E.g. file sorted on age, with a hash index on
insertions and deletions.
name and a B+tree index on salary.
Faloutsos, Olston 21 Faloutsos, Olston 22
CMU SCS 15-415 CMU SCS 15-415

Alternatives for Data Entries (Contd.) BREAK: World’s Largest Databases


Alternative 2
<k, rid of matching data record>
and Alternative 3
<k, list of rids of matching data records>
– Easier to maintain than Alternative 1.
– If more than one index is required on a given file, at
most one index can use Alternative 1; rest must use
Alternatives 2 or 3.
– Alternative 3 more compact than Alternative 2, but
leads to variable sized data entries even if search keys
are of fixed length.
– Even worse, for large rid lists the data entry would
have to span multiple pages!
Faloutsos, Olston 23 Faloutsos, Olston 24
CMU SCS 15-415 CMU SCS 15-415

4
Where were we?: Index Classification Indexing - clustering index example

• Representation of data entries in index Clustering/sparse index on ssn


– i.e., what is at the bottom of the index? 123
– 3 alternatives here 456 >=123

• Clustered vs. Unclustered STUDENT
• Primary vs. Secondary Ssn Name Address
123 smith main str
>=456
• Dense vs. Sparse 234 jones forbes ave
345 tomson main str
• Single Key vs. Composite 456 stevens forbes ave
• Indexing technique 567 smith forbes ave
– Tree-based, hash-based, other
Faloutsos, Olston 25 Faloutsos, Olston 26
CMU SCS 15-415 CMU SCS 15-415

Indexing - non-clustering Index Classification - clustering

Non-clustering / dense index • Clustered vs. unclustered: If order of data


123
records is the same as, or `close to’, order of
234 index data entries, then called clustered index.
345
456 Ssn Name Address
567 345 tomson main str Index entries
UNCLUSTERED
234 jones forbes ave CLUSTERED direct search for
data entries

567 smith forbes ave


456 stevens forbes ave
Data entries Data entries
123 smith main str (Index File)
(Data file)

Data Records Data Records


Faloutsos, Olston 27 Faloutsos, Olston 28
CMU SCS 15-415 CMU SCS 15-415

Index Classification - clustering Clustered vs. Unclustered Index

– A file can have a clustered index on at most • Cost of retrieving records found in range scan:
one search key. – Clustered: cost =
– Cost of retrieving data records through index – Unclustered: cost ˜
varies greatly based on whether index is • What are the tradeoffs????
clustered!
– Note: Alternative 1 implies clustered, but not
vice-versa.

Faloutsos, Olston 29 Faloutsos, Olston 30


CMU SCS 15-415 CMU SCS 15-415

5
Clustered vs. Unclustered Index Clustered vs. Unclustered Index

• Cost of retrieving records found in range scan: • Cost of retrieving records found in range scan:
– Clustered: cost = # pages in file w/matching records – Clustered: cost = # pages in file w/matching records
– Unclustered: cost ˜ # of matching index data entries – Unclustered: cost ˜ # of matching index data entries
• What are the tradeoffs???? • What are the tradeoffs????
– Clustered Pros:
• Efficient for range searches
• May be able to do some types of compression
– Clustered Cons:
• Expensive to maintain (on the fly or sloppy with
reorganization)
Faloutsos, Olston 31 Faloutsos, Olston 32
CMU SCS 15-415 CMU SCS 15-415

Primary vs. Secondary Index Index Classification - detailed

• Primary: index key includes the file’s primary • Representation of data entries in index
key – i.e., what is at the bottom of the index?
– 3 alternatives here
• Secondary: any other index
• Clustered vs. Unclustered
– Sometimes confused with Alt. 1 vs. Alt. 2/3 • Primary vs. Secondary
– Primary index never contains duplicates • Dense vs. Sparse
– Secondary index may contain duplicates • Single Key vs. Composite
• If index key contains a candidate key, no
• Indexing technique
duplicates => unique index
– Tree-based, hash-based, other
Faloutsos, Olston 33 Faloutsos, Olston 34
CMU SCS 15-415 CMU SCS 15-415

Dense vs. Sparse Index Composite Search Keys


• Dense: at least one • Search on combination of fields.
data entry per key – Equality query: Every field is Examples of composite key
indexes using lexicographic order.
value equal to a constant value.
Ashby, 25, 3000
22
E.g. wrt <sal,age> index:
• Sparse: an entry per Basu, 33, 4003

Bristow, 30, 2007


25
• age=12 and sal =75
11,80
12,10
11
12
data page in file Ashby
30
12,20 name age sal 12
Cass Cass, 50, 5004
33
– Range query: Some field
– Every sparse index is Smith Daniels, 22, 6003
40 value is not a constant.
13,75 bob 12 10 13
<age, sal> cal 11 80 <age>
clustered! Jones, 40, 6003
44
E.g.:
44
joe 12 20
– Sparse indexes are Smith, 44, 3000

Tracy, 44, 5004


50
• age =12; or age=12 and 10,12 sue 13 75 10
smaller; however, some sal > 20 20,12 Data records 20
useful optimizations are Sparse Index
on
Dense Index
on
75,13 sorted by name 75

based on dense indexes. Name Data File Age • Data entries in index sorted by 80,11 80
search key for range queries. <sal, age> <sal>
– Alternative 1 always
leads to dense index. – “Lexicographic” order.
Faloutsos, Olston 35 Faloutsos, Olston 36
CMU SCS 15-415 CMU SCS 15-415

6
Index Classification - detailed Tree vs. Hash-based index
• Hash-based index
• Representation of data entries in index
– Good for equality selections.
– i.e., what is at the bottom of the index?
• File = a collection of buckets. Bucket = primary page
– 3 alternatives here plus 0 or more overflow pages.
• Clustered vs. Unclustered • Hash function h: h(r.search_key) = bucket in which
• Primary vs. Secondary record r belongs.

• Dense vs. Sparse • Tree-based index


– Good for range selections.
• Single Key vs. Composite
• Hierarchical structure (Tree) directs searches
• Indexing technique
• Leaves contain data entries sorted by search key value
– Tree-based, hash-based, other
• B+ tree: all root->leaf paths have equal length (height)
Faloutsos, Olston 37 Faloutsos, Olston 38
CMU SCS 15-415 CMU SCS 15-415

Cost estimation Cost estimation

• Heap file • Heap file • scan


• Sorted • Sorted • equality search
• Clustered • Clustered • range search
• Unclustured tree index • Unclustured tree index • insertion
• Unclustered hash index • Unclustered hash index • deletion

Methods Operations(?) Methods Operations

• Consider only I/O cost;


• suppose file spans B pages
Faloutsos, Olston 39 Faloutsos, Olston 40
CMU SCS 15-415 CMU SCS 15-415

Cost estimation Cost estimation


scan eq range ins del scan eq range ins del

Heap Heap B

sorted sorted B

Clust. Clust. 1.5B

u-tree u-tree ~B

u-hash u-hash ~B

Assume that:
• Clustered index spans 1.5B pages (due to empty space)
• Data entry= 1/10 of data record
Faloutsos, Olston 41 Faloutsos, Olston 42
CMU SCS 15-415 CMU SCS 15-415

7
Cost estimation Cost estimation

index – cost? In general


– levels of index +
– heap: seq. scan
– blocks w/ qual. tuples
– sorted: binary search #1 #1
– index search ..
#2 #2
for primary key – cost:
h for clustering index
… …
h’+1 for non-clustering
#B h’ #B

Faloutsos, Olston 43 Faloutsos, Olston 44


CMU SCS 15-415 CMU SCS 15-415

Cost estimation Cost estimation


scan eq range ins del index – cost?
Heap B B/2 – levels of index +
– blocks w/ qual. tuples
sorted B log2B #1
Clust. 1.5B h
#2
u-tree ~B 1+h’ sec. key – clustering index

u-hash ~B ~2 h + #qual-pages

h= height of btree ~ logF (1.5B) #B


h’= height of unclustered index btree ~ logF (0.15B) h

Faloutsos, Olston 45 Faloutsos, Olston 46


CMU SCS 15-415 CMU SCS 15-415

Cost estimation Cost estimation

index – cost? scan eq range ins del


– levels of index + Heap B B/2 B
– blocks w/ qual. tuples
#1 sorted B log2B <- +m

... Clust. 1.5B h <- +m


#2
sec. key – non-clust. u-tree ~B 1+h’ <- +m’
index
… u-hash ~B ~2 B
h’ + #qual-records
(actually, a bit less...) #B m: # of qualifying pages
h’ m’: # of qualifying records
Faloutsos, Olston 47 Faloutsos, Olston 48
CMU SCS 15-415 CMU SCS 15-415

8
Cost estimation Cost estimation - big-O notation:
scan eq range ins del scan eq range ins del
Search+1
Heap B B/2 B 2 Heap B B B 2 B

sorted B log2B <- +m Search+B Search+B sorted B log2B log2B B B


<- +m Search+1 Search+1
Clust. 1.5B h Clust. B logFB logFB logFB logFB
<- +m’ Search+2 Search+2
u-tree ~B 1+h’ u-tree B logFB logFB logFB logFB
Search+2 Search+2
u-hash ~B ~2 B u-hash B 1 B 1 1

Faloutsos, Olston 49 Faloutsos, Olston 50


CMU SCS 15-415 CMU SCS 15-415

Index specification in SQL:1999 Summary


• To speed up selection queries: index.
CREATE INDEX IndAgeRating ON Students
• Terminology:
WITH STRUCTURE=BTREE, – Clustered / non-clustered index
KEY = (age, gpa) – primary / secondary index
• Typically, B-tree index (see later)
– hashing is only good for equality search
• At most one clustered index per table
– many non-clustered ones are possible
– composite indexes are possible

Faloutsos, Olston 51 Faloutsos, Olston 52


CMU SCS 15-415 CMU SCS 15-415

You might also like