Files Indexing

Review: Memory, Disks
File Organizations and Indexing

• Storage Hierarchy: cache, RAM, disk, tape, …
– Can’t fit everything in RAM (usually).
15-415 • “Page” or “Frame” - unit of buffer

management in RAM.
R&G Chapter 8
• “Page” or “Block” unit of interaction with disk.
• Importance of “locality” and sequential access
for good disk performance.
• Buffer pool management
– Slots in RAM to hold Pages
– Policy to move Pages between RAM & disk
Faloutsos, Olston 2
CMU SCS 15-415
Review: File Storage Files
• Page or block is OK when doing I/O,

• FILE: A collection of pages, each containing a
but higher levels of DBMS operate on
collection of records.
records, and files of records.
• Must support:
• We saw: – insert/delete/modify record
– How to organize records within pages. – read a particular record (specified using record id)
– How to keep pages of records on disk – scan all records (possibly with some conditions on
the records to be retrieved)
• Today we’ll see:
– How to support operations on files of
records efficiently.
Faloutsos, Olston 3 Faloutsos, Olston 4
CMU SCS 15-415 CMU SCS 15-415
Alternative File Organizations How to find records quickly?

Many alternatives exist, each good for some
situations, and not so good in others: • E.g., student.gpa = ‘3’
– Heap files: Suitable when typical access is a
file scan retrieving all records. Q: On a heap organization, with B blocks,
– Sorted Files: Best for retrieval in some order, how many disk accesses?
or for retrieving a `range’ of records.
– Index File Organizations: (will cover shortly..)

1
Heap File Implemented Using Lists How to find records quickly?
• E.g., student.gpa = ‘3’

Data Data Data Full Pages
Page Page Page
Header Q: On a heap organization, with B blocks,
Page
Data Data Data
how many disk accesses?
Pages with
Page Page Page
Free Space A: B
• The header page id and Heap file name must be

stored someplace.
• Each page contains 2 `pointers’ plus data.

How to accelerate searches? Example: Simple Index on GPA
• A: Indices, like: Directory
2 2.5 3 3.5
Data entries:
1.2* 1.7* 1.8* 1.9* 2.2* 2.4* 2.7* 2.7* 2.9* 3.2* 3.3* 3.3* 3.6* 3.8* 3.9* 4.0*
(Index File)
(Data file)
Data Records
An index contains a collection of data entries, and

supports efficient retrieval of records matching a
given search condition
Indexes Index Search Conditions

• Sometimes, we want to retrieve records by specifying the
values in one or more fields, e.g., • Search condition =
– Find all students in the “CS” department
<search key, comparison operator>
– Find all students with a gpa > 3 Examples…
• An index on a file speeds up selections on the search key (1) Condition: Department = “CS”
fields for the index.
– Any subset of the fields of a relation can be the search key for an
– Search key: “CS”
index on the relation. – Comparison operator: equality (=)
– Search key is not the same as key (e.g., doesn’t have to be
unique). (2) Condition: GPA > 3
– Search key: 3
– Comparison operator: greater-than (>)

2
Index Classification - high level Index Classification - high level
• Primary vs. Secondary • Primary vs. Secondary

– depending whether the index key is primary or not – depending whether the index key is primary or not
(= duplicates or not) (= duplicates or not)
• Clustered vs. Unclustered (~ sparse vs • Clustered vs. Unclustered (~ sparse vs
dense) dense)
• (B-tree-like index vs hash index)

Indexing - clustering index example Indexing - non-clustering
Clustering/sparse index on ssn Non-clustering / dense index

123 123
456 >=123 234
… 345
STUDENT 456 Ssn Name Address
Ssn Name Address 567 345 tomson main str
123 smith main str 234 jones forbes ave
>=456
234 jones forbes ave 567 smith forbes ave
345 tomson main str 456 stevens forbes ave
456 stevens forbes ave
123 smith main str
567 smith forbes ave

Index Classification - detailed Details
• Representation of data entries in index • ‘data entries’ == what we store at the bottom
– i.e., what is at the bottom of the index? of the index pages
– 3 alternatives here • what would you use as data entries?
• Clustered vs. Unclustered
• Primary vs. Secondary
• Dense vs. Sparse
• Single Key vs. Composite
• Indexing technique
– Tree-based, hash-based, other
3
Example: Simple Index on GPA Alternatives for Data Entry k* in Index
1. Actual data record (with key value k)
Directory
2 2.5 3 3.5
2. <k, rid of matching data record>
3. <k, list of rids of matching data records>
Data entries:
1.2* 1.7* 1.8* 1.9* 2.2* 2.4* 2.7* 2.7* 2.9* 3.2* 3.3* 3.3* 3.6* 3.8* 3.9* 4.0*
(Index File)
(Data file)
Data Records
An index contains a collection of data entries, and

supports efficient retrieval of records matching a
given search condition
Alternatives for Data Entry k* in Index Alternatives for Data Entries (Contd.)
1. Actual data record (with key value k)
Alternative 1:
2. <k, rid of matching data record>
Actual data record (with key value k)
3. <k, list of rids of matching data records>
– If this is used, index structure is a file
• Choice is orthogonal to the indexing technique.
organization for data records (like Heap
– Examples of indexing techniques: B+ trees, files or sorted files).
hash-based structures, R trees, …
– At most one index on a given collection of
– Typically, index contains auxiliary info that
directs searches to the desired data entries data records can use Alternative 1.
• Can have multiple (different) indexes per file. – This alternative saves pointer lookups but
can be expensive to maintain with
– E.g. file sorted on age, with a hash index on
insertions and deletions.
name and a B+tree index on salary.
Alternatives for Data Entries (Contd.) BREAK: World’s Largest Databases

Alternative 2
<k, rid of matching data record>
and Alternative 3
<k, list of rids of matching data records>
– Easier to maintain than Alternative 1.
– If more than one index is required on a given file, at
most one index can use Alternative 1; rest must use
Alternatives 2 or 3.
– Alternative 3 more compact than Alternative 2, but
leads to variable sized data entries even if search keys
are of fixed length.
– Even worse, for large rid lists the data entry would
have to span multiple pages!
4
Where were we?: Index Classification Indexing - clustering index example
• Representation of data entries in index Clustering/sparse index on ssn

– i.e., what is at the bottom of the index? 123
– 3 alternatives here 456 >=123
…
• Clustered vs. Unclustered STUDENT
• Primary vs. Secondary Ssn Name Address
123 smith main str
>=456
• Dense vs. Sparse 234 jones forbes ave
345 tomson main str
• Single Key vs. Composite 456 stevens forbes ave
• Indexing technique 567 smith forbes ave
Indexing - non-clustering Index Classification - clustering
Non-clustering / dense index • Clustered vs. unclustered: If order of data

123
records is the same as, or `close to’, order of
234 index data entries, then called clustered index.
345
456 Ssn Name Address
567 345 tomson main str Index entries
UNCLUSTERED
234 jones forbes ave CLUSTERED direct search for
data entries
567 smith forbes ave

456 stevens forbes ave
Data entries Data entries
123 smith main str (Index File)
(Data file)
Data Records Data Records

Index Classification - clustering Clustered vs. Unclustered Index
– A file can have a clustered index on at most • Cost of retrieving records found in range scan:
one search key. – Clustered: cost =
– Cost of retrieving data records through index – Unclustered: cost ˜
varies greatly based on whether index is • What are the tradeoffs????
clustered!
– Note: Alternative 1 implies clustered, but not
vice-versa.

5
Clustered vs. Unclustered Index Clustered vs. Unclustered Index
• Cost of retrieving records found in range scan: • Cost of retrieving records found in range scan:
– Clustered: cost = # pages in file w/matching records – Clustered: cost = # pages in file w/matching records
– Unclustered: cost ˜ # of matching index data entries – Unclustered: cost ˜ # of matching index data entries
• What are the tradeoffs???? • What are the tradeoffs????
– Clustered Pros:
• Efficient for range searches
• May be able to do some types of compression
– Clustered Cons:
• Expensive to maintain (on the fly or sloppy with
reorganization)
Primary vs. Secondary Index Index Classification - detailed
• Primary: index key includes the file’s primary • Representation of data entries in index
key – i.e., what is at the bottom of the index?
– 3 alternatives here
• Secondary: any other index
• Clustered vs. Unclustered
– Sometimes confused with Alt. 1 vs. Alt. 2/3 • Primary vs. Secondary
– Primary index never contains duplicates • Dense vs. Sparse
– Secondary index may contain duplicates • Single Key vs. Composite
• If index key contains a candidate key, no
duplicates => unique index
Dense vs. Sparse Index Composite Search Keys

• Dense: at least one • Search on combination of fields.
data entry per key – Equality query: Every field is Examples of composite key
indexes using lexicographic order.
value equal to a constant value.
Ashby, 25, 3000
22
E.g. wrt <sal,age> index:
• Sparse: an entry per Basu, 33, 4003
Bristow, 30, 2007

25
• age=12 and sal =75
11,80
12,10
11
12
data page in file Ashby
30
12,20 name age sal 12
Cass Cass, 50, 5004
33
– Range query: Some field
– Every sparse index is Smith Daniels, 22, 6003
40 value is not a constant.
13,75 bob 12 10 13
<age, sal> cal 11 80 <age>
clustered! Jones, 40, 6003
44
E.g.:
44
joe 12 20
– Sparse indexes are Smith, 44, 3000
Tracy, 44, 5004

50
• age =12; or age=12 and 10,12 sue 13 75 10
smaller; however, some sal > 20 20,12 Data records 20
useful optimizations are Sparse Index
on
Dense Index
on
75,13 sorted by name 75
based on dense indexes. Name Data File Age • Data entries in index sorted by 80,11 80
search key for range queries. <sal, age> <sal>
– Alternative 1 always
leads to dense index. – “Lexicographic” order.
6
Index Classification - detailed Tree vs. Hash-based index
• Hash-based index
• Representation of data entries in index
– Good for equality selections.
– i.e., what is at the bottom of the index?
• File = a collection of buckets. Bucket = primary page
– 3 alternatives here plus 0 or more overflow pages.
• Clustered vs. Unclustered • Hash function h: h(r.search_key) = bucket in which
• Primary vs. Secondary record r belongs.
• Dense vs. Sparse • Tree-based index

– Good for range selections.
• Single Key vs. Composite
• Hierarchical structure (Tree) directs searches
• Leaves contain data entries sorted by search key value
• B+ tree: all root->leaf paths have equal length (height)
Cost estimation Cost estimation
• Heap file • Heap file • scan

• Sorted • Sorted • equality search
• Clustered • Clustered • range search
• Unclustured tree index • Unclustured tree index • insertion
• Unclustered hash index • Unclustered hash index • deletion
Methods Operations(?) Methods Operations
• Consider only I/O cost;

• suppose file spans B pages

scan eq range ins del scan eq range ins del
Heap Heap B
sorted sorted B
Clust. Clust. 1.5B
u-tree u-tree ~B
u-hash u-hash ~B
Assume that:
• Clustered index spans 1.5B pages (due to empty space)
• Data entry= 1/10 of data record
7
index – cost? In general

– levels of index +
– heap: seq. scan
– blocks w/ qual. tuples
– sorted: binary search #1 #1
– index search ..
#2 #2
for primary key – cost:
h for clustering index
… …
h’+1 for non-clustering
#B h’ #B


scan eq range ins del index – cost?
Heap B B/2 – levels of index +
sorted B log2B #1
Clust. 1.5B h
#2
u-tree ~B 1+h’ sec. key – clustering index
u-hash ~B ~2 h + #qual-pages
…
h= height of btree ~ logF (1.5B) #B

h’= height of unclustered index btree ~ logF (0.15B) h

index – cost? scan eq range ins del

– levels of index + Heap B B/2 B
#1 sorted B log2B <- +m
... Clust. 1.5B h <- +m

#2
sec. key – non-clust. u-tree ~B 1+h’ <- +m’
index
… u-hash ~B ~2 B
h’ + #qual-records
(actually, a bit less...) #B m: # of qualifying pages
h’ m’: # of qualifying records
8
Cost estimation Cost estimation - big-O notation:
scan eq range ins del scan eq range ins del
Search+1
Heap B B/2 B 2 Heap B B B 2 B
sorted B log2B <- +m Search+B Search+B sorted B log2B log2B B B

<- +m Search+1 Search+1
Clust. 1.5B h Clust. B logFB logFB logFB logFB
<- +m’ Search+2 Search+2
u-tree ~B 1+h’ u-tree B logFB logFB logFB logFB
Search+2 Search+2
u-hash ~B ~2 B u-hash B 1 B 1 1

Index specification in SQL:1999 Summary

• To speed up selection queries: index.
CREATE INDEX IndAgeRating ON Students
• Terminology:
WITH STRUCTURE=BTREE, – Clustered / non-clustered index
KEY = (age, gpa) – primary / secondary index
• Typically, B-tree index (see later)
– hashing is only good for equality search
• At most one clustered index per table
– many non-clustered ones are possible
– composite indexes are possible


Files Indexing

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Files Indexing

Uploaded by

Copyright:

Available Formats

Review: Memory, Disks

File Organizations and Indexing

15-415 • “Page” or “Frame” - unit of buffer

Review: File Storage Files

• Page or block is OK when doing I/O,

Alternative File Organizations How to find records quickly?

Faloutsos, Olston 5 Faloutsos, Olston 6

• E.g., student.gpa = ‘3’

• The header page id and Heap file name must be

Faloutsos, Olston 7 Faloutsos, Olston 8

How to accelerate searches? Example: Simple Index on GPA

• A: Indices, like: Directory

An index contains a collection of data entries, and

Indexes Index Search Conditions

Faloutsos, Olston 11 Faloutsos, Olston 12

• Primary vs. Secondary • Primary vs. Secondary

Faloutsos, Olston 13 Faloutsos, Olston 14

Indexing - clustering index example Indexing - non-clustering

Clustering/sparse index on ssn Non-clustering / dense index

Faloutsos, Olston 15 Faloutsos, Olston 16

Index Classification - detailed Details

An index contains a collection of data entries, and

Alternatives for Data Entries (Contd.) BREAK: World’s Largest Databases

• Representation of data entries in index Clustering/sparse index on ssn

Indexing - non-clustering Index Classification - clustering

Non-clustering / dense index • Clustered vs. unclustered: If order of data

567 smith forbes ave

Data Records Data Records

Index Classification - clustering Clustered vs. Unclustered Index

Faloutsos, Olston 29 Faloutsos, Olston 30

Primary vs. Secondary Index Index Classification - detailed

Dense vs. Sparse Index Composite Search Keys

Bristow, 30, 2007

Tracy, 44, 5004

• Dense vs. Sparse • Tree-based index

Cost estimation Cost estimation

• Heap file • Heap file • scan

Methods Operations(?) Methods Operations

• Consider only I/O cost;

Cost estimation Cost estimation

Clust. Clust. 1.5B

index – cost? In general

Faloutsos, Olston 43 Faloutsos, Olston 44

Cost estimation Cost estimation

h= height of btree ~ logF (1.5B) #B

Faloutsos, Olston 45 Faloutsos, Olston 46

Cost estimation Cost estimation

index – cost? scan eq range ins del

... Clust. 1.5B h <- +m

sorted B log2B <- +m Search+B Search+B sorted B log2B log2B B B

Faloutsos, Olston 49 Faloutsos, Olston 50

Index specification in SQL:1999 Summary

Faloutsos, Olston 51 Faloutsos, Olston 52

You might also like