Professional Documents
Culture Documents
1
Heap File Implemented Using Lists How to find records quickly?
2 2.5 3 3.5
Data entries:
1.2* 1.7* 1.8* 1.9* 2.2* 2.4* 2.7* 2.7* 2.9* 3.2* 3.3* 3.3* 3.6* 3.8* 3.9* 4.0*
(Index File)
(Data file)
Data Records
2
Index Classification - high level Index Classification - high level
• Representation of data entries in index • ‘data entries’ == what we store at the bottom
– i.e., what is at the bottom of the index? of the index pages
– 3 alternatives here • what would you use as data entries?
• Clustered vs. Unclustered
• Primary vs. Secondary
• Dense vs. Sparse
• Single Key vs. Composite
• Indexing technique
– Tree-based, hash-based, other
Faloutsos, Olston 17 Faloutsos, Olston 18
CMU SCS 15-415 CMU SCS 15-415
3
Example: Simple Index on GPA Alternatives for Data Entry k* in Index
1. Actual data record (with key value k)
Directory
2 2.5 3 3.5
2. <k, rid of matching data record>
3. <k, list of rids of matching data records>
Data entries:
1.2* 1.7* 1.8* 1.9* 2.2* 2.4* 2.7* 2.7* 2.9* 3.2* 3.3* 3.3* 3.6* 3.8* 3.9* 4.0*
(Index File)
(Data file)
Data Records
Alternatives for Data Entry k* in Index Alternatives for Data Entries (Contd.)
1. Actual data record (with key value k)
Alternative 1:
2. <k, rid of matching data record>
Actual data record (with key value k)
3. <k, list of rids of matching data records>
– If this is used, index structure is a file
• Choice is orthogonal to the indexing technique.
organization for data records (like Heap
– Examples of indexing techniques: B+ trees, files or sorted files).
hash-based structures, R trees, …
– At most one index on a given collection of
– Typically, index contains auxiliary info that
directs searches to the desired data entries data records can use Alternative 1.
• Can have multiple (different) indexes per file. – This alternative saves pointer lookups but
can be expensive to maintain with
– E.g. file sorted on age, with a hash index on
insertions and deletions.
name and a B+tree index on salary.
Faloutsos, Olston 21 Faloutsos, Olston 22
CMU SCS 15-415 CMU SCS 15-415
4
Where were we?: Index Classification Indexing - clustering index example
– A file can have a clustered index on at most • Cost of retrieving records found in range scan:
one search key. – Clustered: cost =
– Cost of retrieving data records through index – Unclustered: cost ˜
varies greatly based on whether index is • What are the tradeoffs????
clustered!
– Note: Alternative 1 implies clustered, but not
vice-versa.
5
Clustered vs. Unclustered Index Clustered vs. Unclustered Index
• Cost of retrieving records found in range scan: • Cost of retrieving records found in range scan:
– Clustered: cost = # pages in file w/matching records – Clustered: cost = # pages in file w/matching records
– Unclustered: cost ˜ # of matching index data entries – Unclustered: cost ˜ # of matching index data entries
• What are the tradeoffs???? • What are the tradeoffs????
– Clustered Pros:
• Efficient for range searches
• May be able to do some types of compression
– Clustered Cons:
• Expensive to maintain (on the fly or sloppy with
reorganization)
Faloutsos, Olston 31 Faloutsos, Olston 32
CMU SCS 15-415 CMU SCS 15-415
• Primary: index key includes the file’s primary • Representation of data entries in index
key – i.e., what is at the bottom of the index?
– 3 alternatives here
• Secondary: any other index
• Clustered vs. Unclustered
– Sometimes confused with Alt. 1 vs. Alt. 2/3 • Primary vs. Secondary
– Primary index never contains duplicates • Dense vs. Sparse
– Secondary index may contain duplicates • Single Key vs. Composite
• If index key contains a candidate key, no
• Indexing technique
duplicates => unique index
– Tree-based, hash-based, other
Faloutsos, Olston 33 Faloutsos, Olston 34
CMU SCS 15-415 CMU SCS 15-415
based on dense indexes. Name Data File Age • Data entries in index sorted by 80,11 80
search key for range queries. <sal, age> <sal>
– Alternative 1 always
leads to dense index. – “Lexicographic” order.
Faloutsos, Olston 35 Faloutsos, Olston 36
CMU SCS 15-415 CMU SCS 15-415
6
Index Classification - detailed Tree vs. Hash-based index
• Hash-based index
• Representation of data entries in index
– Good for equality selections.
– i.e., what is at the bottom of the index?
• File = a collection of buckets. Bucket = primary page
– 3 alternatives here plus 0 or more overflow pages.
• Clustered vs. Unclustered • Hash function h: h(r.search_key) = bucket in which
• Primary vs. Secondary record r belongs.
Heap Heap B
sorted sorted B
u-tree u-tree ~B
u-hash u-hash ~B
Assume that:
• Clustered index spans 1.5B pages (due to empty space)
• Data entry= 1/10 of data record
Faloutsos, Olston 41 Faloutsos, Olston 42
CMU SCS 15-415 CMU SCS 15-415
7
Cost estimation Cost estimation
u-hash ~B ~2 h + #qual-pages
…
8
Cost estimation Cost estimation - big-O notation:
scan eq range ins del scan eq range ins del
Search+1
Heap B B/2 B 2 Heap B B B 2 B