You are on page 1of 24

Indexing and Hashing

B.Ramamurthy
Chapter 11

12/08/21 B.Ramamurthy 1
Representing Data
Attributes are represented in fixed or variable length
collections called “fields”
Fields in turn are put into fixed or variable length
collections called records.
Records are stored in physical blocks.
A collection of records that forms a relation is stored as
a collection of blocks called a file.
 This file different than OS file. How?

 Organization is different.

 Extra indices to accommodate easy search and

access.

12/08/21 B.Ramamurthy 2
Basic Concepts (indexing)
Indexing works the same way as a
catalog for a book in a library.
Indexing needs to be efficient to allow
fast access to records.
Two types of indices:
 ordered indices and
 hash indices

12/08/21 B.Ramamurthy 3
Techniques and Evaluation
Access types : types of accesses that are supported
efficiently. Search by specific value or by range.
Access time: Time sit takes to find a particular data
or a set of data.
Insertion time: Time it takes to insert a new item.
Deletion time: Time it takes to delete an item.
Space overhead : Additional space occupied by the
index structure.

12/08/21 B.Ramamurthy 4
Ordered Indices
To gain fast access to records in a file we can
use an index structure.
If the file containing the records is
sequentially ordered, the index whose search
key specifies the sequential order of the file is
the primary key index.
Primary key indices are also called clustering
indices.

12/08/21 B.Ramamurthy 5
Primary Index
Assume that all files are ordered
sequentially on some search key.
Such files, with primary key on the
search key, are called index-sequential
files.
These files accommodate both
sequential and random access to
individual records.

12/08/21 B.Ramamurthy 6
Dense and Sparse Index
Dense index:
 An index record appears for every search key
value in the file.
 The index record contains the search key and a
pointer to the first data record with that search-
key value.
Sparse index:
 An index is created only for a few values. Each
index contains a value and pointer to first record
that contains that value.

12/08/21 B.Ramamurthy 7
Dense Index
Brighton Brighton A-217 750
Downtown Downtown A-101 500
Mianus Downtown A-110 600
Perryridge Mianus A-215 700
Redwood Perryridge A-102 400
Round Hill Perryridge A-201 900
Perryridge A-218 700
Redwood A-222 700
Round Hill A-305 350

12/08/21 B.Ramamurthy 8
Sparse Index
Brighton Brighton A-217 750
Mianus Downtown A-101 500
Redwood Downtown A-110 600
Mianus A-215 700
Perryridge A-102 400
Perryridge A-201 900
Perryridge A-218 700
Redwood A-222 700
Round Hill A-305 350

Which one is better? Dense or sparse? It is a trade off


Between access time and space overhead.

12/08/21 B.Ramamurthy 9
Multi-level Indices
Indices themselves may become too large for
efficient processing.
Example:
 Consider file with 100000 records with 10 records in a
block.
 With sparse index and one index per block we have about
10,000 indices.
 Assuming 100 indices fit into a block we need about 100
blocks.
 It is desirable to keep the index file in the main memory.
 Problem: Searching a large index file becomes expensive.

12/08/21 B.Ramamurthy 10
Multi-level Index
Solution: Index the index file. We treat
the index as we would treat any other
sequential file and construct a sparse
index on the primary index.
We binary-search the outer level index
to find the largest search key less than
or equal to the one we desire.
Two-level sparse index ; Figure 11.4

12/08/21 B.Ramamurthy 11
Secondary Index
Secondary index is on attributes whose
values are not stored sequentially.
If the search key of a secondary index
is not a candidate key, the index needs
to be dense too.
We can use an extra level of indirection
with buckets at the second level.
See fig.11.5

12/08/21 B.Ramamurthy 12
Secondary Index
350 Brighton A-217 750
400 Downtown A-101 500
500 Downtown A-110 600
600 Mianus A-215 700
700 Perryridge A-102 400
750 Perryridge A-201 900
900 Perryridge A-218 700
Redwood A-222 700
Round Hill A-305 350

12/08/21 B.Ramamurthy 13
B+ Tree Index Files
Main disadvantage of the index-sequential file
organization is that performance degrades as
the file grows both for index lookups and
sequential scans.
B+ tree index structure is most widely used
of several index structures that maintain their
efficiency despite insertion and deletion of
data.

12/08/21 B.Ramamurthy 14
B+ Tree Index files
A B+ index tree is a balanced tree in
which every path from root to leaf is of
same length and each non-leaf node
has between ceiling(n/2) and n nodes
where n is fixed.
Typical node is a B+ tree:
n-1 search keys K1, K2,… Kn-1
n pointers P1, P2, …Pn

12/08/21 B.Ramamurthy 15
B+ Tree Node

P1 K1 P2 K2 …… Pn-1 Kn-1 Pn

12/08/21 B.Ramamurthy 16
B+ Tree (contd.)
Structure of a B+ tree
Queries on B+ trees
Updates on B+ trees (insertion ,
deletion)
B+ file organization
B Tree variation of B+ tree : avoiding
redundancy

12/08/21 B.Ramamurthy 17
Hashing
Can we avoid the IO operations that the
result from accessing the index file?
Hashing offers a way.
It also provides a way of constructing
indices (which need nor be sequential).
We will study static and dynamic
hashing.

12/08/21 B.Ramamurthy 18
Hash File Organization
Address of the disk block containing a desired
record is computed using a function (hash
function) and the search key.
Let K denote set of all search keys, B denote
set of all bucket addresses. Hash function h is
a function that maps K to B.
Bucket is typically a disk block.

12/08/21 B.Ramamurthy 19
Operations
To insert a record with Ki as key, compute
h(Ki) which gives the address of the bucket
for the record. If there is space in the bucket
then it is stored that bucket. (else chaining?)
To lookup a record with key Ki, compute
h(ki). Check with every record in the bucket
to obtain the record.
To delete a similar hash, find and delete is
followed.

12/08/21 B.Ramamurthy 20
Hash Functions
Hash function should be chosen so that
 The distribution of records is uniform.
 The distribution is random.
Handling bucket overflows:
 May occur due to insufficient number of buckets.
 Due to bucket skew.
 Solution: Overflow buckets, chaining, double
hashing, linear probing, quadratic probing

12/08/21 B.Ramamurthy 21
Hash Indices
Hashing can be used for organizing
indices.Hash index organizes search
keys with their associated pointers.
See Fig.11.22
Typically only secondary indices need to
be organized using hashing.

12/08/21 B.Ramamurthy 22
Dynamic Hashing
Many of today’s databases grow very large in
(a short) time.
If you use static hash function we have three
option:
 Choose hash function based on current size,
 Choose hash function based on anticipated size.
 Periodically restructure the hash file in response to
growth.
Another solution: dynamic hashing.

12/08/21 B.Ramamurthy 23
Dynamic Hash Techniques
Dynamic hash techniques allow the hash
function to be modified dynamically to
accommodate the growth and shrinkage of
the database.
It is also known as extendable hashing.
Extendable hashing copes with the growth in
the database size by splitting and coalescing
buckets as the database grows and shrinks.

12/08/21 B.Ramamurthy 24

You might also like