You are on page 1of 20

Efficient Storage and Retrieval of Data

Physical Data Organization

Management of large amount of persistent, reliable and


shared data

large: data does not fit into the main memory, we have to use some
secondary storage
persistent: data written into a file should persist even after using it
so that it can be used again
reliable: should survive hardware and software failures, should be
able to recover from these failures
sharable: sharable by multiple users

Physical Storage Media (hierarchy)

Cache
main memory
flash memory
magnetic-disk storage
optical storage
magnetic-tape storage

Magnetic Disks

Access time is much larger than the processing time.


Access time consists of

seek time
rotation delay
block transfer time

A better organization of data requires less number of disk


accesses
A relation can be stored in one or more files with tuples as
records and attributes as fields
Block Size: 512 - 4096 bytes
Blocking Factor: Number of records that can fit in a block
f = B/R where f = blocking factor, R = record size, B = Block size
E.g. B=1024bytes, R=100, f = 1024/100 = 10

Files and Records

File Operations

Block Allocation

Find, Delete, Modify, Insert


contiguous: consecutive blocks are assigned, difficult to expand
linked: each block contains the address of the next block, easy to
expand but reading is slow
indexed: an index is stored in the file header

fixed length vs variable length records


spanned vs unspanned records

File Organization

unordered (heap or pile)

Ordered

a new record is placed in the last block


insertion is very cheap, but search, delete, update or reading in order are
expensive
requires b/2 block accesses on an average because of the linear search
For 4096 blocks, 4096/2 = 2048 block accesses are needed
ordering field is same as the key field
search, update or reading in order are efficient
requires log2(b) because binary search can be used
but insertion is expensive (overflow blocks can be used to reduce the
cost)
For 4096 blocks, log2(4096) = 14 block accesses are needed

hashing

used when fast access is required


whereas the access time for ordered file is log2(b), the access time for
hashing is constant
permits access on the basis of the key
miserable for range queries

Access Methods
Primary Key Access Methods

Hashing
Primary Key Indexing
Multilevel Indexing
B - Trees
B+-Trees

Secondary Key Access Methods


Secondary key indexing
Clustering Indexing

Internal Hashing
Apply Hash
function

Key
Example:

Name

h(K) = K mod m
m = 70 - 90% of
the expected number
of records

Physical
Address

Department Salary

h(James Adams) = (74+65) mod 17 = 139 mod 17 = 3

Name

Department

Salary Overflow Pointer

0
1
2
3 James Adams
15 Mary Jones
16
17 Henry Truman

-1
-1
-1

External Hashing
- number of disk accesses is never more than 2 but will usually be 1
- the file has 2 levels, the directory (bucket address table) and buckets
- the bucket contains actual records
- key is to choose a good hash function h such that no more than n records

have the same has value if n is the number of records that can be stored in a
bucket
- if there is a collusion, overflow buckets may be used
Part Number
2369
3760
4692
4871
5659
7115
1620
2428

Hash Function
20 mod 8 = 4
16 mod 8 = 0
21 mod 8 = 5
20 mod 8 = 4
25 mod 8 = 1
14 mod 8 = 6
9 mod 8 = 1
16 mod 8 = 0

0
1
2

3
4
5

3760
2428

5659
1620
2369
4871

null
null

4692
null

6
7

null

7115
null

Primary Indexing
EMPLOYEE

EMP #
107
201
371
624

Block Pointer

EMP #

NAME

DEPT

SALARY

107

10k

110

12k

112

20k

115

15k

201

25k

236

10k

307

30k

366

35k

371

12k

395

15k

524

33k

608

25k

624

20k

630

30k

724

30k

798

35k

Example

Number of records, r = 30000


Block size, B = 1024 bytes
Record length, R = 100 bytes
Blocking factor, f = B/R = 1024/100 = 10 records/block
Number of blocks needed, b = 30,000/10 = 3000 blocks
Key field, V = 9 bytes
Block pointer, P = 6 bytes

Blocking factor for index entries = 1024/15 = 68


Number of blocks need to store index entries = 3000/68 = 45blocks
Number of block accesses needed = log245 +1 = 6+1 = 7

Clustering Indexing
EMPLOYEE

Salary
10k
12k
15k
20k
25k
30k

Block Pointer

EMP #

DEPT

SALARY

107

10k

236

10k

110

12k

371

12k

115

15k

395

15k

112

20k

624

20k

25k

608

25k

307

30k

630

30k

724

30k

524

33k

366

35k

798

35k

201

NAME

null

null

33k
35k

null

Secondary Indexing

Constructed on a nonordering field


Can create many secondary indexes
If constructed on a key field, it is called secondary
key

EMPLOYEE
EMP # NAME
201

EMP # Block Pointer


107
110
112
115
201
236

DEPT
1

SALARY
25k

110

12k

366

35k

107

10k

115

15k

236

10k

307

30k

112

20k

798

35k

307
366
371

395

15k

395

524

33k

524

724

30k

624

20k

630

30k

608

25k

608
624
630
724
798

371

12k

Secondary Index

Example:
Number of records, r = 30000
Block size, B = 1024 bytes
Record length, R = 100 bytes
Blocking factor, f = B/R = 1024/100 = 10 records/block
Number of blocks needed, b = 30,000/10 = 3000 blocks
Key field, V = 9 bytes
Block pointer, P = 6 bytes
Blocking factor for index entries = 1024/15 = 68
Number of blocks need to store index entries = 30000/68 =
442blocks
Number of block accesses needed = log2442 +1 = 9+1 = 10
Occupy more space
requires maintenance hence expensive
can create it on a non-key field

Multilevel Indexing
When the index file itself is large, then we can construct an index on index
This is always the primary index

Blocking factor for index entries = 1024/15 = 68


Number of blocks need to store index entries at level 1 = 30000/68 = 442
Number of blocks need to store index entries at level 2 = 442/68 = 7
Number of blocks need to store index entries at level 3 = 7/68 = 1
Number of block accesses needed = 3+1 = 4
1
Key

Block Pointer

2
3

1
4

5
6
7
8
9

B-Trees and B+-Trees


B-Tree
Each node in the B-tree of order p is of the form
P1,<K1,Pr1> .. <K2,Pr2>, Pq> where Pi is a tree pointer, Ki
is the key field value, Pri is the data pointer, p q
Each path from the root node to a leaf node has the same
length
Each node (except the root and leaf) has at least p/2
children

B+-Trees
All the keys and the associated data pointers to the record
reside in the leaf nodes

Example of a B-Tree

You might also like