Efficient Storage and Retrieval of Data

Efficient Storage and Retrieval of Data
Physical Data Organization
Management of large amount of persistent, reliable and

shared data
large: data does not fit into the main memory, we have to use some
secondary storage
persistent: data written into a file should persist even after using it
so that it can be used again
reliable: should survive hardware and software failures, should be
able to recover from these failures
sharable: sharable by multiple users
Physical Storage Media (hierarchy)
Cache
main memory
flash memory
magnetic-disk storage
optical storage
magnetic-tape storage
Magnetic Disks
Access time is much larger than the processing time.

Access time consists of
seek time
rotation delay
block transfer time
A better organization of data requires less number of disk

accesses
A relation can be stored in one or more files with tuples as
records and attributes as fields
Block Size: 512 - 4096 bytes
Blocking Factor: Number of records that can fit in a block
f = B/R where f = blocking factor, R = record size, B = Block size
E.g. B=1024bytes, R=100, f = 1024/100 = 10
Files and Records
File Operations
Block Allocation
Find, Delete, Modify, Insert

contiguous: consecutive blocks are assigned, difficult to expand
linked: each block contains the address of the next block, easy to
expand but reading is slow
indexed: an index is stored in the file header
fixed length vs variable length records

spanned vs unspanned records
File Organization
unordered (heap or pile)
Ordered
a new record is placed in the last block

insertion is very cheap, but search, delete, update or reading in order are
expensive
requires b/2 block accesses on an average because of the linear search
For 4096 blocks, 4096/2 = 2048 block accesses are needed
ordering field is same as the key field
search, update or reading in order are efficient
requires log2(b) because binary search can be used
but insertion is expensive (overflow blocks can be used to reduce the
cost)
For 4096 blocks, log2(4096) = 14 block accesses are needed
hashing
used when fast access is required

whereas the access time for ordered file is log2(b), the access time for
hashing is constant
permits access on the basis of the key
miserable for range queries
Access Methods
Primary Key Access Methods
Hashing
Primary Key Indexing
Multilevel Indexing
B - Trees
B+-Trees
Secondary Key Access Methods

Secondary key indexing
Clustering Indexing
Internal Hashing
Apply Hash
function
Key
Example:
Name
h(K) = K mod m
m = 70 - 90% of
the expected number
of records
Physical
Address
Department Salary
h(James Adams) = (74+65) mod 17 = 139 mod 17 = 3
Name
Department
Salary Overflow Pointer
0
1
2
3 James Adams
15 Mary Jones
16
17 Henry Truman
-1
-1
-1
External Hashing
- number of disk accesses is never more than 2 but will usually be 1
- the file has 2 levels, the directory (bucket address table) and buckets
- the bucket contains actual records
- key is to choose a good hash function h such that no more than n records
have the same has value if n is the number of records that can be stored in a
bucket
- if there is a collusion, overflow buckets may be used
Part Number
2369
3760
4692
4871
5659
7115
1620
2428
Hash Function
20 mod 8 = 4
16 mod 8 = 0
21 mod 8 = 5
20 mod 8 = 4
25 mod 8 = 1
14 mod 8 = 6
9 mod 8 = 1
16 mod 8 = 0
0
1
2
3
4
5
3760
2428
5659
1620
2369
4871
null
null
4692
null
6
7
null
7115
null
Primary Indexing
EMPLOYEE
EMP #
107
201
371
624
Block Pointer
EMP #
NAME
DEPT
SALARY
107
10k
110
12k
112
20k
115
15k
201
25k
236
10k
307
30k
366
35k
371
12k
395
15k
524
33k
608
25k
624
20k
630
30k
724
30k
798
35k
Example
Number of records, r = 30000

Block size, B = 1024 bytes
Record length, R = 100 bytes
Blocking factor, f = B/R = 1024/100 = 10 records/block
Number of blocks needed, b = 30,000/10 = 3000 blocks
Key field, V = 9 bytes
Block pointer, P = 6 bytes
Blocking factor for index entries = 1024/15 = 68

Number of blocks need to store index entries = 3000/68 = 45blocks
Number of block accesses needed = log245 +1 = 6+1 = 7
Clustering Indexing
EMPLOYEE
Salary
10k
12k
15k
20k
25k
30k
Block Pointer
EMP #
DEPT
SALARY
107
10k
236
10k
110
12k
371
12k
115
15k
395
15k
112
20k
624
20k
25k
608
25k
307
30k
630
30k
724
30k
524
33k
366
35k
798
35k
201
NAME
null
null
33k
35k
null
Secondary Indexing
Constructed on a nonordering field

Can create many secondary indexes
If constructed on a key field, it is called secondary
key
EMPLOYEE
EMP # NAME
201
EMP # Block Pointer

107
110
112
115
201
236
DEPT
1
SALARY
25k
110
12k
366
35k
107
10k
115
15k
236
10k
307
30k
112
20k
798
35k
307
366
371
395
15k
395
524
33k
524
724
30k
624
20k
630
30k
608
25k
608
624
630
724
798
371
12k
Secondary Index
Example:
Number of records, r = 30000
Block size, B = 1024 bytes
Record length, R = 100 bytes
Blocking factor, f = B/R = 1024/100 = 10 records/block
Number of blocks needed, b = 30,000/10 = 3000 blocks
Key field, V = 9 bytes
Block pointer, P = 6 bytes
Number of blocks need to store index entries = 30000/68 =
442blocks
Number of block accesses needed = log2442 +1 = 9+1 = 10
Occupy more space
requires maintenance hence expensive
can create it on a non-key field
Multilevel Indexing
When the index file itself is large, then we can construct an index on index
This is always the primary index

Number of blocks need to store index entries at level 1 = 30000/68 = 442
Number of block accesses needed = 3+1 = 4
1
Key
Block Pointer
2
3
1
4
5
6
7
8
9
B-Trees and B+-Trees

B-Tree
Each node in the B-tree of order p is of the form
P1,<K1,Pr1> .. <K2,Pr2>, Pq> where Pi is a tree pointer, Ki
is the key field value, Pri is the data pointer, p q
Each path from the root node to a leaf node has the same
length
Each node (except the root and leaf) has at least p/2
children
B+-Trees
All the keys and the associated data pointers to the record
reside in the leaf nodes
Example of a B-Tree

Efficient Storage and Retrieval of Data

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Efficient Storage and Retrieval of Data

Uploaded by

Copyright:

Available Formats

Efficient Storage and Retrieval of Data

Physical Data Organization

Management of large amount of persistent, reliable and

Physical Storage Media (hierarchy)

Access time is much larger than the processing time.

A better organization of data requires less number of disk

Files and Records

Find, Delete, Modify, Insert

fixed length vs variable length records

unordered (heap or pile)

a new record is placed in the last block

used when fast access is required

Secondary Key Access Methods

h(James Adams) = (74+65) mod 17 = 139 mod 17 = 3

Salary Overflow Pointer

Number of records, r = 30000

Blocking factor for index entries = 1024/15 = 68

Constructed on a nonordering field

EMP # Block Pointer

Blocking factor for index entries = 1024/15 = 68

B-Trees and B+-Trees

You might also like