Data Management: INFO125

Data Management
INFO125
Lecture 14:
Physical Data Organization (part 2)
Mehdi Elahi
University of Bergen (UiB)
Memory
✤ Memory is used to store information within a

computer, either programs or data.
Memory
✤ Memory system is a collection of various forms of
memory, constructed in a hierarchy
✤ Memory hierarchy by considering:
✤ Cost per storage unit
✤ Access speed
✤ Reliability
Operations on Files
✤ Operations for locating and accessing le records vary
from system to system.
✤ Open ✤ Delete
✤ Reset ✤ Modify
✤ Find (or locate) ✤ Insert
✤ Read (or get) ✤ Close
✤ FindNext
fi
File Header
✤ File header (or le descriptor) contains meta data

about the le needed to access the le records.
✤ File header includes information to determine the

disk addresses of the le blocks.
fi
fi
fi
fi
File headers
✤ File header may also includes:
✤ eld lengths & eld types
✤ order of elds within ( xed-length) records
✤ separator characters record type

fi
fi
fi
fi
File headers
✤ But how search and nd the records?
✤ Searching for a records on a disk:

✤
(a) Copying one/few blocks into main memory

(b) Checking the le header to nd record
(c) If le header does n’t help, then do linear search
fi
fi
fi
fi
Operations on Files
✤ This is nearly similar for different data types, even BLOB
✤ BLOB (= Binary Large Object) is a data item that

consists of large unstructured objects which represent
images, digitised video or audio streams, or free text.
✤ BLOBs are stored separately from the record with a

pointer directed at them.
Organising Files
✤ Methods for organizing records of a le retrieval &
update is optimised:
✤ Heap les
✤ Sorted les
✤ Hashing
fi
fi
fi
Heap Files
✤ Simplest organization where records are placed in the

le in the order in which they are inserted.
✤ Hence, the new records are inserted at the end of le.
✤ As you may know, this organization is called a Heap

(Pile) le
fi
fi
fi
Heap Files
Data Block (Data Page)
1 Record
3
Heap Files
✤ Advantage:
✤ Inserting a new record is ef cient.
✤ Disadvantage:
✤ Searching for a record is expensive since it involves

a linear search through the le block by block.
✤ Deletion is expensive
fi
fi
Heap Files
Sorted files
✤ We can physically order the records of a le based on

the values of one of the elds (ordering eld).
✤ This leads to an ordered le.

fi
fi
fi
fi
Sorted Files
✤ Example:
An ordered (sequential) le of EMPLOYEE records with
Name as the ordering key eld.
…
fi
fi
Sorted Files
✤ If the ordering eld is also a key eld of the le, the eld is
guaranteed to have a unique value in each record, then the
eld is called the ordering key.
…
fi
fi
fi
fi
fi
Sorted files
✤ Advantages:
✤ reading the records becomes ef cient
✤ nding the next record from the current one usually
requires no additional block accesses
✤ using a search condition based on the value of the ordering
key eld results in faster access with binary search
✤ Disadvantages:
✤ no bene t when searching by non ordering elds, inserting

and deleting is expensive (in time)
fi
fi
fi
fi
fi
Hashing
✤ Another type of primary le organization is based on

hashing, which provides very fast access to records
under certain search conditions.
✤ This organization is usually called a hash le.

fi
fi
Hash files
✤ Hashing is a method for mapping digital data of arbitrary

size —-to—> data of xed size
✤ In the context of databases storage, two forms of hashing

exists:
✤ Internal Hashing
✤ External Hashing
fi
Internal Hashing
✤ Hash table - les are organised into an array of m

‘slots’ each containing one record
✤ Address of each slot corresponds to the index of array
✤ Hash Function is used to transform the hash eld

value into a number 1 to m
fi
fi
Internal hashing
✤ Example:
Hash key Address

(hash eld) of disc block
fi
Internal Hashing
✤ Example:
John Smith 1112222 Manager 50K

Lisa Smith 1552233 CEO 100K
Sam Doe 3333221 Architect 77K
array with
Sandra Dee 9991911 Engineer 92K
m positions
Hash key (hash eld)

fi
Quiz
✤ Which function could be a hash function:
i. h(K) = eK*M
ii. h(K) = M mod K
iii. h(K) = K mod M
iv. h(K) = K / M
Quiz
✤ Which function could be a hash function:
i. h(K) = eK*M
ii. h(K) = M mod K
iii. h(K) = K mod M
iv. h(K) = K / M
Alternative Methods
✤ Folding: applying an arithmetic or a logical function to

different portions of hash eld to get a hash address
✤ Example:
✤ if m=1000, how to store 235469 —> ‘235’ and ‘469’
✤ (235+964) mod 1000 = 199

fi
Quiz
✤ Consider K mod 1000 function and folding method,

which hash address corresponds to the hash eld 987654:
(A) 641
(B) 666
(C) 0
(D) 654
fi
Answer
✤ Consider K mod 1000 function and folding method,

which hash address corresponds to the hash eld 987654:
(A) 641
(B) 666 ✤ m=1000, store 987654 —> ‘987’ & ‘654’
(C) 0 ✤ (987+654) mod 1000 = 641
(D) 654
fi
Collision
✤ Collision occurs when the hash eld value of a record,

that is inserted, hashes to an address that already
contains a different record.
✤ In this situation, we must insert the new record in

another position, since that address is occupied.
fi
Collision
✤ Perfect hashing function will never creates collisions
✤ Collision Resolution: process of nding another

position in the collision situation.
fi
Other Hash Methods
✤ Examples of collision resolution methods:
✤ Chaining
✤ Open addressing
✤ Multiple hashing
Other Hash Methods
✤ Collision resolution methods:
✤ Chaining: various “over ow” locations are kept,

with a pointer added to each record location
✤ Lets check an example.

fl
We add
pointers here
Pointing to
here
Note: If no collision occurs, they point to null (-1)

Other Hash Methods
✤ Open addressing: checking the subsequent positions in order

until an empty slot is found
Other Hash Methods
✤ Multiple hashing: a second (and third…) hash is applied

to the results of the rst hash
fi
Good Hash Function
✤ Distribute records uniformly over the address space

to minimise collisions, while not leaving many empty
spaces
✤ Hash table is better to be kept 70% to 90% full
✤ Choose a prime number for m, it distributes the hash

addresses better over the address space when hashing
function (mod) is used
External Hashing
✤ To suit the characteristics of disk storage, the target address

space is made of buckets, each of which holds multiple
records.
✤ Hashing function maps a key into a relative bucket number

rather than assigning an absolute block address to the bucket
Organising Files on Disk
✤ Cluster is a number of blocks that are consecutive on

the storage medium.
✤ Bucket is either one disk block or a cluster of

contiguous disk blocks.
External Hashing
✤ Hashing for disk les is called external hashing.

fi
Hashing
✤ A big picture
Dynamic Hashing
✤ Hashing scheme described so far is called static hashing
because a xed number of buckets m is allocated.
✤ We are xing the address space which can be a serious

drawback for dynamic les.
✤ Example: if the number of records increases to a lot,

many collisions will result in and retrieval will be
slowed down because of the long of over ow records.
fi
fi
fi
fl
Idea!
✤ h(K) =K mod m
✤ We need a way to make m a variable.
✤ What if we take a h(K) not to produce the number of

the row in the table, but a binary number.
✤ What if we use this binary number as the row number!

Dynamic Hashing
✤ Internal nodes: that have two pointers:
✤ left pointer corresponds to 0 bit (in hashed address)
✤ right pointer corresponds to the 1 bit
✤ Leaf nodes: that hold pointer to bucket with records

000
001
01
Extendible Hash Tables
✤ Extendible Hashing, uses an array of 2d bucket addresses
(called directory) with d factor (global depth of directory).
✤ Integer value corresponding to the rst (high-order) d bits of

a hash value is used as an index to the array to determine a
directory entry
✤ Address in that entry determines the bucket in which the

corresponding records are stored
fi
Extendible Hash Tables
✤ Local depth d′ speci es the number of bits on which

the bucket contents are based
✤ When d = d′, the number of entries in the directory

doubles, if a bucket over ows.
✤ Halving occurs if d > d′ for all the buckets after some

deletions occur.
fi
fl
•Local depth d’
is smaller than
Global debt d!
•If Global depth

d = Local depth
d’ the size of
directory is
doubled!
Quiz (part A)
Directory Buckets
2 2
00
01 2
10
2
11
Check the above setup!

Quiz (part A)
Lets assume each bucket
contains 4 records
Directory
2 2
00
01 2
10
2
11
Check the above setup!

Quiz (part A)
✤ Determine the current:
✤ global depth
✤ local depth
✤ How to insert the given numbers into the right bucket

with the following hash function!
h(K) = K mod M
Quiz (part A)
Numbers Directory Buckets
to insert
2 2
4
00
6
01 2
7
9 10
2
10 11
22
2
24
16
31
Answer
Global depth Local Depth
2 2
00
01 2
10
2
11
2
Answer

to insert
2 2
4
00
6
01 2
7
9 10
2
10 11
22
h(K) = K mod M 2
24
16
31
Answer

to insert
2 2
4 0
00 4, 24, 16
6
01
1 2
7
10 9
9 2
2
10 11
3 6, 22, 10
22
h(K) = K mod M 2
24
h(num)= num mod 4 7, 31
16
31
Quiz (part B)
More numbers Directory Buckets
to insert
2 2
00 4, 24, 16
01 2
10 9
20 2
11
26 6, 22, 10
2
Add more numbers! 7, 31
by applying extensible hashing!

Answer
Now, there is over ow in here!
& Global depth is equal Local depth
More numbers
to insert
2 2
00 4, 24, 16, 20
01 2
10 9
2
11
26 6, 22, 10
2
So we need to double 7, 31
the size of directory!
fl
Answer
3 3
000
001 2
010
2
double the 011
directory! 100
2
101
110
111
Ref: Gary D. Boetticher at UHCL

Answer
3 3
000 4, 24, 16, 20
001 2
010
2
011
100 split!
2
101
110 3
111
Answer
Directory Buckets
More numbers 3 3
to insert 000 4, 24, 16, 20
001 2
010 9
2
011
6, 22, 10
100
2
26 101
7, 31
110
3
111
Answer
Directory Buckets
More numbers 3 3
to insert
0
000 24, 16
1
001 2
010 2 9
3 2
011
6, 22, 10
100 4
2
26 101 5 7, 31
110 6 3
111
7 4, 20
h(num)= num mod 8
Now, there is over ow in here!
Answer
but Global depth is more than Local depth
Directory Buckets
More numbers 3 3
to insert
0
000 24, 16
1
001 2
010 2 9
3 2
011
6, 22, 10, 26
100 4
2
101 5 7, 31
110 6 3
111
7 4, 20
h(num)= num mod 8
fl
Answer
Directory Buckets
3
3
All numbers 0 24, 16
are inserted 000
1 2
001
2 9
010 2
011
3 10, 26
100 4 2
101 5 7, 31
3
110 6
4, 20
111 7 3
h(num)= num mod 8 6, 22
Linear hashing
✤ Linear Hashing: type of hashing which allows a hash

le to expand and shrink its number of buckets
dynamically
✤ The advantage is that there is no need for a directory!

fi
Linear Hashing
✤ Example:
✤ Suppose that the le starts with m buckets that are

numbered 0, 1, 2, … , m − 1
✤ with hash function hi+j(K) = K mod 2j m and j = 0, 1, 2, …
✤ h(K) K mod m is called the initial hash function

fi
Linear Hashing
✤ We have the following parameters (meta-data):
• n: number of available buckets

• r: the number of records
• i: counter (counting how many bits of m is used)
Linear Hashing
✤ Records originally in bucket 0 are distributed into

two buckets based on a new hashing function:
hi+j (K) = K mod 2jm
✤ Where j = 0, 1, 2, …
✤ Records that are hashed to bucket 0 by hash function

hi will hash to either bucket 0 or 2m by on hi+1
Linear Hashing
✤ Again, new collisions lead to over ow records, and

additional buckets are split in linear order 1, 2, 3, …
✤ If enough over ows occur, all buckets 0, 1, … , m − 1

will be split, resulting in 2m buckets (instead of m
buckets)
✤ This means all buckets use the hash function hi+1

fl
fl
Linear Hashing
✤ But we need to de ne the splitting policy.
✤ This would help us to decide when to split a bucket

when it reaches the threshold close to the
maximum capacity.
fi
Linear Hashing
✤ Splitting is controlled by monitoring le load factor (l):
l = r / ( bfr * n )
✤ where:
✤ r is current number of le records
✤ bfr is max number of records that t in bucket
✤ n is current number of le buckets

fi
fi
fi
fi
Quiz
✤ Suppose the load factor l = r/(n*bfr) = 0.75
✤ Which statement is correct to say:
(i) 25% of the available memory is occupied
(ii) 75% of the available memory is free
(iii) 25% of the available addresses are used
(iv) 75% of the available addresses are used

Answer
✤ Suppose the load factor l = r/(n*bfr) = 0.75
✤ Which statement is correct to say:
(i) 25% of the available memory is occupied
(ii) 75% of the available memory is free
(iii) 25% of the available addresses are used
(iv) 75% of the available addresses are used

Linear Hashing
✤ Example:
Buckets
Lets assume each bucket 0

contains 2 records
1
Linear Hashing
✤ Split policy: Lets assume a bucket should be split

when the load factor (l) exceeds 0.75 threshold
Buckets
1
Linear Hashing Meta-info
i=1
✤ Example: n=2
r=0
r=4
Numbers
to insert Buckets
8 0
13
10 1
15
19
hi(K) = K mod M
22
i=1
n=2
r=1
r=4
Numbers
to insert Buckets
0 8
8 0
13
10 1
15
19
h1(8) = 8 mod 2
22
=0
i=1
n=2
r=2
r=4
Numbers
to insert Buckets
0 8
13 1
10 13
1
15
19
h1(13) = 13 mod 2
22
=1
i=1
n=2
r=3
r=4
Numbers
to insert Buckets
0 8
0 10
10 13
1
15
19
h1(10) = 10 mod 2
22
=0
i=1
n=2
r=3
r=4
We do not exceed Buckets

load factor:
0 8
l = r/(n*bfr) 10
13
1
= 3/(2*2)
= 0.75
i=1
n=2
r=4
Numbers
to insert Buckets
0 8
10
13
1 1
15 15
19
h1(15) = 15 mod 2
22
=1
i=1
n=2
r=4
Now we exceed Buckets

load factor:
0 8
l = r/(n*bfr) 10
13
1
= 4/(2*2) 15
= 1.0
Linear Hashing
✤ So what to do?
✤ first bucket in the le (bucket 0) split into two

buckets:
(1) original bucket 0
(2) new bucket m

fi
i=1
n=2
r=4
Buckets
0 8
We should split the 10
bucket 0! 1
13
15
i=2
n=3
r=4
Buckets
0 8
bucket 0! 1
13
split!
15
2
i=2
n=3
r=4
Buckets
h2(K) = K mod 2M 0 8
10
h1(K) = K mod 1M 13
1 split!
15
h2(K) = K mod 2M 2
i=2
n=3
r=4
Numbers
to insert
Buckets
h2 (8) = 8 m
8 od 4
0 8
13
10 h (
15 2 10) 13
=1 1
0m
19 od 15
4
22 10
2
i=2
n=3
r=4
Numbers
to insert
Buckets
8
0 8
13
10
15 13
1
19 15
22 10
2
i=2
n=3
r=4
Numbers
to insert
Buckets
0 8
1 9 m o d 2 13
h 1(19) = 1
19 15
22 10
2
i=2
n=3
r=5
r=4
Buckets
No more space in 0 8
bucket 1!
13
1
This is over ow! 15
10
2
fl
Linear Hashing
✤ Collisions happens again, what to do?
✤ We maintain individual over ow chain for each bucket
?
fl
Linear Hashing
✤ Each bucket has an empty list of over ow pointers.
Buckets
0
Pointer
2
fl
i=2
n=3
r=5
r=4
Now we exceed Buckets

load factor:
0 8
l = r/(n*bfr)
13 19
1
= 5/(3*2) 15
10
= 0.83 2
i=2
n=4
r=5
r=4
8 Buckets
0
13
1
10
bucket 1! 2 split!
3
i=2
n=4
r=5
r=4
Numbers
to insert
8 Buckets
0
8
h2(13) = 13 mod 4 13
13 1
10
15 h (1 10
2 5) = 2
19 15 m
od 4
22 h (19 15
2 ) = 19
mod 3
4 19
i=2
n=4
r=6
r=4
Numbers
to insert
8 Buckets
0
8
13 13
1
10
15 10
= 2 2 mod 4 2
19 h 2(2 2) 22
22 15
3
19
i=2
n=4
r=6
r=4
Numbers
to insert
8 Buckets
0
8
13 13
1
10
15 10
2
19 22
22 15
3
19
Linear Hashing
8 Buckets
00
We can alternatively
write: 13
01
10
10
22
15
11
19
Linear Hashing
1000 Buckets
00
We can alternatively
write: 1101
01
1010
10
10110
1111
11
10011
Linear Hashing
1000 Buckets
00
What can you see! 1101

01
1010
10
10110
1111
11
10011
Linear Hashing
1000 Buckets
Seems bits from the 00
right (i) are indicative
of the number of 1101
01
bucket (n)!
1010
10
n = 2i 10110
1111
11
10011
Linear Hashing
1000 Buckets
Seems bits from the 00
right (i) are indicative
of the number of 1101
01
bucket (n)!
1010
10
log(n) = i 10110
1111
11
10011
Linear Hashing
Link: youtu.be/h37Jhr21ByQ
Next Lecture
✤ Introduction to NoSQL

Data Management: INFO125

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Management: INFO125

Uploaded by

Copyright:

Available Formats

Data Management

✤ Memory is used to store information within a

✤ Memory hierarchy by considering:

✤ Cost per storage unit

✤ Find (or locate) ✤ Insert

✤ Read (or get) ✤ Close

✤ File header (or le descriptor) contains meta data

✤ File header includes information to determine the

✤ File header may also includes:

✤ eld lengths & eld types

✤ order of elds within ( xed-length) records

✤ separator characters record type

✤ But how search and nd the records?

✤ Searching for a records on a disk:

(a) Copying one/few blocks into main memory

✤ This is nearly similar for different data types, even BLOB

✤ BLOB (= Binary Large Object) is a data item that

✤ BLOBs are stored separately from the record with a

✤ Simplest organization where records are placed in the

✤ Hence, the new records are inserted at the end of le.

✤ As you may know, this organization is called a Heap

Data Block (Data Page)

✤ Inserting a new record is ef cient.

✤ Searching for a record is expensive since it involves

✤ We can physically order the records of a le based on

✤ This leads to an ordered le.

✤ no bene t when searching by non ordering elds, inserting

✤ Another type of primary le organization is based on

✤ This organization is usually called a hash le.

✤ Hashing is a method for mapping digital data of arbitrary

✤ In the context of databases storage, two forms of hashing

✤ Hash table - les are organised into an array of m

✤ Address of each slot corresponds to the index of array

✤ Hash Function is used to transform the hash eld

Hash key Address

John Smith 1112222 Manager 50K

Hash key (hash eld)

✤ Which function could be a hash function:

ii. h(K) = M mod K

iii. h(K) = K mod M

✤ Which function could be a hash function:

ii. h(K) = M mod K

iii. h(K) = K mod M

✤ Folding: applying an arithmetic or a logical function to

✤ if m=1000, how to store 235469 —> ‘235’ and ‘469’

✤ (235+964) mod 1000 = 199

✤ Consider K mod 1000 function and folding method,

✤ Consider K mod 1000 function and folding method,

(B) 666 ✤ m=1000, store 987654 —> ‘987’ & ‘654’

(C) 0 ✤ (987+654) mod 1000 = 641

✤ Collision occurs when the hash eld value of a record,

✤ In this situation, we must insert the new record in

✤ Perfect hashing function will never creates collisions

✤ Collision Resolution: process of nding another

✤ Examples of collision resolution methods:

✤ Collision resolution methods:

✤ Chaining: various “over ow” locations are kept,

✤ Lets check an example.

Note: If no collision occurs, they point to null (-1)

✤ Open addressing: checking the subsequent positions in order

✤ Collision resolution methods:

✤ Multiple hashing: a second (and third…) hash is applied

✤ Distribute records uniformly over the address space