You are on page 1of 111

Data Management

INFO125

Lecture 14:
Physical Data Organization (part 2)

Mehdi Elahi
University of Bergen (UiB)

Memory

✤ Memory is used to store information within a


computer, either programs or data.
Memory
✤ Memory system is a collection of various forms of
memory, constructed in a hierarchy

✤ Memory hierarchy by considering:

✤ Cost per storage unit

✤ Access speed

✤ Reliability
Operations on Files
✤ Operations for locating and accessing le records vary
from system to system.
✤ Open ✤ Delete

✤ Reset ✤ Modify

✤ Find (or locate) ✤ Insert

✤ Read (or get) ✤ Close

✤ FindNext
fi
File Header

✤ File header (or le descriptor) contains meta data


about the le needed to access the le records.

✤ File header includes information to determine the


disk addresses of the le blocks.
fi
fi
fi
fi
File headers

✤ File header may also includes:

✤ eld lengths & eld types

✤ order of elds within ( xed-length) records

✤ separator characters record type


fi
fi
fi
fi
File headers

✤ But how search and nd the records?

✤ Searching for a records on a disk:


(a) Copying one/few blocks into main memory


(b) Checking the le header to nd record
(c) If le header does n’t help, then do linear search
fi
fi
fi
fi
Operations on Files

✤ This is nearly similar for different data types, even BLOB

✤ BLOB (= Binary Large Object) is a data item that


consists of large unstructured objects which represent
images, digitised video or audio streams, or free text.

✤ BLOBs are stored separately from the record with a


pointer directed at them.
Organising Files
✤ Methods for organizing records of a le retrieval &
update is optimised:

✤ Heap les

✤ Sorted les

✤ Hashing
fi
fi
fi
Heap Files

✤ Simplest organization where records are placed in the


le in the order in which they are inserted.

✤ Hence, the new records are inserted at the end of le.

✤ As you may know, this organization is called a Heap


(Pile) le
fi
fi
fi
Heap Files

Data Block (Data Page)

1 Record

3
Heap Files

✤ Advantage:

✤ Inserting a new record is ef cient.

✤ Disadvantage:

✤ Searching for a record is expensive since it involves


a linear search through the le block by block.

✤ Deletion is expensive
fi
fi
Heap Files
Sorted files

✤ We can physically order the records of a le based on


the values of one of the elds (ordering eld).

✤ This leads to an ordered le.


fi
fi
fi
fi
Sorted Files
✤ Example:
An ordered (sequential) le of EMPLOYEE records with
Name as the ordering key eld.


fi
fi
Sorted Files
✤ If the ordering eld is also a key eld of the le, the eld is
guaranteed to have a unique value in each record, then the
eld is called the ordering key.


fi
fi
fi
fi
fi
Sorted files

✤ Advantages:
✤ reading the records becomes ef cient
✤ nding the next record from the current one usually
requires no additional block accesses
✤ using a search condition based on the value of the ordering
key eld results in faster access with binary search

✤ Disadvantages:

✤ no bene t when searching by non ordering elds, inserting


and deleting is expensive (in time)
fi
fi
fi
fi
fi
Hashing

✤ Another type of primary le organization is based on


hashing, which provides very fast access to records
under certain search conditions.

✤ This organization is usually called a hash le.


fi
fi
Hash files

✤ Hashing is a method for mapping digital data of arbitrary


size —-to—> data of xed size

✤ In the context of databases storage, two forms of hashing


exists:

✤ Internal Hashing

✤ External Hashing
fi
Internal Hashing

✤ Hash table - les are organised into an array of m


‘slots’ each containing one record

✤ Address of each slot corresponds to the index of array

✤ Hash Function is used to transform the hash eld


value into a number 1 to m
fi
fi
Internal hashing

✤ Example:

Hash key Address


(hash eld) of disc block
fi
Internal Hashing

✤ Example:

John Smith 1112222 Manager 50K


Lisa Smith 1552233 CEO 100K
Sam Doe 3333221 Architect 77K
array with
Sandra Dee 9991911 Engineer 92K
m positions

Hash key (hash eld)


fi
Quiz

✤ Which function could be a hash function:

i. h(K) = eK*M

ii. h(K) = M mod K

iii. h(K) = K mod M

iv. h(K) = K / M
Quiz

✤ Which function could be a hash function:

i. h(K) = eK*M

ii. h(K) = M mod K

iii. h(K) = K mod M

iv. h(K) = K / M
Alternative Methods

✤ Folding: applying an arithmetic or a logical function to


different portions of hash eld to get a hash address

✤ Example:

✤ if m=1000, how to store 235469 —> ‘235’ and ‘469’

✤ (235+964) mod 1000 = 199


fi
Quiz

✤ Consider K mod 1000 function and folding method,


which hash address corresponds to the hash eld 987654:

(A) 641

(B) 666

(C) 0

(D) 654
fi
Answer

✤ Consider K mod 1000 function and folding method,


which hash address corresponds to the hash eld 987654:

(A) 641

(B) 666 ✤ m=1000, store 987654 —> ‘987’ & ‘654’

(C) 0 ✤ (987+654) mod 1000 = 641

(D) 654
fi
Collision

✤ Collision occurs when the hash eld value of a record,


that is inserted, hashes to an address that already
contains a different record.

✤ In this situation, we must insert the new record in


another position, since that address is occupied.
fi
Collision

✤ Perfect hashing function will never creates collisions

✤ Collision Resolution: process of nding another


position in the collision situation.

fi
Other Hash Methods

✤ Examples of collision resolution methods:

✤ Chaining

✤ Open addressing

✤ Multiple hashing
Other Hash Methods

✤ Collision resolution methods:

✤ Chaining: various “over ow” locations are kept,


with a pointer added to each record location

✤ Lets check an example.


fl
We add
pointers here

Pointing to
here

Note: If no collision occurs, they point to null (-1)


Other Hash Methods
✤ Collision resolution methods:

✤ Open addressing: checking the subsequent positions in order


until an empty slot is found
Other Hash Methods

✤ Collision resolution methods:

✤ Multiple hashing: a second (and third…) hash is applied


to the results of the rst hash
fi
Good Hash Function

✤ Distribute records uniformly over the address space


to minimise collisions, while not leaving many empty
spaces

✤ Hash table is better to be kept 70% to 90% full

✤ Choose a prime number for m, it distributes the hash


addresses better over the address space when hashing
function (mod) is used
External Hashing

✤ To suit the characteristics of disk storage, the target address


space is made of buckets, each of which holds multiple
records.

✤ Hashing function maps a key into a relative bucket number


rather than assigning an absolute block address to the bucket
Organising Files on Disk

✤ Cluster is a number of blocks that are consecutive on


the storage medium.

✤ Bucket is either one disk block or a cluster of


contiguous disk blocks.
External Hashing

✤ Hashing for disk les is called external hashing.


fi
Hashing

✤ A big picture
Dynamic Hashing
✤ Hashing scheme described so far is called static hashing
because a xed number of buckets m is allocated.

✤ We are xing the address space which can be a serious


drawback for dynamic les.

✤ Example: if the number of records increases to a lot,


many collisions will result in and retrieval will be
slowed down because of the long of over ow records.
fi
fi
fi
fl
Idea!

✤ h(K) =K mod m

✤ We need a way to make m a variable.

✤ What if we take a h(K) not to produce the number of


the row in the table, but a binary number.

✤ What if we use this binary number as the row number!


Dynamic Hashing

✤ Internal nodes: that have two pointers:

✤ left pointer corresponds to 0 bit (in hashed address)

✤ right pointer corresponds to the 1 bit

✤ Leaf nodes: that hold pointer to bucket with records


000
001
01
Extendible Hash Tables
✤ Extendible Hashing, uses an array of 2d bucket addresses
(called directory) with d factor (global depth of directory).

✤ Integer value corresponding to the rst (high-order) d bits of


a hash value is used as an index to the array to determine a
directory entry

✤ Address in that entry determines the bucket in which the


corresponding records are stored
fi
Extendible Hash Tables

✤ Local depth d′ speci es the number of bits on which


the bucket contents are based

✤ When d = d′, the number of entries in the directory


doubles, if a bucket over ows.

✤ Halving occurs if d > d′ for all the buckets after some


deletions occur.
fi
fl
•Local depth d’
is smaller than
Global debt d!

•If Global depth


d = Local depth
d’ the size of
directory is
doubled!
Quiz (part A)

Directory Buckets

2 2
00
01 2

10
2
11

Check the above setup!


Quiz (part A)
Lets assume each bucket
contains 4 records
Directory

2 2
00
01 2

10
2
11

Check the above setup!


Quiz (part A)
✤ Determine the current:

✤ global depth

✤ local depth

✤ How to insert the given numbers into the right bucket


with the following hash function!

h(K) = K mod M
Quiz (part A)
Numbers Directory Buckets
to insert
2 2
4
00
6
01 2
7
9 10
2
10 11
22
2
24
16
31
Answer

Global depth Local Depth

2 2
00
01 2

10
2
11

2
Answer

Numbers Directory Buckets


to insert
2 2
4
00
6
01 2
7
9 10
2
10 11
22
h(K) = K mod M 2
24
16
31
Answer

Numbers Directory Buckets


to insert
2 2
4 0
00 4, 24, 16
6
01
1 2
7
10 9
9 2
2
10 11
3 6, 22, 10
22
h(K) = K mod M 2
24
h(num)= num mod 4 7, 31
16
31
Quiz (part B)
More numbers Directory Buckets
to insert
2 2
00 4, 24, 16
01 2

10 9
20 2
11
26 6, 22, 10
2
Add more numbers! 7, 31
by applying extensible hashing!

Answer
Now, there is over ow in here!
& Global depth is equal Local depth

More numbers
to insert
2 2
00 4, 24, 16, 20
01 2

10 9
2
11
26 6, 22, 10
2
So we need to double 7, 31
the size of directory!
fl
Answer
Global depth Local Depth

3 3
000
001 2

010
2
double the 011
directory! 100
2
101
110
111

Ref: Gary D. Boetticher at UHCL


Answer
Global depth Local Depth

3 3
000 4, 24, 16, 20
001 2

010
2
011
100 split!
2
101
110 3
111
Answer
Directory Buckets

More numbers 3 3
to insert 000 4, 24, 16, 20
001 2
010 9
2
011
6, 22, 10
100
2
26 101
7, 31
110
3
111
Answer
Directory Buckets

More numbers 3 3
to insert
0
000 24, 16
1
001 2
010 2 9
3 2
011
6, 22, 10
100 4
2
26 101 5 7, 31
110 6 3
111
7 4, 20
h(num)= num mod 8
Now, there is over ow in here!

Answer
but Global depth is more than Local depth

Directory Buckets

More numbers 3 3
to insert
0
000 24, 16
1
001 2
010 2 9
3 2
011
6, 22, 10, 26
100 4
2
101 5 7, 31
110 6 3
111
7 4, 20
h(num)= num mod 8
fl
Answer
Directory Buckets
3
3
All numbers 0 24, 16
are inserted 000
1 2
001
2 9
010 2
011
3 10, 26
100 4 2

101 5 7, 31
3
110 6
4, 20
111 7 3
h(num)= num mod 8 6, 22
Linear hashing

✤ Linear Hashing: type of hashing which allows a hash


le to expand and shrink its number of buckets
dynamically

✤ The advantage is that there is no need for a directory!


fi
Linear Hashing

✤ Example:

✤ Suppose that the le starts with m buckets that are


numbered 0, 1, 2, … , m − 1

✤ with hash function hi+j(K) = K mod 2j m and j = 0, 1, 2, …

✤ h(K) K mod m is called the initial hash function


fi
Linear Hashing

✤ We have the following parameters (meta-data):

• n: number of available buckets


• r: the number of records
• i: counter (counting how many bits of m is used)
Linear Hashing

✤ Records originally in bucket 0 are distributed into


two buckets based on a new hashing function:

hi+j (K) = K mod 2jm

✤ Where j = 0, 1, 2, …

✤ Records that are hashed to bucket 0 by hash function


hi will hash to either bucket 0 or 2m by on hi+1
Linear Hashing

✤ Again, new collisions lead to over ow records, and


additional buckets are split in linear order 1, 2, 3, …

✤ If enough over ows occur, all buckets 0, 1, … , m − 1


will be split, resulting in 2m buckets (instead of m
buckets)

✤ This means all buckets use the hash function hi+1


fl
fl
Linear Hashing

✤ But we need to de ne the splitting policy.

✤ This would help us to decide when to split a bucket


when it reaches the threshold close to the
maximum capacity.
fi
Linear Hashing
✤ Splitting is controlled by monitoring le load factor (l):

l = r / ( bfr * n )

✤ where:

✤ r is current number of le records

✤ bfr is max number of records that t in bucket

✤ n is current number of le buckets


fi
fi
fi
fi
Quiz

✤ Suppose the load factor l = r/(n*bfr) = 0.75

✤ Which statement is correct to say:

(i) 25% of the available memory is occupied

(ii) 75% of the available memory is free

(iii) 25% of the available addresses are used

(iv) 75% of the available addresses are used


Answer

✤ Suppose the load factor l = r/(n*bfr) = 0.75

✤ Which statement is correct to say:

(i) 25% of the available memory is occupied

(ii) 75% of the available memory is free

(iii) 25% of the available addresses are used

(iv) 75% of the available addresses are used


Linear Hashing
✤ Example:

Buckets

Lets assume each bucket 0


contains 2 records

1
Linear Hashing

✤ Split policy: Lets assume a bucket should be split


when the load factor (l) exceeds 0.75 threshold
Buckets

1
Linear Hashing Meta-info

i=1
✤ Example: n=2
r=0
r=4
Numbers
to insert Buckets

8 0
13
10 1
15
19
hi(K) = K mod M
22
Linear Hashing Meta-info

i=1
n=2
r=1
r=4
Numbers
to insert Buckets

0 8
8 0
13
10 1
15
19
h1(8) = 8 mod 2
22
=0
Linear Hashing Meta-info

i=1
n=2
r=2
r=4
Numbers
to insert Buckets

0 8
13 1
10 13
1
15
19
h1(13) = 13 mod 2
22
=1
Linear Hashing Meta-info

i=1
n=2
r=3
r=4
Numbers
to insert Buckets

0 8
0 10
10 13
1
15
19
h1(10) = 10 mod 2
22
=0
Linear Hashing Meta-info

i=1
n=2
r=3
r=4

We do not exceed Buckets


load factor:
0 8
l = r/(n*bfr) 10
13
1
= 3/(2*2)

= 0.75
Linear Hashing Meta-info

i=1
n=2
r=4
Numbers
to insert Buckets

0 8
10
13
1 1
15 15
19
h1(15) = 15 mod 2
22
=1
Linear Hashing Meta-info

i=1
n=2
r=4

Now we exceed Buckets


load factor:
0 8
l = r/(n*bfr) 10
13
1
= 4/(2*2) 15

= 1.0
Linear Hashing
✤ So what to do?

✤ first bucket in the le (bucket 0) split into two


buckets:

(1) original bucket 0

(2) new bucket m


fi
Linear Hashing Meta-info

i=1
n=2
r=4

Buckets

0 8
We should split the 10
bucket 0! 1
13
15
Linear Hashing Meta-info

i=2
n=3
r=4

Buckets

0 8
We should split the 10
bucket 0! 1
13
split!
15

2
Linear Hashing Meta-info

i=2
n=3
r=4

Buckets

h2(K) = K mod 2M 0 8
10
h1(K) = K mod 1M 13
1 split!
15
h2(K) = K mod 2M 2
Linear Hashing Meta-info

i=2
n=3
r=4
Numbers
to insert
Buckets
h2 (8) = 8 m
8 od 4
0 8
13
10 h (
15 2 10) 13
=1 1
0m
19 od 15
4
22 10
2
Linear Hashing Meta-info

i=2
n=3
r=4
Numbers
to insert
Buckets
8
0 8
13
10
15 13
1
19 15
22 10
2
Linear Hashing Meta-info

i=2
n=3
r=4
Numbers
to insert
Buckets

0 8

1 9 m o d 2 13
h 1(19) = 1
19 15
22 10
2
Linear Hashing Meta-info

i=2
n=3
r=5
r=4

Buckets

No more space in 0 8
bucket 1!
13
1
This is over ow! 15
10
2
fl
Linear Hashing

✤ Collisions happens again, what to do?

✤ We maintain individual over ow chain for each bucket

?
fl
Linear Hashing

✤ Each bucket has an empty list of over ow pointers.

Buckets

0
Pointer

2
fl
Linear Hashing Meta-info

i=2
n=3
r=5
r=4

Now we exceed Buckets


load factor:
0 8
l = r/(n*bfr)
13 19
1
= 5/(3*2) 15
10
= 0.83 2
Linear Hashing Meta-info

i=2
n=4
r=5
r=4

8 Buckets
0

13
1
We should split the 15
10
bucket 1! 2 split!

3
Linear Hashing Meta-info

i=2
n=4
r=5
r=4
Numbers
to insert
8 Buckets
0
8
h2(13) = 13 mod 4 13
13 1
10
15 h (1 10
2 5) = 2
19 15 m
od 4
22 h (19 15
2 ) = 19
mod 3
4 19
Linear Hashing Meta-info

i=2
n=4
r=6
r=4
Numbers
to insert
8 Buckets
0
8
13 13
1
10
15 10
= 2 2 mod 4 2
19 h 2(2 2) 22
22 15
3
19
Linear Hashing Meta-info

i=2
n=4
r=6
r=4
Numbers
to insert
8 Buckets
0
8
13 13
1
10
15 10
2
19 22
22 15
3
19
Linear Hashing

8 Buckets
00
We can alternatively
write: 13
01

10
10
22
15
11
19
Linear Hashing

1000 Buckets
00
We can alternatively
write: 1101
01

1010
10
10110
1111
11
10011
Linear Hashing

1000 Buckets
00

What can you see! 1101


01

1010
10
10110
1111
11
10011
Linear Hashing

1000 Buckets
Seems bits from the 00
right (i) are indicative
of the number of 1101
01
bucket (n)!
1010
10
n = 2i 10110
1111
11
10011
Linear Hashing

1000 Buckets
Seems bits from the 00
right (i) are indicative
of the number of 1101
01
bucket (n)!
1010
10
log(n) = i 10110
1111
11
10011
Linear Hashing

Link: youtu.be/h37Jhr21ByQ
Next Lecture

✤ Introduction to NoSQL

You might also like