You are on page 1of 37

Tree

CHAPTER 5 LEC 3
Outline
B-Tree (introduction)
Hash function
Data compression
B-tree
Suppose you have 100,000 items in a BST
• Levels: ~log2100,000=17
• Meaning: disk may need to be accessed ~17 times
• Note: portion of data may be in memory
 Data Access Times
 RAM: ~50-150ns
 Hard Disk Drive(HDD): ~ 9-15ms
 HDD can be 100,000 times slower!

 Conclusion: BSTs not good enough for BIG data


Multi-way Search Tree
 A multi-way search tree of order M:
• Each node has m children and m-1 keys
• In the figure: a tree of order 4 (m=4)
• 3 keys per node (m-1)
• 4 children per node(m)
Multi-way Search Tree
 Arrangement of keys analogous to BSTs
• Take the key “60”
• Items to it’s left are smaller
• Items to it’s right are bigger
Performance of Multi-way Trees
 Suppose you have 1,000,000 items.
• Balanced Binary Trees: ~log2106≈20
• Balanced 10th order tree: ~log10106≈6
B+ Tree Properties
1. All data items are stored at leaves
• This speeds up traversal operation
2. Non-leaf nodes store up to M-1 keys to guide the
searching
3. Root has between 2 and M children (unless it‟s a leaf)
B+ Tree Properties
4. All non-leaf nodes (except the root) have between
ceil(M/2) and M children
• In other words: a node must be at least half full.
5. All leaves are at the same depth.

 These two requirements enforce balance


B+ Tree Illustrated
Hash Function

This lecture is taken from the book “Data


Structures and Algorithm Analysis in Java” by
Mark Allen Weiss
Motivation

Need a technique used for performing insertions, deletions,


and searches in constant average time.
General Idea
 The ideal hash table data structure is merely an array of
some fixed size, containing the items.
 A search is performed on some part (that is, data field) of
the item. This is called the key.
 Each key is mapped into some number in the range 0 to
TableSize − 1 and placed in the appropriate cell.
 The mapping is called a hash function, which ideally
should be simple to compute and should ensure that any
two distinct keys get different cells(collision).
General Idea
example
h(john)=3
h(phil)=4
h(dave)=6
h(mary)=7
Hash Function

Two major issues in designing hash function


 How do we come up with a hash function?

 What do we do when two keys hash to the same


value(collision)?
Hash Function - example
Division: If the input keys are integers
simple hash function is Key mod TableSize
h(K) = K mod Tsize
chose Tsize to be prime.

 Phone Numbers: 251-912-34-56-78


• What if we take the fist or second three numbers as keys? It’s
a bad idea!
 IP Addresses: 213.124.67.90
• What if we take the first part? (213 in the above case)
• What if we take the last part? (90 in the above case)
Folding: In the case of strings, one approach processes all
characters of the string by add up the ASCII (or Unicode)
values of the characters in the string and using the result for
the address.
Given TableSize = 10,007 and characters are <=8
(127*8=1016) clearly not an equitable distribution

 A better approach: for a string [𝑘0𝑘1…𝑘𝑛]


• ℎ=𝑘0+37𝑘1+372𝑘2…37𝑛𝑘𝑛
• The result has to be compressed to the appropriate range
COLLISION RESOLUTION
 Factors to minimize the number of collisions
• hash function
• table size
 Now matter how good a hash function may be, collisions
are likely to occur

 Collusion Resolution Mechanisms


• Separate Chaining
• Open Addressing
Open Addressing
 when a key collides with another key, the collision is
resolved by finding an available table entry other than the
position (address) to which the colliding key is originally
hashed.
 If position h(K) is occupied, then the positions in the
probing sequence:
norm(h(K) + p(1)), norm(h(K) + p(2)), . . . , norm(h(K) + p(i)), . . .
• p is a probing function
• norm is a normalization function
Open Addressing-linear probing
p(i) = i  h(K) + i) mod TSize
Linear Probing -Search
Searching:
• 1.Calculate hash value of the key: 𝑖=ℎ(𝑘𝑒𝑦)
• 2.Go to the position calculated by the hash function (the 𝑖𝑡ℎ position )
• 3.Start searching for the key until it is found
• 4.If you arrive at an empty bucket while searching, it means the item
is not in the list
Deletion:
• Deletion is NOT accomplished by simply removing a data item from
a cell, leaving it empty.
• For this reason a deleted item is replaced by an item with a special
key value that identifies it as deleted.
 Linear Probing is prone
to clustering.
 If cluster is long,
performance degrades
Open Addressing-Quadratic
probing
h(K) + i2, h(K) – i2 for i = 1, 2, . . . , (TSize – 1)/2
Open Addressing- double hashing
 In double hashing we have two functions
• The usual hash function: 𝑖=ℎ(𝑘𝑒𝑦)
• A function for calculating the number of cells to “jump” in case of
collusion: 𝑠𝑡𝑒𝑝=ℎ2(𝑖)
 When a collusion occurs, we will try to insert the new data
at 𝑖+𝑠𝑡𝑒𝑝.
 If 𝑖+𝑠𝑡𝑒𝑝is also occupied, we try at 𝑖+2𝑠𝑡𝑒𝑝and so on

 Performance a little better than linear probing


Separate Chaining
 “install” a linked list at each index in the hash table
 When multiple keys hash to the same spot, they are
inserted into the linked list

Search:
• Step 1: We determine the location
of the item in the table using the
hash function
• Step 2: We traverse the linked list
to extract the data
Separate Chaining…
 Searching an item requires computing the hash function,
and then traversing the list.
• Computing Hash Function: takes nearly constant time
• Traversing the List: depends on load factor
 Load factor(L)=[No of elements(M)]/[Table size(N)]
 Unsuccessful search = O(1+L)

 Tables with smaller load factor are faster


 Rehashing is one technique to keep the load factor
small
Rehashing

 If the table gets too full, the running time for the
operations will start taking too long

 Rehashing is an expensive operation


• Must be avoided whenever possible
• If we have an initial estimate of the number items to be inserted, table
size must be adjusted accordingly
Rehashing
Rehashing has two steps
1. Create a new array
that is larger
Rehashing
Rehashing has two steps
1. Create a new array
that is larger
2. Copying keys from
old table to new one
using the new hash
function
Data Compression
 Text files are usually addressed by representing by each
characters with an 8-bit ASCII code.
 The ASCII encoding is an example of fixed length
encoding
 I order to reduce the space required by to store a text file,
we can exploit the fact that some characters are more
likely to occur than others.
 Variable length encoding uses binary codes of different
lengths for different characters. Thus we can assign fewer
bits to frequency used characters and most bits to rarely
used characteristics.
File compression - example
An encoding example
Text: “java”
Encoding a=„0‟, j=„11‟, v=„10‟
Encoded text “110100” (java)
How to decode? (problem of ambiguity)
Encode a=„0‟, j=„01‟, v=„00‟
Encoded text 010000
Could be “java”, “jvv”, “jaaaa”
Encoding
 To prevent ambiguity in decoding, we require that the
encoding satisfies the prefix rule. No code is prefix of
another.

 a=„0‟, j=„11‟, v=„10‟  satisfies the prefix rule


 A=„0‟, j=„01‟, v=„00‟  does not satisfy the prefix rule
Encoding
 We use an encoding tries to satisfy the prefix rule.
• The characters are stored at the external nodes.
• The left child(edge) means 0
• The right child(edge) means 1
HUFFMAN CODING
Huffman Coding
 five letters A, B, C, D, and E with probabilities .39,.21,
.19, .12, and .09.
 Which data structure is preferable to implement huffman
coding?
See u next class!!!

You might also like