You are on page 1of 55

Data Structure using 'C'

Hashing

Department of CSE & IT


C.V. Raman College of Engineering
Bhubaneswar

1 Department of CSE and IT


 Introduction and Definition
 Hash Tables as a Data Structure

Choosing a hash Function


 Truncation Method
 Mid Square Method

 Folding Method

 Modular Method

 Collision Resolution(Open Hashing)


 Separate Chaining
 Closed Hashing(Open Addressing)
 Linear Probing
 Quadratic Probing

 Double Hashing

 Storage Management
 Garbage Collection
 Dynamic memory management

 Buddy System

 Binary Buddy System

 Fibonacci Buddy System

 Application

Department of CSE and IT


2
Introduction to Hashing

 Hashing is the technique used for performing almost constant time search in case of
insertion, deletion and find operation.

 The essence of hashing is to facilitate the next level searching method when
compared with the linear or binary search.

 It is the process of mapping large amount of data item to a smaller table with the help
of a hashing function.

 The advantage of this searching method is its efficiency to hand vast amount of data
items in a given collection (i.e. collection size).

 Due to this hashing process, the result is a Hash data structure that can store or
retrieve data items in an average time disregard to the collection size.

3 Department of CSE and IT


Continued…

 Let ‘n’ distinct record Table is one of the important data structure used
for information retrieval.
 Records with keys K1,K2-----Kn are stored in a file and we want to
retrieve a record with a given key value K.
 One way is to start from the first record by comparing K with its key
value, if found then stop ,otherwise proceed to the next record. The
searching time is directly proportional to the number of records in the
file.
 Another way is to use an access table, where the searching time is
significantly reduced.

4 Department of CSE and IT


Continued…

 Let us assume a function f and apply this function to K, then it returns ‘i’ that
is f(k)=i. The ith entry of the access table gives the location of record with key
value K.
 One type of such access table is known as Hash table.
 The hash table is an array which contains key values with pointers to the
corresponding records.
 The basic idea here is to place a key inside the hash table, and the
location/index of that key will be calculated from the given key value itself.
 The one to one correspondence between a key value and index in the hash
table is known as hashing.

5 Department of CSE and IT


Example: The Search Problem

 Find items with keys matching a given search key


 Given an array A, containing n keys, and a search key x, find the index i
such as x=A[i]
 As in the case of sorting, a key could be part of a large record.

6 Department of CSE and IT


Applications
 Keeping track of customer account information at a bank
 Search through records to check balances and perform transactions

 Keep track of reservations on flights


 Search to find empty seats, cancel/modify reservations

 Search engine
 Looks for all documents containing a given word

7 Department of CSE and IT


Hash Table
 There are some type of tables which helps to retrieve information in
efficient manner.

The hash table is one of them.

This table is an array of constant size which depends on applications.

The hash table contains key values with pointers to the corresponding
records.

The basic idea is to place key value into location in the hash table and this
location will be calculated from the key value itself.

The one to one correspondence between the key value and index in the
hash table is known as hashing.

8 Department of CSE and IT


Continued…
 The ideal hash table structure is merely an array of some fixed size, containing
the items.
 A stored item needs to have a data member, called key, that will be used in
computing the index value for the item.
 Key could be an integer, a string, etc
 e.g. a name or Id that is a part of a large employee structure
 The size of the array is TableSize.
 The items that are stored in the hash table are indexed by values from 0 to
TableSize – 1.
 Each key is mapped into some number in the range 0 to TableSize – 1.
 The mapping is called a hash function.

A hash table, or a hash map, is a data structure that associates keys (names)
with values (attributes).

• Look-Up Table
• Dictionary
• Cache
• Extended Array

9 Department of CSE and IT


Continued…
Example: HASH TABLE

U
(universe of keys) h(k1)
h(k4)

K k1
h(k2) = h(k5)
(actual k4 k2
keys)
k5 k3 h(k3)

m-1

10 Department of CSE and IT


Choosing a Hash Function
 A hash function maps keys to small integers (buckets). An ideal hash function
maps the keys to the integers in a random-like manner, so that bucket values are
evenly distributed even if there are regularities in the input data.

 This process can be divided into two steps:


 Map the key to an integer.
 Map the integer to a bucket.

11 Department of CSE and IT


Example
Hash
Table
0
1
Items
2
john 25000
3 john 25000
phil 31250 key Hash 4 phil 31250
Function
dave 27500 5
mary 28200 6 dave 27500
7 mary 28200
key 8
9

12 Department of CSE and IT


 The hash function:
 must be simple to compute.
 must distribute the keys evenly among the cells.

Different Approaches to generate hash function

1. Truncation Method
2. Mid Square Method
3. Folding Method
4. Modular Method

13 Department of CSE and IT


1.Truncation Method
Ignore a part of the key and use the remaining part directly as the index

Ex:1 If a hash table contains 999 entries at the most or 999 different key indexes
may be kept, then a hash function may be defined such that from an eight digit
integer 12345678, first, second and fifth digits from the right may be used to
define a key index i.e. 478, which is the key position in the hash lookup table
where this element will be inserted. Any other key combination may be used.

Ex:2 If students have an 9-digit identification number, take the last 3 digits as the
table position
e.g. 925371622

becomes

622

14 Department of CSE and IT


2. Mid Square Method
 In Mid-Square method, the key element is multiplied by itself to yield
the square of the given key.
 If the hash table contains maximum 999 entries, then three digits from
the result of square may be chosen as a hash key for mapping the key
element in the lookup table.
 It generates random sequences of hash keys, which are generally key
dependent.
 Mid Square method is a flexible method of selecting the hash key
value.

Example: If the input is the number 4567, squaring yields an 8-digit


number, 20857489. The middle two digits of this result are 57. All digits
of the original key value (equivalently, all bits when the number is
viewed in binary) contribute to the middle two digits of the squared
value. Thus, the result is not dominated by the distribution of the
bottom digit or the top digit of the original key value.

15 Department of CSE and IT


3. Folding Method
 Partition the key into several parts and combine the parts in a convenient
way (often addition or multiplication) to obtain the index.
 Ex-1:
An eight digit integer can be divided into groups of three, three and two digits
(or any other combination) the groups added together and truncated if
necessary to be in the proper range of indices.
Hence 12345678 maps to 123+456+78 = 657, since any digit can affect the
index, it offers a wide distribution of key values.
 Ex-2: Split a 9-digit number into three 3-digit numbers, and add them
e.g. 925371622 becomes 925 + 376 + 622 = 1923

4. Modular Method
 For mapping a given key element in the hash table, mod operation of
individual key is calculated. The remainder denotes particular address position
of each element.
 The result so obtained is divided by an integer, usually taken to be the size
of the hash table to obtain the remainder as the hash key to place that element
in the lookup table.

16 Department of CSE and IT


5. The Division Method

 Idea:
 Map a key k into one of the m slots by taking the remainder of k divided
by m
h(k) = k mod m
 Advantage:
 fast, requires only one operation
 Disadvantage:
 Certain values of m are bad, e.g.,
 power of 2
 non-prime numbers

17 Department of CSE and IT


Example - The Division Method
m m
97 100
 If m = 2p, then h(k) is just the least significant p
bits of k
p = 1 m = 2
 h(k) = {0, 1}, least significant 1 bit of k
p = 2m = 4
 h(k) = {0, 1, 2, 3}, least significant 2 bits of k
 Choose m to be a prime, not close to a
power of 2
 Column 2: k mod 97

 Column 3: k mod 100

18 Department of CSE and IT


Collision

 Two or more keys hash to the same slot!!

 For a given set K of keys


 If |K| ≤ m, collisions may or may not happen, depending on the
hash function
 If |K| > m, collisions will definitely happen (i.e., there must be at
least two keys that have the same hash value)
 Avoiding collisions completely is hard, even with a good hash
function

19 Department of CSE and IT


Collision Resolution(Open Hashing)

 If, when an element is inserted, it hashes to the same value as


an already inserted element, then we have a collision and need
to resolve it.
 There are several methods for dealing with this:
 Separate chaining
 Open addressing
 Linear Probing
 Quadratic Probing
 Double Hashing

20 Department of CSE and IT


Separate Chaining

 The idea is to keep a list of all elements that hash to the same
value.
 The array elements are pointers to the first nodes of the lists.
 A new item is inserted to the front of the list.

 Advantages:
 Better space utilization for large items.
 Simple collision handling: searching linked list.
 Overflow: we can store more items than the hash table size.
 Deletion is quick and easy: deletion from the linked list.

21 Department of CSE and IT


21
Example
Keys: 0, 1, 4, 9, 16, 25, 36, 49, 64, 81
hash(key) = key % 10.
0 0

1 81 1
2

4 64 4
5 25
6 36 16
7

9 49 9

22 Department of CSE and IT


22
Operations
 Initialization: all entries are set to NULL
 Find:
 locate the cell using hash function.
 sequential search on the linked list in that cell.
 Insertion:
 Locate the cell using hash function.
 (If the item does not exist) insert it as the first item in the list.
 Deletion:
 Locate the cell using hash function.
 Delete the item from the linked list.

23 Department of CSE and IT


23
Handling Collisions Using Separate Chaining
 Idea:
 Put all elements that hash to the same slot into a linked list

 Slot j contains a pointer to the head of the list of all elements that
hash to j
24 Department of CSE and IT
24
Closed Hashing(Open Addressing)

 If we have enough contiguous memory to store all the keys (m > N) 


store the keys in the table itself
e.g., insert 14
 No need to use linked lists anymore

 Basic idea:

 Insertion: if a slot is full, try another one,

until you find an empty one


 Search: follow the same sequence of probes

 Deletion: more difficult ... (we’ll see why)

 Search time depends on the length of the

probe sequence!

25 Department of CSE and IT


Generalize hash function notation:

 A hash function contains two arguments now:


(i) Key value, and (ii) Probe number
insert 14
h(k,p), p=0,1,...,m-1

 Probe sequences
<h(k,0), h(k,1), ..., h(k,m-1)>

 Must be a permutation of <0,1,...,m-1>


 There are m! possible permutations
 Good hash functions should be able to produce all m! probe
sequences

Example
<1, 5, 9>
26 Department of CSE and IT
Common Open Addressing Methods

 Linear probing
 Quadratic probing
 Double hashing

 Note: None of these methods can generate more than m2 different


probing sequences!

27 Department of CSE and IT


Linear probing: Inserting a key

 Idea: when there is a collision, check the next available


position in the table (i.e., probing)

h(k,i) = (h1(k) + i) mod m


i=0,1,2,...
 First slot probed: h1(k)
 Second slot probed: h1(k) + 1
 Third slot probed: h1(k)+2, and so on

probe sequence: < h1(k), h1(k)+1 , h1(k)+2 , ....>


 Can generate m probe sequences maximum, why?
wrap around
28 Department of CSE and IT
Linear probing: Searching for a key

 Three cases:
(1) Position in table is occupied with an element of
equal key
0
(2) Position in table is empty
(3) Position in table occupied with a different
h(k1)
element
h(k4)
 Case 2: probe the next higher index until the
element is found or an empty position is found
h(k2) = h(k5)
 The process wraps around to the beginning of the
table
h(k3)

m-1

29 Department of CSE and IT


Linear probing: Deleting a key

 Problems
 Cannot mark the slot as empty
 Impossible to retrieve keys inserted after that slot 0
was occupied
 Solution
 Mark the slot with a sentinel value DELETED
 The deleted slot can later be used for insertion
 Searching will be able to find all the keys

m-1

30 Department of CSE and IT


Quadratic Probing

i=0,1,2,...

31 Department of CSE and IT


Double Hashing
(1) Use one hash function to determine the first slot
(2) Use a second hash function to determine the increment
for the probe sequence
h(k,i) = (h1(k) + i h2(k) ) mod m, i=0,1,...
 Initial probe: h1(k)
 Second probe is offset by h2(k) mod m, so on ...
 Advantage: avoids clustering
 Disadvantage: harder to delete an element
 Can generate m2 probe sequences maximum

32 Department of CSE and IT


Double Hashing: Example

h1(k) = k mod 13 0
1 79
h2(k) = 1+ (k mod 11)
2
h(k, i) = (h1(k) + i h2(k) ) mod 13 3
 Insert key 14: 4 69

h1(14,0) = 14 mod 13 = 1 5 98
6
h(14,1) = (h1(14) + h2(14)) mod 13
7 72
= (1 + 4) mod 13 = 5 8
h(14,2) = (h1(14) + 2 h2(14)) mod 13 9 14
10
= (1 + 8) mod 13 = 9
11 50
12

33 Department of CSE and IT


Example Linear (Closed) Hashing
 D=8, keys a,b,c,d have hash values h(a)=3, h(b)=0, h(c)=4,
h(d)=3
Where do we insert using linear hashing:
Probe sequence 4%8 = 4
ady fillh1(d) = (h(d)+1)%8 = t d? 3 alreed 0 b
h2(d) = (h(d)+2)%8 = 5%8 = 5* 1
7, 0,1,2 2
Wraps around the beginning of the table!
3 a
Already Filled
4 c
5 d
6
7
Analysis of Open Addressing

a (load factor)
1 a

k=0

35 Department of CSE and IT


Applications of Hashing
 Compilers use hash tables to keep track of declared variables.

 A hash table can be used for on-line spelling checkers — if misspelling


detection (rather than correction) is important, an entire dictionary can
be hashed and words checked in constant time.

 Game playing programs use hash tables to store seen positions,


thereby saving computation time if the position is encountered
again.

 Hash functions can be used to quickly check for inequality — if


two elements hash to different values they must be different.

 Storing sparse data

36 Department of CSE and IT


Storage Management
An executing program uses memory (storage) for many different purposes, such
as for the machine instructions that represent the executable part of the program,
the values of data objects, and the return location for a function invocation.

Static Allocation :
 allocation during translation that remains fixed throughout execution.
 does not allow recursive subprograms
Dynamic Allocation: (Heap Storage Management)

 Memory used for dynamic allocation of data objects in somewhat


unstructured manner is called heap storage.

37 Department of CSE and IT


Memory Management

 Memory Areas and their use

 Memory Manager Tasks:


 acquire
 release

 Free List Implementations


 Singly Linked List
 Doubly Linked List
 Buddy Systems

38 Department of CSE and IT


Continued…
 Memory areas: In languages like C or Java, the memory used
by a program can be allocated from three different areas:
 Static: laid out at compilation time, and allocated when the
program starts.
 Used for Global variables and constants
 Stack: memory is allocated and freed dynamically, in LIFO
order.
 Used for Local variables and parameters
 Heap: memory is allocated and freed dynamically, in any
order.
 Used for data outliving the method which created them. In
Java all objects are stored in the heap

39 Department of CSE and IT


Memory Manager

 The memory manager is part of the Operating System.


 It must keep track of which parts of the heap are free, and which are
allocated.

 A memory manager supports the following operations:


 acquire: allocates memory needed by programs
 release: deallocates memory no longer needed by programs

 It also defragments memory when needed

40 Department of CSE and IT


Problems faced in memory allocation
 Memory fragmentation:
 External fragmentation: Memory wasted outside allocated
blocks
 Internal fragmentation: Memory wasted inside allocated
block. Results when memory allocated is larger than memory
requested.

 Overhead: Additional memory that must be allocated, above and


beyond that requested by programs, in order to provide for the
management of the heap.

41 Department of CSE and IT


Free List
 Memory manager uses a free list data structure that keeps track of free
memory blocks in a scheme for dynamic memory allocation.
 Common implementations for free list:
 Singly-linked list
 Doubly-linked list
 Buddy systems: an array of doubly-linked lists

 Allocation Policies:
 First fit chooses the first block in the free list big enough to satisfy the
request, and split it.
 Next fit is like first fit, except that the search for a fitting block will start
where the last one stopped, instead of at the beginning of the free list.
 Best fit chooses the smallest block bigger than the requested one.
 Worst fit chooses the biggest, with the aim of avoiding the creation of too
many small fragments – but doesn’t work well in practice.

42 Department of CSE and IT


Singly-linked list implementation of free-list

 Each node represents a free block of memory

 Nodes must be sorted according to start addresses of free blocks so


that adjacent free memory blocks can be combined.

 acquire( ) and release( ) operations are O(n); where n is the number of


blocks in the heap.

 In order to acquire a block, a node is searched following one of the


allocation policy. If the block is bigger than requested, it is divided into
two. One part is allocated and one remains in the list.

 In order to release a block,


 a new node must be inserted (if the adjacent block is not on the free
list)
 or a node, which contains the adjacent free block, must be modified.
 Searching for the place of the new or existing node has complexity
O(n).
43 Department of CSE and IT
Doubly-linked list implementation of free-list
 In this implementation
 Nodes are not sorted according to start addresses of free blocks.
 All memory blocks have boundary tags between them. The tag has
information about the size and status (allocated/free)
 Each node in the doubly linked list represents a free block. It keeps
size & start address of the free block and start addresses & sizes of
the previous and next memory blocks. The adjacent blocks may be
or may not be free
 The release operation does not combine adjacent free blocks. It simply
pretends a node corresponding to a released block at the front of the
free list. This operation is thus O(1). Adjacent free blocks are combined
by acquire().
 The acquire operation traverses the free list in order to find a free area
of a suitable size. As it does so it also combines adjacent free blocks.

44 Department of CSE and IT


Doubly Linked List Example
 Node structure:

 Initial state of memory (shaded=allocated, grayed=boundary tags)

 The corresponding free list

45 Department of CSE and IT


Doubly Linked List Example (Cont.)
 The operation release(400, 4000) will result in:

 The node corresponding to the freed block is appended at the


front of the free-list. The nodes x, y, and z correspond to the
three free blocks that have not yet been combined.

46 Department of CSE and IT


Doubly Linked List Example (Cont.)
 The operation acquire(600) using the first-fit allocation policy
will first result in the combination of the three adjacent free
blocks:

 At this point the corresponding free list is:

47 Department of CSE and IT


Doubly Linked List Example (Cont.)
 The required 600 bytes are then allocated, resulting in:

 The corresponding free list is:

48 Department of CSE and IT


Buddy Systems implementation of free-list
 Instead of having a single free list, it has an array of free lists;
each element of the array holding blocks of the same size. One
type of buddy systems is the binary buddy system.
 For a memory of size m, there are free-lists of size 20, 21, 22, .
. . , 2k, where m  2k
 The heap is viewed as one large block which can be split into
two equal smaller blocks, called buddies. Each of these
smaller blocks can again be split into two equal smaller
buddies, and so on. Each memory block has its “buddy”. The
“buddy” of a block of size 2k that starts at address x is the
block of size 2k that start at address y = complementBit_k(x),
where the address bits are numbered from right to left starting
with 0.

49 Department of CSE and IT


Buddies
 If each block is of size 8 bytes (i.e., 23 bytes); then
the buddy of a block is obtained by complementing
bit 3 of its starting address. If each block is of size 4
bytes (i.e., 22 bytes); then the buddy of a block is
obtained by complementing bit 2 of its starting
address.

 Example: What is the starting address of the buddy


of a block that starts at address 1100101010101101
if each block is 16 bytes?
 Solution: 16 = 24; the starting address of the buddy
is obtained by complementing bit 4:
1100101010111101

50 Department of CSE and IT


Binary Buddy System implementation of free-list
 Each array element is a list of free blocks of same size. The size of
each block is a power of 2.

51 Department of CSE and IT


Binary Buddy System Algorithms
 acquire(x): x <= 2k, the corresponding free list is searched
 If there is a block in this list, it is allocated;
 otherwise a block of size 2k+1, 2k+2, and so on is searched and
taken off the free list. The block is divided into two buddies.
One buddy is put on the free list for the next lower size and the
other is either allocated or further splinted if needed.
 release(x): The block is placed back in the free list of its size,
and
 if its buddy is also free they are combined to form a free block
of size 2k+1. This block is then moved to the corresponding
free list.
 If its buddy is free they are combined to form a free block of
size 2k+2, which is then moved to the appropriate free list and
so on.
52 Department of CSE and IT
Buddy Systems Advantages/Disadvantages

• Advantage:
– Both acquire( ) and release( ) operations are fast.

• Disadvantages:
– Only memory of size that is a power of 2 can be allocated 
internal fragmentation if memory request is not a power of 2.
– When a block is released, its buddy may not be free, resulting in
external fragmentation.

53 Department of CSE and IT


QUIZ Test
1. Technique for direct search is
(A) Binary Search (B) Linear Search (C) Tree Search (D) Hashing
2. If h is any hashing function and is used to hash n keys in to a table of size m,
where n<=m, the expected number of collisions involving a particular key x is :
(A) less than 1. (B) less than n. (C) less than m. (D) less than n/2.
3. The OS of a computer may periodically collect all the free memory space to
form contiguous block of free space. This is called
(A) Concatenation (B) Garbage collection (C) Collision (D) Dynamic Memory Allocation
4. What is the best definition of a collision in a hash table?
A. Two entries are identical except for their keys.
B. Two entries with different data have the exact same key.
C. Two entries with different keys have the same exact hash value.
D. Two entries with the exact same key have different hash values.
5. In an open-address hash table there is a difference between those spots which
have never been used and those spots which have previously been used but no
longer contain an item. Which function has a better implementation because of
this difference?
A. insert B. is present C. remove E. size F. Two or more of the above
functions

54 Department of CSE and IT


-
Thank You

55 Department of CSE and IT


55

You might also like