You are on page 1of 29

UNIT III DISJOINT SET ADT, Hashing THE DISJOINT SET ADT Introduction It is an efficient data structure to solve

the equivalence problem. Each routine requires only a few lines of code, and a simple array can be used. The implementation is also extremely fast, requiring constant average time per operation. 1. Equivalence Relations A relation R is defined on a set S if for every pair of elements (a, b), a, b S, a R b is either true or false. If a R b is true, then we say that a is related to b. An equivalence relation is a relation R that satisfies three properties: 1. (Reflexive) a R a, for all a S. 2. (Symmetric) a R b if and only if b R a. 3. (Transitive) a R b and b R c implies that a R c. Examples. 1. The relationship is not an equivalence relationship. Although it is reflexive, since a a, and transitive, since a b and b c implies a c, it is not symmetric, since a b does not imply b a.

Data structures and Algorithms

R.Punithavathi Department of IT

2. Electrical connectivity, where all connections are by metal wires, is an equivalence relation. The relation is clearly reflexive, as any component is connected to itself. If a is electrically connected to b, then b must be electrically connected to a, so the relation is symmetric. Finally, if a is connected to b and b is connected to c, then a is connected to c. Thus electrical connectivity is an equivalence relation. 3. Two cities are related if they are in the same country. It is easily verified that this is an equivalence relation. Suppose town a is related to b if it is possible to travel from a to b by taking roads. This relation is an equivalence relation if all the roads are two-way. 2. The Dynamic Equivalence Problem The input is initially a collection of n sets, each with one element. This initial representation is that all relations (except reflexive relations) are false. Each set has a different element, so that Si Sj = ; this makes the sets disjoint.

There are two permissible operations. The first is find, which returns the name of the set (that is, the equivalence class) containing a given element. The second operation adds relations. If we want to add the relation a ~ b, then we first see if a and b are already related. This is done by performing finds on both a and b and checking whether they are in the same equivalence class. If they are not, then we apply union. This operation merges the two equivalence classes containing a and b into a new equivalence class. From a set point of view, the result of is to create a new set Sk = Si Sj, destroying the originals

and preserving the disjointness of all the sets. The algorithm to do this is frequently known as the disjoint set union/find algorithm for this reason.

Data structures and Algorithms

R.Punithavathi Department of IT

This algorithm is dynamic because, during the course of the algorithm, the sets can change via the union operation. The algorithm must also operate on-line: When a find is performed, it must give an answer before continuing. Another possibility would be an off-line algorithm. Such an algorithm would be allowed to see the entire sequence of unions and finds. The answer it provides for each find must still be consistent with all the unions that were performed up until the find, but the algorithm can give all its answers after it has seen all the questions. The difference is similar to taking a written exam (which is generally off-line--you only have to give the answers before time expires), and an oral exam (which is online, because you must answer the current question before proceeding to the next question). Notice that we do not perform any operations comparing the relative values of elements, but merely require knowledge of their location. For this reason, we can assume that all the elements have been numbered sequentially from 1 to n and that the numbering can be determined easily by some hashing scheme. Thus, initially we have Si = {i} for i = 1 through n. Our second observation is that the name of the set returned by find is actually fairly abitrary. All that really matters is that find(x) = find( ) if and only if x and are in the same set. These operations are important in many graph theory problems and also in compilers which process equivalence (or type) declarations. We will see an application later. There are two strategies to solve this problem. One ensures that the find instruction can be executed in constant worst-case time, and the other ensures that the union instruction can be executed in constant worst-case time. It has recently been shown that both cannot be done simultaneously in constant worstcase time.

Data structures and Algorithms

R.Punithavathi Department of IT

We will now briefly discuss the first approach. For the find operation to be fast, we could maintain, in an array, the name of the equivalence class for each element. Then find is just a simple O(1) lookup. Suppose we want to perform union(a, b). Suppose that a is in equivalence class i and b is in equivalence class j. Then we scan down the array, changing all is to j. Unfortunately, this scan takes (n). Thus, a sequence of n - 1 unions (the maximum, since then (n2) time. If there are (n2) find operations, everything is in one set), would take

this performance is fine, since the total running time would then amount to O(1) for each union or find operation over the course of the algorithm. If there are fewer finds, this bound is not acceptable. One idea is to keep all the elements that are in the same equivalence class in a linked list. This saves time when updating, because we do not have to search through the entire array. This by itself does not reduce the asymptotic running time, because it is still possible to perform the course of the algorithm. If we also keep track of the size of each equivalence class, and when performing unions we change the name of the smaller equivalence class to the larger, then the total time spent for n - 1 merges isO (n log n). The reason for this is that each element can have its equivalence class changed at most log n times, since every time its class is changed, its new equivalence class is at least twice as large as its old. Using this strategy, any sequence of m finds and up to n - 1 unions takes at most O(m + n log n) time. In the remainder of this chapter, we will examine a solution to the union/find problem that makes unions easy but finds hard. Even so, the running time for any sequences of at most m finds and up to n - 1 unions will be only a little more than O(m + n). 3. Basic Data Structure (n2) equivalence class updates over

Data structures and Algorithms

R.Punithavathi Department of IT

Recall that the problem does not require that a find operation return any specific name, just that finds on two elements return the same answer if and only if they are in the same set. One idea might be to use a tree to represent each set, since each element in a tree has the same root. Thus, the root can be used to name the set. We will represent each set by a tree. (Recall that a collection of trees is known as a forest.) Initially, each set contains one element. The trees we will use are not necessarily binary trees, but their representation is easy, because the only information we will need is a parent pointer. The name of a set is given by the node at the root. Since only the name of the parent is required, we can assume that this tree is stored implicitly in an array: each entry p[i] in the array represents the parent of element i. If i is a root, then p[i] = 0. In the forest in Figure 1, p[i] = 0 for 1 i As with heaps, we will draw the trees explicitly, with the understanding that an array is being used. Figure 1 shows the explicit representation. We will draw the root's parent pointer vertically for convenience. To perform a union of two sets, we merge the two trees by making the root of one tree point to the root of the other. It should be clear that this operation takes constant time. Figures 2, 3, and 4 represent the forest after each of union(5,6) union(7,8), union(5,7), where we have adopted the convention that the new root after the union(x,y) is x. The implicit representation of the last forest is shown in Figure 5. A find(x) on element x is performed by returning the root of the tree containing x. The time to perform this operation is proportional to the depth of the node representing x, assuming, of course, that we can find the node representing x in constant time. Using the strategy above, it is possible to create a tree of depth n 1, so the worst-case running time of a find is O(n). Typically, the running time is computed for a sequence of m intermixed instructions. In this case, m consecutive operations could take O(mn) time in the worst case. The code in Figures 6 through 9 represents an implementation of the basic algorithm, assuming that error checks have already been performed. In our

Data structures and Algorithms

R.Punithavathi Department of IT

routine, unions are performed on the roots of the trees. Sometimes the operation is performed by passing any two elements, and having the union perform two finds to determine the roots.

Figure 1 Eight elements, initially in different sets

Figure 2 After union (5, 6)

Figure 3 After union (7, 8)

Data structures and Algorithms

R.Punithavathi Department of IT

Figure 4 After union (5, 7)

Figure 5 Implicit representation of previous tree typedef int DISJ_SET[ NUM_SETS+1 ]; typedef unsigned int set_type; typedef unsigned int element_type; Figure 6 Disjoint set type declaration void initialize( DISJ_SET S ) { int i; for( i = NUN_SETS; i > 0; i-- ) S[i] = 0; } Figure 7 Disjoint set initialization routine /* Assumes root1 and root2 are roots. */ /* union is a C keyword, so this routine is named set_union. */ void set_union( DISJ_SET S, set_type root1, set_type root2 ) { S[root2] = root1; } Figure 8 Union (not the best way) set_type Data structures and Algorithms R.Punithavathi Department of IT

find( element_type x, DISJ_SET S ) { if( S[x] <= 0 ) return x; else return( find( S[x], S ) ); } Figure 9 A simple disjoint set find algorithm 4. Smart Union Algorithms The unions above were performed rather arbitrarily, by making the second tree a subtree of the first. A simple improvement is always to make the smaller tree a subtree of the larger, breaking ties by any method; we call this approach unionby-size. The three unions in the preceding example were all ties, and so we can consider that they were performed by size. If the next operation were union (4, 5), then the forest in Figure 10 would form. Had the size heuristic not been used, a deeper forest would have been formed (Fig. 11).

Figure 10 Result of union-by-size

Data structures and Algorithms

R.Punithavathi Department of IT

Figure 11 Result of an arbitrary union

Figure 12 Worst-case tree for n = 16 We can prove that if unions are done by size, the depth of any node is never more than log n. To see this, note that a node is initially at depth 0. When its depth increases as a result of a union, it is placed in a tree that is at least twice as large as before. Thus, its depth can be increased at most log n times. (We used this argument in the quick-find algorithm at the end of Section 2.) This implies that the running time for a find operation is O(log n), and a sequence of m operations takes O(m log n). The tree in Figure 12 shows the worst tree possible after 16 unions and is obtained if all unions are between equal-sized trees. To implement this strategy, we need to keep track of the size of each tree. Since we are really just using an array, we can have the array entry of each root contain the negative of the size of its tree. Thus, initially the array representation

Data structures and Algorithms

R.Punithavathi Department of IT

of the tree is all -1s (and Fig 7 needs to be changed accordingly). When a union is performed, check the sizes; the new size is the sum of the old. Thus, union-bysize is not at all difficult to implement and requires no extra space. It is also fast, on average. For virtually all reasonable models, it has been shown that a sequence of m operations requires O(m) average time if union-by-size is used. This is because when random unions are performed, generally very small (usually one-element) sets are merged with large sets throughout the algorithm. The following figures show a tree and its implicit representation for both union-bysize and union-by-height. The code in Figure 13 implements union-by-height.

5. Path Compression The union/find algorithm, as described so far, is quite acceptable for most cases. It is very simple and linear on average for a sequence of m instructions (under all models). However, the worst case of O(m log n ) can occur fairly easily and naturally. /* assume root1 and root2 are roots */

Data structures and Algorithms

R.Punithavathi Department of IT

/* union is a C keyword, so this routine is named set_union */ void set_union (DISJ_SET S, set_type root1, set_type root2 ) { if( S[root2] < S[root1] ) /* root2 is deeper set */ S[root1] = root2; else { if( S[root2] == S[root1] ) /* same height, so update */ S[root1]--; S[root2] = root1; /* make root1 new root */ } } Figure 13 Code for union-by-height (rank) For instance, if we put all the sets on a queue and repeatedly dequeue the first two sets and enqueue the union, the worst case occurs. If there are many more finds than unions, this running time is worse than that of the quick-find algorithm. Moreover, it should be clear that there are probably no more improvements possible for the union algorithm. This is based on the observation that any method to perform the unions will yield the same worst-case trees, since it must break ties arbitrarily. Therefore, the only way to speed the algorithm up, without reworking the data structure entirely, is to do something clever on the find operation. The clever operation is known as path compression. Path compression is performed during a find operation and is independent of the strategy used to perform unions. Suppose the operation is find(x). Then the effect of path compression is that every node on the path from x to the root has its parent changed to the root. Figure 14 shows the effect of path compression after find (15) on the generic worst tree of Figure 12. /* make root2 new root */

Data structures and Algorithms

R.Punithavathi Department of IT

The effect of path compression is that with an extra two pointer moves, nodes 13 and 14 are now one position closer to the root and nodes 15 and 16 are now two positions closer. Thus, the fast future accesses on these nodes will pay (we hope) for the extra work to do the path compression. As the code in Figure 15 shows, path compression is a trivial change to the basic find algorithm. The only change to the find routine is that S[x] is made equal to the value returned by find; thus after the root of the set is found recursively, x is made to point directly to it. This occurs recursively to every node on the path to the root, so this implements path compression. As we stated when we implemented stacks and queues, modifying a parameter to a function called is not necessarily in line with current software engineering rules. Some languages will not allow this, so this code may well need changes.

Figure 14 An example of path compression set_type find( element_type x, DISJ_SET S ) { if( S[x] <= 0 ) return x; else return( S[x] = find( S[x], S ) ); } Data structures and Algorithms R.Punithavathi Department of IT

Figure 15 Code for disjoint set find with path compression When unions are done arbitrarily, path compression is a good idea, because there is an abundance of deep nodes and these are brought near the root by path compression. It has been proven that when path compression is done in this case, a sequence of m operations requires at most O(m log n) time. It is still an open problem to determine what the average-case behavior is in this situation. Path compression is perfectly compatible with union-by-size, and thus both routines can be implemented at the same time. Since doing union-by-size by itself is expected to execute a sequence of m operations in linear time, it is not clear that the extra pass involved in path compression is worthwhile on average. Indeed, this problem is still open. However, as we shall see later, the combination of path compression and a smart union rule guarantees a very efficient algorithm in all cases. Path compression is not entirely compatible with union-by-height, because path compression can change the heights of the trees. It is not at all clear how to recompute them efficiently. The answer is do not!! Then the heights stored for each tree become estimated heights (sometimes known as ranks), but it turns out that union-by-rank (which is what this has now become) is just as efficient in theory as union-by-size. Furthermore, heights are updated less often than sizes. As with union-by-size, it is not clear whether path compression is worthwhile on average. What we will show in the next section is that with either union heuristic, path compression significantly reduces the worst-case running time. 6. Worst Case for Union-by-Rank and Path Compression When both heuristics are used, the algorithm is almost linear in the worst case. Specifically, the time required in the worst case is (m (m, n)) (provided m n), where (m, n) is a functional inverse of Ackerman's function, which is defined below:*

Data structures and Algorithms

R.Punithavathi Department of IT

A(1, j) = 2j for j

1 2 2

A(i, 1) = A(i - 1, 2) for i

A(i, j) = A(i - 1,A(i, j - 1)) for i, j


*

Ackerman's function is frequently defined with A(1, j) = j + 1 for j 1. the form in

this text grows faster; thus, the inverse grows more slowly. From this, we define (m, n) = min{i 1|A(i, m/ n ) > log n}

You may want to compute some values, but for all practical purposes, (m, n) 4, which is all that is really important here. The single-variable inverse Ackerman function, sometimes written as log*n, is the number of times the logarithm of n needs to be applied until n 1. Thus, log* 65536 = 4, because log log log log 65536 = 1. log* 265536 = 5, but keep in mind that 265536 is a 20,000-digit number. (m, n) actually grows even slower then log* n. However, (m, n) is not a constant, so the running time is not linear. In the remainder of this section, we will prove a slightly weaker result. We will show that any sequence of m = (n) union/find operations takes a total of O(m log* n) running time. The same bound holds if union-by-rank is replaced with union-by-size. This analysis is probably the most complex in the book and one of the first truly complex worst-case analyses ever performed for an algorithm that is essentially trivial to implement
An Application We have a network of computers and a list of bidirectional connections; each of these connections allows a file transfer from one computer to another. Is it possible to send a file from any computer on the network to any other? An extra restriction is that the problem must be solved on-line. Thus, the list of connections is presented one at a time, and the algorithm must be prepared to give an answer at any point.

Data structures and Algorithms

R.Punithavathi Department of IT

An algorithm to solve this problem can initially put every computer in its own set. Our invariant is that two computers can transfer files if and only if they are in the same set. We can see that the ability to transfer files forms an equivalence relation. We then read connections one at a time. When we read some connection, say (u, v), we test to see whether u and v are in the same set and do nothing if they are. If they are in different sets, we merge their sets. At the end of the algorithm, the graph is connected if and only if there is exactly one set. If there are m connections and n computers, the space requirement is O(n). Using union-by-size and path compression, we obtain a worst-case running time of O(m (m, n)), since there are 2m finds and at most n - 1 unions. This running time is linear for all practical purposes. Summary We have seen a very simple data structure to maintain disjoint sets. When the union operation is performed, it does not matter, as far as correctness is concerned, which set retains its name. A valuable lesson that should be learned here is that it can be very important to consider the alternatives when a particular step is not totally specified. The union step is flexible; by taking advantage of this, we are able to get a much more efficient algorithm. Path compression is one of the earliest forms of self-adjustment, which we have seen elsewhere (splay trees, skew heaps). Its use is extremely interesting, especially from a theoretical point of view, because it was one of the first examples of a simple algorithm with a not-so-simple worst-case analysis.

HASHING INTRODUCTION Hashing is a method to store data in an array so that sorting, searching, inserting and deleting data is fast. For this every record needs unique key.

Data structures and Algorithms

R.Punithavathi Department of IT

The basic idea is not to search for the correct position of a record with comparisons but to compute the position within the array. The function that returns the position is called the 'hash function' and the array is called a 'hash table'. WHY HASHING? In the other type of searching, we have seen that the record is stored in a table and it is necessary to pass through some number of keys before finding the desired one. While we know that the efficient search technique is one which minimizesthese comparisons. Thus we need a search technique in which there is no unnecessary comparisons. If we want to access a key in a single retrieval, then the location of the record within the table must depend only on the key, not on the location of other keys(as in other type of searching i.e. tree). The most efficient way to organize such a table is an array.It was possible only with hashing. HASH CLASH

Suppose two keys k1 and k2 are such that h(k1) equals h(k2).When a record with key two keys can't get the same position.such a situation is called hash collision or hash clash. METHODS OF DEALING WITH HASH CLASH There are three basic methods of dealing with hash clash. They are: 1. Chaining 2. Rehashing 3. separate chaining. Hashing Techniques CHAINING

Data structures and Algorithms

R.Punithavathi Department of IT

It builds a link list of all items whose keys has the same value. During search,this sorted linked list is traversed sequentially fro the desired key.It involves adding an extra link field to each table position. There are three types of chaining 1. Standard Coalsced Hashing 2. General Coalsced Hashing 3. Varied insertion coalsced Hashing Standard Coalsced Hashing It is the simplest of chaining methods. It reduces the average number of probes for an unsuccessful search. It efficiently does the deletion without affecting the efficiency General Coalsced Hashing It is the generalization of standard coalesced chaining method. In this method, we add extra positions to the hash table that can be used to list the nodes in the time of collision. SOLVING HASH CLASHES BY LINEAR PROBING

The simplest method is that when a clash occurs, insert the record in the next available place in the table. For example in the table the next position 646 is empty. So we can insert the record with key 012345645 in this place which is still empty. Similarly if the record with key %1000 = 646 appears, it will be inserted in next empty space. This technique is called linear probing and is an example for resolving hash clashes called rehashing or open addressing. DISADVANTAGES OF LINEAR PROBING It may happen, however, that the loop executes forever. There are two possible reasons for this. First, the table may be full so that it is impossible to insert any new record. This situation can be detected by keeping an account of the number of records in the table. When the count equals the table size, no further insertion should be done. The other reason may be that the table is not full, too. In this type, suppose all the odd positions are emptyAnd the even positions are full and we want to insert in the even position by

Data structures and Algorithms

R.Punithavathi Department of IT

rh(i)=(i+2)%1000 used as a hash function. Of course, it is very unlikely that all the odd positions are empty and all the even positions are full. However the rehash function rh(i)=(i+200)%1000 is used, each key can be placed in one of the five positions only. Here the loop can run infinitely, too. SEPARATE CHAINING As we have seen earlier, we can't insert items more than the table size. In some cases, we allocate space much more than required resulting in wastage of space. In order to tackle all these problems, we have a separate method of resolving clashes called separate chaining. It keeps a distinct link list for all records whose keys hash into a particular value. In this method, the items that end with a particular number (unit position) is placed in a particular link list as shown in the figure. The 10's, 100's not taken into account. The pointer to the node points to the next node and when there is no more nodes, the pointer points to NULL value. ADVANTAGES OF SEPERATE CHAINING 1. No worries of filling up the table whatever be the number of items. 2. The list items need not be contiguous storage 3. It allows traversal of items in hash key order. SITUATION OF HASH CLASH What would happen if we want to insert a new part number 012345645 in the table .Using he hash function key %1000 we get 645.Therefore for the part belongs in position 645.However record for the part is already being occupied by 011345645.Therefore the record with the key 012345645 must be inserted somewhere in the table resulting in hash clash.This is illustrated in the given table: POSITION 0 2 KEY 258001201 698321903 RECORD

Data structures and Algorithms

R.Punithavathi Department of IT

3 . . 450 451 . . 647 648 649 . . 997 998 999

986453204

256894450 158965451

214563647 782154648 325649649

011239997 231452998 011232999

DOUBLE HASHING A method of open addressing for a hash table in which a collision is resolved by searching the table for an empty place at intervals given by a different hash function, thus minimizing clustering. Double Hashing is another method of collision resolution, but unlike the linear collision resolution, double hashing uses a second hashing function that normally limits multiple collisions. The idea is that if two values hash to the same spot in the table, a constant can be calculated from the initial value using the second hashing function that can then be used to change the sequence of locations in the table, but still have access to the entire table.

LUSTERING There are mainly two types of clustering: Primary clustering Data structures and Algorithms R.Punithavathi Department of IT

When the entire array is empty, it is equally likely that a record is inserted at any position in the array. However, once entries have been inserted and several hash clashes have been resolved, it doesn't remain true. For, example in the given above table, it is five times as likely for the record to be inserted at the position 994 as the position 401. This is because any record whose key hashes into 990, 991, 992, 993 or 994 will be placed in 994, whereas only a record whose key hashes into 401 will be placed there. This phenomenon where two keys that hash into different values compete with each other in successive rehashes is called primary clustering. CAUSE OF PRIMARY CLUSTERING Any rehash function that depends solely on the index to be rehashed causes primary clustering WAYS OF ELEMINATING PRIMARY CLUSTERING One way of eliminating primary clustering is to allow the rehash function to depend on the number of times that the function is applied to a particular hash value. Another way is to use random permutation of the number between 1 and e, where e is (table size -1, the largest index of the table). One more method is to allow rehash to depend on the hash value. All these methods allow key that hash into different locations to follow separate rehash paths. ECONDARY CLUSTERING In this type, different keys that hash to the same value follow same rehash path. WAYS TO ELIMINATE SECONDARY CLUSTERING All types of clustering can be eliminated by double hashing, which involves the use of two hash function h1(key) and h2(key).h1 is known as primary hash function and is allowed first to get the position where the key will be inserted. If that position is occupied already, the rehash function rh(i,key) = (i+h2(key))%table size is used successively until an empty position is found. As long as h2(key1) doesn't equal h2(key2),records with keys h1 and h2 don't compete for the same position. Therefore one should choose functions h1 and h2 that distributes the hashes and rehashes uniformly in the table and also minimizes clustering.

Data structures and Algorithms

R.Punithavathi Department of IT

DELETING AN ITEM FROM THE HASH TABLE: It is very difficult to delete an item from the hash table that uses rehashes for search and insertion. Suppose that a record r is placed at some specific location. We want to insert some other record r1 on the same location. We will have to insert the record in the next empty location to the specified original location. Suppose that the record r which was there at the specified location is deleted. Now, we want to search the record r1, as the location with record r is now empty, it will erroneously conclude that the record r1 is absent from the table. One possible solution to this problem is that the deleted record must be marked "deleted" rather than "empty" and the search must continue whenever a "deleted" position is encountered. But this is possible only when there are small numbers of deletions otherwise an unsuccessful search will have to search the entire table since most of the positions will be marked "deleted" rather than "empty".

Part A Questions

1. Chaining is a much better resolution technique than open addressing .Justify the statement? 2. Briefly explain the collision resolution by Open Addressing ? Explain its disadvantages ? 3. Which problem occurs during Quadratic Rehash ? How it can be avoided ? 4. Write a short note on Mid Square method adopted to calculate the address of the record ? 5. Explain the division method to obtain the address of the record? 6. Briefly explain the multiplicative method used to obtain the address of a record ? 7. Summarize the hash table organization in terms of advantages and disadvantages ?

Data structures and Algorithms

R.Punithavathi Department of IT

8. What is Universal Hashing ? What should be the correct table size m in the Division Method used in the Hashing ? 9. a) Explain the Hashing function : Digit analysis ? b) What should be done when a key is large ? 10. Briefly explain the folding method used to calculate the address for a given record ? What is the major drawback of the hashing functions? 11. What is Direct Addressing? When is it used? 12. What is the advantage of using Hash Function over Direct Addressing? Explain. 13. When does a 'collision' occur? What are the methods to resolve it? Explain giving examples. 14. Explain the 'division method' for creating hash functions. 15. Consider a version of the division method in which h(k)= k mod m , where m=(2p) - 1 and k is a character string interpreted in radix 2p. Show that if string x can be derived from string y by permuting its characters, then x and y hash to the same value. 16. What is 'universal hashing' and how will you design a universal class of hash functions? 17. Differentiate between linear probing and quadratic probing. 18. Insert the keys 10,22,31,4,15,28,17,88,59 into a hash table of length m=11 using open addressing with the primary hash function h'(k)=k mod m. Illustrate the result of inserting keys using linear probing, using quadratic probing with c1=1 and c2=3 and using double hashing with h2(k)=1+(kmod(m-1)).

Data structures and Algorithms

R.Punithavathi Department of IT

PART B Questions 1. Explain any two techniques to overcome hash collision? Separate chaining Example Explanation Coding Open addressing Linear probing Quadratic probing 2. Explain the implementation of different hashing techniques.

3. Given input {4371,1323,6173,4199,4344,9679,1989} and a hash function h(x)=x(mod10), show the resulting: (a) separate chaining table (4) (b) open addressing hash table using linear probing (4) (c) open addressing hash table using quadratic probing (4) (d) open addressing hash table with second hash function h2(x) =7-(x mod 7).

Objective Type Questions


1. Name the function which when applied to the key produces an integer which can be used as an address in a Hash table ? a) Greatest Integer Function b) Modulus function c) Signum Function d) Hash Function

2) Which is the phenomenon which occurs when a hash function maps 2 different keys to the same table address ? a) Collision b) Overlaps c) Nothing Happens d) Rebound

Data structures and Algorithms

R.Punithavathi Department of IT

3) What is the name of the collision resolution technique in which the next slot in a table is checked on a collision ? a) Chaining b) Linear probing c) Quadratic Probing d) Clustering

4) Name the Hashing scheme in which a second order function of the hash index is used to calculate the address ? a) Quadratic Probing b) Sequential Search c) Linear Probing d) Chaining

5) What is the name of the collision resolution technique in which colliding records are chained into special overflow area which is separate from the prime area ? a) Chaining b) Mid Square Method c) Folding Method d) Open Addressing

6) Which number corresponds to the mapping of 72963195 according to Folding method ? a) 455 b) 1455 c) 451 d) 456 7) What will be the address of the key 12345 according to Mid Square Method if the positions at 5 and 6 are taken into account ? a) 21 b) 91 c) 99 d) 19

8) Which type of address is obtained by a perfect Hash Function ? a) Odd b) Unique c) Even

Data structures and Algorithms

R.Punithavathi Department of IT

d) No Address is obtained

9) What is the name of the collision sequences generated by address calculated with the help of Quadratic Probing ? a) Rings b) Stacks c) Secondary Clustering d) Linked Lists

10) Name the technique used to avoid Secondary Clustering ? a) Quadratic Probe b) Double Hashing c) Linear Probe d) Open addressing

Answers: 1. Answer : d) Hash Function

Explanation : Hash function does a key to address transformation i.e when applied to the key produces an integer which can be used as an address in a hash table . 2. Answer : a) Collision

Explanation : When a Hash function maps 2 different keys to the same table address a collision is said to occur . Suppose that 2 keys K1 & K2 map to the same address . When a record with the key K1 is entered , it is inserted at the hashed address . But , when another record with key K2 is entered , K2 is hashed and an attempt is made to insert the record with key K1 is stored because its key is hashed to the same address as that of key K2 . This results in a Hash Collision . 3. Answer : b) Linear Probing

Explanation : Linear Probing comes under the category of Collision resolution technique called Open Addressing . Here , using a suitable algorithm the address of the hash table

Data structures and Algorithms

R.Punithavathi Department of IT

is calculated . If the position is not empty then the next slot of the table is checked and the record is inserted if the position is empty . 4. Answer : a) Quadratic Probing

Explanation : Quadratic probing is a Rehasing Scheme in which usually a second order function of the hash index is used to calculate the address . If there is a collision at the hash address h then this method probes the array at h+1, h+4 , h+9 address locations . 5. Answer : a) Chaining Explanation : Chaining is a Collision Resolution Technique in which colliding records are chained into a special overflow area which is separate from the prime area .The prime area contains the part of the array into which the records of the array are initially hashed . Each record in the prime area contains a pointer field to maintain a separate Linked List for each set of colliding records into the overflow area . 6. Answer : a) 455 Explanation : In the Folding Method a key is broken into several parts . Each part has the same address as that of the required address except the last part . The parts are added together and the final carry is ignored . Therefore , 72963195 maps to 729 + 631 + 95 =1455 ignoring the final carry of 1 . 7. Answer : d) 19 Explanation : In Mid Square Method a key is multiplied by itself and the hash value is obtained by selecting an appropriate number of digits from the middle of the square . In the key 12345 , square of the key is 152199025 . If we choose the values at 5th and 6th position for the address then the address is 19 . 8. Answer : b) Unique

Explanation : A perfect Hash Function is a function which when applied to all the members of the set of items to be stored in a hash table produces a unique set of integers within some suitable range . 9. Answer : c) Secondary Clustering

Explanation : Secondary Clustering are the collision sequences generated by addresses calculated with quadratic probing . In this the different keys that hash to the same value are rehashed to the same value again . 10. Answer : b) Double Hashing

Explanation : Double Hashing can be used to avoid Secondary Clustering . For example if h1 is the same hashing function which generates the same address i for 2 different

Data structures and Algorithms

R.Punithavathi Department of IT

keys x1 and x2 and we have a second function h2 such that it generates different addresses for x1 and x2 . We can use h1 as a primary hash function and h2 as a rehash function . The functions h1 and h2 should be chosen so as to distribute the hashes and rehashes uniformly over the array indices and to minimize clustering .

11.What is the best definition of a collision in a hash table?


o o o o

A. Two entries are identical except for their keys. B. Two entries with different data have the exact same key. C. Two entries with different keys have the same exact hash value. D. Two entries with the exact same key have different hash values.

12.Which guideline is NOT suggested from from empirical or theoretical studies of hash tables:
o o o o

A. Hash table size should be the product of two primes. B. Hash table size should be the upper of a pair of twin primes. C. Hash table size should have the form 4K+3 for some K. D. Hash table size should not be too near a power of two.

13. In an open-address hash table there is a difference between those spots which have never been used and those spots which have previously been used but no longer contain an item. Which function has a better implementation because of this difference? a. A. insert b. B. is_present c. C. remove d. D. size e. E. Two or more of the above functions 14. What kind of initialization needs to be done for an open-address hash table? a. A. None.

Data structures and Algorithms

R.Punithavathi Department of IT

b. B. The key at each array location must be initialized. c. C. The head pointer of each chain must be set to NULL. d. D. Both B and C must be carried out. 15. What kind of initialization needs to be done for an chained hash table? a. A. None. b. B. The key at each array location must be initialized. c. C. The head pointer of each chain must be set to NULL. d. D. Both B and C must be carried out. 16. A chained hash table has an array size of 512. What is the maximum number of entries that can be placed in the table? a. A. 256 b. B. 511 c. C. 512 d. D. 1024 e. E. There is no maximum. 17. Suppose you place m items in a hash table with an array size of s. What is the correct formula for the load factor? a. A. s + m b. B. s - m c. C. m - s d. D. m * s

e. E. m / s
18. The constructor of the Hashtable class initializes data members and creates the Hashtable. o True o False 19. Hash table are very good if there are many insertion and deletion. o True

Data structures and Algorithms

R.Punithavathi Department of IT

o False

Data structures and Algorithms

R.Punithavathi Department of IT