You are on page 1of 91

Analysis

Analysis
Analysis
2-3 Trees
 Idea of balancing a search tree is to allow more than one key in the same
node of such a tree.
 The simplest implementation of this idea is 2-3 trees, introduced by the U.S.
computer scientist John Hopcroft in 1970.
 A 2-3 tree is a tree that can have nodes of two kinds: 2-nodes and 3-nodes.
 A 2-node contains a single key K and has two children: the left child serves as
the root of a subtree whose keys are less than K, and the right child serves as
the root of a subtree whose keys are greater than K. (In other words, a 2-node
is the same kind of node we have in the classical binary search tree.)
 A 3-node contains two ordered keys K1 and K2 (K1 < K2) and has three
children. The leftmost child serves as the root of a subtree with keys less than
K1, the middle child serves as the root of a subtree with keys between K1 and
K2, and the rightmost child serves as the root of a subtree with keys greater
than K2
 Construct a 2-3 tree for the list 1,2,3,4,5,6

 Construct a 2-3 tree for the list 6,5,4,3,2,1


 The data structure called the “heap” is a clever, partially
ordered data structure that is especially suitable for
implementing priority queues.
 Recall that a priority queue is a multiset of items with an
orderable characteristic called an item’s priority, with
the following operations:
BOTTOM UP HEAP CONSTRUCTION
TOP DOWN HEAP CONSTRUCTION
 The idea is to preprocess the problem’s input, in whole or in
part, and store the additional information obtained to
accelerate solving the problem afterward. We call this
approach input enhancement1

 The other type of technique that exploits space-for-time trade-


offs simply uses extra space to facilitate faster and/or more
flexible access to the data. We call this approach
prestructuring.

 There is one more algorithm design technique related to the


space-for-time trade-off idea: dynamic programming. This
strategy is based on recording solutions to overlapping
subproblems of a given problem in a table from which a
solution to the problem in question is then obtained.
Analysis
 A simple example can demonstrate that the worst-case efficiency of Hor-
spool’s algorithm is in O(nm)
 But for random texts, it is in (n), and, although in the same efficiency class,
Horspool’s algorithm is obviously faster on average than the brute-force
algorithm.

 EXAMPLE 2
LAB PROGRAM
Consider the problem of searching for genes in DNA sequences using Horspool’s algorithm. A DNA
sequence is represented by a text on the alphabet {A, C, G, T}, and the gene or gene segment is the
pattern. A gene segment of your chromosome 10 has the pattern TCCTATTCTT . Design and
develop a program in C to locate the above pattern in the following DNA sequence by applying
Horspool’s algorithm.
TTATAGATCTCGTATTCTTTTATAGATCTCCTATTCTT.
Also compute the number of comparisons using this method as compared to linear sear[ch method
 Aim: To find the given pattern in the text using horspool’s algorithm

 Theory :

 The technique of input enhancement can be applied to the problem of string matching.The
problem of string matching requires finding an occurrence of a given string of m characters
called the pattern in a longer string of n characters called the text. The brute-force algorithm for
this problem simply matches corresponding pairs of characters in the pattern and the text left to
right and, if a mismatch occurs, shifts the pattern one position to the right for the next trial.
Since the maximum number of such trials is n − m + 1 and, in the worst case, m comparisons
need to be made on each of them, the worst-case efficiency of the brute-force algorithm is in the
O(nm) class. On average, however, we should expect just a few comparisons before a pattern’s
shift, and for random natural-language texts, the average-case efficiency indeed turns out to be
in O(n + m). The worst-case efficiency of Horspool’s algorithm is in O(nm) . But for random
texts, it is in O (n), and, although in the same efficiency class, Horspool’s algorithm is
obviously faster on average than the brute-force algorithm.
ALGORITHM ShiftTable(P [0..m − 1])

//Fills the shift table used by Horspool’s and Boyer-Moore algorithms


//Input: Pattern P [0..m − 1] and an alphabet of possible characters
//Output: Table[0..size − 1] indexed by the alphabet’s characters and filled with shift sizes computed by formula (7.1)

for i ← 0 to size − 1 do Table[i] ← m

for j ← 0 to m − 2 do Table[P [j ]] ← m − 1 − j return Table

ALGORITHM HorspoolMatching(P [0..m − 1], T [0..n − 1])


//Implements Horspool’s algorithm for string matching
//Input: Pattern P [0..m − 1] and text T [0..n − 1]
//Output: The index of the left end of the first matching substring or −1 if there are no matches

ShiftTable(P [0..m − 1]) //generate Table of shifts

i←m−1 //position of the pattern’s right end

while i ≤ n − 1 do

k←0 //number of matched characters

while k ≤ m − 1 and P [m − 1 − k] = T [i − k] do k ← k + 1

if k = m

return i − m + 1 else i ← i + Table[T [i]]

return −1
int horspool(char src[],char p[])
#include<stdio.h> {
#include<string.h> int i,j,k,m,n;
#include<conio.h> n=strlen(src);
#define MAX 500 m=strlen(p);
int t[MAX]; printf("\nLength of text=%d",n);
void shifttable(char p[]) printf("\n Length of pattern=%d",m);
{ i=m-1;
int i,j,m; while(i<n)
m=strlen(p); {
for(i=0;i<MAX;i++) k=0;
t[i]=m; while((k<m)&&(p[m-1-k]==src[i-k]))
for(j=0;j<m-1;j++) k++;
t[p[j]]=m-1-j; if(k==m)
} return(i-m+1);
else
i+=t[src[i]];
}
return -1;
}
void main()
{
char src[100],p[100];
int pos;
printf("Enter the text in which pattern is to be searched:\n");
gets(src);
printf("Enter the pattern to be searched:\n");
gets(p);
shifttable(p);
pos=horspool(src,p);
if(pos>=0)
printf("\n The desired pattern was found starting from position %d",pos+1);
else
printf("\n The pattern was not found in the given text");
getch();
}
 Obviously, if we choose a hash table’s size m to be smaller than the
number of keys n, we will get collisions—a phenomenon of two (or more)
keys being hashed into the same cell of the hash table.
 But collisions should be expected even if m is considerably larger than n.
 In fact, in the worst case, all the keys could be hashed to the same cell of
the hash table.
 Fortunately, with an appropriately chosen hash table size and a good hash
function, this situation happens very rarely.
 Still, every hashing scheme must have a collision resolution mechanism.
 This mechanism is different in the two principal versions of hashing: open
hashing (also called separate chaining) and closed hashing (also called
open addressing).
Open Hashing (Separate Chaining)
 If the hash function distributes n keys among m cells of the hash table
about evenly, each list will be about n/m keys long.
 The ratio α = n/m, called the load factor of the hash table, plays a
crucial role in the efficiency of hashing.
 In particular, the average number of pointers (chain links) inspected in
successful searches, S, and unsuccessful searches, U, turns out to be

 S ≈ 1 +α and U = α,
 The two other dictionary operations—insertion and deletion—are almost
identical to searching.
 Insertions are normally done at the end of a list.
 Deletion is performed by searching for a key to be deleted and then removing
it from its list.
 Hence, the efficiency of these operations is identical to that of searching, and
they are all θ(1) in the average case if the number of keys n is about equal to
the hash table’s size m.
closed hashing
 In closed hashing, all keys are stored in the hash table
itself without the use of linked lists.
 Of course, this implies that the table size m must be at
least as large as the number of keys n.
 Different strategies can be employed for collision
resolution.
 The simplest one—called linear probing—checks the cell
following the one where the collision occurs.
 If that cell is empty, the new key is installed there; if the
next cell is already occupied, the availability of that cell’s
immediate successor is checked, and so on.
 To search for a given key K, we start by computing h(K)
where h is the hash function used in the table
construction.
 If the cell h(K) is empty, the search is unsuccessful.
 If the cell is not empty, we must compare K with the cell’s
occupant: if they are equal, we have found a matching
key; if they are not, we compare K with a key in the next
cell and continue in this manner until we encounter either
a matching key (a successful search) or an empty cell
(unsuccessful search).
 For example, if we search for the word LIT in the table of
Figure 7.6,
we will get h(LIT) = (12 + 9 + 20) mod 13 = 2 and,
since cell 2 is empty, we can stop immediately.
However, if we search for KID with
h(KID) = (11 + 9 + 4) mod 13 = 11,
we will have to compare KID with ARE, SOON, PARTED,
and A before we can declare the search unsuccessful.
 Although the search and insertion operations are
straightforward for this version of hashing, deletion is not.
 For example, if we simply delete the key ARE from the
last state of the hash table in Figure 7.6, we will be
unable to find the key SOON afterward.
 Indeed, after computing h(SOON) = 11, the algorithm
would find this location empty and report the
unsuccessful search result.
 A simple solution is to use “lazy deletion,” i.e., to mark
previously occupied locations by a special symbol to
distinguish them from locations that have not been
occupied.
 Still, as the hash table gets closer to being full, the
performance of linear probing deteriorates because of a
phenomenon called clustering.
 A cluster in linear probing is a sequence of contiguously
occupied cells (with a possible wrapping).
 For example, the final state of the hash table of Figure
7.6 has two clusters.
 Clusters are bad news in hashing because they make the
dictionary operations less efficient.
 Several other collision resolution strategies have been
suggested to alleviate this problem. One of the most
important is double hashing.
 Under this scheme, we use another hash function, s(K), to
determine a fixed increment for the probing sequence to
be used after a collision at location
 l = h(K): (l + s(K)) mod m, (l + 2s(K)) mod m
 Mathematical analysis of double hashing has proved to be
quite difficult.
 Some partial results and considerable practical
experience with the method suggest that with good
hashing functions—both primary and secondary—double
hashing is superior to linear probing.
 But its performance also deteriorates when the table gets
close to being full.
 A natural solution in such a situation is rehashing: the
current table is scanned, and all its keys are relocated
into a larger table.
 Since its discovery in the 1950s by IBM researchers,
hashing has found many important applications.
 In particular, it has become a standard technique for
storing a symbol table—a table of a computer program’s
symbols generated during compilation.
 Hashing is quite handy for such AI applications as checking
whether positions generated by a chess-playing computer
program have already been considered.
 With some modifications, it has also proved to be useful
for storing very large dictionaries on disks; this variation
of hashing is called extendible hashing.
 Since disk access is expensive compared with probes
performed in the main memory, it is preferable to make
many more probes than disk accesses.
 Accordingly, a location computed by a hash function in
extendible hashing indicates a disk address of a bucket
that can hold up to b keys.
 When a key’s bucket is identified, all its keys are read
into main memory and then searched for the key in
question.
S.No.
Separate Chaining Open Addressing
Open Addressing requires more
1.
Chaining is Simpler to implement.
computation.
In chaining, Hash table never fills up, we
2.
In open addressing, table may become full.
can always add more elements to chain.
3.
Chaining is Less sensitive to the hash Open addressing requires extra care for to
function or load factors. avoid clustering and load factor.
Chaining is mostly used when it is
Open addressing is used when the
4.
unknown how many and how frequently
frequency and number of keys is known.
keys may be inserted or deleted.
Open addressing provides better cache
Cache performance of chaining is not
5.
performance as everything is stored in the
good as keys are stored using linked list.
same table.
6.
Wastage of Space (Some Parts of hash In Open addressing, a slot can be used
table in chaining are never used). even if an input doesn’t map to it.
7.
Chaining uses extra space for links. No links in Open addressing

You might also like