You are on page 1of 29

DESIGN AND ANALYSIS OF ALGORITHMS (III - CSE) – I SEM

UNIT V

NP-Hard and NP-Complete problems: Basic concepts, Cook’s Theorem.


String Matching: Introduction, String Matching-Meaning and Application, NaÏve String
Matching Algorithm, Rabin-Karp Algorithm, Knuth-Morris-Pratt Automata, Tries, Suffix Tree.

P, NP and NP Complete Problems


There are two groups in which a problem can be classified. The first group
consists of the problems that can be solved in polynomial time.

For example: Searching of an element from the list O(logn), sorting of elements
O(logn).
The second group consists of problems that can be solved in non-deterministic
polynomial time.
For example: Knapsack problem O(2n/2) and Travelling Salesperson
problem(O(n22n )).

• Any problem for which answer is either yes or no is called decision problem. The
algorithm for decision problem is called decision algorithm.
• Any problem that involves the identification of optimal cost (minimum or
maximum) is called optimization problem. The algorithm for optimization
problem is called optimization algorithm.
• Definition of P - Problems that can be solved in polynomial time. (“P” stands for
polynomial).
Examples - Searching of key element, Sorting of elements, All pair shortest path.
• Definition of NP - It stands for “non-deterministic polynomial time”. Note
that NP does not stand for “non-polynomial”.
Examples - Travelling salesperson problem, Graph coloring problem, Knapsack
problem, Hamiltonian circuit problems
• The NP class problems can be further categorized into NP-complete and NP hard
problems.

UNIT-5 [1] K.PRASANTHI


 If an NP-hard problem can be solved in polynomial time then all NP-complete problems
can also be solved in polynomial time.

 All NP-complete problems are NP-hard but all NP-hard problems cannot be NP-
complete.

 The NP class problems are the decision problems that can be solved by non deterministic
polynomial algorithms.

Deterministic and non-deterministic algorithms:


Deterministic: The algorithm in which every operation is uniquely defined is calleddeterministic
algorithm.
Non-Deterministic: The algorithm in which the operations are not uniquely defined but are
limited to specific set of possibilities for every operation, such an algorithm is called non-
deterministic algorithm.

The non-deterministic algorithms use the following functions:


1. Choice: Arbitrarily chooses one of the element from given set.
2. Failure: Indicates an unsuccessful completion
3. Success: Indicates a successful completion

A non-deterministic algorithm terminates unsuccessfully if and only if there exists no set of


choices leading to a success signal. Whenever, there is a set of choices that leads to a successful
completion, then one such set of choices is selected and the algorithm terminates successfully.

UNIT-5 [2] K.PRASANTHI


In case the successful completion is not possible, then the complexity is O(1). In case of
successful signal completion then the time required is the minimum number of steps needed to
reach a successful completion of O(n) where n is the number of inputs.

The problems that are solved in polynomial time are called tractable problems and the problems
that require super polynomial time are called non-tractable problems. All deterministic
polynomial time algorithms are tractable and the non-deterministic polynomials are intractable

Satisfiability Problem:
The satisfiability is a boolean formula that can be constructed using the following
literals and operations.
1. A literal is either a variable or its negation of the variable.
2. The literals are connected with operators ˅, ˄͢, ⇒ , ⇔
3. Parenthesis
The satisfiability problem is to determine whether a Boolean formula is truefor
some assignment of truth values to the variables. In general, formulas are
expressed in Conjunctive Normal Form (CNF).

UNIT-5 [3] K.PRASANTHI


A Boolean formula is in 3CNF if each clause has exactly 3 distinct literals.
Example:
The non-deterministic algorithm that terminates successfully iff a given formula
E(x1,x2,x3) is satisfiable.

Types of Problems:
 Tractable
 Intractable
 Decision
 Optimization
Tractable: Problems that can be solvable in a reasonable (polynomial) time.
Intractable: Some problems are intractable, as they grow large, we are unable to solve them in
reasonable time.

Decision Problem:
• Any problem for which the answer is either yes or no is called decision problem. The
• algorithm for decision problem is called decision algorithm.
• Example: Sum of subsets problem.

UNIT-5 [4] K.PRASANTHI


Optimization Problem: Any problem that involves the identification of an optimal value
• (maximum or minimum) is called optimization problem.
• Example: Knapsack problem, travelling salesperson problem.

Reducibility:
A problem Q1 can be reduced to Q2 if any instance of Q1 can be easily rephrased as an
instance of Q2. If the solution to the problem Q2 provides a solution to the problem Q1,
then these are said to be reducable problems.
Let L1 and L2 are the two problems. L1 is reduced to L2 iff there is a way to solve L1 by
a deterministic polynomial time algorithm using a deterministic algorithm that solves L2
in polynomial time and is denoted by L1α L2.
If we have a polynomial time algorithm for L2 then we can solve L1 in polynomial time.
Two problems L1 and L2 are said to be polynomially equivalent iff L1α L2 and L2 α L1.

Example: Let P1 be the problem of selection and P2 be the problem of sorting. Let the
input have n numbers. If the numbers are sorted in array A[ ] the ith smallest element of
the input can be obtained as A[i]. Thus P1 reduces to P2 in O(1) time.

Class P:
P: the class of decision problems that are solvable in O(p(n)) time, where p(n) is a polynomial of
problem’s input size n
Examples:
• searching
• element uniqueness
• graph connectivity
• graph acyclicity
• primality testing
Class NP
NP (nondeterministic polynomial): class of decision problems whose proposed solutions can be
verified in polynomial time = solvable by a nondeterministic polynomial algorithm
A nondeterministic polynomial algorithm is an abstract two-stage procedure that:

UNIT-5 [5] K.PRASANTHI


 generates a random string purported to solve the problem
 checks whether this solution is correct in polynomial time
By definition, it solves the problem if it’s capable of generating and verifying a solution on one
of its tries
Example: CNF satisfiability
Problem: Is a boolean expression in its conjunctive normal form (CNF) satisfiable, i.e., are there
values of its variables that makes it true? This problem is in NP.
Nondeterministic algorithm:
• Guess truth assignment
• Substitute the values into the CNF formula to see if it evaluates to true

What problems are in NP?


• Hamiltonian circuit existence
• Partition problem: Is it possible to partition a set of n integers into two disjoint subsets with
the same sum?
• Decision versions of TSP, knapsack problem, graph coloring, and many other combinatorial
optimization problems. (Few exceptions include: MST, shortest paths)
• All the problems in P can also be solved in this manner (but no guessing is necessary), so we
have:
P ⸦ NP
• Big question: P = NP ?

UNIT-5 [6] K.PRASANTHI


NP HARD AND NP COMPLETE
Polynomial Time algorithms
Problems whose solutions times are bounded by polynomials of small degree are called
polynomial time algorithms
Example: Linear search, quick sort, all pairs shortest path etc.
Non- Polynomial time algorithms
Problems whose solutions times are bounded by non-polynomials are called nonpolynomialtime
algorithms
Examples: Travelling salesman problem, 0/1 knapsack problem etc
It is impossible to develop the algorithms whose time complexity is polynomial for non-
polynomial time problems, because the computing times of non-polynomial are greater than
polynomial. A problem that can be solved in polynomial time in one model can also be solved in
polynomial time.
NP-Hard and NP-Complete Problem:
Let P denote the set of all decision problems solvable by deterministic algorithm in polynomial
time. NP denotes set of decision problems solvable by nondeterministic algorithms in
polynomial time. Since, deterministic algorithms are a special case of nondeterministic
algorithms, P ⊆ NP. The nondeterministic polynomial time problems can be classified into two
classes. They are
1. NP Hard and
2. NP Complete
NP-Hard: A problem L is NP-Hard iff satisfiability reduces to L i.e., any nondeterministic
polynomial time problem is satisfiable and reducable then the problem is said to be NP-Hard.
Example: Halting Problem, Flow shop scheduling problem

NP-Complete: A problem L is NP-Complete iff L is NP-Hard and L belongs to NP


(nondeterministic polynomial).
A problem that is NP-Complete has the property that it can be solved in polynomial time iff all
other NP-Complete problems can also be solved in polynomial time. (NP=P)

UNIT-5 [7] K.PRASANTHI


If an NP-hard problem can be solved in polynomial time, then all NP- complete problems can be
solved in polynomial time. All NP-Complete problems are NP-hard, but some NPhard problems
are not known to be NP- Complete.

Normally the decision problems are NP-complete but the optimization problems are NPHard.
However if problem L1 is a decision problem and L2 is an optimization problem, then it is
possible that L1α L2.
Example: Knapsack decision problem can be reduced to knapsack optimization problem.
There are some NP-hard problems that are not NP-Complete.

Relationship between P,NP,NP-hard, NP-Complete


Let P, NP, NP-hard, NP-Complete are the sets of all possible decision problems that are solvable
in polynomial time by using deterministic algorithms, non-deterministic algorithms, NP-Hard
and NP-complete respectively. Then the relationship between P,
NP, NP-hard, NP-Complete can be expressed using Venn diagram as:

UNIT-5 [8] K.PRASANTHI


Cook’s Theorem – Satisfiability is NP-complete
 Cook's theorem states that satisfiability is in P if and only if P = NP. We
now prove this important theorem.
 Hence if P = NP, then satisfiability is in P. It remains to be shown that if
satisfiability is in P, then P = NP

The boolean satisfiability problem is in NP. This is because a non-deterministic


algorithm can guess an assignment of truth values of variables. This algorithm can
also determine the value of expression for corresponding assignement and can accept
if entire expression is true .
The algorithm is composed of -
• Input tape wherein tape is divided in finite number of cells.
• The read/write head which reads each symbol from tape.
• Each cell contains only one symbol at a time.
• Computation is performed in number of states.
• The algorithm terminates when it reaches to accept state.
The conjunction clauses for boolean expression are given in following table

UNIT-5 [9] K.PRASANTHI


Note that H denotes head, Q denotes states and T denotes tape. The disjunction
clause for this algorithm can be written as :

String Matching

String Matching Algorithm is also called "String Searching Algorithm." This is a vital class
of string algorithm is declared as "this is the method to find a place where one is several
strings are found within the larger string."

Given a text array, T [1.....n], of n character and a pattern array, P [1......m], of m


characters. The problems are to find an integer s, called valid shift where 0 ≤ s < n-m and T
[s+1......s+m] = P [1. m]. In other words, to find even if P in T, i.e., where P is a substring of
T. The item of P and T are character drawn from some finite alphabet such as {0, 1} or {A, B
.....Z, a, b z}.

Given a string T [1......n], the substrings are represented as T [i. .... j] for some 0≤i ≤ j≤n-1,
the string formed by the characters in T from index i to index j, inclusive. This process that a
string is a substring of itself (take i = 0 and j =m).

The proper substring of string T [1......n] is T [1......j] for some 0<i ≤ j≤n-1. That is, we
must have either i>0 or j < m-1.

UNIT-5 [10] Dr.R.Satheeskumar, Professor


Applications of String Matching Algorithms:

Plagiarism Detection:
The documents to be compared are decomposed into string tokens and compared using
string matching algorithms. Thus, these algorithms are used to detect similarities between
them and declare if the work is plagiarized or original.

Bioinformatics and DNA Sequencing: Bioinformatics involves applying information


technology and computer science to problems involving genetic sequences to find DNA
patterns. String matching algorithms and DNA analysis are both collectively used for finding
the occurrence of the pattern set.

Digital Forensics: String matching algorithms are used to locate specific text strings of
interest in the digital forensic text, which are useful for the investigation.

Spelling Checker: Trie is built based on a predefined set of patterns. Then, this trie is used for
string matching. The text is taken as input, and if any such pattern occurs, it is shown by
reaching the acceptance state.

UNIT-5 [11] Dr.R.Satheeskumar, Professor


Spam filters: Spam filters use string matching to discard the spam. For example, to categorize
an email as spam or not, suspected spam keywords are searched in the content of the email by
string matching algorithms. Hence, the content is classified as spam or not.

Search engines or content search in large databases: To categorize and organize data
efficiently, string matching algorithms are used. Categorization is done based on the search
keywords. Thus, string matching algorithms make it easier for one to find the information they
are searching for.

UNIT-5 [12] Dr.R.Satheeskumar, Professor


Algorithms used for String Matching:

There are different types of method is used to finding the string

1. The Naive String Matching Algorithm


2. The Rabin-Karp-Algorithm
3. Finite Automata
4. The Knuth-Morris-Pratt Algorithm

The naive string-matching algorithm:

The naïve approach tests all the possible placement of Pattern P [1.......m] relative to text T
[1......n]. We try shift s = 0, 1.......n-m, successively and for each shift s. Compare T
[s+1.......s+m] to P [1. m].

The naïve algorithm finds all valid shifts using a loop that checks the condition P [1. m] = T
[s+1. s+m] for each of the n - m +1 possible value of s.

Example:

The operation of the naive string matcher for the pattern P = aab and the text T = acaabc. We
can imagine the pattern P as a template that we slide next to the text. (a)–(d) The four successive
alignments tried by the naive string matcher. In each part, vertical lines connect corresponding
regions found to match (shown shaded), and a jagged line connects the first mismatched
character found, if any. The algorithm finds one occurrence of the pattern, at shift s D 2, shown
in part (c).

UNIT-5 [13] Dr.R.Satheeskumar, Professor


The Rabin-Karp-Algorithm
 The Rabin-Karp string matching algorithm calculates a hash value for the pattern, as well
as for each M-character subsequences of text to be compared.

 If the hash values are unequal, the algorithm will determine the hash value for next M-
character sequence.

 If the hash values are equal, the algorithm will analyze the pattern and the M-character
sequence.

 In this way, there is only one comparison per text subsequence, and character matching is
only required when the hash values match.

Algorithm:

Example:

For string matching, working module q = 11, how many spurious hits does the Rabin-Karp
matcher encounters in Text T = 31415926535

T = 31415926535.......
P = 26
Here T.Length =11 so Q = 11
And P mod Q = 26 mod 11 = 4
Now find the exact match of P mod Q...

UNIT-5 [14] Dr.R.Satheeskumar, Professor


UNIT-5 [15] Dr.R.Satheeskumar, Professor
Complexity:
The running time of RABIN-KARP-MATCHER in the worst case scenario O ((n-m+1) m but it
has a good average case running time. If the expected number of strong shifts is small O (1) and
prime q is chosen to be quite large, then the Rabin-Karp algorithm can be expected to run in time
O (n+m) plus the time to require to process spurious hits.

String Matching with Finite Automata:


The string-matching automaton is a very useful tool which is used in string matching algorithm.
It examines every character in the text exactly once and reports all the valid shifts in O (n) time.
The goal of string matching is to find the location of specific text pattern within the larger body
of text (a sentence, a paragraph, a book, etc.)

UNIT-5 [16] Dr.R.Satheeskumar, Professor


UNIT-5 [17] Dr.R.Satheeskumar, Professor
The Knuth-Morris-Pratt (KMP)Algorithm:

Knuth-Morris and Pratt introduce a linear time algorithm for the string matching problem. A
matching time of O (n) is achieved by avoiding comparison with an element of 'S' that have
previously been involved in comparison with some element of the pattern 'p' to be matched. i.e.,
backtracking on the string 'S' never occurs

Components of KMP Algorithm:

1. The Prefix Function (Π): The Prefix Function, Π for a pattern encapsulates knowledge about
how the pattern matches against the shift of itself. This information can be used to avoid a
useless shift of the pattern 'p.' In other words, this enables avoiding backtracking of the string 'S.'

2. The KMP Matcher: With string 'S,' pattern 'p' and prefix function 'Π' as inputs, find the
occurrence of 'p' in 'S' and returns the number of shifts of 'p' after which occurrences are found.

The Prefix Function (Π):

Following pseudo code compute the prefix function, Π:

Running Time Analysis:

In the above pseudo code for calculating the prefix function, the for loop from step 4 to step 10
runs 'm' times. Step1 to Step3 take constant time. Hence the running time of computing prefix
function is O (m).

Example: Compute Π for the pattern 'p' below:

UNIT-5 [18] Dr.R.Satheeskumar, Professor


Solution:

Initially: m = length [p] = 7


Π [1] = 0
k=0

UNIT-5 [19] Dr.R.Satheeskumar, Professor


The KMP Matcher:

The KMP Matcher with the pattern 'p,' the string 'S' and prefix function 'Π' as input, finds a
match of p in S. Following pseudo code compute the matching component of KMP algorithm:

Running Time Analysis:

The for loop beginning in step 5 runs 'n' times, i.e., as long as the length of the string 'S.' Since
step 1 to step 4 take constant times, the running time is dominated by this for the loop. Thus
running time of the matching function is O (n).

Let us execute the KMP Algorithm to find whether 'P' occurs in 'T.'

For 'p' the prefix function, ? was computed previously and is as follows:

UNIT-5 [20] Dr.R.Satheeskumar, Professor


Initially: n = size of T = 15
m = size of P = 7

UNIT-5 [21] Dr.R.Satheeskumar, Professor


UNIT-5 [22] Dr.R.Satheeskumar, Professor
Pattern 'P' has been found to complexity occur in a string 'T.' The total number of shifts that took
place for the match to be found is i-m = 13 - 7 = 6 shifts.

UNIT-5 [23] Dr.R.Satheeskumar, Professor


Tries:
 A trie is a tree-based date structure for storing strings in order to make pattern matching
faster.

 Tries can be used to perform prefix queries for information retrieval. Prefix queries
search for the longest prefix of a given string X that matches a prefix of some string in
the trie.

 A trie supports the following operations on a set S of strings:

insert(X): Insert the string X into S

Input: String Ouput: None

remove(X): Remove string X from S

Input: String Output: None

prefixes(X): Return all the strings in S that have a longest prefix of X

Input: String Output: Enumeration of strings

For example, the standard trie over the alphabet Σ ={a, b} for the set {aabab, abaab, babbb,
bbaaa, bbab}

 An internal node can have 1 to d children when d is the size of the alphabet. Our example
is essentially a binary tree.

UNIT-5 [24] Dr.R.Satheeskumar, Professor


 A path from the root of T to an internal node v at depth i corresponds to an i-character
prefix of astring of S.

 We can implement a trie with an ordered tree by storing the character associated with an
edge at the child node below it.

Compressed Tries:

 A compressed trie is like a standard trie but makes sure that each trie had a degree of at
least 2. Single child nodes are compressed into an single edge.

 A critical node is a node v such that v is labeled with a string from S, v has at least 2
children, or v is the root.

 To convert a standard trie to a compressed trie we replace an edge (v0, v1) each chain on
nodes (v0, v1...vk) for k 2 such that

- v0 and v1 are critical but v1 is critical for 0<i<k

- each v1 has only one child

UNIT-5 [25] Dr.R.Satheeskumar, Professor


Insertion & Deletion:

UNIT-5 [26] Dr.R.Satheeskumar, Professor


Suffix Trees:

Suffix tree is a compressed trie of all the suffixes of a given string. Suffix trees help in solving a
lot of string related problems like pattern matching, finding distinct substrings in a given string,
finding longest palindrome etc. In this tutorial following points will be covered:

 Compressed Trie
 Suffix Tree Construction (Brute Force)
 Brief description of Ukkonen's Algorithm

Before going to suffix tree, let's first try to understand what a compressed trie is.

UNIT-5 [27] Dr.R.Satheeskumar, Professor


And a compressed trie for the given set of strings will look like:

As it might be clear from the images show above, in a compressed trie, edges that direct to a
node having single child are combined together to form a single edge and their edge labels are
concatenated. So this means that each internal node in a compressed trie has atleast two children.
Also it has atmost leaves, where is the number of strings inserted in the compressed trie. Now
both the facts: Each internal node having atleast two children, and that there are leaves, implies
that there are atmost nodes in the trie. So the space complexity of a compressed trie is as
compared to the of a normal trie.

So that is one reason why to use compressed tries over normal tries.

Before going to construction of suffix trees, there is one more thing that should be understood,
Implicit Suffix Tree. In Implicit suffix trees, there are atmost leaves, while in normal one there
should be exactly leaves. The reason for atmost leaves is one suffix being prefix of another
suffix. Following example will make it clear. Consider the string

Implicit Suffix Tree for the above string is shown in image below:

UNIT-5 [28] Dr.R.Satheeskumar, Professor


To avoid getting an Implicit Suffix Tree we append a special character that is not equal to any
other character of the string. Suppose we append $ to the given string then, so the new string
is "banana$". Now its suffix tree will be

UNIT-5 [29] Dr.R.Satheeskumar, Professor

You might also like