Find Shortest Common Superstring Using Greedy Heuristic

PROBLEM DEFINITION :
Find the shortest string S which contains each Si as a substring of S.
INPUT OUTPUT
PROBLEM DESCRIPTION :
The shortest superstring problem takes as input, several strings of different
lengths and finds the shortest common string that contains all the input strings as
substrings. This is helpful in the genome project since it will allow researchers to
determine entire coding regions from a collection of fragmented sections. Shortest
common superstring arises in a variety of applications, including sparse matrix
compression. Suppose we have an (n x m) matrix with most of the elements being zero.
We can partition each row into (m / k) runs of k elements each and construct the shortest
common superstring S' of these runs. We now have reduced the problem to storing the
superstring, plus an (n x m / k) array of pointers into the superstring denoting where
each of the runs starts. Accessing a particular element M[i,j] still takes constant time, but
there is a space savings when |S| << mn.
INPUT DESCRIPTION :
Given a set of n strings, S = {S1,...,Sn}, we want to find the shortest string s that contains
Si as a substring.
OUTPUT DESCRIPTION :
The output of this problem is the shortest common superstring from the given set of
substrings and printing the substrings in a shifted fashion whenever a match is
encountered.
ASSUMPTIONS :
We assume that no Si belongs to S is a substring of Sj belongs to S. This problem is
NP-hard. Such a problem scales up exponentially and consequently large instances
cannot be solved in real life time by electronic computers.
1
TECNIQUES THAT CAN BE APPLIED:
1. GREEDY HEURISTIC METHOD :

The Greedy Heuristic method provides the standard approach to approximating
Shortest Common Superstring.
ALGORITHM:
Step 1 : Input set of strings. S= { S1, S2 ..., Sn }.
Step 2 : Identification of which pair of string have maximum overlap for every pair by
using Brute – Force Algorithm or Knuth Morris Pratt Algorithm.
Step 3 : Replace the pair of strings with maximum overlap by a merge string until only
one string remains.
Step 4 : Output the string with the superstring in one line and approximately shifting the
substring to the right after a mismatch.
2. USING TRAVELLING SALESMAN PROBLEM APPROACH :

This is one of the most well known difficult problems of time. A salesperson must
visit n cities, passing through each city only once, beginning from one of the city that is
considered as a base or starting city and returns to it. The cost of the transportation
among the cities is given. The problem is to find the order of minimum cost route that is,
the order of visiting the cities in such a way that the cost is the minimum.
To solve the above problem using TSP we have to do the following operations:
1. Create an overlap graph G where vertex Vi represents string Si.

2. Assign edge (vi,vj) weight equal to the length of Si minus the overlap of Sj with Si.
Thus weight W(vi,vj) = 1 for Sj=abc and Sj =bcd.
3. The minimum weight path visiting all the vertices defines the SCS. These edge
weights are not symmetric.
4. For the above problem W(vi,vj)=3 for the 1st two strings S1=ABRAC and
S2=ACADA.
5. Now the TSP is applied.
ALGORITHM TSP:
Step 1: First, find out all (n -1)! Possible solutions, where n is the number of string inputs.
Step 2 ; Next, determine the minimum cost by finding out the cost of everyone of these
(n -1)! solutions.
Step 3 : Finally, keep the one with the minimum cost.
2
3. THE SET COVER ALGORITHM APPROACH :
Using the set cover method, we obtain a 2Hn factor approximation algorithm.
Given input, S = {S1,...,Sn}, we construct a string rijk for all possible combinations Si and
Sj belongs to S (where k is the maximum overlap between the two). Now, let’s call the
set of all such r, R. Now let v belong to given set, such that sub(v) = {s belongs to S| s is
a substring of v}. All possible subsets of S are sub(v) for all v belongs to S U R.
ALGORITHM (SET COVER):
Step 1 : Use the greedy set cover algorithm to find a cover for the instance C.
Step 2 : Backwards construct v1, ...vk from the sets selected by the algorithm so that
sub(v1)U...U
sub(vk) is the cover for C.
Step 3 : Uniting the strings v1, ...vk gives the shortest superstring via set cover.
4. KRUSKAL’S MAXIMUM SPANNING TREE ALGORITHM :

We can solve the problem also by finding the Maximum Spanning Tree using
Kruskal Algorithm by creating a graph G of the given set of strings. T represents the
Tree.
ALGORITHM :
One method for computing the maximum weight spanning tree of a network G –
due to Kruskal can be summarized as follows.
Step 1 : Sort the edges of G into decreasing order by weight. Let T be the set of edges
comprising the maximum weight spanning tree. Set T = NULL.
Step 2 : Add the first edge to T.
Step 3 : Add the next edge to T if and only if it does not form a cycle in T. If there are no
remaining edges exit and report G to be disconnected.
Step 4 : If T has n−1 edges (where n is the number of vertices in G) stop and output T.
Otherwise go to step 3.
3
OUR LOGICAL APPROACH :
We begin our approach by taking ‘n’ substrings from the user and storing them in
a 2D array. The user may enter a maximum of 10 substrings, which is the boundary
condition of the program. We have implemented the program, using various structures.
Firstly, after the inputs are encountered we use the structure ‘matrix’ which keeps the
record of common character between each pair of substring. We also use a structure
‘edgelist’ to represent each substring as a vertex and the number of common characters
between pairs of substrings as edges. In this way, the whole structure is represented in
a form of a tree. Later we use Khuskal’s algorithm to form the maximal spanning tree,
with help of the structure ‘sequence’ which stores the edges in a non-increasing order.
Finally, a function is invoked which rearranges the vertices in an efficient way so that the
shortest common superstring can be formed.
DATA STRUCTURES USED :

We try to solve this problem simply using array data structure. We use a 2D
array ‘IS’ to store the input substrings and an 1D array ‘OS’ to store the shortest
common superstring which the required output. The reason that we have chosen array
as the primary data structure is that strings are most suitably represented using
character-array representation. It is also worth mentioning that the manipulation of
strings become easier as traversing an array with respect to array-indices reduces
excess overhead.
PROGRAM IMPLEMENTATION USING C-CODE :
/*Inclusion of Header Files*/

#include<stdio.h>
#include<conio.h>
#include<string.h>
#include<alloc.h>
/*declaration of global variables*/

char IS[10][10];/*input_string*/
char OS[50];/*output string*/
int total;/*total no of substrings*/
int length;/*length of a substring*/
int edge_count=0;/*no. of matches found*/
int sequence_count=0;/*no. of matches actually considered in formation of the output
string*/
/*declaration of global structures*/

struct string_matrix/*structure which keeps the record of common character between
each pair of substring*/
{
int value[10];
}matrix[10];
struct edgelist/*structure which stores the non-zero entries of the matrix*/

{
int u,v,weight;
}edgelist[10];
4
struct sequence/*structure which holds the maximal spanning tree*/
{
int u,v,weight;
}sequence[10];
struct dummy_sequence
{
int u,v,weight;
}dseq[10];
/*declaration of global function*/

void display_sub_strings(void);
void create_matrix(void);
void display_matrix(void);
void create_edgelist(void);
void display_edgelist(void);
void arrange_edgelist(void);
void create_sequence(void);
int check_cycle(int);
void arrange_sequence(void);
void display_sequence(void);
void create_super_string(void);
void display_super_string(void);
void main()
{
int i,j;
printf("ENTER THE TOTAL NO. OF SUBSTRINGS : ");

scanf("%d",&total);
printf("ENTER %d SUBSTRINGS (each terminated by an enter)
EACH SUBSTRING MUST BE OF SAME LENGTH : ",total);
for(i=0;i<=total;i++)
gets(IS[i]);
length=strlen(IS[1]);
/*initialization of string_matrix*/
for(i=1;i<=10;i++)
for(j=1;j<=10;j++)
matrix[i].value[j]=0;
display_sub_strings();
create_matrix();
display_matrix();
create_edgelist();
arrange_edgelist();
display_edgelist();
create_sequence();
arrange_sequence();
display_sequence();
5
create_super_string();
display_super_string();
}/*end of main*/
/*definition of global functions*/

/*function to display each substring entered*/
void display_sub_strings(void)
{
int i;
printf("ENTERED SUBSTRINGS : ");
printf("-------------------------------------");
{
printf("IS[%d] = ",i);
puts(IS[i]);
}
}/*end of function*/
/*function to create the string_matrix*/

void create_matrix(void)
{
int i,j,k,l,flag;
length=strlen(IS[1]);
{
flag=0;
for(j=1;j<=total;j++)
{
for(k=0;k<length;k++)
{
if(IS[i][k]==IS[j][0])
{
l=1;
k++;
flag=1;
while(k<length && (IS[i][k]==IS[j][l]))
{
l++;
k++;
}
}
if((IS[i][k]!=IS[j][l]) && (k!=length))
flag=0;
}
if(flag && i!=j)/*match found for last 'l' characters of i-th string*/
matrix[i].value[j]=l;
}
}
/*function to display the string_matrix*/
6
void display_matrix(void)
{
int i,j;
printf("MATRIX : ");
printf("--------------");
printf("Here MATRIX[i][j]) = max. matching characters between two strings and
MATRIX[i][j]) = 0 if i=j");
printf("IS[%d]",i);
{
printf("IS[%d]",i);
printf("%d",matrix[i].value[j]);
printf("\n");
}
/*function to create the edge_list*/

void create_edgelist(void)
{
int i,j;
{
{
if(matrix[i].value[j])
{
edge_count++;
edgelist[edge_count].u=i;
edgelist[edge_count].v=j;
edgelist[edge_count].weight=matrix[i].value[j];
}
}
}
/*function to arrange the edge_list in non-increasing order, implemented bubble sort*/

void arrange_edgelist(void)
{
int i,flag=1,j,temp;
for(i=1;i<=edge_count && flag;i++)
{
j=edge_count;
flag=0;
while(j>i)
{
if(edgelist[j].weight > edgelist[j-1].weight)
{
temp=edgelist[j].weight;
edgelist[j].weight=edgelist[j-1].weight;
7
edgelist[j-1].weight=temp;
temp=edgelist[j].u;
edgelist[j].u=edgelist[j-1].u;
edgelist[j-1].u=temp;
temp=edgelist[j].v;
edgelist[j].v=edgelist[j-1].v;
edgelist[j-1].v=temp;
flag=1;
}
j--;
}
}
/*function to display the edge_list*/

void display_edgelist(void)
{
int i;
printf("EDGELIST : ");
printf("-----------------");
printf("Here The Non-zero Entries of The above Matrix is Represented in Form of a
Edgelist");
printf("VERTEX 1 VERTEX 2 EDGE");
for(i=1;i<=edge_count;i++)
printf("IS[%d] IS[%d] %d",edgelist[i].u,edgelist[i].v,edgelist[i].weight);
/*function to create the maximal spanning tree using kruskal algorithim*/

void create_sequence(void)
{
int i=1,flag;
while((sequence_count < total-1) && (i<=edge_count))
{
flag=check_cycle(edgelist[i].u);
if(!flag)
{
sequence_count++;
sequence[sequence_count].u=edgelist[i].u;
sequence[sequence_count].v=edgelist[i].v;
sequence[sequence_count].weight=edgelist[i].weight;
}
i++;
}
/*function to check whether inclusion of a edge form a cycle in the tree*/

int check_cycle(int u)
{
int i;
for(i=1;i<=sequence_count;i++)
if(sequence[i].u==u)
8
return(1);
return(0);
/*function to form the final sequence of substrings*/

void arrange_sequence(void)
{
int flag,i,j,k,store_i;
{
k=0;
flag=0;
for(j=1;j<=sequence_count && (flag!=sequence_count-1);j++)
{
if(sequence[i].v==sequence[j].u)
{
k++;
dseq[k].u= sequence[i].u;
dseq[k].v= sequence[i].v;
dseq[k].weight= sequence[i].weight;
store_i=i;
i=j;
flag++;
j=0;
}
}
if(flag==sequence_count-1)
{
k++;
dseq[k].u= sequence[i].u;
dseq[k].v= sequence[i].v;
dseq[k].weight= sequence[i].weight;
/*copy into sequence*/
{
sequence[i].u=dseq[i].u;
sequence[i].v=dseq[i].v;
sequence[i].weight=dseq[i].weight;
}
return;
}
if(flag)
i=store_i;
}
/*function to display the final sequence of substrings*/

void display_sequence(void)
{
int i;
printf("SEQUENCE : ");
9
printf("--------");
printf("Here we Represent The Maximal Spanning Tree in Form of a List : ");
printf("VERTEX 1 VERTEX 2 EDGE");
printf("IS[%d] IS[%d] %d",sequence[i].u,sequence[i].v,sequence[i].
weight);
/*function to form the shortest common string*/

void create_super_string(void)
{
int i,j,k;
for(i=0,j=0;i<length;i++,j++)
OS[j]=IS[sequence[1].u][i];
{
for(k=sequence[i].weight;k<length;k++,j++)
{
OS[j]=IS[sequence[i].v][k];
}
}
/*function to display the shortest common string*/

void display_super_string(void)
{
int i,j,k,count_blank=0;
printf("SHORTEST COMMON SUPERSTRING : ");
puts(OS);
printf("------------");
/*printing a formatted output*/
puts(IS[sequence[1].u]);
printf("\n");
{
for(j=1;j<=count_blank;j++)
printf(" ");
for(k=sequence[i].weight;k<length;k++)
{
printf(" ");
count_blank++;
}
puts(IS[sequence[i].v]);
printf("\n");
}
/*definition of global structures finished*/
10
OUTPUT INSTANCE 1 :
ENTER THE TOTAL NO. OF SUBSTRINGS : 5
ENTER 5 SUBSTRINGS (each terminated by an enter)

EACH SUBSTRING MUST BE OF SAME LENGTH :
ABRAC
ACADA
ADABR
DABRA
RACAD
ENTERED SUBSTRINGS :
------------------------------------
IS[1] = ABRAC
IS[2] = ACADA
IS[3] = ADABR
IS[4] = DABRA
IS[5] = RACAD
MATRIX :
-------------
Here MATRIX[i][j]) = max. matching characters between two strings and
MATRIX[i][j]) = 0 if i=j
IS[1] IS[2] IS[3] IS[4] IS[5]

IS[1] 0 2 0 0 3
IS[2] 1 0 3 2 0
IS[3] 3 0 0 4 1
IS[4] 4 1 1 0 2
IS[5] 0 4 2 1 0
EDGELIST :
-----------------
Here The Non-zero Entries of The above Matrix is Represented in Form of an Edgelist
VERTEX 1 VERTEX 2 EDGE

IS[3] IS[4] 4
IS[4] IS[1] 4
IS[5] IS[2] 4
IS[1] IS[5] 3
IS[2] IS[3] 3
IS[3] IS[1] 3
IS[1] IS[2] 2
IS[2] IS[4] 2
IS[4] IS[5] 2
IS[5] IS[3] 2
IS[2] IS[1] 1
IS[3] IS[5] 1
IS[4] IS[2] 1
IS[4] IS[3] 1
11
IS[5] IS[4] 1
SEQUENCE :
-------------------
Here we Represent The Maximal Spanning Tree in Form of a List :

IS[3] IS[4] 4
IS[4] IS[1] 4
IS[1] IS[5] 3
IS[5] IS[2] 4
SHORTEST COMMON SUPERSTRING :
ADABRACADA
---------------------
ADABR
DABRA
ABRAC
RACAD
ACADA
OUTPUT INSTANCE 2 :
ENTER THE TOTAL NO. OF SUBSTRINGS : 4
ENTER 4 SUBSTRINGS (each terminated by an enter)

EACH SUBSTRING MUST BE OF SAME LENGTH :
ABCDE
BCDEF
DEFGH
CDEFG
ENTERED SUBSTRINGS :
------------------------------------
IS[1] = abcde
IS[2] = bcdef
IS[3] = defgh
IS[4] = cdefg
MATRIX :
-------------
Here MATRIX[i][j]) = max. matching characters between two strings and
MATRIX[i][j]) = 0 if i=j
IS[1] IS[2] IS[3] IS[4]

IS[1] 0 4 2 3
IS[2] 1 0 3 4
IS[3] 0 0 0 0
IS[4] 0 0 4 0
12
EDGELIST :
----------------
Here The Non-zero Entries of The above Matrix is Represented in Form of a
Edgelist

IS[1] IS[2] 4
IS[2] IS[4] 4
IS[4] IS[3] 4
IS[1] IS[4] 3
IS[2] IS[3] 3
IS[1] IS[3] 2
SEQUENCE :
------------------
Here we Represent The Maximal Spanning Tree in Form of a List :

IS[1] IS[2] 4
IS[2] IS[4] 4
IS[4] IS[3] 4
SHORTEST COMMON SUPERSTRING :
ABCDEFGH
-----------------
ABCDE
BCDEF
CDEFG
DEFGH
DISCUSSION :
1. The code is implemented considering certain basic assumptions, such as:

i. Each substring entered must be of equal length.
ii. No such substring should be entered that have no common
characters when compared with all other substrings.
iii. No Si is a substring of Sj, where both Si and Sj are substrings of S.
2. Certain boundary conditions have also to be maintained, such as:

i. The substrings entered must be within of 10 characters.
ii. A maximum of 10 substring may be entered.
iii. The output string is 1D array capable of storing a maximum of 30
characters.
3. The output is displayed in a formatted way to that it is easier for the user to
understand the formation of the shortest common superstring.
13
4. The Kruskal’s algorithm is generally used to compute the minimal spanning tree
but here it is used to find the maximal spanning tree. This is possible because
the structure ‘sequence’ used here stores the edges in an non-increasing order.
The Kruskal’s algorithm starts by sorting all edges of a graph. The time
complexity of this sorting operation is O(ElogE) if there is ‘E’ number of edges in
the graph. The ‘for’ loop in the algorithm makes ‘E’ number of iterations in the
worst case. In each iteration, the major task is to find whether the current edge
introduces a cycle. The complexity of detecting a cycle is O(log n) in the worst
case if the graph contains ‘n’ vertices. Thus the overall time complexity of the
algorithm is O(ElogE) + O(Elogn).
5. This program can even be further modified by using suffix trees. It can be done
by building a tree containing all suffixes of all strings of S. String Si overlaps with
Sj iff a suffix of Si matches the prefix of Sj- traversing these vertices in order of
distance from the root defines the approximate merging order.
APPLICATION OF THIS PROBLEM:

The shortest common superstring problem (SCS) has been extensively studied
for its applications in string compression and DNA sequence assembly. Although the
problem is known to be Max-SNP hard, the simple greedy algorithm performs extremely
well in practice. To explain the good performance, previous researchers proved that the
greedy algorithm is asymptotically optimal on random instances. Unfortunately, the
practical instances in DNA sequence assembly are very different from the random
instances. The shortest common superstring problem (SCS) has been extensively
studied for its applications in string compression and DNA sequence assembly. Although
the problem is known to be Max-SNP hard, the simple greedy algorithm performs
extremely well in practice. To explain the good performance, previous researchers
proved that the greedy algorithm is asymptotically optimal on random instances.
Unfortunately, the practical instances in DNA sequence assembly are very different from
the random instances.
BIBLIOGRAPHY :
1. Lecture notes on Shortest Superstring Problem from Massachusetts Institute of

Technology.Seminar in Theoretical Computer Science.
2. Research work from Kenneth S. Alexander 1
Department of Mathematics,
University of Southern California. Los Angeles.
3. From Scholarly Articles available from net.
4. From the book : The Algorithm Design Manual BY Steven S.Skiena
Stony Brook University ,
Dept. of Computer Science.
5. Self experience.
14

Find Shortest Common Superstring Using Greedy Heuristic

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Find Shortest Common Superstring Using Greedy Heuristic

Uploaded by

Copyright:

Available Formats

PROBLEM DEFINITION :

Find the shortest string S which contains each Si as a substring of S.

1. GREEDY HEURISTIC METHOD :

Step 1 : Input set of strings. S= { S1, S2 ..., Sn }.

2. USING TRAVELLING SALESMAN PROBLEM APPROACH :

1. Create an overlap graph G where vertex Vi represents string Si.

Step 3 : Finally, keep the one with the minimum cost.

ALGORITHM (SET COVER):

4. KRUSKAL’S MAXIMUM SPANNING TREE ALGORITHM :

Step 2 : Add the first edge to T.

DATA STRUCTURES USED :

PROGRAM IMPLEMENTATION USING C-CODE :

/*Inclusion of Header Files*/

/*declaration of global variables*/

/*declaration of global structures*/

struct edgelist/*structure which stores the non-zero entries of the matrix*/

/*declaration of global function*/

printf("ENTER THE TOTAL NO. OF SUBSTRINGS : ");

/*definition of global functions*/

/*function to create the string_matrix*/

/*function to display the string_matrix*/

/*function to create the edge_list*/

/*function to arrange the edge_list in non-increasing order, implemented bubble sort*/

/*function to display the edge_list*/

/*function to create the maximal spanning tree using kruskal algorithim*/

/*function to check whether inclusion of a edge form a cycle in the tree*/

/*function to form the final sequence of substrings*/

/*function to display the final sequence of substrings*/

/*function to form the shortest common string*/

/*function to display the shortest common string*/

/*definition of global structures finished*/

ENTER THE TOTAL NO. OF SUBSTRINGS : 5

ENTER 5 SUBSTRINGS (each terminated by an enter)

IS[1] IS[2] IS[3] IS[4] IS[5]

VERTEX 1 VERTEX 2 EDGE

VERTEX 1 VERTEX 2 EDGE

SHORTEST COMMON SUPERSTRING :

ENTER THE TOTAL NO. OF SUBSTRINGS : 4

ENTER 4 SUBSTRINGS (each terminated by an enter)

IS[1] IS[2] IS[3] IS[4]

VERTEX 1 VERTEX 2 EDGE

VERTEX 1 VERTEX 2 EDGE

SHORTEST COMMON SUPERSTRING :

1. The code is implemented considering certain basic assumptions, such as:

2. Certain boundary conditions have also to be maintained, such as:

APPLICATION OF THIS PROBLEM:

1. Lecture notes on Shortest Superstring Problem from Massachusetts Institute of

You might also like

/Inclusion of Header Files/

/declaration of global variables/

/declaration of global structures/

struct edgelist/structure which stores the non-zero entries of the matrix/

/declaration of global function/

/definition of global functions/

/function to create the string_matrix/

/function to display the string_matrix/

/function to create the edge_list/

/function to arrange the edge_list in non-increasing order, implemented bubble sort/

/function to display the edge_list/

/function to create the maximal spanning tree using kruskal algorithim/

/function to check whether inclusion of a edge form a cycle in the tree/

/function to form the final sequence of substrings/

/function to display the final sequence of substrings/

/function to form the shortest common string/

/function to display the shortest common string/

/definition of global structures finished/