Longest Common Sub Sequence

Longest Common subsequence(LCS)
Let P=set of alphabet.And A and B are two strings of size n from the alphabet .
i.e. A=a1,a2,.....,am and B=b1,b2,.....,bn . Now we have to _nd the longest common subsequence
between them .
Now suppose we de_ne L[i,j]as
L[i,j]= Length of longest common subsequence between the strings a1,a2,.....,ai and b1,b2,.....,bj
.
Now from the problem we can see that L[i,j] can be de_ned in terms of the previous ones
L[i-1,j-1] , L[i-1,j] , L[i,j-1] as bellow .
L[i,j] = L[i-1,j-1]+1 if ai = bj
L[i,j] = Max(L[i,j-1] , L[i-1,j]) otherwise .
Let C[m_n] matrix contains the length of an longest common subsequence between
5
a1,a2,.....,am and b1,b2,.....,bn and D[m_n] matrix contains the longest common subsequence
. If D[i;j]=- then ai or bj ( both are equal ) is part of the longest common
subsequence .
The Algorithm is given bellow .
Method
Input:Two Arrayes A[m] and B[n].
Output: Two m_n matrices C and D.
LCS
m=length[A];
n=length[B];
for i 1 upto m
C[i,0] 0;
for j 1 upto n
C[0,j] 0;
for i 1 upto m
for j 1 upto n
if(ai = bj)
C[i,j] C[i-1,j-1]+1;
D[i,j] "-";
else if(C[i-1,j]_C[i,j-1])
C[i,j] C[i-1,j];
D[i,j] """;
elseC[i,j] C[i,j-1];
D[i,j] " ";
endfor
endfor
return C and D
4.1 Construction Of Longest Common Subsequence:
From the D[m_n] matrix we can get the LCS as bellow .
Method
6
Input: D[m_n] matrix output from LCS Algorithm .
Output: Longest Common Subsequence .
PrintLCS(D,A,i,j)
if(i=0 or j=0)
then return ;
endif
if(D[i,j]="-")
then PrintLCS(D,A,i-1,j-1);
print Ai;
else if(D[i,j]=""")
then PrintLCS(D,A,i-1,j);
else PrintLCS(D,A,i,j-1);
endif
endif
4.2 Complexity analysis
4.2.1 Time Complexity
From the above LCS algorithm it is very clear that the time complexity is _( m*n ) as
there are two nested for loop of length m and n .
4.2.2 Space Complexity
As we need to store the m_n matrices C and D the space complexity is also _( m*n )
. But we can reduce the space for storing storing the m*n matrix C . We can store only
i-1st and i-2nd row/column whichever is smaller for computing at ith level in the LCS
algorithm . Thus we can improve the space complexity to _( Min ( m,n ) ) .
7
Longest common subsequence problem
Why might we want to solve the longest common subsequence problem? There are
several motivating applications.
a) Molecular biology.
DNA sequences (genes) can be represented as sequences of four letters
ACGT,(A=adenine,C=cytosine,G=guanine and T=thymine) , corresponding to the four
submolecules forming DNA. When biologists find a new sequences, they typically want
to know what other sequences it is most similar to. One way of computing how similar
two sequences are is to find the length of their longest common subsequence.
b) File comparison.
The Unix program "diff" is used to compare two different versions of the same file, to
determine what changes have been made to the file. It works by finding a longest
common subsequence of the lines of the two files; any line in the subsequence has not
been changed, so what it displays is the remaining set of lines that have changed. In this
instance of the problem we should think of each line of a file as being a single
complicated character in a string.
c) Screen redisplay.
Many text editors like "emacs" display part of a file on the screen, updating the screen
image as the file is changed. For slow dial-in terminals, these programs want to send the
terminal as few characters as possible to cause it to update its display correctly. It is
possible to view the computation of the minimum length sequence of characters needed
to update the terminal as being a sort of common subsequence problem (the common
subsequence tells you the parts of the display that are already correct and don't need to be
changed).
Brute-force methods:-
Using brute-force methods,we are solving LCS problem. If we have two strings, say
"subsequence" and "opsubset", we can represent a subsequence as a way of writing the
two so that certain letters line up:
Subsequence
|||||
Opsubset
If we draw lines connecting the letters in the first string to the corresponding letters in the
second, no two lines cross (the top and bottom endpoints occur in the same order, the
order of the letters in the subsequence). Conversely any set of lines drawn like this,
without crossings, represents a subsequence.
On the other hand, suppose that, like the example above, the two first characters differ.
Then it is not possible for both of them to be part of a common subsequence - one or the
other (or maybe both) will have to be removed.
Finally, observe that once we've decided what to do with the first characters of the
strings, the remaining subproblem is again a longest common subsequence problem, on
two shorter strings. Therefore we can solve it recursively.
These observations give us the following, very inefficient, recursive algorithm.
Recursive LCS:
int lcs_length(char * A, char * B)
if (*A == '\0' || *B == '\0') return 0;
else if (*A == *B) return 1 + lcs_length(A+1, B+1);
else return max(lcs_length(A+1,B), lcs_length(A,B+1));
This is a correct solution but it's very time consuming. For example, if the two strings
have no matching characters, so the last line always gets executed, the the time bounds
are binomial coefficients, which (if m=n) are close to O(2^n).

Longest Common Sub Sequence

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Longest Common Sub Sequence

Uploaded by

Copyright:

Available Formats

Longest Common subsequence(LCS)

Longest common subsequence problem

several motivating applications.

DNA sequences (genes) can be represented as sequences of four letters

ACGT,(A=adenine,C=cytosine,G=guanine and T=thymine) , corresponding to the four

complicated character in a string.

terminal as few characters as possible to cause it to update its display correctly. It is

"subsequence" and "opsubset", we can represent a subsequence as a way of writing the

two so that certain letters line up:

without crossings, represents a subsequence.

other (or maybe both) will have to be removed.

strings, the remaining subproblem is again a longest common subsequence problem, on

two shorter strings. Therefore we can solve it recursively.

These observations give us the following, very inefficient, recursive algorithm.

int lcs_length(char * A, char * B)

if (A == '\0' || B == '\0') return 0;

else if (A == B) return 1 + lcs_length(A+1, B+1);

else return max(lcs_length(A+1,B), lcs_length(A,B+1));

are binomial coefficients, which (if m=n) are close to O(2^n).

You might also like