You are on page 1of 4

# Finite Automata based Regular Expression Constrained Longest Common Subsequence

CSCI-589

## Under the guidance of: Dr. Abdullah Arslan

Abstract. Finite Automata Based Algorithms for the Generalized Constrained Longest Common Subsequence[i] solves the following problems: STR-IC-LCS Problem, SEQ-IC-LCS Problem, STREC-LCS Problem and SEQ-EC-LCS Problem. For the generalized constrained longest common subsequence (GC-LCS) for strings S1 , S2 with respect to P, the time complexity of the solutions are worked out to be (r (n+ m)+ nm) for a fixed size alphabet, where r , n and m are the lengths of P, S1 and S2 respectively. The problems and solutions can be extended to Regular Expression Constrained Longest Common Subsequence. I presented the finite automata based algorithm can be applied to Regular Expression Constrained Longest Common Subsequence problem.

## Introduction and related work

Finite Automata Based Algorithms for the Generalized Constrained Longest Common Subsequence[i] solves the following problems: STR-IC-LCS Problem, SEQ-IC-LCS Problem, STR-EC-LCS Problem and SEQ-EC-LCS Problem. For the generalized constrained longest common subsequence (GC-LCS) for strings S1 , S2 with respect to P, the time complexity of the solutions are worked out to be (r (n+ m)+ nm) for a fixed size alphabet, where r , n and m are the lengths of P, S1 and S2 respectively. One of the closest problem to the above is the Sequence alignment problem. This is presented and a solution is given in [ii]. The Regular Expression Constrained Sequence Alignment problem is introduced and a solution is presented by A. N. Arslan[ii]. The problem is introduced as: given strings S1 , S2 and a regular expression R, find the maximum alignment score between S1 and S2 over all alignments such that in these alignments there exists a segment where some substring s1 of S1 is aligned to some substring s2 of S2 , and both s1 and s2 match a given regular expression R, i.e. s1 , s2 L(R) where L(R) is the regular language described by R. The solution for this problem in [ii] presented a (nmr ) time algorithm where r =(t 4 ) , and t is the number of states of a nondeterministic finite automaton N that accepts L(R). I have tried to present a Finite automata based algorithm for the Regular Expression constrained Sequence Alignment problem in this paper.

## Finite Automata based Algorithm

The algorithm to find an LCS of S1 and S2 constrained by a regular expression R as a substring is presented below: Step 1. Construct the minimal Automata MR for R such that L(MR) = L(R). The method used to construct the minimal automata from a regular expression is presented in [iii]. The time complexity is given as (r log( r )) where r is the size of the regular expression R. The space complexity, i.e. the number of states in the resulting minimal automata is (r ) . Step 2. Construct the Subsequence Automata M1 for S1. We can construct a Directed Acyclic Sequence Graph (DASG) for the string S1 in (nlog(n)) time and (n) space where n is the size of string S1 as given in [iv]. L(M1) = All possible 2 n subsequences of S1. Step 3. Construct the Subsequence Automata M2 for S2. As in step 2, we can construct a Directed Acyclic Sequence Graph (DASG) for the string S2 in (mlog(m)) time and (m) space where m is the size of string S2 as given in [iv]. L(M2)

= All possible 2 m subsequences of S2. Step 4. Construct the Intersection Automata M1R of M1 and MR. L( M 1R )=L ( M 1) L( M R ) contains all the subsequences of S1 which satisfy the regular expression constraint R. This step consumes (nr ) time and at most (nr ) space. Step 5. Construct the Intersection Automata M2R of M2 and MR. L( M 2R )= L( M 2 )L (M R) contains all the subsequences of S2 which satisfy the regular expression constraint R. This step consumes (mr ) time and at most (mr ) space. Step 6. Construct the Intersection Automata M12R of M1R and M2R. L( M 12R )=L ( M 1R ) L( M 2R ) contains all the common subsequences of S1 and S2 which satisfy the regular expression constraint R. This step consumes (nmr 2 ) time and at most (nmr 2 ) space. Step 7. Find the maximum value path for Alignment Score in M12R. Following the Dijsktra's algorithm for finding the maximum paths using the weights on the edges from the transition function (x y) of the edit operation x y, we can find the maximum alignment (longest common subsequence with maximum alignment score). The simplest implementation of the Dijkstra's algorithm backed up with a binary heap takes ((nmr 2)+ log(nmr 2)) time. The total solution is bounded by step 7 consuming ((nmr 2)+ log(nmr 2)) time.

i Effat Farhana, Jannatul Ferdous, Tanaeem Moosa and M. Sohel Rahman, Finite Automata Based Algorithms for the Generalized Constrained Longest Common Subsequence Problems ii A. N. Arslan, Regular Expression Constrained Sequence Alignment, Journal of Discrete Algorithms iii Sanjay Bhargava, G. N. Purohit, Construction of a Minimal Deterministic Finite Automaton from a Regular Expression iv Ricardo A. Baeza Yates, Searching Subsequences