Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Save to My Library
Look up keyword
Like this
4Activity
0 of .
Results for:
No results containing your search query
P. 1
HS-MSA: New Algorithm Based on Meta-heuristic Harmony Search for Solving Multiple Sequence Alignment

HS-MSA: New Algorithm Based on Meta-heuristic Harmony Search for Solving Multiple Sequence Alignment

Ratings: (0)|Views: 254 |Likes:
Published by ijcsis
Aligning multiple biological sequences such as in protein or DNA/RNA is a fundamental task in bioinformatics and sequence analysis. In the functional, structural and evolutionary studies of sequence data the role of multiple sequence alignment (MSA) cannot be denied. It is imperative that there is accurate alignment when predicting the RNA structure. MSA is a major bioinformatics challenge as it is NP-complete. In addition, the lack of a reliable scoring method makes it harder to align the sequences and evaluate the alignment outcomes. Scalability, biological accuracy, and computational complexity must be taken into consideration when solving MSA problem. The harmony search algorithm is a recent meta-heuristic method which has been successfully applied to a number of optimization problems. In this paper, an adapted harmony search algorithm (HS-MSA) methodology is proposed to solve MSA problem. In addition, a hybrid method of finding the conserved regions using the Divide-and-Conquer (DAC) method is proposed to reduce the search space. The proposed method (HS-MSA) is extended to a parallel approach in order to exploit the benefits of the multi-core and GPU system so as to reduce computational complexity and time.
Aligning multiple biological sequences such as in protein or DNA/RNA is a fundamental task in bioinformatics and sequence analysis. In the functional, structural and evolutionary studies of sequence data the role of multiple sequence alignment (MSA) cannot be denied. It is imperative that there is accurate alignment when predicting the RNA structure. MSA is a major bioinformatics challenge as it is NP-complete. In addition, the lack of a reliable scoring method makes it harder to align the sequences and evaluate the alignment outcomes. Scalability, biological accuracy, and computational complexity must be taken into consideration when solving MSA problem. The harmony search algorithm is a recent meta-heuristic method which has been successfully applied to a number of optimization problems. In this paper, an adapted harmony search algorithm (HS-MSA) methodology is proposed to solve MSA problem. In addition, a hybrid method of finding the conserved regions using the Divide-and-Conquer (DAC) method is proposed to reduce the search space. The proposed method (HS-MSA) is extended to a parallel approach in order to exploit the benefits of the multi-core and GPU system so as to reduce computational complexity and time.

More info:

Published by: ijcsis on Mar 08, 2011
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

09/13/2013

pdf

text

original

 
Abproseqstu(MaligbiolacseqbiointseabeeInmehyandspaapGP
alg
evoanctomulettnuseqnugu
HS-Har 
S
tractAlignintein or DNA/Ruence analysis.ies of sequencA) cannot benment when p
 
informatics chof a reliableuences and elogical accuracconsideratiorch algorithmn successfullyhis paper, anhodology is prid method of -Conquer (Dce. The proposroach in ordeU system so as
Keyword: RNArithm.
Living organlution. A paiestor in the piscover the sitations that tooA sequence iers of the alpleotides for uence is writtleotides symnine (G) and
SA:ony
Mubarak S.
hool of Compniversiti SaiPenang,obarak_seif  
g multiple biNA is a fundaIn the functioe data the roldenied. It is iredicting thellenge as it isscoring methvaluate the aly, and computwhen solvinis a recent mpplied to a nuadapted harmoposed to solfinding the conC) method ised method (Hr to exploit thto reduce com
, Multiple sequ
I.
 
I
NT
isms are relar of organisst from whicilarities amok place.s an orderedhabet, S (20RNA/DNA).en as s = AUols comprisiracil (U): S =
ewSearc
Mohsen,
uter Sciences,s Malaysia,alaysia,yahoo.com.
ological sequeental task in bnal, structuralof multiple seperative thatNA structure.NP-complete.d makes it haignment outctional complexMSA probleeta-heuristicmber of optimny search alge MSA probleserved regionsproposed to r-MSA) is extebenefits of tutational com
ence alignment
ODUCTION
 
ted to eachs sometimesthey were evg the sequenclist of symbomino acids f In bioinfor UCUGUAA.g adenine ({A, C, G, U}.
(
lgoritfor Ali
Survey an
 
nces such asioinformatics aand evolutionquence alignmthere is accurMSA is a maIn addition,rder to alignmes. Scalabiliity must be ta. The harmethod whichization problerithm (HS-Mm. In additionusing the Divieduce the seaded to a parae multi-core alexity and tim
, Harmony sea
ther throughhas a commlved. MSA tr e and recover ls from a setor protein anatics, a R It is a string), cytosine (
Figure 1.
IJCSIS) Interna
m Bolvingnme
d Propose
 
inndryntateorhehety,ennyass.A), ae-chllelnd.
ch
utoniesheof 4Aof ),Alithe otresiduemutatisymbolTo imcorrespusuallysequenthe alithe besFor using tof seqFindinIt isprobledifficulfunctioInproblescalabisequenfindingobjectia simpproblecalculaMSlocallengthand loFigure
Global and local
tional Journal
sed oMult
Work 
SchU
 
gnment is aer to shows. A columnn has occur s indicates throve the aligond to a spacecalled a gap.ce and deletionment perfor t alignment.clarity’s sakee following dences in ordan accuratea time con[2, 3]. Theties, that is,n.fact, the cos must be sity, is aboutces. The secthe alignmene function ae objective f , the objectivtion in order tA covers twoSA. Globalwhile local Mates conserve1.
MSA
f Computer Sci
Metiple S
Rosni Abd
ool of Computniversiti SainsPenang, Marosni@cs.usethod to arrathe match ahich has maed whereast several mutment score, tintroduced inThe gap is viin the other.ance. The hi, the genericeclaration: “Inr to maximizSA from thesuming andSA problemscalability, oplexity thatlved simultanfinding theond problem,with the higong the sequenction is ane function (OFmeasure the aclosely relatedSA aligns seSA aligns cer d regions alo
nce and InformVol.
-heur eque
llah,
 
er Sciences,Malaysia,laysia,m.my.ge the sequend mismatchch residues sa column wition events ae character “the sequence.wed as an insA score is usehest score of SA problemsert gaps withia similaritysequences iscomputationacan be dividtimization, aarises fromeously. The f alignment of optimization,est score basences. OptimizP-hard proble), involves splignment.problems: gloquences acrosain parts of tg with them
ation Security,9, No. 2, 2011
sticce
ces one over between theows that no
 
th mismatche happening.” is used toThis space isertion in oned to measureone indicatesis expressedn a given setcriterion”[1].ery difficult.lly NP-harded into threend objectivell the threeirst problem,many longdeals withd on a givention of evenm. The thirdeding up theal MSA ands their wholee sequences,as shown in
70 http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 2, 2011
In bioinformatics, MSA is a major interesting problem andconstitutes the basis for other molecular biology analyses.MSA has been used to address many critical problems inbioinformatics. Studying these alignments provides scientistswith information needed to determine the evolutionaryrelationships between them, find the sequences of the family,detect the structure of protein/DNA, reveal the sequencehomologies, predict the functions of protein/DNA sequences,and predict the patient’s diseases or discover drug-likecompounds that can bind to the sequences.In general, the primary step in the secondary structureprediction is through MSA, particularly in the prediction of thestructure of RNA sequences. The RNA structure predictionmethod is extremely affected by the quality of thealignment[4]. Indeed, prediction of an accurate RNA secondarystructure relies on multiple sequence alignments to provide dataon co-varying bases[5]. MSA significantly improves theaccuracy of protein/RNA structure prediction. For example,current RNA secondary structure prediction methods usingaligned sequences have been successful in gaining a higher prediction accuracy than those using a single sequence[6].Nucleic acid sequences are of primary concern in our proposedmethod to evaluate and improve the influence of the alignmenttools on RNA secondary structure prediction.Many different approaches have been proposed to solve theMSA problem. Dynamic programming, progressive, iterative,consistency and segment-based approaches are the mostcommonly used approaches[7]. Although many MSAalgorithms are available, a solution has yet to been found that isapplicable to all possible alignment situations[7].It is well-known fact that the MSA problem can be solvedby using the dynamic programming (DP) algorithm[8, 9].Unfortunately, such an approach is notorious for its largeconsumption of processing time. DP methods with the sum-of-pairs score have been shown to be a NP-completeproblem[10],[11]. Algorithms that provide the optimal solutionis time consuming and have a running time that growsexponentially with the increase in the number of sequences andtheir lengths.In essence, all widely used MSA tools seek an alignmentwith a high sum-of-pairs score. This optimization problem isNP-complete[2, 3] and thus motivates the research intoheuristics. Over the last decade, the evolutionary and meta-heuristic approaches are one of the most recent approaches thathave been used to solve the optimization problem.Evolutionary and meta-heuristic algorithms have been used inseveral problem domains, including science, commerce, andengineering. Consequently, most of the practical MSAalgorithms are based on heuristics to obtain a reasonablyaccurate MSA within a moderate computational time and thatwhich usually produces quasi-optimal alignment. Althoughmany algorithms are now available, there is still room toimprove its computational complexity, accuracy, andscalability.In this paper, a novel algorithm (HS-MSA), that is, a meta-heuristic technique known as harmony search algorithm, isproposed to solve the old MSA problem. The MSA problem isviewed as an optimization problem and can be resolved byadapting a harmony search algorithm. Since the search space inHS is wide, a modified algorithm is proposed (MHS-MSA) tofind the conserved blocks using well-known regions, and thenalign the mismatch regions between the successive blocks toform a final alignment. HS-MSA is extended to include thedivide-and-conquer (DCA) approach in which DCA is used tocut and combine the sub-sequence to form the final MSA.Another proposed technique is to use the harmony searchalgorithm as an MSA improver (HSI-MSA) in which the initialalignment can be obtained from the conventional algorithms or their combinations. HS-MSA can be extended to the parallelalgorithm (PHS-MSA) in order to exploit the benefits of themulti-core and GPU system to reduce computationalcomplexity and time.This paper is organized as follows: Section 2 reviews therelated literature and describes the state-of-the-art MSAapproaches. Section 3 explains the proposed algorithm. Theevaluation and analysis methodology that is used to assess our proposed algorithm is explained in Section 4. Lastly, Section 5provides the conclusion and summary of the paper.II.
 
L
ITERATURE
EVIEW
 There are several MSA algorithms reported in the literaturereview. For a deeper understanding about the MSA algorithms,the basic concepts of MSA alignment representation, gappenalty, alignment scores, dataset benchmarks, MSAapproaches, and harmony search algorithm need to beunderstood. As such subsection 2.1 briefly reviews therepresentation of MSA alignment followed by the details aboutgap penalty in subsection 2.2. The alignment scores, RNAdatasets and benchmarks, and current MSA approaches areexplained in subsections 2.3, 2.4 and 2.5 respectively.Subsection 2.6 provides a summary of the MSA algorithms andconcludes with the harmony search algorithm in subsection 2.7.
A.
 
Representation of MSA Alignment 
There are several ways to represent a multiple sequencealignment. Usually, the final sequences are an aligned listing of the entire sequence of one over the other. However, during thealignment process, it is helpful to represent the alignment of thesequences in a manner known as a representation. Some of therepresentations that have been used in previous algorithmsinclude a bit matrix as used in[12], a matrix of gaps position asused in[13], multiple number-strings as usedin[14],[15],[16],[17], string representation[18],[19],[20] as usedin SAGA[18], four parallel chromosomes as used in[21],directed acyclic graph (DAG) as used in[22, 23], A-Bruijngraph as used in[24-26] , and dispersion Graph as used in[27].
B.
 
Gaps Penalty
A negative score or a penalty can be assigned to a set of gaps. Two types of gaps which were mentioned in the previousreviews[28] are defined as follows:
-
 
Linear gap model – in this model a Gap is always giventhe same penalty wherever it is placed in the alignment.The penalty is proportional to the length of the gap and is
71 http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 2, 2011
given by gap = n×go, where go < 0 is the opening penaltyof a gap and n is the number of consecutive gaps.
-
 
Affine gap model – in this model both the new gap andextension gap are not given the same penalty. Theinsertion of a new gap has a greater penalty than theextension of an existing gap and is given by gap = go + (n
1) × ge, where go < 0 is the gap opening penalty and ge< 0 is the gap extension penalty and are such that |ge| <|go|.
C.
 
Alignment Score
The MSA objective function is defined for assessing thealignment quality either explicitly or implicitly. An efficientalgorithm is used to find the optimal or a near optimalalignment according to the objective function. Matches,mismatches, substitutions, insertions, and deletions need to bescored in the scoring function. The scoring function can bedivided into two parts: substitution matrices and gap penalties.The former provides a numerical score for matches andmismatches while the latter allows for numerical quantificationof insertions and deletions. All possible transitions between the20 amino acids, or the 4 nucleic acids are represented in asubstitution matrix which is an array of two dimensions of 20 x20 for amino acid and 4 x 4 for nucleic acids.Usually a simple matrix used for DNA or RNA sequencesinvolves assigning a positive value for a match and a negativevalue for a mismatch[20]. Meanwhile, the scores for proteinaligned residues are given as log-odds[29] substitution matricessuch as PAM[30], GONNET[31], or BLOSUM[32].There are several models for assessing the score of a givenMSA. Many MSA tools have adopted the score method. Abrief review of the score method that has been used to calculatethe alignment score is as follows:
-
 
Sum-of-Pairs (SP): It was introduced by Carrillo andLipman[10]. More details about the sum-of-Pairs will bepresented later.
-
 
Weighted sum-of-pairs score[33],[34]: The weighted sum-of-pairs (WSP) score is an extension of the SP score sothat each pair-wise alignment score contributes differentlyto the whole score.
-
 
Maximal expected accuracy (MEA)[35]: The basic idea of MEA is to maximize the expected number of “correctly”aligned residue pairs[36]. It has been used in PRIME[37],and ProbCons[38] algorithms.
-
 
Consistency-based Scoring: This consistency concept wasoriginally introduced by Gotoh [9] and later refined byVingron and Argos[39]. Consistency-based scoring is usedin T-Coffee[40], MAFFT[41], and Align-m[42]algorithms.
-
 
Probabilistic consistency Scoring function: This scoringfunction is introduced in ProbCons[38]. It is a novelmodification of the traditional sum-of-pairs scoringsystem. This promising idea is implemented and extendedin the PECAN[43], MUMMALS[44], PROMALS[45],ProbAlign[46] , ProDA[47], and PicXAA[48] programs.
-
 
Segment-to-segment objective function: It is used byDIALIGN[49] to construct an alignment throughcomparison of the whole segments of the sequences rather than the residue-to-residue comparison.
-
 
NorMD[50] objective function: It is a conservation-basedscore which measures the mean distance between thesimilarities of the residue pairs at each alignment column.NorMD is used in RASCAL[51] and AQUA[52].
-
 
Muscle profile scoring function: MUSCLE[53] uses ascoring function which is defined for a pair of profilepositions. In addition to PSP, MUSCLE uses a new profilefunction which is called the log-expectation (LE) score.
D.
 
RNA Database and Benchmarks
Typically, a benchmark of reference alignments is used tovalidate the MSA program. The accurate score is given bycomparing the aligned sequence (test sequences) produced bythe program with the corresponding reference alignment. Mostalignment programs have been extensively investigated for protein. To date, few attempts have been made to benchmark nucleic acid sequences.RNA reference alignments exist in several databases. Itmust be noted that although these databases provide asubstantial amount of information to the specialist, they dodiffer in the file formats used and the data obtained. Herein, abrief review of the benchmarks and database that have beenused for multiple RNA sequence alignment is explained inTable 1.
TABLE I. D
ATABASE AND
B
ENCHMARKS
 RNA Database Description Website
Rfam[54]
,
[55] It is a compilation of alignment and covariance models including manyregular non-coding RNA families[55]http://rfam.sanger.ac.uk/http://rfam.janelia.org/index.html.BRAliBase[56]
,
[57] It is a compilation of RNA reference alignments especially designed for thebenchmark of RNA alignment methods[57].http://www.biophys.uni-duesseldorf.de/bralibase/http://projects.binf.ku.dk/pgardner/bralibase/Comparative RNA Website(CRW)[58]It has alignments for rRNA (5S / 16S / 23S), Group I Intron, Group IIintron, and tRNA for various organisms[58]http://www.rna.ccbb.utexas.edu/European Ribosomal RNADatabase[59]
,
[60]It is a collection of all complete or nearly complete SSU (small subunit) andLSU (large subunit) ribosomal RNA sequences available from publicsequence databases[60].http://bioinformatics.psb.ugent.be/webtools/rRNA/The Ribonuclease PDatabase[61]It contains a collection of sequence alignments, RNase P sequences, threedimensional models, secondary structures, and accessory information[61].http://www.mbio.ncsu.edu/RnaseP/5S Ribosomal RNA It is a collection of the large subunit of most organellar ribosomes and all http://biobases.ibch.poznan.pl/5SData/
72 http://sites.google.com/site/ijcsis/ISSN 1947-5500

Activity (4)

You've already reviewed this. Edit your review.
1 thousand reads
1 hundred reads
Ruchi Gupta added this note
i also work on multiple sequence alignment problem that has to be solved by genetic algorithm .suggest me any approach of genetic through which i solve the problem of msa.
Yaswanth liked this

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->