Professional Documents
Culture Documents
GLB1_GLYDI ---------GLSAAQRQVIAATWKDIAGADNGAGVGKDCLIKFLSAHPQMAAVFGFSGAS 51
HBB_HUMAN --------VHLTPEEKSAVTALWGKV----NVDEVGGEALGRLLVVYPWTQRFFESFGDL 48
HBA_HUMAN ---------VLSPADKTNVKAAWGKVG--AHAGEYGAEALERMFLSFPTTKTYFPHF-DL 48
MYG_PHYCA ---------VLSEGEWQLVLHVWAKVE--ADVAGHGQDILIRLFKSHPETLEKFDRFKHL 49
GLB5_PETMA PIVDTGSVAPLSAAEKTKIRSAWAPVY--STYETSGVDILVKFFTSTPAAQEFFPKFKGL 58
GLB3_CHITP ----------LSADQISTVQASFDKVK------GDPVGILYAVFKADPSIMAKFTQFAGK 44
LGB2_LUPLU --------GALTESQAALVKSSWEEFN--ANIPKHTHRFFILVLEIAPAAKDLFSFLKGT 50
*: : : : . .: : .: * *
• Proteins:
• Align residues that are homologous in both the evolutionary
and structural sense
• Evolutionary homology means the residues “descended” from a
common ancestral residue. (Actually, their coding DNA did…)
• Structural homology means that the residues are superimposable
when the protein structures are aligned
• Structural homology disappears for many residues (is impossible to
determine) when (pairwise) sequence identity gets below 30%
• Core structural elements tend to be visible in multiple alignments
• DNA:
• Align residues that are homologous in the evolutionary sense.
• Regulatory regions are often more conserved than the surrounding DNA
Purposes of Multiple Alignment
• Simplest Form:
A single sequence which represents the most common amino
acid/base in that position
YD D G A V - E A L
YD G G - - - E A L
FE G G I L V E A L
FD - G I L V Q A V
YE G G A V V Q A L
YD G G A/I V/L V E A L
Elements of multiple alignment methods
• Scoring model
• A multiple alignment “implies” the pairwise alignment for each pair of sequences
and then summation of all pairs Sum of Pairs
• SP (Sum of pairs) defines the score of multiple alignment as sum of scores of all
implied pairwise alignment
• Score each columns with all possible pairwise matches, mismatches and gap
penalties
SP -Definition
• If Aij is the score of the alignment implied for sequence pair
(i,j), then the total score is
MSA algorithms
• Exhaustive algorithms
• Heuristic Algorithms
• Progressive Alignment methods
• Iterative Alignment methods
• Block-Based Alignment methods
Exhaustive Algorithms
• Involves examination of all possible aligned positions
• Similar to 2-D matrix in pairwise dynamic programming except
MSA involves multi-dimensional matrix
• Traceback involves finding the best alignment with highest score
• For N sequences, we then need N-dimensional matrix
• Drawback
• Needs more time for examining all possibilities
• Computationally intense for more than 10 shorter sequences
• Therefore, semiexhaustive programs like DCA (Divide-and-Conquer
Alignment) are used which are more heuristics for some steps
• Principle of DCA: Breaks longer sequences into smaller ones based
on local similarities (sometimes needs further divisions for more
longer sequences)
Heuristic algorithms
• Due to unfeasibility of Exhaustive algorithms, more faster algorithms are
developed as Heuristic Algorithms
• Types
• Progressive
• Iterative
• Block-based
Progressive Alignment Method
• They build the alignment bottom up as follows:
21
Progressive Multiple Alignment: Step 3
• Phylogenetic tree is constructed using Neighbor-
Joining method
• This tree is only the Guide Tree and does not
resemble exactly with the formally constructed
phylogenetic tree
• Based on Guide Tree:
• First 2 most closely related sequences are
first re-aligned
• Convert these align sequences into
consensus sequences considering gaps also
• Consensus sequence is treated as single
sequence and this consensus sequence is
searched for next closest sequence using
dynamic programming based on guide tree
• New alignment is used to generate new
consensus and this will again be searched
for another closest sequence
• The process is repeated until all the
sequences are aligned
CLUSTAL
www.ebi.ac.uk/clustalw
• Well-known progressive multiple alignment program
• ClustalW runs on UNIX and ClustalX provides user-friendly graphical
interface
• Special features
• Uses multiple substitution matrices based on evolutionary
information provided from guide tree
• For closely related sequences, it uses BLOSUM62 or PAM120
• For more divergent sequences, it uses BLOSUM45 or PAM250
• Use of adjustable gap penalties
• Gaps in the conserved regions are discouraged (Not allowed or fewer allowed)
• Weighting scheme to increase the reliability of aligning divergent
sequence ( <25% identity)
Pros and Cons of Progressive MSA
• Global alignment, so can not be used to compare sequences of
different lengths
• Use of affine gap penalty discourages the long gaps and in some
cases, this may limit the accuracy of the method.
• “Greedy” in nature
• Depends on initial alignment ; any errors made in initial steps can
not be corrected.
• Errors may cause while inserting gaps in early steps
• Errors made in early steps can not be corrected
Iterative Alignment
• Basic idea
An optimal solution can be found by repeatedly modifying existing suboptimal
solutions
• Starts with low-quality alignment and gradually improves it by iterative
realignment with well-defined procedures until no better scores can be obtained
• Overcomes the “greedy” nature of progressive because order of sequences used
for alignment is different in each iteration
• But this method doesn’t guarantee for finding optimal alignment as this is also
heuristic in nature.
• E.g PRRN
PRRN
https://www.genome.jp/tools/prrn/prrn_help.html
• Web-based program for iterative MSA
• 2 sets of iterations
• Outer iteration
• Initial random alignment is generated that is used to generate UPGMA
• Weights are used to optimize the alignment
• Inner iteration
• Sequences are randomly divided into 2 groups
• Two groups are aligned using global dynamic programming treating each group a single
sequence
• Process is repeated until no better SP is obtained
• This final alignment is used to generate new guide tree based on UPGMA
• New weights are applied to optimize the alignment
• This whole process is repeated until no further improvement in the overall alignment
Fig: Schematic of iterative
alignment procedure for PRRN
28
04/04/2008 Bioinformatics (BIOT 305)
Block Based Method
• Local Multiple Sequence alignment
• Used to identify conserved domains and motifs
• Identifies BLOCKS of ungapped alignment shared by all sequences
• Uses 2 block-based programs
• DIALIGN2
• Web – based program
• Does not use gap penalties, so not so sensitive to longer gaps
• This method breaks each of the sequence down to smllaller segment and
looks for all possible pairwise alignments of blocks.
• All aligned blocks are then compiled in progressive manner to assemble
full MSA
• The sequences between blocks are left unaligned
• Match-Box www.sciences.fundp.ac.be/biologie/bms/matchbox_submit.shtml
• Identifies conserved blocks among sequences
• Conserved blocks are of ~ 9 residues
• If the similarity is above certain threshold then they are used as a anchor to assemble
multiple alignment and residues between blocks are left unaligned