You are on page 1of 30

Multiple Sequence Analysis

WHY Multiple sequence analysis?


• To align multiple sequences to achieve optimal matching of the
sequences
• It is also done through database similarity searching
• Structural alignment good for finding the role of an amino
acid residue in a protein
• Homology based alignment good for finding evolutionary
relationships
Multiple sequence alignment: and Example

CLUSTAL W (1.82) multiple sequence alignment

GLB1_GLYDI ---------GLSAAQRQVIAATWKDIAGADNGAGVGKDCLIKFLSAHPQMAAVFGFSGAS 51
HBB_HUMAN --------VHLTPEEKSAVTALWGKV----NVDEVGGEALGRLLVVYPWTQRFFESFGDL 48
HBA_HUMAN ---------VLSPADKTNVKAAWGKVG--AHAGEYGAEALERMFLSFPTTKTYFPHF-DL 48
MYG_PHYCA ---------VLSEGEWQLVLHVWAKVE--ADVAGHGQDILIRLFKSHPETLEKFDRFKHL 49
GLB5_PETMA PIVDTGSVAPLSAAEKTKIRSAWAPVY--STYETSGVDILVKFFTSTPAAQEFFPKFKGL 58
GLB3_CHITP ----------LSADQISTVQASFDKVK------GDPVGILYAVFKADPSIMAKFTQFAGK 44
LGB2_LUPLU --------GALTESQAALVKSSWEEFN--ANIPKHTHRFFILVLEIAPAAKDLFSFLKGT 50
*: : : : . .: : .: * *

GLB1_GLYDI DP--------GVAALGAKVLAQIGVAVSHLGDE--GKMVAQMKAVGVRHKGYGNKHIKAQ 101


HBB_HUMAN STPDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHC--DKLHVDPE 101
HBA_HUMAN SH-----GSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHA--HKLRVDPV 96
MYG_PHYCA KTEAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHA--TKHKIPIK 102
GLB5_PETMA TTADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHA--KSFQVDPQ 114
GLB3_CHITP DLES-IKGTAPFETHANRIVGFFSKIIGELPN-----IEADVNTFVASHK---PRGVTHD 95
LGB2_LUPLU SEVP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHV---SKGVADA 105
. . : . . . * . :

GLB1_GLYDI YFEPLGASLLSAMEHRIGGKMNAAAKDAWAAAYADISGALISGLQS----- 147


HBB_HUMAN NFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------ 146
HBA_HUMAN NFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------ 141
MYG_PHYCA YLEFISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG 153
GLB5_PETMA YFKVLAAVIADTVAAG---------DAGFEKLMSMICILLRSAY------- 149
GLB3_CHITP QLNNFRAGFVSYMKAHTD---FAGAEAAWGATLDTFFGMIFSKM------- 136
LGB2_LUPLU HFPVVKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- 153
: . : : . . . . : .

10/04/2011 Bioinformatics (BIOT 305) 3


The Meaning of a Multiple Alignment

• Proteins:
• Align residues that are homologous in both the evolutionary
and structural sense
• Evolutionary homology means the residues “descended” from a
common ancestral residue. (Actually, their coding DNA did…)
• Structural homology means that the residues are superimposable
when the protein structures are aligned
• Structural homology disappears for many residues (is impossible to
determine) when (pairwise) sequence identity gets below 30%
• Core structural elements tend to be visible in multiple alignments
• DNA:
• Align residues that are homologous in the evolutionary sense.
• Regulatory regions are often more conserved than the surrounding DNA
Purposes of Multiple Alignment

• Why perform an MSA?


• Visualize trends between homologous sequences:
• Shared regions of homology
• Regions unique to a sequence within a family
• Structural/functional motif
• As the first step in a phylogenetic analysis
• Improve accuracy of structure predictions
• How does one perform an MSA?
• Automated alignment and manual editing (expert)
• Considerations:
• Select sequences carefully:
• Homologous over length
• No unrelated sequences
• Computational intense : demands lot of computational work
• When does it go wrong?
• Non homologous sequences
• Multidomain proteins containing repeated domains
• Domains in different orders
• Choice of wrong alignment parameters
• Gap penalties are important
A popular MSA
program
Pairwise
alignment then
cluster analysis
• ClustalW progressive alignment
Similarity Step one:
• align the easy ones first
Align most similar pair • How do you find them  make
a tree from pairwise distances
Gaps to optimize a alignment • this is a Drag n Drop tree
Align next most similar pair
• Use this tree to align pairs and
New gaps to optimize alignment
to align alignments.
Types of Multiple Alignments (I)

• Global multiple Alignment


(ClustalW)
• Proteins, nucleotides
• Long stretches of
conservation essential
• Identification of protein
family profiles
• Score gaps
Types of Multiple Alignments (II)

• Local multiple alignments (Motif


detection)
• Proteins, nucleotides
• Short stretches of conservations
(12 NT, 6 AA)
• Identification of regulatory
motifs (DNA, proteins)
• No gap scoring
The Correct Alignment

ATTGCGC  ATTGCGC ATTGCGC


ATTGCGC  AT-CCGC
C  ATCCGC
ATTGCGC
 ATC-CGC
Consensus Sequences
• A pseudo-sequence that summarize the residue information
contained in the multiple alignment

• Simplest Form:
A single sequence which represents the most common amino
acid/base in that position

YD D G A V - E A L
YD G G - - - E A L
FE G G I L V E A L
FD - G I L V Q A V
YE G G A V V Q A L
YD G G A/I V/L V E A L
Elements of multiple alignment methods
• Scoring model

• Algorithm for finding the best alignment


Scoring model
• For pairwise alignments, the scoring is performed by
• Using score matrices (PAM, BLOSUM etc.)
• Gap penalties (linear, affine)

• A multiple alignment “implies” the pairwise alignment for each pair of sequences
and then summation of all pairs  Sum of Pairs

• SP (Sum of pairs) defines the score of multiple alignment as sum of scores of all
implied pairwise alignment
• Score each columns with all possible pairwise matches, mismatches and gap
penalties
SP -Definition
• If Aij is the score of the alignment implied for sequence pair
(i,j), then the total score is
MSA algorithms
• Exhaustive algorithms

• Heuristic Algorithms
• Progressive Alignment methods
• Iterative Alignment methods
• Block-Based Alignment methods
Exhaustive Algorithms
• Involves examination of all possible aligned positions
• Similar to 2-D matrix in pairwise dynamic programming except
MSA involves multi-dimensional matrix
• Traceback involves finding the best alignment with highest score
• For N sequences, we then need N-dimensional matrix
• Drawback
• Needs more time for examining all possibilities
• Computationally intense for more than 10 shorter sequences
• Therefore, semiexhaustive programs like DCA (Divide-and-Conquer
Alignment) are used which are more heuristics for some steps
• Principle of DCA: Breaks longer sequences into smaller ones based
on local similarities (sometimes needs further divisions for more
longer sequences)
Heuristic algorithms
• Due to unfeasibility of Exhaustive algorithms, more faster algorithms are
developed as Heuristic Algorithms
• Types
• Progressive
• Iterative
• Block-based
Progressive Alignment Method
• They build the alignment bottom up as follows:

• Perform pairwise alignments


• Construct a tree, joining most similar sequences first (guide tree)
• Align sequences sequentially, using the phylogenetic tree
Progressive Multiple Alignment: Step 1
• Pairwise Alignments for each possible base pairs using
Needleman-Wunsch global alignment methods
• Calculate score for each pairs either in % identity or by any
substitution matrix

Use gaps to align them


globally

Score using matrix for


each pairs
Progressive Multiple Alignment: Step 2

• Scores correlate to evolutionary


distances between sequences
• Convert these scores into
evolutionary distances to
generate distance matrix

21
Progressive Multiple Alignment: Step 3
• Phylogenetic tree is constructed using Neighbor-
Joining method
• This tree is only the Guide Tree and does not
resemble exactly with the formally constructed
phylogenetic tree
• Based on Guide Tree:
• First 2 most closely related sequences are
first re-aligned
• Convert these align sequences into
consensus sequences considering gaps also
• Consensus sequence is treated as single
sequence and this consensus sequence is
searched for next closest sequence using
dynamic programming based on guide tree
• New alignment is used to generate new
consensus and this will again be searched
for another closest sequence
• The process is repeated until all the
sequences are aligned
CLUSTAL
www.ebi.ac.uk/clustalw
• Well-known progressive multiple alignment program
• ClustalW runs on UNIX and ClustalX provides user-friendly graphical
interface
• Special features
• Uses multiple substitution matrices based on evolutionary
information provided from guide tree
• For closely related sequences, it uses BLOSUM62 or PAM120
• For more divergent sequences, it uses BLOSUM45 or PAM250
• Use of adjustable gap penalties
• Gaps in the conserved regions are discouraged (Not allowed or fewer allowed)
• Weighting scheme to increase the reliability of aligning divergent
sequence ( <25% identity)
Pros and Cons of Progressive MSA
• Global alignment, so can not be used to compare sequences of
different lengths
• Use of affine gap penalty discourages the long gaps and in some
cases, this may limit the accuracy of the method.
• “Greedy” in nature
• Depends on initial alignment ; any errors made in initial steps can
not be corrected.
• Errors may cause while inserting gaps in early steps
• Errors made in early steps can not be corrected
Iterative Alignment
• Basic idea
An optimal solution can be found by repeatedly modifying existing suboptimal
solutions
• Starts with low-quality alignment and gradually improves it by iterative
realignment with well-defined procedures until no better scores can be obtained
• Overcomes the “greedy” nature of progressive because order of sequences used
for alignment is different in each iteration
• But this method doesn’t guarantee for finding optimal alignment as this is also
heuristic in nature.
• E.g PRRN
PRRN
https://www.genome.jp/tools/prrn/prrn_help.html
• Web-based program for iterative MSA
• 2 sets of iterations
• Outer iteration
• Initial random alignment is generated that is used to generate UPGMA
• Weights are used to optimize the alignment
• Inner iteration
• Sequences are randomly divided into 2 groups
• Two groups are aligned using global dynamic programming treating each group a single
sequence
• Process is repeated until no better SP is obtained
• This final alignment is used to generate new guide tree based on UPGMA
• New weights are applied to optimize the alignment
• This whole process is repeated until no further improvement in the overall alignment
Fig: Schematic of iterative
alignment procedure for PRRN

28
04/04/2008 Bioinformatics (BIOT 305)
Block Based Method
• Local Multiple Sequence alignment
• Used to identify conserved domains and motifs
• Identifies BLOCKS of ungapped alignment shared by all sequences
• Uses 2 block-based programs
• DIALIGN2
• Web – based program
• Does not use gap penalties, so not so sensitive to longer gaps
• This method breaks each of the sequence down to smllaller segment and
looks for all possible pairwise alignments of blocks.
• All aligned blocks are then compiled in progressive manner to assemble
full MSA
• The sequences between blocks are left unaligned
• Match-Box www.sciences.fundp.ac.be/biologie/bms/matchbox_submit.shtml
• Identifies conserved blocks among sequences
• Conserved blocks are of ~ 9 residues
• If the similarity is above certain threshold then they are used as a anchor to assemble
multiple alignment and residues between blocks are left unaligned

You might also like