Multiple Seq Alignments

Multiple Sequence Analysis
WHY Multiple sequence analysis?

• To align multiple sequences to achieve optimal matching of the
sequences
• It is also done through database similarity searching
• Structural alignment good for finding the role of an amino
acid residue in a protein
• Homology based alignment good for finding evolutionary
relationships
Multiple sequence alignment: and Example
CLUSTAL W (1.82) multiple sequence alignment
GLB1_GLYDI ---------GLSAAQRQVIAATWKDIAGADNGAGVGKDCLIKFLSAHPQMAAVFGFSGAS 51
HBB_HUMAN --------VHLTPEEKSAVTALWGKV----NVDEVGGEALGRLLVVYPWTQRFFESFGDL 48
HBA_HUMAN ---------VLSPADKTNVKAAWGKVG--AHAGEYGAEALERMFLSFPTTKTYFPHF-DL 48
MYG_PHYCA ---------VLSEGEWQLVLHVWAKVE--ADVAGHGQDILIRLFKSHPETLEKFDRFKHL 49
GLB5_PETMA PIVDTGSVAPLSAAEKTKIRSAWAPVY--STYETSGVDILVKFFTSTPAAQEFFPKFKGL 58
GLB3_CHITP ----------LSADQISTVQASFDKVK------GDPVGILYAVFKADPSIMAKFTQFAGK 44
LGB2_LUPLU --------GALTESQAALVKSSWEEFN--ANIPKHTHRFFILVLEIAPAAKDLFSFLKGT 50
*: : : : . .: : .: * *
GLB1_GLYDI DP--------GVAALGAKVLAQIGVAVSHLGDE--GKMVAQMKAVGVRHKGYGNKHIKAQ 101

HBB_HUMAN STPDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHC--DKLHVDPE 101
HBA_HUMAN SH-----GSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHA--HKLRVDPV 96
MYG_PHYCA KTEAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHA--TKHKIPIK 102
GLB5_PETMA TTADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHA--KSFQVDPQ 114
GLB3_CHITP DLES-IKGTAPFETHANRIVGFFSKIIGELPN-----IEADVNTFVASHK---PRGVTHD 95
LGB2_LUPLU SEVP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHV---SKGVADA 105
. . : . . . * . :
GLB1_GLYDI YFEPLGASLLSAMEHRIGGKMNAAAKDAWAAAYADISGALISGLQS----- 147

HBB_HUMAN NFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------ 146
HBA_HUMAN NFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------ 141
MYG_PHYCA YLEFISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG 153
GLB5_PETMA YFKVLAAVIADTVAAG---------DAGFEKLMSMICILLRSAY------- 149
GLB3_CHITP QLNNFRAGFVSYMKAHTD---FAGAEAAWGATLDTFFGMIFSKM------- 136
LGB2_LUPLU HFPVVKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- 153
: . : : . . . . : .
10/04/2011 Bioinformatics (BIOT 305) 3

The Meaning of a Multiple Alignment
• Proteins:
• Align residues that are homologous in both the evolutionary
and structural sense
• Evolutionary homology means the residues “descended” from a
common ancestral residue. (Actually, their coding DNA did…)
• Structural homology means that the residues are superimposable
when the protein structures are aligned
• Structural homology disappears for many residues (is impossible to
determine) when (pairwise) sequence identity gets below 30%
• Core structural elements tend to be visible in multiple alignments
• DNA:
• Align residues that are homologous in the evolutionary sense.
• Regulatory regions are often more conserved than the surrounding DNA
Purposes of Multiple Alignment
• Why perform an MSA?

• Visualize trends between homologous sequences:
• Shared regions of homology
• Regions unique to a sequence within a family
• Structural/functional motif
• As the first step in a phylogenetic analysis
• Improve accuracy of structure predictions
• How does one perform an MSA?
• Automated alignment and manual editing (expert)
• Considerations:
• Select sequences carefully:
• Homologous over length
• No unrelated sequences
• Computational intense : demands lot of computational work
• When does it go wrong?
• Non homologous sequences
• Multidomain proteins containing repeated domains
• Domains in different orders
• Choice of wrong alignment parameters
• Gap penalties are important
A popular MSA
program
Pairwise
alignment then
cluster analysis
• ClustalW progressive alignment
Similarity Step one:
• align the easy ones first
Align most similar pair • How do you find them  make
a tree from pairwise distances
Gaps to optimize a alignment • this is a Drag n Drop tree
Align next most similar pair
• Use this tree to align pairs and
New gaps to optimize alignment
to align alignments.
Types of Multiple Alignments (I)
• Global multiple Alignment

(ClustalW)
• Proteins, nucleotides
• Long stretches of
conservation essential
• Identification of protein
family profiles
• Score gaps
Types of Multiple Alignments (II)
• Local multiple alignments (Motif

detection)
• Proteins, nucleotides
• Short stretches of conservations
(12 NT, 6 AA)
• Identification of regulatory
motifs (DNA, proteins)
• No gap scoring
The Correct Alignment
ATTGCGC  ATTGCGC ATTGCGC

ATTGCGC  AT-CCGC
C  ATCCGC
ATTGCGC
 ATC-CGC
Consensus Sequences
• A pseudo-sequence that summarize the residue information
contained in the multiple alignment
• Simplest Form:
A single sequence which represents the most common amino
acid/base in that position
YD D G A V - E A L
YD G G - - - E A L
FE G G I L V E A L
FD - G I L V Q A V
YE G G A V V Q A L
YD G G A/I V/L V E A L
Elements of multiple alignment methods
• Scoring model
• Algorithm for finding the best alignment

Scoring model
• For pairwise alignments, the scoring is performed by
• Using score matrices (PAM, BLOSUM etc.)
• Gap penalties (linear, affine)
• A multiple alignment “implies” the pairwise alignment for each pair of sequences
and then summation of all pairs  Sum of Pairs
• SP (Sum of pairs) defines the score of multiple alignment as sum of scores of all
implied pairwise alignment
• Score each columns with all possible pairwise matches, mismatches and gap
penalties
SP -Definition
• If Aij is the score of the alignment implied for sequence pair
(i,j), then the total score is
MSA algorithms
• Exhaustive algorithms
• Heuristic Algorithms
• Progressive Alignment methods
• Iterative Alignment methods
• Block-Based Alignment methods
Exhaustive Algorithms
• Involves examination of all possible aligned positions
• Similar to 2-D matrix in pairwise dynamic programming except
MSA involves multi-dimensional matrix
• Traceback involves finding the best alignment with highest score
• For N sequences, we then need N-dimensional matrix
• Drawback
• Needs more time for examining all possibilities
• Computationally intense for more than 10 shorter sequences
• Therefore, semiexhaustive programs like DCA (Divide-and-Conquer
Alignment) are used which are more heuristics for some steps
• Principle of DCA: Breaks longer sequences into smaller ones based
on local similarities (sometimes needs further divisions for more
longer sequences)
Heuristic algorithms
• Due to unfeasibility of Exhaustive algorithms, more faster algorithms are
developed as Heuristic Algorithms
• Types
• Progressive
• Iterative
• Block-based
Progressive Alignment Method
• They build the alignment bottom up as follows:
• Perform pairwise alignments

• Construct a tree, joining most similar sequences first (guide tree)
• Align sequences sequentially, using the phylogenetic tree
Progressive Multiple Alignment: Step 1
• Pairwise Alignments for each possible base pairs using
Needleman-Wunsch global alignment methods
• Calculate score for each pairs either in % identity or by any
substitution matrix
Use gaps to align them

globally
Score using matrix for

each pairs
• Scores correlate to evolutionary

distances between sequences
• Convert these scores into
evolutionary distances to
generate distance matrix
21
• Phylogenetic tree is constructed using Neighbor-
Joining method
• This tree is only the Guide Tree and does not
resemble exactly with the formally constructed
phylogenetic tree
• Based on Guide Tree:
• First 2 most closely related sequences are
first re-aligned
• Convert these align sequences into
consensus sequences considering gaps also
• Consensus sequence is treated as single
sequence and this consensus sequence is
searched for next closest sequence using
dynamic programming based on guide tree
• New alignment is used to generate new
consensus and this will again be searched
for another closest sequence
• The process is repeated until all the
sequences are aligned
CLUSTAL
www.ebi.ac.uk/clustalw
• Well-known progressive multiple alignment program
• ClustalW runs on UNIX and ClustalX provides user-friendly graphical
interface
• Special features
• Uses multiple substitution matrices based on evolutionary
information provided from guide tree
• For closely related sequences, it uses BLOSUM62 or PAM120
• For more divergent sequences, it uses BLOSUM45 or PAM250
• Use of adjustable gap penalties
• Gaps in the conserved regions are discouraged (Not allowed or fewer allowed)
• Weighting scheme to increase the reliability of aligning divergent
sequence ( <25% identity)
Pros and Cons of Progressive MSA
• Global alignment, so can not be used to compare sequences of
different lengths
• Use of affine gap penalty discourages the long gaps and in some
cases, this may limit the accuracy of the method.
• “Greedy” in nature
• Depends on initial alignment ; any errors made in initial steps can
not be corrected.
• Errors may cause while inserting gaps in early steps
• Errors made in early steps can not be corrected
Iterative Alignment
• Basic idea
An optimal solution can be found by repeatedly modifying existing suboptimal
solutions
• Starts with low-quality alignment and gradually improves it by iterative
realignment with well-defined procedures until no better scores can be obtained
• Overcomes the “greedy” nature of progressive because order of sequences used
for alignment is different in each iteration
• But this method doesn’t guarantee for finding optimal alignment as this is also
heuristic in nature.
• E.g PRRN
PRRN
https://www.genome.jp/tools/prrn/prrn_help.html
• Web-based program for iterative MSA
• 2 sets of iterations
• Outer iteration
• Initial random alignment is generated that is used to generate UPGMA
• Weights are used to optimize the alignment
• Inner iteration
• Sequences are randomly divided into 2 groups
• Two groups are aligned using global dynamic programming treating each group a single
sequence
• Process is repeated until no better SP is obtained
• This final alignment is used to generate new guide tree based on UPGMA
• New weights are applied to optimize the alignment
• This whole process is repeated until no further improvement in the overall alignment
Fig: Schematic of iterative
alignment procedure for PRRN
28
04/04/2008 Bioinformatics (BIOT 305)
Block Based Method
• Local Multiple Sequence alignment
• Used to identify conserved domains and motifs
• Identifies BLOCKS of ungapped alignment shared by all sequences
• Uses 2 block-based programs
• DIALIGN2
• Web – based program
• Does not use gap penalties, so not so sensitive to longer gaps
• This method breaks each of the sequence down to smllaller segment and
looks for all possible pairwise alignments of blocks.
• All aligned blocks are then compiled in progressive manner to assemble
full MSA
• The sequences between blocks are left unaligned
• Match-Box www.sciences.fundp.ac.be/biologie/bms/matchbox_submit.shtml
• Identifies conserved blocks among sequences
• Conserved blocks are of ~ 9 residues
• If the similarity is above certain threshold then they are used as a anchor to assemble
multiple alignment and residues between blocks are left unaligned

Multiple Seq Alignments

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multiple Seq Alignments

Uploaded by

Copyright:

Available Formats

Multiple Sequence Analysis

WHY Multiple sequence analysis?

CLUSTAL W (1.82) multiple sequence alignment

GLB1_GLYDI DP--------GVAALGAKVLAQIGVAVSHLGDE--GKMVAQMKAVGVRHKGYGNKHIKAQ 101

GLB1_GLYDI YFEPLGASLLSAMEHRIGGKMNAAAKDAWAAAYADISGALISGLQS----- 147

10/04/2011 Bioinformatics (BIOT 305) 3

• Why perform an MSA?

• Global multiple Alignment

• Local multiple alignments (Motif

ATTGCGC  ATTGCGC ATTGCGC

• Algorithm for finding the best alignment

• Perform pairwise alignments

Use gaps to align them

Score using matrix for

• Scores correlate to evolutionary

You might also like