You are on page 1of 8

Multiple Sequence Alignment (MSA)

Bioinformatics chicken

xenopus
PLVSS---PLRGEAGVLPFQQEEYEKVKRGIVEQCCHNTCSLYQLENYCN

ALVSG---PQDNELDGMQLQPQEYQKMKRGIVEQCCHSTCSLFQLESYCN
human LQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN

Lecture 5
monkey PQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN

dog LQVRDVELAGAPGEGGLQPLALEGALQKRGIVEQCCTSICSLYQLENYCN
hamster PQVAQLELGGGPGADDLQTLALEVAQQKRGIVDQCCTSICSLYQLENYCN

Multiple Sequence bovine PQVGALELAGGPGAGG-----LEGPPQKRGIVEQCCASVCSLYQLENYCN


guinea pig PQVEQTELGMGLGAGGLQPLALEMALQKRGIVDQCCTGTCTRHQLQSYCN

Alignment & Evolutionary


Bring the greatest number of similar characters into the
Analysis same column of the alignment

Why do we do MSA? Why do we do MSA?


Check similarity among sequences (e.g., a gene family) to Get the distances among homologs and used in
find common features, motifs, and conserved regions, evolutionary analysis. MSA is the base for phylogenetic
etc. Used in the prediction of structure and function for tree construction.
new sequences.
a A
Gene tree Species tree

b B

c C
We often assume that gene trees give us species trees
Human Hox genes Concepts: Paralogy & Orthology

1
How to do MSA? Using Clustal

 Dynamic programming: accurate, but slow  Clustal: the most popular MSA program

 Heuristic Algorithms:  can be used online


1. progressive methods: Clustal, T-Coffee, MUSCLE  running locally
2. iterative methods: PRRP, DIALIGN
3. others: Partial Order Algorithm, profile HMM,  Input and output formats
meta-methods (MAFFT)… Input Output
>sequence 1
ATTGCAGTTCG
FASTA CA …… ALN
NBRF/PIR >sequence 2 NBRF/PIR
EMBL/SWISSPROT GCG/MSF
http://en.wikipedia.org/wiki/List_of_sequence_alignment_software ATAGCACATCG
ALN CA…… PHYLIP
GCG/MSF >sequence 3 NEXUS
Current Opinion in Structural Biology 2006, 16:368–373 GCG9/RSF ATGCCACTCCG GDE/FASTA
GDE CC……

Clustal W/X algorithm  Clustal online (ClustalW)

ClustalW @ EBI
http://www.ebi.ac.uk/Tools/clustalw2/

Paste or upload sequences

Adjust parameters

MSA Output

http://www.ebi.ac.uk/Tools/clustalw/help.html

2
 Clustal local (ClustalX) Step 1: input sequences

Download & Install File

Help file: Using ClustalX for multiple sequence alignment Load sequences
by Jarno Tuimala

Two work modes:


 Multiple Alignment

 Profile Alignment

Step 2: set parameters Step 3: do MSA

3
Step 4: save the output in a selected format  Decorate MSA output (1)
 Boxshade highlight the identical/similar sites
(http://www.ch.embnet.org/software/BOX_form.html)
Copy the output from EBI ClustalW Output page

Paste into the “Boxshade” page, and choose “ALN”


from “Input sequence format”, and then select
“RTF_new” from “Output format”

In the result page, click “here is your output


number 1”
Decorated MSA output

 Decorate MSA output (2)  Decorate MSA output (3)

 ESPript multiple functions for highlighting  GeneDoc


http://espript.ibcp.fr/ESPript/cgi-bin/ESPript.cgi http://www.nrbsc.org/gfx/genedoc

Download the “Alignment file” (ALN file) from


the EBI ClustalW Output page File – Import

Upload the ALN file to the “Aligned Sequences”


field at ESPript Analysis page Choose input
format (e.g., ALN
file)
Select “Output layout” & “Output file or device”

Decorated MSA output Decorated


Output

4
Sequence Logos 2. Phylogenetic analysis
http://weblogo.berkeley.edu/logo.cgi  Analyze evolutionary relationships for genes and
http://weblogo.threeplusone.com/create.cgi proteins
http://genome.tugraz.at/Logo/  Construct Phylogenetic Trees
T. D. Schneider and R. M. Stephens. Sequence logos: a new way to display A tree showing the
consensus sequences. Nucleic Acids Research, Vol. 18, No 20, p. 6097-6100. evolutionary
relationships among
various biological
species or other entities
that are believed to
have a common
ancestor.

Methods for Phylogenetic Analysis Terms for a Phylogenetic Tree

End node
Classic Evolutionary Biology: Branch A
can be
Comparison: Morphology, Structure, Fossils B species,
population,
C or protein,
Node
Molecular evolution: D DNA, RNA
Root molecules
Compare DNA and Protein sequences E etc.
Internal /divergence OTU
node
Possible ancestors = ((A, (B,C)), (D, E))
HTU Newick format

5
Terms for a Phylogenetic Tree Terms for a Phylogenetic Tree

A clade is a group of
organisms that includes an
Branch
ancestor and all length
descendents of that
Scaled branches : the
ancestor. length of the branch
Phylogram Ultrametric tree is proportional to the
Cladogram number of changes.
6
Taxon B 1 Taxon B Taxon B The distance between
1 2 species is the sum
Taxon C 3 Taxon C Taxon C
of the length of all
1
Taxon A Taxon A Taxon A branches connecting
them.
Taxon D 5 Taxon D
Taxon D
no meaning genetic change time

Terms for a Phylogenetic Tree Construction Procedure


Rooted tree vs. Unrooted tree
Multiple Sequence Alignment
A C
Unrooted

(automatic output, manual corrections)


Rooted

UPGMA
Choose Methods
B D (substitution model) (Neighbor-joining, NJ)
(maximum parsimony, MP)
(distance) (minimum evolution)
two major ways to rooted trees: (maximum likelihood, ML)
(Bayesian inference)
By midpoint or distance
A
d (A,D) = 10 + 3 + 5 = 18
Midpoint = 18 / 2 = 9
Tree Construction
10
C Statistical analysis
3 2
Tree Evaluation Bootstrap
2
B 5 D Likelihood Ratio Test
outgroup
……

6
Choosing a Method for Phylogenetic Prediction MSA is the Key step for tree construction

MSA can be done for any sequences, and


choosing the proper sequences is very
important!!

Molecular Biology and Evolution


2005 22(3):792-802

Bioinformatics: Sequence and Genome Analysis, 2nd edition, by David W. Mount. p254 Homologous sequences are needed!
http://cshprotocols.cshlp.org/cgi/content/full/2008/5/pdb.ip49

What Sequences to Study?  Phylogenetic tree construction (ClustalW)

Select “tree type” from “PHYLOGENETIC TREE” field at EBI


 Different sequences accumulate changes at
ClustalW page
different rates - chose level of variation that is
appropriate to the group of organisms being
studied. Input the MSA output (or upload an ALN file)
– Proteins (or protein coding DNAs) are constrained
by natural selection - better for very distant
relationships Cladogram Tree shown at the page bottom
– Some sequences are highly variable (rRNA spacer
regions, immunoglobulin genes), while others are
highly conserved (actin, rRNA coding regions)
Click “Show as Phylogram Tree” to display Phylogram Tree
– Different regions within a single gene can evolve at
different rates (conserved vs. variable domains) Not recommended, since only distance method used
and no evaluation for the generated trees

7
 Tools for Visualization Software for Phylogenetic Analysis
 TreeView software for editing and printing evolution trees
PHYLIP http://evolution.genetics.washington.edu/phylip.html
(http://taxonomy.zoology.gla.ac.uk/rod/treeview.html) free and integrated tool for evolutionary analysis
PAUP http://paup.csit.fsu.edu/
Choose “tree type” from PHYLOGENETIC TREE field commercial and integrated tool for evolutionary analysis
at EBI ClustalW page MEGA http://www.megasoftware.net/
free and graphic integrated tool, including the ML algorithm in the
latest version
Input MSA output (or upload an ALN file) PHYML http://atgc.lirmm.fr/phyml/
fastest ML tree construction software
PAML http://abacus.gene.ucl.ac.uk/software/paml.html
ML tree construction software
Download “Phylip tree file” (ph Tree-puzzle http://www.tree-puzzle.de/
faster ML tree construction software
file) MrBayes http://mrbayes.csit.fsu.edu/
Tree construction software based on Bayesian inference
Open the above files with TreeView program More tools: http://evolution.gs.washington.edu/phylip/software.html
Display trees in different forms (1, 2, 3)

 Phylogenetic tree construction


http://www.megasoftware.net/

Supply (ML), (MP) and


distance methods for tree
construction; distance
Merits: graphic interface, integrated environment methods including 3
for sequence query, alignment and tree algorithms (NJ), (Minimum
construction, free and well documented Evolution) and UPGMA.
Defects: slow when using ML algorithms
Latest version (MEGA5) extended function by adding ML algorithms and
the possibility for the selection of substitution models.
92 Pig gi|218855168|gb|ACL12051.1| FAD24 pr
98 Cattle gi|146186885|gb|AAI40653.1| NOC3L
100
Human gi|18389433|dbj|BAB84194.1| AD24 H
Mouse gi|18389431|dbj|BAB84193.1| AD24 M
Chicken gi|118092837|ref|XP 421670.2| PR
Zebrafish gi|50838808|ref|NP 001002863.1

0.02

You might also like