Welcome to Scribd. Sign in or start your free trial to enjoy unlimited e-books, audiobooks & documents.Find out more
Standard view
Full view
of .
Look up keyword
Like this
0 of .
Results for:
No results containing your search query
P. 1
Phylogeny tutorial

Phylogeny tutorial

|Views: 463|Likes:
Published by magicgero

More info:

Categories:Types, Research, Science
Published by: magicgero on Mar 03, 2010
Copyright:Attribution Non-commercial


Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less





Phylogeny for the faint of heart:a tutorial
Sandra L. Baldauf
Department of Biology, University of York, Box 373, York, UK YO10 5YW
Phylogenetic trees seem to be finding ever broaderapplications, and researchers from very different back-grounds are becoming interested in what they mighthave to say. This tutorial aims to introduce the basics ofbuilding and interpreting phylogenetic trees. It isintended for those wanting to understand better whatthey are looking at when they look at someone else’strees or to begin learning how to build their own.Topics covered include: how to read a tree, assemblinga dataset, multiple sequence alignment (how it worksand when it does not), phylogenetic methods, boot-strap analysis and long-branch artefacts, and softwareand resources.
Phylogeneticsisthescienceofestimatingtheevolutionarypast, in the case of molecular phylogeny, based on thecomparison of DNA or protein sequences. The idea of representing these hypotheses as trees probably datesback to Darwin, but the numerical calculation of treesusing quantitative methods is relatively recent[1],and theirapplicationtomoleculardataevenmoreso[2].Intheage of rapid and rampant gene sequencing, molecularphylogenyhastrulycomeintoitsown,emergingasamajortool for making sense of a sometimes overwhelmingamount information.This tutorial aims to introduce the basic principlesbehind and programs for constructing evolutionary trees(phylogenetic analysis). It is intended primarily for thosewhowanttoreadotherpeoplestrees,butalsoasageneralintroduction for those who might wish to begin to trybuilding their own. In the latter case the reader is warnedphylogenetic analysis and evolutionary theory are nottrivial pursuits; as with any new methodology, it isadvisable to seek expert help before getting in too deep.
Some basics
 A phylogenetic tree is composed of branches (edges) andnodes. Branches connect nodes; a node is the point atwhichtwo(ormore)branchesdiverge.Branchesandnodescan be internal or external (terminal). An internal nodecorresponds to the hypothetical last common ancestor(LCA) of everything arising from it. Terminal nodescorrespond to the sequences from which the tree wasderived (also referred to as operational taxonomic units or‘OTUs’). Trees can be made up of multigene families (genetrees) or a single gene from many taxa (species trees, atleast theoretically) or a combination of the two. In the firstcase, the internal nodes correspond to gene duplicationevents, in the second to speciation events.
Trees are about groupings (Fig. 1). A node and everythingarising from it is a ‘clade’ or a ‘monophyletic group’. A monophyletic group is a natural group; all members arederived from a unique common ancestor (with respect tothe rest of the tree) and have inherited a set of uniquecommon traits (characters) from it. A group excludingsome of its descendents is a paraphyletic group(e.g. animals excluding humans). A hodge-podge of distantly related OTUs, perhaps superficially resemblingone another or retaining similar primitive characteristics,is polyphyletic; that is, not a group at all.
Intuitively we draw trees from the ground up like realtrees(Fig.2a).However,asthesetreesgetlargerandmorecomplex, they can become cluttered and difficult to read. As an alternative we can expand the nodes (Fig. 2b) andturnthetreeonitsside(Fig.2c).Nowthetreegrowslefttoright,andallthelabelsarehorizontal.Thismakesthetree
Fig. 1
. Trees are about groups: monophyletic (holophyletic), paraphyletic and‘polyphyletic’.
TRENDS in Genetics 
Corresponding author:
Sandra L. Baldauf (slb14@york.ac.uk).
TRENDS in Genetics 
Vol.19 No.6 June 2003345
http://tigs.trends.com0168-9525/03/$ - see front matter
2003 Elsevier Science Ltd. All rights reserved. doi:10.1016/S0168-9525(03)00112-4
easier to read and to annotate. Thus, the widths of thenodes have no meaning; they are simply adjusted to giveevenspacingtothebranches.Tomakethingsslightlymorecomplicated,allbranchescanrotatefreelyabouttheplaneof their nodes, so all trees inFig. 2are identical (exceptthat tree F is ‘unrooted’, see below).Molecular phylogenetic trees are usually drawn withproportional branch lengths; that is, the lengths of thebranches correspond to the amount of evolution (roughly,percent sequence difference) between the two nodes theyconnect (Fig. 2a–f ). Thus, the longer the branches themore relatively divergent (highly evolved) are thesequences attached to them. Alternatively, trees can bedrawn to display branching patterns only (‘cladograms’),in which case the lengths of the branches have nomeaning (Fig. 2g), but this is rare done with molecularsequence trees.
 At the base of a phylogenetic tree is its ‘root’. This is theoldest point in the tree, and it, in turn, implies the orderof branching in the rest of the tree; that is, who shares amorerecentcommonancestorwithwhom.Theonlywaytoroot a tree is with an ‘outgroup’, an external point of reference. An outgroup is anything that is not a naturalmember of the group of interest (i.e. the ‘ingroup’). Thismight not seem like a difficult concept, but do not bemisled. The excluded member of a monophyletic group(i.e. the exclusion that makes it paraphyletic,Fig. 1) is notanoutgroup(justanoutcast);forexample,humansarenotan outgroup to animals. In theabsence of an outgroup, thebest guess is to place the root in the middle of the tree(at its midpoint), or, better yet, not root it at all (Fig. 2f ). Alternatively you can use extrinsic, more traditionaltaxonomic information, such as the fossil record in thecase of species trees. This is obviously more difficult withgene trees.
Evolution is about homology; that is, the similarity due tocommon ancestry. Homologues can be orthologues orparalogues (Fig. 3). Orthologues only duplicate whentheir host divides;
along with the rest of the genome(Fig. 3a). They are strictly vertically transmitted (parentto offspring), so their phylogeny traces that of their hostlineage (Fig. 3b). Paralogues are members of multigenefamilies; theyarise bygene duplication (Fig.3a). Ifyoutryto infer species relationships with paralogues you can runinto trouble; if some of the copies are missing, you can be veryconvincinglymisled(Fig.3c).However,ifyouhaveallcopies of two paralogues in your tree, then you are fine.Better still, you have two mirror phylogenies (Fig. 3b). Inthis case, paralogues can serve as each other’s natural
Fig. 2
. Phylogenetic tree styles. All these trees have identical branching patterns.The only differences are (f), which is unrooted. (g) is a cladogram, so the branchlengths are right justified and not drawn to scale (i.e. they are not proportional toestimated evolutionary difference).
TRENDS in Genetics 
C   a  v   i   a  r   
T    r   u  f   f   l   e  
C   a  v   i   a  r   
O       y    s    t     e    r    
N   o  r  i   
T  r  f  f  l  
C  v  i  r   
O      y   s   t    e   r   
 L o b s t e r
  T r u  f  f  l e
  N o r  i
O       y    s     t     e    r    
Fig. 3
. The problem with paralogues. (a) Paralogous genes are created by gene duplication events. Gene X is duplicated in a common ancestor to species A and B resultingin two paralogous genes, X and X
. All subsequent species inherit both copies of the gene (unless one or the other is lost somewhere along the way). (b) Phylogenetic anal-ysis of the X/X
gene family gives two parallel phylogenies. All sequences of gene X are orthologues of each other, and all the sequences of gene X
are orthologues of eachother. However, X and X
are paralogues. Both the X and X
subtrees show the true relationships among the three species. The subtrees are also each other’s natural out-group, and as a result each subtree is rooted with the other (reciprocally rooting). (c) A tree of the X/X
gene family can be misleading if not all the sequences are included(because of incomplete sampling or gene loss). If the broken branches are missing, then the true species relationships are misrepresented.
TRENDS in Genetics 
(a) (b) (c)
AGene duplication:Speciation:Species A Species B
geneXgeneX geneX
TRENDS in Genetics 
Vol.19 No.6 June 2003346
outgroup.Thiswasthemethodusedtoinfertherootoftheuniversal tree of life[3–5].
Step 1. Assembling a dataset
Thefirststepinconstructingatreeisbuildingthedataset.For most of us, this means finding and retrievingsequences from the public domain. The main repositoryfor these data is the public nucleotide database (Box 1),stored independent in the USA (GenBank), EU (EMBL)and Japan (DDBJ). Primary entries are redundant amongthem, and they are updated against each other nightly.Some of the most exciting molecular evolutionary data arecoming from genome sequencing projects (Box 1). Much of this data, both in-progress and completed, is deposited inthepublicdatabase,withsomein-progressdatapartitionedoff separately. Other genome project data are availableonly from their own websites; for example, The InstituteforGenomicResearch(TIGR,Box1)andtheJointGenomeResearch Institute (DOE,Box 1). Comprehensive lists andprogress reports of on-going genome sequencing projectsare available from several sources (Box 1).There are two basic kinds of search strategy for findinga set of related sequences Keywords and similarity. A Keywords search identifies sequences by looking throughtheir written descriptions (i.e. the annotation section of adatabase file); a similarity search looks at the sequencesthemselves(e.g.using‘BLAST’software,Box1).Keywordssearching is easier and seems more intuitive, but it is farfrom exhaustive. This is mostly because a lot of dataentries are very scantily annotated or even mis-annotated(sometimes quite entertainingly so). This is particularlytrue for genomic data where high throughput is thepriority. The best-annotated data are the painstakinglyannotated protein data found in the SwissProt database[6].Thisisaccessibledirectlyorthroughthemaindatabasesites(Box1),butthisisonlyasubsetofallthatisavailable.The main search engines for Keywords searching areEntrez (NCBI) and SRS (everywhere else); both haveexcellent online tutorials (Box 1). Beginners might findSRS easier, with its simple forms andobvious blanks to fillin. The main search engine for similarity searching is the‘BLAST’ software[7], available at all databanks and mostgenome websites (Box 1). The NCBI BLAST server is themost sophisticated with numerous ‘flavours’ and optionssuchashoningaBLASTsearchusingkeywords,searchingwith alignment profiles to find distant homologues(PSI-BLAST), and much more. A word on database ‘etiquette’. A large body of unpublished genomic data is now freely available overthe Internet. It is generally (although not universally) feltthat these data should be treated as privileged communi-cations, with any significant or large-scale analysescleared with the submitters before publication and,obviously, gratefully acknowledged. This is basically acourtesy to the authors, most of whom are as publication-dependent as the rest of us[8].
Step 2. Multiple sequence alignment – the heart of thematter
Molecular trees are based on multiple sequence align-ments. Until 1989 these were all assembled by hand(e.g.[9]) because the exhaustive alignment of more thansix or eight sequences was, and more or less still is,computationally unfeasible. Now, most multiple sequencealignments are constructed by the method known as‘progressive sequence alignment’[10,11].This method builds an alignment up stepwise, starting with the mostsimilar sequences and progressively adding the moredissimilar (‘divergent’) ones (Fig. 4a). The process beginswith the construction of a crude ‘guide tree’ (Fig. 4a). Thistree then determines the order in which the sequencesare progressively added to build the alignment (Fig. 4b).Note that the guide tree is included as part of thealignment output, but only to show the user how thealignment was assembled.The cardinal rule of progressive sequence alignment is‘once a gap always a gap’; gaps can only be added orenlarged, never moved or removed[10]. This is based onthe assumption that the best information on gapplacement will be found among the most similar
Box 1. Bioinformatic resourcesDatabases
DDBJ (Japan)
: http://www.ddbj.nig.ac.jp/ 
: http://www.ebi.ac.uk/Databases/ 
GenBank (USA)
: http://www.ncbi.nlm.nih.gov/ 
: http://www.tigr.org/tdb/mdb/mdbcomplete.html
: http://www.jgi.doe.gov/JGI_microbial/html/index.html
: http://www.sanger.ac.uk/Projects/Microbes/ 
: http://ncbi.nlm.nih.gov/Genomes/index.html
Lists of genomes in progress 
http://wit.integratedgenomics.com/GOLD/ http://www.tigr.org/tdb/mdb/mdbinprogress.html
Data acquisition (search engines)
: http://srs.ebi.ac.uk/ (tutorials can be found at http://www.icgeb.trieste.it/~netsrv/courses/RH/srs/ or http://www.no.embnet.org/Programs/DB/srs_tut.php3)
:http://www.ncbi.nlm.nih.gov/Entrez/(tutorialcanbefondat http://www.ncbi.nlm.nih.gov:80/Database/tut1.html)
: http://www.ncbi.nlm.nih.gov/BLAST/ 
Multiple sequence alignment
: http://www.mbio.ncsu.edu/BioEdit/bioedit.html
BCM search launcher
: http://searchlauncher.bcm.tmc.edu/ 
: http://www.accelrys.com/products/gcg_wisconsin_package
: http://www.techfak.uni-bielefeld.de/bcd/Curric/MulAliwelcome.html/ 
Phylogenetic analysis
: http://paup.csit.fsu.edu/index.html (tutorial can be foundat http://paup.csit.fsu.edu/Quick_start_v1.pdf)
: http://www.megasoftware.net/ 
: http://evolution.genetics.washington.edu/phylip.html
: http://taxonomy.zoology.gla.ac.uk/rod/treeview.html
TRENDS in Genetics 
Vol.19 No.6 June 2003347

Activity (9)

You've already reviewed this. Edit your review.
1 hundred reads
1 thousand reads
Woody544 liked this
mmessouli liked this
'lil pu-pu liked this
DavidSwofford liked this
227986 liked this
vikas_ati liked this
mkrajan liked this

You're Reading a Free Preview

/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->