outgroup.Thiswasthemethodusedtoinfertherootoftheuniversal tree of life[3–5].
Step 1. Assembling a dataset
Theﬁrststepinconstructingatreeisbuildingthedataset.For most of us, this means ﬁnding and retrievingsequences from the public domain. The main repositoryfor these data is the public nucleotide database (Box 1),stored independent in the USA (GenBank), EU (EMBL)and Japan (DDBJ). Primary entries are redundant amongthem, and they are updated against each other nightly.Some of the most exciting molecular evolutionary data arecoming from genome sequencing projects (Box 1). Much of this data, both in-progress and completed, is deposited inthepublicdatabase,withsomein-progressdatapartitionedoff separately. Other genome project data are availableonly from their own websites; for example, The InstituteforGenomicResearch(TIGR,Box1)andtheJointGenomeResearch Institute (DOE,Box 1). Comprehensive lists andprogress reports of on-going genome sequencing projectsare available from several sources (Box 1).There are two basic kinds of search strategy for ﬁndinga set of related sequences – Keywords and similarity. A Keywords search identiﬁes sequences by looking throughtheir written descriptions (i.e. the annotation section of adatabase ﬁle); a similarity search looks at the sequencesthemselves(e.g.using‘BLAST’software,Box1).Keywordssearching is easier and seems more intuitive, but it is farfrom exhaustive. This is mostly because a lot of dataentries are very scantily annotated or even mis-annotated(sometimes quite entertainingly so). This is particularlytrue for genomic data where high throughput is thepriority. The best-annotated data are the painstakinglyannotated protein data found in the SwissProt database.Thisisaccessibledirectlyorthroughthemaindatabasesites(Box1),butthisisonlyasubsetofallthatisavailable.The main search engines for Keywords searching areEntrez (NCBI) and SRS (everywhere else); both haveexcellent online tutorials (Box 1). Beginners might ﬁndSRS easier, with its simple forms andobvious blanks to ﬁllin. The main search engine for similarity searching is the‘BLAST’ software, available at all databanks and mostgenome websites (Box 1). The NCBI BLAST server is themost sophisticated with numerous ‘ﬂavours’ and optionssuchashoningaBLASTsearchusingkeywords,searchingwith alignment proﬁles to ﬁnd distant homologues(PSI-BLAST), and much more. A word on database ‘etiquette’. A large body of unpublished genomic data is now freely available overthe Internet. It is generally (although not universally) feltthat these data should be treated as privileged communi-cations, with any signiﬁcant or large-scale analysescleared with the submitters before publication and,obviously, gratefully acknowledged. This is basically acourtesy to the authors, most of whom are as publication-dependent as the rest of us.
Step 2. Multiple sequence alignment – the heart of thematter
Molecular trees are based on multiple sequence align-ments. Until 1989 these were all assembled by hand(e.g.) because the exhaustive alignment of more thansix or eight sequences was, and more or less still is,computationally unfeasible. Now, most multiple sequencealignments are constructed by the method known as‘progressive sequence alignment’[10,11].This method
builds an alignment up stepwise, starting with the mostsimilar sequences and progressively adding the moredissimilar (‘divergent’) ones (Fig. 4a). The process beginswith the construction of a crude ‘guide tree’ (Fig. 4a). Thistree then determines the order in which the sequencesare progressively added to build the alignment (Fig. 4b).Note that the guide tree is included as part of thealignment output, but only to show the user how thealignment was assembled.The cardinal rule of progressive sequence alignment is‘once a gap always a gap’; gaps can only be added orenlarged, never moved or removed. This is based onthe assumption that the best information on gapplacement will be found among the most similar
Box 1. Bioinformatic resourcesDatabases
Lists of genomes in progress
Data acquisition (search engines)
: http://srs.ebi.ac.uk/ (tutorials can be found at http://www.icgeb.trieste.it/~netsrv/courses/RH/srs/ or http://www.no.embnet.org/Programs/DB/srs_tut.php3)
Multiple sequence alignment
BCM search launcher
: http://www.techfak.uni-bielefeld.de/bcd/Curric/MulAli/ welcome.html/
: http://paup.csit.fsu.edu/index.html (tutorial can be foundat http://paup.csit.fsu.edu/Quick_start_v1.pdf)
TRENDS in Genetics
Vol.19 No.6 June 2003347