You are on page 1of 231
NITILIYIDIY JL z 1 D} A} S}y{1 |] A D sly|s I FIs AIEIG|EIS ty ATEIGHY [ST YIT AIFIG HT AIGLY Phylogenetic Trees Made Easy A How-To Manual Second Edition Barry G. Hall University of Rochester, Emeritus Sinauer Associates, Inc. * Publishers Sunderland, Massachusetts ¢ U.S.A. PHYLOGENETIC TRers Mabe Easy: A How-To Manual, Second Edition Copyright © 2004 by Sinauer Associates, Inc. All rights reserved. For information address Sinauer Associates, Inc,, 23 Plumtree Road, Sunderland, MA 01375 US.A. FAX: 413-549-1118, orders@sinauer.com publish@sinuer.com Downloadable files to be used with this text are available on the accompanying CD and at http://www.sinauer.comvhall/ Notice of Liability Due precaution has been taken in the preparation of this book. However, informa- tion and instructions described herein are distributed on an “As Is” basis, without ‘warranty. Neither the author nor Sinauer Associates, Inc. shall have any liability to any person or entity with respect to any loss or damage caused or alleged to be caused, directly or indirectly, by the instructions contained in this book or by the computer software and hardware products described. Notice of Trademarks ‘Throughout this book trademark names have been used and depicted, including, ‘but not limited to Macintosh, Microsoft Windows, Microsoft Explorer, Netscape, and Adobe, In lieu of appending the trademark symbol to each occurrence, the ‘author and publisher state that these trademarked product names are used in an editorial fashion, to the benefit of the trademark owners, and with no intent to infringe upon the trademarks. Library of Congress Cataloging-in-Publication Data Hall, Barry G., 1942- Phylogenetic trees made easy : a how-to manual / Barry G. Hall— 2nd ed. p-om. Includes bibliographical references and index, ISBN 0-87893-312-3 (paperbound) 1. Phylogeny—Data processing, 1. Title. (QH3675.1127 2004 5768'8'0285—dc22 2004010747 Acknowledgments Tam grateful to Joe Felsenstein foradvice and help in learning to use PHYLIP, and to Jim Wilgenbusch for help in learning the command line interface of PAUP®. Lam grateful to Dave Swofford and Jim Wilgenbusch of the PAUP* team, to Joe Felsenstein of PHYLIP, and to Rod Page of TreeView for agreeing to incorporate into future versions of their programs some changes that | thought would be helpful to the readers ofthis book. I thank David Fitch for perusing the manuscript for this book and for his many valuable suggestions for improvement, Errors that remain are mine, not his, and are probably attributable to my stubborn refusal to accept some of his suggestions. Tam grateful to Melinda and Bill for joyously sharing adventures that we would not otherwise have taken on. Lam especially grateful to my wife, Sue, for her patience and encourage- ‘ment during the writing of this book. Much of my time that should have been spent on the process of moving into a new home was, in fact, spent writing, Finally, take deep pleasure in thanking my children, Steve, Scott and Rebecca, for their companionship and the pleasure of their company. This book is ded- icated to them. Table of Contents Introduction: Read Me st 1 A Brief Overview of the Second Edition 2 Time-Limited Copy of the PAUP*4.0 Program 3 Learn More about the Principles 4 Computer Programs Discussed and Where to Obtain Them 4 Clustalx 5 TreeView 5 PAUP* 5 PHYLIP (PHYLogeny Inference Package) 6 Tree-Puzzle 6 MrBayes 7 CodonAlign 7 Other Programs 7 Files and Utilities on the “Phylogenetics Made Easy” Website and CD 7 Some Conventions Used in This Book 8 Chapter 1 Tutorial: Create a Tree! 9 Why Create Phylogenetic Trees? 9 Obtaining Related Sequences by a BLAST Search 10 Step 1: Go to the BLAST Website 11 Step 2: Use BLAST to Search for Sequences Related to Your Sequence 13, Step 3: Decide Which Related Sequences to Include on Your Tree 14 Downloading the Selected Sequences 20 x Table of Contents Creating the Multiple Alignment 23 Creating the Input File 24 Getting the Data into ClustalX 26 Some General Comments about Creating Alignments 27 Setting the Alignment Parameters 27 Creating the Alignment 32 Refining and Improving the Alignment 33 ‘Aligning New Sequences to an Existing Alignment, or Aligning, Two Existing Alignments 38 Phylogenetic Analysis 40 Exactly What Is a Phylogenetic Tree? 41 Methods for Constructing Phylogenies 41 LEARN MORE ABOUT PHYLOGENETIC TREES. 42 Using ClustalX to Create a Neighbor Joining Tree 46 Drawing the Tree Using Tree View 47 Tree Formats: Different Appearances of the Same Tree 47 Bootstrapping a Tree 50 LEARN MORE ABOUT ESTIMATING THE RELIABIUTY OF PHYLOGENETIC TREES. 52 Placing the Root of a Tree 53 LEARNG MORE ABOUT ROOTING PHYLOGENETIC TREES 56 Printing and Saving the Tree 59 Summary 60 Chapter 2. Basic Elements in Crea Presenting Trees 61 g and Selecting Homologs: What Sequences Can Be Put on a Single Tree? 61 Fine-Tuning Alignments 64 Major Methods for Creating Trees 68 Which Method Should You Use? 68 Distance versus Character-Based Methods 69 LEARN MORE AEOUT TREESEARCHING METHODS. 70 LEARN MORE AEQUT DISTANCE METHODS. 74 Data Files Used to Illustrate Methods 76 Table of Contents xi Using PAUP* to Create Trees 76 Opening the Input File 77 Creating Neighbor-Joining Trees Using PAUP* 80 Creating Parsimony Trees Using PAUP* 92 LEARN MORE ABOUT PARSIMONY 94 Creating a Consensus Tree Using PAUP* 100 Creating Maximum Likelihood DNA Trees Using PAUP* 102 LEARN MORE ABOUT MAXIMUM LIKELIHOOD 104 LEARN MORE ABOUT EVOLUTIONARY METHODS 110 Creating Maximum-Likelihood Protein Trees Using Tree-Puzzle 114 Creating Bayesian Trees Using MrBayes 118 Creating the Execution File 119 LEARN MORE ABOUT BAYESIAN ANALYSIS 120 What the Statements in the Example MrBayes Block Do 123 Interpreting MrBayes Results 128 Sample Blocks for MrBayes 130 Getting Help 132 Presenting and Printing Your Trees 135 Opening Tree Files in PAUP* 135 To Root or Not to Root? 139 Choosing What Form of a Tree to Publish 147 Making a Tree Pretty: Not Just a Cosmetic Matter 148 DNA Phylogeny or Protein Phylogeny: Which Is Better? 152 Using CodonAlign 2.0 154 Chapter 3 Advanced Elements in Constructing Trees 157 Reconstructing Ancestral DNA Sequences 157 Ancestral Sequences for Parsimony and Maximum Likelihood Trees Using PAUP* 158 Ancestral Sequences for Parsimony using PHYLIP 161 Using Protein Structure Information to Construct Very Deep Phylogenies 167 Analyzing Trees for Evidence of Adaptive Evolution by Detecting Positive Selection in a Phylogeny 173 xii Table of Contents Chapter 4 Using Alternative Software to Construct and Present Trees 179 Using PAUP* with Windows or Unix 179 Opening and Viewing Files 180 Using PAUP Blocks 182 Bootstrapping 185 Using PHYLIP. 187 General Features of PHYLIP Programs 188 Parsimony Trees 189 Consensus Trees 190 Neighbor Joining Trees 190 Maximum Likelihood Trees 191 Bootstrapping 191 Appendix | File Formats and Their Interconversion 193 Formats Used by Programs Discussed in this Book 193 ‘The FASTA Format 193 ‘The Clustal Format 194 ‘The Nexus Format 195 PHYLIP 3.x 197 Other File Formats 199 ‘The GCG/MSF Format and PileUp 199 The NBRF/PIR format 200 Interconverting Formats Using PAUP* 201 Importing Various File Formats into PAUP* 201 Exporting Various File Formats from PAUP* 202 Interconverting Interleaved and Sequential Formatted PHYLIP Files 203 Table of Contents xiii Appendix Il Printing Alignments 205 Printing to Assess the Quality of the Alignment 205 Printing Alignments for Publication 206 Literature Cited 207 Index to Major Programs Discussed 209 Subject Index 215 Introduction Read Me First This is a “cookbook” intended as a tool to aid beginners in creating phyloge- netic trees from protein or nucleic acid sequence data. It assumes basic famil~ iarity with personal computers and with accessing the World Wide Web using browsers such as Netscape ot Microsoft Explorer. I have not attempted to explore all the alternative approaches that might be used, intending only to give the beginner an approach that will work well most ofthe time and is easy to carry out. [hope the book will also serve the investigator who has a mod- est familiarity with phylogenetic tree construction but needs to address some aspects and problem areas in more depth. ‘Whatever the user's level of experience, this book devotes a significant amount of attention to the problem of aligning proteins and nucleic acids. Although one of the major purposes of alignments is to create phylogenetic trees, its far from the only important application of alignments. Studying alignments of a series of related proteins can provide insight into possible active sites by identifying those regions that are most strongly conserved and that contain probable catalytic residues. Thus those investigators whose interests are more in protein function that in phylogeny per se are also the intended audi- ence for this book. This book is not intended to be used as a primary text in a systematics or phylogenetics course, and it is not appropriate for that purpose. It can, how- fever, be used as a supplement to the primary text and can serve as a tool for making the transition between a theoretical understanding of phylogenetics and a practical application of the methodology. 2 Introduction A Brief Overview of the Second Edition As pointed out in the first edition, software changes quickly, and some screens ‘may well look different from those depicted in this book. Indeed, within a few weeks of publication of the first edition, the appearance of the Web-based. BLAST scroens were substantially different. Its not only cosmetic appearances that change. Programs change the names of input parameters, what is con- tained within various output files, ete. Such changes are intended to simplify things for users and are often part and parcel of improving a program’s func- tionality, but they do make it awkward for a beginner trying to follow a book in order to learn both a method and the software to implement that method. ‘One of the major motives for this edition was to make the book current with. respect to the software as of the end of 2008. When a simple repair of a shower in my home revealed serious and exten sive dry rot my contractor said, “You know, you don’t have to put this bath- room back exactly the way it was. If you are ever going to remodel it this is the time.” That turned out to be good advice. In the same sense, I decided that if I was ever going to reonganize this book to make it more useful, this is the time. One of the critics who reviewed the first edition was kind enough to take the first sentence of the book seriously and compared it to a well-written cook- book. To take that analogy a bit farther, this edition is organized as one might a cooking class for someone who had never prepared food—indeed had never even seen a kitchen—and assumed that food simply appeared by magic. ‘One might begin such a class with an entire evening devoted to boiling an egg, beginning with purchasing the egg, boiling water, putting the egg into the water for a specific duration, cracking the egg, and finally, presenting the egg ata table. It’s not alot, but it allows the stucent to experience the basic elements of cooking and results in something edible at the end of the class. That is exactly the purpose of Chapter 1, “Tutorial: Create a Tree!”. The read- er learns one way to create a tree that isa valid representation of the histori- cal relationships among a set of related sequences. Itis not enough fora meal, butit isa start ‘The next chapter, “Basic Elements in Creating and Presenting Trees,” might bbe compared with demonstrating sautéing, baking, broiling, and roasting, Each isan important method and itis necessary to know something about all of them if one isto prepare any but the most limited meals. Chapter 2 explains how to fine-tune alignments—the raw data from which phylogenies are constructed— and then presents each of the major methods for constructing phylogenies in some detail, with examples. The chapter finally turns to methods of present- ing trees in publications and how to troubleshoot some common problems. ‘Chapter 2 is the meat of this book. Understanding Chapter 2 will permit con- fident construction of phylogenetic trees under circumstances that anyone but ‘an expert systematist is likely to encounter, Read Me First ‘Throughout Chapter 2 you will find notes such as Paup for Windows/Unix: pages xa-yyy and PHYLIP: pages xx-yyy. These notes guide you to the pages in Chapter 4 where the same method is discussed. Chapter 3 presents some advanced topics for those who may want to go beyond the basics, It is not necessary to understand, or even to read, Chapter 3 in order to construct sound, valid phylogenies. Chapter 3 is like learning to ‘make soufflés or spectacular flaming desserts. Itis not necessary, but itcan sure ‘add a nice touch. The advanced topics in Chapter 3 include reconstructing, ancestral states (i.e, estimating the sequences of ancestral genes); analyzing troes for evidence of adaptive evolution; and constructing very deep phylo- genies from protein crystal structure data. Chapter 4 recognizes that not everyone will want to invest in the purchase of PAUP*, the primary software package discussed here for tree construction. Its also the case that PAUP" for Windows does not use the same interface as PAUP" for Macintosh computers, from which the examples in book are drawn. Chapter 4 therefore covers the details of tree construction using PAUP* for Win- dows and using the free and widely used PHYLIP program. The only cooking, analogy for this chapter is that itis like low-calorie cooking, Not as much fun, ‘but sometimes necessary. Time-Limited Version of the PAUP*4.0 Program ‘The CD accompanying this book is compatible with both Windows and Mac- intosh computer platforms. It includes a fully functional but time-limited ver- sion of PAUP*4.0 betai0 for each platform. Users can install PAUP* onto one computer, and it will be functional fora period of six months from the date of installation. ‘The purpose of including this time-limited PAUP* is to permit students to become familiar with the program, especially in the context of a course for ‘which this book may be required. Those students who wish to continue using PAUP* for phylogenetic analysis after that time are encouraged to order PAUP* from Sinauer Associates (http://www.sinauer.com). An advantage of pur- chasing PAUP* is that a registered purchaser is entitled to free updates of all bota releases, and to the final package release (which is expected to include a user's manual), Registered purchasers are also notified by email of any issues that may arise with respect to the use and/or performance of PAUP". 3 4 Introduction Learn More about the Principles Just as it is possible to implement molecular methods without understanding, them by following the protocols in commercial “kits,” itis also possible to implement phylogenetic methods without understanding them by following the protocols in this book. Most of us insist that our students understand the principles underlying the methods implemented by these kits, because we know that without such an understanding itis impossible to spot and trou- bleshoot many problems. It is in this spirit that the reader will find “Learn More” boxes scattered throughout the text. These boxes present somewhat more detailed background on the various methods and suggest further reading. Itis not necessary to read the boxes to be able to construct a reliable, valid, phylogenetic tree, but under- standing the principles outlined will help troubleshoot the phylogenetic prob- tems that arise when creating trees from molecular alignment data Readers who want to go beyond the Learn More boxes will find Dan Graur and Wen-Hsiung Li's Fundamentals of Molecular Evolution (Graur and Li 2000) very helpful and enjoyable to read. Li’s Molecilar Evolution (Li 1997) and Chap- ters 11 (Swofford et al. 1996) and 12 (Hillis et al. 1996) of Molecular Systematics (David Hillis, Craig Moritz, and Barbara Mable, eds.) provide more detailed insights into these topics. Computer Programs Discussed and Where to Obtain Them { “All of the programs described in this book are available for both Macintosh and Windows platforms, and most are available for Unix as ‘well The examples inthis book use the Macintosh versions of software that were current in November of 2003. The Windows and Unix versions of the programs may vary slightly in detail, but not so much as to | preclude using this book as a guide for those platforms. The Windows versions of these programs have been tested by the author, and where they differ significantly from the described Macintosh programs the ferences are noted in the text. Software authors are continuously | upgrading (and sometimes even improving) their software, so some ‘menus, windows, and dialogs may differ from the examples shown. If ‘you don’t see a feature that is mentioned here, look around a bit. It may he in a different menu or have a slightly different name. Users should [ALWAYS register these programs when registration is available. Registration often allows software authors to alert users of updates and | new features. Li Read Me First If you have access to a Macintosh computer I strongly recommend that you ‘use that machine for phylogenetic analyses and that you obtain PAUP* for Mac intosh. Moder Maes, especially the new G5 computers, are as fast or faster than the fastest Windows machines. The Macintosh implementation of PAUP* is superior to the Windows /Unix implementation because it includes a high- resolution tree-drawing program that allows you to present your trees easily and conveniently. If you can possibly do so, buy PAUP. Itis considerably more convenient to use than the free PHYLIP, ClustalX ClustalX (Thompson et al. 1997) is the primary multiple alignment program. Files created by ClustalX can be used by other programs to display and print trees, as well as to display alignments in way’ that facilitate recognizing regions of high similarity. ClustalX for Macintosh, Windows, Linux, and some Unix ‘machines is available free over the Web at ftp:/ftp-ighmeu-strasbg.fr/pub/ClustalX! Help for ClustalX is available at www-ighmc.u-strasbg fr/Biolnfo/ClustalX/Top-html Examples in this book were created using ChustalX version 1.81. If you have an earlier version of ClustalX, or if you are stil using ClustalW, you will find ituseful to download the most recent version of Clustal from one of the above sites, If your version is more recent than 1.81, the appearance of some screens, may differ slightly from the illustrations. TreeView TreeView is a free program for drawing phylogenetic trees. It does not create those trees, it simply uses files created by phylogeny programs to display and print the trees. TreeView allows you to modify the appearance of a tree tosuit you taste and needs. TreeView for Macintosh, Windows, Unix, and Linux is available at http:/taxonomy.zoology.gla.acuk/rod/treeview.htmt ‘TreeView will soon be replaced by TreeView, currently under development. ‘The incomplete development version is currently available from httpy/darwin.zoology.gla.ac.uk/~rpage/treeviews/ ‘TreeViewX promises some interesting new features, so itis worth checking, (on the progress of this major revision. PAUP* PAUP* 4.0 (Swofford 2000) is the primary troe-building program discussed in this book. PAUP* uses the files created by ClustalX to build trees by any one of several methods. PAUP"4.0 beta is available for Macintosh, Windows, and 3 6 Introduction Linux/Unix. PAUP* is an inexpensive commercial program available from ‘Sinauer Associates, Sunderland, MA. Information about ordering PAUP* 4.0 isavailable at http:/www.sinauer.com/detail. phy and at the PAUP* home page, http:/paup.csit-fsu.edu/ Wherever you order PAUP* be sure to go to the PAUP* home page and down- load both the Command Reference Document and the Quick Start Tutorial. Although it is a superb program, PAUP*4.0 is not yet in its final form, The current release is PAUP*4.0b10, meaning beta version ten. It is not clear that PAUP*40 will ever be finalized or that a manual for PAUP* will ever be writ- ten, but each beta release represents a significant improvement over the pre- ‘vious beta. In fact, each beta release corresponds to a minor new version. If you cannot be comfortable without a manual for PAUP*, use PHYLIP to construct your trees and TreeView to draw them. Despite the absence of a manual, I recommend PAUP* and I strongly recommend using the Macintosh version if a Macintosh computer is available to you. The Macintosh version (but not the Windows or Unix versions) includes an excellent interface for drawing and printing trees PHYLIP (PHYLogeny Inference Package) This is a package of programs for inferring phylogenies (evolutionary trees). Itis available free at http://evolution.genetics.washington.edu/phylip.htm! PHYLIP is written to work on as many different kinds of computer systems as possible. The source code is distributed (in C), and for some operating systems executables are also distributed. In particular, already-compiled executables are available for Windows95 /98/NT, Windows 3.x, DOS, and Macintosh sys- tems. Complete documentation is available on documentation files that come with the package. ‘The PHYLIP source code is available to be compiled on any Unix machine. Tree-Puzzle ‘Tree-Puzzle is a program for constructing Maximum Likelihood (ML) trees from DNA and protein sequences. Itis available free of charge from http://www.tree-puzzle.de/ (Tree-Puzzle home page) hittp:/iubio.bio.indiana.edw/soft/molbio/evolve (IUBio archive www, USA) ftpzfiubio.bi indiana.edu/molbio/evolve (IUBio archive ftp, USA) ftp://ftp.pasteur-fr/pub/GenSoft (Institut Pasteur, France) Read Me First MrBayes MrBayes (Huelsenbeck and Ronquist 2001) is a program for constructing phy- logenetic trees by the Bayesian method. MrBayes is available from wwwamrbayes.net The currently available version (November, 2003) is a beta version, MrBayes 3.065, but the final MrBayes 3.0 should be released by the time you read this. Please note that some commands have changed from MrBayes 2.0, so readers interested in using the examples in this book should update to MrBayes 3.0. CodonAlign Codon Align isa simple program that creates a DNA alignment based on align- ‘ments of the corresponding proteins. It introduces into each DNA sequence a triplet gap at the position of each gap in the aligned protein sequence. Codon- Align is available from this book's companion website, hitp://sinauer.com/hall/ The new version, CodonAlign 2.0, is somewhat easier to use than the original Users of the earlier version should be sure to read the new documentation for input file format. Other Programs There are many phylogenetics programs that [have not mentioned here, some of them widely distributed. My failure to mention a program does not imply that the program is not useful or valuable. [have chosen to include a minimal set of programs that will allow you to implement the methods I discuss in this, book, and [have tried to include only programs that are available for Macin- tosh, Windows, and Unix platforms. See http//evolution.genetics.washington.edu/phylip/software.html for a more extensive list of phylogenetics programs. Files and Utilities on the “Phylogenetics Made Easy” Website and CD The “Phylogenetics Made Easy” (PME) website was created especially for this book. Located at http:/www.sinauer.com/hall/ the site contains a variety of files that will make it easier for you to follow the tutorial without actually downloading the sequences to create an input file for ClustalX. It also includes copies of all of the relevant output files so you can compare your results with those I obtained in the event that you have Gifficulties. 7 Introduction In addition, the CD that comes with this book includes all of the files that are on the website atthe time of publication. Instead of downloading, you can simply copy those files from the CD. ‘The site and CD also include templates, or boilerplate, for various input files. You can copy those templates directly into your own input files in lieu of typ- ing everything manually. tis amazingly easy to make a tiny typing error, such as “Ist _pos” instead of “Ist_pos”, and spend hours trying to figure out why your program won't run. Copying blocks of critical text, then modifying those blocks for your specific needs, is a good way to avoid such problems. Indeed, many of us use such templates routinely, Throughout the text these files and templates are indicated with an icon. For example, ® cooper fescrasunmetattoatn ‘means on the website in the Chapter 1 files folder, in the Clustal#1 folder, the Metalloaln file ‘The site and CD also include a copy of the program CodonAlign 2.0, men- tioned earlier They do not include copies of ClustalX, TreeView, or MrBayes. “Although these programs are distributed at no charge, itis much better for you to download the most recent versions from the websites listed above than to work with the versions that were current atthe time this book was assembled. Some Conventions Used in This Book ** Click, as in “click the OK button,” means to use the mouse to position the cursor over the indicated button on the screen and to depress and quickly release the mouse bution. Double Click means to click twice rapidly, without moving the mouse. + Drag means to position the mouse and, while holding down the mouse button, move the mouse to another position. + Select means to highlight a section of text or an object on the screen by Aagging across the indicated repfon or by double icking on the object. + In the text, the Chicago font indicates a menu item or a button that ‘you will see on the screen, + For command-line programs such as PHYLIP, MrBayes and CodonAlign, the Courier font indicates text that you will see on the screen or that you will type into an input file. Chapter 1 Tutorial: Create a Tree! Why Create Phylogenetic Trees? ‘Today phylogenetic trees appear frequently in molecular papers that are unre- lated to phylogenetics or to evolution per se, Their inclusion reflects the grow- ing recognition of trees as a foo! for understanding biological processes. Phy- logenetic trees allow you to organize your thinking about a protein of interest in terms ofits relationship to other proteins, and may allow you to draw con- ai|6977548 |enbiCAB7S346.1| metallo beta-Lactanase [Stenotrephomonas maltophiial 18 Chapter 1 AMreasemasamescnaisi; mvsesesacann stn ee The next sequence, however, is shorter and will not be included (Figure 1,7). For that sequence the first amino acid aligns with residue 23 of the query sequence. It and the next four sequences are simply mature L1 in which the leader peptide is not given. They will not be included. ‘Continuing down the alignments in this fashion, select the sequences to be included, in each case ticking the selection box atthe left. I've chosen to include only one of the GOB variants, but you might choose to include several ofthese if you want a more exhaustive phylogeny. ‘When I reach >gi |20090857 | ref |NP616832..1 Inotice that the sequence alignments have become short and the E values have risen above 10" (Figure 1.8). Since for my purposes Lam only interested in sequences that align over most of the length of the query, I will exclude these sequences and all sequences below them on the list. [emphasize for my purposes because these decisions Tutor Create aTreet 19 depend entirely on the purpose of your phylogeny. You might well want to go much farther down the list if you are interested in related proteins that have greatly diverged from your protein of interest. The decisions about which sequences to keep and which to eliminate cannot be reduced to an algorithm; they depend on what you intend to accomplish with your phylogenetic tree. Do you want as complete a tree as possible? In that case, you will keep every- thing thatis probably a true homolog and is not identical to another sequence. (Note that more than one investigator may submit the same sequence to GenBank, resulting in duplicate entries.) If you only want to show represen- tatives of the major groups, you will be much more selective, 20 Chapter 1 Downloading the Selected Sequences ‘The BLAST Report Web page provides a convenient means of downloading, to your own computer the sequences you have just identified. Scroll to the beginning of the alignments and click the Get Selected Sequences button (Figure 1.9, arrow), STEEL. YStenccspomeae stopassien Tutorial: Create aTree! 21 ‘You will now see the screen in Figure 1.10. ‘One pee TS Mealotetacamate L1 precunor (Betacam, ype H) Pencil) 705478 PS27OOLAT_XANMALI7OS478), a camensss fk oman, TL betactamase [Stenorophomonas mapa) BISADIIEMDICABSS488 1630095) 3 CARTERS fain Oem, ‘mel bet actarve[Stneopomanas mali) (3157 994BemBICAB?S346UO977948), 4: Np ag7es3 tin amie et ‘Pultive et lactam [Salone enterica sup, entre serovar Typhi Ty2] [i291 4abelke NP. 807403 142914061) OS: NP Taam0 ink oman "250 [radyrinbiospeicur) {6127981 84keND_772870 1127381341) iL ‘Ten sequences were selected (scroll down to see al of them). Tick the box at the left of each sequence (Figure 1.10, arrow) to select them all. Next, change the Display choice from Summary to FASTA and the Send To choice so that it reads File. Click the Display button to sce Figure 1.11 2 Chapter 1 AL CAMSUE, Li be actnte(g580659) omen \SCARTSS46 ti elt. a Dr A, pine es _g91 061 re re Figure 1.11 FASTA is the file format that you will use in the next step to create an align- ment. Again, be sure fo tick each of the selection boxes, then click the Send To button. In the resulting dialog, name the file something like MetalloBla.fasta. ‘You can compare the FASTA file you just saved with the one I created by down- loading MetalloBla.fasta from the Chapter 1 files. GB Copter Hts: Metattolafast Now change the Display to GenPept and click the Display button to see Figure 1.12. GenPept files provide a wealth of information about each sequence, information that you may well need later when you write your paper. You will again need to scroll down to select each file, then click the Send to (File) but- ton. In the resulting dialog change its name to something like MetalloBla.Gen- Pept and save it ‘The files you just saved are text (ASCII) files that can be opened in any word processor. Tutorial: Create a Tree! 23 Theva 0 98 ‘se wae. fe Psz7. Mende a. 8190878) Figure 1.12 Creating the Multiple Alignment {i you are familiar with ClustalW or earlier versions of Clustal (Higgins and Sharp 1988; Thompson et al. 1997; Thompson et al. 1994), you should probably skim this section to become aware of the new “win- dows” style interface offered by ClustalX. You should also look at Chapter 2 to see some of the new capabilities of ClustalX that were not available in ClustalW. A pair of sequences can be aligned by writing one sequence above the other in such a way as to maximize the number of residues (nucleotides or amino acids) that match by introducing gaps (spaces) into one or the other sequence. Bio- logically, those gaps are assumed to represent insertions or deletions that occurred as the sequences diverged from a common ancestor. 24 Chapter 1 If we could insert as many gaps as we choose, we could align any two ran- dom, unrelated sequences so that all residues either matched perfectly or were across from a gap in the other sequence, Such an alignment would be mean- ingless, however. Itis necessary to somehow constrain the number of gaps 50 that the resulting alignment makes biological sense. To do that, a scoring sys- tem is used so that matching resiciues get some sort of positive numerical score, and gaps get some sort of negative score, or gap penalty. An alignment pro- gram seeks an arrangement that maximizes the net score. Fornucleic acid alignments, matching residues usually get a score of 1 and mismatches get a score of 0. For protein sequences, scoring is more compli- cated because mismatches between biochemically similar amino acids ustal- ly get an intermedliate score. Those scores are usually determined by the align- ‘ment program itself. Details of those scoring methods will be discussed in Chapter 2 in the section on “Fine-Tuning Alignments” (page 64). Gap penalties, on the other hand, are typically set by the user, and typical- ly there is a penalty for creating a gap plus an extra penalty for the length of the gap. Details are covered later in this chapter, in the section on “Changing, Gap Penalties” (page 34). Aligning a pair of sequences is not a computationally difficult process, and a variety of programs exist to align sequence pairs. Multiple alignments, are considerably more complex, and only a few programs do. really good job. ‘The ClustalX program is one of the best tools for creating multiple alignments. ‘ClustalX is an updated version of ClustalW, an old-fashioned “menu-driv- en” program. In terms of what it does, ClustalX is virtually identical to ‘ClustalW, but ClustalX has a windows environment that will be familiar to Macintosh, PC, and Unix users alike. To better understand what ClustalX does, ppull down the Help menu in ClustalX and read the various entries. For more details on ClustalX, you can go to the online ClustalX help file on the Web at: ‘www-ighme.u-strasbg.fr/Biolnfo/ClustalX/Top.html Creating the Input File ‘ClustalX, like any other computer program, requires that the data it manipu- lates (the input file) must be in a format that ican recognize. You can use your favorite word processor to create the input file. The input file must contain each of the sequences that are to be aligned. In our example use the Meta] 1081. fasta file that you saved from the BLAST search. For convenience, you will edit that file. Just to be safe, though, first make a copy of the Metal ioBla. fasta file and name it something like Metal lo. fasta. Itisalwaysa good idea to work on copies rather than orig- inals of important files. Tutorial: Create a Tree! ‘ClustalX will recognize several formats for the sequences, but we will use the FASTA format (see Appendix I) because we downloaded sequences in that format. The FASTA format can be recognized because the first fine begins with, the “>” character. That character is followed by a single word that ClustalX will use as the name for the sequence in the multiple alignment that it ere- ates. Open the file using your favorite word processing program. The first sequence in the file looks like this: vai |1705478| sp |P52700|BLAL_XANMA Metalio-beta-lactanase Li precursor (Beta-lactamase, type TT} (Penicillinase) HRS TLLAPALAVALPRAETSAREVPLBOLRAYTVDASHLQPVAP.ATACHTWQIGTEDLIALNGTEDA ‘VuLnoare aS. @CARGVTPRDLRL TLL SHARADHRGPVRELARRTGAKVAANASSAVEEARSSS DDLSIP EDGE TYPPANADRTVMDGEVET VOGT FALE UBCRITPOSTANTHTCTRNGKEVREAYADSLSAP (HOLA SYPRY PL TED RS EATVRAL-PCLVLLTPHPGRSNDYARGARRGARALTCXRYADAASOREDS CETAGR CClustal treats everything between “>” and the first space as the sequence name. Because it wouldn’t be very helpful to have the sequence name appear as >gi 1705478 sp|P52700| BLA1_XANMA in the multiple alignment display, ‘we need to change the name to something more useful. “The choice of sequence names can make a lot of difference. Some of the pro- ‘grams we will use later will only recognize the first 10 characters ofthe sequence ‘name; others will not accept certain characters, such as the “character, in the ‘name. In particular, the name cannot include any spaces because then Clustal will, only read the first part of the sequence name. The safest thing to dois to always ppick names of 10 or fewer characters that use only letters and numbers, Let's insert 1 immediately after the “>” character so that the first line now reads 211 gi|1705478 | ep| P5270 |BLAL_XANMA Metallo-beta-lactamase Li precursor (Beta-lactanase, type I) (Penicillinase) Continue down the list of sequences, inserting a recognizable name (followed by a space) after each “>” symbol. In this example, the query sequence you used for the BLAST search was already in the databases. If your query were a new, unpublished sequence, that ‘would not be the case and your FASTA file would not include the query. In such a case, just enter the sequence manually by adding >sequenceName on a new line and pasting the sequence on the line directly below that. Finally, we will save the file in plain text, or ASCII, format. This is impor- tant because ClustalX will not recognize Microsoft Word, WordPerfect, or other word processor files, 25 26 Chapter 1 Getting the Data into ClustalX Start ClustalX and you will see a window that looks something like Figure 1.13. ull down the File menu and choose the Load Sequences menu item. Navi- gate to the folder (subdirectory) that contains the input file (in this case, Met~ alo. fasta) and choose that file. Clustal will load the data from that file and the window will now look like Figure 1.14. The left pane lists the sequences according to the name that follows the “>” symbol in the input file. The right ppane shows the beginning of each sequence. You can scroll to the right to see the rest of each sequence by using the scroll bar at the bottom of that pane. Figure 1.13, Be Chapter 1 Files: Metallo.fasta You may note that many of the residues are shaded in Figure 1.14. On your screen, those shades of gray will be different colors. The colors are applied according to a scheme that indicates the group of amino acids to which the consensus (most common) residues at each position belongs. At this point, however, the sequences have not yet been aligned and the consensus colors are meaningless. Tutorial: Create a Tree! le G4 DuatUsersbarryManuscripts:Book revisionPhylogenetics Manua Figure 1.14 Some General Comments about Creating Alignments An alignment is not an absolute thing. Itis a “best guess” according to some algorithm used by a computer program. One cannot simply have a program. compute an alignment and, without further thought, use that alignment to cre ate a phylogeny. It is necessary for the user to carefully and thoughtfully exam- ine each alignment to see whether it makes biological sense. Often it will be useful to modify some of the parameters used by the computer program in ‘order to improve the alignment. I will discuss such modifications in Chapter 2.in the section on “Fine-Tuning Alignments.” Setting the Alignment Parameters ClustalX creates a multiple alignment in three stages: 1. It individually aligns each sequence to each of the other sequences in a series of pairwise alignments. 2. It uses that set of pairwise alignments to create a guide tree 3. It uses that guide tree to help create the multiple alignment. In onder to create pairwise alignments, ClustalX needs to know what penalties to assign for the creation of a gap and for the “extension” (length) of that gap. 27 28 Chapter 1 Pull the @lignment menu down to choose the Alignment Parameters menu item, which will reveal a submenu from which you should choose Pairwise Alignment Parameters (Figure 1-15). EEE trees colors quality Help Help ‘Do Complete Alignment Produce Guide Tree Only ‘bp Alignment from Guide Tree ealign Selected Sequences Realign Selected Residue Range Align Profte 2ta Profile + Align Profites from cuiae trees Aliga Sequences ta Prosite 1 ‘Align Sequences to Profile 1 from Tree CM Reset New Gaps before Alignment ‘Save Log File Reset All Gaps before Alignment utput Format options I ‘Muttiple Alignment Parameters Protein Gap Parameters Secondary Structure Parameters ‘You will then see a dialog box that looks like Figure 1.16. a inwise Parameters == er pairwise Algnments {Sow Areata —] Protein Weight Matrix QBLOSUM 30 PAM350 © Gonnet 250 [Pairwise Parameters Gap Opening {0-100} {10.00 | Gap Extension [0-100] {0.10 _] tity matrix ter | User detined |] \(Load protein matric DNA Weight Matrix © 1B © CLUSTALW(1.6) © User defined Toad DNA matrie Figure 1.16 Tutorial: Create a Tree! Alignment and gap penalty parameters. The first choice, Pairwise flign- ments, allows you to choose between a Slow-Accurate method and a Fast-Approximate method. The Slow-Accurate method is preferred, but if you are aligning so many sequences or the sequences are so long that the program takes a long time to run, you may want to use the Fast ‘Approximate method, Most modern computers are so speedy that you prob- ably will not need the Fast-Approximate method, ‘The box shows the default values for the Gap Opening penalty (10.00) and the Gap Extension penalty (0.10). Decreasing the gap penalties will allow the introduction of more gaps and will thus produce fewer mismatches in the alignment, but may also result in spurious matches that do not really reflect homology (identity by descent). Increasing the gap penalties will have the ‘opposite effect: increasing the rigor of the alignment may result in missing ‘matches that actually do reflect homology. For aligning DNA sequences, I prefer the default parameters shown in Figure 1.16. For aligning protein sequences, I prefer increasing the gap open- ing penalty to 35 and the gap extension penalty to 0.75 as a starting point. ‘The important thing to remember is that after the multiple alignment is com- plete, we will examine the alignment and see if changing the parameters will, improve it. Weight matrix parameters. As pointed out earlier, ClustalX seeks to maxi- mize the score of the alignment by giving high scores to matching residues and low or zero scores to mismatching residues. The IUB DNA Weight Matrin scores matches as 1.9 and mismatches as 0, except that it scores all X’s and N’s as matches to any IUB ambiguity symbols. ‘The Protein Weight Matrix is more complicated because during align- ment ClustalX takes into account not only identity, but also biochemical and coding similarity of residues when calculating the score of the alignment. The various protein weight matrices weight different mismatches slightly differently. Each gives the highest weight to identical residues (e.g., Tyr-Ty®), but some mismatches get higher scores than others based on the biochemical and functional similarities of the different amino acids (Iyr-Phe scores higher than Tyr-Pro, for instance). “The BLOSUM matrix appears to be the best for searching databases. The PAM ‘matrix has been used widely for about 20 years, and the default GUNET matrix amounts to an updated PAM matrix that is based on a far langer data set. For now, I suggest using the default GONET series, but you should feel free to-choose alternative matrices and to realign the sequences to get a feel for the effect of the matrix on the alignment. If you do change the matrix, be sure to make the same change in the next set of settings, Multiple Alignment Parameters 29 30 Chapter 1 Multiple alignment parameters. Choose the Multiple Alignment Parameters from the Alignment Parameters menu to see a dialog box that looks like Figure 1.17 ‘Alignment Parameters Multiple Parameters— Gap Opening 0-100} {7000 Joap Extention 10-100 {020 Delay Divergent Sequences (=) {30 ] DNA Transition Weight (0-1) [050 Use Negative Matri{OFF] Protein Weight Matrix OQ BLOSUM series O PaMseries }® Gonnet series G Identity matrix| 1 User defined Toad protein matrix | DNA Weight Matrix | @ 18 O cLustauwc.6) | User defined Figure 1.17 ‘Again, for DNA sequences Tlike the default settings, but for protein sequences prefer to change the Gap Opening Penalty to 15.00 and the Gap Extension Penalty to 0. Delay Divergent Sequences determines how different two sequences must be in order for their incorporation into the multiple alignment to be delayed. Tprefer to set this to 25%, but you can use the default value of 30% if you like. {you chose an alternate Protein Weight Matrix for pairwise alignments, be sure to choose the same matrix for multiple alignments now. Tutorial: Create a Tree! Format. The last setting you need to apply before performing the align- ment is the format for the output. When it creates an alignment, ClustalX ‘writes that alignment to your hard drive in the form of an output file. The format of that file is user-determined, and the user makes the decision based. on the needs of the program that will use the alignment file to construct a phylogeny, or for any other purpose. Choosing Output Format options under the Alignment menu will display the dialog shown in Figure 1.18. A CLUSTAL format [] NBRF/PIR format (Cd GCG/MSF format &4 PHYLIP format IGDE format —_ ANEXUS format GDE output case: CLUSTALW sequence numbers {_ON Output order ‘ALIGNED Parameter output OFF Figure 1.18 In the Output Files section you can check any or all of the boxes (you must check at least one); the default is CLUSTAL format. ClustalX will create and write an output file for each of the boxes you checked. The files will be writ- ten to the same folder (directory) that contained the input fil. We will eventually be using PAUP* to create phylogenies from the ClustalX output, and for that purpose the Nexus format is most convenient. That out- put file will have the suffix .nxs appended to it. (If the Nexus file format is not listed you have an older version of ClustalX. Go to one of the sites listed ‘on page 5 of this book and download the most recent version.) If you want to use the free program PHYLIP to construct your tree, also tick the PHYLIP format box. 31 32 Chapter 1 You may eventually want to publish the alignment, or simply print it for your own purposes. The Clustal format, which has the suffix . an, is the most convenient for publication, and it is much more useful if it includes sequence numbers. Therefore, change Clustal Sequence Numbers to On. ‘The remaining choices affect the appearance of the output file and are dis- cussed in Chapter 2. For now, leave them in the default state. Creating the Alignment Finally, itis time to actually create the alignment. Simply choose Do Complete flignment under the Alignment menw. ‘ClustalX will let you know what itis doing as it creates first the series of pairwise alignments and, from that, the multiple alignment. When itis all done, the alignment window will look like Figure 1.19. Gustax aa Font se] NOWS-Alignmentile created | Figure 1.19 ‘The colors (shades of gray here, but colors on your screen) indicate the amino, acid family to which the consensus residue belongs. If there is no color, it means there was so much variation at that position that there is no consensus. The histogram below the ruler line indicates the degree of similarity: peaks indi- cate positions of high similarity, valleys positions of low similarity ‘Tutorial: Create a Tree! ‘The gray line just above the sequences is used to mark strongly conserved positions, The “*" character indicates positions that have been fully conserved, the: character indicates that one of the “strong” groups of amino acids is fully conserved (ie, all ofthe amino acids at that position belong to the same “strong” group), and the ”.”" character indicates that one ofthe “weak” groups of amino acids is similarly fully conserved. The “strong” groups are: ama, NEQK NHOK NEO ‘ORK. Maw Mau RY Pyw ‘The “weak” groups are: sa, ary sag STINK STPA, SND SNDEOK NDEOHK NEQHRK FVLIN EY For more details, see the ClustalX documentation. Refining and Improving the Alignment SB chaper Some people simply stop at this point, An alignment now exists, and ClustalX has saved the alignment file in the same folder (directory) where it found the input file. The alignment file will have a name, such as Metal1o.ain for an alignment in Clustal format or Metal1o.nxs for an alignment in Nexus for- mat. There is also a file called Metal lo.dnd in the same folder. The .dnd extension to the file name indicates this is the guide tree file that ClustalX cre- ated from the pairwise alignments. Itis possible to use the files we have just created to construct a phyloge- netic tree, but the quality and value of that tree will be no better than the qual- ity of the alignment, and we have not yet considered that quality Itis also Clustal #1f 33 34 Chapter 1 essential to understand that no matter how unrelated the sequences are, (ClustalX will always generate an alignment. The mere existence of an align- ‘ment with a few patches of color highlighted does not mean that the sequences are related. It is up to the user to ensure that the sequences in the dataset are actually homologous. ‘At this stage you need to examine the alignment to see if most of the gaps make sense, If many of the gaps seem to be arbitrary (i, you think you could have done better by eye), then you will need to improve the alignment. Like- wise, if there are large regions that are present in only one or two sequences (c,, they appear as gaps inall other sequences), you may need to delete those regions in the sequence input file. Such regions do not share homology with the other sequences, and their presence will only contribute to artifacts when a tree is eventually generated. Eliminate truncated sequences. Although it does not occur in this exam- ple, you may find that you have included a sequence that is significantly shorter than the other sequences, and that after the end of that sequence there are no asterisks indicating identity. Ifa gap is present in any sequence there is no asterisk, even if all of the other sequences are identical. This makes it more difficult to notice areas of high similarity, so it often better to delete truncated sequence from consideration. Click on the name of the sequence in the list of sequence names to select it, then choose Cut Sequences from the Edit menu to eliminate that sequence. The identical and similar amino acids will now be identified above the alignment. Choosing Bo Alignment from Guide Tree from the Alignment menu will rewrite the alignment files without the truncated sequence. Delete nonhomologous regions from the sequences. Again, it does not ‘occur in this set of example sequences, but it sometimes happens that some sequences are much longer than your sequence of interest, so that your sequence aligns to only a portion of those sequences. In that event, the non- homologous regions of the large sequences may align well to each other, but not a all to your sequence, which can cause ClustalX to incorrectly interpret the distances on the eventual tree. Make a copy of your original FASTA input file. Working with the copy, you should then delete the nonhomologous portion from each of the large Sequences. Use this copy to repeat the alignment. Note that this does not mean you should remove every region that covers a ‘gap of 20-30 residues. Some judgment is required here. Change the gap penalties. The ultimate achievement would be to create an alignment in which all the gaps represent the real deletion or insertion events that occurred during the divergence from a common ancestral Tutorial: Create a Tree! sequence, But since we cannot know those real events, we settle for a rea sonable approximation by assigning and adjusting gap penalties. Ifwe could introduce an unlimited number of gaps of unlimited lengths, we could align any two unrelated sequences perfectly, in the sense that all characters would either be across from a gap or across from the same character. Obviously, such an alignment would be meaningless. Alignment programs prevent that by penalizing the alignment score for each gap and for each additional residue in a gap. For the novice, one of the most troublesome aspects of creating alignments is the problem of modifying the gap penalties. How do you know when you should change the penalties? How can you determine how much to change them, and in which direction? The statements “To a large extent itis a matter of experience,” or “It requires a practiced eye” are unhelpful and frustrating, We should attempt to minimize the number and size of gaps while maxi- izing the extent of conserved blocks. A “conserved block” isa region in which similar or identical residues occur across all or most of the sequences. In the alignment window, a column in which all of the residues are identical will be ‘one solid color; heavily colored regions represent highly conserved blocks. Regions with many gaps and very little color are relatively divergent, IF the gap penalties are set too high, the gapsnnecessary tobring truly homol- ‘ogous residues into alignment will not be introduced and blocks of true homol- ‘ogy will be broken up (.e., not very colored in the alignment window). If gap penalties are set too low, there will be many gaps and lots of individwal resiciues that are aligned in two or three of the sequences, but very little color because only “consensus” residues are colored. ‘As we increase gap penalties, we check to see if we are reducing the num- ber of gaps, which is good, or whether we are starting to break up homologous blocks that were present at lower gap penalties, which is not good. As we lower «gap penalties, we check to see whether we start to generate more homologous blocks, which is good, and continue to check that we are not breaking up exist ing homologous blocks. ‘Scrolling across the alignment to the extreme right end, notice that the region between residues 236 and 265 is somewhat conserved (Figure 1.20), ‘There is one completely conserved residue indicated by “*”; several other conserved residues indicated by ”.” or “:” appear in the gray region above the alignment, From 266 to about 288 there are no conserved residues, and there is another conserved region centering about residue 301. Another useful way to get an impression of the overall quality of the align- ment as you scroll across is to notice the peaks in the histogram below the ruler. The 266-283 residue region is pretty flat 35 36 Chapter 1 (A) | pzeeererneneesrenreeers aut tan ysrremseeemeTD | 8 © i a 4 1 i i a a 4 ee d Figure 1.20 Tutorial: Create a Tree! 37 ‘We now want to determine whether we can improve the existing alignment. ‘We can begin by increasing the pairwise gap penalties to 100 and 7.5 and the multiple alignment penalties to 100 and 3.0. We also need to choose Reset All Gaps Before Alignment under Alignment Parameters. Once that is done, we do the Complete Alignment again and see how the alignment has changed: ‘+ Figure 1.208 illustrates that by increasing the gap penalties we have eliminated the gaps that allowed the conserved residues to align, and the histogram is even flatter than before. These extreme gap settings are clearly inappropriate. Figure 1.20C shows the results when we reduce the pairwise align- ‘ment parameters to 2 and 0.02 and the multiple alignment parameters to 1.0 and 0.01. A lot more gaps have been introduced and the align- ment has gotten longer (40 versus 320 sites), but not much has been gained in terms of increasing the number of conserved residues. Since each gap represents an insertion or deletion that occurred during the history of those sequences, the addition of a lot more gaps is clearly much less parsimonious because it requires many more insertion/ deletion events. These extremely low gap penalties are also inappropriate Although adequate for illustrative purposes, determining the effects of changed ‘gap penalties on only one regions is not enough to really assess the effects of those changes. The problem is that we cannot really retain an image of an entire alignment before the change to compare with the alignment after the change. We need to be able to compare hard copies of the alignments after each mod- ification of the parameters. In order to do that, we need to print the alignment, as described in Appendix II. Use the printed alignment to compare the results of changing gap penalties and to decide the penalties that are best for your dataset. Itis important to realize that each time ClustalX tealigns a sequence, it writes over the existing .aln, .phy, and .nxs files, and that each time it writes an. alignment as PostScript (see Appendix ID, it writes over the existing . ps file. ‘Thus it is necessary fo citer move the files toa new folder or to rename each file imme diately ater its created before changing gap penalties and realigning the sequence. ‘One would not normally increase gap penalties as dramatically as in the ‘example shown in Figure 1.20. Typically, one would increase penalties about ‘50% and see what is happening over the entire sequence. If there has been improvement, increase the penalties another 50%, and so on. When changes start to show no improvement or to make things worse, then back down, toward the last values that improved the alignment, Although it can be time-consuming, attempting to improve the alignment through this process of examination and modification of penalties is probably the sin- ‘gle most important thing you can do to ensure a high-quality alignment and make a high-quality phylogeny possible. 38 Chapter 1 | ‘the reader may wonder if it possible to alter gap penalties for only one section of an alignment so as to avoid perturbing portions that are well aligned already. Until ClustalX, that was a difficult thing to do. ClustalX has implemented a method for accomplishing that goal; that method is discussed in Chapter 2 under “Fine-Tuning Alignments” (pp. 64-68). Aligning New Sequences to an Existing Alignment, or Aligning Two Existing Alignments Because creating alignments can be time-consuming if there are many sequences or if the sequences are very long, it can be useful to be able to add anew sequence to an existing alignment. Similarly, you may need to align two different sets of sequences in which the sets are more distantly related to each other than are the sequences within each set. Each of these procedures is done using the Profile Alignments option. ClustalX uses the term “profile” to refer to an existing alignment. ‘After aligning the Metallo protein set, I discovered that there were three addi- tional proteins available from various genome sequencing projects. I acquired those sequences and created another input file called genomes . fasta. BP Chapter Fites: genomes fasta ‘To add those to the final alignment of Metallo, start the ClustalX applica- tion (if it not already running) and change Multiple Alignments to Profile Alignments in the main window. The window now displays two alignment areas, each of which can display a separate alignment (Figure 1.21) Se : imi "rete 7] tock sero Figure 1.21 Tutorial: Create aTree! 39 We begin by choosing Load Profile 1 from the File menu, and select the Netallo.aln file. The existing Metallo alignment will be loaded into the upper window. Now choose Laad Profile 2 and select genome. in, which is not really a profile a all but is simply another file of sequences. ‘The row looks like Figure 1.22. ‘ie 64 DualserstbarreManuscriptsBoek revislonPylogenetics Man Figure 1.22 ee Chapter 1 files: Clustal #2f Now choose Align Sequences to Profile 1 from the Alignment menu. Doing this aligas the sequences in the lower window to the existing alignment in the upper window and creates an alignment file genomes .aln (ie., the new file is named according to the name of the input file for the new 40 Chapter T sequences). The lower window now shows the new sequences aligned to each other and to the preexisting alignment. By checking the Lock Scroll box in the alignment window, we can scroll the two windows together to look across the alignment. Alternatively, we can choose the Multiple Alignment mode to bring the entire alignment into a single window (Figure 1.23) (Cire Agnmencwoge] Fone size 10] If we want to print this alignment, we can write the alignment as a Post- Script file as described in Appendix I. Sometimes itis useful to align one existing alignment to another existing alignment for the purpose of creating an overall phylogeny that includes both data sets. For that purpose, we would again choose the Profile Align- ment Mode and load the two alignments exactly asin the example above, but we would do the aligament by choosing Align Profile 2 to Profile 1 from the Alignment menu. It is important to remember to use the Align Profile 2 to Profile 1 choice only when both datasets have already been aligned. At this point, you have an alignment in hand that we will assume is the best alignment possible at this stage. Remember that a phylogeny is meaningless— or worse, misleading —unless it is based on a reasonably well-done alignment. For more on aligning sequences see Chapter 2, “Fine-Tuning Alignments.” Phylogenetic Analysi This section is a brief introduction to the methods of phylogenetic anal with emphasis on implementation and briefly on interpretation of phyloge- netic trees, Tutorial: Create a Tree! Exactly What Is a Phylogenetic Tree? Traditionally, phylogenetic trees have been used to represent the historical rela- tionships of groups of organisms. Each group is called a taxon. Until some 25 years ago, those relationships were based primarily on morphological charac ter data from extant taxa and the fossil record. With the advent of molecular sequencing, an almost unbelievably extensive new data set entered the picture. Systematists are usually interested in the relationships of the taxa (very often species), and not very interested in the relationships of the underlying data. ‘Typically, molecular biologists have little interest in the relationships of the taxa per se, but instead are interested in the relationships of the sequences. Because phylogenetics programs such as PAUP*, PHYLIP, and MrBayes are written by and for systematists and are intended to be used with both sequence and morphological data, they refer to taxa, not sequences, in their menus and documentation. For the purposes of this book, the words “taxa” and “sequences” are used almost interchangeably. Aphylogenetic tree is a simple object consisting of two elements: nodes and branches. branch isa line that connects twwo nodes. Nodes can be either exter- nal nodes, which are the tips of the tree that are the taxa being considered, or internal nodes, which are points that represent a common ancestor of two oF more other nodes. (For more details see “Learn More about Paylogenetic Tres,” p42.) ‘Methods for Constructing Phylogenies ‘There are currently four primary methods for constructing phylogenies from protein and nucleic acid sequence alignments: 1. Distance methods, of which Neighbor Joining (NJ) is currently the favored implementation. 2. Maximum Parsimony (MP) 3. Maximum Likelihood (ML), which is currently implemented only for the analysis of nucleic acid sequences in PAUP* but is implemented for proteins in the program Tree-Puzzle. 4. Bayesian (BAY), a new method that is gaining rapid acceptance in the phylogenetics field. No single method is the best for all circumstances. The method of choice depends both on what you want to lear and on the size and complexity of the data set. In practical terms, it will also depend on the speed of your computer and the ease of implementing the particular method. Each method is covered in more detail in Chapter 2. For the remainder of this chapter, we will use the NJ method implemented by ClustalX to create a phylogenetic tree. 4 42 Chapter 1 AFARN MORE ABOUT Phylogenetic Trees A phylogenetic tree is composed of lines called branches that intersect and termi- nate at nodes. The nodes at the tips of the branches represent the taxa (or. in the case of sequence data, the sequences) that exist today and that we can actually. examine. The internal nodes represent ancestral taxa, whose properties we can only infer from the existing taxa. Figure 1 is a rooted tree whose branch tips tepresent five taxa (A-E) in a clade, with four internal nodes (R, X, Y, 2) representing ancestral taxa, including the root (B). The numbers on the branches indicate the number of changes in a particular sequence that occurred along that branchy fer example, between X and ¥ three changes occurred, ‘whereas there is only one difference in sequence between ¥ and D. These numbers represent the branch length. Even if exact values are not pro- vided, the relative lengths of the branches may ‘be drawn in proportion to the number of changes along that branch, ‘The tree in Figure 1 is additive because the distance between any two nodes equals the sum of the lengths of all the branches between them. While it might seem intuitive that all. trees must be additive, that is not the case. If multiple substitutions have occurred at any particular site, then additivity will not hold unless the distances are corrected for multiple substitutions. Anode is bifurcating if it has only two immediate descendant lineages. We usually assume that evolutionary speciation is a binary process that resulls in the formation of two species from a single anceestral species. That may not always be the case, or available data may not make it pos- sible to resolve the order in which species descended from a single common ancestor, in which case a node is multifureating, Another term for a multifurcat- ‘ing node is polytomy. Because nodes Ni, N2, afd N3 in the left tree of Figure 2 are all bifurcating, that tree is strictly bifurcating. In the night tree, node NI has three descendant lineages—i.e., there is a polytomy at node NI, so the right tree is not strictly bifurcating, ‘Tutorial: Create aTree! 43, 44 Chapter 1 ‘Phylogenetic Tees continued) For four taxa there are only the (4) Unrooted Trees three possible unrooted trees shown in Figure 4. Once a root is identified, five different rooted frees can be created for each of these unrooted trees, each with a distinctive branching pattern ‘electing a different evolutionary history for the relationships shown a ¢ > D A 8 {in Figure 4. There are thus 15 pos- sible rooted! trees for four taxa. ‘The number of possible trees, both rooted and unrooted, ‘increases rather dramatically as. ° . the number of taxa increases, Where sis the number of taxa, the ¢ number of possible unrooted trees is (2s—5)t F36-9! ’ 8 and the number of possible rooted trees is 25-3)! 2 *s—2)! Shown in tabular form, the results are startling Taxa Unrooted trees Rooted trees __Comment 4 8 10,395 135,135 10 2,027,005 3459425, 2 3x10 Almost a mole of tees 50 3x10% More trees than the number ‘of atoms in the universe 100 2x10 Tutorial: Create a Tree! (Phylogenetic Tees continued) As databases grow, needing to construct trees of >100 sequences becomes ever more of a possibility. ‘Most phylogenetic methods produce uniooted trees, but ufless you specifically choose Unrooted Phylogram or Unrooted Cladagram, when PAUP* prints those tees they’ appear to be rooted. For display purposes, PAUP* has put a bend in one branch or another, but that oes not actually root an unrooted tree. Unless you) A 8 Know that a tree is rooted, either because you rooted it yourself or because an author tells you it is rooted, assume that itis unrooted. An advan- tage of printing a tree in the Unrooted or radial format is ¢ ¢ that it makes its unrooted status absolutely clear It is important to distin ‘guish between a tree and the . ° ‘way that tree is drawn, For instance, it is obvious that the two frees in Figure 5 are the same and that nothing is © A 8 really different as the result of swapping the way the branches to taxa A and B were ¢ drawn. Itis not as obvious, Dut itis also true that the trees: in Figure 6 are the same. A sticeinct overview of ¢ ° phylogenetics emphasizing: the molecular aspeets can be found in Chapter 5 of Graur and Li 2000. ° A 45 46 Chapter 1 Using ClustalX to Create a Neighbor Joining Tree If itis not already running, start ClustalX, choose Load Sequences from the File menu, and open the genomes.an alignment file. Everything having to do with creating an NJ tree is implemented from the Trees menu (Figure 1.24). EEEY colors quality Help Help Draw N-1 Tree Bootstrap N-I Tree Exclude Positions with Gaps Correct for Multiple Substitutions Save Log File ‘Output Format Options Figure 1.24 Although ClustalX creates trees, it docs not draw or display those trees on the screen, Instead, it saves tres in files that can be understood by other pro- grams that do draw trees. Just as there are several formats for alignments, there are several formats in which ClustalX can write trees, From the Trees menu choose Output Farmat Options (Figure 1.24) to display the dialog in Figure 1.25 ‘Output Tree Format Options = Output Files Kiaustattormattree [jPhylip formattree Ci Phylip distance matrix Gf Nexus format tree Bootstrap labels on: [_NODE Figure 1.25 Both TreeView, which you will use during this tutorial, and PAUP*, which you will use later, use the Nexus format. Check the Nexus Format Tree box, and change Bootstrap labels on: to Node, then click the Close button to dis- iss the dialog. Choose Draw N-d Tree from the Trees menu (Figure 1.24). A dialog will appear showing that the files will be saved into the same location as the input file, the . dnd file, and the various alignment files. Click OK to save the tree files as genomes.tre and genomes. treb. Now choose Bootstrap N-J Trees from the Trees menu and again click the OK button on the resulting dialog, Tutorial: Create aTree! 47 ‘The bootstrap operation will take a few seconds. (For the moment, don’t worry about what bootstrap is. We will get tot later) You are done with ClustalX and. you can quit the program by choosing Quit under the File meni. Drawing the Tree Using TreeView Start TreeView and open the genomes .tre file to sce the TreeView tree wi dow that displays the NJ tree of the sequences in the genomes.aln. align- ‘ment (Figure 1.26. Figure 1.26 In Figure 1.26, Goas and_anoGam are two external nodes. Branches from those two nodes join to create the internal node labeled 11. (I have labeled some of the nodes for discussion purposes.) FEZ1_is another external node that connects to node 12 at internal node 12, $801157 and 430296 are two other external nodes that join at the inter- nal node labeled 13, which connects to external node EC2 at 14. T2 and 14 are two internal nodes that join at intemal node TS. Tree Formats: Different Appearances of the Same Tree At this point, itis worthwhile to point out the advantages of certain styles of representations of phylogenies over others. The terms “cladogram” and “phy- Jogram” are used here as they are in PAUP*—to refer to styles of drawing. 48° Chapter 1 trees—and are not used in their historical senses within the field of phyloge- netics. A cladogram shows only the branching order of nodes. Cladograms can bbe presented as either slanted (Figure 1.26) or rectangular. By clicking the square cladogram button (arrow, Figure 1.27) the more familiar rectangular cladogram (Figure 1.28) is displayed Figure 1.28 Figures 1.26 and 1.28 show exactly the same information. Notice that in each case the various internal nodes are lined up vertically above one another. Ina ‘ladogram, whether slanted or rectangular, the lengths of the branches convey no information whatsoever, only the branching order is displayed. A clado- gram thus displays only the topology of a tree. ‘At this point itis useful to introduce another term, clade, All of the descen- ants of a common ancestor represented by a node belong to the same clade Tutorial: Create a Tree! 49 defined by that node a clade is also called a monophyletic group. F521, GOBS, and AnoGam all belong to the same clade stemming from node £2. Aphylogram displays both branching order and distance information. Click the Rectangular Phylogram button (arrow, Figure 1.29) to see the rectangu- lar phylogram view of the same tree (Figure 1.30) o [aa a 8 te ST fs a | Figure 1.30 Notice that in Figure 1.30, GORS and AnoGam are still connected at internal node 11, and nocie 71 is still connected to internal node 12. Now, however, wwe see that the branches connecting GOBS and AnoGam to T1 are shorter than the branches connecting 5801157 and 0296 to T3. This simply means that there were more sequence changes between the common ancestor T1. and ‘G085 and AnoGam than there were between the common ancestor T3 and. 8801157 and 470296. The branches are drawn so that their lengths are pro- portional to the evolutionary distance along that branch, 50 Chapter 1 Distance is the number of changes that have taken place along a branch, usually expressed as the number of substitutions per site. A scale near the bot- tom of Figure 1.30 relates the length of a branch to the distance. ‘The appearance of the tree can be changed from the Trees menu (Figure 1.31) as well as by using the various buttons. Window Help ‘Radial J+ < Slanted cladogram FE Rectangular cladogram {& Phytogram ‘Show Internal Edge Labels Define outgroup. Root with Sutgroup Figure 1.31 Bootstrapping a Tree ‘Although we have a sense of the topology of the tre—the orcer in which the different sequences diverge—we do not have a sense of how reliable these ‘groupings are. Often itis important to gota statistical estimate of the reiabili- ty of some groupings. Bootstrapping is a widely used method for this purpose. (For more details see “Learn More about Estimating the Reliability of Phylogenetic Trees," p. 52.) ; Bootstrapping is a method in which one takes a subsample of the sites in an alignment and creates trees based on those subsamples. That process is iter- ‘ated multiple times (a typical number is 1000, although a minimum of 100 can. bbe used, but 2000 replicates are required for 95% reproducibility) and the results are compiled to allow an estimate of the reliability of a particular grouping. Fortunately, we do not have to create a thousand trees ourselves; ClustalX did it for us in the last operation before we quit ClustalX. ‘The ClustalX dialog that you simply accepted when you saved the boot- strap tree looked like Figure 1.32. The Random number generator seed is, simply a starting point for the bootstrap trials. The actual number is not impor tant, except that it isa good idea to change this number if you repeat the boot- strap process for the same data set (or else you are not running, independent tials). Neighbor Joining analysis is not resource-intensive, and I recommend. that you always use at least 1000 trials for this kind of analysis. Tutorial: Create a Tree! 51 BOOTSTRAP TREE == Random number generator seed [1-100]: [111 Number of bootstrap trials [1-10000]: [7000 SAVE NEXUS TREE AS : [G4 Dual:UsersibarryManuscriptsiBook Co Coxe) Figure 1.32 The tree file created by the bootstrapping operation was named genome. treb. In TreeView, open the genome. treb file, The tree (Figure 133) looks identical to Figure 1.26 except that the node are numbered, the word ‘Trichotony appearsat the node at the extreme left, and the button for show- ing internal edge labels (arrow) is active and darkened. ‘The numbers indicate the number of times, out of 1000 bootstrap replica- tions, all the members of the clade descended from the indicated node were 52 Chapter 1 LEARN MORE ABOUT Estimating the Reliability of Phylogenetic Trees How well can you trust the tree you have just constructed? It depends on what fea- tures of the tree are important to you. Most of the time, “reliability” refers to the topology, oF branching order, of a tree, not to the lengths of the branches. In essence, reliability is measured as the probability that the members ofa given clade are always members of that clade ‘An experimental scientist who wants to test the reliability of a conclusion repeals the experiment with independent data. Since the data in this case are the sequences themselves, ancl sequences are what they are, there seems to be little ‘point to repeating the data unless we just want to fest the reliability of the sequenc- ing. We might repeat the alignment, but unless we change the gap parameters we ‘will simply regenerate the same alignment we used before, Phylogeneticists use a sampling method called bootstrapping that pseudo— repeats data collecting as a method to estimate the reliability of the tree. ‘Consider the following alignment: 3234567890 ‘apmeaearrt ecossaaces Aategnsart ‘taresecatt neresocart ‘Arandom site (ie, a column) is taken from the alignment and used as the first site in a pseudoalignment. Another raridom site is taken and used as the second site in the pseudoalignment, and the process is continued until the pseudoalignment con- {ains the same number of sites as the original alignment. Sampling of the original ligament is with replacement, which meats that the same site may be placed in the pseudoaligament more than once, or may’ not appear in the pseudoalignment at all. The pseudoalignment might look like this: 4936024951 grtemczca ‘eceagtecce GATATAGTGR sotcracte? sangrosta: ‘A tree is then constructed from the pseudoalignment by the same method, and ‘under the same parameter settings, used to construct the original tree. The origi- nal tree is then compared with the new tree. For every clade in the original tree, 1 score of 1 is assigned if that clade is present in the new tree; a score Of 0 is Tutorial: Create a Tree! 53 together. Bootstrap numbers are usually placed at nodes, but you will some- times find them placed along branches. You can use the menu or the buttons to display the bootstrap tree in rectangular cladogram or rectangular phyto- gram form. Placing the Root of a Tree ‘The root of a tree is a representation of the common ancestor of all of the taxa being considered. Note also the way the tree is presented in Figures 1.26 and 1.33. It would appear that the node labeled 'richotomy represents the com- _mon ancestor ofall the sequences. This is not necessarily the case! The program 34 Chapter 1 has merely chosen one of the sequences arbitrarily to represent the root of the tree. A common mistake would be to use this tree as presented for the final phylogeny. ClustalX creates unrooted NJ trees. The least arbitrary (and there- fore always correct) means to present the tre isto use the unrooted phylogram method. To display this style, choose Radial from the Trees menu or click. the Unrooted Tree button (arrow, Figure 1.34), The tree will look like Figure 1.35 (the internal nodes have been labeled as in Figures 1.26-1.30). Fait Style Tree Window Help Tutorial: Create a Tree! ‘The unrooted tree is unfamiliar to most molecular biologists and does not, at first glance, look like a tree at all. The point ofa radial tree is to avoid imply- ing that we know where the root lies when in fact we do not. ‘Obviously this set of taxa had some common ancestor; the problem is where wwe should place the node that represents that ancestor—the root. The sequence alignment alone does not provide sufficient information to make that deter ‘mination, and it is clearly inappropriate to place such an important piece of information completely arbitrarily. The choice of a root is often made on the basis of other information, which must be justified. ‘To root a tree simply means to choose a point on the tree as representing the carliest time in the evolutionary history of those sequences. This can be done either by midpoint rooting, or by selecting any one of the sequences as an out ‘group (a designated outsider to the rest of the sequence). (For more details see “Learn More about Rooting Tres,” p. 56.) TreeView only provides the option of outgroup rooting. We could arbitrarily designate any of the 13 taxa as the outgroup, but obviously not all rootings are equally likely. Often, a judgment can be made on the basis of what the proteins do and where they come from. ‘Thus, in our example, I know from external data that $601157 and 430296 sequences come from the Archaea, and that they are more distantly related to the remaining sequences than the remaining Eubacterial sequences are to each other. We will therefore designate $301157 and {0296 as the oulgroup by choosing Define Outgroup from the Trees menu (Figure 1.31) to show the dia Jog in Figure 1.36. Define outgroup = ingroup: Outgroup: ti Fezt oes u Lie Lib M0296: Thin6. Car) SST TST Figure 1.36 35 56 Chapter 1 LEARN Mme ABOUT Rooting Phylogenetic Trees ‘Unrooted trees tell us only about phylogenetic relationships; they tell us nothing. about the directions of evolution—the order of descent, Rooted trees tell us about the order of descent from the root toward the tips of the tree. While unrooted trees are: always more “correct” in that they don’timply knowledge that we do not have, they. are considerably less informative. The problem is deciding where to place the root. Midpoint rooting places the root at the middle of the longest path between the two most distantly related taxa (the heavy lines in Figure 1). Such a placement implies that the rate of evolution has been the same along all branches—something, we know is offen tot the case ‘Suppose that two groups descended from a common ancestor, but that one group evolved much faster than the other. Faster evolution means more sequence differences accumulated in one group than in the other over the same period of time. Thus the true tree might infact look like Figure 2, but midpoint rooting would produce Figure 3. Unless we are sure that evolutionary rates have been con- stant across the taxa being considered, midpoint rooting is risky. a Bc pb a Bo > a 18/24 D\ 4 2\ fro e ae Bee a > G 7 3 aD (1), Unrooted Tree (2) True Tree ) Midpoint Rooted Tree ‘The alternative to midpoint rooting is rooting with an outgroup. An outgroup is ‘a taxon thatis more distantly related fo each of the ingroup taxa then any of the ‘ingroup taxa are to each other. That definition makes it appear that it should be easy to identify the outgroup among any set of taxa, The problem is that if evolutionary. rates are unequal, as in the midpoint example, that definition may break down. ‘The usual solution isto find a taxon that (distantly related to al ofthe taxa being ‘considered, then add that taxon to the tee and use it as ani outgroup to root the tree. ‘Sometimes finding an outgroup sequence is no more difficalt than doing a BLAST search; other times itis infuriatingly difficult. The problem is that a distantly related sequence may be so distantly related that it does not share a common ancestor with the ingroup sequences; ie, it is not homologous. Tutorial: Create a Tree! 57 (Rooting Tees continued Suppose we want to find a sequence to use as an outgroup for the taxa on the tree shown in Figure 1.35, We could turn to the BLAST output from which we picked the sequences to include in the alignment (Figure 1.5) and scroll down near ‘the end of the list, where the F values range from 0.001 up to 0,007. Those proteins should certainly be more distantly related to the set of proteins on the tree than -memibers of the “tree set” are to each other. One of these potential outgroups is a putative Zn-dependent hydrolase with the GI number 19552860 and an E-score of 10,002 when using the LI sequence as a query, The question is: Is that protein really a homolog of the other proteins on the tree? ‘A BLAST search finds sequences and parts of sequences with some homology to the query sequence: Itis also possible to do a BLAST alignment of two sequences to evaluate their homology. Go to the BLAST home page, http//www.nebi.nlm.nih.gov/BLAST/ In the Special screen, choose Align twa sequences (bI2seq). In the resulting form, change to BLASTP and enter the accession numbers of the two proteins to be compared. If we compare the putative hydrolase with the protein designated AnoGam on the tree, we find that the hydrolase aligns over about 60‘ of the length of the AnoGam query sequence and has an E-score of 0.18; it thus appears that the putative Zn-clependent hydrolase is a legitimate outgroup for the set of proteins on the tree. Knowing that, we can download the sequence of this protein. ‘dd it to the alighment, and reconstruct the tree using the putative Zn-dependent hydrolase as the outgroup. Eotee ‘When we cannot identify a distantly related sequence that exhibits homology over ‘most of the length of the sequence, we are foreed to turn to other means of identify- {ng af outgroup. Typically the “other means" is the classically accepted phylogeny of the organisms from which the sequences are obtained. In the example used in this Tutorial, sequences SSO11157 and MJ0296 are from Archaea and all the other sequences are from Enbacteria, so we used SSO11157 and M0296 as the outgroup. Pethaps more typicelly, if all of our sequences are from mammals, we might ook for the sequence of a homologous protein from, say, binds. Again, however, we need to be sure (by doing a pairwise BLAST) that the sequences have been sulfi- >” button to assign them. If you make a mistake, use the "<<" button to move a name back to the list of ingroup sequences. Now choose Root with Outgroup from the Trees menu (Figure 1.37), and click the Slanted Cladagram button to see the rooted tree (Figure 1.38) Window Help Radiat 4 Slanted cladogram |v Show internal Edge Labels |" internat Labet Font. Choose trae. Figure 1.37 Tutorial: Create a Tree! ‘Compare Figure 1.38 with Figure 1.26, the unrooted tree. The underlying data are unchanged, but rooting the tree now indicates directional evolution. Clicking the Rectangular Phylogram and the Show Internal Edge Label buttons displays the rooted NJ tree with branch lengths scaled to distances and the bootstrap values indicated at the nodes (Figure 1.39) 59 Figure 1.39) Printing and Saving the Tree Having done allthis work to get the tree into just the format you want, now ‘you will certainly want to save that formatted tree. You also will probably want to print it. You may have noticed that when you re-size the tree window, the tree stretches to fill that window. To see how your tree will actually look on the printed page, choose Print Preview from the File menu or click the Preview button (Figure 1.40). Eqit_ Style Tree Window Help aw EELS © SE) Aa 60 Chapter 1 The resulting window displays the tree as it will be printed. The Print button at the top left of the screen (Figure 1.41) allows you print the tree; the Copy bution copies the image to the clipboard so that you can paste it into your favorite drawing program (Canvas, Adobe Illustrator, CorelDraw, etc.), and the Picture button saves the image in a format that can be opened by a draw- ing program. [copy [ricture] Print | close } Figure 1.41 Youcan also Print and Print Preview from the File men and can Copy from the Edit menu. Alternatively, you can choose Save Graphic from the File menu. Incither case, TreeView for Macintosh will save the drawing in PICT format, while TreeView for Windows will save it in Windows Metafile format. Draw- ing programs for those platforms will open files in those formats Summary At this point you should be able to + Use a BLAST search to identify a set of sequences that are homologous to a sequence of interest. * Select from that set the sequences that will be used to create a phylogeny. * Download those sequences in both FASTA and GenPept formats. e ClustalX to create an alignment from the FASTA file of sequences. * Save the results of that alignment in any of several formats. * Use ClustalX to create a Neighbor Joining tree from the alignment and save the resulting tree file + Use ClustalX to bootstrap that tree and saved the resulting file, + Use TreeView to draw the tree. + Use TreeView to modify the appearance of the tree, to print the tree, and to save the tree in a format that drawing programs can use, It would now be valuable to pick a sequence of interest to you and to go through the same steps to put that sequence onto a phylogenetic tree. After that, move on to Chapter 2 Chapter 2 Basic Elements in Creating and Presenting Trees Selecting Homologs: What Sequences Can Be Put on a Single Tree? Homology must be distinguished from similarity. Homology means that two taxa or sequences are descended from a common ancestor and implies that, in an alignment, identical resicues at a site are identical by descent, Similarity merely reflects the proportion of sites that are identical. Two unrelated sequences can be aligned and some sites will be identical, but that identity is not the result of descent from a common ancestor. Obviously, no matter how similar they may be, it is meaningless to put two unrelated sequences onto the same tree because the purpose of the tree is to show a process of descent from common ancestor. Of course, in one sense, all sequences may be descended from a common ancestral sequence. However, as genes or proteins evolve, they diverge from tach other to the point where two genes may share so litle sequence in com- ‘mon that they resemble each other no more than any two randomly chosen sequences. At that point their sequence homology has disappeared and those two sequences should never appear together on the same sequence-based tree, ‘Trees that include non-homologous sequences are published surprisingly often. It is not uncommon for a set of enzymes that share similar catalytic proper ties and mechanisms to be given a common designation and subsequently placed onto the same tree, regardless of actual sequence homology. ‘Suppose you have just cloned and sequenced a dibibliacmuctinase (DBM) gene from the Uncommon Vole. You are familiar with a variety of DBM genes from organisms ranging from flatworms to humans and you want to see where your DBM gene fits in. A BLAST search using your DBM sequence as a query tums up many DBM genes, but the list fails to include a number of well-known 6 2 Chapter 2 DBM genes. The obvious thing to do is to use the Entrez browser at NCBI (httpi/www.nebisnlm.nih.gov/Entrez/) to individually download those well- known sequences ancl add them to the FASTA file that BLAST created for you. ‘This is exactly the point at which you are likely to get into trouble. By including those well-known DBM sequences together with your DBM sequence and the homologs that BLAST identified, you will in all likelihood be including non-homologous sequences on the resulting tree. If all the sequences you added are related to each other then the tree will probably show two distinct groups--your sequence plus its homologs, and the sequences you added—connected by a very long branch. The tree will be misleading because itwill imply an ancestral relationship that is false, More than that, the presence of two unrelated groups of sequences in the aligament will probably reduce the quality of the alignments within the groups, thus distorting the trees of the separate groups. In particular, branch lengths are likely to be distorted. ‘As long as you are choosing sequences from a list generated by a BLAST search you are pretty safe, but what can you do if you need to add other sequences (especially unpublished sequences) to your list? How do you know if those sequences are homologous to the sequences already on your list? ‘There is no single, well-accepted criterion for determining whether two sequences are or are not homologous, buta reasonable criterion of non-homol ogy is that a pairwise BLAST alignment of the two sequences fails to find a sig nificant alignment of the two sequences. The first step is to construct a phy- Jogeay using only the sequences that you identified using the BLAST search. Next, pick out representative sequences from each of the major clades on that tree, Finally, under the Pairwise Blast section of the BLAST homepage (Fig- ture Ll) click the Blast 2 Sequences link to bring up the dialog.in Figure 2. If you are testing protein sequences, change the Pragram designation from blastn to blastp (arrow, Figure 2.1). Paste the sequence in question into the upper box, paste one of your representative sequences into the lower box, and click the Align button, Obviously, if the search reports that No significant similarity was found (Figure 2.2), vou will not want to include the sequence in question on your list. If you want to be more conservative you may decide to impose a more strin- gent criterion—for instance, an E value of 10-°—for inclusion on the list, Figure 2.3A shows an example of a false tree that was created when all of the known aminoglycoside-6”-acetyltransferases were included on the same tree; Figure 2.3B shows the correct trees for each of the major clades separate ly. Notice the very long branches connecting the three clades in Figure 23. Such long branches may be a “red flag,” and if you see them on a tree you have ‘reated its a least worth checking to make sure members ofthe different clades are in fact homologous to one other. ‘Arelated issue arises when dealing with multidomain proteins, The bacte- rial PTS sugar transport proteins, which have four separate functional domains, provide a good example of such a problem. In some cases, the four domains are fused into a single protein; in others two or three domains are fused, with Basic Elements in Creating and Presenting Trees 63 Pesan So ag 25 went Te (SE) Sent Borie oe i GEIR ma ent Be Blast 2 Sequences results [BLAST 2 SEQUENCES RESULTS VERSION BLASTN 2.26 (Apr8:005) Macic'T" Mima gap ope?” pap exten opal expe Wee words TP Ee 6 (ED) Sequence gl Seratn arent amingyonie cyan (eH) gre compe = re" fomsoa, Nene Sequences Saunton A tc gh annem etance gue compete sat ree pau dn andre ne S256 repose pe an ‘Nest sry wa ound Figure 2.2 64 Chapter 2 “ cus the remaining domains existing as separate proteins. When multiple domains are present in a single protein, they may be arranged in different orders in different proteins, Alignments of the complete proteins or the genes encod- ing them are meaningless. It is necessary to treat the different domains as, though they are separate proteins/ genes. Cut the domains apart at domain boundaries as best you can and create separate trees for each domain, Don't ‘worry too much about the precise boundary positions—a few bases one way oranother is unlikely to have a major effect on either the alignment or the tree. Fine-Tuning Alignments GB Soper Fits clean ‘To follow this discussion download the CelF sequence from the “Phylogenet ic Trees Made Easy” website. Load Ce1? .aln into ClustalX, In Chapter 1, I stressed that the quality of a tree can be no better than the quality of the sequence alignment that underlies that tree. ClustalX offers quite 2 few tools to help refine and improve alignments. The easiest of these tools to Basic Elements in Creating and Presenting Trees 65, use is the histogram displayed below the alignment. The height of each bar indicates the similarity of the characters at that site. In the CelF alignment, the 120-170 residue region looks pretty good, whereas the histogram in the 230-280 region is pretty flat (Figure 24). aE cnet A a == === Sian fous] | ‘USIAL-Algnment te created UL 4 Figure 2.4 66 Chapter 2 ‘ClustalX provides an entire menu, Quality, to deal with determining the local quality of the alignment (Figure 2.54). Italso provides an excellent Help menu (Figure 2.5B) with tips on using various parts ofthe ClustalX program, including the Quality meau, When in doubt, turn to the Help menu. Helo Hetp Calculate Low-Scaring Segments Show Low-Scoring Segments ‘General Input & Output Files Eaiting Alignments Multiple Alignments Profle Alignments Show Exceptional Residues Low-Scoring Segment Parameters Column Score Parameters (& (Save Column scores ta File Sec Structures Trees Colors ‘Alignment Quality Command Line Parameters Figure 2.5, (®) |_feferences, Selecting Show Low-Scoring Segments highlights the residues that are caus- ing the low scores. There will always be some highlighted residues as the result of divergence of the sequences during evolution, but strong clustering of high lighted residues suggests misalignment (Figure 2.6). casa (CiiatibieAionment wae] ronestze{_101) Done. eee Figure 2.6 Basic Elements in Creating and Presenting Trees Review the discussion of gap penalties in Chapter 1 (pp. 24, 29-30, 34-37). ‘When penalties are too high, similar residues will not align, resulting in poor quality. When they are too low, there will be too many gaps, also resulting in poor quality. One way to deal with the problem is to realign problematic residue ranges using different gap penalties while leaving the bulk of the align- ‘ment alone. ClustalX provides the means to do that ‘Under the Alignment menu in the Alignment Parameters options, choose Reset Alll Gaps before Mlignment (Figure 2.7). Next, change the gap penal- ties using the Pairwise Alignment Parameters and Multiple Alignment Parameters menu choices, Start within a fairly well-conserved region at the left flank of the low-scoring region (the left anchor), then select the range of residues you want to manipulate by clicking and dragging the alignment pane below the residues until you are within a fairly well-conserved region at the right flank ofthe low-scoring region (the right acho). Finally, under the Align ‘ment menu choose Realign Selected Residue Range. EEN trees colors quality Help ‘Do Complete Alignment Produce Guide Tree Only Do Alignment from Guide Tree Selected Sequences | n Selected Residue Range Align Frotile 20 Protite 1 Align Profiles trom Guide Trees ‘lion Sequences to Profie 1 to Profile 1 feom Tree | Reset New Gaps before Alignment ACO eT Pairwise Alignment Parameters ‘Multiple Alignment Parameters: Protein Gap Parameters Secondary structure Parameters. ‘Output Format Options Figure 2.7 As you vary the gap penalties, note the effects on both the low-scoring region and on the anchor regions. Gap penalties that disrupt the anchor regions are to be avoided; those that improve the low-scoring region while maintaining the conserved flanking regiosis are helping. There is no firm guide to modify- ing gap penalties, but if the low-scoring region seems to have few gaps and many mismatching residues, it makes sense to decrease gap penalties; if it seems to have a lot of gaps, try increasing the penalties. It must be understood that all ofthis manipulation is an attempt to reflect real events in the histories of those regions. It may well be the case that those are simply regions that have diverged a lot, and no amount of valid manipu- lation is going to change that. o7 8 Chapter 2 Its very difficult to hold an image of the alignment in mind for compar- is inst the changed alignment. For that reason, itis useful to print inal alignment before doing any manipulations. Because each manipulation will resutt in overwriting existing alignment files, you should move the original output files to a new folder (directory) before doing any manipulat Better than just printing the alignment file is printing the alignment as it is displayed in the alignment pane, complete with shaded residues and the qual- ity histogram. See Appendix II for instructions on printing the alignment in. that format. Major Methods for Creating Trees Which Method Should You Use? You may already be aware that there are a variety of methods currently being used to construct trees from sequence data, and you may even be aware that the field of phylogenetics is quite contentious with respect to which method is best. If you ask an evolutionary colleague which method to use, you are like- ly to get an answer such as, “You must use Parsimony” (or Neighbor Joining orMaximum Likelihood, etc,, depending on which colleague you ask). “Other methods are just shoddy or worse.” Much of the opinion amounts to reli- gious conviction, and you need not worry about it. You could just stick with Neighbor Joining, but the other methods offer some advantages and some dis- advantages when compared with Neighbor Joining, It is important to under- stand several methods, to make your choices based on the situation at hand, and not limit yourself simply because NJ was used in the Chapter 1 tutorial. ‘There are two primary approaches to tree construction: algorithmic and tree-searching, The algorithmic approach uses an algorithm to construct a tree from the data. The tree-searching method constructs many trees, then uses some criterion to decide which is the best tree or best set of trees (see “Lear More about Tree-Searching Methods,” p. 70) ‘The algorithmic approach has two advantages: It is fast, and it yields only a single tree from any given dataset. The two algorithmic methods in current uuse are Neighbor Joining, with which you are already familiar, and UPGMA (which stands for Unweighted Pair-Group Method with Arithmetic Mean). NJ has almost completely replaced UPGMA in the current literature, Both NJ and UPGMA are distance methods. All the other methods in current use are tree-searching methods. They generally are slower, and some will produce several equally good trees. At first it might seem that the algorithmic methods are the obvious choice because they are fast and they result in a single tree that you can publish and get on with other things. At one time the speed issue was important, especially when a Basic Elements in Creating and Presenting Trees dataset inclucled many sequences. Today’s fast, powerful desktop computers have greatly reduced the speed problem, and for most datasets the speed advantage of algorithmic methods is negligible. Although it may appear advantageous to have only a single tree to think about, that comfort can be quite misleading because it gives the impression that the tree you see is the righ tre. Itis essential to understand that the “right tree” doesn’t exist. We are trying to deduce the order in which existing taxa (Sequences) diverged from a hypothetical common ancestor and the amount of change along the branches between the diverging events. It is extremely ‘unlikely that those deductions will be correct in every detail, so the tree we see ‘will not be an accurate depiction of historical events. Even if we are only con- cerned with tree topology, we can never be assured that the topology of the tree accurately reflects the historical branching order. ‘The best we can hope for is a tree that pretty well reflects what happened in the past, while realizing that we don’t farow what happened in the past 50 ‘we can never be entirely sure how accurate the treeis. Tree-searching methods may yield one tree or several, but all methods implicitly acknowledge that the trees produced are only a subset of the possible trees that are consistent with the data, Distance versus Character-Based Methods Thave already mentioned that NJ and UPGMA are distance methods. Distance ‘methods convert the aligned sequences into a distance matrix of pairwise dif- ferences (distances) between the sequences (see “Leart More about Distance Meth ods,” p. 74). The matrix is much like the tables of “% homology” that often ‘appear when only a few sequences are being compared, Distance methods use that matrix as the data from which branching order and branch lengths are ‘computed. Character-based methods, including Parsimony, Maximum Likeli- hood, and Bayesian methods, all use the multiple alignment directly by com- paring characters within each column (each site) in the alignment. arsimony looks for the tree or trees with the minimum number of changes (ee “Learn More about Parsimony,” p. 94). Itis often the case that there are sev- eral trees, typically differing only slightly, that are consistent with the same number of events and that are therefore equally parsimonious, Maximum Likelihood looks for the tree that, under some model of evolu- tion, maximizes the likelinood of observing the data (see “Latrn More about Max- imum Likelihood,” p. 104), ML almost alway’ recovers a single tree, but programs such as PAUP* can be instructed to save multiple tres. An advantage of the ML ‘method is that the likelihood of the resulting tree is known. A disadvantage is that ML is considerably slower than either Parsimony or NJ, and itis not diffi cult to exceed the capacity of even the most up-to-date desktop computer, Bayesian analysis is a recent variant of Maximum Likelihood. Instead of seeking the tree that maximizes the likelihood of observing the data, it seeks those trees with the greatest likelihoods given the data (see “Learn More about Bayesian Analysis,” p. 120), Instead of producing a single tree, Bayesian analy- 69 70 Chapter 2 Basic Elememts in Creating and Presenting Trees 71 ‘An exhaustive search is carried out by finding each of the possible trees by a branch-addition algorithm. The first three taxa are connected to form the only pos- sible three-taxon tree, one that contains three branches (tree A in the Figure 1), The fourth taxon is added by adding a new branch to the middle of each of the existing branches to generate the three possible four-taxon trees (trees Bl. B2, and B3). ‘Adding the fifth taxon requites adding a new branch to the middle of each of the five branches in each of the four-taxon trees to generate 15 trees, This is accom- plished by adding each of the five possible branches to tree Bl to construct trees C1L-CI5, then backing down to tree B2 and adding each of the five branches to make trees C21-C25, then backing down to tree B3 and again adding the five possi- ble branches to make trees C31-C35. If there were six taxa, starting with tree CIL and going through tree C35 sever branches would be added to each tree to make all of the possible trees at the D level. ‘There is an alternative, the branch-and-bound algorithm, that also guarantees finding the best tree but does not require searching every tree. A random tree con taining all taxa is generated and evaluated. Then, starting at A, the three-taxon tree in Figure I, the search moves out toward the lips. It does not attempt to con- struct all possible trees at each level of the search; instead it constructs a single tree, say Bl, and evaluates it. Ifthe criterion is minimum evolution and the cur- rent tree has a better (ower) score than the random starting tree, the search moves on to the next level by adding another branch, If the current tree has a ‘worse score than the random tree, then it and all other trees that can be derived from it by adding more branches will have worse scores. The branch-and-bound search can thus discard all of its descendants without evaluating them. When that ‘occurs, the search backs up one level, adds a branch somewhere else, and again Starts searching toward the tip. ‘If the search gets all the way to the tip and finds a score that is better than that of the random tree, that score now becomes the score against which all other scores are judged. As in the exhaustive search, the entire tree is covered by eventually backing down to the root level and starting out along the path that begins with B2, and then along the path that begins with tree BO. When the number of trees is Jarge and evaluating each tree would be too stow to permit using the branch-and-bound algorithm, a heuristic strategy is used. A heuristic approach is essentially a hill-climbing algorithm in which an initial tree is. selected, then rearrangements are sought that improve the tree. ‘There ate too many heuristic algorithms to describe them in detail, but one com- ‘mon approach (with many variants) is the stepwise addition method. It is similar to branch-and-bound in that it starts with a three-taxon tree, then adds branches to make each of the three possible four-taxon trees, The difference is that at this point ‘each of the trees is evaluated and the one with the best score is selected to make the five possible five-taxon trees that can be derived from it: At each level, only the ‘best of the trees at that level is used to add the next taxon, (continued new page) 72 Chapter 2 sis produces a set of trees of roughly equal likelihoods. The results ofa Bayesian. analysis are easy to interpret because the frequency of a given clade in that set of trees is virtually identical to the probability of that clade, so no boot strapping is necessary to assess the confidence in the structure of the tree. It would be lovely if there were some objective way to select the “best” ‘method for constructing evolutionary trees, butno such way exists. No method. is ideal for all performance criteria, Some of the criteria that have been con- sidered are efficiency, robustness, computational speed, and discriminating, ability. Efficiency is a measure of how quickly the method converges on the correct tree as the amount of data (lengths of the sequences) increases; robust- ness is a measure of how well the method can tolerate deviations from its assumptions and still recover the correct tree; computational speed is obvious; and discriminating ability is how well the method guarantees recovering the correct tree. There are often tradeoffs among these criteria in that methods that increase one measure decrease another (Hillis et al. 1996). ‘One might well ask if we don’t know which tree is the true tree, how can ‘we measure how well a method recovers that tree? Usually, with real data, we cannot. The exception is some experimental evolutionary systems in which all of the descendants ofa single clonal onganism are available and the true tree can be known. Attempts to measure the relative effectiveness of methods are Basic Elements in Creating and Presenting Trees usually based on simulations in which a computer generates descendants of some starting sequence according to some evolutionary model. In the end, a set of taxon sequences is generated, but all ofthe intermediate steps are known, so the “true” tree is known. Various methods are then compared to see which, best recovers the true tree and under what conditions they do so, The problem is that the methods that work best are those that incorporate the same assump- tions that were used to generate the tree, so itis very difficult to extrapolate simulation studies of method effectiveness to estimate effectiveness with real data Choosing among the methods is often just a pragmatic matter: If your com- puter takes longer to calculate the tree than you are willing to take, then use a faster method. My own rule of thumb is that lam willing to use a method that will run overnight while [am home. Therefore, if it takes longer than about 14 hours, I will probably choose another method. If speed is not an issue, I prefer a Bayesian analysis for several reasons. First, I can easily evaluate the reliability of the tree without bootstrapping, which is often impractical with ML. Second, Iam uncomfortable with seeing only the single tree that NJ and ML produce and having no idea how much it differs from other trees that might be as good. Third, other methods do not allow me to have branch lengths on consensus or bootstrap trees, whereas Bayesian analysis as implemented by MrBayes does that. [emphasize that these are my reasons for a preference, not general reasons. They are personal and should not be interpreted as recommendations. Because time is often the basis for deciding which method to use, I have applied all four methods to the same datasets in Table 2.1, (Don’t worry if you don’t understand the table legend yet. You will after reading the section oon that method.) Table 2.1. Comparison of times required for the four major Phylogenetic methods* Namberof Neighbor Maximum sequences Joining, imony Likelihood Bayesian 10 <001se 00sec «ISA wee 55 min 52 see 20 S001se 03sec «min 28sec he 32 min 30 <001se 12sec AB min IBsec he #0 <001se 06sec Lhw525min 2h 37 min 50 <001se OS2sec —SSmin see Shr 16min oo ¥ Synchro Seroiting lean Up All Windows |v Main Display 80 Search Status PAUP Help a? Find . Es Figure 2.10 ‘metalotinex Figure 2.11 output format before you aligned the sequences with ClustalX? You do | not have to redo the entire alignment. Start ClustalX, pull down the Alignment menu, choose Output Format Options, and select the out- put format that you forgot. Load the .aln fie into ClustalX, then use the ‘mouse to select a column of characters in the alignment pane (prefer- ably a column of identical characters) by clicking above that column. Pull down the Alignment menu and choose Realign Selected Residue Range. ClustalX will now write the alignment in the format you forgot to specify earlier. Cops! What do you do if you forgot to choose Nexus (or PHYLIP) as the 80 Chapter 2 Creating Neighbor-Joining Trees Using PAUP* PAUP* for Windows/Unix: pages 179-183, PHYLIP: pages 188-190 Pull down the Analysis menu and choose Distance. Using the same Analy- sis menu, choose Neighbor Jaining/UPGMA (Figure 2.12). On choosing Neighbor Joining/UPGMA, you will see the dialog box shown in Figure 2.13. Trees Window Help Parsimony Uketihood \v Distance Parsimony Settings. Uketihood Settings... Distance Settings... Heuristic Search... Branch and Bound Search.. Exhaustive Search... Evaluate Random Trees... Bootstrap/Jackknife. Quartet Puzzling... (UTTER ‘Star Decomposition Searc! Lake's Invariants. Permutation Tests. Partition Homogeneity Test... Load Constraints... Show Constrain Figure 2.12 Be sure the Neighbor Joining button is selected and that the Randomly, ini~ tial seed button (arrow) is selected, The initlal seed is a number that is used as a seed to generate a random number that is used to break ties. The initial seed is usually based on the time since the computer was started. The actual number is not important, except that it is a good idea to change this number if you repeat the process for the same dataset; otherwise you are not running independent trials, Ordinarily you can accept the computer-generated num- ber and click 0K. ‘The PAUP* Main Display window will then show something like Figure 2.14, ‘The Main Display window, incidentally, will keep a record of everything, ‘you do. You can choose to print this record in the end if you like. To do this, You can pull down the File menu and choose Print Display Buffer. Basic Elements in Creating and Presenting Trees 81 Options for clustering Methods a Tessie ; SNeighborjoining Tse bionimetnoa | AShowtree ouroMa Lisave totreenie £1 Enforce tepotoatcal constraints (onty allow Jonvags companbie with constraint tree) constraints: {none defined Cishow branch tenaths Break ties — | O systematically (axon-order dependent iandomity, initial seed = ie Figure 2.13 optnailty arstenion sat te aletnce NeHee Ct ercuintares) will Oe broken rondoniys snitial xed = 1867599505 ie oavag ee oe Figure 2.14 a2 Chapter 2 Saving the NJ Tree. It is always a good idea to save a tree as soon as it is created. Choose Save Trees to File... from the Trees menu (Figure 2.13) Remember, you will be saving the tree, not any particular appearance of the tree, Tree Info Geartrees oot Trees Condsnse Trees. Filter Trees Sort Trees Show Trees. Describe Trees. Tree Scores Show Reconstruction Print Trees... Tree-to-Iree Distances... ‘Compute Consensus Agreement Subtrees.- Print Special Fraets) Generate Trees. Get Trees from file. 06 Matrix Representatio Figure 2.15 The resulting dialog (Figure 2.16) allows you to assign a name to the tree file, Before you name the file and click the Save button, you need to deal with some options and with the format in which the tree will be saved. otmover 1% smatibatta execution files fo Save treefile as fsmatipataNsire——*d Format: (NEXUS x » Figure 2.16, eat Desi =a Basic Elements in Creating and Presenting Trees 83 You will auuys want to save the branch lengths with the tree. Unfortunately, the default is to save tree files without branch lengths. To include branch lengths, click the Options... button (Figure 2.16) to bring up the dialog in Figure 2.17. Tick the Include branch lengths box and click the OK button to dismiss the Options dialog. Ti save as rooted trees i) @ include branch lengths. — Maximum number of decimal places: | [Retain user-supplied brauich lengths include “set storeBriens” commana CGiinciude TAXA block include bootstrap Jackknite proportions: OAs branch tenaths > As internal nede fabets (only for other programs) Bath of the above ‘Save @autrees Olrees ‘through | Figure 2.17 EE NEXUS (no translation table) FREQPARS PHYLIP ax Henniga6 Figure 2.18, Next, pull down the Format menu seen in Figure 2.16 to display list of for- rats (Figure 2.18) in which you can save the tee file. If you choose the default Nexus format the file you save will be a text ile that looks like this: 84 Chapter 2 anexus Begin trees; (Treefile saved Tuesday, August 12, 2003 5:27 PM) ff sData file = smallpata.NJ.tre sNeighbor-joining search settings: > Ties (if encountered) will be broken randomly; initial seed = 634602657 > Distance measure = uncorrected (*p") > (Tree is unrooted) 1 ‘Translate Lb ie, 2 Gopi, 3: TRINB, 4 Fez, 5 mbli, 6 mbisit, 7 caul, 8 Lic, 3 Lia, 10 La tree PAUP_1 = (Date file = enallDeta.NI.tre >Neighbor-joining seach settings: > Ties (if encountered) will be broken systematically > Distance measure = uncorrected ("p") Basic Elements in Creating and Presenting Trees 85 > (Tree ia unrooted) (0) ((({(aaes0. 201175, ( (mb1511:0,002257, L4e+0..008450) :0,05152 5, (ad:0.023972, L1:0..039591) +0.027702) 0.031829) :0.161984 THIN; 0.257175) : 0.005792, (mb11:0,002373, CAUL:0..002241) :0 245462) :0,.095017, PEZ2 :0.245693) :0.271364,GOB1:0) ; End, Most programs that use Nexus files can use this format. Choose whichever format you prefer and click the Save button to save the tree file. Please read the section on “Presenting and Printing Your Trees” (pp. 135-147) for important information on opening tree files within | pau Printing the Tree. You now have a Neighbor-Joining Tree, which you can view or print by pulling down the Trees menu and choosing Print Trees (Figure 2.19), On choosing Print Trees, you will see the dialog box in Figure 2.20. Gearirees finernees Potty: Santeddnavaram —~] 7 wert 8 pe ene ‘rw neronsiicnincs saxon nels {show branch engtns empate Cansenas sie: hes Pentre Marans x. cia a: ‘Generate trees te [BRB]. v8: [025 Jn, | [Sep HHS on erminat vances ‘cetTrees ome. 086 Soveireestoriie. o%S Aur Represent a) a Co CD ac igure 2.20 Figure 2.19) PAUP? Windows/Unix and PHYLIP users will have to use TreeView as described in Chapter 1 and in the Chapter 2 section on “Presenting and Printing Your Trees” to display and print trees. 86 Chapter 2 You can now use the Plot type pulldown menu (arrow, Figure 2.20) to see the different available options to view the tree (Figure 2.21) NO Rectangular cladagram tune wiatn: | pecranau! OCShowtreé CirceTree ser] Unfooted cladogram Unraoted phylogram Figure 2.21 The choices are simply different ways to visually represent the same informa- tion and correspond to the different tree formats in TreeView discussed in Chap- ter 1, Cladograms show only branching order, and phylograms show branch lengths as well. For the moment, leave the Slanted Cladagram choice select- ed and click the Preview bution to see a slanted cladogram tree (Figure 2.22). Figure 2.22 Basic Elements in Creating and Presenting Trees The buttons atthe left ofthe tee (Figure 2.23) allow you to Copy the tree to the clipboard so that you can paste it into a drawing program, to Save the tree as PICT file that most Macintosh drawing programs can open, of to dismiss the {tree (Done). If there is more than one tree in memory, the Newt Page and Pre- vious Page butions atlow you to scroll through those trees. ces Figure 2.23 If you want to root the tee, click the Rating button in the Print dialog (Fig- ture 220) to see the rooting window (Figure 2.24). Goose methodar ootng unrooted ees | EET come atime noe - in boat pa (Ben taro) Gel esate ingroup monopietc | trove than ane ooo taxon present TES ti taro aap wv espet te nga cet @ Make outgroup a monophyletic poister group to ingroup, Otuntberg rootmg: Anestaces = [standard] Q Midpoint rooting: i T (lise user-supptied branch fengths 88 Chapter 2 PAUP" allows you either to root the tre at its midpoint or to use an outgroup as you did in Chapter 1 using TreeView. I suggest that you choose Outgroup rooting, Make ingroup monophyletic and Make outgroup a mono- phyletic sister group ta ingroup, as shown in Figure 2.21, Click the Define Outgroup button to bring up the dialog in Figure 2.25. The outgroup selection dialog in PAUP* works exactly as it does in TreeView (Chapter 1, p. 55). Just double-click the names of taxa in the Ingraup list that you want to add to the Outgroup list. g Ingroup taxe: outgroup taxa: To tnoroun « Figure 2.25 ‘The PAUP* tree-drawing interface gives you more flexibility than does Tree- View. Not only can you display trees with branches drawn proportionally to their lengths (phylogram formats), you can print the branch lengths next to the branches in any of the formats, To do so, tick the Show Branch Lengths box in the print dialog (Figure 2.26). Doing this allows you to modify the fonts for rca tune wan: [J] Lishowtreemumbers include ite: (Fete ‘axon labels soe (2 eS) re [Sma [8] sum (7 5] eae Figure 2.26, Basic Elements in Creating and Presenting Trees 89 both the taxon labels and the branch length labels. like Helvetica Bold for the taxon labels, but I prefer Palatino for the branch length labels. You can also determine the width of the lines used to draw the tree. I prefer slightly heav- ier 1.5 point lines. Finally, if you tick the Include Title box (Figure 2.26) you can define a title that will be printed on the tree. To define that ttle, click the Set button. It is always wise to click the Preview button to see that the tree looks the way you want it to, When the appearance is satisfactory (Figure 2.27), click the Print button to print the tree. Petrie Figure 2.27 90 Chapter 2 Figure 2.28 shows the NJ tree from the LargeData set in the phylogram for- ‘mat. Note that the scale for branch lengths is substitutions per site. ' Figure 2.28 For more about displaying and printing trees, including using TreeView, see “Presenting and Printing Your Trees” later in Chapter 2 (pp. 135-147). Bootstrapping the NJ Tree. It is always a good idea to estimate the confit dence you should have in your tree. PAUP* makes it easy to obtain bootstrap. estimates of that confidence. Refresh your memory about bootstrapping by reading Chapter 1 (pp. 50-53). Basic Elements in Creating and Presenting Trees 91 From the analysis menu choose Bootstrap/Jacknife... (Figure 2.29). Be sure that the analysis method selected (Parsimony, Likelihood, or Distance) is the same as was used to create the tree. Trees Window Help Parsimony ketinood 3 Settings Distance Settings... Heuristic Search ‘Branch and Bound Search.. Exhaustive Search. Evaluate RandomTrees.. yartet PUPZIIN.. Nelohtor Joining/UPGMA... ‘Star Decomposition Search.. Lake's Invariants. Permutation Tests.- ition Homogeneity Test. toad Constraints. | show Constraints Figure 2.29 In the resulting dialog (Figure 2.30) be sure Bootstrap is selected, enter the esired numberof replicates inthe box indicataed by the arrow, and click Con- tinue. In the resulting dialog, click Search. Resampling method ®bootstrap C]Resample characters Osackenite witn [50 ]*sdeletion ( Finulate “iac™ resampling Number efreplicates: [1000] Random number seed: [317696383 Type of search ‘@Fultheuristic (Q “Tast”stepwise-adultion QBranch-and-pound © Neluhoor-Joining/UPGAvA distance only) ‘Consensus tree options = @ tetain groups with requency> [50_]* ority-rule consensus [Show table of partition frequencies Don't show groups with bootstrap proportions s [5] Ghatartersweight handing.) Lisave trees tole Figure 2.30 92 Chapter 2 ‘To display and print the consensus tree, choose Print Bootstrap Consen- sus... from the Trees menu (Figure 231). Tree info. ClearTrees Hoot Trees Condense Trees... Filter Trees Sort Trees ‘Show Trees... Describe Trees... Tree Scores ‘Show Reconstructions.. Print Trees... Tree-to-free Distances. Compute Consensus... Agresment Subtrees, PTET Generate Trees. GetTrees trom file... 0386 SaveTreestofile.. O35 Matrix Representation... Figure 2.31 The bootstrap tree for the smal 1Data isnot very interesting, almost all clades, have 100% confidence, but the bootstrap tree for the LlaxgeData (Figure 232) shows confidences ranging from 56% (not very good) all the way to 100% Creating Parsimony Trees Using PAUP* PAUP* for Windows/Unix: pages 179-184 PHYLIP: pages 187-190 You can use the same Nexus file that you used for the NJ tree to make a parsimony tree, Pull down the Analysis menu, be sure that Parsimony is checked, then choose Heuristic Search from the same menu. In the result- ing dialog, just leave everything in its default state and click the Search but- ton. A status window (Figure 2.33) will show you how the search is pro= gressing. When the search is complete, it will show a Close button and will indicate the number of trees that were created. The trees are now in memory. Save them to a file just as you did for the N] tree above. You can preview and print the trees just as you did the NJ tree by selecting Print Trees from the Trees menu. Basic Elements in Creating and Presenting Trees 93 Figure 2.32 Heuristic Search status == B| ‘adi tion sequence: simple © Trees hala at each step: 1 sapping agers the: TBR COLLAPSE option in effect: Yes Cmax? TULTREES option in effect: Yes Steepest descent: No KeEPing trees of store Figure 2.33, 94 Chapter 2 LEARN MORE ABOUT. Parsimony Parsimony is based on the assumption that the most likely tree is the one that requires the fewest number of changes to explain the data in the alignment, The basic premise of parsimony is that taxa sharing a common characteristic do so because they inherited that characteristic from a common ancestor. When conflicts ‘with that assumption occur (and they often do), they are explained by reversal (a characteristic changed but then reverted back to its original state), convergence (unrelated taxa evolved the same characteristic independently), or parallelism (dif ferent taxa may have similar embryological mechanisms that predispose a charac teristic to develop in a certain way). These explanations are gathered together under the term homoplasy. Homoplasies are regarded as extra’ steps or hypothe- ses that are required to explain the data. More formally, parsimony assumes that a character is more likely to be common to two taxa because it was inherited from a ‘common ancestor than it isto be cominon because of homoplasy. Parsimony operates by selecting the tree of trees that minimize the number of evolutionary steps, including homoplasies, required to explain the data, Parsimony, ‘or minimum change, is the criterion for choosing the best tree. For protein or nucleotide sequences, the data are the aligned sequences. Bach site in the alignment is a character, and each character can have different states in different taxa. Not all characters are useful in constructing a parsimony tree Invariant characters-—those that have the same state in all taxa—are obviously use- less and are ignored by the method. Also ignored are characters in which a state ‘occur in only one taxon. ‘An algorithm is used to determine the minimum number of steps necessary for any given tree (i.e., any given branching order) to be consistent with the data. That ‘number is the score for the tree, and the tree or trees with the lowest scores are the most parsimonious trees. ‘The algorithm is used to evaluate a possible tree at enck informative site. Consider a set of six taxa, conveniently: named 1-6. At some site (character) in the alignment, the states of that character are: 5G 6=C ‘There are 105 possible unrooted trees of six taxa. We will pick the unrooted tree in Figure 1 as our example, but all will be evaluated by the computer. Basic Elements in Creating and Presenting Trees 95. 98 Chapter 2 Because two trees were saved, the Print Trees dialog will ist two trees instead of one (Figure 2.34), Trees Pottype: Sianted ciadogram >] aD tine wiaus [1 Jr] Cishowtree numbers C}inctudertite: Sets [Use user-provided branch tengths Taxon abels i show branch tengths Font: [Heivetica —¥} ont: {ieivetien sie: (2 >) Lanes) | | size: (>) Cano fara | Moxdecimat aiaits: [ Lm: [OS] in. va: =] Suppress on terminal branches Cregeen) Creeroerrne=) Camas) Cena) Figure 2.34 You can select individual trees to preview or print, or you can select all of the trees at once. Just as was the case for the NJ tree, you can display branch. Jengths, and you can root the tree with an outgroup. Figure 2.36 shows the trees rooted with GOB! as the outgroup. To display both trees on the same page, click the Trees per Page button (Figure 2.35) to reveal the dialog in Figure 2.34 and click to select the two boxes that would position the trees above each other. | @ number oftrees per page: | Onumber orpages pertree: fows:2 Trees/page:2 Positioning mode: © "Horizontal" trees O-verticar"trees oe CEQ Figure 2. Basic Elements in Creating and Presenting Trees 99 Please read the section on “Presenting and Printing Your Trees” (pp. 135-147) for important information on opening tree files within PAUP*, Notice that the branch lengths in Figure 2.36 are not displayed as decimal fractions but as integers. For Parsimony, the default is to have the branches indicate the number of changes along that branch. How can you choose between the two equally parsimonious trees in Figure 2.36? In one sense it doesn’t matter; each of the trees is equally parsimonious and therefore as good as the other tree, $o you can pick a tree at random. Anoth: er possibility is to compare the Parsimony trees with the NJ tree (Figure 2.27) and pick the Parsimony tree that most resembles the NJ tre. In either case you should indicate, either in the text or in a figure legend, that you are showing only one of 1 equally parsimonious trees. 100 Chapter 2 Creating a Consensus Tree Using PAUP* PAUP* for Windows/Unix: pages 179-185 PHYLIP: pages 187-190 Another option is to present a consensus tree. From the Trees menu sclect Com- pute Consensus. In the resulting dialog (Figure 2.37), select all of the trees you want to include in the consensus (usually all the trees), [ike to use the 50% majority rule to compute the consensus, but you can use either strict or semi-strict rules if you prefer. In fact, PAUP will calculate a consensus tree for ach of the options that you check, Caste [lsemstet combinable companent) | Gatems | Srtyore ; | | Bstownequenccseratctsenee Gen Ce Figure 2.37 To view the consensus tree, select Print consensus trees) from the Trees ‘menu. The print dialog will by now be familiar, and you can decide on the plot typeas you did before. For the consensus tree, the plot type is always a clado- gram and your only choice is the shape of that cladogram. This is because the branch lengths are not determined You can preview the consensus tree as you would any other. Figure 2.38 shows the consensus tree derived from the two trees in Figure 2.36, The num- bers are not branch lengths; instead they show the percentage of trees in which the taxa above the indicated node are together. Notice that the consensus tree has a polytomy: three branches arising from a single node. The polytomy is more obvious when the consensus tree is dis- played asa slanted cladogram, as it is in the tree shown in Figure 2.39. If you show a consensus tree you might want to point out that the polytomies rep- resent uncertainty about the branching order. If you wish to choosea single tree to present, in many cases you can choose the tree that most closely represents the consensus. In this case, where the two trees differ by a single node, neither “more closely represents” the con- sensus tree, so the choice is completely arbitrary. Basic Elements in Creating and Presenting Trees jay eae Figure 2.38 Maite Figure 2.39 Finally, your last option is to bootstrap the Parsimony analysis. (Ifyou don't remember about bootstrapping, refer to pp. 50-53 in Chapter 1 and pp. 90-92 in this chapter.) The bootstrap tree, like the consensus tree, will not show branch lengths, but it will show the fraction of the time that a particular clade (group of taxa) are together. Itshould be understood that the existence of several equally parsimonious {trees is not a flaw in the program, nor does it indicate a problem with the data. 101 102 Chapter 2 Multiple trees are often the result of very real polytomies in the tree. Like most of us, phylogeneticists prefer to keep things simple. The simplest situation is a strictly bifurcating tree: from every intemal node there are exactly two branch es (see “Learn more About Phylogenetic Trees,” p. 42). Sadly, evolutionary history Js not always so simple, and at times an ancestor may have given rise to muti tiple descendants within such a short span of time that the order of descent cannot be resolved. The result is multiple branches from an internal node—a polytomy. When the tree representing the history of a lange set of sequences includes many polytomies, there may be hundreds of equally parsimonious trees, Is the inconvenience of dealing with consensus trees a reason to simply accept the Neighbor-Joining tree and get on with it? Not necessarily. Compare Figure 2.38 with Figure 2.27, Both are derived from the same data, Which is a more accurate representation of history? If the polytomy is real, there isa prob- Jem with the NJ tree in that in PAUP* distance trees are strictly bifurcating— no polytomies are allowed. ‘The largeData produces 32 equally parsimonious trees, far too many to show here, but the consensus parsimony tree for the largeData is shown in Figure 2.40, Because the consensus tree does not show branch lengths, I would probably publish one of the trees with branch lengths displayed, indi- cate in the legend that itis one of 32 equally parsimonious trees, and also pub- lish the consensus tree or a bootstrap tree as a second part of the same figure. Creating Maximum Likelihood DNA Trees Using PAUP* PAUP" for Windows/Unix: Use the instructions in this section without modi cation. PHYLIP: Although PHYLIP does create ML trees, it does not do so using the GTR model discussed below. Using PHYLIP to create ML trees is sufficiently ‘complex that itis beyond the scope of this book. PAUP* cannot create Maximum Likelihood (ML) trees from protein sequences, butit does a very nice job with DNA sequences. The number of possible trees depends on the number of sequences in the alignment, butit quickly becomes huge. The number of possible trees depends on whether the tree is rooted or not. For unrooted trees itis where sis the number of sequences. The number of possible rooted trees is (2s~3)! *(s—2)! Basie Elements in Creating and Presenting Trees 103 Figure 2.40 Thus, for just 10 sequences there are 2.03 x 10" unrooted trees and 3.4 x 10” root- ed trees. Itis not possible to compare the likelihoods of all possible trees, so the program searches by comparing a tree in memoty with a closely related tree and retaining the more likely, repeating that process until no improvement is, obtained. Visualize a surface consisting of each of the possible trees for a given num- ber of sequences. The height of each point ree) above that surface isthe like- lihood of the alignment data and the specified model of evolution given that tree. On that surface, the more closely related trees are to each other, the clos- er together they are. The surface thus consists of hills and valleys, with the most likely tree being the point that is at the top of the highest hill. The ML method starts at some point (some tree) and tries to find the top of the highest ill by 108 Chapter 2 Basic Elements in Creating and Presenting Trees 103, 106 Section 2 ‘moving from tree to tree, always accepting moves that go up and rejecting, moves that go down in probability. ‘So far we have created trees by executing a data file (the Nexus alignment file that was produced by ClustalX), then using the mouse to select ments items, click buttons, etc. The number of instructions that need to be given to create an ML tree is large enough that it is actually easier to use an alternative way, the command line interface, to tell PAUP* what to do with the data. Indeed, Basic Elements in Creating and Presenting Trees 107 the command-line interface is the only option available to those who run. PAUP* under Windows or Unix. While itis possible to issue individual com- mands by typing in the little one-line window at the bottom of PAUP*’s main screen, that is generally a bad idea. It is far too easy to mistype one word and not have anything work. Chapter 2 Files: ML coding PAUP block and ML non-coding PAUP block ‘The better way to use the command-line interface is to put all of the com- mands together intoa PAUP block that follows the data block in the input file. ‘To make things easy I have included some example PAUP blocks as files on. the web site. You can copy those blocks then modify thom slightly to create ML. {trees from your own data. Which block to use depends on whether or not your sequences are coding regions or not. If they are coding regions, you can make better tree by considering the first, second, and third positions in each codon. differently, Duplicate your Nexus alignment file, rename the copy something like My£ile ML..nxs, and do everything to the copy! The alignment file looks like this: nexus Begin data; Dimensions ntax-10 nchar=360; Format datatype-DNA cay Macrix Lae atgegttctaccctactegecttegecetetegtegctegccctguccgcca. . copa at gagazattttgct THIN) —atgacactattggcgaagttgatgctggegacagttgcgaccat FEZ] at gaaaaaagtatta, bly) ———______—-argaag. mblsag_§ cau) ——________—-argaas.. Lic atgegttttaccctactegecttegcectg gcegtes Lid atgegttctaccctgctegecttegccctg- gecgtaa Li atgegttctaccetgctegecttegccotg———-geegtese. . end; It consists of a single block, the data block, that immediately follows the word #Nexus. Itbegins with Begin Daca; and ends with Bnd; . Similarly, the PAUP. block will begin with Begin Paup; and end with end. In the Nexus for- mat all command lines end with a semi-colon, ‘The ML coding PAUP block looks like this: 108 Chapter 2 begin paup; set autocloss charaet first charset second charset third charpartition by codon = 1:firat,2:second,3:thira; set criterion-parsinony; search; set criterionslikelihood; Iset nste6 rmatrix-estimate basefreq-estinate ratesssitespec siteratesspartition:by codon; Iscores 2; leet rnatrixeprev basefreq=prev rate: siterates = prev: hsearch startel; [this is a coanent] itespec savetrees brlens-yes maxDecinalss4 file-output.ml.trees replace-yes; end; ‘The format for PAUP commands is to begin a line with a command such as set, followed by one or more option settings for that command. The command is terminated by a semicolon. A command does not have to be typed on a sin- gle line because it is the semicolon that terminates the command, The first command in the above PAUP block is set autoclose-yes warnreset-no increase-aute; ACommand Reference documentation file, Cnd_ret_v2..péf, is available from hitp://paup.csit.fsu.edu/downLhtml; you should be sure to download it. ‘That documentation a command reference list that is not very user-friend- ly, but you can use it to look up each of the commands that PAUP will recog- nize. Let’s consider each line in the PAUP block to understand what ‘The set command on the first line of the block sets a variety of options. The option autoclose = yes sets the status window to close at the end of the search; warnreset. = no tums off a user waming that adata block has already been processed; increase = auto automatically increases the maximum number of trees if that maximum is reached, ‘The next four commands are charset first 3 charset second = 2-.\3; 3 chapartition by_codon = i:firet,2:aecond,2:third; chareet third Basic Elements in Creating and Presenting Trees "These commands partition the characters as being first, second or third posi tion in a codon. ‘The command set criterion=parsimony; sets the tree-building method to parsimony. It is the equivalent of choosing parsimony from the analysis menu. The command hseaxch; on the following line is the same as choosing heuristic search to initiate building a parsimony tree. The purpose of those two commands is to create a parsimony tree that the MI. method can use asa starting point. One could just as easily use set. criterionedistance; Na; tomake a Neighbor-Joining tree instead. The command set_criterion=1ikelihood; now sets the tree building ‘method to maximum likelihood. ‘The ML method requires the user to specify the model for evolution. (See “Learn More about Evolutionary Models,” p. 110.) The command 1set: specifies the model to be used. In this case, itis the General Time Reversible Mode! with estimated base frequencies that are site-specific by codon position. Setting nst=6 specifies the number of substitution types. The commands rnatrixsestimate and basefrequest imate require the program to estimate both the instantaneous rate matrix and the base frequencies. Setting, rates«sitespec siterates-partition:by_codon commands the rates to be site-specific, with those sites partitioned by codon position, We know from experience that bases in third positions of codon triplets evolve much faster than do those in the first position, and that those in the sec ond position evolve most slowly of all. The reason for these differences is that ‘many third-position mutations are silent—iee, they do not change the encod- ed amino acid. Some first-position mutations are also silent, but mo second- position mutation is silent. By having the program estimate the substitution. rates separately for each codon position, we more closely mimic real evolu- tionary outcomes. ‘The command Iscores 1; instructs PAUP to calculate the likelihood of the first tree in memory using the model specified by the Iset command. ‘Thecommand Iset rnatrixsprev basefreq-prev rates = site- spec siterates = prev; ensures that the previously specified model will be applied during the search for the ML tre. ‘Thecommand hsearch start=1; initiates a heuristic search for MI. trees starting with the first tree in memory. ‘Anything enclosed in square brackets is interpreted as a comment and is ignored. You can add your own comments to the PAUP block to help you remember what each command does if you like ‘Thecommand savetrees replace-yes tells PAUP* to save the resuilt- ing tree to afi. The option briens=yes says to save the branch lengths with the tree, while the option maxDecimals=4 says to limit those lengths to four decimal places. You can set maxDecimals to any value you want, but anything more than 6 seems excessive and makes it harder to fead the tree when it is printed. The option £i1e-output .mL. trees assigns the name of the file. T've boldfaced 109 110 Chapter 2 LEARN MORE ABOUT Evolutionary Models Sequences diverge from a common ancestor because mutations occur and some fraction of those mutations are fixed into the evolving population by selection and by chance, resulting in the substitution of one nucleotide for another at various sites. In order to reconstruct evolutionary trees, we must make some assumptions about that substitution process and state those assuimptions in the form of a model. ‘When you use a program such as PAUP*, unless you explicitly state a model, you are using the default model for the method you choose. ‘The easiest model to consider is one in which the probabilities of any nucleotide hanging to any other nucleotide are equal. In order fo predict the probability that a particular nucleotide at a particular site will change to some other specific nuicleotide over some time interval, we need only know the fastarrtancous rate of ‘change (ie., the rate at which nucleotide substitutions occur). This simple model has ‘only one parameter, the substitution rate, and is known as the one-parameter ‘model or the Jukes-Cantor model (jukes and Cantor 1968). If we know that there is a G at some site at time f = 0, we can ask what is the probability that there will still be a G at that site at some later time t, and what is, the probability that there will be, for instance, an A at that site instead. These are ‘expressed, respectively, a8 Picey(t) and Pica, (t). Ifthe substitution rate isc per time unit, then Rogilattae and Roat)= if 4 Because according to the one-parameter mode! all substitutions are equally likely, a ‘more general stateinent is that tat rer 3 sat Rar ge n® and My 4 When fis very close to zero, the probability that the site has not changed, Pj is very close to 1, while P,,—the probability that the mucleotide at that site has changed from / to some other nucleotide jis close to 0. As time goes on, both. probabilities approach 0.25; the time required for that approach depends on (2. ‘We can write a table that shows the instantaneous rates for each of the possibili- ties for change at a site: Basic Elements in Creating and Presenting Trees 112 Chapter 2 Basic Elements in Creating and Presenting Trees that option to help you remember to change that to whatever name you choose, for instance £1e=myFile.m1.tre will remind you that the fle isa tree file of an ML tree. Finally, replace=yes says that ifa file with the name you chose already exists in the same directory (folder) as the input file the new tree file will replace the old one, ‘The final command, end; ends the PAUP block If your sequence is not a coding sequence, use the ML non-coding PAUP block instead: begin paup: set autoclose-yes warnreset-no Increase-auto; set criterion-parsinony; heearch; set criterion-Likelihood; Iset nst=6 rmatrixeestimate basefreq-estinate rates-ganma shape = estimate; Ascores 1; Lset rmatrix-prev basefreq-prev rates = prev shape = prev: heearch starte1; savetrees brlens-yes maxDecimale=4 filesoutput.ml.trees replace-yes; end; In this block, everything having to do with partitioning by codon is eliminat- ed, and 1set- uses gamma distributed rates, with the shape of the gamma dis tribution being estimated by PAUP. Now that you understand the PAUP blocks, itis time to use them to create ‘an ML tree. Paste the PAUP block at the end of the copied data file, after the ‘word End; and save the file. When you open and execute that file, PAUP will first read in the alignment and then execute each of the commands in the PAUP block. You should be aware that with a lot of sequences, and depending on the speed of your computer, this might well take all night! Don’t give up ML with ‘out trying it, however. Your computer probably has litte to do overnight, so let it work for you. ‘When PAUP* has finished, you can display or print the tree as for NJ and parsimony trees, The ML tree created using the smal ipata.nex alignment 113 114 Chapter 2 is shown in Figure 2.41. Compare the ML tree in Figure 2.41 with the NJ tree in Figure 2.27 and the parsimony trees in Figure 2.36. {aco — cau sce —— ms Figure 2.41 ‘The ML tree for the lange dataset alignment looks like Figure 2.42. Creating Maximum-Likelihood Protein Trees Using Tree-Puzzle ‘Tree-Puzzle isa free program for creating ML trees from protein and DNA. sequences. Itcan be obtained at: hitp:/Iwww.tree-puzzle.del. When you down- load Tree-Puzzle, the file will include documentation that provides consider- ably more detail than I provide here. This documentation is worth reading. Basic Elements in Creating and Presenting Trees Largebeta amir Liao = —01 cargos Figure 2.42 You need to pay attention fo a few details when you use Clustalx to cre- ate the alignment. When you select Output Format Options under the Alignment menu, you must check both the PHYLIP Format and the Nexus Format boxes. When the alignment is done, you must move the PHYLIP file (.phy) into the folder (directory) where the Tree-Puzzle program is. The .phy file will serve as the input fle for Tree-Puzzle. Tree-Puzzle is a traditional menu-driven pro- 115 116 Chapter 2 ‘gram. It will ask you for the name ofthe input file (Figure 2.43), and you must type thatname correctly. Tree-Puzzle is case-sensitive when it reads file names, 30 be careful. IF you have forgotten to put the input file into the same folder as ‘Tree-Puzzle, Tree-Puzzle won't be able to find it and will request an alterna tive name. SSS | Figure 2.43 ‘Tree-Puzzle will next present you with a list of options (Figure 2.44). To change a setting, simply type the leter at the left ofthe choice (eg, to change the outgroup, type the letter 0). Some options are simply switches so that elect- ing the option automatically changes it. Others, like Outgroup, will present you with a list of choices or other instructions Se ute tals sentira ful or chong Ira sattnget Figure 2.44 Basic Elements in Creating and Presenting Trees Notice that Tree-Puzzle automatically chose the first sequence as the out- group. To change that, type o followed by the number of the sequence you wish to select as the outgroup. When everything is set the way you want it, type y to start the program running. When the run is complete, Tree-Puzzle asks for a name for the tree files, then writes three iles—out fie, outa st, and the tree file—to the same folder ‘where the Tree-Puzzle program and the input file are located (Figure 2.45). Comping neinum lhelitond branch tenathe (eithout clock) mivecne SH somtimes 23 nim: Figure 2.45 The file named out £1¢ contains all ofthe information about your tree, includ ing the time for the run, a diagram of the tree showing support (equivalent to bootstrap values) for each branch, branch lengths, and finally the tree itself in the PHYLIP format. Using the alignment from Chapter 1, it took only 17 sec- conds on a 1.42 GHz. G4 PowerMac to create the tree. However, if your align- ‘ment includes a lot of sequences, you may want to let the program run overnight. You can use either TreeView or PAUP* for Macintosh to view and print the tree, but PAUP* cannot read the tree file created by Tree-Puzzle directly. Use any word processing program to create a template Nexus tree file that looks like this: avexus Begin trees; tree PAUP* 1 = [60) Bnd: W7 Chapter 2 Copy the tree and paste it into the Nexus format tree file just after the [6] 50 that the file now looks like this: anexus Begin trees; tree PAUP!1 = (a) (GoBs:0.00343, (( (({ (L120. 06530,T1a:0,05858) 95:0.04915, Lib, 0.12135) 100:0.64275,ThinB:0.66097) 100:0.19739, (b1z6230:0 52657, Cau :0.51875) 96:0.21397) 88:0.14623, mBlaSalty:0.637 88) 100:0.29610, F8z1:0.42836) 100:0.68123,AnoGam:0,05447) ; Bnd; ‘To view the tree with PAUP*, you must first open and execute the .nxs file that ClustalX created, then you can get the tree file you just made and preview ‘or print the file as described on pages 85-90. Creating Bayesian Trees Using MrBayes Distance, Parsimony, and Maximum Likelihood methods for constructing phy- logenetic trees are well established and will be familiar to phylogeneticists who ‘might serve as reviewers of manuscripts that include a major phylogenetic component. In contrast, the Bayesian approach (Mau and Newton 1997; Mau. et al. 1999; Rannala and Yang 1996) is new and remains less familiar to most systematists. [have included Bayesian methodology because I judge it to be a powerful approach that is gaining popularity, and because, as implemented. by the program MrBayes, it offers some distinct advantages over other meth- ods. MrBayes is extraordinarily easy to use, is quite fast, and is capable of deal- ing with very large phylogenies. MrBayes uses a conmiand lie interface in which ‘you type commands to instruct the program what to do with the data file. Itis, ‘considerably easier, however, to add a MrBayes block to a Nexus format align- ment file, much as is done when using PAUP* for DNA Maximum Likelihood. When executed, the program starts with a tree (either a random tree or one ‘specified by the user in the execution file), evaluates that tree according to the ‘model specified in the execution file, changes the tree, evaluates the new tree, and if the new tree is better accepts that tree. That process constitutes a “gen- eration.” Every so many generations (specified by the user), the program records the current tree and its likelihood in a file. The user specifies the num- ber of generations, and eventually the program calculates a consensus of the recorded trees and writes that consensus, complete with branch lengths, to a file. The user can then open that consensus tree file in Tree-View or PAUP* to vview and print the tree. The user can also determine the fraction of the trees Basic Elements in Creating and Presenting Trees 119 that contain any particular group (clade) of sequences. Those probabilities, are the equivalent of bootstrap values and will tell you how much confidence you can have in that part of the tree. Creating the Execution File ‘The execution file is simply the .nxs_output file created by ClustalX with a “block” of MrBayes commands added after the data block, exactly as a PAUP. block was added when using PAUP* to create an ML. tree. After adding the MrBayes block, usually rename the file with a . bay extension. See Appendix | to learn about blocks in the Nexus format. Caution: If the data block of the Nexus file statement must leaved Nexus files it uses the shorthand interleave. In MrBayes you ‘must change that manually to interleave=yes. in the interleaved format, the format jude interleave=yes. When ClustalX writes inter- ‘To modify the .nxs_ file for MrBayes, use any text editing program to open the .mxs file and type or paste in a MrBayes block following the end; state- mentaat the end of the data block. A MrBayes block begins with begin mrbayes; and ends with end; in all cases. The semicolon at the end of each of those elements is essential. Between begin mrbayes; and end; area series of statements, each ending, with a semicolon, that tell MrBayes what to do with the data. An example of a MrBayes block that can be used for coding sequences is begin mrbayes; log start filenamestyFile.log replace; charset ist_pos = 1-.\3; charset 2nd_pos = 2-.\3; charset 3rd_pos + 3-.\3; partition by_codon = 3:16t_pos, 2nd_pos, 3rd_pos; get partition ~ by_codon leet net~s; preet ratepr-variable: bet autoclose = yee momep ngen=10000 printfreq-1000 samplefreq=100 nchaina=4 savebriens-yes filenane-¥yFile; plot filenar punt flenam log stop; ena; 120 Chapter 2 eA MORE ABOUT Bayesian Analysis Bayesian inference is based on the notion of posterior probabilities: probabilities that are estimated, based on some model (priot expectations), after learning some- thing about the dala. For instance, suppose you have been told that 90% of the coins in a bag are true coins and 10% are coins that are biased to turn up heads 80% ‘of the time, You are blindfolded and asked to pick a coin at random; then you are asked “What is the probability that this coin is a biased coin?” Having nothing ‘more to go on than Your model that 90% of the coins ate true, your obvious answer is. 1 however, you ate allowed to toss the coin you chose 10 times and then are asked the probability that it is biased, you would revise your estimate based on your model of the expected distribution of outcomes from true and biased coins (the biased coins are expected to come up heads 80"% of the time), and your expec tation of the initial proportion of true and biased coins. The probability you esti- ‘mate after observing the outcomes—the posterior probability—should be a better ‘estimate than the 0.1 probability you estimated with rio knowledge. Suppose you observe the following result of your tosses: HHTHHTTHHH, We ‘will use X to symbolize that result. The probability of that result given that the coin is teue—symbolized PIX True} where the vertical line means “given that”—is P{X| True] =05!" =9.76% 104 ‘The probability of that result given a biased coin is 187 x02? = 1.67107 P{X|Biased] = ‘The posterior probability that the-coin is biased—ie,, the probability that itis biased given the result HH'THHTTHHH-—is given by Bayes formula as oe PIXiDiasedl Biased] MBiasedlX1~ “pixiiiacedl x MBiased)) +(PDUT ruc] PU True} P{Biased|X|= 1.6710 x01. (67x10 x0.1)+(9.76 x10 * x09) Basic Elements in Creating and Presenting Trees 121 122 Chapter 2 Basic Elements in Creating and Presenting Trees What the Statements in the Example MrBayes Block Do You will notice that the commands in MrBayes resemble the commands used. by PAUP* for ML trees. In the example MrBayes block, the log start file- hame=MyPile.log replace; command starts recording everything that appears in the MrBayes window toa file named MyFi le. Log. The reason for recording that information will be discussed later in this chapter, in the sec- tion “Interpreting MrBayes Results.” In the above example, which is appropriate for coding sequences, the set of commands charset. ist_pos charset 2nd_pos charset 3rd pos = 3-.\3; partition by_codon = 3:1st_pos, 2né_pos, 3rd_pos; set partition - by_codon serves the same purpose as in the ML coding PAUP block for Maximum Like- lihood trees (pp. 107-108). The command set nst=6; sets the number of states (rst) t0 6, which is the GIR (General Time Reversible) model. The com- mand prset ratepr=variable; specifiesa site-specific rates model with the rates varying according to the defined partition, ie. by codon position. Thecommand set autoclose=yes; tells MrBayes that after it has run the chains as directed by the mene statement, it should close the chains and {go on to the next statement, Without the autoclose statement, MrBayes will ask if you want to run more generations, and it will wait for an answer before continuing, ‘As is the case for PAUP*, MrBayes treats anything within square brackets as a comment. ‘The command memep sets the parameters for the run, and the statement memep ngen-10000 printfreq=1000 samplefreq=100 nehains-4 eavebrlens=yes filename= MyFile; tells MrBayes to run for 10,000 generations, to print to the screen every 1000 generations, to save the current tree to a file every 100 generations, to run four simultaneous Monte Carlo chains, and to save the tree with branch lengths. The statement also stipulates that the basic file name for the output files will be MyFile. The memep command causes two files to be written: MyPile.t, which is the tree file that records every tree that was saved, and MyPile._p which has information on the parameters, including the likelihoods of the trees at each step. The default file name is the input file name; if you do not specify a file name with the memcp command, and your input file was ‘TEM. bay, the files will be named Tem.bay.t, Tem-bay.p, etc. ‘The command mene is the “run” command. 123 124 Chapter 2 ‘The command plot tells MrBayes to draw a rough plot of the log likeli- hood scores of the trees. It uses the parameter file Myf ile .p as the source of those likelihoods, so you must tell it filename = Myfile.p so it knows where to look for the information. The resulting plot will look something like Figure 2.46, Figure 2.46 ‘The penultimate statement is eumt £ilenamesMyFile.t burnine20 contypeshalfcompat; ‘This sumt statement tells MrBayes how to summarize the trees. It says that the name of the tree file will be yi Le. t, that in summarizing it should ignore the first 20 trees, and that the consensus tree should be a “majority rule” tre. Because none of those sunt instructions is intuitive, let's consider them one at a time. Because we specified My#i1e as the file name for all output file, in the sumt command filename= MyFile.t must be used to tell MrBayes where to look for the trees it must summarize. ‘The burnin=20 option requires a detailed explanation. As MrBayes seatch- es for trees, it saves increasingly likely trees to the tree file. At first the likeli- hoods of the trees increase rapidly. Eventually the increase in likelihood declines, and the likelihoods converge around a steady value. After that the saved trees are roughly equally likely, and it is these roughly equally likely trees that will be used to create a consensus tree. The burnin value is the num- ber of tres (of the number of generations) that will be ignored when the con- sensus is created. I discuss how to choose a good value for burnin under discussion of the meme command below. ‘The sumt command instructs MrBayes to summarize the trees, starting after the burnin, in the form of a consensus tree. That consensus tree can either be “strict” or “majority rule"; contype=hal Ecompat says to use the majority rule tree, Basic Elements in Creating and Presenting Trees ‘The sunt: command causes three output files to be written: WyPile.parts (which has information on the partitions), MyPi le. tprobs (which lists all of the credible trees and their probabilities), and MyFi 1e..con (which gives the consensus tree) “The 1og stop; command stops recording the display butfer to the log file. Choosing Values for ngen and burnin. MrBayes is unusual in that itis the user who determines how long the run will take by setting ngen, the number of generations. There are tivo factors to take into consideration in deciding the value of ngen: (1) the number of generations required to con- verge on a stable value for the likelihoods of the trees; and (2) the amount of time required for the program to run. The easiest way choose a reasonable setting for ngen is to do a preliminary run of a few thousand generations, ‘To run MrBayes be sure that the bay file is in the same folder as MrBayes, start MrBayes, and type execute £ilename, where filename is the name of the execution file. In our example we would type execute smaliData.bay. ‘As the program runs, it displays a column of numbers each time it prints to the screen as shown in Figure 2.47. We need only concern ourselves with the first and last columns for the moment. The first is the generation number and the last is the time MrBayes estimates it will require to complete the specified ‘number of generations, (The interior four columns show the log likelihoods of the trees in each of the chains.) MrBayes pretty accurately predicts how long, you have to wait to complete the analysis. Run time increases linearly with ‘generations, so we can easily estimate the run time for any number of gener- ations, A reasonable approach is to do an initial ran of 10,000 generations just to estimate the time per generation, then to do a second run with agen set to a value that will result in an overnight run and with the burnin option of sumt set to 10% of the trees, chain 4 = -9982.470016 poee = (-e7ee.829) (-e: Jone —- (cersecata) -e7a?-972] (-0735.128) (-6740:885) soos —- (ceraactia) (-67ae.168) (ce7ss.se7) (-0725.476) 738) (erat 40s) [6731-097] Ce7z7.908) Goe0 == (Ceraa.o0) (25726.997) (-evs2.sa4] (8798.295) See0 —- {s730.008) (-6731.985), C737. 160) Ce729.078) eco —[-0728.021 | (628.675) (6726 .261 ) (-0720,008) hein completed in 39 seconds Ehain ued 38-85 Secends of CPU tine Figure 2.47 125 126 Chapter 2 In the example in Figure 2.47, 10,000 generations required 39 seconds, or (0.0039 seconds per generation; thus an overnight run of 14 hours (50400 sec ‘onds) will produce 12,923,076 generations, That's lot of generations and would produce 129,307 trees. You will usually be pretty safe setting the burnin to 10% of the trees, soset ngen = 13000000 and burnin=13000 and run MrBayes ‘overnight. I generally figure a minimum of a million generations for DNA sequences, and a minimum of 204,000 generations for protein sequences. ‘When the run is complete open the .p file (MyPi le. in our example). The first column is the generation number and the second column is the log like- lihood of the current tree in the cold chain. (The remaining columns in the file are not of interest to us here.) Gen int 1 -10669.690 200 -8376.450 200 -7035.258 300 -6869.159 400 -6032.282 500 -6802.460 600 -6770.949 700 -6757.845 800 -6743.862 900 -6742.035 1000 -6742.012 e100 -6726.400 2200 -6725.089, 8300 -6728.693 8400 -6733.872 8500 -6728.069, 8600 -6730.283 8700 -6722.274 8800 -6726..466 8900 -6729.968, 9000 -6728.072 In the above example, the log likelihood started at -10,669 and by generation 1000 increased to -6742. The likelihood had not convenged on a stable value by generation 1000, but by generation 8,000 had converged on values of -6720 10-6730, Basic Elements in Creating and Presenting Trees ‘The importance of that log likelihood is that we want to be sure we have set the burnin value to discard all those trees prior to the stable log likeli- hood value. The problem is that we don’t know in advance how long that will take. Examination of the -p file has told us that the sma21Data dataset required only about 8,000 generations to converge. With a total of 130,000 trees and a burnin of 13,000 trees, we are very much on the safe side! ‘Another way to check that the log likelihoods have converged on a stable value is to look at the plot of InLikelinood (Figure 2.46). (If you happened to quit MrBayes, or you can’t see the plot, you will find the plot near the end of the log file.) In Figure 2.46, MrBayes was run for a million generations and the likelihood values converged by about 30,000 generations, so any burnin value greater than 300 would have been safe. The plot shows likelihoods ver~ ‘sus generations, Remember that burnin is the number of trees, not the num- ber of generations, required for convergence. If examination of either the plot or the .p file itself show that the values converged well before the burnin that you set, you are done. If it happens that convergence occurred after the burnin that you set, you can do the sumt command again manually, choosing a bet- ter burnin value. Suppose you had used a burnin of 1000 trees but you actually needed a burnin of 2000 trees. In MrBayes, just type sumt filename=KyPile burnin=2000 contype-halfcompat. MrBayes will re-calculate the con- sensus tree and rewrite the files. It will, in each case, ask if you want to over~ write the old file. Just say yes. 1 quit MrBayes before 1 re ‘again? Do I need to take another night to do the tree again? No! Just be sure the MrBayes program isin the same folder (directory) as your .© and .p files, start MrBayes, and issue the sumt command as described in the previous paragraph. _l OOPS! What id I needed to do sume. Keep in mind that the time required to run a given number of generations will depend both on the computer being used and on the dataset. The 39 sec onds required for 10,000 generations is true on my Macintosh 1.4 GHz G4. On my old 366 MHz G3, the same result takes 25 minutes, Most datasets will take tens or even hundreds of thousands generations to converge. You should, alioays check the plot or the .p file to be sure that you set burnin to a large enough value. 127 128 Chapter 2 Interpreting MrBayes Results At completion of the alignment, MrBayes prints alot of summary information, and prints low-resolution consensus trees to the screen. The first tree in Figure 2.48 shows the credibility of the clades. The numbers are equivalent to boot strap percentages. One of the big advantages of MrBayes is that it is not nec- essary to bootstrap a tree to estimate clade credibility, which tells you about the credibility ofthe tree in the same way that bootstrap values do, The sec cond tree is a phylogram. f mists fue Fete ants te gst set Figure 2.48 < Basic Elements in Creating and Presenting Trees ‘The trees in Figure 2.48 were saved to the file MyFi1e..con, and those trees can be printed or viewed using either PAUP* for Macintosh or TreeView, as, described in the section on “Presenting and Printing Your Trees,” pages 135-147. It is important to realize that MrBayes 3.0 saves the consensus tree in that file twice. The first tree includes branch lengths and the clade credibilities (as branch labels). Tree View will print the clade credibilites, but PAUP* will not. The file MyFile. log that was created by the log start filename=MyFile.1og replace; contains the same trees that were displayed in the MrBayes window (Figure 2.48). Open that file to see the credibilities or to add them to your pub- lished tree. Clade credibilites, like bootstrap values, are typically placed at nodes, often within circles, and can be added manually with a drawing program as described in the section on “Presenting and Printing Your Trees,” pages 135-147. ‘The Bayesian tree of the smal1Data_ alignment (one million generations, burnin of 500 trees) looks like Figure 2.49, Compare Figure 2.49 with Figures 2.27, 2.36, and 2.41. au _ ns ___evee | | 20681 met Figure 2.49) 130 Chapter 2 ‘Sample Blocks for MrBayes Chapter 2 Files: MrBayes v3 coding block, ‘MrBayes v3 non-coding block, and MrBayes v3 protein block MrBayes can create Bayesian trees from both protein and nucleic acid sequences, The following provide coding, non-coding, and protein sequences that can serve as templates for running MrBayes, GB copter 2 ites: MiBayes v3 coding block If your sequences are coding regions, use the following block. seqi begin mrbayes; log start filename=MyPile.teg replace; charset ist_pos = 1-.\3; charset 2nd_pos = 2-.\3; charset 3xd_pos + 3-.\3; partition by_codon = 3:1st_pos,2nd_pos,3rd_pes; set partition = by_codon: set nst-6; praet rateprevariable; set autoclose - yess memep ngen=10000 printfreq=1000 sanplefreq=100 nchaina=4 savebrlens-yes filenane-MyFile; plot filename=KyFile. sunt filename-MyFile.t burnin-20 contypeshalfcompat; Log stop; ena; SS ‘Chapter 2 files: MrBayes v3 non-coding block If your sequences are not coding regions, use the following block. begin mbayes; log start filename-MyFile.log replace; leet nst«6 rates-canma; eet autoclose-yes? metiep ngen+1000 printfreqe100 samplefreq=100 nehains-4 savebrlens=yes filename-MyFile; plot £Llename-MyFile.p: sunt filename-MyFile.t burnine2 contype-halfcompat log stop: end; Basic Elements in Creating and Presenting Trees 131 : MrBayes v3 protein block EB Cooper 2st If your sequences are proteins, use the following block. begin mrbayes; log start filename=MyFile.log replace; iset rates-ganma; prset samodelpr=mixed; set autoclose = yes; momep ngen=10000 printfreq=1000 samplefreq=100\ nchains=4 savebrlens«yes filenam unt EllenamesMyfile.t burnine20 contyps plot Filename=MyPile.p: log stop: end; eile; Note: In addition to adding the MrBayes block for proteins, you will need to modify the beginning of the data block. ClustalX writes Nexus protein align ment files like this: ‘eNexus BEGIN DATA dimensions ntax-10 nehar=320; format missing-? synbols-"ARCDEFGHIKLMNPORSTUVWXYZ" interleave datatype=PROTEIN gay MrBayes requires that the datatype immediately follows the word format, and does not allow the symbol s=" ABCDEFGHTKLMNPQRSTUVWXY2” state- ment. For protein sequences you must modify the data block so that it looks like this: anexus BEGIN DATA; dimensions nta: format datatyp. Finally, these blocks are set up for a very quick intial run to determine the time required per generation as suggested in steps 3 and 4 of the summary box on page 132. 132 Chapter 2 ‘Summarizing MrBayes |. Create an alignment, specifying Nexus as one of the output formats. 2. Use any text editor to open the .nxs_ file and paste in an appropri- ate sample MrBayes block, changing the filename under the meme and sunt commands fo whatever you like. Remember that the file name under sumt must be the same as under memc, but with a *.t* added. Save the file with a new name such as My£ile.bay. 3. Open MrBayes and type execute where is the name of the .new file containing the MrBayes block. 4. Calculate how many generations will be run overnight, open the -bay file, and change agen to that number. Set the burnin option of the sunt. command to 0.001 times ngen (this is 10% of the trees). Save the file, Discard the output files that were created during the first run. In MrBayes, execute the file again. ‘When the overnight run is completed, open the .p file, or look at | the log likelihood plot, to see if the likelihood value converged to a stable value before the burnin that you chose. If it did not con- verge manually, issue the sumt. command from within MrBayes, choosing a burnin value that is well in excess of convergence. 7. When the run is completed (with an acceptable burnin value) view and/or print the consensus tree that was saved in the .con file, Getting Help When you download MrBayes, you can also download the documentation, and it is good idea to do so, The first part ofthe documentation explains the various commands in detail, while the second part explains the basis of Bayesian phylogenetic reconstruction. For most purposes, the settings in the sample MrBayes blocks (pp. 130-131) will work just fine. By far the easiest way to get more detailed information on the various statements in the block is to use MrBayes’ outstanding online help facility (A particularly nice feature is that typing, “help manual” results in MrBayes saving the entire on-line help manual to a text file that you can print out.) Start MrBayes and type the word hep to see a complete list of the avail- able commands: Commande that are avaiiable fron the or from a MzBayee block include About the program Acknowiedgmence =~ sram acknowledgnents charset ‘2 group of sites to a set Appropriate cation of progran Basic Elements in Creating and Presenting Trees 133 Comparetsee -- compares the trees fron two tree tiles ceype Assigns ordering for the characters Databveake -- defines nucleotide pairs (doublets) for stem modele Deiete Deleres taxa trom the analysis Disctainer -- Describes program disclainer Bxcludes sites from the analyeio Execute - Executes a file zxclude Holp Provides detalled description of commands tnelude includes eites Link ~ Links parameters across character partitions eg Loge screen output to 9 file Leet ~ Seta the paraneters of the Likelihood model anual Prints a command reference to a text file Hone + Starts Warkov chain Monte Carlo analysie once Sete the parameters of a chain (without starting analysis! outgroup Changes outgroup taxon pire Defines nucleotide paire (doublets) for stem models Partition -- Assigns a character partition Phot. Plote paranetere from MCMC analysis preet Sots the priors for the parameters Prope. Set proposal probabilities ouie Quits the program Report Controls how certain sodel parameters are reported, Restore Reetares taxa Ser = Sets run conditions and defines active data partition Shownatrix -- Shows current character matrix Showodel -- shows model settings Showtree =~ Shows user tree sump Surmarizes parameters from MCHC analysis sume Sunmarizes trees from MCMC analycie Taxastae Shows status of taxa Taxsst + Resiqne a group of taxa to a set onlink <+ Unlinks parameters across character partitions Usertree ~~ Defines a single user tree commands that should be in @ NEKUS file (data block or trees block) include: Begin <+ Denotes beginning of block in file Dimensions -- Defines size of character matrix ma + Denotes end of a block in file Format s+ Detines character formt in data block atrix s+ Defines matrix of characters In data block Translate -- Defines alternative names for taxa ‘tree -- Defines a tree from MCHC analysis Note that this program supports the use of the shortest unambiguous epelling of the above commands (6.,, "exe" instead of “execute") 134 Chapter 2 Any of these commands can be issued either from MrBayes command line or from the execution file. Most commands allow you to set a variety of parameters. To see the parameters and possible settings for any command, type help. , where isone of the commands in the list above. For instance, to see how to use memep, type help momep and see: Momep ‘This command sets the parameters of the Markov chain Monte Carlo (McMC)analysis without actually starting the chain. This cormand ia identical in all respects to Nome, except that the analysis will not start after this command is issued. For more details on the options, check the help menu for Meme. Paraneter options current. setting seed snunber> 1064159393 gen enunber> 1000000, samplefzeq enunber> 100 swapfreq enunber> 1 Printfreq enunber> 100 Nehains number 4 ‘Temp number 0.200000 Reweight enunber>, 0.00 v 0.00 * Filename temp.out-

Burnin enunber> ° Startingtree Random/User Random Nperts - 0 Savebriens Yes/No No Notice that the help information describes the current setting for each param- eter. The settings listed are the default settings, but if you have changed any of those settings, either via the command line or via an execution file, the cur rent settings will be displayed. If you are ever unsure ea parameter setting, simply type help to see the current sett Don't be intimidate by all of those choices Most ar of interest oly to re- fessional phylogeneticists. You can use the program quite well just by using the example MrBayes blocks provided earlier. ‘The Bayesian tree of the 1argeDat.a alignment (one million generations, burnin of 1000 trees) looks like Figure 2.50. Basic Elements in Creating and Presenting Trees 135 Figure 2.50 Presenting and Printing Your Trees Opening Tree Files in PAUP* When PAUP* prints the branch lengths on a tree, the format of those lengths depends on what kind of analysis is selected. A Parsimony tree shows the num- berof changes along the branch and is thus an integer. Neighbor Joining and Max- imum Likelihood trees show the changes per site and on phylograms print scale thatallows the reader to correlate branch length with number of changes per site. PAUP for ‘TreeView (see Chapter 1) to open tree files. 136 Chapter 2 If you use PAUP" to create trees, when you preview andlor print the tree, the pro- ‘gram will automatically pick the correct format for labeling branch lengths. When PAUP* saves tree files with branch lengths, it also automatically includes those lengths in the correct format. However, when PAUP* opens a saved tree file, it automatically converts the branch length format into the format that is appropriate for the currently selected tree construction method. ‘There is also the question of what format to use when the tree has been con- structed by another program such as Puzzle or MrBayes. To ensure that branch, lengths are displayed correctly itis important to be sure that when PAUP* gets trees from a file that it (1) stores the branch lengths from that file and (2) dis- plays the user-provided branch lengths when it displays and prints the trees. PAUP* cannot get trees from a file unless it has the data from which the tree ‘was constructed in memory, so the first step is to open and execute the rele- vant data file (Figure 2.51), Initial mode: @ Execute (HX) Q eate ee) Figure 2.51 Next, from the Trees menu choose Get Trees From File... (Figure 2.52). Figure 2.52 Basic Elements in Creating and Presenting Trees 137 When the resulting dialog opens (Figure 2.53) do not immediately click the Get Trees button. Instead click the Options button, and in the resulting dia- log (Figure 2.54) tick the Store branch lengths (if present) box, click the OK button, then click the Get Trees button. a Trigger commret_mb3.064 at 2) metalloB NiJog (4) metalloBbay (2) metatioB.iog BMmtiie:con Mytile.p Myfilest Figure 2.53, Process: ® First REES block only © All blocks Trees tobe included from Mle: @altrees Ovrees | —]tmrough [] ‘Trees to be retained in memory: Trees orretiy Tres from fe 9 [Eliminate duplicate trees store branch lengths Gr present) (tore tree weights (irpresent) Figure 2.54 138 Chapter 2 You may now see the dialog in Figure 2.55. If so, unless you specifically root cd the tree before saving it, click the ¥es button. You can always re-root the tree from within the Print dialog, Rooted tree(s) input but current criterion and/or option settings ‘Would specify unrooted trees. Do you ‘want to “deroot" the tree(s)? eas Figure 2.55, Finally, when you preview or print the tre itis essential to tick the Use use! provided branch lengths box in the Print Trees dialog (Figure 2.56) to be sure that the branch lengths displayed are the same as those that were stored. in the tree files. trees: Plottype: [Siantedcladogram @o ine with Cishowtreenumbers Cl include ti [Use user-provided branch lengths Taxon labels — he ~ [1Show branch lenaths: font : sue: (2 J] Margins Un [BBE] n. 10: [22S Jn, C1Surpress on termnatrancnes nal digits Figure 2.56 Basic Elements in Creating and Presenting Trees 139 To Root or Not to Root? As discussed in Chapter 1, you need to decide whether and how to root your tree before you present it to the public. The root is the most intemal node on a tree, the one that represents the common ancestor ofall ofthe taxa (sequences) on the tree. Visually, we tend to think of that root as a midpoint at the base of a tree, but that may well be misleading both to ourselves and to our audience. Consider the tree in Figure 2.57. Itis a Parsimony tree in the Rectangular Cladogram format. The tree is difficult to read because our eyes tend to place the root at the midpoint on the left, and it appears that there are a multitude of clades descended from that root. Infact, that is not the case at all, as can be seen in the Unrooted Phylagram format of the same tree (Figure 2.58). Figure 2.57 140 Chapter 2 Figure 2.58 Itis virtually impossible to read the taxon labels in Figure 2.58, butts clear that there are two clades separated by a very long branch with length 96. Another look at the cladogram in Figure 2.57 shows that there is indeed a branch with length 97, and that everything on one end of that branch isa TEM ‘whereas everything on the other end is an SHY. Basic Elements in Creating and Presenting Trees 141 Neither of those formats is very helpful in terms of understanding the evolution of the two clades. What about another format, the Unrooted Cladogram? That format doesn’t suggest a root, and branch lengths are not proportional to distance (Figure 2.59). Figure 2.59) Both the unrooted phylogram and the unrooted cladogram are “honest” representations in that they don’t imply anything about a root about which we may be uncertain. The problem with the former (Figure 2.58) is that itis unreadable; the problem with the latter (Figure 2.59) is that it seems to be unin- terpretable.Itis certainly unfamiliar to all but phylogeneticists and systema- 142 Chapter 2 tists. Molecular biologists are unlikely to get much information from Figure 2.58, and that alone is enough to discourage the use of that format. The pri- ‘mary purpose of a printed tree is to help the reader interpret evolutionary relationships. If it does not serve that purpose, itis useless. ‘The phylogram format (Figure 2.60) might help you out, Figure 2.60 makes it clear that there are two clades that are separated by a very long branch, and the labels are easy to read. However, any sense of the branch lengths within the clades is completely lost because the branich lengths within each clade are very short compared with the length of the branch between the clades. Basic Elements in Creating and Presenting Trees Itis important to realize that, whatever their appearance, the trees in Fig- ures 2.57-2.60 are all unrooted trees. It seems likely that these two clades descended from a common ancestor so long, ago that there have been many changes since diverging from that ancestor, but there have been relatively few changes within each ofthe two clades. To convey all that information, we need to root the tree Midpoint Rooting. PAUP* allows us root the tree in either of two ways, We can choose Print trees from the Trees menu and click the eating but- ton, or we can choose Rooting from the Options menu to specify how we ‘want to root the tree. Whichever we choose, we will see the dialog shown in Figure 2.61 Choose method for rooting unrooted trees: a © outgroup rooting - ] Root tree at internal node | [Cel © with basal potytomy | (Q Make ingroup monophyletic: | | itmore than one outgroup taxon present: / Make outgroup paraphyletic : ¢to ingroup | nyletic Make sister group to ingroup 2 Lindberg rooting: Anestates = Midpoint rooting ——-~ ‘ Use user-supplied branch lengths Figure 2.61 PAUP? for Windows and Unix: TreeView does not permit midpoint rooting. PHYLIP: reeView does not permit midpoint rooting. 143 144 Chapter 2 ‘There are a variety of choices available for rooting, When the two clades are separated by a very long branch, midpoint rooting makes perfectly good sense. Ifwe use Midpoint rooting, then decide on the phylogram format, we see Fig- ture 2.62. Figure 2.62 still makes it clear that there are two clades, and it now strongly conveys the sense of descent from a common ancestor. However, we still have no sense of the fine structure of either clade. Figure 262 cae Despite the fact that phylograms give a much stronger visual sense of branch Tengths than do cladograms, we might consider a slanted cladogram that shows branch lengths in order to see the fine structures ofthe clades (Figure 2.63). The slanted cladogram makes the root obvious, and it allows us to see the branch. lengths. It seems like a good choice in this case. It is essential to understand. the importance of the phrase “in this case” in the preceding sentence. A slant- ced cladogram will not always be the best choice. There is no single best choice; it depends entirely on the tree and on what message is most important to convey to the reader. Basic Elements in Creating and Presenting Trees Figure 2.63, Rooting with an Outgroup. You should review the section on rooting trees with an outgroup in Chapter 1 (pp. 53-59), and PAUP* for Macintosh users should review pages 87-88 in this chapter. ‘There is not always an especially long branch between two clades to help us decide where to place a root. Typically you will need to root the tree with an outgroup. Given a pure molecular clock in which all lineages evolved at the same constant rate, an outgroup sequence would be more distantly related to any of the ingroup sequences than any of them are to each other. Sadly, that tidy molecular clock rarely applies. Indeed, choosing a legitimate outgroup can be one of the more difficult aspects of creating a phylogenetic tre. It is important to understand that itis not always necessary to root a tree, ‘The sole purpose of rooting is to provide information about the direction of evo- 145 146 Chapter 2 lution. If you do not root the tree, however, itis important to note in the fig- tre legend that the tree is unrooted. It may even be wise to put an “Unrooted Tree” label somewhere within the figure itself. Even better, present the tree in an “unrooted” or “radial” format to ensure that the reader does not imply a root where none is given. If you do root a tree with an outgroup, how do you choose that outgroup? Assume that you have used BLAST to locate a set of sequences that are relat- ed to your sequence of interest, downloaded those sequences, aligned them, and constructed a tree for which there is no obvious root. You would like to find an outgroup sequence to root the tree. Another BLAST search could iden- tify some sequences that are less closely related than any of those you chose. ‘The problem is that the more distantly related the sequence, the more you can be sure its a legitimate outgroup—and the worse the resulting alignment will be. The opposite is also typically true: the better a sequence aligns, the less likely itis to be a legitimate outgroup. There is no easy solution to the problem, but a general approach is to pick 1 few candidates for the outgroup that themselves constitute a monophyletic clade. Putting them together with the sequences you already have should result in an unrooted tree, which clusters the two clades much as was done with the TEM/SHV sequences in Figures 2.62 and 2,63. Note that it is not necessary for the outgroup members to belong toa monophyletic clade, but it often helps to pick out the outgroup if they are monophyletic. ‘Another good approach is to use prior knowledge of the evolution of the species from which the sequences were obtained to decide on an outgroup. For instance, if ome of the sequences are from Eubacteria and others from Archaea, and you can rule out the possibility of horizontal transfer then you put the eubac- terial sequences into the ingroup and the archaeal sequences into the outgroup. A Word about Orthology, Paralogy, and Horizontal Transfer. Choosing an ‘outgroup based on prior knowledge of the species involved makes the assuimption that the evolution of the sequences is the same as the evolution ‘of the species from which those sequences came. While that will often be the ‘case, itis not always so. At least two kinds of events, horizontal transfer and gene duplication, can lead to the above assumption being false. Horizontal Transfer. We are most familiar with the problem of horizontal transfer, the transfer of genes between species, in microorganisms but as more genomes are being sequenced we are realizing that horizontal transfer has occurred among multicellular eukaryotes. If | chose for my outgroup a sequence that came from Archaea, but that sequence was in fact a recent hor- izontal transfer from a Eubacterial species, my “outgroup” sequence would probably belong within the ingroup. Using that sequence as an outgroup would. probably distort the tree so that all congruence between sequence evolution land species evolution is lost. If after rooting a tree you see that the tree bears little resemblance to the generally accepted species evolution tree, itis likely that you have rooted the tree incorrect. Basic Elements in Creating and Presenting Trees Gene Duplication. When a gene duplication occurs, the two gene copies begin to accumulate differences and to diverge from each other within the species in ‘which the duplication occurred; call those two copies o.and B. Later a speciation event occurs so that we have two species, each descended from the common ancestor in which the duplication event occurred, and each having. an o. and a B gene. As time goes on the a. and genes continue to diverge. Clearly, the a genes in the two species are more closely related to each other than either is to a B gene, because the a genes only began to diverge after the speciation event, whereas & and began to diverge immediately after the dupli- cation event, before speciation occurred. The four genes (two @ genes and two B genes) are all homologs because they are descended from a common ances- tral gene—the gene that was present before the duplication—but they are dif- fetent kinds of homolog, We define the o:and genes as paralogs because they are derived from a duplication event. The two 6: genes are orthologs because they are derived from a speciation event, The genes for a- and -hemoglobin are examples of paralogs. If we construct a tree that includes both orthologous and paralogous sequences, the members of orthologous groups will cluster together. If we are ‘unaware of the duplication situation (especially if modem species tend to have lost one of the duplicates), the sequence tree will bear little resemblance to the accepted species tree, Were we to choose, say, reptile sequences to use as an outgroup to otherwise mammalian sequences, the “outgroup” would prob- ably include sequences that diverged from each other after members of dif- ferent orthologs of the ingroup diverged. Using prior knowledge of species phylogenies to assign outgroups is safe only when you are confident that you are dealing, with orthologs and that horizontal fransfer has not reared its ugly head. Choosing What Form of a Tree to Publish The choice ofa tree for publication depends entirely on what makes the infor- ‘mation most clear to your audience. That decision requires some considera- tion of the intended audience. If you are publishing in a molecular evolution journal, an unrooted tree may be a good choice because that audience is like- ly to be familiar with unrooted trees. The same choice may be a poor one in a ‘molecular biology journal, whose audience is unlikely to be familiar with them. Cladogram or phylogram? Ina phylogram, the branch lengths that are pro- portional to evolutionary distance (differences) have strong and unambiguous vvisual impact, but if there are a few very long branches and a lot of short ones, the structure of the short branches may not be visible (eg, see Figure 2.62) If you are going to all the trouble of finding and downloading homolo- gous sequences, aligning them, and constructing a tree, itis certainly worth your while to take some time to think seriously about the form of the tree. The ‘main point is that you should make a thoughtful choice. The decision should never be an automatic or default decision 147 148 Chapter 2 Making a Tree Pretty: Not Just a Cosmetic Matter ‘The trees displayed and printed by PAUP* and by TreeView are perfectly good tools for our understanding of the evolution of our sequences, but for the read- ‘er who has not been looking at or thinking about those sequences in all their iterations, the final tree could often use some improvement, The purpose of ‘making a tree “pretty” is to make it easier for the reader to understand. ‘Some of the choices about which tree format to present are a matter of taste and personal prejudice. | like slanted cladograms because they make poly- tomies so clear—that is, they clearly show when several taxa are descended from a single interior node. I judge that labeling branch lengths is sufficient to convey evolutionary distance, and that the clarity of the polytomies out- ‘weighs the loss of visual information conveyed by visual branch lengths in phylograms. Others will see the matter differently. Both PAUP" for Macintosh and TreeView allow the user to specify the font and font size for taxon names, and PAUP? allows the user to specify the font and font size for branch lengths and to specify line thickness. Both programs permit the tree drawing to be saved in PICT format (Macintosh) or Metafile (+ wn) format (Windows) that can be used by most drawing programs for those platforms. The ability to open the drawing in drawing programs means that the appearances can be modified fairly easily. Figure 2.64 shows a Bayesian tree that was created by MrBayes, saved as a tree file, and displayed by TreeView. TreeView was used to define the outgroup and to root the Figure 2.64 Basic Elements in Creating and Presenting Trees tree with the outgroup. Figure 2,65 shows the same tree after making it pret ty with a drawing program, Figure 2.65 Use a drawing program such a CorelDraw for Windows or Canvas for Mac- intosh to open the image file that was saved from PAUP* or TreeView. The mage will be present as a single object that can be selected. As a single object, there is not much you can do with the image. Select, then Ungroup the object. ‘Ungrouping results in each of the elements of the image—the various lines, ‘numbers, and text features—becoming individual objects that can be manip- ulated. You are now free to move or eliminate any of those elements as you choose. ‘The most important adjustment in this case involved manually adding branch lengths to the branches. Note that TreeView X is expected to include the ability to display branches labeled with their lengths, in which case man- ual addition of branch lengths will be unnecessary. Manual addition of branch lengths requires being able to interpret the tree description in the tree fie. ‘The format for that description is called the Newickian format, named after the restaurant where a group of phylogeneticists gathered to devise the for- mat It is a surprisingly good format. 149 150 Chapter 2 002297, 0XA32:0.004205,OXA15:0. 004515, OXA3 4:0.004567) :0.039130,0KA3:0.076322) :0.864728, MAGNETO:1.41 (0002) :0.298565, ( ( ((OXAL0:0.003034,0KA17:0.006253) 0.02558 3, (OKAZ8:0.003536,OXA35 0.001634, OXAL9:0..003397) :0.011213, ):0.041238, OxA5:0.270908) 0.739753, SPUTREFASCIENS :0.43335 3) :0.254041) :0.156620, ( ( (O¥A25: 0.008265, OXA26:0.004931, Ox 24: 0.008550) :0.544334,0XA27:0.180197) :0.774431, PAERUGINO $A:0.975052) :0.412950) :0.133023, CHEJUNI:1.586750) :0.12453, 4, (BSUBTILTS: 0.970492, (( (SEPIDERM: 0.004029, SAUREUS:0.0026 04) :0.802940, CDIFPICILE:0. 705384) :0.651208, ( (OXA30:0.0025 59,OXR31:0,011029,OXA1:0,010656) :1.050799, { ( (OXA6:0.5392, 50,ATUME: 0.633595) :0.382442, (OA22:0.601324, BPSEUDO:0.454 682) +0, 454261) :0,602838, (OXA29:0.221556, LPNEUMO:0.097217) 625383) :0.540152) :1.102884) :0.262226) :0,252886) :0.2059, 49, CHUTCHISONTE 0, 966228) :1.087491,,NPUNCTIFORME: 0) + ‘The interpretation is that the length of a branch leading to a taxon is} the taxon name, separated from that name by a colon. Thus OxA2 : 0.002287 tells us to label the branch leading to OXA2 a5 0.023 (rounding to four decimal places). By enclosing everything in a single set of parentheses, (OKA2 0.002297, OKA32:0.004205, OXA15: 0.004515, 0XA34: 0.004567) :0.039230, tells us that OXA2, OXA32, OXA15 and OXA34 form a clade that is descend- ced from a single node, and that the length of the branch leading to that node is 0.0391. (OKA2 0 , 002297, OXA32 0.004205, OXALS: 0.004525, 0XA34 0.004567) :0.039230,0XA3:0.076322) :0.864728 says the first clade and OXAS form a clade, the length of the branch to OXA3 i 0.7632 and the length of the branch leading to the entire clade is 0.8647. The rule, then, is that an entire clade is enclosed in a set of parentheses and that the length of a branch leading to that clade is separated from the right-most paren- thesis by a colon. Another major adjustment was to change the taxon labels to something that is more easily interpreted by the reader. I really is not helpful to a reader to ask him to keep in mind all of the taxon abbreviations that you chose. Finally, where there were polytomies I used diagonal lines leading to the horizontal branches to make those polytomies clear to the casual reader. Basic Elements in Creating and Presenting Trees. 151 Figure 2.66 is the upper portion of Figure 2.63. A zero branch length, such as the branch leading to TEMI in Figure 2.66, means that TEM1 is actually at that node. I think that it makes it much easier to understand the tree if the taxa at interior nodes are represented by taxon labels placed at those nodes. In the drawing program, therefore, I delete such branches and move the labels to the node itself (Figure 2.67). The result is that i is now clear that TEMI is the ances- tor ofits entire clade, that TEM77 is descended from TEM30, and that TEM31 is the ancestor of the clade that includes TEMS4, TEM73, and TEM74. ‘Again, these are personal preferences based on my judgments about what makes the information clear to the reader. Figure 2.66 152 Chapter 2 ers Te Figure 2.67 Whatever your preferences, the primary goal is to make the tree as easy to inter- pretas possible for yur reader. That requires taking the time to think about things from the perspective of your likely readers. A tree published in the Journal of Moi- cular Evolution, where you can reasonably expect yout readers to be familiar with evolutionary trees, might look quite different from the same tree published in Cell, ‘where many readers will not be familiar with evolutionary trees, DNA Phylogeny or Protein Phylogeny: Which Is Better? Whether we obtain our sequences from DNA or protein databases, itis almost always the case that the original data were in the form of DNA sequence. Thus ‘we almost always have the choice of constructing phylogenies from amino acid or nucleotide alignments. At first it might appear that it simply does not mat- ter, but that is not the case. Consider the amino acid alignment used to illustrate Chapter 1. Those sequences are quite divergent, even at the protein level. At the DNA level they are likely to be so divergent, particularly at third positions of codons, that it is impossible to obtain a meaningful alignment. Although there are fewer sites, in an amino acid alignment, there are 20 possible states at each site instead of Basic Elements in Creating and Presenting Trees four possible states, making it possible to obtain good alignments and there- fore to construct valid protein phylogenies when phylogenies of the corre- sponding DNA sequences would be meaningless. Tt might appear from this discussion that protein phylogenies are always preferable to nucleic acid phylogenies. As is usually the case, however, things are not quite that simple. ‘When the phylogeny is quite shallow—that is, when there has been little divergence among the taxa—there is again likely to be more divergence at the DNA than at the protein level. In the case of a shallow phylogeny, this fact is helpful. Consider the phylogeny illustrated in Figure 2.63, Those sequences diverged over a very short interval. Many of the nucleotide substi- tutions were silent (they did not result in an amino acid replacement), At the amino acid level, the phylogeny has much less structure—that is, many more polytomies—than at the nucleotide level. In this case, the DNA phylogeny is, ‘much more useful than the protein phylogeny. ‘Yet another problem arises when we use ClustalX to align DNA coding. sequences. When multiple alignment programs such as ClustalX introduce ‘gaps in order to maximize the alignment score, they do so without regard to ‘codons. Translation of the gapped DNA sequences often produces frameshift- ed proteins that bear no resemblance to the actual proteins encoded by the genes used to create the alignment. Gaps represent insertions and deletions (often called indels) that occurred 4s the ancestral sequence diverged to give rise to the extant sequences in the alignment. It seems unlikely that most genes passed through a period of being, inactive pseudogenes (cue to the indels represented by the gaps), it follows that itis also unlikely that within-codon gaps often represent actual historical indels. Thus, the gaps introduced into DNA coding sequences are likely to be misplaced, with the result that homologous nucleotides within sequences are often misaligned. Trees based on such alignments of DNA coding sequences have the potential for reduced accuracy. ‘The problem becomes acute when we are interested in reconstructing ances- tral states (see Chapter 3). The presence of within-codon gaps easily results in the estimation of ancestral sequences whose protein products bear no resem- lance to existing proteins. In those cases, predicting ancestral sequences from. DNA sequences is quite useless. The following pages describe how to use a program called Codon Align to solve the problems caused by introducing with- in-codon gaps. There is another important consideration when deciding between nucleotide and protein phylogenies: If you are using a desktop computer, pro- tein phylogenies of more than about 50 taxa are limited to Neighbor Joining and Parsimony methods. Maximum Likelihood of protein sequences is not implemented in PAUP". Tree-Puzzle implements protein ML, but large phy- logenies would require days to run on typical desktop computers. MrBayes does Bayesian analysis of protein sequences, but also requires a lot of mem- ory and a lot of time for large datasets. 153 154 Chapter 2 It would appear that for datasets larger than about 50 taxa involving deep phylogenies, we are pretty much limited to using NJ and Parsimony to con- struct tres from the protein sequences. If you prefer those methods and you are not interested in estimating ancestral states, there is no problem. If you pre- fer ML or Bayesian analyses, you can usea little program, CodonAlign 2.0, that is provided on the website to solve the problem, Using CodonAlign 2.0 @@B Couonrtigng:Codonatign 20 Mac Package GB Codon nig: CodonAtgn 20 Win? Package GB osondigof: CodonAlign 20 Unix Package CodonAlign 2.0 uses a protein alignment to introduce gaps (actually triplet gaps) into coding sequences at positions corresponding to the gaps in the aligned protein sequence. The result isa set of aligned DNA coding sequences, ‘which if translated, will regenerate the original protein sequences. Alignments, of coding sequences that are done in this fashion are much more biologically realistic than are alignments that are done directly by ClustalX or by other alignment programs. The resulting DNA alignment can then be used for ML ‘or Bayesian tree construction in a fraction of the time that would be required for the corresponding protein alignment. ‘CodonAlign requires two input files: (1) a file of aligned protein sequences and (2) a file of the corresponding DNA coding sequences. Aligned protein sequence format. The protein file must be a text (ASCII) file in PHYLIP interleaved format (see Appendix I, File Formats, for more information about these formats). This is the format in which ClustalX writes PHYLIP files Sequence or taxon names must not contain any spaces, periods, or dashes and ‘must not exceed mine characters. I' you wish, spaces may be replaced by the under- score (_) character. Note that taxon names are case-sensitive. DNA sequence file format. The DNA sequence file must be a text (ASCII) file in FASTA format (see Appendix I for more on file formats). The file must include only the coding region corresponding to the protein sequence used to create the aligned protein sequences and the sequences must not include the termination (nonsense) codon. Basic Elements in Creating and Presenting Trees Creating a Protein File using Clustalx 1. Choose Clustal’s Output Format Options under the Alignment 2. In the resulting dialog box, check the PHYLIP box, then change the default Output Order from Aligned to Input. 3. Load the protein sequences into ClustalX and create the alignment as described in Chapter 1. ClustalX will create an output file with the same name as the file you used to input the protein sequences except that it will have the extension «phy. [ 4. Use the .phy file as your protein input file for CodonAlign. The names of the sequences (taxa) must be identical to the names of the cor- responding proteins (ie, they must not contain any spaces, periods, or das es), The DNA sequences must be in exactly the same order as the proteins in the aligned protein file Running CodonAlign. The two input files must be in the same folder (directory) as the CodonAlign application (program) ‘Double-click the CodonAllign application icon. CodonAllign will ask you for the name of the aligned protein file Type the name exactly as it appears in the folder (directory). The file name is case sensitive. CodonAlign will next ask you for the name of the DNA sequence file. Type the name exactly as it appears in the folder (directory). Finally, CodonAlign will ask you for the name of your output file. The name ‘must nat exceed 25 characters. ‘When the program is done, Quit to close the console window. There is no need to save the contents of the console because it contains no useful information. Warning!! CodonAlign is a very picky program. Any deviations from the correct formats for the input files, or anything else for that matter, will result in an error. See the description of error messages on page 136. ‘The Output File, CodonAliga will create two output files using whatever name you choose and append the extensions .nex for the Nexus formatted file and .pay2ip for the PHYLIP formatted file, The output files will con- tain the DNA sequence gapped according to the protein alignment. The Nexus file can be used directly by PAUP* and many other programs, includ ing MrBayes. The PHYLIP file can be used as input for PHYLIP and other programs such as Puzzle. 135 136 Chapter 2 Error messages. If an error is encountered, the program will terminate but the console window will remain open and will display an error message. Errors include: * Can't find the protein sequence file. Fither you mistyped the name of the file, othe file is notin the same folder (directory) as the applica- tion. No output files are saved + Can't find the DNA sequience fie. Either you mistyped the name of the file, or the file is notin the same folder (directory) as the applica- tion. No output files are saved + Names of protein and DNA sequence are not identical. Look at the input files. Ether some pair of names failed to match exactly, or the sequences are notin the same order in the two files. The output files are saved and include the gapped DNA sequences up to the point of the error. | you gi as the name of an input file the name of some | other file that is in the same folder as CodonAlign, then CodonAlign will | ‘try to read that file. If it is a nontext (non-ASCI) file or is not in the cor- rect format for an input file, CodonAlign will lock up and will probably crash your computer. Be careful! — Obtaining CodonAlign 2.0. Download Codonalign from the website Three packages are available, one for Macintosh, one for Windows, and one for Unix. The Macintosh and Windows packages include the CodonAlign 2.0 program, documentation in PDF format and some example files, The Unix package includes the C source code, documentation and example files. Advanced Elements in Constructing Trees This chapter discusses some advanced topics for those who would like to go beyond the basics. You do not need to understand, or even to read, this chap- ter in order to construct valid, robust trees. Reconstructing Ancestral DNA Sequences In some situations, it may be very valuable to know the sequence of a partic- ular length of DNA in the common ancestor of extant taxa, Lacking the ances- tral organism itself, itis impossible to determine that sequence experimentally, so we can never be certain of the sequence. We can, however, estimate that sequence. Imagine that we have constructed a phylogeny of a group of glycosidas- es, a part of which includes two distinct clades—one consisting entirely of ‘a-glucosidases and the other entirely of o-galactosidases. We would like to identify the amino acid changes that are most likely to be responsible for the different substrate specificities. It would be useful to compare the sequence of the node from which all galactosidases are descended with the node from which all glucosidases are descended, and to compare those with the node that is their immediate ancestor. We might want to go even further and use protein modeling software to model the structures of those ancestral proteins in order to visualize the structural changes that accompanied the substrate changes. If those comparisons identify a small number of amino acid sub- stitutions, we could introduce those substitutions into extant sequences to determine whether those changes would shift the substrate specificities as expected. Chapter 3 157 158 Chapter 3 Ancestral Juences for Parsimony and ‘Maximum Likelihood Trees Using PAUP* Parsimony Using PAUP* for Macintosh. After creating. your parsimony tree, choose Log output to disk... from the File menu. In the resulting Save dialog (Figure 3.1), pick a name for the log file. That file will contain every- thing that appears in the main display buffer until you tum the logging option off. Figure 3.1 Next, choose [Gsmandate execution fer tog subsequent output to (satibatateg se] suppress outputto sereen otatslayumeriniogme (Sweat Describe Trees from the Trees menu (Figure 3.2) to reveal the dialog in Figure 33. “ree info ear Trees Root Trees Condense trees... Filter Trees Sort Trees EE "ree Scores > Show Reconstructions. Print Trees... ‘onr ‘Tree-to-Tree Distances. Save Treestofile. OS Figure 3.2. | Matrix Representation. Advanced Elements in Constructing Trees ‘Tree Description Options Selecttree(s) om nae / Figure 3.3 Select whichever tree you prefer, and be sure the States for internal nodes and Label internal nades boxes are checked as shown in Figure 32. The cur- rent version, 4.0610, does not permit writing the ancestral sequences in sequen- til format so that they can be copied and used. The next version, 40611, which ‘will be current in summer of 2004, will permit doing that by providing an option to write an ancestral sequence file. Look for an option or box that will allow you to choose a name for that sequence file and to choose a format (Sequential or Interleaved). Choose sequential, name the fil, then when all is ready click the Describe button. Finally, again choose Log output to disk... from the File menu, and in the resulting dialog click the Step Saving button. The ancestral sequences are saved in the logfile that is discussed below. Parsimony Using PAUP* for Windows/Unix. Simply add the following three commands to the end of the PAUP* block in your parsimony execution file: tog File = ‘myFile-log’ Replaces yes Start = Yes; [saves the output buffer to a log file] DescribeTrees 1/ Briens internal £ yes LabelNode ecnyAncFile interleav: t nodes of Log stop = yee; [stops saving the log file] 1 9 160 ‘taxon! 16 Chapter 3 ‘The first Log File command starts saving the output butfer to a log file. ‘The DescribeTrees command begins with a tree list, in this case tree ‘number I. If there is more than one tree, use this option to enter the number of the tree to use for the ancestral state reconstruction. You must provide a tree list followed by the slash, BrLens = yes and LabelNode = yes cause the tree to be printed with branch lengths and with the internal nodes labeted ‘with numbers. That is essential if you are to match the ancestral sequences with their corresponding nodes. Xout=internal tells PAUP* to calculate the sequences of the internal nodes, file = myAncFile and interleave = no tells PAUP* to write the internal sequences in sequential format toa file named myFileanc. Note that the file = and interleave = options are not available in the current version, 4.0b10, but will be available in the next version, 4.0b11, which should be current by May of 2004. ‘As usual, choose any name you like for the log file and the ancestral state file Maximum Likelihood. For both Macintosh and Windows/Unix, add the ‘same three lines to the end of the PAUP* block as described above for parsi- mony trees using Windows. Interpreting the log file. The log file includes what amounts to an align- ment of the ancestral sequences in interleaved format (Figure 3.4). The num- bers across the top of Figure 34 are the sites in the alignment. Each internal node is given a number. A tree is printed at the bottom of the file with the intemal nodes numbered, as in Figure 3.5. ua2i1i2i12922220225353953923446444446455555555556666666666777777777 123456709012345678901234567090123456700022245678901234567830123456789012345578 atgegtatgacactat tggegaagt tgatgctguegsts Stgogratqucactattggegaagtegatgcrggegacgottgesat gragegcetgatectagcegcegetggteS atgostatgacactartagegaagtegaractgacgacagtegegateangcecctaaccacgqegcacgetgagtey stgcgratgacactantggegagt tgatgctqsegacgcaaanagregrsactctasccacegtgeacgetasetes dtgegttteaceetgeregcer togeectger gacgataguaaeggeegeegettscoggcegtccacaccascacc atgoattcraccetgstegest eqoectger ggegatggnaanggcostogctetteeageegeecsegccesegee Etgogetteaceetgezegect togeecrgerqacgatggaaaagaceguegctattceggeagtecacgecagcgce atgegrttraccetgcrogecttegcecegetggagatagaaaaggcogt cactctgaccycegtecacgecageges Atgegtacgucactacregecaacgcestecegzegaces ce ceca Reconstructed stares for internal nodes (continued) Figure 3.4 (continued next page) Advanced Elements in Constructing Trees 161 (continued from previous page) suis Li gecttgactacoctagoagaaccagagcetgaaaatatgoccaaagaatagaaccagcettetgcaccattcoatact, 22 Geattagetetauccteggeagcceacgoogacgacatgecagccaactajaccaagocgaccangcertaccat ata 33 cegteggetaageeagesqegccogacgetgactacatgcceaacgactgqaaceagccgatcacaccattceststs La Geettagctacgetqueagagecagaggetgacascatgeccaacgactagaaccagccaategcaccatteagtatt, As geogaggeacegctgceacaactgcaggcctaracegtggatgcgtectagctacageagatqacacegctseagat Aé —_googaggoaccactgocgcagctgeqggectacaccatggncactcatagctgeagcogatagcaceactacaget 27 gengaggeacegotqecacagotgegagearacacegtagacgcctcctggctacagccaatagcaccgctycagatt Le geogagacaacgetaceacaget geuguectacacogtagacacctoctagetacagccgatggcaccattgcagatt 13 © ca aesage, a a tgp case = ce cacea at Reconstructed states fox internal nodes (continued) Figure 3.4 Figure 3.5 Notice that there is a node 19 in the alignment, but no node 19 on the tree. ‘The highest numbered node is that of the root, in this case based on a default ‘outgroup—the first taxon in thelist. Ths isthe same tree as the frst tree in Fig- ure 2.36, so we can use it to number the nodes in the Figure 2.36 tree, as shown in Figure 36. With the nodes correctly numbered, we can identify the sequence for any node and we can copy that sequence from the ancestral sequences file for any purpose we wish. 162 Chapter 3 ue a + Figure 3.6 Ancestral Sequences for Parsimony using PHYLIP ‘See Chapter 4 for using PHYLIP to construct protein and DNA parsimony trees. Both the Protpars program and the Dnapars program include a choice, under menu option 5, to Print sequences at all nodes of tree, ic. the ancestral sequences. Type “5” to change that option to Yes. Upon doing so, a new option labeled ”.” appears immediately below option 5. Type “.” to change Use dot- differencing to display them io No. Now when you run the program, the coutfile will include a tree with all of the nodes numbered and a table show- ing the branch lengths from each node x to node y. ‘The example in Table 3.1 shows the sequence at each node. The table makes it easy to see exactly how each character changed from node to node, and it can (with a bit of effort) be converted into a useable data file from which indi- vidual sequences can be copied. The resemblance between Table 3.1 and an interleaved data file are obvious. The problem is one of conversion. There is, no program that does the conversion, but itis not difficult to do manually using, Microsoft Word. Begin by copying the entire table into a new Word file and save that file as text only, The first column in the table is From, the second is To, and the third is Any Steps?. The last is the sequence data itself. For the first block, \We will eliminate the first and third columns, and for the remaining blocks we wil eliminate the first three columns, leaving only the data intact. Table 3.1 Fron nia bie bist rev cau mpl Pez opt me a tad nie mbist1 THINB cat bli FEZ opt ne any steps? maybe maybe maybe yes yes yes yes maybe wnaybe yes yes yea yes yes naybe aybe yes yes yes yes yes Advanced Elements in Constructing Trees State at upper nede ATGCGTT?TA CceTscTCee ATGOGTT?TA CceTGeTCGC ATGOGTICTA cceTGeTEGC argosrTeTa cecTacTese arecarTera cecrecreee ATGCGTT?TA cocTecTeee ATOCGTTTTA cocrecreac crere7T26c cacrartesc cceteerese cackeranre eccrervecs ccereryece cecrerrecs caceereees eccrerzeca cacrerreca ackcranye ‘ITGeGACCAT GTCGGCGSCT RTWRT GRSNCTCWZC cceceraare cecereare ceccraate DasECraNye AARARGTATT AAGETTARCC GRARTTTTGC TACACTOTTT ecereacese cacecrorTe ertoaceers cerresceers cerrescects ertesceera cerresceets errececors ertescecre 9327772970 GAAGTIGATG ertesceere RecoTCsace RecGTccacs Reoscccacs ecoececaca arcecacacs ‘ecgrecace accerccacs RCSGTSSACG ACGGTGCAGS nysorswics cresccaces cetesccaces cereeceeces mYvDTGNTCA GCRTTGATGA ‘rrcaTertca AccFTeaace reeteceres ccrsceceae ceacceceae ceaecaccac ccaecacese ceacrescsc ccasceccac ccacecece carsspevsc ccAARGACACC cressreasn cerseereeer crecereser ceracarosct rroavrrew ITGGTATTOAA retectracs sesccecese 163 104 Chapter 3 Word has. feature called Rectangular selection that allows you to select «a vertical portion of text without selecting entire lines. While holding down the option key (Macintosh) or its equivalent in Windows, for the first block only, select the first column right up to the edge of the first taxon name, as shown in Figure 3.7. Delete the selection, then click the | button so that the first block now looks like pron te any st 7 saybe arocor7?raccerecreaccr7eaccere e aaybe arocerzcraccerecreacerTeacte’ Figure 3.8. 2? State at upper node arocarreracceraczcacerrcaccens--2-722777 arocarr?racceracreaccrreaccem™ TATGACACEATTOGC CAAGTTEATECTECCCACSE 5 saybe Figure 3.7 so pay Steps? state at upper node Figure 3.8 Arucorr?sa cccrocrese. Crreaccera Arccerrcra cceracreae CrrCUceeTs arccerrrsa cocracteae. crieaccexa Advanced Elements in Constructing Trees Place the cursor at the left end of the top line, then use the arrow keys to ‘move exactly 10spaces to the right. Hold down the option key and select every- thing in the first block up to the edge of the sequences, as in Figure 3.9. ro. ----angjigtape?| state at a 2 laracorr7a. 7 ‘maybe ATGCOTS?TA ‘ ‘maybe | ATGCGTICTA u no | tarecarrcea uid “no afecarrcea ‘ Sarecert?Ta hie arcooreraa, mbit 2 mans 4 5 cast abl 2 ven coat ue Figure 3.9 upper nodew cecracreae ecersercac ecereercae cectecreac ccerecreac cccreereec cccrecreae ‘cerzr2ec Oda ddadaaasaaaaaae Delete that material. What remains in the left column is the names of the taxa and nodes, each name exactly 10 characters (including spaces) long, Next, holding down the option key, select everything to the left of the sequences for the remaining blocks and delete that. When you are done the file should look Tike Figure 3.10. 165 166 Chapter 3 [to An State at upper node avocors?ta ccesocrecc crrccccene —2-272777 7 aracerz72a cectacrese crrcacecns 8 aracerzcra ecerecrece crrcaccen un atocarzcra cccrecrece crzcccect ne arocerteta ccctacrece Crzccccc7s 6 © arecerr7ta ccctecrece crrcacectG ute arccorrmta cccnecrece erzoaccens seisit e 2S 2 ran =290H CLOTPITIGE 7727772996 —2-727227 mare "ATGA CACTATTGGC GAAGTIGATG CTGGCGACGG arecertera cectecreee errececere reercectes 2GecEr CACKETBNYC RCCOTCEACG ceRSCGCCEC Lecce? cecreTyece accerccace ccaccaccae “OCCT CGCTCTYCCG RCCECCCACG ceaGecccee —----cecor cocterrees aceccceaca ccaccaccee ———eceer caccereces arcccccacc ccagraacee accor cecterrees ecereeacs ecadcaccee LGceGr CaCICTCSS GccGrCCACG ccacececce = GACKCTGNYC RCSGTSSACG CURSSDCVSC ‘FRECGACCAT GreecceacT ACGETGCAGE CAAAGACACC ---?RINRT GASNCTGHYC NYSGTSNHCG crasercKsH ——-—-arasa ccaceraae crosceccce cxaconcecr -ATGAA GEGCCTGATC cTocceaccs cracaTCECT -ATGAA GeQCCTGATC CrescCEcee CraccIEGCY DRAATOINOT DASNCTGMYC NYVDIGHTCA TIGRVITGVE ARARAGIAY? AAGTTTAACC GCATTGATGA TOGTATTGAA aaarrirec TACACTOTT TeearGTsca TrTGCTTCGG cecrecccec cacccrerse accrzceace ececcaceee Figure 3.10 What remains is to delete the top line, and on the empty line below it enter the number of sequences and the number of characters in each sequence. There ate 18 sequences, 10 taxa, plus 8 interior nodes. There are 960 characters in the alignment (if you don’t remember, just open the infile), so the top of the file now looks like Figure 3.11. This is now a proper PHYLIP file in interleaved for- mat. Save it Advanced Elements in Constructing Trees 167 te 960 1 ATGCGTT?TA cecTecTese crresccers 7 ATGCGTT?TA CccTGCTCGC crrcscceTs e ATGCGTICTA cccTGCTcsc crrcsccers La ATGCGTTCTA cccTGCresc crrescceTs ia ATGCGTTCTA cccTecrecc crrescceTs 6 ATGCGTT?TA CccTecTcec crrescceTs lie ATGCGTTTTA cccrecrece crrescccTs mbisin 2 coereaTaGe 7772227776 --7-722277 ‘THINB CACTATTGGC GAAGTTGATG cTesccacGs 4 2277 5 cauL bi 3 FEZ. oB1 . Lie ATGCGTTCTA cccTacTose cTTescccre TesTescTCS Figure 3.11 Ifyou want to copy a particular node sequence, you need to convert the inter leaved file to a sequential file using Seqbont as described in Appendix Using Protein Structure Information to Construct Very Deep Phylogenies In Chapter 2, I emphasized the importance of not including sequences on the ‘same tree unless there is evidence that those structures are truly homologous. suggested all sequences on the same tree should exhibit significant homolo- By ina pairwise BLAST alignment, What do you do if sequences have diverged so much that no sequence homology can be detecied, but other evidence sug gests that they are homologs? Typically, that “other evidence” is likely to be similarity of protein structures. Ttis not uncommon for two groups of proteins to exhibit so much structur- al similarity that itis almost certain they descended from a common ancestor, even though sequences within the two groups exhibit no detectable sequence similarity. sit then reasonable to put members of the two groups into the same alignment and onto the same tree? The short answer is “No! Itis not OK.” The sequences have diverged so much that alignment programs will not be able to line up homologous amino acids in the same site and the resulting alignment and tree will be meaningless. We cannot put those two groups onto the same sequence-based tree. ‘On the other hand, we can put those sequences onto the same structure-based tree. Ifwe assume that homologous sites Occupy the same positions in the pro-

You might also like