Welcome to Scribd!
Academic Documents
Professional Documents
Culture Documents
Hobbies & Crafts Documents
Personal Growth Documents
Burnin enunber> ° Startingtree Random/User Random Nperts - 0 Savebriens Yes/No No Notice that the help information describes the current setting for each param- eter. The settings listed are the default settings, but if you have changed any of those settings, either via the command line or via an execution file, the cur rent settings will be displayed. If you are ever unsure ea parameter setting, simply type help to see the current sett Don't be intimidate by all of those choices Most ar of interest oly to re- fessional phylogeneticists. You can use the program quite well just by using the example MrBayes blocks provided earlier. ‘The Bayesian tree of the 1argeDat.a alignment (one million generations, burnin of 1000 trees) looks like Figure 2.50.Basic Elements in Creating and Presenting Trees 135 Figure 2.50 Presenting and Printing Your Trees Opening Tree Files in PAUP* When PAUP* prints the branch lengths on a tree, the format of those lengths depends on what kind of analysis is selected. A Parsimony tree shows the num- berof changes along the branch and is thus an integer. Neighbor Joining and Max- imum Likelihood trees show the changes per site and on phylograms print scale thatallows the reader to correlate branch length with number of changes per site. PAUP for ‘TreeView (see Chapter 1) to open tree files.136 Chapter 2 If you use PAUP" to create trees, when you preview andlor print the tree, the pro- ‘gram will automatically pick the correct format for labeling branch lengths. When PAUP* saves tree files with branch lengths, it also automatically includes those lengths in the correct format. However, when PAUP* opens a saved tree file, it automatically converts the branch length format into the format that is appropriate for the currently selected tree construction method. ‘There is also the question of what format to use when the tree has been con- structed by another program such as Puzzle or MrBayes. To ensure that branch, lengths are displayed correctly itis important to be sure that when PAUP* gets trees from a file that it (1) stores the branch lengths from that file and (2) dis- plays the user-provided branch lengths when it displays and prints the trees. PAUP* cannot get trees from a file unless it has the data from which the tree ‘was constructed in memory, so the first step is to open and execute the rele- vant data file (Figure 2.51), Initial mode: @ Execute (HX) Q eate ee) Figure 2.51 Next, from the Trees menu choose Get Trees From File... (Figure 2.52). Figure 2.52Basic Elements in Creating and Presenting Trees 137 When the resulting dialog opens (Figure 2.53) do not immediately click the Get Trees button. Instead click the Options button, and in the resulting dia- log (Figure 2.54) tick the Store branch lengths (if present) box, click the OK button, then click the Get Trees button. a Trigger commret_mb3.064 at 2) metalloB NiJog (4) metalloBbay (2) metatioB.iog BMmtiie:con Mytile.p Myfilest Figure 2.53, Process: ® First REES block only © All blocks Trees tobe included from Mle: @altrees Ovrees | —]tmrough [] ‘Trees to be retained in memory: Trees orretiy Tres from fe 9 [Eliminate duplicate trees store branch lengths Gr present) (tore tree weights (irpresent) Figure 2.54138 Chapter 2 You may now see the dialog in Figure 2.55. If so, unless you specifically root cd the tree before saving it, click the ¥es button. You can always re-root the tree from within the Print dialog, Rooted tree(s) input but current criterion and/or option settings ‘Would specify unrooted trees. Do you ‘want to “deroot" the tree(s)? eas Figure 2.55, Finally, when you preview or print the tre itis essential to tick the Use use! provided branch lengths box in the Print Trees dialog (Figure 2.56) to be sure that the branch lengths displayed are the same as those that were stored. in the tree files. trees: Plottype: [Siantedcladogram @o ine with Cishowtreenumbers Cl include ti [Use user-provided branch lengths Taxon labels — he ~ [1Show branch lenaths: font : sue: (2 J] Margins Un [BBE] n. 10: [22S Jn, C1Surpress on termnatrancnes nal digits Figure 2.56Basic Elements in Creating and Presenting Trees 139 To Root or Not to Root? As discussed in Chapter 1, you need to decide whether and how to root your tree before you present it to the public. The root is the most intemal node on a tree, the one that represents the common ancestor ofall ofthe taxa (sequences) on the tree. Visually, we tend to think of that root as a midpoint at the base of a tree, but that may well be misleading both to ourselves and to our audience. Consider the tree in Figure 2.57. Itis a Parsimony tree in the Rectangular Cladogram format. The tree is difficult to read because our eyes tend to place the root at the midpoint on the left, and it appears that there are a multitude of clades descended from that root. Infact, that is not the case at all, as can be seen in the Unrooted Phylagram format of the same tree (Figure 2.58). Figure 2.57140 Chapter 2 Figure 2.58 Itis virtually impossible to read the taxon labels in Figure 2.58, butts clear that there are two clades separated by a very long branch with length 96. Another look at the cladogram in Figure 2.57 shows that there is indeed a branch with length 97, and that everything on one end of that branch isa TEM ‘whereas everything on the other end is an SHY.Basic Elements in Creating and Presenting Trees 141 Neither of those formats is very helpful in terms of understanding the evolution of the two clades. What about another format, the Unrooted Cladogram? That format doesn’t suggest a root, and branch lengths are not proportional to distance (Figure 2.59). Figure 2.59) Both the unrooted phylogram and the unrooted cladogram are “honest” representations in that they don’t imply anything about a root about which we may be uncertain. The problem with the former (Figure 2.58) is that itis unreadable; the problem with the latter (Figure 2.59) is that it seems to be unin- terpretable.Itis certainly unfamiliar to all but phylogeneticists and systema-142 Chapter 2 tists. Molecular biologists are unlikely to get much information from Figure 2.58, and that alone is enough to discourage the use of that format. The pri- ‘mary purpose of a printed tree is to help the reader interpret evolutionary relationships. If it does not serve that purpose, itis useless. ‘The phylogram format (Figure 2.60) might help you out, Figure 2.60 makes it clear that there are two clades that are separated by a very long branch, and the labels are easy to read. However, any sense of the branch lengths within the clades is completely lost because the branich lengths within each clade are very short compared with the length of the branch between the clades.Basic Elements in Creating and Presenting Trees Itis important to realize that, whatever their appearance, the trees in Fig- ures 2.57-2.60 are all unrooted trees. It seems likely that these two clades descended from a common ancestor so long, ago that there have been many changes since diverging from that ancestor, but there have been relatively few changes within each ofthe two clades. To convey all that information, we need to root the tree Midpoint Rooting. PAUP* allows us root the tree in either of two ways, We can choose Print trees from the Trees menu and click the eating but- ton, or we can choose Rooting from the Options menu to specify how we ‘want to root the tree. Whichever we choose, we will see the dialog shown in Figure 2.61 Choose method for rooting unrooted trees: a © outgroup rooting - ] Root tree at internal node | [Cel © with basal potytomy | (Q Make ingroup monophyletic: | | itmore than one outgroup taxon present: / Make outgroup paraphyletic : ¢to ingroup | nyletic Make sister group to ingroup 2 Lindberg rooting: Anestates = Midpoint rooting ——-~ ‘ Use user-supplied branch lengths Figure 2.61 PAUP? for Windows and Unix: TreeView does not permit midpoint rooting. PHYLIP: reeView does not permit midpoint rooting. 143144 Chapter 2 ‘There are a variety of choices available for rooting, When the two clades are separated by a very long branch, midpoint rooting makes perfectly good sense. Ifwe use Midpoint rooting, then decide on the phylogram format, we see Fig- ture 2.62. Figure 2.62 still makes it clear that there are two clades, and it now strongly conveys the sense of descent from a common ancestor. However, we still have no sense of the fine structure of either clade. Figure 262 cae Despite the fact that phylograms give a much stronger visual sense of branch Tengths than do cladograms, we might consider a slanted cladogram that shows branch lengths in order to see the fine structures ofthe clades (Figure 2.63). The slanted cladogram makes the root obvious, and it allows us to see the branch. lengths. It seems like a good choice in this case. It is essential to understand. the importance of the phrase “in this case” in the preceding sentence. A slant- ced cladogram will not always be the best choice. There is no single best choice; it depends entirely on the tree and on what message is most important to convey to the reader.Basic Elements in Creating and Presenting Trees Figure 2.63, Rooting with an Outgroup. You should review the section on rooting trees with an outgroup in Chapter 1 (pp. 53-59), and PAUP* for Macintosh users should review pages 87-88 in this chapter. ‘There is not always an especially long branch between two clades to help us decide where to place a root. Typically you will need to root the tree with an outgroup. Given a pure molecular clock in which all lineages evolved at the same constant rate, an outgroup sequence would be more distantly related to any of the ingroup sequences than any of them are to each other. Sadly, that tidy molecular clock rarely applies. Indeed, choosing a legitimate outgroup can be one of the more difficult aspects of creating a phylogenetic tre. It is important to understand that itis not always necessary to root a tree, ‘The sole purpose of rooting is to provide information about the direction of evo- 145146 Chapter 2 lution. If you do not root the tree, however, itis important to note in the fig- tre legend that the tree is unrooted. It may even be wise to put an “Unrooted Tree” label somewhere within the figure itself. Even better, present the tree in an “unrooted” or “radial” format to ensure that the reader does not imply a root where none is given. If you do root a tree with an outgroup, how do you choose that outgroup? Assume that you have used BLAST to locate a set of sequences that are relat- ed to your sequence of interest, downloaded those sequences, aligned them, and constructed a tree for which there is no obvious root. You would like to find an outgroup sequence to root the tree. Another BLAST search could iden- tify some sequences that are less closely related than any of those you chose. ‘The problem is that the more distantly related the sequence, the more you can be sure its a legitimate outgroup—and the worse the resulting alignment will be. The opposite is also typically true: the better a sequence aligns, the less likely itis to be a legitimate outgroup. There is no easy solution to the problem, but a general approach is to pick 1 few candidates for the outgroup that themselves constitute a monophyletic clade. Putting them together with the sequences you already have should result in an unrooted tree, which clusters the two clades much as was done with the TEM/SHV sequences in Figures 2.62 and 2,63. Note that it is not necessary for the outgroup members to belong toa monophyletic clade, but it often helps to pick out the outgroup if they are monophyletic. ‘Another good approach is to use prior knowledge of the evolution of the species from which the sequences were obtained to decide on an outgroup. For instance, if ome of the sequences are from Eubacteria and others from Archaea, and you can rule out the possibility of horizontal transfer then you put the eubac- terial sequences into the ingroup and the archaeal sequences into the outgroup. A Word about Orthology, Paralogy, and Horizontal Transfer. Choosing an ‘outgroup based on prior knowledge of the species involved makes the assuimption that the evolution of the sequences is the same as the evolution ‘of the species from which those sequences came. While that will often be the ‘case, itis not always so. At least two kinds of events, horizontal transfer and gene duplication, can lead to the above assumption being false. Horizontal Transfer. We are most familiar with the problem of horizontal transfer, the transfer of genes between species, in microorganisms but as more genomes are being sequenced we are realizing that horizontal transfer has occurred among multicellular eukaryotes. If | chose for my outgroup a sequence that came from Archaea, but that sequence was in fact a recent hor- izontal transfer from a Eubacterial species, my “outgroup” sequence would probably belong within the ingroup. Using that sequence as an outgroup would. probably distort the tree so that all congruence between sequence evolution land species evolution is lost. If after rooting a tree you see that the tree bears little resemblance to the generally accepted species evolution tree, itis likely that you have rooted the tree incorrect.Basic Elements in Creating and Presenting Trees Gene Duplication. When a gene duplication occurs, the two gene copies begin to accumulate differences and to diverge from each other within the species in ‘which the duplication occurred; call those two copies o.and B. Later a speciation event occurs so that we have two species, each descended from the common ancestor in which the duplication event occurred, and each having. an o. and a B gene. As time goes on the a. and genes continue to diverge. Clearly, the a genes in the two species are more closely related to each other than either is to a B gene, because the a genes only began to diverge after the speciation event, whereas & and began to diverge immediately after the dupli- cation event, before speciation occurred. The four genes (two @ genes and two B genes) are all homologs because they are descended from a common ances- tral gene—the gene that was present before the duplication—but they are dif- fetent kinds of homolog, We define the o:and genes as paralogs because they are derived from a duplication event. The two 6: genes are orthologs because they are derived from a speciation event, The genes for a- and -hemoglobin are examples of paralogs. If we construct a tree that includes both orthologous and paralogous sequences, the members of orthologous groups will cluster together. If we are ‘unaware of the duplication situation (especially if modem species tend to have lost one of the duplicates), the sequence tree will bear little resemblance to the accepted species tree, Were we to choose, say, reptile sequences to use as an outgroup to otherwise mammalian sequences, the “outgroup” would prob- ably include sequences that diverged from each other after members of dif- ferent orthologs of the ingroup diverged. Using prior knowledge of species phylogenies to assign outgroups is safe only when you are confident that you are dealing, with orthologs and that horizontal fransfer has not reared its ugly head. Choosing What Form of a Tree to Publish The choice ofa tree for publication depends entirely on what makes the infor- ‘mation most clear to your audience. That decision requires some considera- tion of the intended audience. If you are publishing in a molecular evolution journal, an unrooted tree may be a good choice because that audience is like- ly to be familiar with unrooted trees. The same choice may be a poor one in a ‘molecular biology journal, whose audience is unlikely to be familiar with them. Cladogram or phylogram? Ina phylogram, the branch lengths that are pro- portional to evolutionary distance (differences) have strong and unambiguous vvisual impact, but if there are a few very long branches and a lot of short ones, the structure of the short branches may not be visible (eg, see Figure 2.62) If you are going to all the trouble of finding and downloading homolo- gous sequences, aligning them, and constructing a tree, itis certainly worth your while to take some time to think seriously about the form of the tree. The ‘main point is that you should make a thoughtful choice. The decision should never be an automatic or default decision 147148 Chapter 2 Making a Tree Pretty: Not Just a Cosmetic Matter ‘The trees displayed and printed by PAUP* and by TreeView are perfectly good tools for our understanding of the evolution of our sequences, but for the read- ‘er who has not been looking at or thinking about those sequences in all their iterations, the final tree could often use some improvement, The purpose of ‘making a tree “pretty” is to make it easier for the reader to understand. ‘Some of the choices about which tree format to present are a matter of taste and personal prejudice. | like slanted cladograms because they make poly- tomies so clear—that is, they clearly show when several taxa are descended from a single interior node. I judge that labeling branch lengths is sufficient to convey evolutionary distance, and that the clarity of the polytomies out- ‘weighs the loss of visual information conveyed by visual branch lengths in phylograms. Others will see the matter differently. Both PAUP" for Macintosh and TreeView allow the user to specify the font and font size for taxon names, and PAUP? allows the user to specify the font and font size for branch lengths and to specify line thickness. Both programs permit the tree drawing to be saved in PICT format (Macintosh) or Metafile (+ wn) format (Windows) that can be used by most drawing programs for those platforms. The ability to open the drawing in drawing programs means that the appearances can be modified fairly easily. Figure 2.64 shows a Bayesian tree that was created by MrBayes, saved as a tree file, and displayed by TreeView. TreeView was used to define the outgroup and to root the Figure 2.64Basic Elements in Creating and Presenting Trees tree with the outgroup. Figure 2,65 shows the same tree after making it pret ty with a drawing program, Figure 2.65 Use a drawing program such a CorelDraw for Windows or Canvas for Mac- intosh to open the image file that was saved from PAUP* or TreeView. The mage will be present as a single object that can be selected. As a single object, there is not much you can do with the image. Select, then Ungroup the object. ‘Ungrouping results in each of the elements of the image—the various lines, ‘numbers, and text features—becoming individual objects that can be manip- ulated. You are now free to move or eliminate any of those elements as you choose. ‘The most important adjustment in this case involved manually adding branch lengths to the branches. Note that TreeView X is expected to include the ability to display branches labeled with their lengths, in which case man- ual addition of branch lengths will be unnecessary. Manual addition of branch lengths requires being able to interpret the tree description in the tree fie. ‘The format for that description is called the Newickian format, named after the restaurant where a group of phylogeneticists gathered to devise the for- mat It is a surprisingly good format. 149150 Chapter 2 002297, 0XA32:0.004205,OXA15:0. 004515, OXA3 4:0.004567) :0.039130,0KA3:0.076322) :0.864728, MAGNETO:1.41 (0002) :0.298565, ( ( ((OXAL0:0.003034,0KA17:0.006253) 0.02558 3, (OKAZ8:0.003536,OXA35 0.001634, OXAL9:0..003397) :0.011213, ):0.041238, OxA5:0.270908) 0.739753, SPUTREFASCIENS :0.43335 3) :0.254041) :0.156620, ( ( (O¥A25: 0.008265, OXA26:0.004931, Ox 24: 0.008550) :0.544334,0XA27:0.180197) :0.774431, PAERUGINO $A:0.975052) :0.412950) :0.133023, CHEJUNI:1.586750) :0.12453, 4, (BSUBTILTS: 0.970492, (( (SEPIDERM: 0.004029, SAUREUS:0.0026 04) :0.802940, CDIFPICILE:0. 705384) :0.651208, ( (OXA30:0.0025 59,OXR31:0,011029,OXA1:0,010656) :1.050799, { ( (OXA6:0.5392, 50,ATUME: 0.633595) :0.382442, (OA22:0.601324, BPSEUDO:0.454 682) +0, 454261) :0,602838, (OXA29:0.221556, LPNEUMO:0.097217) 625383) :0.540152) :1.102884) :0.262226) :0,252886) :0.2059, 49, CHUTCHISONTE 0, 966228) :1.087491,,NPUNCTIFORME: 0) + ‘The interpretation is that the length of a branch leading to a taxon is} the taxon name, separated from that name by a colon. Thus OxA2 : 0.002287 tells us to label the branch leading to OXA2 a5 0.023 (rounding to four decimal places). By enclosing everything in a single set of parentheses, (OKA2 0.002297, OKA32:0.004205, OXA15: 0.004515, 0XA34: 0.004567) :0.039230, tells us that OXA2, OXA32, OXA15 and OXA34 form a clade that is descend- ced from a single node, and that the length of the branch leading to that node is 0.0391. (OKA2 0 , 002297, OXA32 0.004205, OXALS: 0.004525, 0XA34 0.004567) :0.039230,0XA3:0.076322) :0.864728 says the first clade and OXAS form a clade, the length of the branch to OXA3 i 0.7632 and the length of the branch leading to the entire clade is 0.8647. The rule, then, is that an entire clade is enclosed in a set of parentheses and that the length of a branch leading to that clade is separated from the right-most paren- thesis by a colon. Another major adjustment was to change the taxon labels to something that is more easily interpreted by the reader. I really is not helpful to a reader to ask him to keep in mind all of the taxon abbreviations that you chose. Finally, where there were polytomies I used diagonal lines leading to the horizontal branches to make those polytomies clear to the casual reader.Basic Elements in Creating and Presenting Trees. 151 Figure 2.66 is the upper portion of Figure 2.63. A zero branch length, such as the branch leading to TEMI in Figure 2.66, means that TEM1 is actually at that node. I think that it makes it much easier to understand the tree if the taxa at interior nodes are represented by taxon labels placed at those nodes. In the drawing program, therefore, I delete such branches and move the labels to the node itself (Figure 2.67). The result is that i is now clear that TEMI is the ances- tor ofits entire clade, that TEM77 is descended from TEM30, and that TEM31 is the ancestor of the clade that includes TEMS4, TEM73, and TEM74. ‘Again, these are personal preferences based on my judgments about what makes the information clear to the reader. Figure 2.66152 Chapter 2 ers Te Figure 2.67 Whatever your preferences, the primary goal is to make the tree as easy to inter- pretas possible for yur reader. That requires taking the time to think about things from the perspective of your likely readers. A tree published in the Journal of Moi- cular Evolution, where you can reasonably expect yout readers to be familiar with evolutionary trees, might look quite different from the same tree published in Cell, ‘where many readers will not be familiar with evolutionary trees, DNA Phylogeny or Protein Phylogeny: Which Is Better? Whether we obtain our sequences from DNA or protein databases, itis almost always the case that the original data were in the form of DNA sequence. Thus ‘we almost always have the choice of constructing phylogenies from amino acid or nucleotide alignments. At first it might appear that it simply does not mat- ter, but that is not the case. Consider the amino acid alignment used to illustrate Chapter 1. Those sequences are quite divergent, even at the protein level. At the DNA level they are likely to be so divergent, particularly at third positions of codons, that it is impossible to obtain a meaningful alignment. Although there are fewer sites, in an amino acid alignment, there are 20 possible states at each site instead ofBasic Elements in Creating and Presenting Trees four possible states, making it possible to obtain good alignments and there- fore to construct valid protein phylogenies when phylogenies of the corre- sponding DNA sequences would be meaningless. Tt might appear from this discussion that protein phylogenies are always preferable to nucleic acid phylogenies. As is usually the case, however, things are not quite that simple. ‘When the phylogeny is quite shallow—that is, when there has been little divergence among the taxa—there is again likely to be more divergence at the DNA than at the protein level. In the case of a shallow phylogeny, this fact is helpful. Consider the phylogeny illustrated in Figure 2.63, Those sequences diverged over a very short interval. Many of the nucleotide substi- tutions were silent (they did not result in an amino acid replacement), At the amino acid level, the phylogeny has much less structure—that is, many more polytomies—than at the nucleotide level. In this case, the DNA phylogeny is, ‘much more useful than the protein phylogeny. ‘Yet another problem arises when we use ClustalX to align DNA coding. sequences. When multiple alignment programs such as ClustalX introduce ‘gaps in order to maximize the alignment score, they do so without regard to ‘codons. Translation of the gapped DNA sequences often produces frameshift- ed proteins that bear no resemblance to the actual proteins encoded by the genes used to create the alignment. Gaps represent insertions and deletions (often called indels) that occurred 4s the ancestral sequence diverged to give rise to the extant sequences in the alignment. It seems unlikely that most genes passed through a period of being, inactive pseudogenes (cue to the indels represented by the gaps), it follows that itis also unlikely that within-codon gaps often represent actual historical indels. Thus, the gaps introduced into DNA coding sequences are likely to be misplaced, with the result that homologous nucleotides within sequences are often misaligned. Trees based on such alignments of DNA coding sequences have the potential for reduced accuracy. ‘The problem becomes acute when we are interested in reconstructing ances- tral states (see Chapter 3). The presence of within-codon gaps easily results in the estimation of ancestral sequences whose protein products bear no resem- lance to existing proteins. In those cases, predicting ancestral sequences from. DNA sequences is quite useless. The following pages describe how to use a program called Codon Align to solve the problems caused by introducing with- in-codon gaps. There is another important consideration when deciding between nucleotide and protein phylogenies: If you are using a desktop computer, pro- tein phylogenies of more than about 50 taxa are limited to Neighbor Joining and Parsimony methods. Maximum Likelihood of protein sequences is not implemented in PAUP". Tree-Puzzle implements protein ML, but large phy- logenies would require days to run on typical desktop computers. MrBayes does Bayesian analysis of protein sequences, but also requires a lot of mem- ory and a lot of time for large datasets. 153154 Chapter 2 It would appear that for datasets larger than about 50 taxa involving deep phylogenies, we are pretty much limited to using NJ and Parsimony to con- struct tres from the protein sequences. If you prefer those methods and you are not interested in estimating ancestral states, there is no problem. If you pre- fer ML or Bayesian analyses, you can usea little program, CodonAlign 2.0, that is provided on the website to solve the problem, Using CodonAlign 2.0 @@B Couonrtigng:Codonatign 20 Mac Package GB Codon nig: CodonAtgn 20 Win? Package GB osondigof: CodonAlign 20 Unix Package CodonAlign 2.0 uses a protein alignment to introduce gaps (actually triplet gaps) into coding sequences at positions corresponding to the gaps in the aligned protein sequence. The result isa set of aligned DNA coding sequences, ‘which if translated, will regenerate the original protein sequences. Alignments, of coding sequences that are done in this fashion are much more biologically realistic than are alignments that are done directly by ClustalX or by other alignment programs. The resulting DNA alignment can then be used for ML ‘or Bayesian tree construction in a fraction of the time that would be required for the corresponding protein alignment. ‘CodonAlign requires two input files: (1) a file of aligned protein sequences and (2) a file of the corresponding DNA coding sequences. Aligned protein sequence format. The protein file must be a text (ASCII) file in PHYLIP interleaved format (see Appendix I, File Formats, for more information about these formats). This is the format in which ClustalX writes PHYLIP files Sequence or taxon names must not contain any spaces, periods, or dashes and ‘must not exceed mine characters. I' you wish, spaces may be replaced by the under- score (_) character. Note that taxon names are case-sensitive. DNA sequence file format. The DNA sequence file must be a text (ASCII) file in FASTA format (see Appendix I for more on file formats). The file must include only the coding region corresponding to the protein sequence used to create the aligned protein sequences and the sequences must not include the termination (nonsense) codon.Basic Elements in Creating and Presenting Trees Creating a Protein File using Clustalx 1. Choose Clustal’s Output Format Options under the Alignment 2. In the resulting dialog box, check the PHYLIP box, then change the default Output Order from Aligned to Input. 3. Load the protein sequences into ClustalX and create the alignment as described in Chapter 1. ClustalX will create an output file with the same name as the file you used to input the protein sequences except that it will have the extension «phy. [ 4. Use the .phy file as your protein input file for CodonAlign. The names of the sequences (taxa) must be identical to the names of the cor- responding proteins (ie, they must not contain any spaces, periods, or das es), The DNA sequences must be in exactly the same order as the proteins in the aligned protein file Running CodonAlign. The two input files must be in the same folder (directory) as the CodonAlign application (program) ‘Double-click the CodonAllign application icon. CodonAllign will ask you for the name of the aligned protein file Type the name exactly as it appears in the folder (directory). The file name is case sensitive. CodonAlign will next ask you for the name of the DNA sequence file. Type the name exactly as it appears in the folder (directory). Finally, CodonAlign will ask you for the name of your output file. The name ‘must nat exceed 25 characters. ‘When the program is done, Quit to close the console window. There is no need to save the contents of the console because it contains no useful information. Warning!! CodonAlign is a very picky program. Any deviations from the correct formats for the input files, or anything else for that matter, will result in an error. See the description of error messages on page 136. ‘The Output File, CodonAliga will create two output files using whatever name you choose and append the extensions .nex for the Nexus formatted file and .pay2ip for the PHYLIP formatted file, The output files will con- tain the DNA sequence gapped according to the protein alignment. The Nexus file can be used directly by PAUP* and many other programs, includ ing MrBayes. The PHYLIP file can be used as input for PHYLIP and other programs such as Puzzle. 135136 Chapter 2 Error messages. If an error is encountered, the program will terminate but the console window will remain open and will display an error message. Errors include: * Can't find the protein sequence file. Fither you mistyped the name of the file, othe file is notin the same folder (directory) as the applica- tion. No output files are saved + Can't find the DNA sequience fie. Either you mistyped the name of the file, or the file is notin the same folder (directory) as the applica- tion. No output files are saved + Names of protein and DNA sequence are not identical. Look at the input files. Ether some pair of names failed to match exactly, or the sequences are notin the same order in the two files. The output files are saved and include the gapped DNA sequences up to the point of the error. | you gi as the name of an input file the name of some | other file that is in the same folder as CodonAlign, then CodonAlign will | ‘try to read that file. If it is a nontext (non-ASCI) file or is not in the cor- rect format for an input file, CodonAlign will lock up and will probably crash your computer. Be careful! — Obtaining CodonAlign 2.0. Download Codonalign from the website Three packages are available, one for Macintosh, one for Windows, and one for Unix. The Macintosh and Windows packages include the CodonAlign 2.0 program, documentation in PDF format and some example files, The Unix package includes the C source code, documentation and example files.Advanced Elements in Constructing Trees This chapter discusses some advanced topics for those who would like to go beyond the basics. You do not need to understand, or even to read, this chap- ter in order to construct valid, robust trees. Reconstructing Ancestral DNA Sequences In some situations, it may be very valuable to know the sequence of a partic- ular length of DNA in the common ancestor of extant taxa, Lacking the ances- tral organism itself, itis impossible to determine that sequence experimentally, so we can never be certain of the sequence. We can, however, estimate that sequence. Imagine that we have constructed a phylogeny of a group of glycosidas- es, a part of which includes two distinct clades—one consisting entirely of ‘a-glucosidases and the other entirely of o-galactosidases. We would like to identify the amino acid changes that are most likely to be responsible for the different substrate specificities. It would be useful to compare the sequence of the node from which all galactosidases are descended with the node from which all glucosidases are descended, and to compare those with the node that is their immediate ancestor. We might want to go even further and use protein modeling software to model the structures of those ancestral proteins in order to visualize the structural changes that accompanied the substrate changes. If those comparisons identify a small number of amino acid sub- stitutions, we could introduce those substitutions into extant sequences to determine whether those changes would shift the substrate specificities as expected. Chapter 3 157158 Chapter 3 Ancestral Juences for Parsimony and ‘Maximum Likelihood Trees Using PAUP* Parsimony Using PAUP* for Macintosh. After creating. your parsimony tree, choose Log output to disk... from the File menu. In the resulting Save dialog (Figure 3.1), pick a name for the log file. That file will contain every- thing that appears in the main display buffer until you tum the logging option off. Figure 3.1 Next, choose [Gsmandate execution fer tog subsequent output to (satibatateg se] suppress outputto sereen otatslayumeriniogme (Sweat Describe Trees from the Trees menu (Figure 3.2) to reveal the dialog in Figure 33. “ree info ear Trees Root Trees Condense trees... Filter Trees Sort Trees EE "ree Scores > Show Reconstructions. Print Trees... ‘onr ‘Tree-to-Tree Distances. Save Treestofile. OS Figure 3.2. | Matrix Representation.Advanced Elements in Constructing Trees ‘Tree Description Options Selecttree(s) om nae / Figure 3.3 Select whichever tree you prefer, and be sure the States for internal nodes and Label internal nades boxes are checked as shown in Figure 32. The cur- rent version, 4.0610, does not permit writing the ancestral sequences in sequen- til format so that they can be copied and used. The next version, 40611, which ‘will be current in summer of 2004, will permit doing that by providing an option to write an ancestral sequence file. Look for an option or box that will allow you to choose a name for that sequence file and to choose a format (Sequential or Interleaved). Choose sequential, name the fil, then when all is ready click the Describe button. Finally, again choose Log output to disk... from the File menu, and in the resulting dialog click the Step Saving button. The ancestral sequences are saved in the logfile that is discussed below. Parsimony Using PAUP* for Windows/Unix. Simply add the following three commands to the end of the PAUP* block in your parsimony execution file: tog File = ‘myFile-log’ Replaces yes Start = Yes; [saves the output buffer to a log file] DescribeTrees 1/ Briens internal £ yes LabelNode ecnyAncFile interleav: t nodes of Log stop = yee; [stops saving the log file] 1 9160 ‘taxon! 16 Chapter 3 ‘The first Log File command starts saving the output butfer to a log file. ‘The DescribeTrees command begins with a tree list, in this case tree ‘number I. If there is more than one tree, use this option to enter the number of the tree to use for the ancestral state reconstruction. You must provide a tree list followed by the slash, BrLens = yes and LabelNode = yes cause the tree to be printed with branch lengths and with the internal nodes labeted ‘with numbers. That is essential if you are to match the ancestral sequences with their corresponding nodes. Xout=internal tells PAUP* to calculate the sequences of the internal nodes, file = myAncFile and interleave = no tells PAUP* to write the internal sequences in sequential format toa file named myFileanc. Note that the file = and interleave = options are not available in the current version, 4.0b10, but will be available in the next version, 4.0b11, which should be current by May of 2004. ‘As usual, choose any name you like for the log file and the ancestral state file Maximum Likelihood. For both Macintosh and Windows/Unix, add the ‘same three lines to the end of the PAUP* block as described above for parsi- mony trees using Windows. Interpreting the log file. The log file includes what amounts to an align- ment of the ancestral sequences in interleaved format (Figure 3.4). The num- bers across the top of Figure 34 are the sites in the alignment. Each internal node is given a number. A tree is printed at the bottom of the file with the intemal nodes numbered, as in Figure 3.5. ua2i1i2i12922220225353953923446444446455555555556666666666777777777 123456709012345678901234567090123456700022245678901234567830123456789012345578 atgegtatgacactat tggegaagt tgatgctguegsts Stgogratqucactattggegaagtegatgcrggegacgottgesat gragegcetgatectagcegcegetggteS atgostatgacactartagegaagtegaractgacgacagtegegateangcecctaaccacgqegcacgetgagtey stgcgratgacactantggegagt tgatgctqsegacgcaaanagregrsactctasccacegtgeacgetasetes dtgegttteaceetgeregcer togeectger gacgataguaaeggeegeegettscoggcegtccacaccascacc atgoattcraccetgstegest eqoectger ggegatggnaanggcostogctetteeageegeecsegccesegee Etgogetteaceetgezegect togeecrgerqacgatggaaaagaceguegctattceggeagtecacgecagcgce atgegrttraccetgcrogecttegcecegetggagatagaaaaggcogt cactctgaccycegtecacgecageges Atgegtacgucactacregecaacgcestecegzegaces ce ceca Reconstructed stares for internal nodes (continued) Figure 3.4 (continued next page)Advanced Elements in Constructing Trees 161 (continued from previous page) suis Li gecttgactacoctagoagaaccagagcetgaaaatatgoccaaagaatagaaccagcettetgcaccattcoatact, 22 Geattagetetauccteggeagcceacgoogacgacatgecagccaactajaccaagocgaccangcertaccat ata 33 cegteggetaageeagesqegccogacgetgactacatgcceaacgactgqaaceagccgatcacaccattceststs La Geettagctacgetqueagagecagaggetgacascatgeccaacgactagaaccagccaategcaccatteagtatt, As geogaggeacegctgceacaactgcaggcctaracegtggatgcgtectagctacageagatqacacegctseagat Aé —_googaggoaccactgocgcagctgeqggectacaccatggncactcatagctgeagcogatagcaceactacaget 27 gengaggeacegotqecacagotgegagearacacegtagacgcctcctggctacagccaatagcaccgctycagatt Le geogagacaacgetaceacaget geuguectacacogtagacacctoctagetacagccgatggcaccattgcagatt 13 © ca aesage, a a tgp case = ce cacea at Reconstructed states fox internal nodes (continued) Figure 3.4 Figure 3.5 Notice that there is a node 19 in the alignment, but no node 19 on the tree. ‘The highest numbered node is that of the root, in this case based on a default ‘outgroup—the first taxon in thelist. Ths isthe same tree as the frst tree in Fig- ure 2.36, so we can use it to number the nodes in the Figure 2.36 tree, as shown in Figure 36. With the nodes correctly numbered, we can identify the sequence for any node and we can copy that sequence from the ancestral sequences file for any purpose we wish.162 Chapter 3 ue a + Figure 3.6 Ancestral Sequences for Parsimony using PHYLIP ‘See Chapter 4 for using PHYLIP to construct protein and DNA parsimony trees. Both the Protpars program and the Dnapars program include a choice, under menu option 5, to Print sequences at all nodes of tree, ic. the ancestral sequences. Type “5” to change that option to Yes. Upon doing so, a new option labeled ”.” appears immediately below option 5. Type “.” to change Use dot- differencing to display them io No. Now when you run the program, the coutfile will include a tree with all of the nodes numbered and a table show- ing the branch lengths from each node x to node y. ‘The example in Table 3.1 shows the sequence at each node. The table makes it easy to see exactly how each character changed from node to node, and it can (with a bit of effort) be converted into a useable data file from which indi- vidual sequences can be copied. The resemblance between Table 3.1 and an interleaved data file are obvious. The problem is one of conversion. There is, no program that does the conversion, but itis not difficult to do manually using, Microsoft Word. Begin by copying the entire table into a new Word file and save that file as text only, The first column in the table is From, the second is To, and the third is Any Steps?. The last is the sequence data itself. For the first block, \We will eliminate the first and third columns, and for the remaining blocks we wil eliminate the first three columns, leaving only the data intact.Table 3.1 Fron nia bie bist rev cau mpl Pez opt me a tad nie mbist1 THINB cat bli FEZ opt ne any steps? maybe maybe maybe yes yes yes yes maybe wnaybe yes yes yea yes yes naybe aybe yes yes yes yes yes Advanced Elements in Constructing Trees State at upper nede ATGCGTT?TA CceTscTCee ATGOGTT?TA CceTGeTCGC ATGOGTICTA cceTGeTEGC argosrTeTa cecTacTese arecarTera cecrecreee ATGCGTT?TA cocTecTeee ATOCGTTTTA cocrecreac crere7T26c cacrartesc cceteerese cackeranre eccrervecs ccereryece cecrerrecs caceereees eccrerzeca cacrerreca ackcranye ‘ITGeGACCAT GTCGGCGSCT RTWRT GRSNCTCWZC cceceraare cecereare ceccraate DasECraNye AARARGTATT AAGETTARCC GRARTTTTGC TACACTOTTT ecereacese cacecrorTe ertoaceers cerresceers cerrescects ertesceera cerresceets errececors ertescecre 9327772970 GAAGTIGATG ertesceere RecoTCsace RecGTccacs Reoscccacs ecoececaca arcecacacs ‘ecgrecace accerccacs RCSGTSSACG ACGGTGCAGS nysorswics cresccaces cetesccaces cereeceeces mYvDTGNTCA GCRTTGATGA ‘rrcaTertca AccFTeaace reeteceres ccrsceceae ceacceceae ceaecaccac ccaecacese ceacrescsc ccasceccac ccacecece carsspevsc ccAARGACACC cressreasn cerseereeer crecereser ceracarosct rroavrrew ITGGTATTOAA retectracs sesccecese 163104 Chapter 3 Word has. feature called Rectangular selection that allows you to select «a vertical portion of text without selecting entire lines. While holding down the option key (Macintosh) or its equivalent in Windows, for the first block only, select the first column right up to the edge of the first taxon name, as shown in Figure 3.7. Delete the selection, then click the | button so that the first block now looks like pron te any st 7 saybe arocor7?raccerecreaccr7eaccere e aaybe arocerzcraccerecreacerTeacte’ Figure 3.8. 2? State at upper node arocarreracceraczcacerrcaccens--2-722777 arocarr?racceracreaccrreaccem™ TATGACACEATTOGC CAAGTTEATECTECCCACSE 5 saybe Figure 3.7 so pay Steps? state at upper node Figure 3.8 Arucorr?sa cccrocrese. Crreaccera Arccerrcra cceracreae CrrCUceeTs arccerrrsa cocracteae. crieaccexaAdvanced Elements in Constructing Trees Place the cursor at the left end of the top line, then use the arrow keys to ‘move exactly 10spaces to the right. Hold down the option key and select every- thing in the first block up to the edge of the sequences, as in Figure 3.9. ro. ----angjigtape?| state at a 2 laracorr7a. 7 ‘maybe ATGCOTS?TA ‘ ‘maybe | ATGCGTICTA u no | tarecarrcea uid “no afecarrcea ‘ Sarecert?Ta hie arcooreraa, mbit 2 mans 4 5 cast abl 2 ven coat ue Figure 3.9 upper nodew cecracreae ecersercac ecereercae cectecreac ccerecreac cccreereec cccrecreae ‘cerzr2ec Oda ddadaaasaaaaaae Delete that material. What remains in the left column is the names of the taxa and nodes, each name exactly 10 characters (including spaces) long, Next, holding down the option key, select everything to the left of the sequences for the remaining blocks and delete that. When you are done the file should look Tike Figure 3.10. 165166 Chapter 3 [to An State at upper node avocors?ta ccesocrecc crrccccene —2-272777 7 aracerz72a cectacrese crrcacecns 8 aracerzcra ecerecrece crrcaccen un atocarzcra cccrecrece crzcccect ne arocerteta ccctacrece Crzccccc7s 6 © arecerr7ta ccctecrece crrcacectG ute arccorrmta cccnecrece erzoaccens seisit e 2S 2 ran =290H CLOTPITIGE 7727772996 —2-727227 mare "ATGA CACTATTGGC GAAGTIGATG CTGGCGACGG arecertera cectecreee errececere reercectes 2GecEr CACKETBNYC RCCOTCEACG ceRSCGCCEC Lecce? cecreTyece accerccace ccaccaccae “OCCT CGCTCTYCCG RCCECCCACG ceaGecccee —----cecor cocterrees aceccceaca ccaccaccee ———eceer caccereces arcccccacc ccagraacee accor cecterrees ecereeacs ecadcaccee LGceGr CaCICTCSS GccGrCCACG ccacececce = GACKCTGNYC RCSGTSSACG CURSSDCVSC ‘FRECGACCAT GreecceacT ACGETGCAGE CAAAGACACC ---?RINRT GASNCTGHYC NYSGTSNHCG crasercKsH ——-—-arasa ccaceraae crosceccce cxaconcecr -ATGAA GEGCCTGATC cTocceaccs cracaTCECT -ATGAA GeQCCTGATC CrescCEcee CraccIEGCY DRAATOINOT DASNCTGMYC NYVDIGHTCA TIGRVITGVE ARARAGIAY? AAGTTTAACC GCATTGATGA TOGTATTGAA aaarrirec TACACTOTT TeearGTsca TrTGCTTCGG cecrecccec cacccrerse accrzceace ececcaceee Figure 3.10 What remains is to delete the top line, and on the empty line below it enter the number of sequences and the number of characters in each sequence. There ate 18 sequences, 10 taxa, plus 8 interior nodes. There are 960 characters in the alignment (if you don’t remember, just open the infile), so the top of the file now looks like Figure 3.11. This is now a proper PHYLIP file in interleaved for- mat. Save itAdvanced Elements in Constructing Trees 167 te 960 1 ATGCGTT?TA cecTecTese crresccers 7 ATGCGTT?TA CccTGCTCGC crrcscceTs e ATGCGTICTA cccTGCTcsc crrcsccers La ATGCGTTCTA cccTGCresc crrescceTs ia ATGCGTTCTA cccTecrecc crrescceTs 6 ATGCGTT?TA CccTecTcec crrescceTs lie ATGCGTTTTA cccrecrece crrescccTs mbisin 2 coereaTaGe 7772227776 --7-722277 ‘THINB CACTATTGGC GAAGTTGATG cTesccacGs 4 2277 5 cauL bi 3 FEZ. oB1 . Lie ATGCGTTCTA cccTacTose cTTescccre TesTescTCS Figure 3.11 Ifyou want to copy a particular node sequence, you need to convert the inter leaved file to a sequential file using Seqbont as described in Appendix Using Protein Structure Information to Construct Very Deep Phylogenies In Chapter 2, I emphasized the importance of not including sequences on the ‘same tree unless there is evidence that those structures are truly homologous. suggested all sequences on the same tree should exhibit significant homolo- By ina pairwise BLAST alignment, What do you do if sequences have diverged so much that no sequence homology can be detecied, but other evidence sug gests that they are homologs? Typically, that “other evidence” is likely to be similarity of protein structures. Ttis not uncommon for two groups of proteins to exhibit so much structur- al similarity that itis almost certain they descended from a common ancestor, even though sequences within the two groups exhibit no detectable sequence similarity. sit then reasonable to put members of the two groups into the same alignment and onto the same tree? The short answer is “No! Itis not OK.” The sequences have diverged so much that alignment programs will not be able to line up homologous amino acids in the same site and the resulting alignment and tree will be meaningless. We cannot put those two groups onto the same sequence-based tree. ‘On the other hand, we can put those sequences onto the same structure-based tree. Ifwe assume that homologous sites Occupy the same positions in the pro-