You are on page 1of 6

FUNDAMENTALS OF BIOINFORMATICS

Module 23: Phylogenetic Tree Construction – Tools

Hello all… and welcome to a new session on Fundamentals of Bioinformatics.


In this session, we will discuss the tools or programs employed for
Phylogenetic tree construction.

PHYLOGENETIC PROGRAMS

Genome sequencing is generating vast amounts of DNA sequence data from a


wide range of organisms. As a result, gene sequence databases are growing
rapidly. In order to conduct efficient analyses of these data, there is a need
for easy-to-use computer programs, containing fast computational algorithms
and useful statistical methods. Phylogenetic tree reconstruction is not a
trivial task. Although there are numerous phylogenetic programs available,
knowing the theoretical background, capabilities, and limitations of each is
very important.

With the development of different phylogenetic methods and technological


advancement, various programs or packages were built. These programs allow
the analysis of thousands of data that would be impossible to work manually.
Generally, each of the programs for phylogenetic analysis uses different
formats of input files. The formats can be of different types, fasta, meg, nexus,
phylip, clustal, and MFS format. These formats are generated during the
alignment of sequences that can be performed in the programs Clustal X,
Clustal W, Bioedit, and Aliview. Once the alignment is properly formatted, you
can then run the analyses in the desired program. In this session, I will
present some programs of phylogenetic analysis and general characteristics
of them.

1. PAUP (Phylogenetic Analysis Using Parsimony)

(by David Swofford, http://paup.csit.fsu.edu/)

It is a commercial phylogenetic package. It is probably one of the most widely


used phylogenetic programs available from Sinauer Publishers. It is a
Macintosh program (UNIX version available in the GCG package) with a very
user-friendly graphical interface. PAUP was originally developed as a
parsimony program, but expanded to a comprehensive package that is
capable of performing distance, parsimony, and likelihood analyses. The
distance options include NJ, ME, FM, and UPGMA. For distance or ML
analyses, PAUP has the option for detailed specifications of substitution
models, base frequencies, and among site rate heterogeneity (γ -shape
parameters, proportion of invariant sites). PAUP is also able to perform
nonparametric bootstrapping, jack knifing, KH testing, and SH testing.

Until the official release of version 5.0 of PAUP*, you can download time-
expiring test versions of PAUP* from http://phylosolutions.com/paup-test

Module 24 |1
2. MEGA (Molecular Evolutionary Genetics Analysis)

The objective of the MEGA software has been to provide tools for exploring,
discovering, and analyzing DNA and protein sequences from an evolutionary
perspective. The first version was developed for the limited computational
resources that were available on the average personal computer in early
1990s. MEGA1 made many methods of evolutionary analysis easily accessible
to the scientific community for research and education.

MEGA2 was designed to harness the exponentially greater computing power


and a graphical interface of the late 1990’s, fulfilling the fast growing need for
more extensive biological sequence analysis and exploration software. It
expanded the scope of its predecessor from single gene to genome wide
analyses. Two versions were developed (2.0 and 2.1), each supporting the
analyses of molecular sequence (DNA and protein sequences) and pairwise
distance data. Both could specify domains and genes for multi-gene
comparative sequence analysis and could create groups of sequences that
would facilitate the estimation of within- and among- group diversities and
infer the higher-level evolutionary relationships of genes and species. MEGA2
implemented many methods for the estimation of evolutionary distances, the
calculation of molecular sequence and genetic diversities within and among
groups, and the inference of phylogenetic trees under minimum evolution and
maximum parsimony criteria. It included the bootstrap and the confidence
probability tests of reliability of the inferred phylogenies, and the disparity
index test for examining the heterogeneity of substitution pattern between
lineages.

MEGA 4 continues where MEGA2 left off, emphasizing the integration of


sequence acquisition with evolutionary analysis. It contains an array of input
data and multiple results explorers for visual representation; the handling
and editing of sequence data, sequence alignments, inferred phylogenetic
trees; and estimated evolutionary distances. The results explorers allow users
to browse, edit, summarize, export, and generate publication-quality captions
for their results. MEGA 4 also includes distance matrix and phylogeny
explorers as well as advanced graphical modules for the visual representation
of input data and output results.

MEGA 5 is specifically designed to reduce the time needed for mundane tasks
in data analysis and to provide statistical methods of molecular evolutionary
genetic analysis in an easy-to-use computing workbench. While MEGA 5 is
distinct from previous versions, the developers have made a special effort to
retain the user friendly interface that researchers have come to identify with
MEGA. After that a series of versions came with progressive improvements.

MEGA is a useful software in constructing phylogenies and visualizing them,


and also for data conversion. It can easily convert alignment files to other
formats such as nexus, paup, phylip, and fasta, and so on. The MEGA tree
explorer is helpful in editing trees very easily, subtrees can also be selected
and edited separately. Some tree image export options are also available. The

Module 24 |2
input formats are newick, phylip, mega, and nexus. The phylogenetic tree can
also be converted in newick format but it falls short on converting it into other
formats such as phylip which is required in other analyses such as selection
analysis.

The latest version of MEGA software is MEGA X, which is available from


https://www.megasoftware.net/.

3. PHYLIP (Phylogenetic inference package; by Joe Felsenstein)

PHYLIP is a package of programs for inferring phylogenies (evolutionary trees).


It is available free over the Internet, and written to work on as many different
kinds of computer systems as possible. The source code is distributed (in C),
and executables are also distributed. In particular, already-compiled
executables are available for Windows (95/98/NT/2000/me/xp/Vista), Mac
OS X, and Linux systems. Older executables are also available for Mac OS 8
or 9 systems. Complete documentation is available on documentation files
that come with the package.

Methods that are available in the package include parsimony, distance


matrix, and likelihood methods, including bootstrapping and consensus
trees. Data types that can be handled include molecular sequences, gene
frequencies, restriction sites and fragments, distance matrices, and discrete
characters.

The programs are controlled through a menu, which asks the users which
options they want to set, and allows them to start the computation. The data
are read into the program from a text file, which the user can prepare using
any word processor or text editor (but it is important that this text file not be
in the special format of that word processor -- it should instead be in "flat
ASCII" or "Text Only" format). Some sequence analysis programs such as the
ClustalW alignment program can write data files in the PHYLIP format. Most
of the programs look for the data in a file called "infile" -- if they do not find
this file they then ask the user to type in the file name of the data file.

Output is written onto special files with names like "outfile" and "outtree".
Trees written onto "outtree" are in the Newickformat, an informal standard
agreed to in 1986 by authors of a number of major phylogeny packages.

PHYLIP is probably the most widely-distributed phylogeny package. It is the


sixth most frequently cited phylogeny package, after MrBayes, PAUP*, RAxML,
Phyml, and MEGA.

The program package is downloadable from


http://evolution.genetics.washington.edu/phylip.html.

4. Dendroscope

Module 24 |3
Researchers studying phylogenetic relationships need software that is able to
visualize rooted phylogenetic trees and networks efficiently, increasingly of
large datasets involving hundreds of thousands of taxa. The program should
be user friendly (easy to run on all popular operating systems), facilitate
interactive browsing and editing the trees and allow one to export the result
in multiple file formats in publication quality. In addition, there is a need for
a program that allows one to compute rooted phylogenetic networks from
trees.

They have developed the platform independent tree and rooted network
viewer Dendroscope that addresses these issues.

Feature List of this software:

 Large trees with hundreds of thousands of taxa can be easily displayed,


browsed and edited.
 Multiple trees and networks from a single file can be displayed together
in an m by n grid
 Novel magnifying features for zooming detailed views (see screenshots)
 Find and replace tool bar that uses regular expressions
 Subtrees can be collapsed and coloured
 All labels (leaves/inner nodes and edges) can be edited
 Trees can be re-rooted
 Seven different views are available, including a rectangular, slanted,
circular and radial view
 Input formats: Newick and Nexus, extended-Newick (for rooted
phylogenetic networks) and Dendroscope
 Multiple graphic export formats: .eps, .svg, .png, .jpg, .gif, .bmp, .pdf
 Trees and networks can be copied and pasted between different
windows
 Platform independent (Java, installers for common operating systems
available)
 Consensus trees and rooted phylogenetic networks can be computed
from a set of trees
 Hybridization networks and tanglegrams for multifurcating trees on
unequal taxon sets
 Commandline mode

5. TREE-PUZZLE

TREE-PUZZLE is a program performing quartet puzzling. The advantage is


that it allows various substitution models for likelihood score estimation and
incorporates a discrete γ model for rate heterogeneity among sites. Because
of the heuristic nature of the program, it allows ML analyses of large datasets.
The resulting puzzle trees are automatically assigned puzzle support values
to internal branches. These values are percentages of consistent quartet trees
and do not have the same meaning as bootstrap values.

Module 24 |4
TREE-PUZZLE version 5.0 is available for Mac, UNIX, and Windows and can
be downloaded from www.tree-puzzle.de/.

6. PHYML

PHYML is a web-based phylogenetic program using the GA. It first builds an


NJ tree and uses it as a starting tree for subsequent iterative refinement
through sub tree swapping. Branch lengths are simultaneously optimized
during this process. The tree searching stops when the total ML score no
longer increases. The main advantage of this program is the ability to build
trees from very large datasets with hundreds of taxa and to complete tree
searching within a relatively short time frame.

It is available at (http://atgc.lirmm.fr/phyml/)

7. MrBayes: Bayesian Inference of Phylogeny

MrBayes is a program for Bayesian inference and model choice across a wide
range of phylogenetic and evolutionary models. MrBayes uses Markov chain
Monte Carlo (MCMC) methods to estimate the posterior distribution of model
parameters.

Program features include:

 A common command-line interface across Macintosh, Windows, and


UNIX operating systems;
 Extensive help available from the command line;
 Analysis of nucleotide, amino acid, restriction site, and morphological
data;
 Mixing of data types, such as molecular and morphological characters,
in a single analysis;
 Easy linking and unlinking of parameters across data partitions;
 An abundance of evolutionary models, including 4x4, doublet, and
codon models for nucleotide data and many of the standard rate
matrices for amino acid data;
 Estimation of positively selected sites in a fully hierarchical Bayesian
framework;
 Full integration of the BEST algorithms for the multi-species coalescent;
 Estimation of time calibrated (clock) trees using a variety of (strict and)
relaxed-clock models;
 Support for complex combinations of positive, negative, and backbone
constraints on topologies;
 Model jumping across the GTR model space and across fixed rate
matrices for amino acid data;
 Monitoring of convergence during the analysis, and access to a wide
range of convergence diagnostics tools after the analysis has finished;

Module 24 |5
 Rich summaries of posterior samples of branch and node parameters
printed to majority rule consensus trees in FigTree format;
 Implementation of the stepping-stone method for accurate estimation
of model likelihoods for Bayesian model choice using Bayes factors;
 The ability to spread jobs over a cluster of computers using MPI (for
Macintosh (OS X) and UNIX environments only);
 Support for the BEAGLE library, resulting in dramatic speedups for
codon and amino acid models on compatible hardware (NVIDIA
graphics cards);
 Check pointing across all models, allowing the user to seamlessly
extend a previous analysis or recover from a system crash;

The program is available in multiplatform versions and can be downloaded


from https://nbisweden.github.io/MrBayes/download.html.

8. RAxML (Randomized Axelerated Maximum Likelihood)

RAxML (Randomized Axelerated Maximum Likelihood) is a program for


sequential and parallel Maximum Likelihood based inference of large
phylogenetic trees. It can also be used for post analyses of sets of phylogenetic
trees, analyses of alignments and, evolutionary placement of short reads. It
has originally been derived from fastDNAml which in turn was derived from
part of the PHYLIP package.

9. FastTree

The software is an open source and can be installed on different platforms


(Mac, Linux/Unix, and Windows). It has the purpose of doing ML analyses of
thousands of DNA, RNA, and protein data much faster than other programs
(about 100–1000 times faster). For DNA analysis you can use the Jukes-
Cantor and GTR replacement models, which is a limitation. For protein data,
it uses the Jones-Taylor-Thornton 1992 (JTT) [50], Whelan and Goldman
2001 (WAG) [51], and Le and Gascuel 2008 (LG) [52] models. One of the great
advantages of the program is to use a category of each site (or CAT model)
approach, and it reduces the computational time during the analyses, mainly
of amino acids [53, 54, 55]. The program uses a specific type of support value,
called local-bootstrap support values that can vary throughout the search,
but the traditional bootstrap can be obtained by using the SEQBOOT program
(belonging to the phylogeny inference package) that resamples the data. The
program written in Perl CompareToBootstrap.pl. can be used to compare the
tree generated by FastTree and this, with resampling of the data. The program
uses the multiple sequence alignment (MSA), fasta, and interleaved phylip
format formats.

Module 24 |6

You might also like