You are on page 1of 12

Phylo-MCOA: A Fast and Efficient Method to Detect Outlier Genes and Species in Phylogenomics Using Multiple Co-inertia

Analysis
´bastien Ollier,2 and Gabriela Aguileta1,2 Damien M. de Vienne,*,1,2 Se
1 2

MBE Advance Access published February 14, 2012

Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG) and UPF, Barcelona, Spain ´matique et Evolution, UMR 8079, Universite ´ Paris-Sud 11, Orsay cedex, France Laboratoire Ecologie, Syste *Corresponding author: E-mail: damien.de-vienne@crg.es. ´ Philippe Associate editor: Herve

Abstract

Research article

Full genome data sets are currently being explored on a regular basis to infer phylogenetic trees, but there are often discordances among the trees produced by different genes. An important goal in phylogenomics is to identify which individual gene and species produce the same phylogenetic tree and are thus likely to share the same evolutionary history. On the other hand, it is also essential to identify which genes and species produce discordant topologies and therefore evolve in a different way or represent noise in the data. The latter are outlier genes or species and they can provide a wealth of information on potentially interesting biological processes, such as incomplete lineage sorting, hybridization, and horizontal gene transfers. Here, we propose a new method to explore the genomic tree space and detect outlier genes and species based on multiple co-inertia analysis (MCOA), which efficiently captures and compares the similarities in the phylogenetic topologies produced by individual genes. Our method allows the rapid identification of outlier genes and species by extracting the similarities and discrepancies, in terms of the pairwise distances, between all the species in all the trees, simultaneously. This is achieved by using MCOA, which finds successive decomposition axes from individual ordinations (i.e., derived from distance matrices) that maximize a covariance function. The method is freely available as a set of R functions. The source code and tutorial can be found online at http://phylomcoa.cgenomics.org. Key words: phylogenomics, outlier genes, phylogenetic markers, multivariate analysis, fungal genomes.

Introduction
The analysis of genomic data sets has generated the pressing need to compare a large amount of phylogenetic trees. The trees under comparison may, for example, represent a set of shared orthologous genes from a given organism or taxonomic group, sampled across their full genomes. The goal of such comparisons can be 2-fold: either to identify the set of genes producing a concordant topology that may reflect the species history or to identify the set of outlier genes that generate topological incongruities and are likely to evolve differently or represent noise. While most efforts have been devoted to detecting genes that produce concordant gene trees, less attention has been given to identifying outliers, although they are very important candidates for studying interesting biological processes. Detecting outlier genes and species is particularly relevant in phylogenomics because genes whose tree topologies are discordant with respect to an overall topological consensus are good candidates for evolving at significantly different rates, to be under positive selection for functional divergence, to have been horizontally transferred, to be paralogous instead of orthologous genes, or to be randomly sampled from incompletely sorted lineages. On the other hand, species that appear as outliers (i.e., that are placed very differently in different gene trees) may represent species that are more likely to experience gene transfers,

having sequencing errors, having undergone bottlenecks, or other biological processes at the species level. Even though topological discordances can be detected at the gene or the species level, most methods so far are designed to look at genes only. In the context of phylogenetic gene tree comparison and phylogenomics, several approaches and methods have been proposed to detect concordant and discordant topologies trying to integrate as much as possible the available information. A solution that attempts to deal with all the available data is the construction of a super matrix from the alignment of all or a large number of the available individual genes, an approach known as ‘‘total evidence’’ (Gadagkar et al. 2005; Aguileta et al. 2008). However, most methods involve some sort of consensus solution either by classical consensus of multiple competing phylogenies (Felsenstein 1985), by building a super tree (Gordon 1986; Sanderson et al. 1998), by Bayesian sampling (Ane et al. 2007), or a summary by agreement subtrees (Cranston and Rannala 2007). Other approaches work by trying to filter out the noise and/or increasing the phylogenetic signal (Rodriguez-Ezpeleta et al. 2007; Roure et al. 2007) or by clustering congruent loci (Leigh et al. 2008, 2011). Another possibility involves identifying the conflicting bipartitions and representing all the possible alternatives in a graphic way, an approach that results in networks

© The Author 2012. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com

Mol. Biol. Evol. doi:10.1093/molbev/msr317

Advance Access publication January 3, 2012

1

de Vienne et al. · doi:10.1093/molbev/msr317

MBE
at the species level for comparing the histories of different species across individual gene trees. Phylo-MCOA can be very useful for phylogenomic analyses, where big genomic data sets are typically analyzed. First, Phylo-MCOA readily identifies the outlier genes and species that cause topological incongruence. This is useful for studying the evolution of specific genes, particularly those that seem to differ from the rest of the genome. The method will also identify candidate outlier species, which may undergo interesting biological processes at the species level. Phylo-MCOA is unique in pointing not only to outlier genes but also to outlier species. Second, when the goal is to construct a species tree or a tree that reflects the evolution of most genes in the genome, PhyloMCOA can be used to identify which genes produce discordant trees and remove them for subsequent phylogenetic analysis. In this way, only the genes that share the same evolutionary history are kept in order to build a tree that reflects the evolution of the species or that of most of the genome. Third, Phylo-MCOA can be used to investigate specific evolutionary hypotheses including, but not limited to, which genes may have been laterally transferred, which ones may be more readily lost or replaced, or which species produce phylogenetic discordance.

of genes (Huson and Bryant 2006), or a projection of the conflicting signals in 2Ds (White et al. 2007). Also, in order to obtain a single tree from multiple individual trees of the same organism or species, the coalescent approach can be used (Rosenberg 2002; Suchard et al. 2003; Suchard 2005; Degnan and Rosenberg 2006) or the Bayesian reconstruction of gene trees taking the species tree as a prior (BEST) (Edwards et al. 2007). Finally, approaches for graphically comparing phylogenetic trees include the use of multidimensional scaling to visualize tree-to-tree pairwise distances (Hillis et al. 2005); the use of principal component analysis (PCA) to detect events of horizontal gene transfer (Brochier et al. 2002; Susko et al. 2006); and employing Hadamard spectral analysis represented in a triangular graph called Treeness Triangles to investigate the loss of phylogenetic signal (White et al. 2007). However, by concatenating the multilocus data or by summarizing or obtaining a consensus of their individual gene trees, most of the latter methods lose a wealth of potentially interesting information, especially by removing outlier data or trees. Phylo-MCOA is unique in that it analyzes all the trees simultaneously without discarding data a priori. Furthermore, as far as we know, the possibility to also detect outlier species when comparing phylogenetic trees has not been proposed before. Here, we present Phylo-MCOA, a fast method for comparing multiple phylogenetic trees and detecting outlier genes and species based on multiple co-inertia analysis (MCOA) (Chessel and Hanafi 1996). The method allows the efficient identification of concordant and outlier genes and species, by enabling the quick comparison of a large number of trees. It builds a consensus typology by extracting the similarities, in terms of the pairwise distances between all the species in all the trees, simultaneously. The consensus typology is a 2D graphic representation of the distance interrelationships between species supported by the majority of the genes. Phylo-MCOA finds successive decomposition axes from individual ordinations (derived from matrices of distances between species) that maximize a covariance function in order to build the consensus typology, and it estimates the contribution of each individual gene tree to this typology (i.e., to what extent each gene deviates or agrees with what the majority of genes support). MCOA is expected to reveal the common features (i.e., pairwise distances and species placements) of single-tree topologies, build a consensus typology and compare individual trees with this reference. Until now, MCOA has been used with ecological (Bady et al. 2004; Hedde et al. 2005) and genetic data (Laloe et al. 2007; Berthouly et al. 2008); however, as far as we know, this is the first time that MCOA is applied to phylogenetic analysis. Phylo-MCOA is unique in tracing the position of each species with respect to all other species across the compared phylogenies. The simultaneous analysis of all the species placements in all trees makes it possible to detect, not only concordant genes but also outlier gene trees, and even more importantly, it can show which species and genes explain those differences. Our method is thus new in allowing the analysis of the data
2

Materials and Methods
MCOA in the Context of Phylogenetic Analysis
In phylogenomics, there is often the need to compare sets of phylogenetic trees, determine how the position of each species changes in every tree, and to find out which trees are more similar between each other in terms of the pairwise topological distance (i.e., either nodal or patristic). It would thus be advantageous to compare the position of each species in all trees at the same time, to know if there is a correlation among the species positions and distances represented in all trees (i.e., create a consensus typology that serves as reference), and to what extent each individual tree deviates from such a consensus. The problem at hand involves measuring the relationship between the position of each species and the distances between them in all the trees of a given set. In the context of multivariate data analysis, one is interested in ordering objects (e.g., variables, sites, and tables) by their similarity, and the relationships between the objects are measured by different metrics (Manly 2004). Generally, objects are projected onto principal axes and are thus characterized numerically or graphically. This procedure places similar objects near each other and dissimilar ones far apart in geometric space. Traditional ordering techniques (e.g., PCA) typically use a singular decomposition axis value of one table. In the case of our proposed method, MCOA allows the simultaneous ordination of many Euclidean distance matrices derived from every tree in the set by finding successive decomposition axes from each distance matrix that maximize a covariance function. Doing this allows the extraction of the consensus typology contributed by all trees.

Detecting Outlier Genes and Species in Phylogenomics · doi:10.1093/molbev/msr317

MBE

FIG. 1. Principle of MCOA as applied to phylogenetic trees. (a) From one tree to a 2D representation of the tree. Step 1: A single phylogenetic tree can be converted into a distance matrix using either nodal or patristic distances between the leaves. Step 2: A PCO on the square root of the distance matrix allows retrieving the coordinates of the species in a Euclidean space. Step 3: Positions of the species in a 2D space formed by the two first axes of the PCO. (b) MCOA from K trees to the cohesion plot. Steps 1 and 2: From K trees, K distance matrices and K Euclidean spaces are obtained, as in A. Each Euclidean space can then be seen as a scatter plot (or set of points) in an n À 1 dimensional space (n being 1 1 1 the number of leaves). Step 3: MCOA finds K auxiliary vectors (u1 1 ,u2 , . . ., uK ) and a reference vector m so that the sum of the covariances 1 1 1 2 between each u vector and the unique m vector is maximized. This is done recursively for m , m , etc. Step 4: Cohesion plot for the two first axes of the MCOA analysis m1 and m2. Labels represent the reference position of the species (consensus typology). Black points represent the positions of the species in each individual gene tree.

Here, we introduce the basic concepts of the method; interested readers are referred to the appendix in the supplementary material (Supplementary Material online) and to Chessel and Hanafi (1996) for its mathematical description. To better understand how MCOA is applied to phylogenetic data (i.e., a set of K trees), we can decompose the method in two steps. We first explain how one tree can be represented in a 2D space through Principal Coordinate Analysis (Gower 1984; Gower and Legendre 1986) (PCO, fig. 1a), and we then explain how K trees can be simultaneously analyzed through MCOA (fig. 1b). A single tree can be transformed into a distance matrix by calculating the pairwise distances between species in the trees (fig. 1a, step 1). Distance between species can either be nodal (number of nodes separating two leaves in the tree) or patristic (sum of the branch lengths separating two leaves). We then compute the square root of this

distance matrix in order to get a Euclidean distance matrix, which is an optimal condition for the PCO to be performed. Whereas the methods typically used to obtain Euclidean distances impose a geometric distortion on them, taking their square root does not affect the results (de Vienne et al. 2011). The principal coordinates of the leaves (i.e., terminal branches) in a Euclidean space are then derived from the Euclidean distance matrix by a PCO analysis (fig. 1a, step 2). Each axis defined by the PCO analysis maximizes the variance of the coordinates of the species. When original distances between species are Euclidean, all the eigenvalues are positive and correspond to the variance associated to the axis. In that case, the distances between species in the new space defined by the PCO analysis are the same as the original distances defined by the tree. Because most of the variance is generally concentrated in the first axis (depending on the topology of the tree), we can represent the respective position of the leaves in
3

de Vienne et al. · doi:10.1093/molbev/msr317

MBE

FIG. 2. Detection of complete outliers by Phylo-MCOA. (a) The 2WR matrix is computed by calculating, for every species, the distance separating its position in each gene tree to its reference position (see text). The 2WR matrix provides information about species in rows and genes in columns. (b) The 2WR matrix is transformed into two binary matrices (2WRBgn and 2WRBsp) by detecting outlier cells as described in the Methods section. (c) A score is computed for each gene from the 2WRBgn matrices and for each species from the 2WRBsp matrix and represented by a bar plot. The dashed gray line on the bar plot represents the threshold chosen by the user over which a gene or a species is considered as a complete outlier.

a reduced space formed by these axes of the PCO (fig. 1a, step 3). The relative positions of the taxa in this reduced space represent the main dichotomies observed in the tree. The first steps of the MCOA analysis, as applied to multiple phylogenetic trees, are the same as before. Each one of the K trees is transformed into a distance matrix (fig. 1b, step1), and the square root of the distance matrices are then computed and used to perform K-independent PCOs (fig. 1b, step2). We thus obtain K Euclidean spaces, one for each initial tree. Each one of these spaces can be seen as a set of points (scatter plots on fig. 1b) distributed on a space having n À 1 dimensions (or axes), where n is the number of leaves in each tree. The basic goal of MCOA is then to find a set of K vectors (one for each of the K Euclidean spaces) called auxiliary vectors (from u1 1 to 1 u1 K ), as well as a single reference vector (called m ), so that the sum of the covariances between each of the uK auxiliary vectors and the m1 vector is maximized (fig. 1b, step 3). An analytic demonstration of this maximization has been given by Chessel and Hanafi (1996). The obtained m1 vector is the first axis of the MCOA analysis. The same method is then applied to obtain a m2 vector by maximizing the sum of the covariance between this vector and K u2 vectors 2 2 (from u2 1 to uK ). Thus m becomes the second axis of the MCOA. This is done recursively for retrieving the n axes of the MCOA analysis.

Phylo-MCOA: Using MCOA for Phylogenetic Analysis
Following the analysis described above, the reference position of each species (coordinates of the species projected on the m1, m2, . . . mn vectors) as well as the position of the species in each individual tree (coordinates of the species projected on the u1, u2, . . . , un auxiliary vectors) can be computed. It is possible to plot the reference positions
4

of each species, as well as their position in each individual tree, using the two first axes of the MCOA analysis (fig. 1b part 4). This plot is usually called the cohesion plot. The set of reference positions of the species in the cohesion plot is called consensus typology. For a given species, the distance between its reference position and its position in each individual tree is represented by lines. Long lines represent cases were the species has a position in a given tree that is not concordant with its position in all other trees (and therefore neither in the consensus typology). On the contrary, short lines represent cases where the position of the species in a given tree is roughly the same as in all the other trees (and therefore in the consensus typology). The cohesion plot thus contains N Â T dots, where N represents the number of species in each tree and T the number of trees. The cohesion plot is very informative for small data sets, where one can assess the position of each species in each individual tree relative to its reference position. However, this plot becomes difficult to read and to interpret as the number of genes and species increases because species labels superimpose and the cloud of points and lines becomes overcrowded. Moreover, the cohesion plot gives only a graphical representation of the data and does not allow the extraction of statistically significant outlier genes and species. Therefore, we show instead the information contained in cohesion plots but using a different and more readable representation: the two-Way Reference matrix (hereafter referred to as the 2WR matrix). The 2WR matrix is built by computing, for every species, the distance between its reference position and its position in each individual tree in the multidimensional space. The rows in the 2WR matrix represent species and columns represent genes (fig. 2a). Note that the order of species (rows) in the 2WR matrix follows the average relative position of species in the gene trees. In other words, two species that are close

Detecting Outlier Genes and Species in Phylogenomics · doi:10.1093/molbev/msr317

MBE

FIG. 3. Detection of cell-by-cell outliers by Phylo-MCOA. (a) The 2WR matrix is obtained as described in the text. (b) The 2WR matrix is then transformed into the 2WRBgnsp binary matrix by successive normalizations and detection of outliers (see text). This matrix contains ‘‘islands’’ of outliers representing groups of closely related species erroneously detected as outliers surrounding the true outlier. (c) Islands are removed by considering as true outlier only the cell having the maximum value in the island. This maximum is taken from the 2WRgnsp matrix.

in the 2WR matrix (two contiguous rows) are two species that are closely related in the gene trees.

Detection of Outliers
Phylo-MCOA can detect two types of outliers: ‘‘complete’’ outliers and ‘‘cell-by-cell’’ outliers. Complete outliers are of two types: complete outlier species and complete outlier genes. Complete outlier genes place all the species in a position different from their positions in the other gene trees, whereas complete outlier species are species whose position with respect to other species is different in all the gene trees. Cell-by-cell outliers represent species whose position in some specific gene trees is not concordant with their position in the other gene trees (i.e., single cells in the 2WR matrix whose value seems outlier with respect to the other values in the matrix). Cell-by-cell outliers cannot be detected by Phylo-MCOA from the 2WR matrix if complete outliers are present because it hides the cell-by-cell signal. The detection of outliers by Phylo-MCOA is thus performed in multiple steps: 1) Phylo-MCOA analysis of the original data set; 2) detection of complete outliers; 3) removal of the complete outliers from the original data set; 4) Phylo-MCOA analysis of the new data set; and 5) detection of cell-by-cell outliers. If no complete outliers are detected, steps 3 and 4 are omitted. For detecting complete outliers (fig. 2), the 2WR matrix is separated in two matrices, (2WRBsp and 2WRBgn, fig. 2b) by detecting outlier cells. For constructing the 2WRBsp matrix, a cell is considered as an outlier (and assigned the value 1), if its value in the 2WR matrix is higher than Q3 þ kIQR, where Q3 represents the third quartile of the distribution of values of its column and IQR represents the interquartile range of these values. Symmetrically, for the 2WRBgn matrix, a cell is considered as outlier if its value in the 2WR matrix is higher than Q3 þ kIQR, where Q3 and IQR are now computed based on the values in the row. We set k 5 1.5 as the default value as is classically done when detecting outliers

with this method. The two obtained binary matrices are represented in figure 2b. From the 2WRBgn matrix, a score is computed for each gene by computing the proportion of 1s in its column. From the 2WRBsp matrix, a score is computed for each species by computing the proportion of 1s in its row. This leads to the two barplots presented in figure 2c. In this example, gene 2 and gene 13 are detected as outliers by more than 90% of the species, and t16 and t33 are detected as outlier species by more than 90% of the genes. The user can fix a threshold for considering that a species or a gene is a complete outlier. By default, we set this threshold to 50% (i.e., species detected as outlier for more than 50% of the genes are seen as complete outliers and genes detected as outliers for more than 50% of the species are considered as complete outliers). For the detection of cell-by-cell outliers (fig. 3), the 2WR matrix (fig. 3a) is transformed into the 2WRgn and 2WRsp matrices by normalizing the 2WR matrix by rows and columns, respectively; the product of these two normalized matrices gives the 2WRgnsp matrix that is used for the cell-by-cell outlier detection. The 2WRgnsp matrix is first transformed into a binary matrix by detecting outlier cells, according to the values in the rows, and to a second binary matrix by detecting outlier cells according to the values in the columns. These two binary matrices are finally multiplied in order to conserve only cells detected as outliers for both rows and columns. This last binary matrix is called 2WRBgnsp (fig. 3b). This procedure for detecting cell-by-cell outliers is, however, too permissive, and therefore, a second filtering step is necessary. As explained before, Phylo-MCOA starts by transforming trees into distance matrices. Species in the trees are not independent from each other. Therefore, even a single modification in a tree will change all the values in the associated distance matrix. When we detect a cell-bycell outlier, we have to distinguish between ‘‘true’’ outliers and species that are affected by this outlier just because
5

de Vienne et al. · doi:10.1093/molbev/msr317

MBE
Table 1. Effect of the Size of the Data on the Ability of PhyloMCOA to Correctly Retrieve Complete Outlier Genes and Species.
Retrieved Species/Genes Number of genes 10 20 30 50 80 100 Number of Species 10 0/0 0/0 0/0 0/0 0/0 0/1 20 0/0 1/2 0/3 0/3 0/3 0/3 30 1/0 3/3 3/3 3/3 3/3 3/3 50 3/0 3/3 3/3 3/3 3/3 3/3 80 3/0 3/3 3/3 3/3 3/3 3/3 100 3/0 3/3 3/3 3/3 3/3 3/3

they are close in the distance matrix. To solve this problem, we take advantage of a Phylo-MCOA property that orders species in the 2WR matrix by their mean proximity in the cohesion plot. When two or more species that are contiguous in the 2WRBgnsp matrix for a given gene are detected as outliers, we define this group as an ‘‘island’’ (dashed circles in fig. 3b). In each island, we need to distinguish between the true outlier species and the contiguous species that are erroneously detected as outliers. To do so, we consider that the outlier cell with the highest value in the 2WRgnsp is the true outlier, and the contiguous cells are artifacts. This filtering step allows obtaining a binary matrix free of islands that contains only ‘‘true’’ cell-by-cell outliers (fig. 3c).

NOTE.—Three outlier genes and three outlier species were simulated. Each cell contains two values separated by a slash (/): number of outlier species retrieved/ number of outlier genes retrieved. Nodal distances were used in the PhyloMCOA analysis. The darker the shading, the better the retrieval of outliers by Phylo-MCOA.

Handling Tree and Node Support
When comparing phylogenetic trees, it is very important to consider the statistical support of both trees and nodes. To take into account measures of tree support such as likelihood scores, posterior probabilities, or any other suitable scores, Phylo-MCOA can assign different weights, specified by the user, to the different trees according to their support. To do so, during the MCOA analysis, each table (representing Euclidean spaces, see fig. 1) is multiplied by the square root of the weight that is associated to it (in absolute values). As a result, the values in the tables corresponding to a specific gene with a high weight are higher and the importance of the corresponding auxiliary vectors for retrieving the m vectors also increases (fig. 1b, step 3). In this sense, those genes will weigh more in the MCOA. It is also possible to take into account node support (e.g., bootstrap values) when building the consensus typology, by choosing a cutoff level of node confidence below which nodes will be collapsed. In this way, poorly supported nodes (i.e., unlikely species bipartitions) will have less or no influence in the consensus typology.

outlier species, we changed its position randomly in all T trees. To generate a complete outlier gene, we replaced one of the gene trees by a completely random tree. Subsequently, we evaluated the efficiency of PhyloMCOA by its ability to retrieve the correct complete outlier gene and species, using the corresponding 2WR matrix. All outliers were detected based on nodal distances, using the option distance 5 ‘‘nodal’’ in Phylo-MCOA. To simulate cell-by-cell outliers, we randomly sampled one species and one tree and changed randomly the position of the selected species in the selected tree. By repeating this operation multiple times, we obtained data sets with a variable proportion of cell-by-cell outliers. Effect of Sample Size on the Ability of Phylo-MCOA to Retrieve Complete Outliers Simulated data sets were obtained containing between 10 and 100 genes (10, 20, 30, 50, and 100, respectively) and between 10 and 100 species (10, 20, 30, 50, and 100, respectively); all possible combinations were explored. In all these data sets, we artificially introduced three complete outlier genes and three complete outlier species. We analyzed these data sets with Phylo-MCOA and evaluated the proportion of complete outlier genes and species that the method was able to retrieve (table 1). We observed that as long as the number of both species and genes is large enough (!20), Phylo-MCOA always correctly retrieved the outliers (black cells in table 1). On the other hand, when the number of species is small ( 10), PhyloMCOA is unable to retrieve the outlier genes correctly, even for a large number of genes (first column, light gray cells in table 1). However, outlier species are easier to retrieve as long as the sample size is large enough (!20), even if there is a small number of genes ( 10, first row, dark gray cells in table 1). Finally, for moderately small number of outlier species (520), outlier genes are easier to retrieve than species (second column, dark gray cells in table 1). Effect of the Proportion of Complete Outliers on the Ability of Phylo-MCOA to Retrieve Them We created a data set of 50 genes and 50 species and tested the ability of Phylo-MCOA to retrieve complete outliers as their proportion increased. We tested data sets where between 2% and 60% of the genes, and between 2% and

Handling Missing Data
Phylo-MCOA is capable of analyzing data sets that are incompletely sampled (i.e., with missing data), in the cases where a given gene is not available for one or more species but is found in all other species. To handle incomplete samples, the distances that correspond to missing species in a given pairwise distance matrix are assigned the value of the average distance obtained from all the other distance matrices where the species is present.

Results
Phylo-MCOA Evaluation: Simulated Data Sets
Simulated data sets were generated in order to evaluate the efficiency and robustness of Phylo-MCOA. All the data sets were created in the same way: we generated a random tree with N species by random splitting of edges as implemented in function rtree of the ‘‘ape’’ package in R language (Paradis et al. 2004) and duplicated this gene tree in order to get T identical gene trees. A new gene tree was randomly generated for each test performed. To create a complete
6

Detecting Outlier Genes and Species in Phylogenomics · doi:10.1093/molbev/msr317
Table 2. Effect of the Proportion of Outlier Genes and Species in the Ability of Phylo-MCOA to Correctly Retrieve Them.

MBE

Proportion of Complete Outlier Species [Number] Proportion of complete outlier genes [number] Retrieved Species/Genes (%) 2 [1] 10 [5] 20 [10] 40 [20] 60 [30] 2% [1] 1/1 1/5 1/10 1/0 0/0 10% [5] 5/1 5/5 5/10 5/0 0/0 20% [10] 9/1 8/5 10/10 0/0 0/0 40% [20] 0/1 0/5 0/0 0/0 0/0 60% [30] 0/0 0/0 0/0 0/0 0/0

NOTE.—The data set was composed of 50 genes and 50 species. The proportions of outliers are given in percentage and the absolute number of outliers is given under bracket ([ ]). Each cell contains two values separated by a slash (/): number of complete outlier species retrieved/number of complete outlier genes retrieved. Nodal distances were used in the Phylo-MCOA analysis. The darker the shading, the better the retrieval of outliers by Phylo-MCOA.

60% of the species were outliers (with values 2%, 10%, 20%, 40%, and 60%). All possible combinations of these proportions were tested. We then evaluated the number of complete outlier genes and species that Phylo-MCOA was able to retrieve correctly in each case (table 2). We observed that even when 20% of genes and species are outliers, Phylo-MCOA was still able to retrieve almost all of them correctly (black cells in table 2). A large proportion of outlier genes (40%) does not prevent retrieving outlier species. Similarly, a large proportion of outlier species is not a problem for retrieving outlier genes (dark gray cells in table 2). However, if a large proportion of both genes and species are outliers, this will make it difficult to retrieve outliers (light gray cells in table 2). Effect of the Proportion of Missing Data on the Ability of Phylo-MCOA to Retrieve Complete Outliers In typical biological data sets, there are plenty of missing data, therefore, methods for real data set analysis should be robust to incompletely sampled data sets. In order to evaluate the ability of Phylo-MCOA to retrieve outlier genes and species despite missing data, we created a data set with 100 genes and 100 species containing three complete outlier species and five complete outlier genes. We increased the proportion of missing data progressively by randomly removing species from the original set of genes. We removed 1%, 2%, 5%, 10%, 15%, 20%, 30%, and 50% of the data, respectively (table 3). Note that a proportion of 20% of missing data means that, on average, each tree has lost 20% of its leaves. We observed that Phylo-MCOA is very robust to missing data. Even with 20% of the data missing, the three complete outlier species and one of the five complete outlier genes were correctly retrieved (table 2).

Capacity of Phylo-MCOA to Detect Complete Outliers in the Presence of Cell-by-Cell Outliers The detection of cell-by-cell outliers by Phylo-MCOA is regularly performed on a data set from which complete outliers have been removed. In order to verify whether Phylo-MCOA is able to correctly detect complete outliers in the presence of cell-by-cell outliers, we simulated a data set with 100 genes and 100 species containing three complete outlier species and three complete outlier genes, and evaluated the ability of Phylo-MCOA to correctly detect the complete outliers given different numbers of cell-by-cell outliers, from 1 to 1,500. Results are shown in table 4. If ,1,200 cell-by-cell outliers are present, Phylo-MCOA is always able to retrieve correctly the complete outliers. For 1,500 cell-by-cell outliers, all complete outlier genes are still retrieved, but the complete outlier species are not retrieved. Capacity of Phylo-MCOA to Retrieve Cell-by-Cell Outliers As previously explained, the detection of cell-by-cell outliers by Phylo-MCOA is performed on a data set free of complete outliers, therefore, for this test complete outliers are excluded. We analyzed 100 genes and 100 species with a variable proportion of cell-by-cell outliers and evaluated the proportion that was detected by Phylo-MCOA. For each number of cell-by-cell outliers, we repeated the operation 20 times in order to have an estimation of the standard deviation of the mean proportion of cell-by-cell outliers retrieved (table 5). We observed that the mean proportion of cell-by-cell outliers retrieved decreased very slowly, starting at 100% (SD 5 0) for two and five cellby-cell outliers included, to more than 85% for 1,000 cell-by-cell included. Phylo-MCOA is thus able to detect cell-by-cell outliers even when there are several of them.

Table 3. Effect of the Proportion of Missing Data on the Number of Complete Outliers Retrieved (genes and species).
Proportion of Missing Data Species retrieved Genes retrieved 1% 3 5 2% 3 5 5% 3 5 10% 3 5 20% 3 1 30% 0 0 50% 0 0

Table 4. Number of Complete Outliers Retrieved with Increasing Number of Cell-by-Cell Outliers.
Number of Cell-by-Cell Outliers 1 10 50 100 300 500 1,000 1,200 1,500 Species retrieved 3 3 3 3 3 3 3 3 0 Genes retrieved 3 3 3 3 3 3 3 3 3
NOTE.—The data set was comprised of 100 genes and 100 species with three complete outlier genes and three complete outlier species. Nodal distances were used in the Phylo-MCOA analysis. The darker the shading, the better the retrieval of outliers by Phylo-MCOA.

NOTE.—The data set was comprised of 100 genes and 100 species with five complete outlier genes and three complete outlier species. Nodal distances were used in the Phylo-MCOA analysis. The darker the shading, the better the retrieval of outliers by Phylo-MCOA.

7

de Vienne et al. · doi:10.1093/molbev/msr317

MBE
Number of Cell-by-Cell Outliers

Table 5. Mean Proportion of Cell-by-Cell Outliers Retrieved Out of 20 Repetitions with Increasing the Number of Cell-by-Cell Outliers.

Proportion Retrieved Mean SD

2 1 0

5 1 0

10 0.975 0.055

20 0.973 0.034

50 0.960 0.030

100 0.947 0.022

200 0.938 0.018

300 0.932 0.015

400 0.923 0.011

500 0.907 0.011

600 0.897 0.014

700 0.888 0.011

800 0.888 0.010

900 0.871 0.008

1,000 0.863 0.009

NOTE.—The data set was comprised of 100 genes and 100 species. Nodal distances were used in the Phylo-MCOA analysis.

Phylo-MCOA Evaluation: Real Data Set
We used Phylo-MCOA on a real data set comprising 246 gene trees for 21 fungal species. This data set was used previously by Aguileta et al. (2008) for retrieving the best genes to reconstruct the phylogeny of fungi, and the data are available in FUNYBASE (Marthey et al. 2008). Aguileta et al. (2008) had compared each individual gene tree to the species tree obtained from the concatenation of the 246 genes. From this comparison, they had obtained a score for each gene tree representing how suitable it was for finding the reference (concatenated) tree. Using Phylo-MCOA, we analyzed the data set of Aguileta et al. (2008) using nodal distance between species and obtained a score for each species and gene as explained in the Materials and Methods section. These scores are plotted in figure 4 for the genes (top) and for the species (bottom). Phylo-MCOA detected five complete outlier genes (black bars in fig. 4) and no complete outlier species. The fact that no species appeared as outliers in this data set is not surprising because the goal of their work was to find optimal gene markers for reconstructing the fungal species tree, therefore, species producing incongruent trees were removed from the data set in Aguileta et al. (2008). In order to check for consistency, we compared the score that Aguileta et al. obtained for each gene (a topological similarity score referred to as topscore) to the score we obtained with Phylo-MCOA. Figure 5 (white dots) represents the scatter plot

of these two scores. Phylo-MCOA is able to retrieve the correct ranking of trees and to identify genes that can be considered as outliers. We found that trees considered as outliers in Aguileta et al. (2008) were indeed also outliers according to the analysis using Phylo-MCOA (gray arrows in fig. 5). When using patristic distances instead of nodal distances for the Phylo-MCOA analysis (black dots in fig. 5), the same trend is still present but the correlation with the topscore values of Aguileta et al. (2008) seems to slightly decrease. This is expected because the topscore used by Aguileta et al. (2008) is a topological distance measure, thus more similar to Phylo-MCOA with nodal distances than to Phylo-MCOA with patristic distances. Overall, this illustrates the general utility of PhyloMCOA for analyzing real data sets. Moreover, the method is very fast: using the set of trees in Aguileta et al. (2008), it took about 1 min on a pc laptop.

Using Phylo-MCOA for Filtering Phylogenomic Data Sets
Deep relationships in the animal phylogeny are still debated. In 2009, Schierwater et al. (2009) proposed a phylogeny of the Metazoa that was questioned because of the low quality of the data set used to construct it. More precisely, Philippe et al. (2011) detected important problems in the data set of Schierwater et al. (2009), such as contaminations, deep paralogies, or horizontal gene transfers. They
Genes

Outlier detection proportion

0.8 0.6 0.4 0.2 0.0
FG1000 FG1010 FG1015 FG1020 FG1021 FG1024 FG1031 FG1046 FG1049 FG1056 FG1063 FG1073 FG1108 FG436 FG459 FG465 FG478 FG487 FG491 FG499 FG503 FG507 FG508 FG524 FG525 FG529 FG533 FG534 FG535 FG542 FG543 FG546 FG549 FG551 FG552 FG556 FG559 FG569 FG570 FG575 FG576 FG578 FG579 FG583 FG586 FG588 FG589 FG591 FG595 FG598 FG610 FG612 FG616 FG621 FG627 FG630 FG634 FG635 FG636 FG638 FG644 FG646 FG649 FG651 FG652 FG657 FG659 FG662 FG665 FG668 FG673 FG678 FG679 FG684 FG685 FG686 FG687 FG689 FG690 FG691 FG692 FG694 FG695 FG696 FG698 FG699 FG702 FG704 FG705 FG708 FG716 FG718 FG720 FG722 FG723 FG730 FG734 FG735 FG740 FG746 FG747 FG750 FG752 FG756 FG757 FG758 FG761 FG762 FG763 FG771 FG781 FG784 FG788 FG789 FG798 FG805 FG813 FG819 FG821 FG825 FG832 FG834 FG841 FG844 FG848 FG850 FG854 FG855 FG861 FG862 FG863 FG864 FG866 FG870 FG878 FG884 FG893 FG894 FG896 FG897 FG898 FG901 FG909 FG912 FG916 FG927 FG933 FG942 FG948 FG964 FG975 FG994 MS241 MS277 MS282 MS294 MS302 MS304 MS305 MS307 MS311 MS313 MS320 MS325 MS329 MS333 MS334 MS337 MS345 MS348 MS351 MS353 MS355 MS358 MS359 MS361 MS362 MS368 MS375 MS377 MS378 MS379 MS380 MS382 MS384 MS386 MS393 MS394 MS395 MS397 MS398 MS399 MS400 MS401 MS407 MS408 MS409 MS411 MS413 MS414 MS415 MS417 MS418 MS422 MS424 MS428 MS429 MS430 MS431 MS432 MS437 MS441 MS442 MS444 MS447 MS448 MS452 MS453 MS454 MS456 MS459 MS460 MS462 MS463 MS467 MS470 MS481 MS484 MS485 MS486 MS487 MS489 MS493 MS497 MS501 MS517 MS524 MS529 MS532 MS541 MS547 MS549 MS550 MS561 MS578 MS621

Species
outlier detection proportion 0.12 0.10 0.08 0.06 0.04 0.02 0.00
Spa Cne Uma Dha Sce Pch Clu Spo Afu Ani Ssc Cim Ncr Tre Yli Ago Cgl Sba Mgr Fgr Kla

FIG. 4. Bar plots for the fungal data showing the score calculated for each gene (top) and each species (bottom). Black bars represent significant outliers according to the threshold chosen by the user (here 0.4). Nodal distances were used for the Phylo-MCOA analysis in this case.

8

Detecting Outlier Genes and Species in Phylogenomics · doi:10.1093/molbev/msr317

MBE

FIG. 5. Scatter plot of the topological congruence (top score) between 246 fungal gene trees and the consensus species tree obtained from the concatenation of individual trees (x axis) against the score of the genes according to Phylo-MCOA (y axis). Black dots represent the scores when using patristic distances. White dots represent the scores when using nodal distances. Outlier genes according to Phylo-MCOA and identified by black bars in figure 4 are shown with small gray arrows. They appear to be poorly congruent with the species tree (small values on the x axis).

reanalyzed the data set of Schierwater after having replaced problematic sequences by correct ones and showed that the obtained tree, even if it had less supported nodes, was more realistic given previous studies. In order to test whether Phylo-MCOA is able to filter problematic sequences from a real data set, we analyzed

the set of phylogenetic trees used by Schierwater et al. (2009). We removed the outliers detected by Phylo-MCOA, built a phylogeny using only the filtered data set (i.e., without problematic sequences) and compared our tree with the one obtained without filtering and the one produced by Philippe et al. (2011) after filtering the data. To obtain the animal phylogeny, we used the original data set of Schierwater et al. (2009). It was first aligned and trimmed with the same tools and parameters as in Philippe et al. (2011), and the individual gene trees were reconstructed using the same model as Philippe et al. (2011). As in Philippe et al. (2011), four genes (AT8, DCR, N4L, and ND6) were removed from the analysis because they contained too few positions after trimming. All the trees were analyzed with Phylo-MCOA using patristic distances and all default parameters for the outlier detection. Four complete outliers genes were detected by PhyloMCOA and removed before performing the cell-by-cell detection: rrna16S, rrna18S, rrna28S, and rrna58S. Of the 984 cells in the 2WR matrix (41 genes  24 species), 21 cell-by-cell outliers were detected (see supplementary table S1, Supplementary Material online) and were removed from the original data set (by removing the sequences in the original sequence files). The remaining sequences (available as supplementary data set S1, Supplementary Material online) were then concatenated and used to reconstruct a phylogenetic tree with 100 bootstrap replicates using RAxML (Stamatakis 2006) and the same model as in Schierwater et al. (2009). The tree we obtained without filtering (fig. 6a) was similar to the one proposed by Schierwater et al. (2009), which is known to contain noisy data and thus produce an inconsistent topology. After filtering, there was a general decrease of bootstrap support for nodes in the ‘‘diploblasts’’ group (fig. 6b), while the support

FIG. 6. Comparison of the phylogeny of animals from the Schierwater et al. (2009) data set. (a) Tree obtained from the original sequences. (b) Tree obtained after removal of cell-by-cell outliers using Phylo-MCOA. Species with a star (*) represent Porifera. The gray arrows point to the node support of the ‘‘diploblasts’’ group in the two trees. Nodes with black dots represent bootstrap supports higher than 95. A hundred bootstrap replicates were performed for each tree.

9

de Vienne et al. · doi:10.1093/molbev/msr317

MBE
recombination events, among others. Outlier genes are typically affected by these factors, so identifying which subset of species explain the differences may help to focus the efforts for studying the affected species. On the other hand, by building a consensus typology Phylo-MCOA also finds the genes that produce concordant phylogenies. These genes are excellent candidates for recovering the species tree, which can be subsequently verified by other methods. Phylo-MCOA thus provides good candidates, not only for outliers, but for phylogenetic markers producing concordant topologies. To our knowledge, this is the first time that MCOA is applied directly to analyze phylogenetic distances. Based on a covariation optimization criterion, MCOA is a method that captures the similarities (and dissimilarities) obtained from the ordination of several tables (types) of data, as it has been previously demonstrated for the analysis of ecological data (Bady et al. 2004; Hedde et al. 2005; Laloe et al. 2007). Furthermore, it is a statistically solid technique that lends itself well to the analysis of phylogenetic data (i.e., in terms of topological distances), where the covariation of the histories of genes and species can be integrated in a single ordination analysis. Even though a consensus typology is proposed by the method, MCOA takes into account all the gene trees in order to build it. We believe this makes a better use of the individual tree information than other forms of phylogenetic consensus that do not take into account outliers or use only a part of the available data. Our method is unique in simultaneously detecting outlier species and genes from the comparison of multiple gene trees. One important aspect of tree comparisons involves the estimation of pairwise distance matrices. The Eigen decomposition and further ordination represented in the consensus typology take the distances as absolute distances. Therefore, the consensus typology will change depending on the type of distance that is actually being measured. There are two ways to calculate the distance between two species in a phylogenetic tree: the sum of the branch lengths or the number of nodes separating the species in the trees. Both distance metrics have advantages and disadvantages: using the sum of the branch lengths (patristic distances) allows incorporating the relative rates of evolution but can artificially bring closer species that have small branch lengths and separate species with longer branches. Conversely, if the internodal distance separating species is used, then the consensus typology informs us about a degree of phylogenetic proximity, but it tells us nothing about the relative rates of evolution. It is important to be explicit about the pairwise distance method employed in order to derive conclusions, as genes identified as outliers when using one method may seem concordant with other genes when using another one. Comparing the results given by both methods can be very informative. A gene that is identified as an outlier when using patristic distances but not when using nodal distances will represent a gene evolving at higher or slower rates in some species compared with other genes.

of the nodes outside the ‘‘diploblasts’’ were almost not affected. Overall, the bootstrap support for the monophyly of ‘‘diploblasts’’ went from 100 in the original data set to 63 in the filtered data set (gray arrows in fig. 6), which is in accordance with the accepted vision that ‘‘diploblasts’’ are certainly not monophyletic. The monophyly of Porifera (species with an * in fig. 6) was lost, which is also in accordance with previous studies and coherent with the result of Philippe et al. (2011) on this data set, who obtained a node for the Porifera with a support value of 36 (Philippe et al. 2011). Phylo-MCOA was thus able to identify true outliers that need to be filtered out prior to further phylogenetic analysis in order to estimate a more consistent topology. However, we could not retrieve exactly the same topology as the one obtained by Philippe et al. This can be due to the fact that we do not replace the sequences identified as problematic by correct sequences as done in Philippe et al. (2011). Moreover, Phylo-MCOA considers that the correct signal is the one that is present in the majority of the genes, but this is not true in this data set for the Hexactinellida, where the majority of the sequences are slow evolving mitochondrial sequences being contaminations from the Demospongiae, while true nuclear sequences of Hexactinellidae are fast evolving (Philippe et al. 2011). As a consequence, Hexactinellida and Demospongiae still form a very solid group after filtering, and the branch length leading to Hexactinellida is reduced in the new tree (fig. 6b) compared with the original tree (fig. 6a).

The Phylo-MCOA Program: Availability and Requirements
Phylo-MCOA is provided as a set of R functions which make use of the ade4 (Chessel et al. 2004; Dray et al. 2007), ape (Paradis et al. 2004), and lattice (for graphical functions, Sarkar 2010) packages and newly developed functions. It requires a recent version of R (.2.8.0) installed on the machine. Phylo-MCOA is freely distributed under the GNU General Public License. It can be downloaded at http://phylomcoa.cgenomics.org.

Discussion
It has become a standard practice to use multiple genes in order to obtain a reliable and concordant phylogeny. When sampling many different genes for the same set of species, the individual gene trees will often be partly incongruent. In this case, one is interested in detecting which subset of genes and species is responsible for the discrepancies, in other words, in finding the outliers. Phylo-MCOA, the method proposed here, provides a quick and powerful way to detect outliers by integrating the phylogenetic information contained in all the individual gene trees. PhyloMCOA determines how different the compared topologies are but more importantly, it identifies which genes and species explain those differences. This may prove very useful when determining possible causes for incongruities, such as horizontal gene transfers, different selective pressures,
10

Detecting Outlier Genes and Species in Phylogenomics · doi:10.1093/molbev/msr317

MBE

Phylo-MCOA allows weighing trees according to different scores (e.g., likelihood scores), which ensures that more credible trees are given more weight in the analysis. It is equally important to assign weights to nodes to distinguish supported relationships from spurious or dubious ones. In principle, one could incorporate a measure of node support into MCOA metrics; however, this is not straightforward. At present, we have arrived at a solution that may alleviate the problem of node support assignment to some extent. We have incorporated the possibility of collapsing lowly supported nodes, according to a cutoff value given by the user. In this way, the nodes that are poorly supported will contribute little or nothing to the construction of the consensus typology. Assessing node support across phylogenies is difficult, especially with cases of missing or duplicated data, as not all nodes will be present in all trees. For this reason, we suggest the use of Treeko (Marcet-Houben ´ n 2011) for trees with duplicated genes, prior and Gabaldo to analysis with Phylo-MCOA. Interestingly, our method can be used even in the absence of completely sampled data sets. The program can deal with gene losses, as demonstrated by randomly deleting species from the simulated data analyzed. We have shown that Phylo-MCOA is able to correctly detect outlier genes and species from the comparison of multiple phylogenetic trees. We have shown this by analyzing simulated data sets but, more interestingly, from the analysis of real data sets. We have shown, with two different examples, how Phylo-MCOA can detect true outliers. First, with the fungal data set by Aguileta et al. (2008), where we found the same concordant and discordant genes as in the original study; and second, where we use Phylo-MCOA to predict which cell-by-cell outliers need to be removed from phylogenetic reconstruction in order to recover the animal phylogeny obtained after removing noisy data, as done by Philippe et al. (2011). Phylo-MCOA can be used to detect candidate genes for further analysis, or for filtering out problematic genes in phylogenetic reconstruction. Phylo-MCOA can detect two types of outliers. Complete outliers can identify which genes and species deviate from the evolutionary history of most genes analyzed. This information can be very useful for filtering out noisy data in phylogenetic reconstruction. On the other hand, cell-by-cell outliers point to particular cases of topological discordance, where one specific species is an outlier in a given gene. This information can be used to postulate specific hypotheses concerning particular genes or species. Although cell-by-cell outlier detection is implemented in the current Phylo-MCOA version, there is room for improvement, specifically in the way outliers are scored to define significance. Phylo-MCOA can be used to detect candidate genes for further analysis, or for filtering out problematic genes in phylogenetic reconstruction. Nevertheless, caution should always be exerted because the method will give more weight to the information given by the majority of genes under study, and if the data set analyzed is plagued with dubious sequences, the method will favor an incorrect

signal. This is what we observed when we analyzed the data set of Schierwater et al. (2009) that is known to be ridden with multiple faulty sequences. External information on the quality of the sequences should be employed whenever possible prior to analysis with our method.

Supplementary Material
Supplementary table S1 and data set S1 are available at Molecular Biology and Evolution online (http://www.mbe. oxfordjournals.org/).

Acknowledgments
´ Philippe for the time he We would like to thank Herve spent on this manuscript and the helpful comments and suggestions he made. We thank David Moreira for constructive discussions, Avner Bar-Hen for interesting comments on the use of MCOA for phylogenomics analysis, Vincent Berry for suggestions on future developments and comments on distances between species, and Fran Supek for his suggestions on statistical significance testing. ´ n for Finally, we thank Tatiana Giraud and Toni Gabaldo insightful comments on the manuscript as well as three anonymous reviewers. D.M.V. acknowledges postdoctoral grants from the Centre National de la Recherche ´ Paris-Sud 11. G.A. acScientifique (CNRS) and Universite ´ knowledges postdoctoral grants from the CNRS, Universite Paris-Sud 11, and Agence National de la Recherche (ANR06-BLAN-0201).

References
Aguileta G, Marthey S, Chiapello H, Lebrun MH, Rodolphe F, Fournier E, Gendrault-Jacquemard A, Giraud T. 2008. Assessing the performance of single-copy genes for recovering robust phylogenies. Syst Biol. 57:613–627. Ane C, Larget B, Baum DA, Smith SD, Rokas A. 2007. Bayesian estimation of concordance among gene trees. Mol Biol Evol. 24: 412–426. Bady P, Doledec S, Dumont B, Fruget JF. 2004. Multiple co-inertia analysis: a tool for assessing synchrony in the temporal variability of aquatic communities. C R Biol. 327:29–36. Berthouly C, Bed’Hom B, Tixier-Boichard M, Chen CF, Lee YP, Laloe D, Legros H, Verrier E, Rognon X. 2008. Using molecular markers and multivariate methods to study the genetic diversity of local European and Asian chicken breeds. Anim Genet. 39: 121–129. Brochier C, Bapteste E, Moreira D, Philippe H. 2002. Eubacterial phylogeny based on translational apparatus proteins. Trends Genet. 18:1–5. Chessel D, Dufour AB, Thioulouse J. 2004. The ade4 package-I-onetable methods. R News. 4:5–10. Chessel D, Hanafi M. 1996. Analyses de la co-inertie de K nuages de points. Rev Stat Appl. 44:35–60. Cranston KA, Rannala B. 2007. Summarizing a posterior distribution of trees using agreement subtrees. Syst Biol. 56:578–590. de Vienne DM, Aguileta G, Ollier S. 2011. The Euclidean nature of phylogenetic distance matrices. Syst Biol. 60:826–832. Degnan JH, Rosenberg NA. 2006. Discordance of species trees with their most likely gene trees. PLoS Genet. 2:762–768.

11

de Vienne et al. · doi:10.1093/molbev/msr317
Dray S, Dufour AB, Chessel D. 2007. The ade4 package-II: two-table and K-table methods. R News. 7:47–52. Edwards SV, Liu L, Pearl DK. 2007. High-resolution species trees without concatenation. PNAS 104:5936–5941. Felsenstein J. 1985. Confidence-limits on phylogenies—an approach using the bootstrap. Evolution 39:783–791. Gadagkar SR, Rosenberg MS, Kumar S. 2005. Inferring species phylogenies from multiple genes: concatenated sequence tree versus consensus gene tree. J Exp Zool B Mol Dev Evol. 304:64–74. Gordon AD. 1986. Consensus supertrees—the synthesis of rooted trees containing overlapping sets of labeled leaves. J Classif. 3: 335–348. Gower JC. 1984. Distance matrices and their Euclidean approximation. In: Jambu M, Diday E, Lebart L, Pages J, Tomassone R, editors. Data analysis and informatics III. Amsterdam (The Netherlands): Elsevier. p. 3–21. Gower JC, Legendre P. 1986. Metric and Euclidean properties of dissimilarity coefficients. J Classif. 3:5–48. Hedde M, Lavelle P, Joffre R, Jimenez JJ, Decaens T. 2005. Specific functional signature in soil macro-invertebrate biostructures. Funct Ecol. 19:785–793. Hillis DM, Heath TA, St John K. 2005. Analysis and visualization of tree space. Syst Biol. 54:471–482. Huson DH, Bryant D. 2006. Application of phylogenetic networks in evolutionary studies. Mol Biol Evol. 23:254–267. Laloe D, Jombart T, Dufour AB, Moazami-Goudarzi K. 2007. Consensus genetic structuring and typological value of markers using multiple co-inertia analysis. Genet Sel Evol. 39:545–567. Leigh JW, Schliep K, Lopez P, Bapteste E. 2011. Let them fall where they may: congruence analysis in massive, phylogenetically messy datasets. Mol Biol Evol. 28:2773–2785. Leigh JW, Susko E, Baumgartner M, Roger AJ. 2008. Testing congruence in phylogenomic analysis. Syst Biol. 57:104–115. Manly BFJ. 2004. Multivariate statistical methods: a primer. 3rd ed. London: Chapman & Hall, Ltd. ´ n T. 2011. TreeKO: a duplication-aware Marcet-Houben M, Gabaldo algorithm for the comparison of phylogenetic trees. Nucleic Acids Res. 39(10):e66.

MBE
Marthey S, Aguileta G, Rodolphe F, et al. (11 co-authors). 2008. FUNYBASE: a FUNgal phYlogenomic dataBASE. BMC Bioinformatics. 9:456. Paradis E, Claude J, Strimmer K. 2004. APE: analyses of phylogenetics and evolution in R language. Bioinformatics 20:289–290. Philippe H, Brinkmann H, Lavrov DV, Littlewood DTJ, Manuel M, Wo ¨ rheide G, Baurain D. 2011. Resolving difficult phylogenetic questions: why more sequences are not enough. PLoS Biol. 9: e1000602. Rodriguez-Ezpeleta N, Brinkmann H, Roure B, Lartillot N, Lang BF, Philippe H. 2007. Detecting and overcoming systematic errors in genome-scale phylogenies. Syst Biol. 56:389–399. Rosenberg NA. 2002. The probability of topological concordance of gene trees and species trees. Theor Popul Biol. 61:225–247. Roure B, Rodriguez-Ezpeleta N, Philippe H. 2007. SCaFoS: a tool for selection, concatenation and fusion of sequences for phylogenomics. BMC Evol Biol. 7(1 Suppl):S2. Sanderson MJ, Purvis A, Henze C. 1998. Phylogenetic supertrees: assembling the trees of life. Trends Ecol Evol. 13:105–109. Sarkar D. 2010. Lattice: lattice graphics [Internet]. R package version 0.18-8. Available from: http://CRAN.R-project.org/package=lattice. Schierwater B, Eitel M, Jakob W, Osigus H-J, Hadrys H, Dellaporta SL, Kolokotronis S-O, DeSalle R. 2009. Concatenated analysis sheds light on early metazoan evolution and fuels a modern ‘‘urmetazoon’’ hypothesis. PLoS Biol. 7:e1000020. Stamatakis A. 2006. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22:2688–2690. Suchard MA. 2005. Stochastic models for horizontal gene transfer: taking a random walk through tree space. Genetics 170:419–431. Suchard MA, Weiss RE, Sinsheimer JS, Dorman KS, Patel M, McCabe ERB. 2003. Evolutionary similarity among genes. J Am Stat Assoc. 98:653–662. Susko E, Leigh J, Doolittle WF, Bapteste E. 2006. Visualizing and assessing phylogenetic congruence of core gene sets: a case study of the gamma-proteobacteria. Mol Biol Evol. 23:1019–1030. White WT, Hills SF, Gaddam R, Holland BR, Penny D. 2007. Treeness triangles: visualizing the loss of phylogenetic signal. Mol Biol Evol. 24:2029–2039.

12