Using Consensus Networks to Visualize Contradictory Evidence for Species Phylogeny
http://www.100md.com
分子生物学进展 2004年第7期
* Allan Wilson Centre for Molecular Ecology and Evolution, Institute of Fundamental Sciences, Massey University, Palmerston North, New Zealand
Department of Biometry and Engineering, The Swedish University of Agricultural Sciences, Uppsala, Sweden
The Linnaeus Centre for Bioinformatics, Uppsala University, Uppsala, Sweden
Allan Wilson Centre for Molecular Ecology and Evolution, Institute for Molecular BioSciences, Massey University, Palmerston North, New Zealand
E-mail: P.J.Lockhart@massey.ac.nz.
Abstract
Building species phylogenies from genome data requires the evaluation of phylogenetic evidence from independent gene loci. We propose an approach to do this using consensus networks. We compare gene trees for eight yeast genomes and show that consensus networks have potential for helping to visualize contradictory evidence for species phylogenies.
Although gene trees are important for inferring species relationships, they differ in a number of ways from species trees. These differences are caused by the complex (nonbifurcating) evolutionary histories of species, as well as evolutionary processes that violate the assumptions of simple sequence substitution models. These processes may include gene conversion and concerted evolution (Aguilar, Rossello, and Feliner 1999; Buckler, Ippolito, and Holtsford 1999), as well as asymmetric sequence evolution (Moores and Holmes 2000; Steel, Huson, and Lockhart 2000). Despite the difficulty of interpretation, gene trees are central to the problem of building a species phylogeny. An important issue is how to combine evidence from different genome loci without losing information about independent gene histories. This may not be achievable using current methods, including those that (1) reconstruct the optimal evolutionary tree for single genes and then find their consensus (Bryant 2003) or build a supertree (Bininda-Emonds, Gittleman, and Steel 2002), (2) methods that identify optimal substitution models for different genes and then find the tree that best fits these mixed substitution models (Yang 1996; Ronquist and Huelsenbeck 2003), and (3) methods that concatenate genes before fitting an evolutionary model (Huelsenbeck, Bull, and Cunningham 1996). The last method, in particular, is highly controversial. Although concatenation can provide long enough sequences to overcome sampling error, the problem of model misspecification has the potential to lead-tree building methods to converge (e.g., with 100% nonparametric bootstrap support) to an incorrect tree.
Recently, a study was published by Rokas et al. (2003) that has allowed us to test a new approach to visualizing the extent of contradictory genome evidence for building a species phylogeny. Our approach uses consensus networks (Bandelt 1995; Holland and Moulton 2003) and involves combining "splits" (partitions of the taxa into two groups) from different genes into a potentially hyperdimensional graph. Computational details for constructing consensus networks are described in Holland and Moulton (2003), where they are used to visualize phylogenetic uncertainty in large collections of trees. The consensus network approach is now freely available and implemented in SplitsTree version 4.0 (http://www-ab.informatik.uni-tuebingen.de/software/jsplits/welcome_en.html).
In this brief communication, we have constructed a consensus network, by using the gene trees from Rokas et al. (2003), for 106 orthologs common to Candida albicans and seven species of Saccharomyces. We compare our results with those obtained in their analyses. They concatenated each of the 106 genes from the yeast genomes and then exhaustively evaluated the best-fitting parsimony and maximum-likelihood trees. A single tree was identified with 100% nonparametric bootstrap support for all internal edges. These authors noted that this level of support was obtained even though some individual gene trees were not congruent with the tree derived from the concatenated sequences.
To gain insight into the extent of incongruence between their 106 individual gene trees, we constructed a consensus network from the splits occurring in the different gene trees. In doing this, we used median network construction (Bandelt 1995; Holland and Moulton 2003) and included in our consensus network all splits that occurred above a threshold value of 10% (i.e., all splits that occurred in at least 10 of the 106 gene trees). Ours is a novel application of median networks, which typically are used to analyze sequence site pattern variation in population studies (Bandelt, Forster, and Rohl 1999; Huber et al. 2001). Because splits are compared for a large number of independent gene loci, we show for the Rokas et al. (2003) data that our approach provides a more informative indicator of the species phylogeny than does concatenation of sequences.
Expectations are clear when reconstructing a species phylogeny from a collection of gene trees using consensus networks. At one extreme, in the absence of any common phylogenetic patterns among the gene trees, the consensus network will be a structure consisting of high-dimensional hypercubes (e.g., as in figure 1a). At the other extreme, and assuming no stochasticity associated with a bifurcating evolutionary process, the consensus network will be a unique bifurcating tree.
FIG. 1. Consensus networks have been reconstructed for (a) 106 random bifurcating trees on eight taxa [A, B, C, ... , H], (b) the 106 maximum-likelihood trees obtained in Rokas et al. (2003), and (c) the 106 maximum-parsimony trees obtained in Rokas et al. (2003). The presence of boxes in these networks indicates contradictory evidence for grouping certain species together. The lengths of the edges are proportional to the number of gene trees in which a particular edge occurs. Each network displays all those edges that are represented in at least 10 of the 106 trees
The consensus networks for the 106 maximum-likelihood (ML) and maximum-parsimony (MP) trees of Rokas et al. (2003) (figure 1b and c, respectively) are both very similar to the concatenated gene tree reported earlier by these authors. The largely bifurcating nature of these consensus networks indicates that the data are very treelike and that there is common phylogenetic signal between a large number of independent genes. However, there is some uncertainty in the species phylogeny in respect of the relationship of S. castellii, S. kluyveri, and C. albicans, as well as uncertainty in the placement of S. bayanus and S. kudriavzevii. This uncertainty is hidden when genes are concatenated before phylogenetic analysis, as is done in the study of Rokas et al. (2003).
It is interesting to note that evaluating the extent of the incongruence between optimal gene trees for many loci in yeast is relevant to earlier work that sought to test the theory of evolution. The finding of highly similar gene trees for Candida albicans and seven species of Saccharomyces provides strong evidence that corroborates the conclusions of Penny, Foulds, and Hendy. (1982). These authors sought to test the theory of evolution by asking whether gene trees for the same species of mammals were more similar than one would expect by chance. In doing so, and because of limited data at the time, they compared trees reconstructed from very few genes. The gene trees of Rokas et al. (2003) allow comparisons to be made for very large number of genes (106). It is evident from comparing figures 1a–c that the gene trees of Rokas et al. (2003) are far more congruent with each other than would be expected by chance. As shown in table 1, this finding is also evident when the analytical approach of Penny, Foulds, and Hendy. (1982) is used.
Table 1 Distribution of the Robinson-Foulds Distance Between Pairs of Trees.
Our study highlights the potential of using consensus networks to visualize species phylogenies for large numbers of independent genes. Their main advantage is that, unlike other methods, conflicting evolutionary hypotheses can be displayed simultaneously. Such conflict might arise because of stochastic processes and sampling error. It may also arise because of complex biological processes such as hybridization or lateral gene transfer. The incongruence generated by these processes in the evolution of genomes is easily quantified using consensus networks.
Acknowledgements
We thank the anonymous reviewers and AE for their helpful and constructive comments. This work was financially supported by the New Zealand Marsden Fund (P.J.L.), the Swedish Research Council (K.H.), and the Swedish Foundation for International Cooperation in Research and Higher Education (STINT).
Literature Cited
Aguilar, J. F., J. A. Rossello, and G. N. Feliner. 1999. Nuclear ribosomal DNA (nrDNA) concerted evolution in natural and artificial hybrids of Armeria (Plumbaginaceae). Mol. Ecol. 8:1341-1346.
Bandelt, H.-J. 1995. Combination of data in phylogenetic analysis. Plant Syst. Evol. (suppl) 9:355-361.
Bandelt, H-J., P. Forster, and A. Rohl. 1999. Median-joining networks for inferring intraspecific phylogenies. Mol. Biol. Evol. 16:37-48.
Bininda-Emonds, O. R. P., J. L. Gittleman, and M. A. Steel. 2002. The (super) tree of life: procedures, problems, and prospects. Annu. Rev. Ecol. Syst. 33:265-89.
Buckler, E. S., A. Ippolito, and T. P. Holtsford. 1997. The evolution of ribosomal DNA: divergent paralogues and phylogenetic implications. Genetics 145:821-832.
Bryant, D. 2003. A classification of consensus methods for phylogenetics. Pp. 1–21 in M. Janowitz, F. J. Lapointe, F. McMorris, B. Mirkin, and F. Roberts, eds. Bioconsensus. American Mathematical Society Publications—DIMACS (Center for Discrete Mathematics and Theoretical Computer Science), Piscataway, NJ.
Holland, B., and V. Moulton. 2003. Consensus networks: a method for visualising incompatibilities in collections of trees. Pp. 165–176 in G. Benson and R. Page, eds. Algorithms in bioinformatics, WABI 2003. Springer-Verlag, Berlin, Germany.
Huber, K. T., V. Moulton, P. J. Lockhart, and A. Dress. 2001. Pruned median networks: A technique for studying plant speciations. Mol. Phylogenet. Evol. 19:302-310.
Huelsenbeck, J. P., J. J. Bull, and C. W. Cunningham. 1996. Combining data in phylogenetic analysis. Trends Ecol. Evol. 11:152-158.
Moores, A., and E. C. Holmes. 2000. The evolution of base composition and phylogenetic inference. Trends Ecol. Evol. 15:365-369.
Penny, D., L. R. Foulds, and M. D. Hendy. 1982. Testing the theory of evolution by comparing phylogenetic trees constructed from 5 different protein sequences. Nature 297:197-200.
Rokas, A., B. L. Williams, N. King, and S. B. Carroll. 2003. Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature 425:798-803.
Ronquist, F., and J. P. Huelsenbeck. 2003. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19:1572-1574.
Steel, M. A., D. Huson, and P. J. Lockhart. 2000. Invariable site models and their use in phylogeny reconstruction. Syst. Biol. 49:225-232.
Yang, Z. H. 1996. Maximum-likelihood models for combined analyses of multiple sequence data. J. Mol. Evol. 42:587-596.(Barbara R. Holland*, Kath)
Department of Biometry and Engineering, The Swedish University of Agricultural Sciences, Uppsala, Sweden
The Linnaeus Centre for Bioinformatics, Uppsala University, Uppsala, Sweden
Allan Wilson Centre for Molecular Ecology and Evolution, Institute for Molecular BioSciences, Massey University, Palmerston North, New Zealand
E-mail: P.J.Lockhart@massey.ac.nz.
Abstract
Building species phylogenies from genome data requires the evaluation of phylogenetic evidence from independent gene loci. We propose an approach to do this using consensus networks. We compare gene trees for eight yeast genomes and show that consensus networks have potential for helping to visualize contradictory evidence for species phylogenies.
Although gene trees are important for inferring species relationships, they differ in a number of ways from species trees. These differences are caused by the complex (nonbifurcating) evolutionary histories of species, as well as evolutionary processes that violate the assumptions of simple sequence substitution models. These processes may include gene conversion and concerted evolution (Aguilar, Rossello, and Feliner 1999; Buckler, Ippolito, and Holtsford 1999), as well as asymmetric sequence evolution (Moores and Holmes 2000; Steel, Huson, and Lockhart 2000). Despite the difficulty of interpretation, gene trees are central to the problem of building a species phylogeny. An important issue is how to combine evidence from different genome loci without losing information about independent gene histories. This may not be achievable using current methods, including those that (1) reconstruct the optimal evolutionary tree for single genes and then find their consensus (Bryant 2003) or build a supertree (Bininda-Emonds, Gittleman, and Steel 2002), (2) methods that identify optimal substitution models for different genes and then find the tree that best fits these mixed substitution models (Yang 1996; Ronquist and Huelsenbeck 2003), and (3) methods that concatenate genes before fitting an evolutionary model (Huelsenbeck, Bull, and Cunningham 1996). The last method, in particular, is highly controversial. Although concatenation can provide long enough sequences to overcome sampling error, the problem of model misspecification has the potential to lead-tree building methods to converge (e.g., with 100% nonparametric bootstrap support) to an incorrect tree.
Recently, a study was published by Rokas et al. (2003) that has allowed us to test a new approach to visualizing the extent of contradictory genome evidence for building a species phylogeny. Our approach uses consensus networks (Bandelt 1995; Holland and Moulton 2003) and involves combining "splits" (partitions of the taxa into two groups) from different genes into a potentially hyperdimensional graph. Computational details for constructing consensus networks are described in Holland and Moulton (2003), where they are used to visualize phylogenetic uncertainty in large collections of trees. The consensus network approach is now freely available and implemented in SplitsTree version 4.0 (http://www-ab.informatik.uni-tuebingen.de/software/jsplits/welcome_en.html).
In this brief communication, we have constructed a consensus network, by using the gene trees from Rokas et al. (2003), for 106 orthologs common to Candida albicans and seven species of Saccharomyces. We compare our results with those obtained in their analyses. They concatenated each of the 106 genes from the yeast genomes and then exhaustively evaluated the best-fitting parsimony and maximum-likelihood trees. A single tree was identified with 100% nonparametric bootstrap support for all internal edges. These authors noted that this level of support was obtained even though some individual gene trees were not congruent with the tree derived from the concatenated sequences.
To gain insight into the extent of incongruence between their 106 individual gene trees, we constructed a consensus network from the splits occurring in the different gene trees. In doing this, we used median network construction (Bandelt 1995; Holland and Moulton 2003) and included in our consensus network all splits that occurred above a threshold value of 10% (i.e., all splits that occurred in at least 10 of the 106 gene trees). Ours is a novel application of median networks, which typically are used to analyze sequence site pattern variation in population studies (Bandelt, Forster, and Rohl 1999; Huber et al. 2001). Because splits are compared for a large number of independent gene loci, we show for the Rokas et al. (2003) data that our approach provides a more informative indicator of the species phylogeny than does concatenation of sequences.
Expectations are clear when reconstructing a species phylogeny from a collection of gene trees using consensus networks. At one extreme, in the absence of any common phylogenetic patterns among the gene trees, the consensus network will be a structure consisting of high-dimensional hypercubes (e.g., as in figure 1a). At the other extreme, and assuming no stochasticity associated with a bifurcating evolutionary process, the consensus network will be a unique bifurcating tree.
FIG. 1. Consensus networks have been reconstructed for (a) 106 random bifurcating trees on eight taxa [A, B, C, ... , H], (b) the 106 maximum-likelihood trees obtained in Rokas et al. (2003), and (c) the 106 maximum-parsimony trees obtained in Rokas et al. (2003). The presence of boxes in these networks indicates contradictory evidence for grouping certain species together. The lengths of the edges are proportional to the number of gene trees in which a particular edge occurs. Each network displays all those edges that are represented in at least 10 of the 106 trees
The consensus networks for the 106 maximum-likelihood (ML) and maximum-parsimony (MP) trees of Rokas et al. (2003) (figure 1b and c, respectively) are both very similar to the concatenated gene tree reported earlier by these authors. The largely bifurcating nature of these consensus networks indicates that the data are very treelike and that there is common phylogenetic signal between a large number of independent genes. However, there is some uncertainty in the species phylogeny in respect of the relationship of S. castellii, S. kluyveri, and C. albicans, as well as uncertainty in the placement of S. bayanus and S. kudriavzevii. This uncertainty is hidden when genes are concatenated before phylogenetic analysis, as is done in the study of Rokas et al. (2003).
It is interesting to note that evaluating the extent of the incongruence between optimal gene trees for many loci in yeast is relevant to earlier work that sought to test the theory of evolution. The finding of highly similar gene trees for Candida albicans and seven species of Saccharomyces provides strong evidence that corroborates the conclusions of Penny, Foulds, and Hendy. (1982). These authors sought to test the theory of evolution by asking whether gene trees for the same species of mammals were more similar than one would expect by chance. In doing so, and because of limited data at the time, they compared trees reconstructed from very few genes. The gene trees of Rokas et al. (2003) allow comparisons to be made for very large number of genes (106). It is evident from comparing figures 1a–c that the gene trees of Rokas et al. (2003) are far more congruent with each other than would be expected by chance. As shown in table 1, this finding is also evident when the analytical approach of Penny, Foulds, and Hendy. (1982) is used.
Table 1 Distribution of the Robinson-Foulds Distance Between Pairs of Trees.
Our study highlights the potential of using consensus networks to visualize species phylogenies for large numbers of independent genes. Their main advantage is that, unlike other methods, conflicting evolutionary hypotheses can be displayed simultaneously. Such conflict might arise because of stochastic processes and sampling error. It may also arise because of complex biological processes such as hybridization or lateral gene transfer. The incongruence generated by these processes in the evolution of genomes is easily quantified using consensus networks.
Acknowledgements
We thank the anonymous reviewers and AE for their helpful and constructive comments. This work was financially supported by the New Zealand Marsden Fund (P.J.L.), the Swedish Research Council (K.H.), and the Swedish Foundation for International Cooperation in Research and Higher Education (STINT).
Literature Cited
Aguilar, J. F., J. A. Rossello, and G. N. Feliner. 1999. Nuclear ribosomal DNA (nrDNA) concerted evolution in natural and artificial hybrids of Armeria (Plumbaginaceae). Mol. Ecol. 8:1341-1346.
Bandelt, H.-J. 1995. Combination of data in phylogenetic analysis. Plant Syst. Evol. (suppl) 9:355-361.
Bandelt, H-J., P. Forster, and A. Rohl. 1999. Median-joining networks for inferring intraspecific phylogenies. Mol. Biol. Evol. 16:37-48.
Bininda-Emonds, O. R. P., J. L. Gittleman, and M. A. Steel. 2002. The (super) tree of life: procedures, problems, and prospects. Annu. Rev. Ecol. Syst. 33:265-89.
Buckler, E. S., A. Ippolito, and T. P. Holtsford. 1997. The evolution of ribosomal DNA: divergent paralogues and phylogenetic implications. Genetics 145:821-832.
Bryant, D. 2003. A classification of consensus methods for phylogenetics. Pp. 1–21 in M. Janowitz, F. J. Lapointe, F. McMorris, B. Mirkin, and F. Roberts, eds. Bioconsensus. American Mathematical Society Publications—DIMACS (Center for Discrete Mathematics and Theoretical Computer Science), Piscataway, NJ.
Holland, B., and V. Moulton. 2003. Consensus networks: a method for visualising incompatibilities in collections of trees. Pp. 165–176 in G. Benson and R. Page, eds. Algorithms in bioinformatics, WABI 2003. Springer-Verlag, Berlin, Germany.
Huber, K. T., V. Moulton, P. J. Lockhart, and A. Dress. 2001. Pruned median networks: A technique for studying plant speciations. Mol. Phylogenet. Evol. 19:302-310.
Huelsenbeck, J. P., J. J. Bull, and C. W. Cunningham. 1996. Combining data in phylogenetic analysis. Trends Ecol. Evol. 11:152-158.
Moores, A., and E. C. Holmes. 2000. The evolution of base composition and phylogenetic inference. Trends Ecol. Evol. 15:365-369.
Penny, D., L. R. Foulds, and M. D. Hendy. 1982. Testing the theory of evolution by comparing phylogenetic trees constructed from 5 different protein sequences. Nature 297:197-200.
Rokas, A., B. L. Williams, N. King, and S. B. Carroll. 2003. Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature 425:798-803.
Ronquist, F., and J. P. Huelsenbeck. 2003. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19:1572-1574.
Steel, M. A., D. Huson, and P. J. Lockhart. 2000. Invariable site models and their use in phylogeny reconstruction. Syst. Biol. 49:225-232.
Yang, Z. H. 1996. Maximum-likelihood models for combined analyses of multiple sequence data. J. Mol. Evol. 42:587-596.(Barbara R. Holland*, Kath)