The Chloroplast Genome of Nymphaea alba: Whole-Genome Analyses and the Problem of Identifying the Most Basal Angiosperm
http://www.100md.com
分子生物学进展 2004年第7期
* Institut für Spezielle Botanik, Universit?t Jena, Jena, Germany
Zentrum Pharmakologie und Toxikologie, Universit?t G?ttingen, G?ttingen, Germany
Klinik für Innere Medizin, Universit?t Jena, Jena, Germany
E-mail: Vadim.Goremykin@uni-jena.de.
Abstract
Angiosperms (flowering plants) dominate contemporary terrestrial flora with roughly 250,000 species, but their origin and early evolution are still poorly understood. In recent years, molecular evidence has accumulated suggesting a dicotyledonous origin of monocots. Phylogenetic reconstructions have suggested that several dicotyledonous groups that include taxa such as Amborella, Austrobaileya, and Nymphaea branch off as the most basal among angiosperms. This has led to the concept of monocots, "eudicots," "basal dicots," and "ANITA" groupings. Here, we present the sequence and phylogenetic analyses of the chloroplast DNA of Nymphaea alba. Phylogenetic analyses of our 14-species data set, consisting of 29,991 aligned nucleotide positions per chloroplast genome, revealed consistent support for Nymphaea being a divergent member of a monophyletic dicot assemblage. Three distinct angiosperm lineages were supported in the majority of our phylogenetic analyses—eudicots, Magnoliopsida, and monocots. However, the monocot lineage leading to the grasses was the deepest branching. Although analyses of only one individual gene alignment (out of 61) is consistent with some recently proposed hypotheses for the paraphyly of dicots, we also report observations that nine genes do not support paraphyly of dicots. Instead, they support the basal monocot-dicot split. Consistent with this finding, we also report observations suggesting that the monocot lineage leading to the grasses has the strongest phylogenetic affinity to gymnosperms. Our findings have general implications for studies of substitution model specification and analyses of concatenated genome data.
Key Words: Nymphaea ? chloroplast genomes ? angiosperms ? gymnosperms ? molecular evolution ? substitution rates
Introduction
A new consensus view of higher-level angiosperm systematics is currently emerging, based mostly on the analysis of three genes (reviewed in Savolainen and Chase [2003]). In this view, the dicot lineages, including Nymphaea and Amborella, are regarded as the deepest branching among angiosperms. Recent prominent studies (Adams et al. 2002; Bergthorsson et al. 2003) have taken for granted that this consensus is correct.
In our previous studies involving chloroplast genome analyses (Goremykin et al. 2003a, 2003b), we noticed that although the optimal symmetrical model (GTR + I + ) identified by Modeltest (Posada and Crandall 1998) on concatenated chloroplast data was consistent with a basal position of the magnoliids Calycanthus and Amborella, tree building with many other symmetric substitution models did not support this hypothesis. In fact, an alternative hypothesis placing monocots as basal was in most analyses favored with 100% nonparametric bootstrap support. In our paper, we raised the concern of model specification and whether or not the best-fitting symmetric model for our concatenated data gave the most biologically realistic result. In an effort to shed more light on these earlier findings, we have sequenced the chloroplast genome of Nymphaea alba. It was hoped that addition of this putatively basal species would help break up long branches and, thus, help stabilize the angiosperm tree topology.
Nymphaea alba belongs to Nymphaeaceae (Nymphaeales). Like Amborella, Nymphaeales (excluding Nelumbonaceae) have no vessels (Cronquist 1981; Takhtajan 1966). Their position has been viewed by classical taxonomists as somehow intermediate between monocotyledonous and dicotyledonous flowering plants. Some botanists (Rohweder and Endress 1983) noted certain similarities such as floral organs in trimerous whorls uniting members of this order (Cabomba) and "primitive" monocotyledons (Alisma). Others considered the morphological similarities between Nymphaeales and the monocots (rhizodermis differentiated into short and long cells, dispersed vascular bundles, compound midrib, and operculate seeds) important enough to actually include the Nymphaeales into the monocotyledons (Schaffner 1904, 1934; Guttenberg and Müller-Schr?der 1958; Haines and Lye 1975). A detailed investigation of embryonal development of Nymphaeales (Lodkina 1988) revealed asynchronous and asymmetric development of cotyledons, which was interpreted by the author as evidence of an ancestral position of Nympheales to monocots. Indeed, a number of scientists for different reasons believed that Nymphaeales belong to the stock of early lineages from which monocots arose (Arber 1920; Takhtajan 1973; Cronquist 1981; Dahlgren and Clifford 1982). Many character states of Nymphaeales considered to be primitive (e.g., uniaperturate pollen, apocarpous gyneceum, numerous stamens, and laminar placentation) point to the antiquity of the order. The complex polymerous flowers of water lilies with no clear differentiation between sepals and carpels fit very well into the euanthian school of flower origin theories.
Fossil record supports the great age of Nymphaeales. Friis, Pedersen, and Crane (2001) found fossilized nymphaealean flowers dating back to the early Cretaceous period (125 to 115 Myr) and belonging to the oldest fossil assemblages that contain unequivocal angiosperm stamens and carpels. The representatives of this order tend to be among the first taxa to diverge from the other angiosperm lineages in many molecular studies. They are either included in the ANITA (Amborella, Nymphaeales, and Illiciales-Trimeniales-Aristolochiales) grade (Qiu et al. 1999; Soltis, Soltis, and Chase 1999; Barkman et al. 2000; Graham and Olmstead 2000; Zanis et al. 2002) or form a clade with Amborella,which is a sister to all other angiosperms (Barkman et al. 2000; Graham and Olmstead 2000; Zanis et al. 2002).
Recent publications have not been able to discriminate between those hypotheses. It seems that resolution of this issue will require larger amounts of sequence data. The data set of the 61 chloroplast-coding genes we present and investigate here is the largest available to date for addressing the issue of early-branching angiosperms.
Materials and Methods
Genomic Sequencing
Nymphaea alba leaves were harvested from a plant growing in the botanical gardens of the University of Jena, Germany. Total DNA was extracted from the leaves employing the CTAB method (Murray and Thompson 1980) and further purified with Quiagen columns (Quiagen, Valencia, Calif.) according to the manufacturer's protocol. The Nymphaea alba plastome sequence was amplified by implementing a long-range PCR strategy with primers developed from the alignment of known chloroplast genome sequences, as described previously (Goremykin et al. 2003a, 2003b). We covered the entire Nymphaea alba chloroplast genome with PCR products, which exhibited lengths between approximately 4 to 20 kb. The inverted repeat regions of the cpDNA were amplified separately, each with two PCR products extending to the flanking sequences of the single-copy regions and overlapping in the middle of the respective repeat. PCR products were purified by electrophoresis through low-melting agarose followed by digestion with agarase and were subsequently sheared by nebulization, yielding fragments of 0.5 to 1.5 kb in length. The fragments were cloned into the 4Blunt-TOPO vector employing the TOPO Shotgun Subcloning kit (Invitrogen, Groningen, The Netherlands), according to the manufacturer's protocol. Recombinant plasmids containing individual fragments were isolated from transformed E. coli clones with the Montage Plasmid Miniprep kit (Millipore, Eschborn, Germany). Sequencing reactions with plasmid DNA were prepared using the Big Dye Terminator sequencing kit (ABI, Foster City, Calif.). Automated sequencing was performed on ABI 3100, ABI 377 (ABI), and MegaBACE 1000 (Amersham/Pharmacia Biotech, Uppsala, Sweden) sequencers.
Sequence Assembly and Annotation
All automated sequencer traces were base-called with the PHRED program (Ewing et al. 1998), and sequence masking and assembly were performed with the STADEN package (Staden, Beal, and Bonfield 2000). Sequencing data were accumulated to 10x coverage for all PCR fragments; remaining gaps were closed by PCR. The Nymphaea alba chloroplast genome sequence has been deposited in the EMBL database under the accession number AJ627251. The primer sequences used for the amplification of plastome sequences by PCR and the alignments employed for phylogenetic analyses are available upon request. The genome was annotated as described previously (Goremykin et al. 2003b).
Results
General Genome Properties
The chloroplast DNA of Nymphaea alba is a 159,930-bases-long circular molecule, which has a structure typical for many land plants—large and small single-copy regions separated by inverted repeat regions. Using PHRED confidence values, we determined the total plastome assembly to contain 0.03 incorrectly read bases, which suggests no mistakes in the genomic sequence. The G+C content of the plastome is 39.15%, which is close to that of the other angiosperms (Amborella has, for example, 38.3% G+C, Nicotiana has 37.8%, and Zea has 38.4%). The G+C content of the protein-coding genes of known function found on the Nymphaea alba cpDNA is close to the overall one (40.1%). However, the above bases are not uniformly distributed across the different codon positions. The synonymous third codon positions have a G+C content of 31.5%, whereas the first and the second codon positions have a G+C content closer to uniform (i.e., 44.5%). Both the gene order and the gene content of the genome under study are identical to those of Amborella trichopoda cpDNA (Goremykin et al. 2003a) and are very similar to the gene order and the gene content of Calycanthus fertilis cpDNA (Goremykin et al. 2003b), with the exception of the hypothetical ACR-toxin sensitivity gene (ACRS) open reading frame (ORF) found in the latter genome and shorter inverted repeat region of the Calycanthus plastome, not including the rpl2 gene. The gene map of the Nymphaea alba cpDNA is presented in figure 1.
FIG. 1. Nymphaea alba cpDNA. The topmost part of the map corresponds to the start and the end of the EMBL sequence entry AJ627251. Genes shown inside the circle are transcribed clockwise, and genes outside the circle are transcribed counterclockwise. The genes of the genetic apparatus are shown in red, photosynthesis genes are indicated as green, and genes of NADH dehydrogenase are shown in violet. The ORFs, ycfs, and genes of unknown function are designated as gray. Intron-containing genes (names of which are indicated in blue) are represented by their exons. In the cases when two genes overlap, one of them is shifted off the map to show its position
Alignment and Data Properties
The chloroplast genome of Nymphaea alba contains all 61 genes common to completely sequenced chloroplast genomes of the land plants (Ohyama et al. 1986; Shinozaki et al. 1986; Hiratsuka et al. 1989; Wakasugi et al. 1994; Maier et al. 1995; Sato et al. 1999; Hupfer et al. 2000; Kato et al. 2000; Schmitz-Linneweber et al. 2001; Ogihara et al. 2002; Goremykin et al. 2003a, 2003b). Individual alignments of the first and the second codon positions of 61 genes and alignments of their translated sequences were produced with our ClustalW-embedded Perl script. They were manually concatenated and edited to produce a 29,991-position-long nucleotide alignment and a 14,811–amino acid alignment used in phylogenetic analyses. In the nucleotide alignment, we excluded the third codon positions because it could pose problems in phylogeny reconstruction for the application of fitting and tree building with symmetric substitution models. These sites tend to exhibit high and irregular AT-contents and were found to be very divergent in comparison Pinus vs. angiosperms (Goremykin et al. 2003a, 2003b). Even at the first + second codon positions, several species in the data set do not pass a 5% chi-square test of compositional homogeneity. This also raises some concern for phylogenetic analysis of angiosperm/outgroup 1+2 sequence data sets, which we address.
Analyses of Concatenated Nucleotide Alignment
The tree depicting the inferred phylogenetic relationship of the species under analysis is presented in figure 2. This topology was found in distance analysis of the 29,991 position alignment of the first and the second coding positions of 61 chloroplast genes employing Tajima-Nei substitution model as implemented in the Treecon package (Van de Peer et al. 1994) and further confirmed in distance analyses with Jukes-Cantor, Kimura two-parameter Felsenstein F81, Felsenstein F84, Kimura three-parameter, Hasegawa, Kishino and Yano, Tajima-Nei, Tamura-Nei, and General time-reversible (GTR) models as implemented in the PAUP* package (Swofford 2002), both with and without gamma correction (employing alpha shape parameter 0.27). The branches uniting dicotyledons, magnoliids and Nymphaea with Amborella were recovered with 100/100 bootstrap proportion (BP) support in all above analyses.
FIG. 2. Neighbor-joining tree built from Tajima-Nei distances derived from analysis of the alignment of the first and the second codon positions from 61 protein-coding genes common to the plastomes of land plants
Several species in the data set do not pass the 5% chi-square test of compositional homogeneity. The compositional biases in these sequences, however, would not be expected to affect the phylogeny reconstruction because the topology presented on figure 2 was also recovered in the LogDet analysis—with 100/100 BP support for all aforementioned branches.
This topology was further confirmed in maximum-parsimony analyses performed with heuristic and branch and bound searches. The bootstrap values supporting the monophyly of dicotyledonous plants, monophyly of Magnoliopsida, and sister group relationship between Amborella and Nymphaea remained on the maximum level in these analyses.
The same result was recovered in maximum-likelihood (ML) analyses performed with the Tree-Puzzle program (Strimmer and von Haeseler 1996). The ML tree built with the Hasegawa, Kishino, and Yano model of substitution was congruent to the topology shown on figure 2. The monophyletic status of dicots, Magnoliids, and Nymphaea with Amborella received, respectively, 95, 98, and 99 quartet puzzling support (QPS). Applying the Tamura-Nei model of substitution resulted in no changes in topology and in a slight change of support for the above three branches: 95, 97, 99 QPS (respectively). The hypothetical branch uniting all angiosperms under analysis with the exception of Nymphaea and Amborella, a branch that would be in compliance with the ANITA grade hypothesis, received no support in these QP ML analyses.
The heuristic ML searches performed with PAUP* yielded different results. The quartet-puzzling algorithm implemented in this program found an alternative topology ((outgroups (Calycanthus (Nymphaea, Amborella)))(monocots, eudicots)) with exactly the same settings and models that we used with Tree-Puzzle. The branch separating outgroups with magnoliids from the rest of the species received low (<60%) QPS support in these analyses. As with our earlier findings (Goremykin et al. 2003a), the "best fit" model found by the Modeltest program (Posada and Crandall 1998) suggested a General Time-Reversible model with positional rate heterogeneity. Tree building under heuristic ML (PAUP*) with the optimal symmetric model yielded a topology with a clade bearing Nymphaea and Amborella branching first among the angiosperms, followed by Calycanthus, and then by the dichotomy monocots-eudicots. However, as with the findings reported previously, deviations in the choice of model and parameter values gave trees wherein branches united dicotyledons, magnoliids, and Nymphaea with Amborella. In these analyses monocots were basal.
Analyses of Concatenated Amino Acid Alignment
These results were further tested with analyses of the 14,811-position-long alignment of the translated sequences. Heuristic search employing maximum-parsimony algorithm (PAUP*) yielded the topology congruent with monocots basal (fig. 2). The branch bearing Nymphaea and Amborella and the one uniting all dicotyledonous plants were found in 100/100 bootstrap trees, whereas the branch bearing the three magnoliids was recovered in 99/100 bootstrap replicas.
Distance analyses of the protein alignment were performed with the Treecon package employing Kimura and Tajima-Nei models and with the PHYLIP package (Felsenstein 1989) with Dayhoff model. These analyses resulted in topologies identical to the one presented on figure 2 with 100/100 BP support values for the three aforementioned branches.
The neighbor-net (Bryant and Moulton 2004) tree built with the Protein LogDet method (Tholesson 2004) with all amino acid sites included had the topology congruent to the one shown in figure 2.
Maximum-likelihood analyses were performed with the Tree-Puzzle program with default settings and root assigned to Marchantia. The branching order of the eudicot clades—(Nicotiana/Spinacia), (Arabidopsis/Lotus), and the one leading to Oenothera could not be resolved in all ML analyses. The eudicot monophyly though, as well as monophyly of magnoliids, dicots, and the sister group relationship between Amborella and Nymphaea, received strong support. With Müller-Vingron, BLOSUM, Adachi-Hasegawa, Dayhoff, and Jones and Jones substitution models, the lowest QPS value supporting the eudicot branch in the above ML analyses was 96 QPS. The single lowest quartet-puzzling support value obtained for the other three branches was 97.
Analyses of the Individual Alignments
Rearrangements in 11 plastomes of the spermatophytes affecting gene order involve large chunks of DNA. Therefore, on the level of at least spermatophytes, orthology of every gene under analysis as well as common evolutionary history of all 61 genes can easily be proved on the basis of gene order identity and general sequence similarity along the large stretches of cpDNA from different species. Yet the land plant tree topologies derived from the chloroplast genes are often different.
To investigate possible phylogenetic biases of individual genes, we counted the bootstrap values supporting the branches bearing outgroups with Nymphaea, Amborella or with Nymphaea and Amborella taken together, Calycanthus, Magnoliopsida, Rosopsida, and grasses in NJ/GTR trees built from the alignments of the first and the second codon positions from 61 coding genes common to the genomes of the land plants. The results of these analyses are shown in table 1. Here, we label a certain branch to be "supported" by a protein alignment when corresponding BP support value is no lower than the arbitrary value of 60%.
Table 1 Results of the Individual Analysis of the Alignments of the First and the Second Codon Positions.
One can see that one of six branches is supported by a much larger number of genes than the other five. This is the branch uniting outgroups with grasses that we recovered from the majority of analyses of the concatenated data set. This branch found some support in the analysis of atpF (67% BP), matK (60% BP), rpoB (99% BP), rpoC1 (100% BP), rpoC2 (99% BP), rps12 (61% BP), rps18 (76% BP), rps3 (77% BP), and rps8 alignments (90% BP). These alignments exhibit, respectively, 117, 479, 483, 357, 939, 24, 50, 146, and 81 informative positions.
We found no clear cases of support on the level of individual proteins for the branches uniting outgroups with (1) Nymphaea, (2) Amborella, and (3) Magnoliopsida. The branches bearing outgroups with (1) Nymphaea + Amborella, (2) Calycanthus, and (3) Rosopsida were supported each by a single protein, by, respectively, 74% BP (psbF), 73% BP (psaJ), and 70% BP ( ycf 3). The number of informative positions in alignments of these proteins are, respectively, 7, 17, and 35.
Additional Tests
We wished to evaluate support for the clade grouping the monocotyledoneous species with outgroups in our analyses. One simple way of testing outgroup affinity of different angiosperm branches would be to count the number of positions in which a group to be tested share the same character state with the most closely related outgroup that is not observed in all other angiosperm sequences. Given low level of homoplasy (for example by excluding the highly homoplastic third positions), one would expect such positions in a group to be tested to contain the ancestral character states existing before the splitting of angiosperms and the outgroup that subsequently mutated in other angiosperm lineage. Another way to test the basal monocot-dicot split would be to count the number of synapomorphies supporting alternative branches within the angiosperm ingroup. One can expect more synapomorphies between more closely related ingroup taxa.
We found the concatenated alignment of the first and the second codon positions sampled from 61 chloroplast genes to contain 68 positions with bases shared between three grasses and Pinus to the exclusion of other angiosperms. Three species of magnoliopsida (Calycanthus, Nymphaea, and Amborella) share only 13 such positions with Pinus and five species of Rosopsida share only five. The part of the ANITA grade (Nymphaea + Amborella + Pinus) is supported by 10 positions. We deleted nonspermatophyte outgroups and repeated the analysis. The number of positions supporting outgoup affinity of the above four groups changed to 151 (grasses + Pinus), 56 (Nymphaea + Amborella + Pinus), 30 (Calycanthus + Amborella + Nymphaea + Pinus), and 19 (Rosopsida + Pinus). For comparison, monophyly of angiosperms and spermatophytes is supported by 532 and 631 positions, respectively. The positions supporting outgroup affinity of grasses and of the species from the ANITA grade are shown in figure 3.
FIG. 3. The alignment positions in which grasses (above) and members of ANITA group (below) share characters with Pinus to the exclusion of all other angiosperm lines in the alignment of the first and the second codon positions of 61 protein-coding genes common to 14 genomes of land plants
In the total alignment of the first and the second codon positions, the mean distance across the range Pinus to monocots is 0.17 substitutions/position (ML estimation with Tamura-Nei model of substitution). Given that value, the probability that a position mutated twice since the separation of the gymnosperm and monocot lines is 0.0289, which corresponds to one twice-mutated position out of 34.6. Therefore, if the rate of substitutions in 151 alignment positions supporting grouping of monocots with Pinus (fig. 3) does not exceed the mean one characteristic of the whole alignment, this subset could be expected to contain approximately four homoplastic positions. The mean distance among eight dicotyledonous plants in the 151-position subset is 0.11, which is higher than the corresponding distance observed in the total alignment of 61 genes (0.065 substitutions/position, same model). Taking into account this somewhat elevated substitution rate, the 151-position subset can be expected to contain approximately seven homoplastic positions. One can also note that the GC content of the 151-position subset (49.3%) is close to equilibrium and is similar to the total GC content of the alignment (44.4%).
However, because the first codon positions can undergo synonymous substitutions, it is not immediately clear from the above observations whether support for gymnosperm affinity of the line leading to Zea, Oryza, and Triticum would be reflected in the protein sequences. So we repeated the analysis, this time using a 14,811–amino acid alignment. It was found to contain 45 positions that clearly support the separation of grasses and Pinus from the rest of the species under analysis. By contrast, the affinity of ANITA members to Pinus was supported by 19 positions, and that of Magnoliopsida and Rosopsida was supported by nine and four positions, respectively. As in the analyses of nucleotide sequences, removal of fern and liverwort sequences resulted in a stronger signal. In the alignment containing only spermatophyte species, there are 86 positions in which Pinus and three grasses share a character to the exclusion of other angiosperms, as opposed to 42, 13, and 11 positions supporting affinity of Pinus to, respectively, ANITA members, Magnoliopsida, and Rosopsida.
The nucleotide alignment of 14 OTUs with 29,991 positions per species contains 93 synapomorphies shared among the dicots, whereas the number of synapomorphies supporting monophyly of eudicots and monocots is 17. In the 14,811–amino acid alignment, there are 93 synapomorphies that unite all dicots. The number of synapomorphies that unite monocots and eudicots in this alignment is 11.
Discussion
Our findings of strong contradictory signals between different phylogenetic analyses of angiosperm chloroplast genome data highlight the concern recently raised in Molecular Biology and Evolution over appropriate analysis of complete genome data and the problems of sequence concatenation (Holland et al. 2004). Our findings are also cautionary and provide insight into the importance of further complete genome sequences before strong conclusions can be drawn in respect of identification of the most basal angiosperms.
In the majority of our analyses, we detected neither support for (1) monocot affinity of a line leading to modern Nymphaeales nor support for (2) an early appearance of Nymphaeales in the evolutionary history of flowering plants. These analyses suggested that Nymphaea alba is a derived representative of a dicot lineage, which is, of all species under analysis, most closely related to Amborella trichopoda. The clade bearing Amborella with Nymphaea was detected with lower support previously (Barkman et al. 2000, Graham and Olmstead 2000), although not its derived position, because it appeared as the first branch to split off among angiosperms in these trees. Because our taxon sampling remains limited, it is, however, premature to state close taxonomical relationship of Amborella and Nymphaeales.
In analyses of the individual genes we registered multiple cases of support for the sister group relationship of those two plants, yet little support for their close gymnosperm affinity. It is the line leading to three grasses that was found in a most-basal position within the angiosperms in the highest number of phylogenetic analyses of the individual gene alignments. This branching was supported by the alignments of genes (matK, rpoB, rpoC1, and rpoC2) containing the greatest numbers of informative positions among the 61 individual gene alignments we built. All alternative branches tested either received no support in these analyses or were supported each by a single alignment of a short gene with a low number of informative positions.
The interpretation of these results is straightforward: There are no bona fide cases of support for alternative rootings of angiosperms on the level of individual gene alignments. The appearance of alternative topologies in analyses of individual genes is probably the result of stochastic variations in the substitution process becoming visible when character sampling becomes small, as was recently observed in the analysis of 108 genes from yeasts (Rokas et al. 2003). Because bootstrap proportion values are affected by the ratio of characters supporting different topologies but not by their absolute numbers, such small biased samples could still exhibit high BP support values for incorrectly inferred branches.
Another conclusion that can be drawn from our analyses of individual genes is that the character-wise small data sets can be generally unreliable for elucidating phylogeny. Analyzed individually, no alignment recovered the very stable topology (fig. 2) we inferred from the analysis of the concatenated nucleotide alignment of 29,991 positions employing the same method of tree construction. This is consistent with the newer findings of Rokas et al. (2003) and is furthermore consistent with earlier analyses of chloroplast genome phylogeny comparing individual and concatenated alignments (Goremykin, Hansmann, and Martin 1997; Martin et al. 1998; Lockhart et al. 1999).
An attempt to additionally test the outgroup affinity of the grasses revealed that they share the largest number of characters with Pinus to the exclusion of all other angiosperms among branches checked. The positions supporting that affinity are not hypermutated and do not exhibit any strong compositional bias. The number of such positions exceeds the number of positions supporting outgroup affinity of ANITA members approximately three times in the nucleotide and two times in the amino acid alignment; other conflicting signals are substantially less pronounced. It is possible for nonadjacent and nonrelated taxa on a tree to have more sequence identities than adjacent and related taxa if the molecular clock is violated in a way that the substitution rate among the nonadjacent taxa gets comparatively small. However, the above considerations cannot be applied to explain the result in the figure 3, because both the grasses and Pinus are born on the branches that are the longest among the spermatophytes. One can note that the above results are in good accord with the numbers of synapomorphies that can be observed within the angiosperm ingroup. In concatenated protein and nucleotide alignments, the clade uniting all dicots is supported by, respectively, eightfold and fivefold larger numbers of synapomorphies than an alternative clade bearing eudicots and monocots.
These observations and results of individual gene analyses stand in sharp contrast to the topology favored by the optimal symmetric model applied to concatenated sequence data. This apparent contradiction may be easily explained. When the internal branches of the true underlying tree are short compared with the length of the external branches, tree building is expected to be problematic (Hendy and Penny 1987). This problem is exacerbated when sequence evolution is not well described by the assumed substitution model (Jermiin et al. 2004). In such cases, small deviations in model parameters can potentially lead to very different tree topologies supported by high BP support values. Once concatenated, the optimal symmetric model will merely be the best average fit to all these genes.
The majority of analyses presented here converge on the basal monocot-dicot split. However, the extreme difference between internal ingroup and external outgroup branch lengths in the concatenated gene angiosperm tree shown in figure 2 suggests that we may be some ways from being confident of identifying the most basal angiosperm. Clearly, the sequencing of genomes for more closely related outgroups and putatively basal angiosperms will be important for overcoming potential problems of model misspecification and long-branch attraction.
Supplementary Material
The sequence reported in this paper has been deposited in the EMBL database (accession number AJ627251).
Acknowledgements
This publication was supported by a grant of the Deutsche Forschungsgemeinschaft.
Literature Cited
Adams, K. L., Y. L. Qiu, M. Stoutemyer, and J. D. Palmer. 2002. Punctuated evolution of mitochondrial gene content: High and variable rates of mitochondrial gene loss and transfer to the nucleus during angiosperm evolution. Proc. Natl. Acad. Sci. USA 99:9905-9912.
Arber, A. 1920. Water plants: A study of the aquatic angiosperm. Cambridge University Press. London.
Barkman, T. J., G. Chenery, J. R. McNeal, J. Lyons-Weiler, W. J. Ellisens, G. Moore, A. D. Wolfe, and C. W. dePamphilis. 2000. Independent and combined analyses of sequences from all three genomic compartments converge on the root of flowering plant phylogeny. Proc. Natl. Acad. Sci. USA 97:13166-13171.
Bergthorsson, U., K. L. Adams, B. Thomason, and J. D. Palmer. 2003. Widespread horizontal transfer of mitochondrial genes in flowering plants. Nature 424:197-201.
Bryant, D., and Moulton, V. 2004. Neighbor-net: an agglomerative method for the construction of phylogenetic networks. Mol. Biol. Evol. 21:255-265.
Cronquist, A. 1981. An integrated system of classification of flowering plants. Columbia University Press, New York.
Dahlgren, R. M. T., and H. T. Clifford. 1982. The Monocotyledons: a comparative study. Academic Press, New York.
Ewing, B., L. Hillier, M. C. Wendl, and P. Green. 1998. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8:175-185.
Felsenstein, J. 1989. PHYLIP (phylogeny inference package). Version 3.2. Cladistics 5:164-166.
Friis, E. M., K. R. Pedersen, and P. R. Crane. 2001. Fossil evidence of water lilies (Nymphaeales) in the early Cretaceous. Nature 410:357-360.
Goremykin, V., S. Hansmann, and W. Martin. 1997. Evolutionary analysis of 58 proteins encoded in six completely sequenced chloroplast genomes: revised molecular estimates of two seed plant divergence times. Plant Syst. Evol. 206:337-351.
Goremykin, V. V., K. I. Hirsch-Ernst, S. W?lfl, and F. H. Hellwig. 2003a. Analysis of the Amborella trichopoda chloroplast genome sequence suggests that Amborella is not a basal angiosperm. Mol. Biol. Evol. 20:14499-14505.
Goremykin, V. V., K. I. Hirsch-Ernst, S. W?lfl, and F. H. Hellwig. 2003b. The chloroplast genome of the "basal" angiosperm Calycanthus fertilis—structural and phylogenetic analyses. Plant Syst. Evol. 242:119-135.
Graham, S. W., and R. G. Olmstead. 2000. Utility of 17 chloroplast genes for inferring the phylogeny of the basal angiosperms. Am. J. Bot. 87:1712-1730.
Guttenberg, H. V., and R. Müller-Schr?der. 1958. Untersuchungen über die Entwicklung des Embryos und der Keimpflanze von Nuphar luteum. Planta 51:481-510.
Haines, R. W., and K. A. Lye. 1975. Seedlings of Nymphaeaceae. J. Linn. Soc. Bot. 70:255-265.
Hendy, M. D., and D. Penny. 1987. Edge lengths of trees from sequence data. Math. Biosci. 83:157-165.
Hiratsuka, J., H. Shimada, and R. Whittier, et al. (16 co-authors). 1989. The complete sequence of the rice (Oryza sativa) chloroplast genome: intermolecular recombination between distinct tRNA genes accounts for a major plastid DNA inversion during the evolution of the cereals. Mol. Gen. Genet. 217:185-194.
Holland, B. R., K. T. Huber, V. Moulton, and P. J. Lockhart., in press Using consensus networks to visualize contradictory evidence for species phylogeny. Mol. Biol. Evol.
Hupfer, H., M. Swiatek, S. Hornung, R. G. Hermann, R. M. Maier, W.-L. Chiu, and B. Sears. 2000. Complete nucleotide sequence of the Oenothera elata plastid chromosome, representing plastome I of the five distinguishable Euoenothera plastomes. Mol. Gen. Genet. 263:581-585.
Jermiin, L. S., S. Y. W. Ho, F. Ababneh, J. Robinson, and A. W. D. Larkum., in press The biasing effect of compositional heterogeneity on phylogenetic estimates may be underestimated. Syst. Biol.
Kato, T., T. Kaneko, S. Sato, Y. Nakamura, and S. Tabata. 2000. Complete structure of the chloroplast genome of a legume, Lotus japonicus. DNA Res. 7:323-330.
Lockhart, P. J., C. J. Howe, A. C. Barbrook, A. W. D. Larkum, and D. Penny. 1999. Spectral analysis, systematic bias, and the evolution of chloroplasts. Mol. Biol. Evol. 16:573-576.
Lodkina, M. M. 1988. Evolutionary relationships of monocots and dicots derived from studies of embryo and seedlings data. Botanichesky Zhurnal 73:617-629 (in Russian).
Maier, R. M., K. Neckermann, G. L. Igloi, and H. Kossel. 1995. Complete sequence of the maize chloroplast genome: gene content, hotspots of divergence and fine tuning of genetic information by transcript editing. J. Mol. Biol. 251:614-628.
Martin, W., B. Stoebe, V. Goremykin, S. Hansmann, M. Hasegawa, and K. V. Kowallik. 1998. Gene transfer to the nucleus and the evolution of chloroplasts. Nature 393:162-165.
Murray, M. G., and W. F. Thompson. 1980. Rapid isolation of high molecular weight DNA. Nucleic Acids Res. 8:4321-4325.
Ogihara, Y., K. Isono, and T. Kojim, et al. (19 co-authors). 2002. Structural features of a wheat plastome as revealed by complete sequencing of chloroplast DNA. Mol. Genet. Genomics 266:740-746.
Ohyama, K., H. Fukuzawa, and T. Kohchi, et al. (13 co-authors). 1986. Chloroplast gene organization deduced from complete sequence of liverwort Marchantia polymorpha chloroplast DNA. Nature 322:572-574.
Posada, D., and K. A. Crandall. 1988. Modeltest: testing the model of DNA substitution. Bioinformatics. 14:817-818.
Qiu, Y.-L., J. Lee, F. Bernasconi-Quadroni, D. E. Soltis, P. S. Soltis, M. Zanis, E. A. Zimmer, Z. Chen, V. Savolainen, and M. W. Chase. 1999. The earliest angiosperms: evidence from mitochondrial, plastid and nuclear genomes. Nature 402:404-407.
Rohweder, O., and P. K. Endress. 1983. Samenpflanzen. Georg Thieme Verlag, Stuttgart.
Rokas, A., B. L. Williams, N. King, and S. B. Carroll. 2003. Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature 425:798-804.
Sato, S., Y. Nakamura, T. Kaneko, E. Asamizu, and S. Tabata. 1999. Complete structure of the chloroplast genome of Arabidopsis thaliana. DNA Res. 6:283-290.
Savolainen, V., and M. W. Chase. 2003. A decade of progress in plant molecular phylogenetics. Trends Genet. 19:717-724.
Schaffner, J. H. 1904. Some morphological peculiarities of the Nymphaeaceae and Helobiae. Ohio Nat. 4:83-92.
Schaffner, J. H. 1934. Phylogenetic taxonomy of plants. Quart. Rev. Biol. 9:129-160.
Schmitz-Linneweber, C., R. M. Maier, J. P. Alcaraz, A. Cottet, R. G. Herrmann, and R. Mache. 2001. The plastid chromosome of spinach (Spinacia oleracea): complete nucleotide sequence and gene organization. Plant Mol. Biol. 45:307-315.
Shinozaki, K., M. Ohme, and M. Tanaka, et al. (23 co-authors). 1986. The complete nucleotide sequence of the tobacco chloroplast genome: its gene organization and expression. EMBO J. 5:2043-2049.
Soltis, P. S., D. E. Soltis, and M. W. Chase. 1999. Angiosperm phylogeny inferred from multiple genes as a tool for comparative biology. Nature 402:402-403.
Staden, R., K. F. Beal, and J. K. Bonfield. 2000. The Staden package, 1998. Methods Mol. Biol. 132:115-130.
Strimmer, K., and A. von Haeseler. 1996. Quartet puzzling: a quartet maximum likelihood method for reconstructing tree topologies. Mol. Biol. Evol. 13:964-969.
Swofford, D. L. 2002. PAUP*: phylogenetic analysis using parsimony (*and other methods). Version 4. Sinauer Associates, Sunderland, Mass.
Takhtajan, A. 1966. Systema et phylogenia Magnoliophytorum. Nauka, Moscow, Leningrad.
Takhtajan, A. 1973. Evolution und Ausbreitung der Blütenpflanzen. Gustav Fischer Verlag, Jena.
Tholesson, M. 2004. LDDist: a Perl module for calculating LogDet pair-wise distances for protein and nucleotide sequences. Bioinformatics 20:416-418.
Van de Peer, Y., and R. De Wachter. 1994. TREECON for Windows: a software package for the construction and drawing of evolutionary trees for the Microsoft Windows environment. Comput. Applic. Biosci. 10:569-570.
Wakasugi, T., J. Tsudzuki, S. Ito, K. Nakashima, T. Tsudzuki, and M. Sugiura. 1994. Loss of all ndh genes as determined by sequencing the entire chloroplast genome of the black pine Pinus thunbergii. Proc. Natl. Acad. Sci. USA 91:9794-9798.
Zanis, M. J., D. E. Soltis, P. S. Soltis, S. Mathews, and M. J. Donoghue. 2002. The root of the angiosperms revisited. Proc. Natl. Acad. Sci. USA 99:6848-6853.(Vadim V. Goremykin*, Kare)
Zentrum Pharmakologie und Toxikologie, Universit?t G?ttingen, G?ttingen, Germany
Klinik für Innere Medizin, Universit?t Jena, Jena, Germany
E-mail: Vadim.Goremykin@uni-jena.de.
Abstract
Angiosperms (flowering plants) dominate contemporary terrestrial flora with roughly 250,000 species, but their origin and early evolution are still poorly understood. In recent years, molecular evidence has accumulated suggesting a dicotyledonous origin of monocots. Phylogenetic reconstructions have suggested that several dicotyledonous groups that include taxa such as Amborella, Austrobaileya, and Nymphaea branch off as the most basal among angiosperms. This has led to the concept of monocots, "eudicots," "basal dicots," and "ANITA" groupings. Here, we present the sequence and phylogenetic analyses of the chloroplast DNA of Nymphaea alba. Phylogenetic analyses of our 14-species data set, consisting of 29,991 aligned nucleotide positions per chloroplast genome, revealed consistent support for Nymphaea being a divergent member of a monophyletic dicot assemblage. Three distinct angiosperm lineages were supported in the majority of our phylogenetic analyses—eudicots, Magnoliopsida, and monocots. However, the monocot lineage leading to the grasses was the deepest branching. Although analyses of only one individual gene alignment (out of 61) is consistent with some recently proposed hypotheses for the paraphyly of dicots, we also report observations that nine genes do not support paraphyly of dicots. Instead, they support the basal monocot-dicot split. Consistent with this finding, we also report observations suggesting that the monocot lineage leading to the grasses has the strongest phylogenetic affinity to gymnosperms. Our findings have general implications for studies of substitution model specification and analyses of concatenated genome data.
Key Words: Nymphaea ? chloroplast genomes ? angiosperms ? gymnosperms ? molecular evolution ? substitution rates
Introduction
A new consensus view of higher-level angiosperm systematics is currently emerging, based mostly on the analysis of three genes (reviewed in Savolainen and Chase [2003]). In this view, the dicot lineages, including Nymphaea and Amborella, are regarded as the deepest branching among angiosperms. Recent prominent studies (Adams et al. 2002; Bergthorsson et al. 2003) have taken for granted that this consensus is correct.
In our previous studies involving chloroplast genome analyses (Goremykin et al. 2003a, 2003b), we noticed that although the optimal symmetrical model (GTR + I + ) identified by Modeltest (Posada and Crandall 1998) on concatenated chloroplast data was consistent with a basal position of the magnoliids Calycanthus and Amborella, tree building with many other symmetric substitution models did not support this hypothesis. In fact, an alternative hypothesis placing monocots as basal was in most analyses favored with 100% nonparametric bootstrap support. In our paper, we raised the concern of model specification and whether or not the best-fitting symmetric model for our concatenated data gave the most biologically realistic result. In an effort to shed more light on these earlier findings, we have sequenced the chloroplast genome of Nymphaea alba. It was hoped that addition of this putatively basal species would help break up long branches and, thus, help stabilize the angiosperm tree topology.
Nymphaea alba belongs to Nymphaeaceae (Nymphaeales). Like Amborella, Nymphaeales (excluding Nelumbonaceae) have no vessels (Cronquist 1981; Takhtajan 1966). Their position has been viewed by classical taxonomists as somehow intermediate between monocotyledonous and dicotyledonous flowering plants. Some botanists (Rohweder and Endress 1983) noted certain similarities such as floral organs in trimerous whorls uniting members of this order (Cabomba) and "primitive" monocotyledons (Alisma). Others considered the morphological similarities between Nymphaeales and the monocots (rhizodermis differentiated into short and long cells, dispersed vascular bundles, compound midrib, and operculate seeds) important enough to actually include the Nymphaeales into the monocotyledons (Schaffner 1904, 1934; Guttenberg and Müller-Schr?der 1958; Haines and Lye 1975). A detailed investigation of embryonal development of Nymphaeales (Lodkina 1988) revealed asynchronous and asymmetric development of cotyledons, which was interpreted by the author as evidence of an ancestral position of Nympheales to monocots. Indeed, a number of scientists for different reasons believed that Nymphaeales belong to the stock of early lineages from which monocots arose (Arber 1920; Takhtajan 1973; Cronquist 1981; Dahlgren and Clifford 1982). Many character states of Nymphaeales considered to be primitive (e.g., uniaperturate pollen, apocarpous gyneceum, numerous stamens, and laminar placentation) point to the antiquity of the order. The complex polymerous flowers of water lilies with no clear differentiation between sepals and carpels fit very well into the euanthian school of flower origin theories.
Fossil record supports the great age of Nymphaeales. Friis, Pedersen, and Crane (2001) found fossilized nymphaealean flowers dating back to the early Cretaceous period (125 to 115 Myr) and belonging to the oldest fossil assemblages that contain unequivocal angiosperm stamens and carpels. The representatives of this order tend to be among the first taxa to diverge from the other angiosperm lineages in many molecular studies. They are either included in the ANITA (Amborella, Nymphaeales, and Illiciales-Trimeniales-Aristolochiales) grade (Qiu et al. 1999; Soltis, Soltis, and Chase 1999; Barkman et al. 2000; Graham and Olmstead 2000; Zanis et al. 2002) or form a clade with Amborella,which is a sister to all other angiosperms (Barkman et al. 2000; Graham and Olmstead 2000; Zanis et al. 2002).
Recent publications have not been able to discriminate between those hypotheses. It seems that resolution of this issue will require larger amounts of sequence data. The data set of the 61 chloroplast-coding genes we present and investigate here is the largest available to date for addressing the issue of early-branching angiosperms.
Materials and Methods
Genomic Sequencing
Nymphaea alba leaves were harvested from a plant growing in the botanical gardens of the University of Jena, Germany. Total DNA was extracted from the leaves employing the CTAB method (Murray and Thompson 1980) and further purified with Quiagen columns (Quiagen, Valencia, Calif.) according to the manufacturer's protocol. The Nymphaea alba plastome sequence was amplified by implementing a long-range PCR strategy with primers developed from the alignment of known chloroplast genome sequences, as described previously (Goremykin et al. 2003a, 2003b). We covered the entire Nymphaea alba chloroplast genome with PCR products, which exhibited lengths between approximately 4 to 20 kb. The inverted repeat regions of the cpDNA were amplified separately, each with two PCR products extending to the flanking sequences of the single-copy regions and overlapping in the middle of the respective repeat. PCR products were purified by electrophoresis through low-melting agarose followed by digestion with agarase and were subsequently sheared by nebulization, yielding fragments of 0.5 to 1.5 kb in length. The fragments were cloned into the 4Blunt-TOPO vector employing the TOPO Shotgun Subcloning kit (Invitrogen, Groningen, The Netherlands), according to the manufacturer's protocol. Recombinant plasmids containing individual fragments were isolated from transformed E. coli clones with the Montage Plasmid Miniprep kit (Millipore, Eschborn, Germany). Sequencing reactions with plasmid DNA were prepared using the Big Dye Terminator sequencing kit (ABI, Foster City, Calif.). Automated sequencing was performed on ABI 3100, ABI 377 (ABI), and MegaBACE 1000 (Amersham/Pharmacia Biotech, Uppsala, Sweden) sequencers.
Sequence Assembly and Annotation
All automated sequencer traces were base-called with the PHRED program (Ewing et al. 1998), and sequence masking and assembly were performed with the STADEN package (Staden, Beal, and Bonfield 2000). Sequencing data were accumulated to 10x coverage for all PCR fragments; remaining gaps were closed by PCR. The Nymphaea alba chloroplast genome sequence has been deposited in the EMBL database under the accession number AJ627251. The primer sequences used for the amplification of plastome sequences by PCR and the alignments employed for phylogenetic analyses are available upon request. The genome was annotated as described previously (Goremykin et al. 2003b).
Results
General Genome Properties
The chloroplast DNA of Nymphaea alba is a 159,930-bases-long circular molecule, which has a structure typical for many land plants—large and small single-copy regions separated by inverted repeat regions. Using PHRED confidence values, we determined the total plastome assembly to contain 0.03 incorrectly read bases, which suggests no mistakes in the genomic sequence. The G+C content of the plastome is 39.15%, which is close to that of the other angiosperms (Amborella has, for example, 38.3% G+C, Nicotiana has 37.8%, and Zea has 38.4%). The G+C content of the protein-coding genes of known function found on the Nymphaea alba cpDNA is close to the overall one (40.1%). However, the above bases are not uniformly distributed across the different codon positions. The synonymous third codon positions have a G+C content of 31.5%, whereas the first and the second codon positions have a G+C content closer to uniform (i.e., 44.5%). Both the gene order and the gene content of the genome under study are identical to those of Amborella trichopoda cpDNA (Goremykin et al. 2003a) and are very similar to the gene order and the gene content of Calycanthus fertilis cpDNA (Goremykin et al. 2003b), with the exception of the hypothetical ACR-toxin sensitivity gene (ACRS) open reading frame (ORF) found in the latter genome and shorter inverted repeat region of the Calycanthus plastome, not including the rpl2 gene. The gene map of the Nymphaea alba cpDNA is presented in figure 1.
FIG. 1. Nymphaea alba cpDNA. The topmost part of the map corresponds to the start and the end of the EMBL sequence entry AJ627251. Genes shown inside the circle are transcribed clockwise, and genes outside the circle are transcribed counterclockwise. The genes of the genetic apparatus are shown in red, photosynthesis genes are indicated as green, and genes of NADH dehydrogenase are shown in violet. The ORFs, ycfs, and genes of unknown function are designated as gray. Intron-containing genes (names of which are indicated in blue) are represented by their exons. In the cases when two genes overlap, one of them is shifted off the map to show its position
Alignment and Data Properties
The chloroplast genome of Nymphaea alba contains all 61 genes common to completely sequenced chloroplast genomes of the land plants (Ohyama et al. 1986; Shinozaki et al. 1986; Hiratsuka et al. 1989; Wakasugi et al. 1994; Maier et al. 1995; Sato et al. 1999; Hupfer et al. 2000; Kato et al. 2000; Schmitz-Linneweber et al. 2001; Ogihara et al. 2002; Goremykin et al. 2003a, 2003b). Individual alignments of the first and the second codon positions of 61 genes and alignments of their translated sequences were produced with our ClustalW-embedded Perl script. They were manually concatenated and edited to produce a 29,991-position-long nucleotide alignment and a 14,811–amino acid alignment used in phylogenetic analyses. In the nucleotide alignment, we excluded the third codon positions because it could pose problems in phylogeny reconstruction for the application of fitting and tree building with symmetric substitution models. These sites tend to exhibit high and irregular AT-contents and were found to be very divergent in comparison Pinus vs. angiosperms (Goremykin et al. 2003a, 2003b). Even at the first + second codon positions, several species in the data set do not pass a 5% chi-square test of compositional homogeneity. This also raises some concern for phylogenetic analysis of angiosperm/outgroup 1+2 sequence data sets, which we address.
Analyses of Concatenated Nucleotide Alignment
The tree depicting the inferred phylogenetic relationship of the species under analysis is presented in figure 2. This topology was found in distance analysis of the 29,991 position alignment of the first and the second coding positions of 61 chloroplast genes employing Tajima-Nei substitution model as implemented in the Treecon package (Van de Peer et al. 1994) and further confirmed in distance analyses with Jukes-Cantor, Kimura two-parameter Felsenstein F81, Felsenstein F84, Kimura three-parameter, Hasegawa, Kishino and Yano, Tajima-Nei, Tamura-Nei, and General time-reversible (GTR) models as implemented in the PAUP* package (Swofford 2002), both with and without gamma correction (employing alpha shape parameter 0.27). The branches uniting dicotyledons, magnoliids and Nymphaea with Amborella were recovered with 100/100 bootstrap proportion (BP) support in all above analyses.
FIG. 2. Neighbor-joining tree built from Tajima-Nei distances derived from analysis of the alignment of the first and the second codon positions from 61 protein-coding genes common to the plastomes of land plants
Several species in the data set do not pass the 5% chi-square test of compositional homogeneity. The compositional biases in these sequences, however, would not be expected to affect the phylogeny reconstruction because the topology presented on figure 2 was also recovered in the LogDet analysis—with 100/100 BP support for all aforementioned branches.
This topology was further confirmed in maximum-parsimony analyses performed with heuristic and branch and bound searches. The bootstrap values supporting the monophyly of dicotyledonous plants, monophyly of Magnoliopsida, and sister group relationship between Amborella and Nymphaea remained on the maximum level in these analyses.
The same result was recovered in maximum-likelihood (ML) analyses performed with the Tree-Puzzle program (Strimmer and von Haeseler 1996). The ML tree built with the Hasegawa, Kishino, and Yano model of substitution was congruent to the topology shown on figure 2. The monophyletic status of dicots, Magnoliids, and Nymphaea with Amborella received, respectively, 95, 98, and 99 quartet puzzling support (QPS). Applying the Tamura-Nei model of substitution resulted in no changes in topology and in a slight change of support for the above three branches: 95, 97, 99 QPS (respectively). The hypothetical branch uniting all angiosperms under analysis with the exception of Nymphaea and Amborella, a branch that would be in compliance with the ANITA grade hypothesis, received no support in these QP ML analyses.
The heuristic ML searches performed with PAUP* yielded different results. The quartet-puzzling algorithm implemented in this program found an alternative topology ((outgroups (Calycanthus (Nymphaea, Amborella)))(monocots, eudicots)) with exactly the same settings and models that we used with Tree-Puzzle. The branch separating outgroups with magnoliids from the rest of the species received low (<60%) QPS support in these analyses. As with our earlier findings (Goremykin et al. 2003a), the "best fit" model found by the Modeltest program (Posada and Crandall 1998) suggested a General Time-Reversible model with positional rate heterogeneity. Tree building under heuristic ML (PAUP*) with the optimal symmetric model yielded a topology with a clade bearing Nymphaea and Amborella branching first among the angiosperms, followed by Calycanthus, and then by the dichotomy monocots-eudicots. However, as with the findings reported previously, deviations in the choice of model and parameter values gave trees wherein branches united dicotyledons, magnoliids, and Nymphaea with Amborella. In these analyses monocots were basal.
Analyses of Concatenated Amino Acid Alignment
These results were further tested with analyses of the 14,811-position-long alignment of the translated sequences. Heuristic search employing maximum-parsimony algorithm (PAUP*) yielded the topology congruent with monocots basal (fig. 2). The branch bearing Nymphaea and Amborella and the one uniting all dicotyledonous plants were found in 100/100 bootstrap trees, whereas the branch bearing the three magnoliids was recovered in 99/100 bootstrap replicas.
Distance analyses of the protein alignment were performed with the Treecon package employing Kimura and Tajima-Nei models and with the PHYLIP package (Felsenstein 1989) with Dayhoff model. These analyses resulted in topologies identical to the one presented on figure 2 with 100/100 BP support values for the three aforementioned branches.
The neighbor-net (Bryant and Moulton 2004) tree built with the Protein LogDet method (Tholesson 2004) with all amino acid sites included had the topology congruent to the one shown in figure 2.
Maximum-likelihood analyses were performed with the Tree-Puzzle program with default settings and root assigned to Marchantia. The branching order of the eudicot clades—(Nicotiana/Spinacia), (Arabidopsis/Lotus), and the one leading to Oenothera could not be resolved in all ML analyses. The eudicot monophyly though, as well as monophyly of magnoliids, dicots, and the sister group relationship between Amborella and Nymphaea, received strong support. With Müller-Vingron, BLOSUM, Adachi-Hasegawa, Dayhoff, and Jones and Jones substitution models, the lowest QPS value supporting the eudicot branch in the above ML analyses was 96 QPS. The single lowest quartet-puzzling support value obtained for the other three branches was 97.
Analyses of the Individual Alignments
Rearrangements in 11 plastomes of the spermatophytes affecting gene order involve large chunks of DNA. Therefore, on the level of at least spermatophytes, orthology of every gene under analysis as well as common evolutionary history of all 61 genes can easily be proved on the basis of gene order identity and general sequence similarity along the large stretches of cpDNA from different species. Yet the land plant tree topologies derived from the chloroplast genes are often different.
To investigate possible phylogenetic biases of individual genes, we counted the bootstrap values supporting the branches bearing outgroups with Nymphaea, Amborella or with Nymphaea and Amborella taken together, Calycanthus, Magnoliopsida, Rosopsida, and grasses in NJ/GTR trees built from the alignments of the first and the second codon positions from 61 coding genes common to the genomes of the land plants. The results of these analyses are shown in table 1. Here, we label a certain branch to be "supported" by a protein alignment when corresponding BP support value is no lower than the arbitrary value of 60%.
Table 1 Results of the Individual Analysis of the Alignments of the First and the Second Codon Positions.
One can see that one of six branches is supported by a much larger number of genes than the other five. This is the branch uniting outgroups with grasses that we recovered from the majority of analyses of the concatenated data set. This branch found some support in the analysis of atpF (67% BP), matK (60% BP), rpoB (99% BP), rpoC1 (100% BP), rpoC2 (99% BP), rps12 (61% BP), rps18 (76% BP), rps3 (77% BP), and rps8 alignments (90% BP). These alignments exhibit, respectively, 117, 479, 483, 357, 939, 24, 50, 146, and 81 informative positions.
We found no clear cases of support on the level of individual proteins for the branches uniting outgroups with (1) Nymphaea, (2) Amborella, and (3) Magnoliopsida. The branches bearing outgroups with (1) Nymphaea + Amborella, (2) Calycanthus, and (3) Rosopsida were supported each by a single protein, by, respectively, 74% BP (psbF), 73% BP (psaJ), and 70% BP ( ycf 3). The number of informative positions in alignments of these proteins are, respectively, 7, 17, and 35.
Additional Tests
We wished to evaluate support for the clade grouping the monocotyledoneous species with outgroups in our analyses. One simple way of testing outgroup affinity of different angiosperm branches would be to count the number of positions in which a group to be tested share the same character state with the most closely related outgroup that is not observed in all other angiosperm sequences. Given low level of homoplasy (for example by excluding the highly homoplastic third positions), one would expect such positions in a group to be tested to contain the ancestral character states existing before the splitting of angiosperms and the outgroup that subsequently mutated in other angiosperm lineage. Another way to test the basal monocot-dicot split would be to count the number of synapomorphies supporting alternative branches within the angiosperm ingroup. One can expect more synapomorphies between more closely related ingroup taxa.
We found the concatenated alignment of the first and the second codon positions sampled from 61 chloroplast genes to contain 68 positions with bases shared between three grasses and Pinus to the exclusion of other angiosperms. Three species of magnoliopsida (Calycanthus, Nymphaea, and Amborella) share only 13 such positions with Pinus and five species of Rosopsida share only five. The part of the ANITA grade (Nymphaea + Amborella + Pinus) is supported by 10 positions. We deleted nonspermatophyte outgroups and repeated the analysis. The number of positions supporting outgoup affinity of the above four groups changed to 151 (grasses + Pinus), 56 (Nymphaea + Amborella + Pinus), 30 (Calycanthus + Amborella + Nymphaea + Pinus), and 19 (Rosopsida + Pinus). For comparison, monophyly of angiosperms and spermatophytes is supported by 532 and 631 positions, respectively. The positions supporting outgroup affinity of grasses and of the species from the ANITA grade are shown in figure 3.
FIG. 3. The alignment positions in which grasses (above) and members of ANITA group (below) share characters with Pinus to the exclusion of all other angiosperm lines in the alignment of the first and the second codon positions of 61 protein-coding genes common to 14 genomes of land plants
In the total alignment of the first and the second codon positions, the mean distance across the range Pinus to monocots is 0.17 substitutions/position (ML estimation with Tamura-Nei model of substitution). Given that value, the probability that a position mutated twice since the separation of the gymnosperm and monocot lines is 0.0289, which corresponds to one twice-mutated position out of 34.6. Therefore, if the rate of substitutions in 151 alignment positions supporting grouping of monocots with Pinus (fig. 3) does not exceed the mean one characteristic of the whole alignment, this subset could be expected to contain approximately four homoplastic positions. The mean distance among eight dicotyledonous plants in the 151-position subset is 0.11, which is higher than the corresponding distance observed in the total alignment of 61 genes (0.065 substitutions/position, same model). Taking into account this somewhat elevated substitution rate, the 151-position subset can be expected to contain approximately seven homoplastic positions. One can also note that the GC content of the 151-position subset (49.3%) is close to equilibrium and is similar to the total GC content of the alignment (44.4%).
However, because the first codon positions can undergo synonymous substitutions, it is not immediately clear from the above observations whether support for gymnosperm affinity of the line leading to Zea, Oryza, and Triticum would be reflected in the protein sequences. So we repeated the analysis, this time using a 14,811–amino acid alignment. It was found to contain 45 positions that clearly support the separation of grasses and Pinus from the rest of the species under analysis. By contrast, the affinity of ANITA members to Pinus was supported by 19 positions, and that of Magnoliopsida and Rosopsida was supported by nine and four positions, respectively. As in the analyses of nucleotide sequences, removal of fern and liverwort sequences resulted in a stronger signal. In the alignment containing only spermatophyte species, there are 86 positions in which Pinus and three grasses share a character to the exclusion of other angiosperms, as opposed to 42, 13, and 11 positions supporting affinity of Pinus to, respectively, ANITA members, Magnoliopsida, and Rosopsida.
The nucleotide alignment of 14 OTUs with 29,991 positions per species contains 93 synapomorphies shared among the dicots, whereas the number of synapomorphies supporting monophyly of eudicots and monocots is 17. In the 14,811–amino acid alignment, there are 93 synapomorphies that unite all dicots. The number of synapomorphies that unite monocots and eudicots in this alignment is 11.
Discussion
Our findings of strong contradictory signals between different phylogenetic analyses of angiosperm chloroplast genome data highlight the concern recently raised in Molecular Biology and Evolution over appropriate analysis of complete genome data and the problems of sequence concatenation (Holland et al. 2004). Our findings are also cautionary and provide insight into the importance of further complete genome sequences before strong conclusions can be drawn in respect of identification of the most basal angiosperms.
In the majority of our analyses, we detected neither support for (1) monocot affinity of a line leading to modern Nymphaeales nor support for (2) an early appearance of Nymphaeales in the evolutionary history of flowering plants. These analyses suggested that Nymphaea alba is a derived representative of a dicot lineage, which is, of all species under analysis, most closely related to Amborella trichopoda. The clade bearing Amborella with Nymphaea was detected with lower support previously (Barkman et al. 2000, Graham and Olmstead 2000), although not its derived position, because it appeared as the first branch to split off among angiosperms in these trees. Because our taxon sampling remains limited, it is, however, premature to state close taxonomical relationship of Amborella and Nymphaeales.
In analyses of the individual genes we registered multiple cases of support for the sister group relationship of those two plants, yet little support for their close gymnosperm affinity. It is the line leading to three grasses that was found in a most-basal position within the angiosperms in the highest number of phylogenetic analyses of the individual gene alignments. This branching was supported by the alignments of genes (matK, rpoB, rpoC1, and rpoC2) containing the greatest numbers of informative positions among the 61 individual gene alignments we built. All alternative branches tested either received no support in these analyses or were supported each by a single alignment of a short gene with a low number of informative positions.
The interpretation of these results is straightforward: There are no bona fide cases of support for alternative rootings of angiosperms on the level of individual gene alignments. The appearance of alternative topologies in analyses of individual genes is probably the result of stochastic variations in the substitution process becoming visible when character sampling becomes small, as was recently observed in the analysis of 108 genes from yeasts (Rokas et al. 2003). Because bootstrap proportion values are affected by the ratio of characters supporting different topologies but not by their absolute numbers, such small biased samples could still exhibit high BP support values for incorrectly inferred branches.
Another conclusion that can be drawn from our analyses of individual genes is that the character-wise small data sets can be generally unreliable for elucidating phylogeny. Analyzed individually, no alignment recovered the very stable topology (fig. 2) we inferred from the analysis of the concatenated nucleotide alignment of 29,991 positions employing the same method of tree construction. This is consistent with the newer findings of Rokas et al. (2003) and is furthermore consistent with earlier analyses of chloroplast genome phylogeny comparing individual and concatenated alignments (Goremykin, Hansmann, and Martin 1997; Martin et al. 1998; Lockhart et al. 1999).
An attempt to additionally test the outgroup affinity of the grasses revealed that they share the largest number of characters with Pinus to the exclusion of all other angiosperms among branches checked. The positions supporting that affinity are not hypermutated and do not exhibit any strong compositional bias. The number of such positions exceeds the number of positions supporting outgroup affinity of ANITA members approximately three times in the nucleotide and two times in the amino acid alignment; other conflicting signals are substantially less pronounced. It is possible for nonadjacent and nonrelated taxa on a tree to have more sequence identities than adjacent and related taxa if the molecular clock is violated in a way that the substitution rate among the nonadjacent taxa gets comparatively small. However, the above considerations cannot be applied to explain the result in the figure 3, because both the grasses and Pinus are born on the branches that are the longest among the spermatophytes. One can note that the above results are in good accord with the numbers of synapomorphies that can be observed within the angiosperm ingroup. In concatenated protein and nucleotide alignments, the clade uniting all dicots is supported by, respectively, eightfold and fivefold larger numbers of synapomorphies than an alternative clade bearing eudicots and monocots.
These observations and results of individual gene analyses stand in sharp contrast to the topology favored by the optimal symmetric model applied to concatenated sequence data. This apparent contradiction may be easily explained. When the internal branches of the true underlying tree are short compared with the length of the external branches, tree building is expected to be problematic (Hendy and Penny 1987). This problem is exacerbated when sequence evolution is not well described by the assumed substitution model (Jermiin et al. 2004). In such cases, small deviations in model parameters can potentially lead to very different tree topologies supported by high BP support values. Once concatenated, the optimal symmetric model will merely be the best average fit to all these genes.
The majority of analyses presented here converge on the basal monocot-dicot split. However, the extreme difference between internal ingroup and external outgroup branch lengths in the concatenated gene angiosperm tree shown in figure 2 suggests that we may be some ways from being confident of identifying the most basal angiosperm. Clearly, the sequencing of genomes for more closely related outgroups and putatively basal angiosperms will be important for overcoming potential problems of model misspecification and long-branch attraction.
Supplementary Material
The sequence reported in this paper has been deposited in the EMBL database (accession number AJ627251).
Acknowledgements
This publication was supported by a grant of the Deutsche Forschungsgemeinschaft.
Literature Cited
Adams, K. L., Y. L. Qiu, M. Stoutemyer, and J. D. Palmer. 2002. Punctuated evolution of mitochondrial gene content: High and variable rates of mitochondrial gene loss and transfer to the nucleus during angiosperm evolution. Proc. Natl. Acad. Sci. USA 99:9905-9912.
Arber, A. 1920. Water plants: A study of the aquatic angiosperm. Cambridge University Press. London.
Barkman, T. J., G. Chenery, J. R. McNeal, J. Lyons-Weiler, W. J. Ellisens, G. Moore, A. D. Wolfe, and C. W. dePamphilis. 2000. Independent and combined analyses of sequences from all three genomic compartments converge on the root of flowering plant phylogeny. Proc. Natl. Acad. Sci. USA 97:13166-13171.
Bergthorsson, U., K. L. Adams, B. Thomason, and J. D. Palmer. 2003. Widespread horizontal transfer of mitochondrial genes in flowering plants. Nature 424:197-201.
Bryant, D., and Moulton, V. 2004. Neighbor-net: an agglomerative method for the construction of phylogenetic networks. Mol. Biol. Evol. 21:255-265.
Cronquist, A. 1981. An integrated system of classification of flowering plants. Columbia University Press, New York.
Dahlgren, R. M. T., and H. T. Clifford. 1982. The Monocotyledons: a comparative study. Academic Press, New York.
Ewing, B., L. Hillier, M. C. Wendl, and P. Green. 1998. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8:175-185.
Felsenstein, J. 1989. PHYLIP (phylogeny inference package). Version 3.2. Cladistics 5:164-166.
Friis, E. M., K. R. Pedersen, and P. R. Crane. 2001. Fossil evidence of water lilies (Nymphaeales) in the early Cretaceous. Nature 410:357-360.
Goremykin, V., S. Hansmann, and W. Martin. 1997. Evolutionary analysis of 58 proteins encoded in six completely sequenced chloroplast genomes: revised molecular estimates of two seed plant divergence times. Plant Syst. Evol. 206:337-351.
Goremykin, V. V., K. I. Hirsch-Ernst, S. W?lfl, and F. H. Hellwig. 2003a. Analysis of the Amborella trichopoda chloroplast genome sequence suggests that Amborella is not a basal angiosperm. Mol. Biol. Evol. 20:14499-14505.
Goremykin, V. V., K. I. Hirsch-Ernst, S. W?lfl, and F. H. Hellwig. 2003b. The chloroplast genome of the "basal" angiosperm Calycanthus fertilis—structural and phylogenetic analyses. Plant Syst. Evol. 242:119-135.
Graham, S. W., and R. G. Olmstead. 2000. Utility of 17 chloroplast genes for inferring the phylogeny of the basal angiosperms. Am. J. Bot. 87:1712-1730.
Guttenberg, H. V., and R. Müller-Schr?der. 1958. Untersuchungen über die Entwicklung des Embryos und der Keimpflanze von Nuphar luteum. Planta 51:481-510.
Haines, R. W., and K. A. Lye. 1975. Seedlings of Nymphaeaceae. J. Linn. Soc. Bot. 70:255-265.
Hendy, M. D., and D. Penny. 1987. Edge lengths of trees from sequence data. Math. Biosci. 83:157-165.
Hiratsuka, J., H. Shimada, and R. Whittier, et al. (16 co-authors). 1989. The complete sequence of the rice (Oryza sativa) chloroplast genome: intermolecular recombination between distinct tRNA genes accounts for a major plastid DNA inversion during the evolution of the cereals. Mol. Gen. Genet. 217:185-194.
Holland, B. R., K. T. Huber, V. Moulton, and P. J. Lockhart., in press Using consensus networks to visualize contradictory evidence for species phylogeny. Mol. Biol. Evol.
Hupfer, H., M. Swiatek, S. Hornung, R. G. Hermann, R. M. Maier, W.-L. Chiu, and B. Sears. 2000. Complete nucleotide sequence of the Oenothera elata plastid chromosome, representing plastome I of the five distinguishable Euoenothera plastomes. Mol. Gen. Genet. 263:581-585.
Jermiin, L. S., S. Y. W. Ho, F. Ababneh, J. Robinson, and A. W. D. Larkum., in press The biasing effect of compositional heterogeneity on phylogenetic estimates may be underestimated. Syst. Biol.
Kato, T., T. Kaneko, S. Sato, Y. Nakamura, and S. Tabata. 2000. Complete structure of the chloroplast genome of a legume, Lotus japonicus. DNA Res. 7:323-330.
Lockhart, P. J., C. J. Howe, A. C. Barbrook, A. W. D. Larkum, and D. Penny. 1999. Spectral analysis, systematic bias, and the evolution of chloroplasts. Mol. Biol. Evol. 16:573-576.
Lodkina, M. M. 1988. Evolutionary relationships of monocots and dicots derived from studies of embryo and seedlings data. Botanichesky Zhurnal 73:617-629 (in Russian).
Maier, R. M., K. Neckermann, G. L. Igloi, and H. Kossel. 1995. Complete sequence of the maize chloroplast genome: gene content, hotspots of divergence and fine tuning of genetic information by transcript editing. J. Mol. Biol. 251:614-628.
Martin, W., B. Stoebe, V. Goremykin, S. Hansmann, M. Hasegawa, and K. V. Kowallik. 1998. Gene transfer to the nucleus and the evolution of chloroplasts. Nature 393:162-165.
Murray, M. G., and W. F. Thompson. 1980. Rapid isolation of high molecular weight DNA. Nucleic Acids Res. 8:4321-4325.
Ogihara, Y., K. Isono, and T. Kojim, et al. (19 co-authors). 2002. Structural features of a wheat plastome as revealed by complete sequencing of chloroplast DNA. Mol. Genet. Genomics 266:740-746.
Ohyama, K., H. Fukuzawa, and T. Kohchi, et al. (13 co-authors). 1986. Chloroplast gene organization deduced from complete sequence of liverwort Marchantia polymorpha chloroplast DNA. Nature 322:572-574.
Posada, D., and K. A. Crandall. 1988. Modeltest: testing the model of DNA substitution. Bioinformatics. 14:817-818.
Qiu, Y.-L., J. Lee, F. Bernasconi-Quadroni, D. E. Soltis, P. S. Soltis, M. Zanis, E. A. Zimmer, Z. Chen, V. Savolainen, and M. W. Chase. 1999. The earliest angiosperms: evidence from mitochondrial, plastid and nuclear genomes. Nature 402:404-407.
Rohweder, O., and P. K. Endress. 1983. Samenpflanzen. Georg Thieme Verlag, Stuttgart.
Rokas, A., B. L. Williams, N. King, and S. B. Carroll. 2003. Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature 425:798-804.
Sato, S., Y. Nakamura, T. Kaneko, E. Asamizu, and S. Tabata. 1999. Complete structure of the chloroplast genome of Arabidopsis thaliana. DNA Res. 6:283-290.
Savolainen, V., and M. W. Chase. 2003. A decade of progress in plant molecular phylogenetics. Trends Genet. 19:717-724.
Schaffner, J. H. 1904. Some morphological peculiarities of the Nymphaeaceae and Helobiae. Ohio Nat. 4:83-92.
Schaffner, J. H. 1934. Phylogenetic taxonomy of plants. Quart. Rev. Biol. 9:129-160.
Schmitz-Linneweber, C., R. M. Maier, J. P. Alcaraz, A. Cottet, R. G. Herrmann, and R. Mache. 2001. The plastid chromosome of spinach (Spinacia oleracea): complete nucleotide sequence and gene organization. Plant Mol. Biol. 45:307-315.
Shinozaki, K., M. Ohme, and M. Tanaka, et al. (23 co-authors). 1986. The complete nucleotide sequence of the tobacco chloroplast genome: its gene organization and expression. EMBO J. 5:2043-2049.
Soltis, P. S., D. E. Soltis, and M. W. Chase. 1999. Angiosperm phylogeny inferred from multiple genes as a tool for comparative biology. Nature 402:402-403.
Staden, R., K. F. Beal, and J. K. Bonfield. 2000. The Staden package, 1998. Methods Mol. Biol. 132:115-130.
Strimmer, K., and A. von Haeseler. 1996. Quartet puzzling: a quartet maximum likelihood method for reconstructing tree topologies. Mol. Biol. Evol. 13:964-969.
Swofford, D. L. 2002. PAUP*: phylogenetic analysis using parsimony (*and other methods). Version 4. Sinauer Associates, Sunderland, Mass.
Takhtajan, A. 1966. Systema et phylogenia Magnoliophytorum. Nauka, Moscow, Leningrad.
Takhtajan, A. 1973. Evolution und Ausbreitung der Blütenpflanzen. Gustav Fischer Verlag, Jena.
Tholesson, M. 2004. LDDist: a Perl module for calculating LogDet pair-wise distances for protein and nucleotide sequences. Bioinformatics 20:416-418.
Van de Peer, Y., and R. De Wachter. 1994. TREECON for Windows: a software package for the construction and drawing of evolutionary trees for the Microsoft Windows environment. Comput. Applic. Biosci. 10:569-570.
Wakasugi, T., J. Tsudzuki, S. Ito, K. Nakashima, T. Tsudzuki, and M. Sugiura. 1994. Loss of all ndh genes as determined by sequencing the entire chloroplast genome of the black pine Pinus thunbergii. Proc. Natl. Acad. Sci. USA 91:9794-9798.
Zanis, M. J., D. E. Soltis, P. S. Soltis, S. Mathews, and M. J. Donoghue. 2002. The root of the angiosperms revisited. Proc. Natl. Acad. Sci. USA 99:6848-6853.(Vadim V. Goremykin*, Kare)