Analysis of Acorus calamus Chloroplast Genome and Its Phylogenetic Implications(文章精)

Analysis of Acorus calamus Chloroplast Genome and Its Phylogenetic Implications

http://www.100md.com 分子生物学进展 2005年第9期

     * Institut für Spezielle Botanik, Universit?t Jena, Jena, Germany; Allan Wilson Centre for Molecular Ecology and Evolution, Massey University, Palmerston North, New Zealand; Zentrum Pharmakologie und Toxikologie, Universit?t G?ttingen, G?ttingen, Germany

    E-mail: vadim.goremykin@uni-jena.de.

    Abstract

    Determining the phylogenetic relationships among the major lines of angiosperms is a long-standing problem, yet the uncertainty as to the phylogenetic affinity of these lines persists. While a number of studies have suggested that the ANITA (Amborella-Nymphaeales-Illiciales-Trimeniales-Aristolochiales) grade is basal within angiosperms, studies of complete chloroplast genome sequences also suggested an alternative tree, wherein the line leading to the grasses branches first among the angiosperms. To improve taxon sampling in the existing chloroplast genome data, we sequenced the chloroplast genome of the monocot Acorus calamus. We generated a concatenated alignment (89,436 positions for 15 taxa), encompassing almost all sequences usable for phylogeny reconstruction within spermatophytes. The data still contain support for both the ANITA-basal and grasses-basal hypotheses. Using simulations we can show that were the ANITA-basal hypothesis true, parsimony (and distance-based methods with many models) would be expected to fail to recover it. The self-evident explanation for this failure appears to be a long-branch attraction (LBA) between the clade of grasses and the out-group. However, this LBA cannot explain the discrepancies observed between tree topology recovered using the maximum likelihood (ML) method and the topologies recovered using the parsimony and distance-based methods when grasses are deleted. Furthermore, the fact that neither maximum parsimony nor distance methods consistently recover the ML tree, when according to the simulations they would be expected to, when the out-group (Pinus) is deleted, suggests that either the generating tree is not correct or the best symmetric model is misspecified (or both). We demonstrate that the tree recovered under ML is extremely sensitive to model specification and that the best symmetric model is misspecified. Hence, we remain agnostic regarding phylogenetic relationships among basal angiosperm lineages.

    Key Words: Acorus calamus ? chloroplast genomes ? angiosperms ? gymnosperms ? molecular evolution ? systematic phylogenetic error

    Introduction

    So far, application of molecular techniques has not been able to clarify the phylogenetic relationships among the major angiosperm lineages. Topologies suggested by cladistic studies are highly contradictory, especially regarding the placement of monocots, eumagnoliids, Piperales, Nymphaeales, Amborella, and Ceratophyllum. In our previous studies we investigated this issue by using large genomic data sets including sequence information from all genes shared by the land plant plastomes. We sequenced chloroplast genomes of Calycanthus fertilis (Goremykin et al. 2003a), Amborella trichopoda (Goremykin et al. 2003b), and Nymphaea alba (Goremykin et al. 2004). With a number of methods a topology in which the grasses clade is basal within the angiosperm group was found with high bootstrap support. Recently, Stefanovic, Rice, and Palmer (2004) and Soltis et al. (2004) argued that this result is due to poor taxon sampling causing a long-branch attraction (LBA) between the grasses and the out-group for some phylogenetic methods. Stefanovic, Rice, and Palmer (2004) produced 20 full and partial sequences of chloroplast protein-coding genes from Acorus americanus, a member of a genus considered by a number of botanists (see for example Duvall et al. 1993; Chase et al. 2000; Tamura et al. 2004) to be a representative of a line most ancient within Liliopsida. Their main conclusion—namely that the members of the ANITA (Amborella-Nymphaeales-Illiciales-Trimeniales-Aristolochiales) group (Amborella and Nymphaea) are basal among the angiosperms—stems mostly from the analyses in which grasses were substituted for Acorus rather than when Acorus was added to an alignment containing grasses as well. The authors argued in favor of the exclusion of grasses on the assumptions that (1) all the monocots are monophyletic as suggested in earlier studies and (2) that the basal position of the grasses we previously observed was due to the LBA between the out-group and the grasses. Their results gave reason for Soltis et al. (2004) to express the view that sampling of complete genomes is of relatively less benefit to accurate estimation of phylogeny than the broad sampling of taxa.

    Taking a different position from the one expressed by Soltis et al. (2004), we hypothesize that careful sampling of chloroplast genomes from species representing major lines in angiosperm evolution will allow us to resolve phylogenetic relationships among them. Because, based on recent analyses (Graham and Olmstead 2000; Zanis et al. 2002; Savolainen and Chase 2003), most angiosperms appear to be divided into just a few species-rich clades, this could be achieved perhaps with as few as 10–20 carefully chosen genomes.

    In order to improve taxon sampling for monocots, we have sequenced the chloroplast genome of Acorus calamus. This monocotyledonous plant possesses several morphological characters which might be interpreted as evidence of its close relationship to the dicots (Grayum 1987). Among these are the cellular type of endosperm development, a feature common among the dicots but unique among the monocots, and the dicotyledonous type of anther wall formation. However, one should note that until the sister group to the monocots is revealed, correct evaluation of morphological characters among monocots will remain difficult.

    We compiled an alignment consisting of all genes and conserved introns, as well as the conservative spacers common to the A. calamus genome and other known genomes of seed plants and used this data to assess the phylogenetic position of Acorus and its influence on basal angiosperm topology. Recent commentaries on the debate about angiosperm origins by Lockhart and Penny (2005) and Martin et al. (2005) both highlight why we expect this to be a challenging problem for current phylogenetic estimation methods. Hence, we have taken a cautious approach and tried to identify which if any of the phylogenies resulting from different phylogenetic methods can be trusted. Unfortunately, at this stage, it seems the answer is none.

    Materials and Methods

    DNA Extraction and Genomic Sequencing

    Fresh leaves of A. calamus were harvested from a plant growing in the botanical garden of the University of Jena, Germany. A voucher specimen has been deposited at the Herbarium Haussknecht (JE). Total DNA was extracted using the CTAB-based method (Murray and Thompson 1980) and purified with columns (Qiagen, Valencia, Calif.) according the manufacturer's protocol. Series of polymerase chain reaction (PCR) products covering the entire plastome of A. calamus were generated employing long-range PCR techniques as described previously (Goremykin et al. 2003a, 2003b, 2004). Shotgun libraries were generated from these PCR products as described previously (above papers). Several shorter amplificates were sequenced by primer walking. Automated sequencing was performed on ABI 3100 sequencers as described previously (above papers). All reads were base-called with the PHRED program (Ewing et al. 1998). Sequence masking and assembly was performed with the STADEN Package (Staden, Beal, and Bonfield 2000) on a Linux Pentium III PC. The shotgun sequencing data were accumulated until 8x coverage for all PCR fragments was achieved. A 3x coverage was generated for all the shorter fragments sequenced by primer walking.

    Results

    Genome Properties

    The chloroplast genome of A. calamus (fig. 1) is a circular molecule, 153,821 bp long, with a structure often found among land plant plastomes—two inverted repeats separated by the small single-copy region and large single-copy region (LSCR). A number of rearrangements and deletions characteristic of the completely sequenced plastomes of monocots (all from the family Poaceae)—Oryza (Hiratsuka et al. 1989), Zea (Maier et al. 1995), Triticum (Ogihara et al. 2002), and Saccharum (Asano et al. 2004)—were not found in the chloroplast genome of Acorus, namely: three inversions in the LSCR; translocation of the rpl23 gene from the inverted repeat to the LSCR; loss of an intron in the rpoC1 gene; and a large insertion in the rpoC2 gene. The genome is colinear to the chloroplast genomes of N. alba (Goremykin et al. 2004) and A. trichopoda (Goremykin et al. 2003b) in respect to gene order and overall homology. We estimated, based on Blast searches against our local database of chloroplast genes (as described earlier, Goremykin et al. 2003a), that the total number of gene species encoded by the A. calamus chloroplast genome is at least 112. The difference in gene content between the chloroplast DNA (cpDNA) of A. calamus and the genomes of N. alba and A. trichopoda is minimal. Compared to these two genomes, Acorus cpDNA lacks sequences for the accD and ycf15 genes. Absence of the accD gene, however, cannot be taken as evidence of a close relationship between Acorus and grasses (also lacking this gene in their plastome sequence) because a number of other monocot plastomes have a copy of accD (Katayama and Ogihara 1996). The inverted repeat in the genome includes one tRNA gene species—trnH (gug), which plastomes of Amborella and Nymphaea harbor only in the LSCR.

    FIG. 1.— Acorus calamus cpDNA. The topmost part of the map corresponds to the start and the end of the EMBL sequence entry (AJ879453 [GenBank] ). Genes shown inside the circle are transcribed clockwise, while genes outside the circle are transcribed counterclockwise. The genes of the genetic apparatus are shown in red, photosynthesis genes are indicated as green, and genes of NADH dehydrogenase are shown in violet. The ORFs, ycfs and genes of unknown function are designated as grey. Intron-containing genes (names of which are indicated in blue) are represented by their exons. In the cases of overlapping genes, one of them is shifted off the map to show its position.

    Data Choice

    In our previous studies we used only the protein-coding subset of the cpDNA for elucidating phylogeny. This was done mainly because of the ease of sampling annotated genes from public databases. Here, we followed a more inclusive strategy: we aligned all the genomic regions with strong sequence similarity among the species under analysis. Previously we successfully used spacer and RNA gene sequences for land plant phylogeny reconstruction (Goremykin et al. 1996). Our recent estimates (Goremykin et al. 2003a) showed that angiosperm genomes harbor approximately 2.5-fold more DNA stretches that are possible to align compared to the approximately 28-kb data set of the first and the second codon positions we previously analyzed (under 0.4 substitutions per position divergence threshold in comparisons among Pinus and Calycanthus). Performing Blast searches with relaxed parameters, we estimated a set of continuous DNA stretches for which a similarity exists between Pinus and Acorus and Acorus and Oryza. The former cpDNA subset was considerably shorter than the latter because the Pinus chloroplast genome (Wakasugi et al. 1994) is reduced in size and contains fewer genes than the chloroplast genomes of angiosperms. We Blasted the set of the sequences similar between Pinus and Acorus against chloroplast genomes of Nicotiana (Shinozaki et al. 1986), Oryza (Hiratsuka et al. 1989), Zea (Maier et al. 1995), Arabidopsis (Sato et al. 1999), Oenothera (Hupfer et al. 2000), Lotus (Kato et al. 2000), Spinacia (Schmitz-Linneweber et al. 2001), Triticum (Ogihara et al. 2002), Calycanthus (Goremykin et al. 2003a), Amborella (Goremykin et al. 2003b), Nymphaea (Goremykin et al. 2004), Panax (Kim and Lee 2004), and Saccharum (Asano et al. 2004); upon extraction of Blast hits, we created a series of files containing homologous sequences from these species in Fasta format. To increase the amount of homologous material for phylogenetic analyses, we repeated the above procedure using the set of the sequences similar between Acorus and Oryza cpDNAs. At the next step we detected all redundant sequences among these two file sets and merged overlapping sequence files. A final set of files containing homologous sequences was automatically aligned with the ClustalW program, manually edited, and concatenated in one file. Genes for which the mean distance between Pinus and the angiosperms was greater than the arbitrary value of 0.5 substitutions per site at their synonymous sites had their third codon positions removed from the alignment (previous estimation by Goremykin et al. 2003b). Sequences of genes exhibiting 0.5 and lower dS values were left intact (rps12, rps7, psbL, psaI, lhbA, petG, ycf3), as well as sequences of all genes coding for subunits of NAD dehydrogenase, for which no functional homologs exist in Pinus cpDNA. The resulting alignment of 89,436 bp for 15 taxa (available upon request) contains almost all sequences usable for phylogeny reconstruction within spermatophytes which the above genomes harbor. In addition to the 61 protein-coding genes common to the plastomes of land plants, this alignment contains all RNA genes (34 gene species), all introns, and a number of conserved spacer sequences characteristic of the plastomes of the land plants.

    Phylogenetic Reconstruction

    Determining the Best-Fitting Reversible Model

    Using the hierarchical likelihood ratio test (hLRT) as implemented in Modeltest version 3.06 (Posada and Crandall 1998) we found that, of the 56 models tested, the GTR + G + I model of nucleotide substitution provided the best fit to the data. We estimated the maximum likelihood (ML) tree using PAUP version 4.0b10 (Swofford 2002) by fixing the parameters of the GTR + G + I model to those suggested by Modeltest. PAUPs default heuristic search settings were used. The resulting ML tree is shown in figure 2A; it places the clade containing Amborella and Nymphaea basal to the rest of the angiosperms, and the monocots form a monophyletic group. In order to perform parametric bootstrap analyses we repeated the above process for a reduced alignment of 54,177 sites that contained no gaps. It was necessary to remove gaps as simulated DNA sequences cannot contain any characters other than A, C, G, or T, and we wished to compare the parametric bootstrap alignments to the observed alignment. Again the preferred model under the hLRT was GTR + G + I, and the ML tree topology was identical to that shown in figure 2A with highly similar branch lengths. Using the Seq-Gen (v1.3) program of Rambaut and Grassly (1997), we simulated 100 data sets of length 54,177 under the ML tree using the parameters determined by Modeltest.

    FIG. 2.— (A) The ML tree found using PAUP (Swofford 2002) using a GTR + G + I model with parameters set to those found by Modeltest (Posada and Crandall 1998). (B) The tree recovered by parsimony for 100 parametric bootstrap alignments (generated on the tree in A). (C) The tree recovered by NJ with GTR-based distances for 66 out of 100 parametric bootstrap alignments (generated on the tree in A).

    Performance of Parsimony and Distance Methods: Testing for LBA

    The output of the parametric bootstrap was used in several ways. First, we wanted to assess if parsimony and distance methods would be expected to correctly estimate the tree in figure 2A if it were true. As a control we also checked that ML with a correctly specified model could recover the generating topology. This indeed was the case, the generating tree was recovered in 100/100 parametric bootstrap samples for all three data sets described below. For each parametric bootstrap alignment we constructed both a parsimony tree and a neighbor-joining (NJ) tree. For parsimony we used PAUPs default heuristic search settings with the exception that multrees was set to no; this forces the program to return a single tree for each data set. For NJ, for each parametric bootstrap alignment and for the observed alignment, we did the following: we constructed an initial NJ tree using p distances; fixing this tree we estimated the rate matrix parameters of the general time reversible (GTR) model using ML; fixing these rate matrix parameters we constructed a NJ tree using ML distances.

    In the parsimony analysis the tree shown in figure 2B was returned by all 100 data sets. This tree is very similar to the parsimony tree for the observed alignment (both complete and ungapped), it differs only in the placement of Acorus which for the observed alignment is in a subtree (Acorus, (Calycanthus, (Amborella, Nymphaea))). For both the complete and ungapped alignments there was 100% bootstrap support for grasses being basal and for the paraphyly of the monocots. In the NJ analysis the parametric bootstrap always returned one of two trees, the most common variant (66 cases) is shown in figure 2C, the only difference in the other variant (34 cases) was that it contained the clade (Calycanthus, (Amborella, Nymphaea)) instead of (Calycanthus, Acorus). The tree in figure 2C is exactly the same as the NJ tree recovered from the observed alignment (both complete and ungapped), and all the bootstrap values in the NJ trees for the complete and ungapped alignments were at least 94%. The trees in figures 2B and C differ from the tree used to generate the data, in particular, grasses are basal and monocots are paraphyletic. In other words, if the tree in figure 2A and the model of nucleotide substitution used were true, we would not expect parsimony or NJ with GTR-based distances to recover the tree correctly.

    It appears that parsimony and NJ are failing to recover the generating tree due to LBA; to test this hypothesis we removed Pinus, one of the two long branches, and repeated the parametric bootstrap analyses. In this case the generating topology (fig. 2A, excluding Pinus) was recovered by parsimony for all 100 of the parametric bootstrap alignments and by NJ in 90 out of 100 alignments (in the other 10 NJ trees Acorus was a sister to a clade comprising magnoliids, and grasses and eudicots form a clade together). In the observed data the tree estimated when Pinus was removed differed depending on both the method used and whether the gapped or ungapped alignment was used. Parsimony on the complete alignment placed Acorus as a sister to a clade containing all eudicots and grasses with 94% bootstrap support, but the ungapped alignment gave a tree identical to that in figure 2A (excluding Pinus) with 82% bootstrap support for placing Acorus as sister taxa to the grasses. The NJ analysis gave a tree identical to that shown in figure 2C (excluding Pinus) with 98% and 63% bootstrap support for the clade (Acorus, Calycanthus) in the complete and ungapped alignments respectively. If the tree in figure 2A (excluding Pinus) and the model of nucleotide substitution used were true, we would expect both parsimony and NJ, with GTR-based distances, to recover it correctly. The fact that these methods fail to recover the generating tree (except in the case of parsimony applied to the ungapped alignment) might be interpreted as evidence that either the generating tree or the model of nucleotide substitution are not an accurate description of the observed data. LBA may be insufficient to explain the paraphyly of the monocots in the parsimony and NJ trees, built on the basis of the observed data.

    The other long branch is that leading to the grasses, so we also tried analyses in which Oryza, Triticum, Zea, and Saccharum were excluded. The generating topology (fig. 2A, excluding grasses) was never recovered by parsimony in 100 parametric bootstrap alignments. However, the only error was in the placement of Pinus; it either paired with Amborella (66 cases) or with Nymphaea (34 cases). In the observed data the parsimony tree differs depending on whether the gapped or ungapped alignment is used. The complete alignment gives the tree shown in figure 2C, but with grasses excluded; the ungapped alignment gives a parsimony tree which, apart from the eudicot clade, has no edges in common with the generating tree. Applying NJ to the parametric bootstrap alignments also failed to recover the generating tree, instead the tree in figure 2C (excluding the grasses) was recovered 95 times out of 100. This same tree was also found, with bootstrap support >95%, on applying NJ to both the complete and ungapped observed alignments. These analyses suggest that, if the tree in figure 2A (this time with grasses removed) and the model of nucleotide substitution used were true, we would not expect parsimony or NJ with GTR-based distances to recover the tree correctly. While this is not an obvious case of LBA (i.e., there are no two long branches relative to the rest of the tree), both NJ and parsimony still suffer from (different) systematic phylogenetic errors when they are used to analyze the parametric bootstrap alignments. LBA also seems to be an unlikely explanation for the non-basal position of Amborella and Nymphaea on the NJ and parsimony trees, obtained from the original data.

    NJ results under a variety of other models (Jukes-Cantor [JC], F81, F84, K2P, K3P, Hasegawa, Kishino, and Yano [HKY], Tajima-Nei, Tamura-Nei, GTR, and LogDet) give the topology shown in figure 2C with high bootstrap support. This suggests that, if no gamma correction is used, the type of correction for multiple changes that is used is not of importance. Some further distance-based analyses with ML distances that do use a gamma correction follow later in Testing Robustness to Model Specification.

    Assessing the Fit of the Model to the Data

    The second use of the parametric bootstrap was to assess whether the GTR + G + I model in combination with the ML tree is sufficient to explain the observed data. In other words: is the best symmetric model of nucleotide substitution a good model? Following the method used in Spectronet (Huber et al. 2002) we converted each alignment (100 parametric bootstrap alignments and 1 observed ungapped alignment) to a spectrum of direct splits (fig. 3 shows the split spectrum for the observed data). These spectra give a count of how many sites support different splits (bipartitions) of the taxa set (sites with more than two states are RY coded). For instance, we see in figure 3 that there are 1,280 sites in the observed data that support the split of the grasses from everything else; these would be sites where the grasses all share one state and the other taxa share another. There are 263 sites that support a Pinus + grasses group, 125 that support Amborella + Pinus, 117 that support Nymphaea + Pinus, and 96 supporting Amborella + Nymphaea + Pinus. For all splits in the observed alignment that had more than 20 supporting sites, of which there were 88, we compared the number of supporting sites in the observed ungapped alignment to the numbers found in the 100 parametric bootstrap alignments. Each split was assigned a P value that recorded a proportion of parametric bootstrap alignments with fewer supporting sites for that split than in the observed alignment, and a summary of these P values is shown in figure 4. If the model fitted well we would expect this distribution to be approximately uniform (with a slight selection bias towards low P values as we have chosen splits on the basis that they have high support in the observed alignment). Instead the distribution looks bimodal; 49 out of 88 values fall within the two 10% tails, and 26 out of 88 values are entirely outside the range of the parametric bootstrap.

    FIG. 3.— Lentoplot generated by Spectronet (Huber et al. 2002). Bars above the x axis indicate the numbers of supporting sites for various splits. Bars below the x axis are a measure of the amount of conflict for that split (note the change in scale). Dots for each bar indicate taxa on one half of the split. Sites which had more than two different states were RY coded.

    FIG. 4.— Histogram of P values for the 88 splits in the observed alignment with more than 20 supporting sites. P values for each split are calculated as the proportion of parametric bootstrap alignments with fewer supporting sites for that split than in the observed alignment. The black shading of the rightmost and leftmost bars indicate the number of splits with support values of exactly 0 or 1, that is, outside the range of the parametric bootstrap.

    We also tested the fit of the ML tree and GTR + G + I model by applying the test described in Reeves (1992) in which the unconstrained likelihood is compared to the likelihood of the ML tree. This test was applied to the ungapped alignment. The unconstrained likelihood is –ln L = 219,336.43; there are 6,856 distinct patterns in the alignment, so this model has 6,855 free parameters. The likelihood of the best symmetric model and the tree found using PAUP was –ln L = 247,497.95, this model has 10 parameters. The test statistic, which should have a chi-square distribution, is 2(ln L1 – ln L2) = 56,323.04 and has 6,845 degrees of freedom. The ML tree + GTR + G + I model is rejected with a P value of zero (to over 20 decimal places). The Akaike information criterion (AIC) also rejects this model.

    Testing Robustness to Model Specification

    Given that the best reversible model appears to be misspecified, we wanted to know if the ML tree would be robust to changes in the model. The work of Stefanovic, Rice, and Palmer (2004) suggests that the gamma-shape parameter is of particular importance. Using the complete alignment (with gapped positions) we fitted a GTR + G model, allowing the value of the shape parameter to vary from 0.1 to 2 in steps of 0.1. For each value of we used the following procedure in PAUP: we constructed a NJ tree using p distances; fixing this tree we estimated the parameters of the rate matrix using ML; and fixing these rate matrix parameters we constructed 100 NJ trees using ML distances. The results are shown in figure 5. Depending on the value of , any of four different groups are found to be basal with bootstrap support of greater than 50%.

    FIG. 5.— The different lines show the bootstrap support received by different topologies in a NJ analysis. Distances were estimated by ML using the GTR + G model for different values of the gamma-shape parameter.

    To further investigate sensitivity of the results to model specification we used the PHYML program (Guindon and Gascuel 2003). This was used instead of PAUP due to its greater speed. We produced a set of 72 ML trees from our alignment of 15 OTUs (operational taxonomic units), utilizing six available models—JC, Felsenstein F81, Kimura two parameter, HKY, Tamura-Nei, and GTR—with numerous settings per model, including eight with four gamma rate categories. Similar sets of 72 trees were produced on the basis of the alignment of 14 OTUs (with Pinus sequence removed) and an alignment of 11 OTUs (with the members of Poaceae removed). The results of these experiments and model settings used are summarized in table 1. The topologies numbered in this table are given in figure 6. None of the trees produced with the PHYML program had the topology with the ANITA members (Nymphaea and/or Amborella) forming the basalmost clade in the flowering plant cluster; also, none of the PHYML trees find monophyly of the monocots. The topology found by PAUP shown in figure 2A has a higher likelihood than the topology found by PHYML under comparable settings (model 8 as per table 1), so there appears to be a problem with the PHYML search procedure becoming trapped in a local optima. However, this does not detract from the main result, namely, the sensitivity of tree estimation to changes in model specification and taxon sampling.

    Table 1 Results of the ML Analyses

    FIG. 6.— Results of PHYML analyses. Trees are numbered as in table 1, which also contains information on models used in their estimation. (Note that although trees without Pinus are shown as rooted trees they should be considered unrooted.)

    Discussion

    Sequencing the chloroplast genome of A. calamus was motivated primarily by the question "What is the basalmost lineage of angiosperms?" Recently this question has been intensely debated (Goremykin et al. 2003a, 2003b, 2004; Soltis et al. 2004; Stefanovic, Rice, and Palmer 2004), and previous work has suggested that both taxon sampling and the suitability of current models of sequence evolution are important issues (Soltis et al. 2004; Stefanovic, Rice, and Palmer 2004). It was hoped that the addition of Acorus would break up the long branch leading to the grasses and thus improve the balance of the taxon sampling. However, instead of being able to answer the original question posed we find ourselves with a new question "Are the monocots monophyletic?" Two recent commentaries (Lockhart and Penny 2005; Martin et al. 2005) suggest that we should expect angiosperm phylogeny to be difficult to resolve, hence we have taken a cautious approach and tried to determine if any of the phylogenetic methods we used provide trustworthy answers.

    In order to test if LBA was influencing the results found by parsimony and distance-based methods we used a parametric bootstrap approach, simulating alignments along the best tree found using ML (fig. 2A) under the model parameters determined by Modeltest. We found that neither parsimony nor NJ (with GTR-based distances) would be expected to recover this tree were it true; this shows that—if the generating topology reflects the true underlying tree—LBA might be expected to be a problem for these methods. Removing the grasses, and therefore potential for LBA, however, does not eliminate the problem; the parametric bootstrap showed that neither parsimony nor NJ would be expected to recover the ML tree were it correct and the substitution model used in simulations were correctly specified. Focusing on the issue of monocot monophyly we removed Pinus from the analysis; the parametric bootstrap indicated that, with Pinus excluded, both parsimony and NJ should be able to recover the generating topology. However, results with the observed alignment varied depending on method and treatment of gaps: parsimony gave monophyly of the monocots with the ungapped alignment, but with the complete alignment it supported a tree wherein a clade containing Acorus and Calycanthus is a sister to a clade bearing Nymphaea and Amborella. NJ yielded this last topology for both the complete and the ungapped alignment.

    The results of the parametric bootstrap after removing Pinus suggest that LBA is not sufficient to explain the paraphyly of the monocots observed using parsimony and distance-based methods. The distribution of morphological characters in Acorus is such that a scenario in which this genus was derived from a lineage that includes Aristolochiaceae and/or Piperaceae/Saururaceae cannot be excluded a priori. The palaeoherb affinity of Acorus is supported by the presence of idioblasts containing ethereal oil, the absence of calcium oxalate raphides, the presence of secretory anther tapetum instead of the periplasmodial tapetum widespread in monocots, and the presence of perisperm and cellular endosperm development (Grayum 1987; Bogner and Mayo 1998). Topologies in which Acorus was embedded within Magnoliopsida and distant from the other monocots were found in analyses of 18S-RNA sequences (Bharathan and Zimmer 1995; Duvall and Bricker Ervin 2004). Future genomic studies with better sampling of monocots will surely shed more light onto the issue of monocot paraphyly versus monophyly (that is, the robustness of cotyledon number as a taxonomic trait at higher levels).

    The non-basal position, which Amborella and Nymphaea assumed on a number of trees built from the observed data after removal of the grasses, obviously, also cannot be attributed to the LBA between the branch of grasses and the outgroup. Recent paleontological findings (Friis, Pedersen, and Crane 2004) extended the paleontological record of monocots into the earliest fossil assemblages containing angiosperm stamens and carpels. This indicates that the possibility of a monocots-basal topology would not be at odds with currently available fossil evidence.

    We also wanted to test how well the ML tree and model, found using PAUP in combination with Modeltest, fit the observed data. The first approach was to compare the split-support spectrum for the (ungapped) observed alignment to the split-support spectra for the parametric bootstrap alignments. This revealed that many splits had numbers of supporting sites that were significantly higher or lower than was predicted by the model. We also performed the likelihood ratio test described in Reeves (1992), which compares the unconstrained likelihood to that of the ML tree. This test rejected the ML tree model. Ané et al. (2005) recently found evidence for covarion structure (i.e., for the sites altering their rate class in different lineages) in chloroplast coding regions, which may explain why reversible models that cannot take covarion data structures into account are not a good description of this data. Model misspecification may also arise due to fitting a single model to an alignment that contains many concatenated genes.

    Previous studies (Stefanovic, Rice, and Palmer 2004) have found that the tree estimated by ML is very sensitive to the choice of model. Our results presented here corroborate this finding. The tree estimated by PHYML was dependent on model choice; also the support for different basalmost lineages in a NJ analysis varied depending on the choice of gamma-shape parameter. The fact that the best reversible model is misspecified taken in combination with the sensitivity of the trees to model choice means that the ML results should also be treated with extreme caution.

    Despite large numbers of characters for analysis, the issues of the basalmost angiosperms and the monophyly of monocots are not, in our view, presently resolved. This confirms that the early phase of angiosperm lineage diversification is a difficult problem in molecular phylogenetics that will require a larger taxon sample and a large number of sites to resolve. Our results suggest a criterion which should be met before we accept one topology of basal angiosperms—namely, insensitivity of the tree to the model changes. Our results also suggest that the need for more realistic ML models which provide a better approximation of lineage-specific evolution is great.

    References

    Ané, C., J. G. Burleigh, M. M. McMahon, and M. J. Sanderson. 2005. Covarion structure in plastid genome evolution: a new statistical test. Mol. Biol. Evol. 22:914–924.

    Asano, T., T. Tsudzuki, S. Takahashi, H. Shimada, and K. Kadowaki. 2004. Complete nucleotide sequence of the sugarcane (Saccharum officinarum) chloroplast genome: a comparative analysis of four monocot chloroplast genomes. DNA Res. 11:93–99.

    Bharathan, G., and E. A. Zimmer. 1995. Early branching events in monocotyledons—partial 18s ribosomal DNA sequence analysis. Pp. 81–107 in P. J. Rudall, P. J. Cribb, D. F. Cutler, and C. J. Humphries, eds. Monocotyledons: systematics and evolution. Royal Botanic Gardens, Kew, United Kingdom.

    Bogner, J., and S. J. Mayo. 1998. Acoraceae. Pp. 7–11 in K. Kubitzki, ed. The families and genera of vascular plants. Vol. 4. Springer Verlag, N. Y.

    Chase, M. W., D. E. Soltis, P. S. Soltis et al. (13 co-authors). 2000. Higher-level systematics of the monocotyledons: an assessment of current knowledge and a new classification. Pp. 3–16 in K. L. Wilson and D. A. Morrison, eds. Monocots: systematics and evolution. CSIRO, Melbourne, Australia.

    Duvall, M. R., and A. Bricker Ervin. 2004. 18S gene trees are positively misleading for monocot/dicot phylogenetics. Mol. Phylogenet. Evol. 30:97–106.

    Duvall, M. R., G. H. Learn Jr, L. E. Eguiarte, and M. T. Clegg. 1993. Phylogenetic analysis of rbcL sequences identifies Acorus calamus as the primal extant monocotyledon. Proc. Natl. Acad. Sci. USA 90:4641–4644.

    Ewing, B., L. Hillier, M. C. Wendl, and P. Green. 1998. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8:175–185.

    Friis, E. M., K. R. Pedersen, and P. R. Crane. 2004. Araceae from the early cretaceous of Portugal: evidence on the emergence of monocotyledons. Proc. Natl. Acad. Sci. USA 101:16565–16570.

    Goremykin, V., V. Bobrova, J. Pahnke, A. Troitsky, A. Antonov, and W. Martin. 1996. Noncoding sequences from the slowly evolving chloroplast inverted repeat in addition to rbcL data do not support gnetalean affinities of angiosperms. Mol. Biol. Evol. 13:383–396.

    Goremykin, V. V., K. I. Hirsch-Ernst, S. W?lfl, and F. H. Hellwig. 2003a. The chloroplast genome of the "basal" angiosperm Calycanthus fertilis—structural and phylogenetic analyses. Plant Syst. Evol. 242:19–135.

    ———. 2003b. Analysis of the Amborella trichopoda chloroplast genome sequence suggests that Amborella is not a basal angiosperm. Mol. Biol. Evol. 20:1499–1505.

    Goremykin, V. V., K. I. Hirsch-Ernst, S. W?lfl, and F. H. Hellwig. 2004. The chloroplast genome of Nymphaea alba: whole-genome analyses and the problem of identifying the most basal angiosperm. Mol. Biol. Evol. 21:1445–1454.

    Graham, S. W., and R. G. Olmstead. 2000. Utility of 17 chloroplast genes for inferring the phylogeny of the basal angiosperms. Am. J. Bot. 87:1712–1730.

    Grayum, M. H. 1987. A summary of evidence and arguments supporting the removal of Acorus from the Araceae. Taxon 36:723–729.

    Guindon, S., and O. Gascuel. 2003. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol. 52:696–504.

    Hiratsuka, J., H. Shimada, R. Whittier et al. (16 co-authors). 1989. The complete sequence of the rice (Oryza sativa) chloroplast genome: intermolecular recombination between distinct tRNA genes accounts for a major plastid DNA inversion during the evolution of the cereals. Mol. Gen. Genet. 217:185–194.

    Huber, K. T., M. Langton, D. Penny, V. Moulton, and M. Hendy. 2002. Spectronet: a package for computing spectra and median networks. Appl. Bioinformatics 1:159–161.

    Hupfer, H., M. Swiatek, S. Hornung, R. G. Hermann, R. M. Maier, W.-L. Chiu, and B. Sears. 2000. Complete nucleotide sequence of the Oenothera elata plastid chromosome, representing plastome I of the five distinguishable Euoenothera plastomes. Mol. Gen. Genet. 263:581–585.

    Katayama, H., and Y. Ogihara. 1996. Phylogenetic affinities of the grasses to other monocots as revealed by molecular analysis of chloroplast DNA. Curr. Genet. 29:572–581.

    Kato, T., T. Kaneko, S. Sato, Y. Nakamura, and S. Tabata. 2000. Complete structure of the chloroplast genome of a legume, Lotus japonicus. DNA Res. 7:323–330.

    Kim, K.-J., and H.-L. Lee, 2004. Complete chloroplast genome sequence from Korea Ginseng (Panax schinseng Nees) and comparative analysis of sequence evolution among 17 vascular plants. DNA Res. 11:247–261.

    Lockhart, P., and D. Penny. 2005. The place of Amborella within the radiation of angiosperms. Trends Plant Sci. 10:201–202.

    Maier, R. M., K. Neckermann, G. L. Igloi, and H. Kossel. 1995. Complete sequence of the maize chloroplast genome: gene content, hotspots of divergence and fine tuning of genetic information by transcript editing. J. Mol. Biol. 251:614–628.

    Martin, W., O. Deusch, N. Stawski, N. Grünheit, and V. Goremykin. 2005. Chloroplast genome phylogenetics: why we need independent approaches to plant molecular evolution. Trends Plant Sci. 10:203–209.

    Murray, M. G., and W. F. Thompson. 1980. Rapid isolation of high molecular weight plant DNA. Nucleic Acids Res. 19:4321–4325.

    Ogihara, Y., K. Isono, T. Kojim et al. (19 co-authors). 2002. Structural features of a wheat plastome as revealed by complete sequencing of chloroplast DNA. Mol. Genet. Genomics 266:740–746.

    Posada, D., and K. A. Crandall. 1998. Modeltest: testing the model of DNA substitution. Bioinformatics 14:817–818.

    Rambaut, A. E., and N. C. Grassly. 1997. Seq-gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput. Appl. Biosci. 1:235–238.

    Reeves, J. H. 1992. Heterogeneity in the substitution process of amino acid sites of proteins coded for by mitochondrial DNA. J. Mol. Evol. 35:17–31.

    Sato, S., Y. Nakamura, T. Kaneko, E. Asamizu, and S. Tabata. 1999. Complete structure of the chloroplast genome of Arabidopsis thaliana. DNA Res. 6:283–290.

    Savolainen, V., and M. W. Chase. 2003. A decade of progress in plant molecular phylogenetics. Trends Genet. 19:717–724.

    Schmitz-Linneweber, C., R. M. Maier, J. P. Alcaraz, A. Cottet, R. G. Herrmann, and R. Mache. 2001. The plastid chromosome of spinach (Spinacia oleracea): complete nucleotide sequence and gene organization. Plant Mol. Biol. 45:307–315.

    Shinozaki, K., M. Ohme, M. Tanaka et al. (23 co-authors). 1986. The complete nucleotide sequence of the tobacco chloroplast genome: its gene organisation and expression. EMBO J. 5:2043–2049.

    Soltis, D. E., V. A. Albert, V. Savolainen et al. (11 co-authors). 2004. Genome-scale data, angiosperm relationships, and "ending incongruence": a cautionary tale in phylogenetics. Trends Plant Sci. 9:477–483.

    Staden, R., K. F. Beal, and J. K. Bonfield. 2000. The Staden package, 1998. Methods Mol. Biol. 132:115–130.

    Stefanovic, S., D. W. Rice, and J. D. Palmer. 2004. Long branch attraction, taxon sampling, and the earliest angiosperms: Amborella or monocots? BMC Evol. Biol. 4:35.

    Swofford, D. L. 2002. PAUP*: phylogenetic analysis using parsimony (*and other methods). Version 4. Sinauer Associates, Sunderland, Mass.

    Tamura, M. N., J. Yamashita, S. Fuse, and M. Haraguchi. 2004. Molecular phylogeny of monocotyledons inferred from combined analysis of plastid matK and rbcL gene sequences. J. Plant Res. 117:109–120.

    Wakasugi, T., J. Tsudzuki, S. Ito, K. Nakashima, T. Tsudzuki, and M. Sugiura. 1994. Loss of all ndh genes as determined by sequencing the entire chloroplast genome of the black pine Pinus thunbergii. Proc. Natl. Acad. Sci. USA 91:9794–9798.

    Zanis, M. J., D. E. Soltis, P. S. Soltis, S. Mathews, and M. J. Donoghue. 2002. The root of the angiosperms revisited. Proc. Natl. Acad. Sci. USA 99:6848–6853.(Vadim V. Goremykin*, Barb)

http://www.100md.com/html/DirDu/2006/10/17/25/81/35.htm