Molecular Evolution and Phylogenetic Utility of the petD Group II Intron: A Case Study in Basal Angiosperms(文章精)

Molecular Evolution and Phylogenetic Utility of the petD Group II Intron: A Case Study in Basal Angiosperms

http://www.100md.com 《分子生物学进展》

     Nees Institute for Biodiversity of Plants, University of Bonn, Bonn, Germany

    Correspondence: E-mail: c.loehne@uni-bonn.de.

    Abstract

    Sequences of spacers and group I introns in plant chloroplast genomes have recently been shown to be very effective in phylogenetic reconstruction at higher taxonomic levels and not only for inferring relationships among species. Group II introns, being more frequent in those genomes than group I introns, may be further promising markers. Because group II introns are structurally constrained, we assumed that sequences of a group II intron should be alignable across seed plants. We designed universal amplification primers for the petD intron and sequenced this intron in a representative selection of 47 angiosperms and three gymnosperms. Our sampling of taxa is the most representative of major seed plant lineages to date for group II introns. Through differential analysis of structural partitions, we studied patterns of molecular evolution and their contribution to phylogenetic signal. Nonpairing stretches (loops, bulges, and interhelical nucleotides) were considerably more variable in both substitutions and indels than in helical elements. Differences among the domains are basically a function of their structural composition. After the exclusion of four mutational hotspots accounting for less than 18% of sequence length, which are located in loops of domains I and IV, all sequences could be aligned unambiguously across seed plants. Microstructural changes predominantly occurred in loop regions and are mostly simple sequence repeats. An indel matrix comprising 241 characters revealed microstructural changes to be of lower homoplasy than are substitutions. In showing Amborella first branching and providing support for a magnoliid clade through a synapomorphic indel, the petD data set proved effective in testing between alternative hypotheses on the basal nodes of the angiosperm tree. Within angiosperms, group II introns offer phylogenetic signal that is intermediate in information content between that of spacers and group I introns on the one hand and coding sequences on the other.

    Key Words: chloroplast noncoding DNA ? group II intron ? petD ? phylogeny ? microstructural changes ? basal angiosperms

    Introduction

    Phylogenetic studies based on DNA sequences from all three plant genomes have greatly improved our understanding of the early evolution of angiosperms. Combining three (Soltis et al. 2000), five (Qiu et al. 1999), and six to 11 genes (Zanis et al. 2002) from two or three genomes or up to 17 genes from the chloroplast genome (Graham and Olmstead 2000b) led to the identification of Amborella as the first branch in angiosperms followed by Nymphaeales and Austrobaileyales and of major clades such as the magnoliids consisting of Piperales + Canellales and Magnoliales + Laurales (Qiu et al. 1999; Graham and Olmstead 2000b; Mathews and Donoghue 2000; Zanis et al. 2002). Still, phylogenetic affinities of the monocots, eudicots, Ceratophyllales, and Chloranthales remain unclear. Qiu et al. (1999) argued that the inclusion of even more genes might help to resolve these questions. Recent analyses of whole-chloroplast genome sequences refreshed the discussion on the root of angiosperms by inferring Poaceae basal (Goremykin et al. 2003). However, given the effort of sequencing numerous genes, the number of taxa that can be included usually is a limiting factor. Finding markers offering strong historic signal for the amount of sequence generated, therefore, becomes crucial. This signal might not only be provided by substitutions but also provided by length mutations and genomic rearrangements.

    To infer phylogenetic relationships at deeper levels, rather conserved genes such as rbcL have been sequenced, whereas rapidly evolving noncoding regions served to infer relationships among species and genera. Examples of such rapidly evolving parts in the chloroplast genome are the intergenic spacers between atpB and rbcL, trnT and trnL, trnL and trnF, the group I intron in trnL, and the group II introns in rpl16, rps16, rpoC1, trnV, and ndhA (see Soltis and Soltis [1998] and Kelchner [2000, 2002] for reviews), all of which are located in the large and small single-copy regions. The group II introns in rpl2, rps12, and ndhB are situated in the highly conserved inverted repeat and were sequenced for a set of basal angiosperms by Graham and Olmstead (2000a) and Graham et al. (2000). Introns in particular possess a mosaic of highly conserved core elements that are responsible for their function alternating with sequence stretches that might be more or less freely evolving. Nevertheless, the overall variability of spacers and introns is higher than in most coding DNA. Therefore, it has been assumed that high substitution rates are present, leading to saturation and homoplasy, and that frequent length mutations cause homology assessment (alignment) of noncoding sequences to be difficult or even impossible in data sets covering a broad taxonomic spectrum.

    Recently, Borsch et al. (2003) employed the noncoding parts of the trnT-trnF region, consisting of two spacers and the trnL group I intron, for deep-level phylogenetic analysis of angiosperms. Their study revealed that high length variability in trnT-trnF was confined to mutational hotspots in the intron and that these corresponded to certain stem-loop elements of the proposed secondary structure. Considering length mutations as single evolutionary events involving one or more nucleotides allowed for reliable alignment of trnT-trnF sequences. This finding and others suggest that microstructural changes in chloroplast genomes follow certain patterns and that understanding the nature of these patterns is essential for phylogenetic interpretation of length variability of sequences (Graham et al. 2000; Kelchner 2000). Analysis of the trnT-trnF data revealed a tree for basal angiosperms largely congruent with multigene, multigenome studies (Qiu et al. 1999; Soltis et al. 2000; Zanis et al. 2002) with most nodes gaining high statistical support. Several questions arise concerning the general significance of these findings. Does extreme variability in other noncoding cpDNA regions also correspond to particular structural elements (resulting in mutational hotspots), and can they be confidently aligned similar to trnT-trnF? Are other noncoding regions effective for deep-level phylogenetic studies as well, or is trnT-trnF an exception? Are there differences in the phylogenetic utility of different types of noncoding regions, such as spacers and group I and group II introns?

    Variation in intron sequences is to a large extent correlated with the secondary structure of their RNA, which is essential for the self-splicing function of the intron (Learn et al. 1992). Based on differing RNA folding patterns, organelle introns are classified into group I and II (Michel, Umesono, and Ozeki 1989; Michel and Ferat 1995). Because the trnL intron employed by Borsch et al. (2003) is a group I intron, we searched for an omnipresent and alignable group II intron in the chloroplast genome to compare information content of group I and II introns. We restricted our analyses to the chloroplast genome because it is inherited as a single linkage unit. Consequently, differences between group I and group II introns could be more clearly linked to evolutionary processes operating at structurally and functionally different loci, without having to worry about effects of recombination, hybridization, or lineage sorting (Doyle 1992).

    Structure and function of group II introns have been studied in detail by several authors (Michel, Umesono, and Ozeki 1989; Knoop and Brennicke 1993; Michel and Ferat 1995; Bonen and Vogel 2001; Federova, Mitros, and Pyle 2003). The secondary structure model (Michel, Umesono, and Ozeki 1989; Michel and Ferat 1995) has largely been validated in numerous experiments (see Kelchner [2002] for review). Utilizing group II introns as phylogenetic tools, their presence or absence was found to provide valuable information among land plant lineages (e.g., Qiu et al. 1998; Pruchner et al. 2002). In angiosperms, intron losses in different plant lineages have been reported from several chloroplast genes, such as rps16, rpoC1, and rpl16 (see Kelchner [2002] for review). At the sequence level, sound knowledge of the secondary structure allows recognition of structure-linked mutation patterns and, subsequently, their evaluation in a phylogenetic context (Kelchner 2002). So far, chloroplast group II intron sequences yielded well-resolved phylogenies at the genus level (e.g., Kelchner and Clark 1997; Asmussen and Chase 2001; Clausing and Renner 2001). The mitochondrial nad5 intron proved useful among ferns and allies (Vangerow, Teerkorn, and Knoop 1999). To examine their broader applicability, we generated a data set covering the range of seed plants.

    We selected the intron in petD for study because (1) no losses of this locus have been reported for angiosperms and gymnosperms, (2) initial alignment using sequences of available chloroplast genomes was successful, and (3) the presence of highly conserved sequences in the flanking regions suitable for designing universal primers for amplification. To our knowledge, the petD intron or the petB-petD intergenic spacer so far had never been used in phylogenetic studies. The petD gene is part of the psbB operon (Westhoff and Herrmann 1988). The pentacistronic primary transcript of this operon is processed into monocistronic and dicistronic mRNAs, with petB and petD normally staying connected as a dicistronic mRNA (Rock, Barkan, and Taylor 1987; Tanaka et al. 1987; Dixit et al. 1999; Monde, Greene, and Stern 2000).

    The present study has involved characterization of patterns of variability and homoplasy in the petD intron of flowering plants. We discuss the impact of structural (and functional) constraints on substitutions and microstructural changes in petD and test hypotheses concerning the phylogenetic relationships of basal angiosperms. This second aspect of our work has allowed us to evaluate the potential of group II introns as molecular markers for deeper-level phylogenetic problems.

    Materials and Methods

    Taxon Sampling and Plant Material

    Sequences of the petD intron and the petB-petD intergenic spacer were obtained from 47 angiosperms (representing 30 families) and three gymnosperms. The Pinus sequence was taken from GenBank (Tsudzuki et al. 1992). All taxa included in this study are listed in Appendix 1 of Supplementary Material online, along with their respective families, the origin of material, and GenBank accession numbers. To compare the results of this study with the analysis based on trnT-trnF (Borsch et al. 2003), we also carried out analyses of a reduced data set congruent with the 42 taxa in the trnT-trnF study (Chloranthus officinalis, Impatiens noli-tangere, Magnolia sieboldii, Nymphaea micrantha, Nymphaea nouchalii, Piper crocatum, Pseudowintera colorata and Tasmannia insipida were excluded).

    DNA Isolation

    Total genomic DNA was isolated from fresh or silica gel–dried leaf tissue. To gain an optimal quantity of high-quality DNA, a modified CTAB method with triple extractions was used (Borsch et al. 2003). After chloroform extraction, DNA was precipitated with isopropanol, resuspended in TE, and further purified by ammonium acetate and sodium acetate washing steps followed by ethanol precipitation.

    Primer Design

    Universal primers to amplify the petD region (consisting of the petB-petD intergenic spacer, the petD 5' exon, and the petD intron) in seed plants were designed based on the completely sequenced chloroplast genomes of Arabidopsis thaliana (GenBank accession number NC001284), Lotus japonicus (GenBank accession number NC002694), Nicotiana tabacum (GenBank accession number NC001879), Pinus thunbergii (GenBank accession number NC001631), and Spinacia oleracea (GenBank accession number NC001631). The petD region was amplified in one fragment with the forward primers PIpetB1411F (5'-GCCGTMTTTATGTTAATGC-3') or PIpetB1365F (5'-TTGACYCGTTTTTATAGTTTAC-3') that anneal to the 3' exon of petB and the reverse primer PIpetD738R (5'-AATTTAGCYCTTAATACAGG-3') that anneals to the 3' exon of petD (fig. 1). PIpetD346R (5'-TCTTCCTYAGATCCC-3') was designed as an additional internal sequencing primer located in domain I of the petD intron because electropherograms were not readable after homonucleotide strings in the petB-petD spacer of Aristolochia and Ginkgo.

    FIG. 1.— Schematic of the psbB operon and the region studied in basal angiosperms and gymnosperms. Coding stretches are marked by black boxes; noncoding regions are gray. (a) psbB operon: the psbN gene is not part of this operon but is located on the counterstrand between psbH and psbT. (b) Enlarged section of the psbB operon including the petB-petD intergenic spacer, the petD 5' exon (8bp: ATGGGAGT) and the petD intron. Minimum and maximum size of spacers and intron among the taxa sequenced are indicated below the bar. Arrows mark positions of primers.

    Amplification and Sequencing

    PCR was performed in a T3 Thermocycler (Biometra, G?ttingen, Germany) with initial denaturation (1.5 min at 96°C), 34 cycles of denaturation (0.5 min at 95°C), annealing (1 min at 50°C), extension (1.5 min at 72°C), and a final extension step (20 min at 72°C). Reaction mixtures (total 50 μl) contained 4 μl DNA template (1:10), 26.7 μl H2O, 5 μl Taq-buffer S (Peqlab, Erlangen, Germany [including 15 mM MgCl2]), 2 μl of each primer, 10 μl dNTPs (each 1.25 mM), and 1.5 units of Taq-DNA-polymerase (Peqlab). PCR products were purified using the QIAquick gel extraction kit (QIAGEN, Hilden, Germany) and labeled using ABI Prism BigDye Terminator Cycle Sequencing Ready Reaction Kits version 1.0 or 1.1 (Applied Biosystems, Foster City, Calif.) in 30 cycles of denaturation (5 s at 96°C), annealing (15 s at 50°C), and elongation (4 min at 50°C). Extension products were run on ABI Prism 310 and 373XL automated sequencers. Sequences were edited manually with EditView version 1.0.1. In case of single-base length mutations, positions were cross-checked again after their alignment with the original electropherograms to ensure correct reads. Gene-spacer and intron-exon boundaries are based on annotations in completely sequenced chloroplast genomes (Tsudzuki et al. 1992; Schmitz-Linneweber et al. 2001).

    Sequence Alignment

    Noncoding regions are characterized by relatively high numbers of length mutations in addition to substitutions. Levinson and Gutman (1987) suggested slipped strand mispairing (SSM) as a mechanism that generates length mutations. SSM might be the underlying process for simple sequence repeats, whereas hairpin structures have been shown to favor inversions (Kelchner and Wendel 1996). For correct primary homology assessment (De Pinna 1991) the molecular processes leading to microstructural changes, first pointed out by Gu and Li (1995), have to be considered. Unfortunately, our understanding of the exact mechanisms leading to microstructural changes is still poor, but the resulting sequence motifs can be observed. Although in recent years, there has been considerable progress in automated multiple-sequence alignment methods, ranging from early work on global Needleman-Wunsch scoring criteria to more recent local segment–based methods (e.g., Dialign2 [Morgenstern 1999]), alignment programs may nevertheless still fail to correctly align sequences containing repeats or inversions. In the present study, such motifs were identified by eye and aligned manually using QuickAlign version 1.5.6 (Müller and Müller 2003). Rules for alignment were proposed on the basis of inferred mechanisms of sequence evolution and the similarity-based criteria for homology assessment suggested by Golenberg et al. (1993), Kelchner and Clark (1997), Hoot and Douglas (1998), Graham et al. (2000), Kelchner (2000), Simmons and Ochoterena (2000) and Borsch et al. (2003). We also suggest additional principles to those described by Borsch et al. (2003). The necessity for these became apparent after considering sequence variability patterns in the petD data set. The principles for alignment follow and are illustrated in figure 2. The petD alignment can be obtained from the corresponding author.

    FIG. 2.— Illustration of the alignment principles applied in this study. (a) Gap placement (principle 2). Primary homology assessment would allow alternative positions of the gap (a and b), but according to principle 2 alternative (b) was given priority because it reflects a possible simple sequence repeat (SSR; indel 241, positions 1792 to 1795). (b) Multiple repeats (principle 2) were recognized as independent events. Repeat motifs are marked by dashed (1st indel) or full lines (2nd indel) (indels 190 and 191, positions 1505 to 1515). (c) Entire indels (principle 4): If sequence composition adjacent to an entire indel was not sufficient to determine the position of the indel, gaps were placed arbitrarily. If such gaps were shared by two or more sequences, they were placed in the same column. In this example, one of alternative (b) or (c) would be chosen arbitrarily (indel 77, positions 564 to 568). (d) Substitution in indels (principle 5): If substitutions in either repeat or template occurred, they were replaced by ambiguity codes (indel 64, positions 424 to 431). (e) Inversions (principle 8) were reverse complemented (indel 41, positions 282 to 285).

    (1) For gap insertion, microstructural changes were assumed to be less likely than substitutions (exceptions are microsatellites; see principle 3). Attention was, therefore, given to both the potentially inserted sequence and its neighboring nucleotides. Additional gaps were inserted only if they prevented the inclusion of more than two substitutions among closely adjacent nucleotides.

    (2) For gap placement, positional homology of sequence motifs was given priority when alternative gap placements were possible (fig. 2a). Thus, simple sequence repeats were accounted for as internal paralogs. Based on the sequence variability found in this data set, we considered multiple repeats as multiple events (fig. 2b). A prerequisite, of course, is that primary sequences are sufficiently complex to allow unambiguous motif recognition.

    (3) Homonucleotide strings strictly involving only one kind of nucleotide (microsatellites) occurred in different positions. Formally, homonucleotide strings can be considered as stepwise indels (overlapping indels) and can be aligned according to a parsimony principle of individual steps as outlined in principle 6. However, the probability of reversal or parallel length mutations may be quite high, depending on the number and size of repeat units. Borsch et al. (2003) preferred not to align such microsatellites, because prevalence of single nucleotides hinders motif recognition. Microsatellites were frequent in this petD data set, and no substitutions were involved leading to false phylogenetic signal. Thus, for practical reasons, we treated them as overlapping indels.

    (4) Entire indels, that is, indels of the same positional extension occurring in several taxa, were aligned in the same column(s). In cases where sequence composition adjacent to an entire indel was not sufficient to restrict the placement of this indel to only a single position, we followed the suggestion of Simmons and Ochoterena (2000) and placed the gap in the same position (same column) in all sequences, because it would be most parsimonious to assume a single event in all taxa (fig. 2c).

    (5) Substitutions in indels occurring only in one copy, template or repeat, were excluded from phylogenetic analysis by introduction of ambiguity codes (fig. 2d).

    (6) In the case of overlapping indels, a parsimony principle was employed to arrange gaps in a way that globally requires the least number of length mutational events.

    (7) Regions of uncertain homology were excluded from analysis (hotspots sensu Borsch et al. [2003]).

    (8) Inversions in the petB-petD spacer were reverse complemented in the alignment whenever recognized (fig. 2e). Left unchanged, this would result in substitutions giving false signal (Kelchner 2000; Quandt, Müller, and Huttunen 2003).

    Coding of Length Mutational Events

    Several workers recently have developed methods to code information from length mutational events and to utilize them in phylogeny reconstruction (e.g., Graham and Olmstead 2000b; Graham et al. 2000; Kelchner 2000; Simmons and Ochoterena 2000). Simmons and Ochoterena (2000) proposed two kinds of formalized coding strategies, both recognizing gaps as characters. We basically used the simple indel-coding approach, thereby assigning character state 1 when the sequence was present in the respective taxon and character state 0 if there was a gap. However, we found patterns of microstructural changes present in our data set that apparently were not covered by the existing simple indel-coding principles. These limitations stem from strictly focusing on gaps as characters, rather than considering any microstructural change as a character. The following additional principles were, thus, employed:

    (1) Inversions were coded as a single binary character (1 = present, 0 = lacking [fig. 5]).

    FIG. 5.— Illustration of observed inversions. (a) Inversion in Cabomba, surrounded by a 7-bp inverted repeat (indel 41, positions 282 to 285). (b) Inversion in Impatiens, surrounded by a 10-bp inverted repeat (indel 55, positions 361 to 483). Indel number 53, a simple sequence repeat (AACT) shared by all taxa that belong to the eumagnoliid clade. Substitutions within this indel element occur in Aristolochia (AACC), Myristica (AGCT), and Tasmannia (AATT). The most-parsimonious explanation for the sequence of Saruma is a subsequent loss of 3 nt (indel number 54) affecting 1 nt of indel number 53. Indel number 68 is not present in the sequences shown here.

    (2) Independent adjacent gaps need to be distinguished as different characters. As a consequence of alignment principle 2 (recognition of sequence motifs has priority) situations of adjacent but different indels occur (see indels 64/65 and 66/67 in figure 5b). Such situations seem to be frequent in noncoding cpDNA and probably in intron sequences in general, where structural conditions facilitate the insertion of short repeats. These SSRs appear in the same DNA region but are independent events involving different motifs. Therefore, gaps may appear adjacent, although they are not resulting from a single event and, thus, cannot be interpreted as one character. This fact can be illustrated because repeat events result in palindromic motifs, so that there is no way to distinguish which of the nucleotides actually are the template and which are the repeat. Therefore, the columns could be placed freely on either side of the motif, placing gaps adjacent or nonadjacent, respectively. To include an addition in the alignment principles that governs placement of columns does not remedy the problem, because SSRs might be frequent, leading to adjacent gaps regardless of their placement.

    (3) Multiple repeats within a given sequence were coded as separate indels. This procedure is a consequence of extended alignment principle 2.

    (4) Length mutations within homonucleotide strings were not coded. For microsatellites, increased rates of length mutational events may lead to high levels of homoplasy. This condition becomes readily apparent when including sequences from a broad spectrum of plants. Lutzoni et al. (2000) suggested using multiple states for length-variable homonucleotide strands. We omitted these indels in a more conservative approach because current understanding of the evolutionary processes involved at these sites is still very limited.

    Secondary Structure

    The large-scale study on group II introns by Michel, Umesono, and Ozeki (1989) has provided a secondary structure model that is nowadays widely accepted. For the purpose of this study, the calculation of secondary structure appeared to be unnecessary, because group II intron core structures are highly conserved, and visual examination of sequences allowed the recognition of reverse-complement regions and the demarcation of domain boundaries and structural elements. This approach was facilitated by the petD intron sequences of maize, tobacco, spinach, and the liverwort Marchantia polymorpha already examined by Michel, Umesono, and Ozeki (1989). Classification of elements such as stems, loops, bulges, and interhelical sequences followed Vawter and Brown (1993) and the modifications by Kelchner (2002). Stems are helices formed by complementarily pairing nucleotides (including G-U wobble-pairs where they were not terminating a helix). Single-stranded nucleotide stretches terminating a helix are termed loops, whereas unpaired nucleotides within stems are bulges. Interhelical sequences are those single-stranded nucleotides connecting helices of adjacent domains and subdomains.

    Phylogenetic Analysis

    For phylogeny reconstruction, the following data partitions were analyzed: intron sequences alone (= intron matrix), intron and spacer sequences combined (= intron + spacer matrix), all indels alone (= indel matrix), intron sequences plus respective indels (= intron + indel matrix), and intron and spacer sequences plus all indels (= combined matrix). Furthermore, all characters of the intron matrix that were assigned to stems and nonpairing elements were analyzed as separate partitions. All characters were equally weighted, and gaps were treated as missing characters. Before combining individual matrices, incongruence-length difference tests were performed in 1,000 random addition replicates using PAUP* version 4.0b10 (Swofford 2002). Parsimony analysis (MP) with PAUP* 4.0b10 employed heuristic searches with 1,000 replicates of random addition and tree bisection and reconnection (TBR) branch swapping. For small matrices, the limit of trees saved was set to 10,000. Measures of support for each node were obtained through bootstrapping (BS) 500 replicates (each with 10 random addition replicates) using PAUP* 4.0b10 and Bremer support (= decay) analysis using PRAP (10 random addition replicates per constraint tree, parsimony ratchet not employed) (Müller 2004).

    Bayesian inference (BI) of the substitution-based matrices (intron matrix and intron + spacer matrix) was performed using MrBayes version 2.01 (Huelsenbeck and Ronquist 2001). Following the Akaike Information Criterion in Modeltest version 3.06 (Posada and Crandall 1998) a GTR+I+G model of molecular evolution was implemented. We conducted four runs of Metropolis-coupled Markov Chain Monte Carlo analysis, each with four chains and saving one tree every 100 generations for 1,000,000 generations, starting with a random tree. The temperature for heating was set at 0.2. After 50,000 generations in the first two runs and 70,000 generations in the third and forth run, likelihood scores appeared to be stationary. Consequently, the burn-in was set at this generation, sampling only the trees obtained thereafter. GC content and transition:transversion (ti:tv) ratios were calculated using MEGA version 2.1 (Kumar et al. 2001). Indel and substitution characters were optimized on one of the shortest trees found in the combined data set using Winclada version 1.00.08 (Nixon 2002), assuming accelerated transformation (ACCTRAN).

    Results

    Overall Sequence Variability

    The petD region, including the petB-petD intergenic spacer, the petD 5' exon, and the petD intron, was found to be present in all angiosperms and gymnosperms studied. The overall length ranges from 842 to 979 bp in the taxa sampled, with the spacer accounting for 167 to 228 bp, the exon accounting for 8 bp, and the intron accounting for 634 to 784 bp. The spacer is shortest in the three genera of Austrobaileyales (173 bp, because of a 24-bp deletion) and longest in Araucaria (215 bp). The intron is shortest in Nymphaeales (approximately 643 bp, because of a 76-bp deletion in domain IV) and longest in Impatiens (785 bp). High numbers of length mutations lead to 1,821 characters in the overall sequence alignment (spacer 508, 5' exon 8, intron 1,304), and the mean sequence length is 927 nt. Stretches of high sequence variability are confined to four distinct mutational hotspots in the intron comprising 13% to 18% of its total length. These hotspots were mainly caused by strongly deviating sequences of the gymnosperms. The spacer could be aligned and analyzed as a whole without excluding any hotspots. After exclusion of hotspots, total length of the aligned sequences was 1,582 characters. At the 3' terminus of the petB gene, no length variability was observed, and all stop codons shared homologous positions.

    Microstructural Changes

    A total of 241 length mutations (Appendix 2 in Supplementary Material online) were coded in a binary matrix, 73 of which belong to the spacer and 168 of which belong to the intron. Most indels were found to be simple sequence repeats (SSRs; 58% in the spacer and 46% in the intron). All other indels are insertions of unknown origin or deletions. Within repeats, substitutional differences were rare; in 33% of the SSRs, repeat and template differ by one or, rarely, more substitutions. Indel length varies between 1 and 76 nt. Indels longer than 10 nt were relatively rare (only 7% of all indels). Among the shorter indels, most were SSRs, with single-base insertions or deletions representing the most frequent size class (25%), followed by 4-nt and 5-nt indels (16% and 17%, respectively). As mentioned above, the highest number of indels was observed in domains I and IV, whereas in domain V and domain VI, only one length mutational event could be detected (table 1). Concerning structural partitions, indels were most frequent in loops and less frequent in stems (table 2). Actually, only three indels were found in stem regions, all of them being single-base length mutations. Two inversions could be recognized in the spacer (fig. 5), a 4-bp to 6-bp inversion in Cabomba (accurate length cannot be detected because of a palindromic motif) and a 33-nt inversion in Impatiens. In both cases, the inversions were flanked by short (6 bp) inverted-repeat stretches. No inversions were detected in the intron.

    Table 1 Characteristics of petB-petD Spacer and petD Intron Sequences in 47 Angiosperms and Three Gymnospermous Outgroup Taxa

    Table 2 Characteristics of Structural Partitions of the petD Intron Sequences

    Secondary Structure

    The majority (89%) of all characters could be assigned to one of the four structural classes (stems, loops, bulges, or interhelical sequences). Ambiguity in structure assignment was limited to some areas of domain I, which is very complex and, by far, the largest domain, spanning more than half of the intron sequence length (table 1). Domain I contains 33% of all indels in the intron. Domain II is quite simple and comprises a 7-bp stem that is terminated by a loop of 41 to 62 nt. The 9-bp stem of domain III is interrupted by 2-nt to 3-nt bulges and terminated by an 18-nt to 34-nt loop. The second largest domain of the petD intron is domain IV, consisting of a short stem (5 bp) and a terminal loop varying from 72 to 197 nt in this data set. This loop also contains 44% of all indels in the intron. Domains V and VI are the shortest domains. Whereas domain V is built of a long stem of 14 bp (interrupted by a 2-nt bulge) and a small terminal loop of 3 or 4 nt (AAA in Araucaria and Pinus, AAAA in all other taxa), domain VI consists of a 5-bp stem and a 19-nt to 23-nt loop. The branch point A in domain VI, which plays an essential role in the transesterification reaction during the self-splicing process, is present in all taxa. The number of characters assigned to loop elements is highest in domain IV and lowest in domain V (fig. 4). On average, 51 % of the nucleotides of the intron were assigned to loop regions (corresponding to 67% of all characters in the matrix), and 25% were assigned to stems (corresponding to 15% of all characters). Nucleotides assigned to bulges and interhelical sequences amount to 11% and 5%, respectively (table 2). All four hotspots turned out to correspond to loop regions of the intron (fig. 3).

    FIG. 4.— Structure and size of the six group II intron domains in petD, illustrated as number of characters per structural element (stems, loops, bulges, and interhelical sequences) in each domain. Circles indicate numbers of length mutations that have been coded in a binary matrix.

    FIG. 3.— Secondary structure model of group II introns (subgroup B1) modified after (Michel, Umesono, and Ozeki 1989). Visual identification of reverse complement stretches (stem regions) in the petD data set was based on this model. Positions of the four mutational hotspots in petD intron sequences are illustrated by gray stripes (HS1 positions 641 to 689, HS2 positions 862 to 974, HS3 positions 1433 to 1494, HS4 positions 1679 to 1688). Mean sequence lengths (and standard deviation [SD]) are 19 in H1 (SD = 6), 46 in H2 (SD = 7), 28 in H3 (SD = 13), and 5 in H4 (SD = 2).

    Variable and Informative Characters, GC Content, and ti:tv Ratio

    Within the petD data set, the percentage of variable and potentially parsimony informative characters differs between spacer and intron. In the intron, 38% of the characters are variable and 26% are informative, whereas in the spacer, 30% only of the characters are variable and 23% are informative. However, if the amount of variability is calculated using average actual sequence lengths (corrected values according to Borsch et al. [2003]), all values are considerably higher (table 1). Nucleotide frequencies and ti:tv ratio are also different between intron and spacer (table 1). GC content is 39% (SD = 1.2) in the intron, 30% in the spacer (SD = 1.8), and 37% (SD = 1.0) in the whole data set. The ti:tv ratio is higher in the intron (2.8, SD = 0.5) and lower in the spacer (1.9, SD = 0.6), whereas ti:tv in the whole-sequence data set is 2.3 (SD = 0.3).

    Considering the intron secondary structure, differences in percentage of variable and informative characters as well as GC content and ti:tiv ratio between the six domains (table 1) and between structural components (table 2) become evident. Domain VI has the highest percentage of variable and informative characters but lowest GC content. In contrast, domains V and VI, both containing relatively small loops, are characterized by low percentage of variable and informative characters but high GC content. Generally, stems have much higher GC contents (49 %, SD = 1.0) than do loops (31%, SD = 1.8 [table 2]). Stems are also characterized by higher ti:tv ratios (4.1, SD = 1.0) than are loop stretches (2.5, SD = 0.9), but ratios are even higher in bulges (4.6, SD = 3.3) and interhelical sequences (4.7, SD = 6.4).

    Trees Obtained from Individual and Combined Data Partitions

    Incongruence-length difference tests (substitutions versus indels and spacer versus intron) indicated that data partitions are not significantly incongruent (P values ranging from 0.09 to 1.0) and, therefore, were combined for phylogenetic analysis. Table 3 gives an overview of the trees obtained from parsimony analyses. Analysis of the intron matrix (first column of table 3) revealed 114 shortest trees with a CI of 0.564 and a RC of 0.365. By combining intron and spacer sequences (second column) the number of trees was reduced to 32, but CI and RC increased only slightly. Most-parsimonious (MP) analysis of the indel matrix revealed a lower homoplasy (CI = 0.898, RC = 0.796 [Appendix 3 in Supplementary Material online). In the combined analyses of indels and substitutions, six or two MP trees were recovered (fig. 6; table 3, columns 4 and 5). CI and RC values were considerably higher than in analyses of substitutions alone (columns 1 and 2) but lower than in the indel matrix (column 3). The stem partition only comprises 192 characters (nonparing = 800) and only resolves few clades, such as Nymphaeaceae and Piperaceae (Appendix 5 in Supplementary Material online).

    Table 3 Results of Phylogenetic Analyses Using Different Approaches and Different Data Setsa

    FIG. 6.— One of two most-parsimonious trees obtained from MP analysis of the combined matrix. Bootstrap values (if higher than 50) and decay values are given above the branches. The arrow indicates the node that is collapsing in the strict consensus. Black boxes indicate synapomorphic indel characters. White boxes indicate homoplastic indels. Uninformative indels are not shown, but number of autapomorphies (if present) is given behind taxon names.

    Table 3 summarizes details on topology and support of the trees recovered in different analyses. It becomes evident that major clades such as eudicots and monocots are resolved by all data partitions, but other clades are only resolved by indels (e.g., the magnoliids) or by substitutions (e.g., Magnoliales and Piperales), respectively. The strict consensus tree of the combined data set (column 5) is the best-resolved tree (see also figure 6). Furthermore, nodes in MP trees inferred from the combined matrix gained highest BS and decay values, whereas support was generally lower in trees based on substitutions alone. Bayesian inference of spacer and intron sequence data resulted in a total of 37,600 trees sampled. The consensus is fully resolved but shows high posterior probabilities ( 95%), largely for terminal nodes at the family or order level but also for monocots and eudicots (fig. 7). A similar trend can be observed for BS values in MP analyses (fig. 7).

    FIG. 7.— A 50% majority-rule consensus of 37,603 trees obtained from four runs of Bayesian analysis implementing the GTR+G+I model. Posterior probabilities are given above the branches. Note that the consensus is fully resolved, but some branches are too short to be visualized; for example, there seems to be a trichotomy in Nymphaeales, although Nuphar is placed sister to Nymphaea+Victoria in the consensus.

    BI of the intron + spacer matrix and MP of the combined matrix revealed largely congruent topologies. First branching is Amborella, followed by Nymphaeales (Cabombaceae and Nymphaeaceae) (figs. 6 and 7). Within Nymphaeales, there is a clearly supported core Nymphaeaceae (Nymphaea, Victoria) whereas Nuphar appears in a clade either with Cabombaceae (MP) or core Nymphaeaceae (BI, unsupported). Austrobaileyales are third with BI but resolved sister to eumagnoliids with MP. However, the latter topology does not receive significant statistical support. Monocots appear with either Acorus (MP) or Nypa (BI) branching first. BI indicates affinities of Ceratophyllum and Chloranthaceae to monocots, but their positions are not resolved with MP. Magnoliids are generally resolved and even possess a synapomorphic SSR (fig. 6). Support for Magnoliales, Laurales, Canellales, and Piperales is particularly high. Within eudicots, Dicentra (Ranunculales) comes first, followed by Platanus+Nelumbo, Buxus, Trochodendron, and terminal core eudicots (Aextoxicon, Impatiens). Branches leading to Piper, to Nymphaeales, and, among Magnoliales, to Annonaceae (fig. 7) are particularly long. The same nodes are supported by high numbers of microstructural changes (fig. 6).

    Discussion

    Sequence Variability Within the petD Region

    Sequences of the petD intron and the petB-petD spacer are not as variable in their length as the intron and spacers of the trnT-trnF region (Borsch et al. 2003). Standard deviation (SD) of sequence length is not higher than 5% of the average sequence length in petD, but 10% of sequence length in the trnL intron, 32% in the trnT-trnL, and 14% in the trnT-trnF spacer. Most of the highly length-variable sequence stretches in trnT-trnF are located in hotspots. Thus, length deviation was significantly reduced by the exclusion of hotspots. In contrast, exclusion of hotspots in the petD data set did not decrease SD, indicating a more even distribution of microstructural changes. Nevertheless, the matrix of the aligned petD intron and spacer sequences was with 1,821 characters, almost twice as long as the mean sequence length, indicating that length mutations play a major role in the evolution of both intron and spacer sequences. For a meaningful comparison of sequence variability between petD and other data sets, the increased character numbers (caused by length mutations!) as compared with the nucleotide numbers has to be considered. As a better approximation, percentage of variable or potentially parsimony informative characters in relation to mean sequence length instead of character number has been suggested (Borsch et al. 2003). These were also calculated here. The petD intron contains a similar number of variable (65%) and potentially informative (24%) characters as the trnL intron (63% variable and 19% informative). The petB-petD spacer is less variable (76%) than the trnT-trnL (83%) and the trnL-trnF spacer (98%), but the percentage of informative characters is not decreased (21% in petB-petD spacer versus 21% and 18%, respectively). Obviously, the petB-petD spacer is relatively conserved when it is compared with the spacers of the trnT-trnF region. This might be explained by its functional role in the expression of the gene cluster.

    Kind and Distribution of Length Mutations

    Analysis of length mutations in the petD data set allowed identification and coding of 219 indels. The resulting indel matrix is one of the largest ever compiled (Simmons, Ochoterena, and Carr 2001), allowing a thorough analysis of frequency, size distribution, and kind of length mutations. Most indels (57%) in the petD region are simple sequence repeats. This is congruent with the findings of Graham et al. (2000) in the chloroplast inverted repeat and Borsch et al. (2003) in the trnT-trnF region. Among the SSRs, most indels result from single-base events or are 4 to 5 bp long. Indels of 2 to 3 bp or more than 6 bp are considerably less frequent, as also has been observed with other noncoding cp DNA regions (Graham et al. 2000; Borsch et al. 2003).

    Structural Effects on Sequence Evolution in the Group II Intron

    Group II introns can be subdivided into six domains, with domain I being the largest and most complex, whereas domains II to VI are simple stem-loop structures (Michel, Umesono, and Ozeki 1989; Michel and Ferat 1995). Categorization of these components for group II introns makes it possible to compare sequence evolution of different structural partitions and allows analysis of the contribution of different structural elements to phylogeny reconstructions. Similar to the rpl16 intron in Myoporaceae (Kelchner 2002), almost 90% of sequence could be unequivocally assigned to elements such as stems, loops, bulges, and interhelical stretches in the petD intron. The comparison of petD intron sequences across basal angiosperms shows stem regions to contain fewer variable sites and also fewer potentially informative characters than loops and other nonpairing stretches. The same pattern is found in the rpl16 intron in Myoporaceae (Kelchner 2002), although distances are generally smaller in this family level data set. Thus, the group II intron core structure seems to be a governing factor for mutational dynamics, regardless of the gene in which the intron is inserted.

    Within the petD intron, GC content and ti:tv ratios are higher in stems and lower in loops and, corresponding to that, GC content and ti:tv ratios are higher in domains that contain large proportions of stem regions (domains I, V, and VI) and lower in domains consisting mainly of loop stretches (table 1). GC-rich stems have a ti:tv of 4.12 (SD = 1.0), whereas AT-rich loops have a ti:tv of 2.52 (SD = 0.9 [table 2]). It appears that ti:tv ratios are a function of the respective GC content, in line with observations of Bakker et al. (2000) on trnL-F sequences. This finding is contrary to prevailing thinking that high ti:tv ratios depend on saturation through multiple transitions (Hillis, Allard, and Miyamoto 1993), thereby reflecting their evolutionary distance. Maintenance of secondary structure in GC-rich stem regions through compensatory mutations might further favor transitions (Rousset, Pélandakis, and Solignac 1991; Bakker et al. 2000). Obviously, stems evolve under high functional constraints because they are essential for the secondary structure and splicing function of the group II intron. Therefore, stems are less variable regarding nucleotide substitutions as well as length mutations, in line with findings in the mitochondrial rps3 intron (Laroche and Bousquet 1999). In fact, only three indels have been observed in stems, all of which are 1 bp. Indels longer than 1 bp only occur in nonpairing sequence stretches. In contrast, nonpairing DNA is more AT rich and bears more variable and potentially informative characters, as well as length mutations.

    The six intron domains are characterized not only by a typical secondary structure but also by their specific function in the splicing reaction (Dib-Hajj et al. 1993; Costa, Michel, and Westhof 2000). Domain VI, for instance, is one of the most important structures in the group II intron. Functional constraints on the evolution of this particular domain are supposed to be high. In fact, the frequency of length mutations in domain VI is considerably lower than in domain I, although both have almost the same amount of loop stretches relative to their total sequence length. Differences in domain conservation imply differences in phylogenetic utility. Less-conserved regions (e.g., domains II to IV and loops in general) should provide information for the terminal nodes, whereas more-conserved regions (domains V and VI and stems in general) possibly provide information for basal nodes. Because findings on different molecular evolution in domains of the petD intron corroborate findings made in rpl16 (Kelchner 2002), these considerations might be valid for group II introns in general. Parts of the less-conserved domain Ic and d2 loops, as well as of the domain VI loop of the petD intron, were so variable that they could not be aligned unambiguously across seed plants. Similar to the trnL intron (Borsch et al. 2003), hotspots are strictly confined to loop stretches and do not comprise more than 18% of the whole-intron sequence length (20% in trnL). Thus, they do not impair the general utility of that kind of noncoding cpDNA for phylogenetic reconstruction on higher taxonomic levels.

    Molecular Evolution of the Spacer

    In contrast to the trnT-L and trnL-F spacers, the petB-petD intergenic spacer is less variable. Because the whole petB-petD region persists as a dicistronic mRNA after transcription, it can be assumed that RNA secondary structure is important for the evolution of the petB-petD spacer. Moreover, this spacer is considered to play an important role in translation of the petD gene, by containing a sequence motif that allows ribosomes to detect the petD initiation codon (Monde, Greene, and Stern 2000). However, the secondary structure of the petB-petD intergenic spacer has not been analyzed, and information on sites relevant for translation of the petD gene is not yet available. The two inversions observed in Cabomba and Impatiens are associated with short inverted repeats, indicating the presence of stem-loop structures. In other taxa, the regions enclosed by these inverted repeats show numerous length mutations as well as substitutions (see figure 5b).

    Phylogenetic Signal of petD Sequence Data

    Differential analysis of data partitions and the respective tree statistics provide evidence for the high potential of microstructural changes as phylogenetic characters. Overall, indel characters are considerably less homoplastic than substitutions in this petD data set (table 3, columns 2 and 3). This finding becomes evident when comparing CI and RC values of the indel tree (Appendix 3 in Supplementary Material online) with the trees inferred from substitutions only (indels: CI = 0.895, RC = 0.789; substitutions: CI = 0.566, RC = 0.368). The resolution of the indel tree is even more striking, given that only 59 out of 168 indels in the intron are parsimony informative (35%). The spacer yields 21 parsimony-informative indels out of 73. Most of the parsimony-informative indels (71%) are synapomorphies (CI, RI = 1), whereas 23 out of 79 informative indels were reconstructed to have originated two or more times independently (empty boxes in figure 6). Homoplastic indels in the intron are generally located in loop regions. In the spacer, only indel 59 is homoplastic, which is part of a possible loop between inverted-repeat stretches (fig. 5b). This confirms the theoretical expectation that structural constraints have effects on the frequency of length mutations.

    Indels are relatively rare at the deepest nodes within angiosperms. Nevertheless, indel number 53, a simple sequence repeat of 4 nt, is synapomorphic for the magnoliid clade (comprising Magnoliales, Laurales, Canellales, and Piperales [fig. 5b]). The magnoliids are one of the major recently recovered angiosperm clades. This clade has also been revealed by substitution-based trees of this and of other noncoding (Borsch et al. 2003) and coding data sets (e.g., Qiu et al. 1999, Hilu et al. 2003) and now is clearly substantiated by indel information. Compared with substitutions, indels may be regarded as independent evidence because microstructural changes result from different mutational processes.

    Although petD sequences reveal Amborella and Nymphaeales as first branching angiosperms, they provide no statistical support for the basal grade (figs. 6 and 7). Short branch lengths (phylogram in figure 7) indicate a possible lack of mutations that were fixed during early divergence of angiosperm lineages. Similarly, there is no support from the indels for these nodes (Appendix 3 in Supplementary Material online and figure 6). Plotting substitutions on the tree (Appendix 4 in Supplementary Material online) allowed assessment of homoplasy at particular sites and an assignment of individual characters to structural elements of the group II intron. In total, the backbone of angiosperms (nodes from Amborella to eudicots) is supported by only 28 substitutions among the intron characters, averaging four to five per node. All of these substitutions occur in nonpairing elements (21 in loops, five in bulges, and two in interhelical elements). Synapomorphic states supporting angiosperms above Amborella are in character 1509 and, for angiosperms above Nymphaeales, in character 1014 (exon binding site 1). A substitution in character 1564 is synapomorphic for magnoliids. All other characters variable at the basal nodes are homoplastic, often exhibiting repeated substitutions or reversals within eudicots and monocots. Thus, short branches at the base of the angiosperm tree are predominantly caused by a lack of mutations, which may be explained by either a rapid radiation of these lineages or substitutional rates increasing through time. In addition, the respective variable characters are likely to be homoplastic because they are mostly located in rather freely evolving loops of the intron. Microstructural changes also have accumulated in unpaired elements, whereas stems are structurally conserved and show almost no length variability across angiosperms. The rates of length mutations present in unpaired elements of petD obviously have not led to noticeable homoplasy. The above mentioned structural conservation of stems extends to their conservation in primary sequence. Across the angiosperm tree, substitutions are very few in the stem partition of petD (Appendix 5b in Supplementary Material online). Contrary to the expectation that stem elements would provide no information for terminal nodes, some terminals are resolved in the stem partition tree, such as Nymphaeales, the genus Piper, or the Annonaceae. These lineages accumulate both indels and substitutions (see Appendix 5 in Supplementary Material online and figure 6), which points to a lineage-specific acceleration of mutational rates. Higher variability, even in stem elements, may be caused by possibly relaxed constraints on helical parts of the petD intron in these lineages. Accumulation of length mutations may be a trend in the chloroplast genome of Piperaceae and the Nymphaeales because this accumulation was also observed in the trnT-trnF region (Borsch et al. 2003).

    Substitutional patterns in stem regions are often biased by compensatory mutations because maintenance of secondary structure is essential for the self-splicing mechanism. Because single events, thus, lead to double substitutions, signal from respective substitutions might be overweighted in phylogenetic analysis. Therefore, several authors argued for not including such characters in phylogenetic analyses. In the petD intron, 192 out of 1,055 characters belong to stem regions, but out of these characters, only 64 are variable (in contrast to 406 variable characters in the whole intron). Thus, a maximum of 16% of all substitutions might be compensatory mutations. This percentage is higher than in the trnL intron (7% [Borsch et al. 2003]) but still considerably lower than in 18S rDNA (73% [Soltis et al. 1999]). Compensatory mutations, therefore, probably play a minor role in sequence evolution of the petD intron.

    This study has demonstrated that the partitioned analysis of structural elements (Appendix 5 in Supplementary Material online) is a valuable tool for unraveling effects of mutational dynamics on phylogenetic signal. However, the present petD data set only contains 192 stem characters, 45 of which are potentially parsimony informative across seed plants. Thus, stem partitions will not have the potential to fully resolve parts of the angiosperm tree, simply because there is insufficient information. Additional group II intron data sets may well provide further insight.

    Phylogeny of Basal Angiosperms

    The trees inferred from the petD region are consistent with those based on multigene, multigenome data sets (Qiu et al. 1999; Soltis et al. 2000; Zanis et al. 2002) and with trees obtained from other rapidly evolving genomic regions such as trnT-trnF (Borsch et al. 2003) and matK (Hilu et al. 2003). PetD further substantiates Amborella as sister to all other angiosperms, although a contrary hypothesis on the root of angiosperms has recently been proposed (Goremykin et al. 2003).

    Within Nymphaeales, three clades are clearly supported: (1) Cabombaceae, (2) the genus Nuphar, and (3) a clade consisting of Nymphaea + Victoria. However, the monophyly of Nymphaeaceae as comprising Nuphar, Nymphaea, and Victoria is not substantiated by the petD data set. Actually, Nuphar is found either sister to Cabombaceae (MP [fig. 6]) or first branching in Nymphaeaceae (BI [fig. 7]). Further studies are needed to clarify relationships, because resolution and support among the three major Nymphaeales lineages are also low in other studies (e.g., Borsch et al. 2003, Soltis et al. 2000, Qiu et al. 1999, Zanis et al. 2002), or monophyly of Nymphaeaceae s.str. was assumed a priori (Les et al. 1999).

    The magnoliids consisting of Laurales + Magnoliales and Piperales + Canellales are consistently resolved in petD analyses. This underscores the utility of rapidly evolving regions, including petD (Borsch et al. 2003; Hilu et al. 2003). Contrary to trees based on single, slowly evolving genes, magnoliids were otherwise only inferred by multigene analyses combining all three genomic compartments (Qiu et al. 1999). Within magnoliids, results of this petD data set are particularly relevant because BI and MP infer Piperales and Canellales as sister groups. In all previous analyses, there has been medium or only low statistical support for this sister group, and BI of matK sequences even found Piperales first branching within magnoliids (Hilu et al. 2003). Thus, molecular evidence seems to converge upon close relationships between Piperales and Canellales, contrary to phylogenetic analyses of morphological characters (Doyle and Endress 2000). Within Magnoliales, the petD data set provides high support and a topology that mirrors exactly the conclusions drawn in a recent analysis by Sauquet et al. (2003) using molecules and morphology. The respective positions of Ceratophyllum and Chloranthaceae, as well as monocots and eudicots, could not be clarified here because of the small number of characters in petD (half of the trnT-trnF data set [Borsch et al. 2003]); this clarification will have to await future combined analyses. Within eudicots, the topoloy is fully congruent with data sets containing large numbers of taxa (Soltis et al. 2000; Hilu et al. 2003), suggesting that petD will provide valuable information for further resolving eudicot relationships.

    Conclusion

    Group II introns such as in petD can be aligned across seed plants. Similar to observations made with the trnT-trnF region, high length variability is confined to mutational hotspots located in certain loops in their secondary structure. Overall, the petD intron, representing a structurally conserved group II intron with a large proportion of helical elements, is less variable than group I introns. This petD intron data set shows that mutational dynamics in helical elements clearly differs from unpaired stretches. The combination of sequences from several group II introns will possibly serve as a model for contrasting molecular evolution and phylogenetic signal in different data partitions. Group II introns seem to reflect a mosaic pattern of very conserved stems and rather freely evolving unpaired elements. Mutational patterns and rates in petD loops, bulges, and interhelical stretches result in phylogenetic signal that resolves major clades of angiosperms such as magnoliids and cudicots as well as relationships within them. Spacers and group I introns may offer a wider spectrum of differently evolving sites in comparison to the rather uniform partitions in group II introns. Thus, group I introns may be expected to provide support for a greater number of angiosperm clades than do group II introns. This suggestion might explain the better performance of the trnT-trnF region for resolving an identical set of 42 basal angiosperm taxa (Borsch et al. 2003). This earlier study and the observations and analyses that we report here demonstrate the potential of microstructural changes as phylogenetic markers. Indel information corroborates signal inferred from substitutions and leads to increased resolution and support of the recovered trees. Considering that levels of homoplasy are even lower compared with substitutions, a high indel frequency in introns and spacers underscores their effectiveness in phylogenetic reconstruction.

    Supplementary Material

    A list of all taxa included in this study with their respective families, the origin of material, and GenBank accession numbers is provided in Appendix 1. Appendix 2 contains detailed descriptions of 241 length mutations coded in the binary indel matrix. The strict consensus tree obtained from the indel matrix, with the indel characters mapped on, is added as Appendix 3. The strict consensus obtained from substitutions only is shown in Appendix 4, complemented by one of the most-parsimonious trees with details on the characters at the basal nodes of this tree (parts B and C). Appendix 5 contains consensus trees obtained from the partitioned analysis of stems and nonpairing elements of the intron matrix. The complete sequence alignment of the petD region (including the petB-petD spacer) is available from the corresponding author upon request.

    Acknowledgements

    This paper serves in part for the fulfillment of a Ph.D. dissertation of C.L., which is carried out in the framework of the Graduiertenkolleg "Evolution und Biodiversit?t in Raum und Zeit" and funded by the Deutsche Forschungsgemeinschaft. We thank W. Barthlott, T. Litt, and K.P. Sauer (all Bonn) for continuous support during this program. We are grateful to Volker Knoop (Bonn) and Scot Kelchner (Canberra) for valuable comments on earlier drafts of this manuscript; to Kai Müller, Dietmar Quandt, Rolf Bl?cher (all Bonn), Khidir W. Hilu (Blacksburg), and Helga Ochoterena (Mexico City) for critical discussions; and to Peter Lockhart and two anonymous reviewers, who improved the manuscript by their constructive comments.

    References

    Asmussen, C. B., and M. W. Chase. 2001. Coding and noncoding plastid DNA in palm systematics. Am. J. Bot. 88:1103–1117.

    Bakker, F., A. Culham, R. Gomez-Martinez, J. Carvalho, J. Compton, R. Dawtrey, and M. Gibby. 2000. Patterns of nucleotide substitution in angiosperm cpDNA trnL (UAA)-trnF (GAA) regions. Mol. Biol. Evol. 17:1146–1155.

    Bonen, L., and J. Vogel. 2001. The ins and outs of group II introns. Trends Genet. 17:322–331.

    Borsch, T., K. W. Hilu, D. Quandt, V. Wilde, C. Neinhuis, and W. Barthlott. 2003. Non-coding plastid trnT-trnF sequences reveal a well resolved phylogeny of basal angiosperms. J. Evol. Biol. 16:558–576.

    Clausing, G., and S. S. Renner. 2001. Molecular phylogenetics of Melastomataceae and Memecylaceae: implications for character evolution. Am. J. Bot. 88:486–498.

    Costa, M., F. Michel, and E. Westhof. 2000. A three-dimensional perspective on exon binding by a group II self-splicing intron. EMBO J. 19:5007–5018.

    De Pinna, M. C. C. 1991. Concepts and tests of homology in the cladistic paradigm. Cladistics 7:367–394.

    Dib-Hajj, S. D., S. C. Boulanger, S. K. Hebbar, C. L. Peebles, J. S. Franzen, and P. S. Perlman. 1993. Domain 5 interacts with domain 6 and influences the second transesterification reaction of group II intron self-splicing. Nucleic Acids Res. 21:1797–1804.

    Dixit, R., P. K. Trivedi, P. Nath, and R. V. Sane. 1999. Organization and post-transcriptional processing of the psbB operon from chloroplasts of Populus deltoides. Curr. Genet. 36:165–172.

    Doyle, J. A., and P. K. Endress. 2000. Morphological phylogenetic analysis of basal angiosperms: comparison and combination with molecular data. Int. J. Plant Sci. 161:S121–S153.

    Doyle, J. J. 1992. Gene trees and species trees: molecular systematics as one-character taxonomy. Syst. Bot. 17:144–163.

    Federova, O., T. Mitros, and A. M. Pyle. 2003. Domains 2 and 3 interact to form critical elements of the group II intron active site. J. Mol. Biol. 330:197–209.

    Golenberg, E. M., M. T. Clegg, M. L. Durbin, J. Doebley, and D. P. Ma. 1993. Evolution of a non-coding region of the chloroplast genome. Mol. Phylogenet. Evol. 2:52–64.

    Goremykin, V. V., K. I. Hirsch-Ernst, S. Wolfl, and F. H. Hellwig. 2003. Analysis of the Amborella trichopoda chloroplast genome sequence suggests that Amborella is not a basal angiosperm. Mol. Biol. Evol. 20:1499–1505.

    Graham, S. W., and R. G. Olmstead. 2000a. Evolutionary significance of an unusual chloroplast DNA inversion found in two basal angiosperm lineages. Curr. Genet. 37:183–188.

    ———. 2000b. Utility of 17 chloroplast genes for inferring the phylogeny of the basal angiosperms. Am. J. Bot. 87:1712–1730.

    Graham, S. W., P. A. Reeves, A. C. E. Burns, and R. G. Olmstead. 2000. Microstructural changes in noncoding chloroplast DNA: interpretation, evolution, and utility of indels and inversions in basal angiosperm phylogenetic inference. Int. J. Plant Sci. 161:S83–S96.

    Gu, X., and W. H. Li. 1995. The size distribution of insertions and deletions in human and rodent pseudogenes suggests the logarithmic gap penalty for sequence alignment. J. Mol. Evol. 40:464–473.

    Hillis, D. M., M. W. Allard, and M. M. Miyamoto. 1993. Analysis of DNA sequence data: phylogenetic inference. Methods Enzymol. 224:456–490.

    Hilu, K. W., T. Borsch, K. Müller et al. (16 co-authors). 2003. Angiosperm phylogeny based on matK sequence information. Am. J. Bot. 90:1758–1776.

    Hoot, S. B., and A. W. Douglas. 1998. Phylogeny of the Proteaceae based on atpB and atpB-rbcL intergenic spacer region sequences. Aust. Syst. Bot. 11:301–320.

    Huelsenbeck, J. P., and F. Ronquist. 2001. MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics 17:754–755.

    Kelchner, S. A. 2000. The evolution of non-coding chloroplast DNA and its application in plant systematics. Ann. MO Bot. Gard. 87:482–498.

    ———. 2002. Group II introns as phylogenetic tools: structure, function, and evolutionary constraints. Am. J. Bot. 89:1651–1669.

    Kelchner, S. A., and L. G. Clark. 1997. Molecular evolution and phylogenetic utility of the chloroplast rpl16 intron in Chusquea and the Bambusoideae (Poaceae). Mol. Phylogenet. Evol. 8:385–397.

    Kelchner, S. A., and J. F. Wendel. 1996. Hairpins create minute inversions in non-coding regions of chloroplast DNA. Curr. Genet. 30:259–262.

    Knoop, V., and A. Brennicke. 1993. Group II introns in plant mitochondria—Trans-splicing, RNA editing, evolution and promiscuity. Pp. 221–232 in Brennicke, A. and U. Kück, eds. Plant mitochondria. VCH Verlag, Weinheim, Germany.

    Kumar, S., K. Tamura, I. B. Jakobsen, and M. Nei. 2001. MEGA 2.1: molecular evolutionary genetics analysis software. Arizona State University, Tempe, Arizona.

    Laroche, J., and J. Bousquet. 1999. Evolution of the mitochondrial rps3 intron in perennial and annual angiosperms and homology to nad5 intron 1. Mol. Biol. Evol. 16:441–452.

    Learn, G. H. J., J. S. Shore, G. R. Furnier, G. Zurawski, and M. T. Clegg. 1992. Constraints on the evolution of plastid introns: the Group II intron in the gene encoding tRNA-Val (UAC). Mol. Biol. Evol. 9:856–871.

    Les, D. H., E. L. Schneider, D. J. Padgett, P. S. Soltis, D. E. Soltis, and M. Zanis. 1999. Phylogeny, classification and floral evolution of water lilies (Nymphaeaceae; Nymphaeales): a synthesis of non-molecular, rbcL, matK, and 18S rDNA data. Syst. Bot. 24:28–46.

    Levinson, G., and G. Gutman. 1987. Slipped-strand mispairing: a major mechanism for DNA sequence evolution. Mol. Biol. Evol. 4:203–221.

    Lutzoni, F., P. Wagner, V. Reeb, and S. Zoller. 2000. Integrating ambiguously aligned regions of DNA sequences in phylogenetic analyses without violating positional homology. Syst. Biol. 49:628–651.

    Mathews, S., and M. J. Donoghue. 2000. Basal angiosperm phylogeny inferred from duplicate phytochromes A and C. Int. J. Plant Sci. 161:S41–S55.

    Michel, F., and J.-L. Ferat. 1995. Structure and activities of group II introns. Ann. Rev. Biochem. 64:435–461.

    Michel, F., K. Umesono, and H. Ozeki. 1989. Comparative and functional anatomy of group II catalytic introns—a review. Gene 82:5–30.

    Monde, R.-A., J. C. Greene, and D. B. Stern. 2000. Disruption of the petB-petD intergenic region on tobacco chloroplasts affects petD RNA accumulation and translation. Mol. Gen. Genet. 263:610–618.

    Morgenstern, B. 1999. Dialign 2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics 15:211–218.

    Müller, J., and K. Müller. 2003. QuickAlign: A new alignment editor. Plant Mol. Biol. Reporter 21:5.

    Müller, K. 2004. PRAP – computation of Bremer support for large data sets. Mol. Phyl. Evol. 31:780–782.

    Nixon, K. C. 2002. WinClada. Version 1.00.08. Published by the author, Ithaca, NY.

    Posada, D. and K. A. Crandall. 1998. Modeltest: testing the model of DNA substitution. Bioinformatics 14:817–818.

    Pruchner, D., S. Beckert, H. Muhle, and V. Knoop. 2002. Divergent intron conservation in the mitochondrial nad2 gene: signatures for the three bryophyte classes (mosses, liverworts, and hornworts) and the lycophytes. J. Mol. Evol. 55:265–271.

    Qiu, Y.-L., Y. Cho, J. C. Cox, and J. D. Palmer. 1998. The gain of three mitochondrial introns identifies liverworts as the earliest land plants. Nature 394:671–674.

    Qiu, Y.-L., J. L. Lee, F. Bernasconi-Quadroni, D. E. Soltis, P. S. Soltis, M. Zanis, E. A. Zimmer, Z. Chen, V. Savolainen, and M. W. Chase. 1999. The earliest angiosperms: evidence form mitochondrial, plastid and nuclear genomes. Nature 402:404–407.

    Quandt, D., K. Müller, and S. Huttunen. 2003. Characterisation of the chloroplast DNA psbT-H region and the influence of dyad symmetrical elements on phylogenetic reconstructions. Plant Biol. 5:400–410.

    Rock, C. D., A. Barkan, and W. C. Taylor. 1987. The maize plastid psbB-psbF-petB-petD gene cluster: spliced and unspliced petB and petD RNAs encode alternative products. Curr. Genet. 12:69–77.

    Rousset, F., M. Pélandakis, and M. Solignac. 1991. Evolution of compensatory substitutions through G-U intermediate state in Drosophila rRNA. Proc. Natl. Acad. Sci. USA 88:10032–10036.

    Sauquet, H., J. A. Doyle, T. Scharaschkin, T. Borsch, K. W. Hilu, L. W. Chatrou, and A. Le Thomas. 2003. Phylogenetic analysis of Magnoliales and Myristicaceae based on multiple data sets: implications for character evolution. Bot. J. Linnean Soc. 142:125–186.

    Schmitz-Linneweber, C., R. M. Maier, J. P. Alcaraz, A. Cottet, R. G. Herrmann, and R. Mache. 2001. The plastid chromosome of spinach (Spinacia oleracea): complete nucleotide sequence and gene organization. Plant Mol. Biol. 45:307–315.

    Simmons, M. P., and H. Ochoterena. 2000. Gaps as characters in sequence-based phylogenetic analyses. Syst. Biol. 49:369–381.

    Simmons, M. P., H. Ochoterena, and T. G. Carr. 2001. Incorporation, relative homoplasy, and effect of gap characters in sequence-based phylogenetic analyses. Syst. Biol. 50:454–462.

    Soltis, D. E., and P. S. Soltis. 1998. Choosing an approach and an appropriate gene for phylogenetic analysis. Pp. 1–42 in D. E. Soltis, P. S. Soltis, and J. J. Doyle, eds. Molecular systematics of plants II: DNA sequencing. Kluwer Academic Publishers, Boston.

    Soltis, D. E., P. S. Soltis, M. W. Chase et al. (16 co-authors). 2000. Angiosperm phylogeny inferred from 18S rDNA, rbcL, and atpB sequences. Bot. J. Linnean Soc. 133:381–461.

    Soltis, P. S., D. E. Soltis, P. G. Wolf, D. L. Nickrent, S.-M. Chaw, and R. L. Chapman. 1999. The phylogeny of land plants inferred from 18S rDNA sequences: pushing the limits of rDNA signal?. Mol. Biol. Evol. 16:1774–1784.

    Swofford, D. L. 2002. PAUP*: phylogenetic analysis using parsimony (*and other methods). Version 4.0b10. Sinauer Associates, Sunderland, Mass.

    Tanaka, M., J. Obokata, J. Chunwongse, J. Shinozaki, and M. Suguira. 1987. Rapid splicing and stepwise processing of a transcript from the psbB operon in tobacco chloroplasts: determination of the intron sites in petB and petD.. Mol. Gen. Genet. 209:427–431.

    Tsudzuki, J., K. Nakashima, T. Tsudzuki, J. Hirasuka, M. Shibata, T. Wakasugi, and M. Sugiura. 1992. Chloroplast DNA of black pine retains a residual inverted repeat lacking rRNA genes: nucleotide sequences of trnQ, trnK, psbA, trnI and trnH and the absence of rps16.. Mol. Gen. Genet. 232:206–214.

    Vangerow, S., T. Teerkorn, and V. Knoop. 1999. Phylogenetic information in the mitochondrial nad5 gene of pteridophytes: RNA editing and intron sequences. Plant Biol. 1:235–243.

    Vawter, L., and W. M. Brown. 1993. Rates and patterns of base change in the small subunit ribosomal RNA gene. Genetics 134:597–608.

    Westhoff, P., and R. G. Herrmann. 1988. Complex RNA maturation in the chloroplast: the psbB operon from spinach. Eur. J. Biochem. 171:551–564.

    Zanis, M., D. E. Soltis, P. S. Soltis, S. Mathews, and M. J. Donoghue. 2002. The root of the angiosperms revisited. Proc. Natl. Acad. Sci. USA 99:6848–6853.(Cornelia L?hne and Thomas)

http://www.100md.com/html/DirDu/2006/09/20/17/64/48.htm