Covarion Shifts Cause a Long-Branch Attraction Artifact That Unites Microsporidia and Archaebacteria in EF-1 Phylogenies
http://www.100md.com
分子生物学进展 2004年第7期
* Program in Evolutionary Biology, Canadian Institute for Advanced Research and Genome Atlantic, Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax, Nova Scotia, Canada
Department of Bioscience, Nagahama Institute of Bioscience and Technology, 1266 Tamura, Nagahama, Shiga 526-0829, Japan
Department ofMathematics and Statistics and Genome Atlantic, Dalhousie University, Halifax, Nova Scotia, Canada
Departmentof Botany, University of British Columbia, Vancouver, British Columbia, Canada
E-mail: yinagai@dal.ca.
Abstract
Microsporidia branch at the base of eukaryotic phylogenies inferred from translation elongation factor 1 (EF-1) sequences. Because these parasitic eukaryotes are fungi (or close relatives of fungi), it is widely accepted that fast-evolving microsporidian sequences are artifactually "attracted" to the long branch leading to the archaebacterial (outgroup) sequences ("long-branch attraction," or "LBA"). However, no previous studies have explicitly determined the reason(s) why the artifactual allegiance of microsporidia and archaebacteria ("M + A") is recovered by all phylogenetic methods, including maximum likelihood, a method that is supposed to be resistant to classical LBA. Here we show that the M + A affinity can be attributed to those alignment sites associated with large differences in evolutionary site rates between the eukaryotic and archaebacterial subtrees. Therefore, failure to model the significant evolutionary rate distribution differences (covarion shifts) between the ingroup and outgroup sequences is apparently responsible for the artifactual basal position of microsporidia in phylogenetic analyses of EF-1 sequences. Currently, no evolutionary model that accounts for discrete changes in the site rate distribution on particular branches is available for either protein or nucleotide level phylogenetic analysis, so the same artifacts may affect many other "deep" phylogenies. Furthermore, given the relative similarity of the site rate patterns of microsporidian and archaebacterial EF-1 proteins ("parallel site rate variation"), we suggest that the microsporidian orthologs may have lost some eukaryotic EF-1–specific nontranslational functions, exemplifying the extreme degree of reduction in this parasitic lineage.
Key Words: phylogenetic artifact ? long-branch attraction ? EF-1 ? microsporidia ? covarion model ? parallel site rate variation
Introduction
Microsporidia are a monophyletic assemblage of obligate intracellular parasites that generally infect animals (particularly arthropods and fish). These unicellular eukaryotes lack classical mitochondria, peroxisomes and Golgi apparatus at the ultrastructural level (reviewed in Keeling and Fast [2002]). At the molecular level, these organisms appear to possess unusual ribosomes and ribosomal RNA (rRNA). The sedimentation coefficient of microsporidian ribosomes is more similar to that of prokaryotic (i.e., eubacterial or archaebacterial) 70S ribosomes than to that of typical eukaryotic 80S ribosomes (Ishihara and Hayashi 1968). Fusion of the large subunit (LSU) and 5.8S rRNAs is another trait shared between microsporidia and prokaryotes (Vossbrinck and Woese 1986). Furthermore, microsporidia were robustly estimated to be one of the basal lineages among eukaryotes in the first molecular phylogenies that include microsporidia, those of small subunit (SSU) rRNA sequences (Vossbrinck et al. 1987). This basal position was recovered in later SSU phylogenies regardless of the tree reconstruction method, model of nucleotide substitution, or taxonomic sampling (e.g., Cavalier-Smith and Chao [1996] and Kumar and Rzhetsky [1996]). Importantly, similar results were reproduced by early phylogenetic studies based on the amino acid sequences of translation elongation factors 1 and 2 (EF-1 and EF-2) (Kamaishi et al. 1996a, 1996b). Taken together, these results appeared to suggest that microsporidia were one of the earliest offshoots in eukaryotic evolution, diverging from the main trunk of eukaryotic lineages before the mitochondrial endosymbiosis.
However, in the past decade, this evolutionary scenario has drastically changed. Nuclear genes encoding proteins of mitochondrial origin were found in microsporidia, strongly suggesting the secondary loss of typical mitochondria in this lineage (Germot, Philippe, and Le Guyader 1997; Hirt et al. 1997; Peyretaillade et al. 1998; Arisue et al. 2002). Katinka et al. (2001) identified additional genes for mitochondrial proteins in the completed genome of the microsporidian, Encephalitozoon cuniculi. In line with these findings, Williams et al. (2002) recently ultrastructurally identified tiny, relic mitochondria in the microsporidian Trachipleistophora hominis. Secondly, further analyses of protein sequences (except elongation factors) have indicated the evolutionary affinity between microsporidia and fungi (reviewed in Keeling and Fast [2002]). In particular, relatively strong support for a relationship between microsporidia and fungi ("M + F") is recovered in phylogenies based on tubulins (Edlind et al. 1996; Keeling and Doolittle 1996; Keeling, Luker, and Palmer 2000; Keeling 2003), the largest subunit of RNA polymerase II (Hirt et al. 1999), seryl-tRNA synthetase, transcription initiation factor IIB, a subunit of the vacuolar ATPase and a GTPase (Katinka et al. 2001). Multigene phylogenies that include microsporidia also recover the M + F connection with high support values (Baldauf et al. 2000; Hashimoto, Arisue, and Hasegawa 2002). In sum, results accumulating since the late 1990s strongly suggest that microsporidia are fungi or close relatives of fungi. The basal position of microsporidia in SSU rRNA and elongation factor phylogenies has been rationalized as an artifact resulting from "long-branch attraction" (LBA), whereby the fast-evolving microsporidian and the long branch leading to the archaebacterial sequences were artifactually attracted to one another in tree reconstruction (e.g., Embley and Hirt [1998] and Philippe and Laurent [1998]). Under this hypothesis, the relatively simple architecture of microsporidian cells can be interpreted as the result of extensive adaptation to their parasitic lifestyle.
Unlike unweighted maximum-parsimony (MP), maximum-likelihood (ML) methods are generally expected to be statistically consistent if the Markov model of sequence change closely matches reality (for discussion, see Felsenstein [2003]). Simulation studies concur with this result, indicating that ML methods are less susceptible to LBA with finite data than MP or distance methods (e.g., Hasegawa and Fujiwara [1993] and Swofford et al. [2001]). However, in practice, neither infinite data nor perfectly correct models are available, and ML methods can yield biased estimates of phylogenies. For instance, ML methods display sensitivity to LBA when the evolutionary model is violated by the data, such as failure to model among-site rate variation (ASRV) (e.g., Sullivan and Swofford [2001]). Therefore, microsporidian phylogenies estimated under simple models that assume an equal rate of evolution among sites may be severely affected by LBA. Indeed, the EF-1 ML analyses under a more realistic model that takes into account ASRV failed to statistically reject the M + F relationship, although the basal position of microsporidia was still the ML estimate (Hirt et al. 1999; Hashimoto, Arisue, and Hasegawa 2002). A similar trend was observed in concatenated SSU plus LSU rRNA gene phylogenies (Philippe and Germot 2000). Importantly, the LSU rRNA (rooted) distance phylogeny and EF-2 (unrooted) distance and ML phylogenies under an ASRV model actually recovered the M + F relationship, albeit with weak support values (Hirt et al. 1999; Van de Peer, Ben Ali, and Meyer 2000; Keeling and Fast 2002). These results consistently suggest that the fast-evolving microsporidian sequences incorrectly branch at the base of eukaryotes because of LBA.
ASRV models are thought to be essential for phylogenetic analyses of problematic data sets, including fast-evolving sequences (e.g., Philippe and Laurent [1998]). However, these models are obviously insufficient to describe the complex processes of amino acid and nucleotide evolution. One of the crucial processes ignored in ASRV models is across-tree site rate variation (XTSRV), also called "heterotachy" by some authors (Lopez, Casane, and Philippe 2002). Functional and/or structural constraints on a biological molecule can be changed over evolutionary time, so that the rates at sites may vary across the tree (i.e., among the sequences in an alignment). This evolutionary process was first described as the concomitantly variable codon (covarion) model that allows a site to shift between two states, invariable ("off") or variable ("on"), in a protein over time (Fitch and Markowitz 1970). Explicit formulations of Markov models of covarion evolution have only recently been conceived (Tuffley and Steel 1998; Penny et al. 2001). The original covarion and ASRV models have also been merged into "covarion-like" models for both protein and nucleotide evolution that, rather than having simply invariable and variable sites, allow sites to "switch" between a number of different site rates across subtrees (reviewed in Penny and Hasegawa [2001]). The covarionlike models in Tuffley and Steel (1998), Penny et al. (2001), Galtier (2001), Lopez, Casane and Philippe (2002) and Huelsenbeck (2002) assume XTSRV to be a stochastic process varying across both sites and subtrees with frequent changes in rates throughout the larger tree. Although rates do vary across both sites and subtrees in these models, the process is stationary, implying that the marginal distributions of rates in any two subtrees will be the same. This stationarity of the frequency of changes throughout the larger tree does not provide a natural model for biological molecules subjected to a "covarion shift," where many sites experience a shift in evolutionary rates within a short period of time associated with a modification in functional and/or structural properties (Inagaki et al. 2003). The XTSRV in covarion shift models (e.g., modeled by a bivariate rate distribution in Susko et al. [2002]) should allow different marginal distributions in rates for subtrees on either side of the shift, and the rate shifts should occur with lower frequency in evolutionary time than stochastic covarion drift. Although the two types of covarion-like models are intrinsically different, they are not mutually exclusive and both serve to make some adjustment for covarionlike effects.
In this study, we investigated the impact of XTSRV in the EF-1 data set on the phylogenetic position of microsporidia. We divided the data into the archaebacterial and eukaryotic sub-trees, and identified the sites associated with a large difference in evolutionary rates across the subtrees ("SRN" sites) estimated by the bivariate rate ML method (Susko et al. 2002). The difference in log-likelihood (lnL) between the two tree topologies representing the competing hypotheses of (1) microsporidia as an early diverging eukaryotic lineage versus (2) microsporidia as relatives of fungi then was calculated based on the data sets excluding SRN sites. Similar analyses based on data sets excluding random sites (jackknifing) were then conducted to evaluate the significance of SRN site exclusions. Finally, the data sets with and without SRN sites were analyzed using the ML method.
Philippe and Germot (2000) previously investigated the impact of the "covarion structure" on putative LBA artifacts in a concatenated SSU plus LSU rRNA data set using a similar approach to ours. The authors excluded the "covarion" sites identified by a MP-based method (Lopez, Forterre, and Philippe 1999), and tested whether this procedure lowered the support for putative LBA artifacts such as the basal branching of diplomonads (e.g., Giardia), Trichomonas (a parabasalid), and microsporidia (Philippe and Germot [2000]). However, we have identified two potential deficiencies in these analyses. First, this pioneering work lacks control experiments that would allow the assessment of "covarion" site deletion on support for potential LBA artifacts (i.e., jackknifing of random sites). Another difficulty arises from the tripartition of the data (i.e., archaebacteria, "basal branching," and "crown" eukaryotes) to identify the covarion sites. Therefore, it is unclear which partition of the rRNA data bears the significant rate difference that allegedly produces the LBA artifacts.
By contrast, our results unequivocally illustrate that the failure to model covarion shifts across the archaebacterium-eukaryote split in the data results in an artifactual affinity between the long branches leading to the archaebacterial (outgroup) and microsporidian sequences—a kind of LBA that originates from model misspecification. Furthermore, we detect similarity between the patterns of site rates for the microsporidian and archaebacterial sequences ("parallel site rate variation") relative to eukaryotic homologs, and propose the loss of some eukaryotic EF-1–specific nontranslational functions in the microsporidian lineage to explain this "convergent" property of the two EF-1 subtrees.
Materials and Methods
Data Sets
The experimental details for the isolation of an EF-1 gene from a microsporidian Antonospora locustae (Nosema locustae was renamed to Antonospora locustae; see Slamovits, Williams, and Keeling [2004]) were described in Roger (1996). The novel DNA sequence of Antonospora EF-1 was deposited under DDBJ/EMBL/GenBank accession number AY452734.
We manually aligned the deduced EF-1 amino acid sequences of 22 phylogenetically divergent eukaryotes (19 "short-branch" eukaryotes [SB eukaryotes] and three microsporidia) and 20 archaebacteria. The accession numbers of the sequences considered in this study are listed in the caption of figure 1. To focus on LBA between the microsporidian and archaebacterial EF-1 sequences, other fast-evolving eukaryotic sequences (e.g., those from diplomonads and parabasalids) were not considered in this study. We excluded ambiguously aligned sites and sites that included gaps, creating the "Micro" data set consisting of 42 sequences with 349 sites (table 1). Subsequently, the "Basic" data set was generated from the Micro data by exclusion of the microsporidian sequences for the estimation of conditional mode rates at sites (site rates) and evaluating the rate distribution difference across eukaryotic and archaebacterial subtrees (see below). Smaller data sets, "Micro*" and "Basic*," were derived from the Micro and Basic data sets by exclusion of all archaebacterial sequences except those of Methanosarcina and Aeropyrum, respectively (table 1). These data sets were subjected to computationally intensive ML phylogenetic analyses.
FIG. 1. (A) A histogram of the absolute value of SR calculated between the SB-eukaryotic and archaebacterial subtrees in the Basic data set. Arrows indicate the ranges of top 5% to 40% absolute SR. The absolute value SR range, from which 245 sites were randomly jackknifed to generate the "JK*" data sets, is shaded. (B) The tree topologies examined for the phylogenetic positions of microsporidia. The archaebacterial clade is omitted. The microsporidian clade was attached to the edge A or B, creating the "M + A" or "M + F" tree topology, respectively. Accession numbers for the sequence considered are as follows: Sulfolobus acidocaldaris, X52382; S. solfataricus, X70701; S. tokodaii, BAB65236; Desulfurococcus mobilis, X73582; Aeropyrum pernix, AP000062; Pyrobaculum aerophilum, U94347; uncultured crenarchaeote 74A4, AF393466; Thermococcus celer, X52383; Pyrococcus horikohii, AP000006; P. abysii, AJ248285; P. woesei, X59857; Archaeoglobus fulgidus, AE001039; Methanococcus jannaschii, U67486; M vannielii, X05698; Methanosarcina barkeri, ZP_00079008; Ferroplasma acidarmanus, ZP_000005; Halobacterium sp., AAG20682; H. halobium, D32120; Haloarcula marismotui, X16677; Thermoplasma acidophilum, AL445064; Homo sapiens, BC002845; Caenorhabditis elegans, AAA81688; Schizosaccharomyces pombe, D82571; Podospora anserina, X74799; Arabidopsis thaliana, AB026655; Nicotiana tabacum, D63396; Euglena gracilis, X16890; Trypanosoma brucei, U10562; Acrasis rosea, AF190771; Dictyostelium discoideum, X55972; Mastigamoeba balamuthi, a contig of BE636534, BE636560, BE636596, and BE636646; Entamoeba histolytica, M92073; Porphyra purpurea, U08844; Plasmodium knowlesi, AJ224153; Cryptosporidium parvum, U71180; Stylonychia lemnae, X57926; Blastocystis hominis, AF091355; Phytophthora capsici, AF109666; Cyanophora paradoxa, AF092951; Antonospora (Nosema) locustae, AY452734; Glugea plecoglossi, BAA06865; Encephalitozoon cuniculi, CAD25298
Table 1 Details of the Data Sets Used in This Study.
Site rates of the SB-eukaryotic and archaebacterial subtrees in the Basic data set were estimated by the bivariate ML rate method with the PAM001 amino acid substitution model using Bivar (Susko et al. 2002). In this method, ASRV in the two subtrees was described by a bivariate discrete ML rate distribution with a matrix of 25 x 25 equally spaced rate categories. The upper limit for rates was set as 6.0. The probability of each rate category was estimated from the data.
Absolute values of the difference in bivariate site rates (SR) across the SB-eukaryotic and archaebacterial subtrees were calculated, and then eight sets of sites with top 5% to 40% absolute SR with 5% interval were defined (fig. 1A). Henceforth, the sites that associate with top N% SR across the two subtrees in the Basic data set, as well as their homologous positions in other data sets, are designated as the SRN sites (N = 5, 10, 15, 20, 25, 30, 35, or 40). The SRN sites were excluded from the Micro data set, generating eight "Micro-SRN" data sets. These data sets were subjected to the examination of the impact of SRN sites on the phylogenetic position of microsporidia (see below). Likewise, the "Micro*-SR15," "Basic*-SR15," "Basic-SR15," and "Basic-SR30" data sets were further generated from the Micro*, Basic*, and Basic data sets by exclusion of the SR15 or SR30 sites. The first two data sets were used for ML phylogenetic analyses, and the last two were used for evaluation of the rate distribution difference across the archaebacteria-eukaryote split (see below). The details of the data sets described above are listed in table 1.
Phylogenetic Analyses
All distance analyses in this study were conducted as follows: ML distance matrices were calculated under the JTT amino acid substitution model with ASRV approximated using a gamma () distribution with eight equally probable discrete rate categories (JTT+ model) in Tree-Puzzle version 5.0 (Schmidt et al. 2002). Shape parameters (s) for distributions were estimated from each data set (table 1). Phylogenetic trees were reconstructed from ML distance matrices using the Fitch-Margoliash weighted least squares (Fitch) method implemented in PHYLIP version 3.6a (Felsenstein 1993) with the input sequence order randomized five times and global rearrangements. Distance bootstrap analyses (500 replicates) were conducted using Seqboot, Fitch, Consense in PHYLIP (Felsenstein 1993), Tree-Puzzle (Schmidt et al. 2002), and Puzzleboot version 1.02 (A. J. Roger and M. E. Holder, http://hades.biochem.dal.ca/Rogerlab/Software/Software.html). The parameters and settings for the bootstrap analyses were the same as described above, except a single randomized input sequence order per replicate was performed.
ML trees were reconstructed under the JTT+ model using ProML in PHYLIP (Felsenstein 1993) with global rearrangements and the input sequence order randomized five times. ASRV was modeled using discrete approximation of distributions (eight equally probable categories). All parameters required were estimated from the data using Tree-Puzzle (Schmidt et al. 2002). The parameters and settings for the ML bootstrap analyses (100 replicates) were the same as described above, except (1) a single randomized input sequence order per replicate was performed in the tree search, and (2) ASRV was modeled using discrete distributions with four equally probable rate categories.
Parametric Bootstrap Tests for the Overall Rate Distribution Difference Across Two Subtrees
We have previously established a parametric bootstrap test to examine the rate distribution difference between two subtrees in a combined data set—the overall rate distribution difference is represented by the sum of absolute values of "weighted" SR (henceforth designated as "abrsum") (Susko et al. 2002). The observed value is compared with a null distribution from 1,000 Monte Carlo simulated data sets to evaluate its significance. In this study, the rate distribution difference across the SB-eukaryotic and archaebacterial subtrees was assessed using the Basic, Basic-SR15, and Basic-SR30 data sets (table 1). The details of the parametric bootstrap test are described in Susko et al. (2002).
Comparison of the Two Hypotheses for the Phylogenetic Position of Microsporidia
We generated tree topologies to test two alternative hypotheses for the phylogenetic position of microsporidia: the "M + A" hypothesis (under which microsporidia are one of the earliest diverged lineages among the extant eukaryotes) and the "M + F" hypothesis (under which microsporidia branch with or among fungi). First, a distance phylogenetic tree was reconstructed under the JTT+ model from the Basic data set as described above. The microsporidian clade was then attached to the edge A or B in the optimal distance tree topology (see figure 1B), representing the M + A or M + F hypothesis, respectively. The log-likelihoods (lnLs) of the M + A or M + F trees were calculated from the Micro and Micro-SRN data sets by the ML method under the JTT+ model using Tree-Puzzle (Schmidt et al. 2002). The parameters and branch lengths for each tree were re-estimated from each data set. Difference in the lnLs of the two trees (lnL) was then calculated by subtraction of the former from the latter, assessing the dominance of the M + A topology over the other.
We generated, at each of eight values of N (N = 5, 10, 15, 20, 25, 30, 35, or 40 [see above]), 100 "JK" data sets by randomly deleting (jackknifing) the same number of sites as the SRN sites, using Jacksite (H. Philippe, Université de Montreal). Similarly, we prepared another eight sets of 100 "JK*" data sets as follows. The 245 alignment sites in the Micro data set that correspond neither to the constant sites nor to the SR10 sites in the Basic data set were jackknifed randomly to generate JK* data sets (shaded area in figure 1A). The lnLs between the M + A and M + F trees were calculated from the JK and JK* data sets (null distributions) and compared with the corresponding values obtained from the SRN site exclusion (observed values).
Estimation of Site Rate Pattern of EF-1 Subtrees
We investigated the difference in site rate pattern (1) between the SB-eukaryotic and archaebacterial subtrees, (2) between the SB-eukaryotic and microsporidian subtrees, and (3) between the archaebacterial and microsporidian subtrees. SR between two subtrees and their 95% confidence intervals were estimated using the bivariate ML method (Susko et al. 2002). Because large errors were expected in the estimation of microsporidian site rates as a result of the poor sequence sampling (only three sequences are available for this study), we considered only sites associating with significant SR ("SRS" sites), for which 95 % confidence intervals do not contain zero. The absolute values of these significant SR sites were summed to represent the "overall rate distance" between two subtrees (henceforth designated as "arsum").
Results and Discussion
Robustness of the ML Method Against the Artifactual Microsporidia Plus Archaebacteria Grouping
Microsporidia were estimated to be the earliest diverging lineage among eukaryotes considered (M + A grouping) in the optimal distance trees based on both Micro and Micro* data sets, which differ from each other only in the sampling of archaebacterial sequences (20 versus 2, respectively [table 1]). Likewise, the distance bootstrap analyses on the two data sets display similar results, except the monophyletic clades of the animal sequences and of the euglenozoan sequences (the first two columns in table 2). These results indicate that the selection of archaebacterial sequences has no large impact on the overall relationship among eukaryotic sequences. Thus, the smaller Micro* data set was used for more stringent phylogenetic analyses using the ML method under the JTT+ model.
Table 2 Impact of Phylogenetic Methods and SR15 Site Exclusion on BP Values.
The M + A grouping was recovered in the optimal ML tree based on the Micro* data set (fig. 2A) in accordance with previously published works assessing a microsporidian (Glugea plecoglossi) EF-1 sequence (Kamaishi et al. 1996a, 1996b). The M + A grouping receives relatively high support (BP = 83%) in the distance analysis, whereas the ML analysis gives very weak support (BP = 34%) to the same relationship (fig. 2A). These results clearly indicate that the ML method under the JTT+ model is less susceptible to LBA producing the M + A grouping in the EF-1 data than is the distance method under the same model.
FIG. 2. Phylogenetic analyses of the microsporidian EF-1 sequences. (A) An ML tree from the Micro* data set (349 sites) rooted by the archaebacterial sequences. Only ML BP values over 50% are shown. The BP values for particular nodes are also listed in table 2. (B) An ML tree based on the Micro*-SR15 data set (298 sites) rooted by the archaebacterial sequences. The details were as described in (A)
The ML or distance bootstrap analyses under the JTT+ model fail to resolve deep level relationships among eukaryotic lineages with BPs greater than 50%, except the M + A grouping in the latter analysis (figure 2A and the second and third columns in table 2). Significantly, no other large incongruities between the BP values from the ML method and the corresponding values from the distance method were observed for other major eukaryotic clades recovered, indicating that the ML bootstrap analysis is not overly stringent (the first three columns in table 2).
A Covarion Shift Across the SB-Eukaryotic and Archaebacterial Subtrees
As discussed above, the optimal ML tree reconstructed under the JTT+ model has the artifactual M + A grouping (fig. 2A). Unless significant model misspecification is present, the ML method may be robust against classical LBA from coincidental concurrence of character states between two distantly related fast-evolving lineages. In addition, we detect no significant homoplasy of amino acid characters across the microsporidian and archaebacterial sequences in the EF-1 data set (see Supplementary Material online). Therefore, our phylogenetic analyses under the JTT+ model most probably fails to account for a crucial aspect of the evolutionary process that generated the data.
Our group previously reported a significant rate distribution difference (covarion shift) between the archaebacterial and eukaryotic subtrees in the combined EF-1 tree (Susko et al. 2002; Inagaki et al. 2003). This covarion shift clearly violates the assumption of the JTT+ model that requires that the rates at sites be stationary over the tree. Therefore, we hypothesized that the failure to account for the covarion shift across the two EF-1 subtrees in the phylogenetic analyses was responsible for the artifactual M + A grouping.
We reassessed the covarion shift across the archaebacterium-eukaryote split in the Basic data sets using parametric bootstrap tests (Susko et al. 2002; Inagaki et al. 2003). The observed abrsum value appears to fall outside of the bootstrap distribution (P < 0.001, 5.7 standard errors [SE] away from the mean of bootstrap values [fig. 3A]), suggesting that the overall rate distributions in the two subtrees are significantly different. However, the distance between the observed value and the mean of bootstrap values appears to be reduced to 3.2 and 2.6 SE by both exclusions of the SR15 and SR30 groups of sites respectively, although both P values are still less than 0.01 (fig. 3B and C). These results indicate that the magnitude of the covarion shift across the two subtrees can be mitigated by SRN site exclusion.
FIG. 3. (A–C) Rate distribution differences across the SB-eukaryotic and archaebacterial subtrees. The significance of the overall rate distribution difference across the SB-eukaryotic and archaebacterial subtrees was examined using parametric bootstrap tests (Susko et al. 2002). The results from the analyses based on the Basic, Basic-SR15, and Basic-SR30 data sets are shown in (A), (B), and (C), respectively. The histograms and arrows indicate the bootstrap distributions and observed values, respectively. (D–E) Impact of the SRN site exclusion on the M + A grouping. Log-likelihoods of the M + F and M + A trees was calculated from each of the Micro-SRN data sets, and then the former was subtracted from the latter. The bootstrap distributions generated from 100 JK and JK* data sets are presented using box plots in (D) and (E), respectively. The observed values are indicated by diamond symbols. In the case that the null hypothesis of lowering of lnL solely by the reduction of data was rejected at the 1% or 5% level, the observed values are indicated by open or double diamond symbols, respectively. The lnL = 12.58 from the Micro data set (no site exclusion) is indicated by an arrow on the Y-axis. Details of the two tree topologies considered are described in figure 1B
Impact of SRN Site Exclusion on the Microsporidia Plus Archaebacteria Grouping
The hypothesis that the artifactual M + A grouping was caused by the covarion shift in the EF-1 data was assessed by comparing the M + A and M + F tree topologies using lnLs ("tree lnL comparison"). If our working hypothesis is correct, the dominance of the M + A tree topology over the alternative should be significantly reduced by the SRN site exclusions that diminish the magnitude of the covarion shift (see figures 3A–C).
Without any SRN site exclusions, the M + A tree appears to be better than the alternative tree by 12.58 lnL units (arrow on the Y-axis in figures 3D and E). The lnLs appear to drop from 12.58 to 1.14 lnL units by the progressive exclusion of the SRN sites (figs. 3C and D). In comparison between each observed value and the corresponding null distribution from 100 JK data sets, a significant decrease in 1nL (P < 0.05) was obtained after the exclusion of the SR10 sites (open diamond symbols in figure 3D). P < 0.01 was obtained by the exclusion of the SR35 and SR40 sites (double diamond symbols in figure 3D). We can conclude that the decrease in the lnL because of the exclusion of SRN sites is greater than one would expect based on random deletion.
The SRN sites never contain any of 70 constant sites among 19 SB-eukaryotic and 20 archaebacterial sequences (20% of 349 analyzed sites in the Basic data set). Therefore, to mimic this effect, we created the JK* data sets that contain the sites corresponding to the constant and SR10 sites in the Basic data set. However, the jackknife values calculated from the JK* data sets appear to be similar to those from the JK data sets (fig. 3E). P < 0.01 was obtained after the exclusion of the SR15 sites (double diamond symbols in figure 3E). The tree lnL comparisons clearly indicate that the sites associated with a covarion shift across the SB-eukaryotic and archaebacterial subtrees are responsible for the dominance of the artifactual M + A tree, independent of the jackknife distribution methods used (JK or JK*).
Impact of the SR15 Site Exclusion on ML Phylogeny
We further investigated the impact of the covarion shift on ML phylogenetic analyses using the Micro*-SR15 data set. The optimal ML topology appears to have no M + A grouping (fig. 2B), suggesting that the SR15 site exclusion sufficiently eliminates the source of the LBA uniting the microsporidian and archaebacterial sequences in the Micro* data set. Significantly, much of the phylogenetic information in the original data may remain after the SR15 site exclusion because the ML BP values for the major eukaryotic clades were not drastically changed by the presence or absence of the SR15 sites, except the bootstrap support for the animal plus fungi clade reduced nearly 20% by the site exclusion (the third and fourth columns in table 2). Likewise, these support values appear not to be affected by the presence or absence of the microsporidian sequences, except the apicomplexan clade (compare the third and fifth column or the fourth and sixth columns in table 2). Unfortunately, none of the sequences in the data displays strong affinity (BP > 50%) to the microsporidian sequences in the ML bootstrap analysis (fig. 2B). The true phylogenetic signal in the microsporidian sequences has probably been erased by their extremely high evolutionary rates. It is also possible that the number of phylogenetically informative sites may be relatively small in this data set. Taken together the results from the tree lnL comparisons and ML phylogeny on the Micro*-SR15 data set, we conclude that the artifactual M + A grouping in the EF-1 phylogenies originates from the failure to modeling the covarion shift across the eukaryotic and archaebacterial subtrees.
Our strategy of compensating for a covarion shift in the EF-1 data set is generally applicable to rooted phylogenies. Regardless of the phylogenetic marker or the data type (amino acid or nucleotide sequences), the difference in site rate distribution across ingroup and outgroup sequences may be significant. When the basal branching of a fast-evolving sequence is suspected to be an artifact and an alternative hypothesis is available (e.g., M + A versus M + F tree topologies in this study), one should examine whether model misspecification from covarion shift biases phylogenetic estimates using the strategy described above. The identification and exclusion of sites that exhibit a significant XTSRV across the ingroup and outgroup sequences by various methods (e.g., those described in this study and in Lopez, Forterre, and Philippe [1999]) should be carried out routinely for "deep" phylogenies to compensate or prevent potential artifacts from covarion shifts.
It would be intriguing to assess whether the basal position of microsporidia in SSU rRNA phylogenies is a phylogenetic artifact caused by the covarion shift across the eukaryotic and prokaryotic (outgroup) subtrees using the same strategy developed in this study. If SSU rRNA phylogenies are biased by the failure to model of the covarion shift across the ingroup and outgroup subtrees in a SSU rRNA phylogeny, we can expect significant diminishment of the support for the basal position of microsporidia by SRN exclusion across the two SSU rRNA subtrees. Likewise, the fast-evolving EF-1 sequences from diplomonads and/or parabasalids may be misplaced at the base of the eukaryotic subtree by the same model misspecification. Nevertheless, it is difficult to assess whether the basal positions of diplomonads and parabasalids in the EF-1 phylogeny are artifacts, because, currently, there are no clearly supported alternative hypotheses for their phylogenetic positions.
Parallel Site Rate Variation in EF-1
The rate heterogeneity in the data set including SB-eukaryotic EF-1 sequences ("SB-euk" data set [table 1]) is more extreme than that in the data set including both SB-eukaryotic and microsporidian orthologs ("SB-euk plus Micro" [table 1]); the distribution shape parameters (s) for the former and latter data sets are estimated to be 0.33 and 0.74, respectively. Similarly, the degree of rate heterogeneity in the SB-euk data set ( = 0.33) and that in the data set including 20 archaebacterial sequences are largely different ( = 0.74 ["Arch" in table 1]). These observations circumstantially suggest that the overall evolutionary mode of the microsporidian subtree is more similar to that of the archaebacterial subtree than to that of the SB-eukaryotic subtree. Thus, it is possible that the eukaryote-microsporidian split represented a second covarion shift in the data. To examine the potential parallel variation of evolutionary modes in the EF-1 phylogeny, we estimated and compared the site rates of the three subtrees.
We detected 44 SRS sites with relatively large SR between the archaebacterial and SB-eukaryotic subtrees (arsum = 75.25 [fig. 4A]), corresponding to significant covarion shift identified by the parametric bootstrap test shown in figure 3A. Interestingly, 75 SRS sites with arsum = 58.15 were detected between the SB-eukaryotic and microsporidian subtrees, whereas only 27 SRS sites with arsum = 19.71 were found between the archaebacterial and microsporidian subtrees (figs. 4B and C). These results led us to propose a probable scenario for EF-1 evolution in archaebacteria and eukaryotes. First, the EF-1 site rate pattern was largely differentiated at the archaebacterium-eukaryote transition by acquisition of a number of nontranslational (auxiliary) functions in the eukaryote proteins. Then, after microsporidia split from fungi, extreme reduction of the EF-1 functions occurred in the former lineage, causing in the site rate pattern of the microsporidian EF-1 orthologs to change again. In this view, microsporidian and archaebacterial EF-1 proteins could convergently share relatively similar site rate distributions, a phenomenon we call "parallel site rate variation" or "PSRV."
FIG. 4. Site rate pattern comparisons. (A) SB-eukaryotic (abbreviated as "SB-euk") EF-1 versus archaebacterial ("Arch") EF-1. (B) SB-eukaryotic EF-1 versus microsporidian ("Micro") EF-1. (C) Archaebacterial EF-1 versus microsporidian EF-1. Only sites associated with significant SR are plotted. Hash marks and vertical bars indicate SR and their 95% confidence intervals, respectively
This scenario is congruent with our knowledge about eukaryotic EF-1 functions and biology of microsporidia. Eukaryotic EF-1 is known to have auxiliary functions that are most probably absent in the archaebacterial orthologs, as well as its core function as a translation elongation factor. For instance, biochemical/genetic studies using SB-eukaryotes (e.g., Dictyostelium and ciliates) indicated that eukaryotic EF-1 binds to cytoskeletal proteins and calmodulin and is involved in the ubiquitin-dependent proteolytic system (reviewed in Negrutskii, and El'skaya [1998]). It is also likely that additional (potentially many) eukaryotic EF-1–specific protein-protein interactions remain unidentified. Taken together, it is rational to assume that some sites in the SB-eukaryotic proteins are strictly constrained by the eukaryotic EF-1–specific auxiliary functions, whereas the corresponding sites in the archaebacterial proteins are free to vary because of the primary absence of these functions as schematically drawn in figure 5.
FIG. 5. A scheme for the parallel site rate variation hypothesis between the microsporidian and archaebacterial EF-1. EF-1 in archaebacteria (abbreviated as "Arch," top left) is constrained solely by the core function related to translation (primary lack of nontranslational [auxiliary] functions). The area in EF-1 constrained by this core function is highlighted in gray. At the archaebacterium-eukaryote transition (numeral 1), EF-1 acquired auxiliary functions (top right). The areas constrained from the SB-eukaryotic ("SB-euk") EF-1–specific auxiliary functions are shown in black spots. During the reduction of functionality of microsporidian cellular system (numeral 2), some auxiliary functions in microsporidian ("Micro") EF-1 were lost or weakened (bottom). Steps during eukaryotic evolution are shaded
In contrast to eukaryotes in general, the entire cellular system of microsporidia (including the nuclear genome structure; see Katinka et al. [2001]) is highly reduced, most probably because of their intracellular parasitic lifestyle (Keeling and Fast 2002). Likewise, some of the EF-1 auxiliary functions may have been lost/weakened during the reductive evolution of microsporidia (schematically drawn in figure 5). Actually, 62 SRS sites with small negative SR detected in the site rate pattern comparison between the SB-eukaryotic and microsporidian subtrees confirms this notion (fig. 4B). At these alignment positions, the amino acid character appears to be constant in the SB-eukaryotic subtree but variable in the microsporidian subtree (data not shown), indicating that the microsporidian proteins experience less functional constraints than the SB-eukaryotic orthologs. As the core translational functions of EF-1 are absolutely indispensable for cell viability and cannot be reduced, secondary loss of the auxiliary functions in the microsporidian proteins is the most plausible explanation (fig. 5). Under the parallel absence of eukaryotic EF-1–specific auxiliary functions in microsporidian and archaebacterial EF-1 proteins, the two subtrees could have evolved a relatively similar rate distribution independently.
Obviously, this supposed PSRV in the EF-1 data set should be examined by more accurate site rates estimated from a larger set of microsporidian EF-1 sequences in future studies. In-depth studies of the auxiliary functions of both SB-eukaryotic and microsporidian EF-1 proteins would also be essential for testing this hypothesis.
Conclusion
Here we have shown that the covarion shift across the in-group and out-group sequences most likely is responsible for the artifactual "deep" branching position of microsporidia in EF-1 phylogenies. Furthermore, we detected PSRV between the microsporidian and archaebacterial EF-1 sequences. Covarion shifts, and, in particular, PSRV, need to be pursued as intriguing potential sources of artifacts in many deep phylogenetic inferences. Indeed, a study undertaken by some of us shows analytically that under simple Markov models of sequence change, neighbor-joining and least squares distance methods will succumb to the LBA form of inconsistency when an ASRV model is employed if significant PSRV is occurring (Susko, Inagaki, and Roger, 2004). To further investigate the phylogenetic artifacts that arise from PSRV, studies on data simulated with a more complex protein evolutionary models employing both ASRV and XTSRV will be key.
Supplementary Material
EF-1 alignment considered in this study is shown in the Supplementary Material online.
Acknowledgements
We thank J. Leigh for critical reading and the members of Statistical and Evolutionary Bioinformatics Group (Dalhousie University, Halifax, Canada) for valuable discussions. We also thank H. Philippe (Université de Montreal, Montreal, Canada) for coding the Jacksite program. Y.I. is supported by the "Understanding Prokaryote Genome Diversity and Evolution Project" of Genome Atlantic. A.J.R. is supported as a Scholar by the CIAR program in Evolutionary Biology. This work is supported by operating 227085-00 from NSERC (awarded to A.J.R.). This collaboration is part of a Genome Atlantic/Genome Canada large-scale project.
Literature Cited
Arisue, N., L. B. Sanchez, L. M. Weiss, M. Muller, and T. Hashimoto. 2002. Mitochondrial-type hsp70 genes of the amitochondriate protists, Giardia intestinalis, Entamoeba histolytica and two microsporidians. Parasitol. Int. 51:9-16.
Baldauf, S. L., A. J. Roger, I. Wenk-Siefert, and W. F. Doolittle. 2000. A kingdom-level phylogeny of eukaryotes based on combined protein data. Science 290:972-977.
Cavalier-Smith, T., and E. E. Chao. 1996. Molecular phylogeny of the free-living archezoan Trepomonas agilis and the nature of the first eukaryote. J. Mol. Evol. 43:551-562.
Edlind, T. D., J. Li, G. S. Visvesvara, M. H. Vodkin, G. L. McLaughlin, and S. K. Katiyar. 1996. Phylogenetic analysis of beta-tubulin sequences from amitochondrial protozoa. Mol. Phylogenet. Evol. 5:359-367.
Embley, T. M., and R. P. Hirt. 1998. Early branching eukaryotes? Curr. Opin. Genet. Dev. 8:624-629.
Felsenstein, J. 1993. Phylogeny inference package. Version 3.2. Cladistics 5:166.
Felsenstein, J. 2003. Likelihood method. Pp. 248–274 in Inferring phylogenies. Sinauer Associates, Sunderland, Mass.
Fitch, W. M., and E. Markowitz. 1970. An improved method for determining codon variability in a gene and its application to the rate of fixation of mutations in evolution. Biochem. Genet. 4:579-593.
Galtier, N. 2001. Maximum-likelihood phylogenetic analysis under a covarion-like model. Mol Biol. Evol. 18:866-873.
Germot, A., H. Philippe, and H. Le Guyader. 1997. Evidence for loss of mitochondria in Microsporidia from a mitochondrial-type HSP70 in Nosema locustae. Mol. Biochem. Parasitol. 87:159-168.
Hasegawa, M., and M. Fujiwara. 1993. Relative efficiencies of the maximum likelihood, maximum parsimony and neighbor-joining methods for estimating protein phylogeny. Mol. Phylogenet. Evol. 2:1-5.
Hashimoto, T., N. Arisue, and M. Hasegawa. 2002. Application of molecular phylogenetic inference and associated problems: illustrative data analysis on early eukaryotic evolution. Proc Inst. Stat. Math. 50:45-68.
Hirt, R. P., B. Healy, C. R. Vossbrinck, E. U. Canning, and T. M. Embley. 1997. A mitochondrial Hsp70 orthologue in Vairimorpha necatrix: molecular evidence that microsporidia once contained mitochondria. Curr. Biol. 7:995-998.
Hirt, R. P., J. M. Logsdon, Jr., B. Healy, M. W. Dorey, W. F. Doolittle, and T. M. Embley. 1999. Microsporidia are related to Fungi: evidence from the largest subunit of RNA polymerase II and other proteins. Proc. Natl. Acad. Sci. USA 96:580-585.
Huelsenbeck, J. P. 2002. Testing a covariotide model of DNA substitution. Mol. Biol. Evol. 19:698-707.
Inagaki, Y., C. Blouin, E. Susko, and A. J. Roger. 2003. Assessing functional divergence in EF-1 and its paralogs in eukaryotes and archaebacteria. Nucleic Acids Res. 31:4227-4237.
Ishihara, R., and Y. Hayashi. 1968. Some properties of ribosomes from the sporoplasm of Nosema bombycis. J. Invert. Pathol. 11:377-385.
Kamaishi, T., T. Hashimoto, Y. Nakamura, Y. Masuda, F. Nakamura, K. Okamoto, M. Shimizu, and M. Hasegawa. 1996a. Complete nucleotide sequences of the genes encoding translation elongation factors 1 and 2 from a microsporidian parasite, Glugea plecoglossi: implications for the deepest branching of eukaryotes. J. Biochem. (Tokyo) 120:1095-1103.
Kamaishi, T., T. Hashimoto, Y. Nakamura, F. Nakamura, S. Murata, N. Okada, K. Okamoto, M. Shimizu, and M. Hasegawa. 1996b. Protein phylogeny of translation elongation factor EF-1 suggests microsporidians are extremely ancient eukaryotes. J. Mol. Evol. 42:257-263.
Katinka, M. D., S. Duprat, and E. Cornillot, et al. 2001 (17 co-authors). Genome sequence and gene compaction of the eukaryote parasite Encephalitozoon cuniculi. Nature 414:450-453.
Keeling, P. J. 2003. Congruent evidence from alpha-tubulin and beta-tubulin gene phylogenies for a zygomycete origin of microsporidia. Fungal Genet. Biol. 38:298-309.
Keeling, P. J., and W. F. Doolittle. 1996. Alpha-tubulin from early-diverging eukaryotic lineages and the evolution of the tubulin family. Mol. Biol. Evol. 13:1297-1305.
Keeling, P. J., and N. M. Fast. 2002. Microsporidia: biology and evolution of highly reduced intracellular parasites. Annu. Rev. Microbiol. 56:93-116.
Keeling, P. J., M. A. Luker, and J. D. Palmer. 2000. Evidence from beta-tubulin phylogeny that microsporidia evolved from within the fungi. Mol. Biol. Evol. 17:23-31.
Kumar, S., and A. Rzhetsky. 1996. Evolutionary relationships of eukaryotic kingdoms. J. Mol. Evol. 42:183-193.
Lopez, P., D. Casane, and H. Philippe. 2002. Heterotachy, an important process of protein evolution. Mol. Biol. Evol. 19:1-7.
Lopez, P., P. Forterre, and H. Philippe. 1999. The root of the tree of life in the light of the covarion model. J. Mol. Evol. 49:496-508.
Negrutskii, B. S., and A. V. El'skaya. 1998. Eukaryotic elongation factor 1: structure, expression, functions and possible aminoacyl-tRNA channeling. Prog. Nucleic Acid Res. Mol. Biol. 60:47-78.
Penny, D., and M. Hasegawa. 2001. Covarion model of molecular evolution. Pp. 473–477 in S. Brenner and J. H. Miller, eds. Encyclopedia of genetics. Academic Press, San Diego.
Penny, D., B. J. McComish, M. A. Charleston, and M. D. Hendy. 2001. Mathematical elegance with biochemical realism: the covarion model of molecular evolution. J. Mol. Evol. 53:711-723.
Peyretaillade, E., V. Broussolle, P. Peyret, G. Metenier, M. Gouy, and C. P. Vivares. 1998. Microsporidia, amitochondrial protists, possess a 70-kDa heat shock protein gene of mitochondrial evolutionary origin. Mol. Biol. Evol. 15:683-689.
Philippe, H., and A. Germot. 2000. Phylogeny of eukaryotes based on ribosomal RNA: long-branch attraction and models of sequence evolution. Mol. Biol. Evol. 17:830-834.
Philippe, H., and J. Laurent. 1998. How good are deep phylogenetic trees? Curr. Opin. Genet. Dev. 8:616-623.
Roger, A. J. 1996. Studies on the phylogenetic and gene structure of early-branching eukaryotes. PhD Thesis. Dalhousie University, Halifax, Canada.
Schmidt, H. A., K. Strimmer, M. Vingron, and A. von Haeseler. 2002. TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics 18:502-504.
Slamovits, C. H., B. A. P. Williams, and P. J. Keeling. 2004. Transfer of Nosema locustae (Microporidia) to Antonospora locustae n. comb. based on molecular and ultrastructural data. J Euk. Microbiol. 51:270-213.
Sullivan, J., and D. L. Swofford. 2001. Should we use model-based methods for phylogenetic inference when we know that assumptions about among-site rate variation and nucleotide substitution pattern are violated? Syst. Biol. 50:723-729.
Susko, E., Y. Inagaki, and A. J. Roger. 2004. On inconsistency of the neighbor-joining, lease squares and minimum evolutions estimation when substitution processes are incorrectly modeled. Mol. Biol. Evol. (in press).
Susko, E., Y. Inagaki, C. Field, M. E. Holder, and A. J. Roger. 2002. Testing for differences in rates-across-sites distributions in phylogenetic subtrees. Mol. Biol. Evol. 19:1514-1523.
Swofford, D. L., P. J. Waddell, J. P. Huelsenbeck, P. G. Foster, P. O. Lewis, and J. S. Rogers. 2001. Bias in phylogenetic estimation and its relevance to the choice between parsimony and likelihood methods. Syst. Biol. 50:525-539.
Tuffley, C., and M. Steel. 1998. Modeling the covarion hypothesis of nucleotide substitution. Math. Biosci. 147:63-91.
Van de Peer, Y., A. Ben Ali, and A. Meyer. 2000. Microsporidia: accumulating molecular evidence that a group of amitochondriate and suspectedly primitive eukaryotes are just curious fungi. Gene 246:1-8.
Vossbrinck, C. R., J. V. Maddox, S. Friedman, B. A. Debrunner-Vossbrinck, and C. R. Woese. 1987. Ribosomal RNA sequence suggests microsporidia are extremely ancient eukaryotes. Nature 326:411-414.
Vossbrinck, C. R., and C. R. Woese. 1986. Eukaryotic ribosomes that lack a 5.8S RNA. Nature 320:287-288.
Williams, B. A., R. P. Hirt, J. M. Lucocq, and T. M. Embley. 2002. A mitochondrial remnant in the microsporidian Trachipleistophora hominis. Nature 418:865-869.(Yuji Inagaki*,, Edward Su)
Department of Bioscience, Nagahama Institute of Bioscience and Technology, 1266 Tamura, Nagahama, Shiga 526-0829, Japan
Department ofMathematics and Statistics and Genome Atlantic, Dalhousie University, Halifax, Nova Scotia, Canada
Departmentof Botany, University of British Columbia, Vancouver, British Columbia, Canada
E-mail: yinagai@dal.ca.
Abstract
Microsporidia branch at the base of eukaryotic phylogenies inferred from translation elongation factor 1 (EF-1) sequences. Because these parasitic eukaryotes are fungi (or close relatives of fungi), it is widely accepted that fast-evolving microsporidian sequences are artifactually "attracted" to the long branch leading to the archaebacterial (outgroup) sequences ("long-branch attraction," or "LBA"). However, no previous studies have explicitly determined the reason(s) why the artifactual allegiance of microsporidia and archaebacteria ("M + A") is recovered by all phylogenetic methods, including maximum likelihood, a method that is supposed to be resistant to classical LBA. Here we show that the M + A affinity can be attributed to those alignment sites associated with large differences in evolutionary site rates between the eukaryotic and archaebacterial subtrees. Therefore, failure to model the significant evolutionary rate distribution differences (covarion shifts) between the ingroup and outgroup sequences is apparently responsible for the artifactual basal position of microsporidia in phylogenetic analyses of EF-1 sequences. Currently, no evolutionary model that accounts for discrete changes in the site rate distribution on particular branches is available for either protein or nucleotide level phylogenetic analysis, so the same artifacts may affect many other "deep" phylogenies. Furthermore, given the relative similarity of the site rate patterns of microsporidian and archaebacterial EF-1 proteins ("parallel site rate variation"), we suggest that the microsporidian orthologs may have lost some eukaryotic EF-1–specific nontranslational functions, exemplifying the extreme degree of reduction in this parasitic lineage.
Key Words: phylogenetic artifact ? long-branch attraction ? EF-1 ? microsporidia ? covarion model ? parallel site rate variation
Introduction
Microsporidia are a monophyletic assemblage of obligate intracellular parasites that generally infect animals (particularly arthropods and fish). These unicellular eukaryotes lack classical mitochondria, peroxisomes and Golgi apparatus at the ultrastructural level (reviewed in Keeling and Fast [2002]). At the molecular level, these organisms appear to possess unusual ribosomes and ribosomal RNA (rRNA). The sedimentation coefficient of microsporidian ribosomes is more similar to that of prokaryotic (i.e., eubacterial or archaebacterial) 70S ribosomes than to that of typical eukaryotic 80S ribosomes (Ishihara and Hayashi 1968). Fusion of the large subunit (LSU) and 5.8S rRNAs is another trait shared between microsporidia and prokaryotes (Vossbrinck and Woese 1986). Furthermore, microsporidia were robustly estimated to be one of the basal lineages among eukaryotes in the first molecular phylogenies that include microsporidia, those of small subunit (SSU) rRNA sequences (Vossbrinck et al. 1987). This basal position was recovered in later SSU phylogenies regardless of the tree reconstruction method, model of nucleotide substitution, or taxonomic sampling (e.g., Cavalier-Smith and Chao [1996] and Kumar and Rzhetsky [1996]). Importantly, similar results were reproduced by early phylogenetic studies based on the amino acid sequences of translation elongation factors 1 and 2 (EF-1 and EF-2) (Kamaishi et al. 1996a, 1996b). Taken together, these results appeared to suggest that microsporidia were one of the earliest offshoots in eukaryotic evolution, diverging from the main trunk of eukaryotic lineages before the mitochondrial endosymbiosis.
However, in the past decade, this evolutionary scenario has drastically changed. Nuclear genes encoding proteins of mitochondrial origin were found in microsporidia, strongly suggesting the secondary loss of typical mitochondria in this lineage (Germot, Philippe, and Le Guyader 1997; Hirt et al. 1997; Peyretaillade et al. 1998; Arisue et al. 2002). Katinka et al. (2001) identified additional genes for mitochondrial proteins in the completed genome of the microsporidian, Encephalitozoon cuniculi. In line with these findings, Williams et al. (2002) recently ultrastructurally identified tiny, relic mitochondria in the microsporidian Trachipleistophora hominis. Secondly, further analyses of protein sequences (except elongation factors) have indicated the evolutionary affinity between microsporidia and fungi (reviewed in Keeling and Fast [2002]). In particular, relatively strong support for a relationship between microsporidia and fungi ("M + F") is recovered in phylogenies based on tubulins (Edlind et al. 1996; Keeling and Doolittle 1996; Keeling, Luker, and Palmer 2000; Keeling 2003), the largest subunit of RNA polymerase II (Hirt et al. 1999), seryl-tRNA synthetase, transcription initiation factor IIB, a subunit of the vacuolar ATPase and a GTPase (Katinka et al. 2001). Multigene phylogenies that include microsporidia also recover the M + F connection with high support values (Baldauf et al. 2000; Hashimoto, Arisue, and Hasegawa 2002). In sum, results accumulating since the late 1990s strongly suggest that microsporidia are fungi or close relatives of fungi. The basal position of microsporidia in SSU rRNA and elongation factor phylogenies has been rationalized as an artifact resulting from "long-branch attraction" (LBA), whereby the fast-evolving microsporidian and the long branch leading to the archaebacterial sequences were artifactually attracted to one another in tree reconstruction (e.g., Embley and Hirt [1998] and Philippe and Laurent [1998]). Under this hypothesis, the relatively simple architecture of microsporidian cells can be interpreted as the result of extensive adaptation to their parasitic lifestyle.
Unlike unweighted maximum-parsimony (MP), maximum-likelihood (ML) methods are generally expected to be statistically consistent if the Markov model of sequence change closely matches reality (for discussion, see Felsenstein [2003]). Simulation studies concur with this result, indicating that ML methods are less susceptible to LBA with finite data than MP or distance methods (e.g., Hasegawa and Fujiwara [1993] and Swofford et al. [2001]). However, in practice, neither infinite data nor perfectly correct models are available, and ML methods can yield biased estimates of phylogenies. For instance, ML methods display sensitivity to LBA when the evolutionary model is violated by the data, such as failure to model among-site rate variation (ASRV) (e.g., Sullivan and Swofford [2001]). Therefore, microsporidian phylogenies estimated under simple models that assume an equal rate of evolution among sites may be severely affected by LBA. Indeed, the EF-1 ML analyses under a more realistic model that takes into account ASRV failed to statistically reject the M + F relationship, although the basal position of microsporidia was still the ML estimate (Hirt et al. 1999; Hashimoto, Arisue, and Hasegawa 2002). A similar trend was observed in concatenated SSU plus LSU rRNA gene phylogenies (Philippe and Germot 2000). Importantly, the LSU rRNA (rooted) distance phylogeny and EF-2 (unrooted) distance and ML phylogenies under an ASRV model actually recovered the M + F relationship, albeit with weak support values (Hirt et al. 1999; Van de Peer, Ben Ali, and Meyer 2000; Keeling and Fast 2002). These results consistently suggest that the fast-evolving microsporidian sequences incorrectly branch at the base of eukaryotes because of LBA.
ASRV models are thought to be essential for phylogenetic analyses of problematic data sets, including fast-evolving sequences (e.g., Philippe and Laurent [1998]). However, these models are obviously insufficient to describe the complex processes of amino acid and nucleotide evolution. One of the crucial processes ignored in ASRV models is across-tree site rate variation (XTSRV), also called "heterotachy" by some authors (Lopez, Casane, and Philippe 2002). Functional and/or structural constraints on a biological molecule can be changed over evolutionary time, so that the rates at sites may vary across the tree (i.e., among the sequences in an alignment). This evolutionary process was first described as the concomitantly variable codon (covarion) model that allows a site to shift between two states, invariable ("off") or variable ("on"), in a protein over time (Fitch and Markowitz 1970). Explicit formulations of Markov models of covarion evolution have only recently been conceived (Tuffley and Steel 1998; Penny et al. 2001). The original covarion and ASRV models have also been merged into "covarion-like" models for both protein and nucleotide evolution that, rather than having simply invariable and variable sites, allow sites to "switch" between a number of different site rates across subtrees (reviewed in Penny and Hasegawa [2001]). The covarionlike models in Tuffley and Steel (1998), Penny et al. (2001), Galtier (2001), Lopez, Casane and Philippe (2002) and Huelsenbeck (2002) assume XTSRV to be a stochastic process varying across both sites and subtrees with frequent changes in rates throughout the larger tree. Although rates do vary across both sites and subtrees in these models, the process is stationary, implying that the marginal distributions of rates in any two subtrees will be the same. This stationarity of the frequency of changes throughout the larger tree does not provide a natural model for biological molecules subjected to a "covarion shift," where many sites experience a shift in evolutionary rates within a short period of time associated with a modification in functional and/or structural properties (Inagaki et al. 2003). The XTSRV in covarion shift models (e.g., modeled by a bivariate rate distribution in Susko et al. [2002]) should allow different marginal distributions in rates for subtrees on either side of the shift, and the rate shifts should occur with lower frequency in evolutionary time than stochastic covarion drift. Although the two types of covarion-like models are intrinsically different, they are not mutually exclusive and both serve to make some adjustment for covarionlike effects.
In this study, we investigated the impact of XTSRV in the EF-1 data set on the phylogenetic position of microsporidia. We divided the data into the archaebacterial and eukaryotic sub-trees, and identified the sites associated with a large difference in evolutionary rates across the subtrees ("SRN" sites) estimated by the bivariate rate ML method (Susko et al. 2002). The difference in log-likelihood (lnL) between the two tree topologies representing the competing hypotheses of (1) microsporidia as an early diverging eukaryotic lineage versus (2) microsporidia as relatives of fungi then was calculated based on the data sets excluding SRN sites. Similar analyses based on data sets excluding random sites (jackknifing) were then conducted to evaluate the significance of SRN site exclusions. Finally, the data sets with and without SRN sites were analyzed using the ML method.
Philippe and Germot (2000) previously investigated the impact of the "covarion structure" on putative LBA artifacts in a concatenated SSU plus LSU rRNA data set using a similar approach to ours. The authors excluded the "covarion" sites identified by a MP-based method (Lopez, Forterre, and Philippe 1999), and tested whether this procedure lowered the support for putative LBA artifacts such as the basal branching of diplomonads (e.g., Giardia), Trichomonas (a parabasalid), and microsporidia (Philippe and Germot [2000]). However, we have identified two potential deficiencies in these analyses. First, this pioneering work lacks control experiments that would allow the assessment of "covarion" site deletion on support for potential LBA artifacts (i.e., jackknifing of random sites). Another difficulty arises from the tripartition of the data (i.e., archaebacteria, "basal branching," and "crown" eukaryotes) to identify the covarion sites. Therefore, it is unclear which partition of the rRNA data bears the significant rate difference that allegedly produces the LBA artifacts.
By contrast, our results unequivocally illustrate that the failure to model covarion shifts across the archaebacterium-eukaryote split in the data results in an artifactual affinity between the long branches leading to the archaebacterial (outgroup) and microsporidian sequences—a kind of LBA that originates from model misspecification. Furthermore, we detect similarity between the patterns of site rates for the microsporidian and archaebacterial sequences ("parallel site rate variation") relative to eukaryotic homologs, and propose the loss of some eukaryotic EF-1–specific nontranslational functions in the microsporidian lineage to explain this "convergent" property of the two EF-1 subtrees.
Materials and Methods
Data Sets
The experimental details for the isolation of an EF-1 gene from a microsporidian Antonospora locustae (Nosema locustae was renamed to Antonospora locustae; see Slamovits, Williams, and Keeling [2004]) were described in Roger (1996). The novel DNA sequence of Antonospora EF-1 was deposited under DDBJ/EMBL/GenBank accession number AY452734.
We manually aligned the deduced EF-1 amino acid sequences of 22 phylogenetically divergent eukaryotes (19 "short-branch" eukaryotes [SB eukaryotes] and three microsporidia) and 20 archaebacteria. The accession numbers of the sequences considered in this study are listed in the caption of figure 1. To focus on LBA between the microsporidian and archaebacterial EF-1 sequences, other fast-evolving eukaryotic sequences (e.g., those from diplomonads and parabasalids) were not considered in this study. We excluded ambiguously aligned sites and sites that included gaps, creating the "Micro" data set consisting of 42 sequences with 349 sites (table 1). Subsequently, the "Basic" data set was generated from the Micro data by exclusion of the microsporidian sequences for the estimation of conditional mode rates at sites (site rates) and evaluating the rate distribution difference across eukaryotic and archaebacterial subtrees (see below). Smaller data sets, "Micro*" and "Basic*," were derived from the Micro and Basic data sets by exclusion of all archaebacterial sequences except those of Methanosarcina and Aeropyrum, respectively (table 1). These data sets were subjected to computationally intensive ML phylogenetic analyses.
FIG. 1. (A) A histogram of the absolute value of SR calculated between the SB-eukaryotic and archaebacterial subtrees in the Basic data set. Arrows indicate the ranges of top 5% to 40% absolute SR. The absolute value SR range, from which 245 sites were randomly jackknifed to generate the "JK*" data sets, is shaded. (B) The tree topologies examined for the phylogenetic positions of microsporidia. The archaebacterial clade is omitted. The microsporidian clade was attached to the edge A or B, creating the "M + A" or "M + F" tree topology, respectively. Accession numbers for the sequence considered are as follows: Sulfolobus acidocaldaris, X52382; S. solfataricus, X70701; S. tokodaii, BAB65236; Desulfurococcus mobilis, X73582; Aeropyrum pernix, AP000062; Pyrobaculum aerophilum, U94347; uncultured crenarchaeote 74A4, AF393466; Thermococcus celer, X52383; Pyrococcus horikohii, AP000006; P. abysii, AJ248285; P. woesei, X59857; Archaeoglobus fulgidus, AE001039; Methanococcus jannaschii, U67486; M vannielii, X05698; Methanosarcina barkeri, ZP_00079008; Ferroplasma acidarmanus, ZP_000005; Halobacterium sp., AAG20682; H. halobium, D32120; Haloarcula marismotui, X16677; Thermoplasma acidophilum, AL445064; Homo sapiens, BC002845; Caenorhabditis elegans, AAA81688; Schizosaccharomyces pombe, D82571; Podospora anserina, X74799; Arabidopsis thaliana, AB026655; Nicotiana tabacum, D63396; Euglena gracilis, X16890; Trypanosoma brucei, U10562; Acrasis rosea, AF190771; Dictyostelium discoideum, X55972; Mastigamoeba balamuthi, a contig of BE636534, BE636560, BE636596, and BE636646; Entamoeba histolytica, M92073; Porphyra purpurea, U08844; Plasmodium knowlesi, AJ224153; Cryptosporidium parvum, U71180; Stylonychia lemnae, X57926; Blastocystis hominis, AF091355; Phytophthora capsici, AF109666; Cyanophora paradoxa, AF092951; Antonospora (Nosema) locustae, AY452734; Glugea plecoglossi, BAA06865; Encephalitozoon cuniculi, CAD25298
Table 1 Details of the Data Sets Used in This Study.
Site rates of the SB-eukaryotic and archaebacterial subtrees in the Basic data set were estimated by the bivariate ML rate method with the PAM001 amino acid substitution model using Bivar (Susko et al. 2002). In this method, ASRV in the two subtrees was described by a bivariate discrete ML rate distribution with a matrix of 25 x 25 equally spaced rate categories. The upper limit for rates was set as 6.0. The probability of each rate category was estimated from the data.
Absolute values of the difference in bivariate site rates (SR) across the SB-eukaryotic and archaebacterial subtrees were calculated, and then eight sets of sites with top 5% to 40% absolute SR with 5% interval were defined (fig. 1A). Henceforth, the sites that associate with top N% SR across the two subtrees in the Basic data set, as well as their homologous positions in other data sets, are designated as the SRN sites (N = 5, 10, 15, 20, 25, 30, 35, or 40). The SRN sites were excluded from the Micro data set, generating eight "Micro-SRN" data sets. These data sets were subjected to the examination of the impact of SRN sites on the phylogenetic position of microsporidia (see below). Likewise, the "Micro*-SR15," "Basic*-SR15," "Basic-SR15," and "Basic-SR30" data sets were further generated from the Micro*, Basic*, and Basic data sets by exclusion of the SR15 or SR30 sites. The first two data sets were used for ML phylogenetic analyses, and the last two were used for evaluation of the rate distribution difference across the archaebacteria-eukaryote split (see below). The details of the data sets described above are listed in table 1.
Phylogenetic Analyses
All distance analyses in this study were conducted as follows: ML distance matrices were calculated under the JTT amino acid substitution model with ASRV approximated using a gamma () distribution with eight equally probable discrete rate categories (JTT+ model) in Tree-Puzzle version 5.0 (Schmidt et al. 2002). Shape parameters (s) for distributions were estimated from each data set (table 1). Phylogenetic trees were reconstructed from ML distance matrices using the Fitch-Margoliash weighted least squares (Fitch) method implemented in PHYLIP version 3.6a (Felsenstein 1993) with the input sequence order randomized five times and global rearrangements. Distance bootstrap analyses (500 replicates) were conducted using Seqboot, Fitch, Consense in PHYLIP (Felsenstein 1993), Tree-Puzzle (Schmidt et al. 2002), and Puzzleboot version 1.02 (A. J. Roger and M. E. Holder, http://hades.biochem.dal.ca/Rogerlab/Software/Software.html). The parameters and settings for the bootstrap analyses were the same as described above, except a single randomized input sequence order per replicate was performed.
ML trees were reconstructed under the JTT+ model using ProML in PHYLIP (Felsenstein 1993) with global rearrangements and the input sequence order randomized five times. ASRV was modeled using discrete approximation of distributions (eight equally probable categories). All parameters required were estimated from the data using Tree-Puzzle (Schmidt et al. 2002). The parameters and settings for the ML bootstrap analyses (100 replicates) were the same as described above, except (1) a single randomized input sequence order per replicate was performed in the tree search, and (2) ASRV was modeled using discrete distributions with four equally probable rate categories.
Parametric Bootstrap Tests for the Overall Rate Distribution Difference Across Two Subtrees
We have previously established a parametric bootstrap test to examine the rate distribution difference between two subtrees in a combined data set—the overall rate distribution difference is represented by the sum of absolute values of "weighted" SR (henceforth designated as "abrsum") (Susko et al. 2002). The observed value is compared with a null distribution from 1,000 Monte Carlo simulated data sets to evaluate its significance. In this study, the rate distribution difference across the SB-eukaryotic and archaebacterial subtrees was assessed using the Basic, Basic-SR15, and Basic-SR30 data sets (table 1). The details of the parametric bootstrap test are described in Susko et al. (2002).
Comparison of the Two Hypotheses for the Phylogenetic Position of Microsporidia
We generated tree topologies to test two alternative hypotheses for the phylogenetic position of microsporidia: the "M + A" hypothesis (under which microsporidia are one of the earliest diverged lineages among the extant eukaryotes) and the "M + F" hypothesis (under which microsporidia branch with or among fungi). First, a distance phylogenetic tree was reconstructed under the JTT+ model from the Basic data set as described above. The microsporidian clade was then attached to the edge A or B in the optimal distance tree topology (see figure 1B), representing the M + A or M + F hypothesis, respectively. The log-likelihoods (lnLs) of the M + A or M + F trees were calculated from the Micro and Micro-SRN data sets by the ML method under the JTT+ model using Tree-Puzzle (Schmidt et al. 2002). The parameters and branch lengths for each tree were re-estimated from each data set. Difference in the lnLs of the two trees (lnL) was then calculated by subtraction of the former from the latter, assessing the dominance of the M + A topology over the other.
We generated, at each of eight values of N (N = 5, 10, 15, 20, 25, 30, 35, or 40 [see above]), 100 "JK" data sets by randomly deleting (jackknifing) the same number of sites as the SRN sites, using Jacksite (H. Philippe, Université de Montreal). Similarly, we prepared another eight sets of 100 "JK*" data sets as follows. The 245 alignment sites in the Micro data set that correspond neither to the constant sites nor to the SR10 sites in the Basic data set were jackknifed randomly to generate JK* data sets (shaded area in figure 1A). The lnLs between the M + A and M + F trees were calculated from the JK and JK* data sets (null distributions) and compared with the corresponding values obtained from the SRN site exclusion (observed values).
Estimation of Site Rate Pattern of EF-1 Subtrees
We investigated the difference in site rate pattern (1) between the SB-eukaryotic and archaebacterial subtrees, (2) between the SB-eukaryotic and microsporidian subtrees, and (3) between the archaebacterial and microsporidian subtrees. SR between two subtrees and their 95% confidence intervals were estimated using the bivariate ML method (Susko et al. 2002). Because large errors were expected in the estimation of microsporidian site rates as a result of the poor sequence sampling (only three sequences are available for this study), we considered only sites associating with significant SR ("SRS" sites), for which 95 % confidence intervals do not contain zero. The absolute values of these significant SR sites were summed to represent the "overall rate distance" between two subtrees (henceforth designated as "arsum").
Results and Discussion
Robustness of the ML Method Against the Artifactual Microsporidia Plus Archaebacteria Grouping
Microsporidia were estimated to be the earliest diverging lineage among eukaryotes considered (M + A grouping) in the optimal distance trees based on both Micro and Micro* data sets, which differ from each other only in the sampling of archaebacterial sequences (20 versus 2, respectively [table 1]). Likewise, the distance bootstrap analyses on the two data sets display similar results, except the monophyletic clades of the animal sequences and of the euglenozoan sequences (the first two columns in table 2). These results indicate that the selection of archaebacterial sequences has no large impact on the overall relationship among eukaryotic sequences. Thus, the smaller Micro* data set was used for more stringent phylogenetic analyses using the ML method under the JTT+ model.
Table 2 Impact of Phylogenetic Methods and SR15 Site Exclusion on BP Values.
The M + A grouping was recovered in the optimal ML tree based on the Micro* data set (fig. 2A) in accordance with previously published works assessing a microsporidian (Glugea plecoglossi) EF-1 sequence (Kamaishi et al. 1996a, 1996b). The M + A grouping receives relatively high support (BP = 83%) in the distance analysis, whereas the ML analysis gives very weak support (BP = 34%) to the same relationship (fig. 2A). These results clearly indicate that the ML method under the JTT+ model is less susceptible to LBA producing the M + A grouping in the EF-1 data than is the distance method under the same model.
FIG. 2. Phylogenetic analyses of the microsporidian EF-1 sequences. (A) An ML tree from the Micro* data set (349 sites) rooted by the archaebacterial sequences. Only ML BP values over 50% are shown. The BP values for particular nodes are also listed in table 2. (B) An ML tree based on the Micro*-SR15 data set (298 sites) rooted by the archaebacterial sequences. The details were as described in (A)
The ML or distance bootstrap analyses under the JTT+ model fail to resolve deep level relationships among eukaryotic lineages with BPs greater than 50%, except the M + A grouping in the latter analysis (figure 2A and the second and third columns in table 2). Significantly, no other large incongruities between the BP values from the ML method and the corresponding values from the distance method were observed for other major eukaryotic clades recovered, indicating that the ML bootstrap analysis is not overly stringent (the first three columns in table 2).
A Covarion Shift Across the SB-Eukaryotic and Archaebacterial Subtrees
As discussed above, the optimal ML tree reconstructed under the JTT+ model has the artifactual M + A grouping (fig. 2A). Unless significant model misspecification is present, the ML method may be robust against classical LBA from coincidental concurrence of character states between two distantly related fast-evolving lineages. In addition, we detect no significant homoplasy of amino acid characters across the microsporidian and archaebacterial sequences in the EF-1 data set (see Supplementary Material online). Therefore, our phylogenetic analyses under the JTT+ model most probably fails to account for a crucial aspect of the evolutionary process that generated the data.
Our group previously reported a significant rate distribution difference (covarion shift) between the archaebacterial and eukaryotic subtrees in the combined EF-1 tree (Susko et al. 2002; Inagaki et al. 2003). This covarion shift clearly violates the assumption of the JTT+ model that requires that the rates at sites be stationary over the tree. Therefore, we hypothesized that the failure to account for the covarion shift across the two EF-1 subtrees in the phylogenetic analyses was responsible for the artifactual M + A grouping.
We reassessed the covarion shift across the archaebacterium-eukaryote split in the Basic data sets using parametric bootstrap tests (Susko et al. 2002; Inagaki et al. 2003). The observed abrsum value appears to fall outside of the bootstrap distribution (P < 0.001, 5.7 standard errors [SE] away from the mean of bootstrap values [fig. 3A]), suggesting that the overall rate distributions in the two subtrees are significantly different. However, the distance between the observed value and the mean of bootstrap values appears to be reduced to 3.2 and 2.6 SE by both exclusions of the SR15 and SR30 groups of sites respectively, although both P values are still less than 0.01 (fig. 3B and C). These results indicate that the magnitude of the covarion shift across the two subtrees can be mitigated by SRN site exclusion.
FIG. 3. (A–C) Rate distribution differences across the SB-eukaryotic and archaebacterial subtrees. The significance of the overall rate distribution difference across the SB-eukaryotic and archaebacterial subtrees was examined using parametric bootstrap tests (Susko et al. 2002). The results from the analyses based on the Basic, Basic-SR15, and Basic-SR30 data sets are shown in (A), (B), and (C), respectively. The histograms and arrows indicate the bootstrap distributions and observed values, respectively. (D–E) Impact of the SRN site exclusion on the M + A grouping. Log-likelihoods of the M + F and M + A trees was calculated from each of the Micro-SRN data sets, and then the former was subtracted from the latter. The bootstrap distributions generated from 100 JK and JK* data sets are presented using box plots in (D) and (E), respectively. The observed values are indicated by diamond symbols. In the case that the null hypothesis of lowering of lnL solely by the reduction of data was rejected at the 1% or 5% level, the observed values are indicated by open or double diamond symbols, respectively. The lnL = 12.58 from the Micro data set (no site exclusion) is indicated by an arrow on the Y-axis. Details of the two tree topologies considered are described in figure 1B
Impact of SRN Site Exclusion on the Microsporidia Plus Archaebacteria Grouping
The hypothesis that the artifactual M + A grouping was caused by the covarion shift in the EF-1 data was assessed by comparing the M + A and M + F tree topologies using lnLs ("tree lnL comparison"). If our working hypothesis is correct, the dominance of the M + A tree topology over the alternative should be significantly reduced by the SRN site exclusions that diminish the magnitude of the covarion shift (see figures 3A–C).
Without any SRN site exclusions, the M + A tree appears to be better than the alternative tree by 12.58 lnL units (arrow on the Y-axis in figures 3D and E). The lnLs appear to drop from 12.58 to 1.14 lnL units by the progressive exclusion of the SRN sites (figs. 3C and D). In comparison between each observed value and the corresponding null distribution from 100 JK data sets, a significant decrease in 1nL (P < 0.05) was obtained after the exclusion of the SR10 sites (open diamond symbols in figure 3D). P < 0.01 was obtained by the exclusion of the SR35 and SR40 sites (double diamond symbols in figure 3D). We can conclude that the decrease in the lnL because of the exclusion of SRN sites is greater than one would expect based on random deletion.
The SRN sites never contain any of 70 constant sites among 19 SB-eukaryotic and 20 archaebacterial sequences (20% of 349 analyzed sites in the Basic data set). Therefore, to mimic this effect, we created the JK* data sets that contain the sites corresponding to the constant and SR10 sites in the Basic data set. However, the jackknife values calculated from the JK* data sets appear to be similar to those from the JK data sets (fig. 3E). P < 0.01 was obtained after the exclusion of the SR15 sites (double diamond symbols in figure 3E). The tree lnL comparisons clearly indicate that the sites associated with a covarion shift across the SB-eukaryotic and archaebacterial subtrees are responsible for the dominance of the artifactual M + A tree, independent of the jackknife distribution methods used (JK or JK*).
Impact of the SR15 Site Exclusion on ML Phylogeny
We further investigated the impact of the covarion shift on ML phylogenetic analyses using the Micro*-SR15 data set. The optimal ML topology appears to have no M + A grouping (fig. 2B), suggesting that the SR15 site exclusion sufficiently eliminates the source of the LBA uniting the microsporidian and archaebacterial sequences in the Micro* data set. Significantly, much of the phylogenetic information in the original data may remain after the SR15 site exclusion because the ML BP values for the major eukaryotic clades were not drastically changed by the presence or absence of the SR15 sites, except the bootstrap support for the animal plus fungi clade reduced nearly 20% by the site exclusion (the third and fourth columns in table 2). Likewise, these support values appear not to be affected by the presence or absence of the microsporidian sequences, except the apicomplexan clade (compare the third and fifth column or the fourth and sixth columns in table 2). Unfortunately, none of the sequences in the data displays strong affinity (BP > 50%) to the microsporidian sequences in the ML bootstrap analysis (fig. 2B). The true phylogenetic signal in the microsporidian sequences has probably been erased by their extremely high evolutionary rates. It is also possible that the number of phylogenetically informative sites may be relatively small in this data set. Taken together the results from the tree lnL comparisons and ML phylogeny on the Micro*-SR15 data set, we conclude that the artifactual M + A grouping in the EF-1 phylogenies originates from the failure to modeling the covarion shift across the eukaryotic and archaebacterial subtrees.
Our strategy of compensating for a covarion shift in the EF-1 data set is generally applicable to rooted phylogenies. Regardless of the phylogenetic marker or the data type (amino acid or nucleotide sequences), the difference in site rate distribution across ingroup and outgroup sequences may be significant. When the basal branching of a fast-evolving sequence is suspected to be an artifact and an alternative hypothesis is available (e.g., M + A versus M + F tree topologies in this study), one should examine whether model misspecification from covarion shift biases phylogenetic estimates using the strategy described above. The identification and exclusion of sites that exhibit a significant XTSRV across the ingroup and outgroup sequences by various methods (e.g., those described in this study and in Lopez, Forterre, and Philippe [1999]) should be carried out routinely for "deep" phylogenies to compensate or prevent potential artifacts from covarion shifts.
It would be intriguing to assess whether the basal position of microsporidia in SSU rRNA phylogenies is a phylogenetic artifact caused by the covarion shift across the eukaryotic and prokaryotic (outgroup) subtrees using the same strategy developed in this study. If SSU rRNA phylogenies are biased by the failure to model of the covarion shift across the ingroup and outgroup subtrees in a SSU rRNA phylogeny, we can expect significant diminishment of the support for the basal position of microsporidia by SRN exclusion across the two SSU rRNA subtrees. Likewise, the fast-evolving EF-1 sequences from diplomonads and/or parabasalids may be misplaced at the base of the eukaryotic subtree by the same model misspecification. Nevertheless, it is difficult to assess whether the basal positions of diplomonads and parabasalids in the EF-1 phylogeny are artifacts, because, currently, there are no clearly supported alternative hypotheses for their phylogenetic positions.
Parallel Site Rate Variation in EF-1
The rate heterogeneity in the data set including SB-eukaryotic EF-1 sequences ("SB-euk" data set [table 1]) is more extreme than that in the data set including both SB-eukaryotic and microsporidian orthologs ("SB-euk plus Micro" [table 1]); the distribution shape parameters (s) for the former and latter data sets are estimated to be 0.33 and 0.74, respectively. Similarly, the degree of rate heterogeneity in the SB-euk data set ( = 0.33) and that in the data set including 20 archaebacterial sequences are largely different ( = 0.74 ["Arch" in table 1]). These observations circumstantially suggest that the overall evolutionary mode of the microsporidian subtree is more similar to that of the archaebacterial subtree than to that of the SB-eukaryotic subtree. Thus, it is possible that the eukaryote-microsporidian split represented a second covarion shift in the data. To examine the potential parallel variation of evolutionary modes in the EF-1 phylogeny, we estimated and compared the site rates of the three subtrees.
We detected 44 SRS sites with relatively large SR between the archaebacterial and SB-eukaryotic subtrees (arsum = 75.25 [fig. 4A]), corresponding to significant covarion shift identified by the parametric bootstrap test shown in figure 3A. Interestingly, 75 SRS sites with arsum = 58.15 were detected between the SB-eukaryotic and microsporidian subtrees, whereas only 27 SRS sites with arsum = 19.71 were found between the archaebacterial and microsporidian subtrees (figs. 4B and C). These results led us to propose a probable scenario for EF-1 evolution in archaebacteria and eukaryotes. First, the EF-1 site rate pattern was largely differentiated at the archaebacterium-eukaryote transition by acquisition of a number of nontranslational (auxiliary) functions in the eukaryote proteins. Then, after microsporidia split from fungi, extreme reduction of the EF-1 functions occurred in the former lineage, causing in the site rate pattern of the microsporidian EF-1 orthologs to change again. In this view, microsporidian and archaebacterial EF-1 proteins could convergently share relatively similar site rate distributions, a phenomenon we call "parallel site rate variation" or "PSRV."
FIG. 4. Site rate pattern comparisons. (A) SB-eukaryotic (abbreviated as "SB-euk") EF-1 versus archaebacterial ("Arch") EF-1. (B) SB-eukaryotic EF-1 versus microsporidian ("Micro") EF-1. (C) Archaebacterial EF-1 versus microsporidian EF-1. Only sites associated with significant SR are plotted. Hash marks and vertical bars indicate SR and their 95% confidence intervals, respectively
This scenario is congruent with our knowledge about eukaryotic EF-1 functions and biology of microsporidia. Eukaryotic EF-1 is known to have auxiliary functions that are most probably absent in the archaebacterial orthologs, as well as its core function as a translation elongation factor. For instance, biochemical/genetic studies using SB-eukaryotes (e.g., Dictyostelium and ciliates) indicated that eukaryotic EF-1 binds to cytoskeletal proteins and calmodulin and is involved in the ubiquitin-dependent proteolytic system (reviewed in Negrutskii, and El'skaya [1998]). It is also likely that additional (potentially many) eukaryotic EF-1–specific protein-protein interactions remain unidentified. Taken together, it is rational to assume that some sites in the SB-eukaryotic proteins are strictly constrained by the eukaryotic EF-1–specific auxiliary functions, whereas the corresponding sites in the archaebacterial proteins are free to vary because of the primary absence of these functions as schematically drawn in figure 5.
FIG. 5. A scheme for the parallel site rate variation hypothesis between the microsporidian and archaebacterial EF-1. EF-1 in archaebacteria (abbreviated as "Arch," top left) is constrained solely by the core function related to translation (primary lack of nontranslational [auxiliary] functions). The area in EF-1 constrained by this core function is highlighted in gray. At the archaebacterium-eukaryote transition (numeral 1), EF-1 acquired auxiliary functions (top right). The areas constrained from the SB-eukaryotic ("SB-euk") EF-1–specific auxiliary functions are shown in black spots. During the reduction of functionality of microsporidian cellular system (numeral 2), some auxiliary functions in microsporidian ("Micro") EF-1 were lost or weakened (bottom). Steps during eukaryotic evolution are shaded
In contrast to eukaryotes in general, the entire cellular system of microsporidia (including the nuclear genome structure; see Katinka et al. [2001]) is highly reduced, most probably because of their intracellular parasitic lifestyle (Keeling and Fast 2002). Likewise, some of the EF-1 auxiliary functions may have been lost/weakened during the reductive evolution of microsporidia (schematically drawn in figure 5). Actually, 62 SRS sites with small negative SR detected in the site rate pattern comparison between the SB-eukaryotic and microsporidian subtrees confirms this notion (fig. 4B). At these alignment positions, the amino acid character appears to be constant in the SB-eukaryotic subtree but variable in the microsporidian subtree (data not shown), indicating that the microsporidian proteins experience less functional constraints than the SB-eukaryotic orthologs. As the core translational functions of EF-1 are absolutely indispensable for cell viability and cannot be reduced, secondary loss of the auxiliary functions in the microsporidian proteins is the most plausible explanation (fig. 5). Under the parallel absence of eukaryotic EF-1–specific auxiliary functions in microsporidian and archaebacterial EF-1 proteins, the two subtrees could have evolved a relatively similar rate distribution independently.
Obviously, this supposed PSRV in the EF-1 data set should be examined by more accurate site rates estimated from a larger set of microsporidian EF-1 sequences in future studies. In-depth studies of the auxiliary functions of both SB-eukaryotic and microsporidian EF-1 proteins would also be essential for testing this hypothesis.
Conclusion
Here we have shown that the covarion shift across the in-group and out-group sequences most likely is responsible for the artifactual "deep" branching position of microsporidia in EF-1 phylogenies. Furthermore, we detected PSRV between the microsporidian and archaebacterial EF-1 sequences. Covarion shifts, and, in particular, PSRV, need to be pursued as intriguing potential sources of artifacts in many deep phylogenetic inferences. Indeed, a study undertaken by some of us shows analytically that under simple Markov models of sequence change, neighbor-joining and least squares distance methods will succumb to the LBA form of inconsistency when an ASRV model is employed if significant PSRV is occurring (Susko, Inagaki, and Roger, 2004). To further investigate the phylogenetic artifacts that arise from PSRV, studies on data simulated with a more complex protein evolutionary models employing both ASRV and XTSRV will be key.
Supplementary Material
EF-1 alignment considered in this study is shown in the Supplementary Material online.
Acknowledgements
We thank J. Leigh for critical reading and the members of Statistical and Evolutionary Bioinformatics Group (Dalhousie University, Halifax, Canada) for valuable discussions. We also thank H. Philippe (Université de Montreal, Montreal, Canada) for coding the Jacksite program. Y.I. is supported by the "Understanding Prokaryote Genome Diversity and Evolution Project" of Genome Atlantic. A.J.R. is supported as a Scholar by the CIAR program in Evolutionary Biology. This work is supported by operating 227085-00 from NSERC (awarded to A.J.R.). This collaboration is part of a Genome Atlantic/Genome Canada large-scale project.
Literature Cited
Arisue, N., L. B. Sanchez, L. M. Weiss, M. Muller, and T. Hashimoto. 2002. Mitochondrial-type hsp70 genes of the amitochondriate protists, Giardia intestinalis, Entamoeba histolytica and two microsporidians. Parasitol. Int. 51:9-16.
Baldauf, S. L., A. J. Roger, I. Wenk-Siefert, and W. F. Doolittle. 2000. A kingdom-level phylogeny of eukaryotes based on combined protein data. Science 290:972-977.
Cavalier-Smith, T., and E. E. Chao. 1996. Molecular phylogeny of the free-living archezoan Trepomonas agilis and the nature of the first eukaryote. J. Mol. Evol. 43:551-562.
Edlind, T. D., J. Li, G. S. Visvesvara, M. H. Vodkin, G. L. McLaughlin, and S. K. Katiyar. 1996. Phylogenetic analysis of beta-tubulin sequences from amitochondrial protozoa. Mol. Phylogenet. Evol. 5:359-367.
Embley, T. M., and R. P. Hirt. 1998. Early branching eukaryotes? Curr. Opin. Genet. Dev. 8:624-629.
Felsenstein, J. 1993. Phylogeny inference package. Version 3.2. Cladistics 5:166.
Felsenstein, J. 2003. Likelihood method. Pp. 248–274 in Inferring phylogenies. Sinauer Associates, Sunderland, Mass.
Fitch, W. M., and E. Markowitz. 1970. An improved method for determining codon variability in a gene and its application to the rate of fixation of mutations in evolution. Biochem. Genet. 4:579-593.
Galtier, N. 2001. Maximum-likelihood phylogenetic analysis under a covarion-like model. Mol Biol. Evol. 18:866-873.
Germot, A., H. Philippe, and H. Le Guyader. 1997. Evidence for loss of mitochondria in Microsporidia from a mitochondrial-type HSP70 in Nosema locustae. Mol. Biochem. Parasitol. 87:159-168.
Hasegawa, M., and M. Fujiwara. 1993. Relative efficiencies of the maximum likelihood, maximum parsimony and neighbor-joining methods for estimating protein phylogeny. Mol. Phylogenet. Evol. 2:1-5.
Hashimoto, T., N. Arisue, and M. Hasegawa. 2002. Application of molecular phylogenetic inference and associated problems: illustrative data analysis on early eukaryotic evolution. Proc Inst. Stat. Math. 50:45-68.
Hirt, R. P., B. Healy, C. R. Vossbrinck, E. U. Canning, and T. M. Embley. 1997. A mitochondrial Hsp70 orthologue in Vairimorpha necatrix: molecular evidence that microsporidia once contained mitochondria. Curr. Biol. 7:995-998.
Hirt, R. P., J. M. Logsdon, Jr., B. Healy, M. W. Dorey, W. F. Doolittle, and T. M. Embley. 1999. Microsporidia are related to Fungi: evidence from the largest subunit of RNA polymerase II and other proteins. Proc. Natl. Acad. Sci. USA 96:580-585.
Huelsenbeck, J. P. 2002. Testing a covariotide model of DNA substitution. Mol. Biol. Evol. 19:698-707.
Inagaki, Y., C. Blouin, E. Susko, and A. J. Roger. 2003. Assessing functional divergence in EF-1 and its paralogs in eukaryotes and archaebacteria. Nucleic Acids Res. 31:4227-4237.
Ishihara, R., and Y. Hayashi. 1968. Some properties of ribosomes from the sporoplasm of Nosema bombycis. J. Invert. Pathol. 11:377-385.
Kamaishi, T., T. Hashimoto, Y. Nakamura, Y. Masuda, F. Nakamura, K. Okamoto, M. Shimizu, and M. Hasegawa. 1996a. Complete nucleotide sequences of the genes encoding translation elongation factors 1 and 2 from a microsporidian parasite, Glugea plecoglossi: implications for the deepest branching of eukaryotes. J. Biochem. (Tokyo) 120:1095-1103.
Kamaishi, T., T. Hashimoto, Y. Nakamura, F. Nakamura, S. Murata, N. Okada, K. Okamoto, M. Shimizu, and M. Hasegawa. 1996b. Protein phylogeny of translation elongation factor EF-1 suggests microsporidians are extremely ancient eukaryotes. J. Mol. Evol. 42:257-263.
Katinka, M. D., S. Duprat, and E. Cornillot, et al. 2001 (17 co-authors). Genome sequence and gene compaction of the eukaryote parasite Encephalitozoon cuniculi. Nature 414:450-453.
Keeling, P. J. 2003. Congruent evidence from alpha-tubulin and beta-tubulin gene phylogenies for a zygomycete origin of microsporidia. Fungal Genet. Biol. 38:298-309.
Keeling, P. J., and W. F. Doolittle. 1996. Alpha-tubulin from early-diverging eukaryotic lineages and the evolution of the tubulin family. Mol. Biol. Evol. 13:1297-1305.
Keeling, P. J., and N. M. Fast. 2002. Microsporidia: biology and evolution of highly reduced intracellular parasites. Annu. Rev. Microbiol. 56:93-116.
Keeling, P. J., M. A. Luker, and J. D. Palmer. 2000. Evidence from beta-tubulin phylogeny that microsporidia evolved from within the fungi. Mol. Biol. Evol. 17:23-31.
Kumar, S., and A. Rzhetsky. 1996. Evolutionary relationships of eukaryotic kingdoms. J. Mol. Evol. 42:183-193.
Lopez, P., D. Casane, and H. Philippe. 2002. Heterotachy, an important process of protein evolution. Mol. Biol. Evol. 19:1-7.
Lopez, P., P. Forterre, and H. Philippe. 1999. The root of the tree of life in the light of the covarion model. J. Mol. Evol. 49:496-508.
Negrutskii, B. S., and A. V. El'skaya. 1998. Eukaryotic elongation factor 1: structure, expression, functions and possible aminoacyl-tRNA channeling. Prog. Nucleic Acid Res. Mol. Biol. 60:47-78.
Penny, D., and M. Hasegawa. 2001. Covarion model of molecular evolution. Pp. 473–477 in S. Brenner and J. H. Miller, eds. Encyclopedia of genetics. Academic Press, San Diego.
Penny, D., B. J. McComish, M. A. Charleston, and M. D. Hendy. 2001. Mathematical elegance with biochemical realism: the covarion model of molecular evolution. J. Mol. Evol. 53:711-723.
Peyretaillade, E., V. Broussolle, P. Peyret, G. Metenier, M. Gouy, and C. P. Vivares. 1998. Microsporidia, amitochondrial protists, possess a 70-kDa heat shock protein gene of mitochondrial evolutionary origin. Mol. Biol. Evol. 15:683-689.
Philippe, H., and A. Germot. 2000. Phylogeny of eukaryotes based on ribosomal RNA: long-branch attraction and models of sequence evolution. Mol. Biol. Evol. 17:830-834.
Philippe, H., and J. Laurent. 1998. How good are deep phylogenetic trees? Curr. Opin. Genet. Dev. 8:616-623.
Roger, A. J. 1996. Studies on the phylogenetic and gene structure of early-branching eukaryotes. PhD Thesis. Dalhousie University, Halifax, Canada.
Schmidt, H. A., K. Strimmer, M. Vingron, and A. von Haeseler. 2002. TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics 18:502-504.
Slamovits, C. H., B. A. P. Williams, and P. J. Keeling. 2004. Transfer of Nosema locustae (Microporidia) to Antonospora locustae n. comb. based on molecular and ultrastructural data. J Euk. Microbiol. 51:270-213.
Sullivan, J., and D. L. Swofford. 2001. Should we use model-based methods for phylogenetic inference when we know that assumptions about among-site rate variation and nucleotide substitution pattern are violated? Syst. Biol. 50:723-729.
Susko, E., Y. Inagaki, and A. J. Roger. 2004. On inconsistency of the neighbor-joining, lease squares and minimum evolutions estimation when substitution processes are incorrectly modeled. Mol. Biol. Evol. (in press).
Susko, E., Y. Inagaki, C. Field, M. E. Holder, and A. J. Roger. 2002. Testing for differences in rates-across-sites distributions in phylogenetic subtrees. Mol. Biol. Evol. 19:1514-1523.
Swofford, D. L., P. J. Waddell, J. P. Huelsenbeck, P. G. Foster, P. O. Lewis, and J. S. Rogers. 2001. Bias in phylogenetic estimation and its relevance to the choice between parsimony and likelihood methods. Syst. Biol. 50:525-539.
Tuffley, C., and M. Steel. 1998. Modeling the covarion hypothesis of nucleotide substitution. Math. Biosci. 147:63-91.
Van de Peer, Y., A. Ben Ali, and A. Meyer. 2000. Microsporidia: accumulating molecular evidence that a group of amitochondriate and suspectedly primitive eukaryotes are just curious fungi. Gene 246:1-8.
Vossbrinck, C. R., J. V. Maddox, S. Friedman, B. A. Debrunner-Vossbrinck, and C. R. Woese. 1987. Ribosomal RNA sequence suggests microsporidia are extremely ancient eukaryotes. Nature 326:411-414.
Vossbrinck, C. R., and C. R. Woese. 1986. Eukaryotic ribosomes that lack a 5.8S RNA. Nature 320:287-288.
Williams, B. A., R. P. Hirt, J. M. Lucocq, and T. M. Embley. 2002. A mitochondrial remnant in the microsporidian Trachipleistophora hominis. Nature 418:865-869.(Yuji Inagaki*,, Edward Su)