当前位置: 首页 > 医学版 > 期刊论文 > 基础医学 > 分子生物学进展 > 2004年 > 第11期 > 正文
编号:11255315
The Serine Repeat Antigen (SERA) Gene Family Phylogeny in Plasmodium: The Impact of GC Content and Reconciliation of Gene and Species Trees
     * Department of Statistics, University of California, Berkeley; The Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia; Swiss Institute for Experimental Cancer Research, NCCR Molecular Oncology, Bioinformatics, Lausanne, Switzerland

    E-mail: terry@stat.berkeley.edu.

    Abstract

    Plasmodium falciparum is the parasite responsible for the most acute form of malaria in humans. Recently, the serine repeat antigen (SERA) in P. falciparum has attracted attention as a potential vaccine and drug target, and it has been shown to be a member of a large gene family. To clarify the relationships among the numerous P. falciparum SERAs and to identify orthologs to SERA5 and SERA6 in Plasmodium species affecting rodents, gene trees were inferred from nucleotide and amino acid sequence data for 33 putative SERA homologs in seven different species. (A distance method for nucleotide sequences that is specifically designed to accommodate differing GC content yielded results that were largely compatible with the amino acid tree. Standard-distance and maximum-likelihood methods for nucleotide sequences, on the other hand, yielded gene trees that differed in important respects.) To infer the pattern of duplication, speciation, and gene loss events in the SERA gene family history, the resulting gene trees were then "reconciled" with two competing Plasmodium species tree topologies that have been identified by previous phylogenetic studies. Parsimony of reconciliation was used as a criterion for selecting a gene tree/species tree pair and provided (1) support for one of the two species trees and for the core topology of the amino acid–derived gene tree, (2) a basis for critiquing fine detail in a poorly resolved region of the gene tree, (3) a set of predicted "missing genes" in some species, (4) clarification of the relationship among the P. falciparum SERA, and (5) some information about SERA5 and SERA6 orthologs in the rodent malaria parasites. Parsimony of reconciliation and a second criterion—implied mutational pattern at two key active sites in the SERA proteins—were also seen to be useful supplements to standard "bootstrap" analysis for inferred topologies.

    Key Words: Plasmodium ? serine repeat antigen (SERA) ? reconciliation ? orthology ? paralogy ? GC content

    Introduction

    Malaria inflicts serious health and economic burdens on many countries throughout the world. Plasmodium falciparum is responsible for the most severe form of the disease, causing more than 300 million acute illnesses and at least one million deaths—primarily in children in sub-Saharan Africa—each year. An effective vaccine and new drugs are urgently required, and one molecule of potential use in both capacities is the highly expressed blood-stage protein known as the serine repeat antigen (SERA5) (Delplace et al. 1987; Bzik et al. 1988; Knapp et al. 1989; Hodder et al. 2003).

    Recently, it has become clear that SERA5 and another well-known P. falciparum SERA5 paralog known as SERPH, or SERA6, belong to a large gene family (Knapp et al. 1991). This family includes at least nine members in P. falciparum, four members in P. yoelii, and numerous members in other Plasmodium genomes that have not yet been completely sequenced (Kiefer et al. 1996; Gor, Li, and Rosenthal 1998; Carlton et al. 2002; Gardner et al. 2002).

    Orthologous gene family members that have diverged as a result of speciation are more likely to conserve protein function than paralogous family members, which diverged as a result of gene duplication (Thornton and DeSalle 2000; Zmasek and Eddy 2001). In this paper, we aimed to shed light on aspects of SERA function by establishing the evolutionary history of the members of the SERA gene family. In particular, we sought to clarify the relationship among the P. falciparum SERAs and to identify orthologs to SERA5 and SERA6 in Plasmodium species affecting rodents.

    Both objectives required inference about the pattern of duplication, speciation, and gene loss events in the SERA gene family history. "Reconciliation" is one approach for making such inferences (Goodman et al. 1979; Mirkin, Muchnik, and Smith 1995; Guigo, Muchnik, and Smith 1996; Page and Charleston 1997; Page 1998; Thornton and DeSalle 2000; Zmasek and Eddy 2001). The reconciliation algorithm relies, however, upon a trusted or hypothesized gene tree and a species tree. (Here and throughout, we use the phrase "gene tree" to denote the evolutionary history of a gene family—which may be inferred directly from the DNA sequences or from the amino acid sequences of the derived proteins. "Species tree," in contrast, will refer to the evolutionary relationships among the species in which members of the gene family are found.)

    To produce a set of candidate gene trees, we considered SERA proteins from seven species of Plasmodium. All possess a central domain that shows homology to the papain family of cysteine proteases, although some exhibit an unusual cysteine-to-serine substitution at the active site cysteine residue (Bzik et al. 1988; Kiefer et al. 1996; Gor, Li, and Rosenthal 1998; Hodder et al. 2003). We first used the amino acid sequences from this domain to infer an unrooted gene tree.

    At the nucleotide level, GC content is relatively consistent among SERA sequences within a single Plasmodium species, but it differs widely between species. Such compositional differences have previously been shown to have an impact on phylogenetic inference, with similarity of GC content creating "spurious attraction." It has been further suggested that GC content differences may also affect phylogenetic analysis based on the corresponding amino acid sequences (Galtier and Gouy 1995, 1998; Foster and Hickey 1999). Galtier and Gouy have proposed both distance (1995) and maximum-likelihood (1998) approaches that explicitly address the issue, and to assess whether spurious attraction may have influenced the amino acid tree, we also applied their distance approach to the SERA nucleotide sequences.

    For candidate species trees, we reviewed the literature. In previous studies that have included some or all of these species, the use of different genes—and even of different inference methods applied to the same gene—has produced two contradictory results, shown in figure 1 (Escalante and Ayala 1994; McCutchan et al. 1996; Templeton and Kaslow 1997; Escalante et al. 1998; Rich and Ayala 2000; Rathore et al. 2001; Michon et al. 2002; Perkins and Schall 2002). It is not clear that a single species tree need apply to the entire Plasmodium genome, and we, therefore, considered both alternatives as competing hypotheses.

    FIG. 1.— In previous phylogenetic studies involving some or all of the seven Plasmodium species from which presumptive SERA homologs were obtained, two conflicting species trees have appeared. The trees differ only in their placement of the closely related P. falciparum/P. reichenowi pair.

    To select a final gene/species tree pair, we applied a parsimony criterion, attempting to minimize the implied number of duplication and gene loss events in the gene family history. The results of this process provided (1) support for one of the two species trees, (2) a basis for critiquing fine detail in a poorly resolved region of the gene tree, (3) a set of predicted "missing genes" in some species, (4) clarification of the relationship among the P. falciparum SERA, and (5) some information about SERA5 and SERA6 orthologs in the rodent malaria parasites.

    Methods

    Sequence Data

    Amino acid and nucleotide sequence data for 33 presumptive SERA genes from seven different Plasmodium species were obtained using BlastP and TBlastN against GenBank and all the available Plasmodium databases at NCBI, PlasmoDB (Bahl et al. 2002, http://plasmodb.org), the Sanger Centre, and TIGR. (Sequence data for P. falciparum chromosomes 2 and 9 were obtained from the Sanger Institute Web site at http://www.sanger.ac.uk/Projects/P_falciparum/. Sequencing of P. falciparum chromosomes 2 and 9 was accomplished as part of the Malaria Genome Project with support by the Wellcome Trust. Preliminary sequence data for P. vivax and P. yoelii were obtained from the Institute for Genomic Research through the Web site at http://www.tigr.org. Preliminary sequence data for P. chabaudi, P. knowlesi, and P. reichenowi were obtained from the Wellcome Trust Sanger Institute Pathogen Sequencing Unit's partial shotgun sequencing of these genomes.) P. falciparum SERA8, a suspected pseudogene, was excluded from the analysis (Miller et al. 2002). Plasmodium reichenowi reads were used to construct additional sequences for incorporation into the tree. Reads were initially assigned to one or more potential P. falciparum SERA genes via homology search and were assembled by hand and with the aid of the TIGR assembler. Although partial P. reichenowi sequences with homology to all P. falciparum genes except SERA3 were found, only five were complete enough for inclusion in the present analysis. The amino acid and nucleotide sequences are available as Supplementary Material online.

    Gene Tree Inference

    The SERA proteins possess a central domain that shows strong homology to the papain family of cysteine proteases (Bzik et al. 1988; Kiefer et al. 1996; Gor, Li, and Rosenthal 1998). All amino acid sequence data for the 33 putative homologs were aligned using ClustalW with default parameters (Thompson, Higgins, and Gibson 1994), and approximately 266 residues from this readily alignable protease domain were used for subsequent analysis.

    Pairwise distances were computed as maximum-likelihood estimates of the expected number of substitutions per site, under the Jones-Taylor-Thornton (JTT) matrix substitution model (Jones, Taylor, and Thornton 1992). These distances were calculated with PROTDIST, and unrooted trees were constructed with the neighbor-joining algorithm (Saitou and Nei 1987) as implemented in NEIGHBOR, both in the PHYLIP package (Felsenstein 1996). Except for choice of substitution model, default parameters were used here and below.

    To assess the degree to which the inferred topology depended on choice of substitution model, an amino acid pairwise distance matrix was also computed using the Dayhoff PAM model (Dayhoff 1979). The resulting neighbor-joining tree was topologically identical to the JTT tree.

    Neighbor-joining trees were also constructed for two additional distance matrices computed from the corresponding DNA sequences. The GC content for the aligned regions differed substantially from one species to the next (table 1 and fig. 2). To assess the impact, if any, of this compositional bias on the JTT gene tree, both the Galtier and Gouy substitution model—which explicitly addresses varying GC content—and the PHYLIP default substitution model for DNA sequence data were applied (Galtier and Gouy 1995; Felsenstein 1996). We will refer to the two DNA substitution models as "GG" and "F84." (Note that nucleotide data for only one of the P. vivax five sequences was available.)

    Table 1 SERA Gene Family Sequences Used for Phylogenetic Inference

    FIG. 2.— GC content for nucleotide sequences corresponding to the protease domain in the papain homolog. (This region was used for all gene tree inference.) Within-species variability is low, but there are substantial differences between species.

    For all trees, SEQBOOT in the PHYLIP package was used to compute bootstrap support for key internal edges (Felsenstein 1985, 1996). We used 200 bootstrap trees per substitution model. (We suggest that the reported bootstrap values act as rough guides only and that they serve as just one among a set of criteria—that is, parsimony of reconciliation, implied mutational pattern at active site residues, and support under multiple substitution models—that can be used to assess the reliability of inferred topologies.)

    Reconciliation of Gene and Species Trees

    Several authors have presented a straightforward technique for reconciling a rooted gene tree with a trusted or hypothesized species tree (Goodman et al. 1979; Guigo, Muchnik, and Smith 1996; Page and Charleston 1997; Thornton and DeSalle 2000; Zmasek and Eddy 2001). This technique assumes that (1) the evolutionary histories of both the gene family and the species from which the genes derive can be represented as binary trees, and (2) members of the gene family have arisen through duplication, speciation, and deletion events only, not through horizontal gene transfer.

    The algorithm provides a means of selecting the set of internal nodes in the gene tree that represent duplication events. In addition, unobserved genes—genes that have disappeared through deletion, have mutated to such an extent that they are no longer recognizable members of the gene family, or have simply been overlooked or omitted from analysis—are inferred as well. A minimal set of such genes and their most-parsimonious placement on the gene tree, is returned along with the list of inferred duplication events. Implementation of the reconciliation algorithm for the SERA gene family was with the GeneTree software package (Page 1998).

    Although a detailed description of the reconciliation algorithm can be found in the references, we give a brief overview here for convenience. We first present the reconciliation algorithm informally, in a way that highlights the intuition upon which it is based:

    Under assumptions (1) and (2) above, a split in the gene family tree that resulted from a speciation event must, by definition, lead to two subtrees composed of genes from disjoint sets of species. Splits that correspond to duplication events, on the other hand, lead to two subtrees composed of genes from identical sets of species—provided, of course, that no genes are deleted or missing. Even in the more realistic case, where gene deletion and incomplete gene family data are permitted, if the sets of species included in the two subtrees arising from a given internal node contain at least one common member, we infer that the node represents a duplication event. If, on the other hand, these two sets are disjoint and, moreover, they reflect a division of species that is consistent with the hypothesized species tree, then we infer that the node represents a speciation event.

    The reconciliation algorithm can be stated formally as follows: Let S and G denote the set of nodes of the species tree and gene tree, respectively. (Both trees are assumed to be rooted and binary.) For any g G, define (g) to be the set of species contained in the subtree beginning at node g and more recent than g. Define (s) similarly for any s S.

    We may now define a map from G to S: for every g G, let M(g) be the lowest (most recent) s S for which (g)(s). Now for any internal g G, with child nodes g1 and g2, we infer that g represents a duplication event if and only if M(g) is equal to either M(g1) or M(g2); that is, if the node g maps to the same position in the species tree as one of its children.

    As discussed previously, in the simplest case in which no genes are missing or have been deleted, a duplication at node g produces identical sets of species in both subsequent subtrees. Thus, (g) = (g1) = (g2), and so M(g) = M(g1) = M(g2). It is straightforward to show that if (g1) (g2) is nonempty, then M(g) equals either M(g1) or M(g2), and the informal description given above is equivalent to the formal algorithm.

    In addition to inferring which nodes in the gene tree represent speciation events and which represent duplication events, the algorithm also permits inference of a minimal set of missing or deleted genes, as well as the most-parsimonious placement of such genes in the gene tree. Briefly, a node g G, which is inferred to be a duplication, is assumed to propagate all species in (M(g)) into both child subtrees; any members of (M(g)) not found in the subtree children of g are inferred to be lost or missing. The inferred deletion events are then placed so as to account for multiple missing genes with fewer deletion events and to require fewer missing genes. A complete explanation of the parsimonious placement of such lost or missing genes can be found in the references.

    The set of inferred duplications and missing or deleted genes is of inherent interest, and it also provides a parsimony-based criterion for rooting unrooted trees and for selecting among alternative gene and species tree topologies. If duplication and true deletions (as opposed to genes that are present in the organism but were overlooked in the analysis) are assumed to be rare, then it is sensible, ceteris paribus, to minimize the total number of such events. The use of such a criterion amounts to a maximum-parsimony approach to duplication and gene loss events. Although we do not utilize it here, an explicit-probability model for such events has recently been proposed, permitting a maximum-likelihood approach as well (Arvestad et al. 2003). Maximum-likelihood inference has several advantages over cruder parsimony arguments, such as the ability to compute standard errors and confidence intervals. The validity of such computed quantities, however, hangs on the assumption of homogeneity of the birth-death process throughout the evolutionary history of the gene family and on our ability to accurately model this process.

    Results

    Gene Family Trees

    As mentioned above, only the SERA protein protease domain was used for phylogenetic inference. This domain aligned readily, with minimal introduction of gaps (six or fewer gap characters per sequence.) Pairwise amino acid similarity for this domain ranged from 49.6% to 99.5%, with an average of 61.0%.

    Figure 3A shows the unrooted gene tree returned by the neighbor-joining algorithm when applied to the JTT distance matrix. Bootstrap support values (percentages, with B = 200) for several important internal edges are superimposed. In addition, the amino acids present at two sites that are important for proteolytic activity are indicated. Note that the cysteine and serine protease families segregated, with 100% bootstrap support. Two further subsets of the serine family also feature a second active site mutation: histidine to methionine in one case, and to leucine in the other. Again, there was 100% bootstrap support for segregation by residue at this second active site. The fact that the inferred gene tree neatly accommodated these unusual mutations at two sites with functional significance in the papain homolog lends support to its core topology.

    FIG. 3.— (A) Gene tree inferred from amino acid data using the JTT substitution model and neighbor-joining. Bootstrap support values (%) shown for some internal edges. The large shaded region indicates the serine protease family; sequences not in this region have cysteine at the key active site residue. Within the serine family, two subfamilies—identified by another unusual histidine-to-methionine or histidine-to-leucine active site mutation—are also indicated. The relative configuration of subtrees I, II, and III plays an important role in reconciliation. Gene trees were also inferred from nucleotide sequences using (B) the Galtier and Gouy (1995) model, which accounts for nonhomogeneous GC content, and (C) the Felsenstein (1985) model, which does not account for nonhomogeneous GC content. Note the change in position for f6 and f7/r7 in both nucleotide-based trees. (D) Reconciliation suggests repositioning the f6 and f7/r7 leaves in the JTT tree.

    Figure 3B shows the distance tree inferred from nucleotide sequence data using the Galtier and Gouy substitution model, which explicitly permits varying GC content throughout the tree. In this tree, we again see segregation by amino acid residue at the two key active sites described above, with near perfect bootstrap support. There was a rearrangement among P. vivax sequences in the serine-leucine subtree, but it has little impact on our analysis. Of more interest is the shift in position for f6 and f7/r7. These sequences play an important role in the reconciliation of gene and species trees, and we will return to them shortly.

    Another point worth noting is that bootstrap support for the relationship among the three main subtrees of the serine family—the P. falciparum/P. reichenowi subtree (I), the serine/methionine subtree (II), and the P. vivax/P. knowlesi subtree (III)—was substantially weaker than that seen in the amino acid tree. The relative location of these subtrees did not change, but the small edge separating I and II in this tree had only 43% bootstrap support. Further, table 2 shows that among the bootstrap trees considered, two alternative configurations for the three subtrees appeared with nearly equal frequency. Thus, nucleotide data used in combination with the Galtier and Gouy substitution model were insufficient for resolution of the core topology in this region of the gene tree.

    Table 2 Bootstrap Analysis of Gene Trees

    In figure 3C, we see another distance tree inferred from nucleotide data, this time using the F84 substitution model. We first note that the f6 and f7/r7 node were again not where they appeared in the JTT tree. Indeed, none of the methods we considered was able to place these nodes with confidence.

    We next note the suspicious exchange of the k3-k4 and v2-k5 branches in subtree III. GC content for all sequences in the v1-v3-v4-v5b group is above 50%. GC content for v2 and k5 is 51% and 45%, respectively, compared with only 37% for both k3 and k4. One is tempted, therefore, to attribute this swap to "attraction" between sequences of similar GC content. It is doubtful, however, that the F84 tree correctly represents the configuration of sequences in this subtree: as mentioned above, the active site histidine-to-leucine mutation is unusual. In the absence of good evidence to the contrary, parsimony suggests that a gene tree requiring such a mutation to occur twice in the gene family's evolutionary history—as is the case for the F84 tree shown—is less plausible than a gene tree that requires only one such mutation.

    The exchange of position of subtrees I and II is interesting—the possibility of such an exchange will come up in the context of gene and species tree reconciliation—but bootstrap support for this exchange was very poor. In fact, the core topology of the F84 gene tree proved to be highly unstable among the bootstrap trees. As can be seen in table 2, the depicted F84 gene tree was not even the consensus tree among bootstrap results. The consensus tree (not shown) placed subtree III at B (fig. 3C), in close proximity to the P. vivax and P. knowlesi cysteine proteases. That subtree III should move to A or B—adjacent to the other high-GC P. vivax and P. knowlesi sequences—is again what one would expect if GC content induced attraction among sequences of similar composition. As will be discussed below, placement of subtree III at A creates problems for reconciliation unless a new, third species tree previously unreported in the literature is entertained. Placement of subtree III at B is even less justified. It requires both a novel species tree for satisfactory reconciliation and that the unusual cysteine-to-serine mutation has occurred twice in the SERA family's evolutionary history.

    Reconciliation: Root Placement, Duplications, and Deletions

    Parsimony suggests that needless inference of gene duplication or loss should be avoided, and this criterion provides a basis for rooting the gene trees produced by neighbor-joining. (More specifically, it suggests an optimal internal edge but does not provide for exact placement of the root along this edge.) For the topology found in the JTT and GG gene trees, placement of the root on the long edge connecting the cysteine and serine protease families substantially reduced the number of lost or missing genes implied by reconciliation (using either of the two species trees). This edge was also selected by the traditional "midpoint" method.

    Figure 4 shows the pattern of duplication and missing or lost genes inferred by reconciliation of the so-rooted JTT tree with species trees #1 and #2. In both cases, a substantial number of lost genes were required. Replacing the JTT tree with the GG tree produced qualitatively similar results (not shown), and a summary of the number of implied duplications and deletions is given in table 3.

    FIG. 4.— (A) and (B) Attempts to reconcile the original JTT tree with both hypothesized species trees required numerous duplication events and missing or deleted gene family members. Small squares denote inferred duplication events. Dashed branches and corresponding leaves denote a minimal set of missing or deleted genes. Here, branch length was selected for clarity only and is not proportional to estimated evolutionary distance. (C) After repositioning f6 and f7/r7 as shown in figure 3D, the modified JTT tree was again reconciled with species tree #2.

    Table 3 Duplications and Deleted/Missing Genes Implied by Reconciliation

    Most of the inferred duplications arise from subtrees I and III in the serine family, and indeed the large number of genes in these subtrees can only be explained by numerous duplication events. Note that the number of actual "events" may be reduced if chunks of the genome that already contained several paralogs were duplicated in a single event. The f1 through f7 genes, for example, all lie within a small region (35 kb) on P. falciparum chromosome 2 and exhibit the same orientation; it is conceivable that some subset of these genes may have arisen through duplication of a block of tandem repeats (Miller et al. 2002). Note also that although we do not here use a probabilistic model of gene duplication and loss, such as that given by Arvestad et al. (2003), the possibility of such block events creates additional difficulties for modeling.

    With respect to the missing or deleted genes, gene deletion is commonplace in the evolutionary history of many organisms. In addition, it is highly likely that some SERA family members were overlooked—particularly from species such as P. vinckei, P. chabaudi, and P. reichenowi, for which genome data is still incomplete. Nonetheless, the significant number of lost or missing genes inferred for the better-studied genomes suggests that we may be able to improve on this result:

    Modifications to Improve Reconciliation

    A substantially more parsimonious reconciliation of the gene and species trees is possible if we consider configurations around low confidence internal edges of the inferred gene trees to be flexible. In particular, bootstrap support for the placement of f6 and the f7/r7 pair in all gene trees was poor, and the placement of these genes was seen to be sensitive to inference method. If we accept the JTT/GG configuration of major subtrees in the serine family as correct but reposition f6 and f7/r7 as shown in figure 3D, the number of missing or deleted genes required for reconciliation with species tree #2 is substantially reduced, from as many as 25 to eight, all but one of which are in less-studied or incomplete genomes. If we consider instead the minimum number of deletion events required for reconciliation (loss of a single gene early in the SERA family's evolutionary history can produce multiple missing genes among present-day sequences), a comparable improvement is obtained, from as many as 14 to just seven. Figure 4C depicts the inferred pattern of duplication and missing or deleted genes after the suggested modifications were made to the JTT gene tree.

    If species tree #1 is to be used instead, a similar improvement in parsimony of reconciliation can also be achieved by exchanging subtrees I and II. (The f6 branch must also be repositioned slightly.) However, such a modification seems less justifiable because this configuration for the serine protease family subtrees appeared in only 14% of JTT bootstrap trees. (It did appear with slightly higher frequency among F84 bootstrap trees and just under half the time among GG bootstrap trees, but the low bootstrap support for any particular configuration of serine family subtrees in the GG and F84 analyses indicates that the nucleotide sequence data were insufficient for resolution of the subtree configuration.)

    Although it has little bearing on the larger analysis, one might also consider minor adjustments among f2/r2 and f4/r4 to reduce the number of inferred duplication events by one: bootstrap support for the f2/f4 clade shown in the JTT tree was a modest 64%.

    For completeness, we note that if subtree III is moved to A in figure 3C—or even to B as suggested by the F84 consensus tree, albeit with a different root location—comparable improvements in reconciliation parsimony can be obtained by using a species tree that features a P. vivax/P. knowlesi outgroup. (Again, f6 and f7/r7 must be repositioned slightly.) Because we have not seen such a species tree in the literature, and because there is little support for the required subtree swap in any of the inferred gene trees, we discard this configuration.

    Discussion

    Although we will rely upon the JTT gene tree produced from amino acid data for our final analysis, it is fruitful to compare this tree with gene trees inferred from the corresponding nucleotide data. In particular, we see the following: (1) The F84 model's failure to account for differing GC content between the Plasmodium species seemed to produce spurious attraction between sequences with similar GC content. A full maximum-likelihood gene tree (not shown) inferred with DNAML in PHYLIP exhibited similar GC attraction. (2) Although the impact of differing GC content has in some cases been found to carry through to amino acid–based inference (Foster and Hickey 1999), it does not appear to have been a problem with the SERA data: the JTT and GG trees differed only in trivial respects. (This may be caused in part, however, by the use of a short and relatively well-conserved domain in the present analysis.) (3) Not surprisingly, amino acid sequence data seemed to provide a better basis for inference. In table 2, we see that the GG model was unable to resolve the configuration of major subtrees in the serine protease family. The JTT model, on the other hand, produced stronger—although not ideal—bootstrap support for the inferred configuration.

    Although the JTT tree improved on both nucleotide-based trees, it could not be reconciled with either candidate species tree unless we accept that many SERA genes—including some in the better-studied species—have disappeared during the history of the gene family or have yet to be turned up among available sequence data, and that there are nontrivial hidden duplications and unanticipated paralogies among the SERA family.

    If, instead, we admit the existence of small errors in the inferred gene tree and make appropriate modifications, a substantially more satisfactory reconciliation can be achieved. Given that our stochastic substitution models are approximations at best, it make sense to regard fine detail and small internal edges in inferred phylogenies with skepticism, and to entertain reasonable modifications to such features.

    To achieve satisfactory reconciliation with species tree #1, we had to exchange subtrees I and II in the serine protease family, and also reposition f6 slightly. The latter modification is easily justified, but bootstrap analysis of the JTT gene tree provided moderately strong evidence against the former and, thus, against use of species tree #1. For satisfactory reconciliation with species tree #2, on the other hand, only the f6 and f7/r7 nodes required minor repositioning. Because none of the models could place these nodes definitively, we feel that substantial gains in parsimony of reconciliation justify such modifications. (Curiously, the inferred set of missing genes under the species tree #1 scenario is identical to that produced when species tree #2 is used, so the composition of this set provides no further basis for selecting one scenario over the other. It is also interesting to note that the quantity of available Plasmodium sequence data in public databases increased significantly during the writing of this paper. Although they were not included in the present analysis, sequences that appear to correspond to three of the missing P. chabaudi genes in figure 4C, the predicted P. vivax ortholog of k2, and two of the three missing P. reichenowi genes are now available.)

    In summary, we conclude that the SERA family gene tree should be rooted as shown in figure 5, with f6 and f7/r7 repositioned as suggested by reconciliation. In repositioning these nodes, an effort was made to preserve relative pairwise distances from the original JTT tree, but depicted edge lengths in this region of the tree remain somewhat arbitrary. Parsimony of reconciliation provides no information about evolutionary distances, and future work to develop a method for smoothly integrating the results of reconciliation is still required.

    FIG. 5.— The rooted JTT gene tree, including modifications suggested by reconciliation with species tree #2, is shown with branch length proportional to estimated evolutionary distance. Repositioned edges and leaves are shown with dashed lines. Subfamilies defined by residues at two active sites are shown with shaded bands, and the most parsimonious placement of the mutations which produced these subfamilies are indicated. Small squares denote inferred duplication events.

    We also conclude that the inferred phylogeny of the SERA gene family is most consistent with species tree #2. This result should not, however, be construed as a definitive resolution of the species tree conflict. Lateral transfer of genetic material between organisms may produce distinct topologies for gene and species lineages (Guigo, Muchnik, and Smith 1996; Thornton and DeSalle 2000; Rathore et al. 2001), and lateral transfer of genetic material among Plasmodium species is likely (Rathore et al. 2001; Conway and Baum 2002). In other words, it is possible that different regions of the Plasmodium genome have distinct evolutionary histories, and it need not be the case that only one of the two species tree alternatives is correct.

    Our analysis has several implications for SERA biology. Clearly, there are two distinct evolutionary groupings of SERA proteins that separate according to the residue (cysteine or serine) in the catalytic position, and it would appear that at least two of each type is present in each Plasmodium species. It is also evident that there are very different numbers of SERA proteins in the different Plasmodia. P. falciparum, for example, has nine genes, whereas species affecting rodents—P. yoelii, P. vinckei, and P. chabaudi—appear to have only four. Furthermore, there has been substantially more duplication and sequence divergence among the serine family than among the cysteine family. This implies either that the former group is under greater selection pressure or that the latter is under greater functional constraint.

    Several questions remain. In many instances, it is still not clear which SERA proteins are likely to perform the same biological role across the species. Among species affecting rodents, for example, what are the functional orthologs of the dominantly expressed and apparently essential P. falciparum SERA5 and SERA6 genes? (The gene tree in figure 4C gives two alternatives for each.) Why do different species exhibit different numbers of SERA genes? What selective forces are promoting gene duplication (especially among the serine family), and to what degree are such forces species specific? Do the SERA paralogs within a single species all have distinct biological roles? In P. falciparum, expression profiling has shown that, although coregulated, different SERA genes are expressed to very different levels and that this profile is common across laboratory lines; that is, SERA genes are not differentially expressed like members of some other antigen-encoding gene families such as the var and rifin genes (Miller et al. 2002; Aoki et al. 2002). Furthermore, gene knockout analysis has shown that only strongly expressed SERA genes appear to be important for blood stage growth (Miller et al. 2002).

    Such evidence, together with the phylogenetic analysis described in this paper, indicates that the number of different functions performed by the SERA proteins in blood stages is likely to be far more limited than the total number of family members would suggest—perhaps to as few as two different roles that separate according to the nature of the catalytic residue in the active site. As dominantly expressed blood-stage antigens, SERA proteins—most notably SERA5 and SERA6—are established vaccine candidates and/or emerging drug targets. The likelihood that other P. falciparum SERAs may have precisely the same biological function (perhaps simply requiring expression to the appropriate level to be fully functionally complementary) suggests that effective vaccines or drugs targeting the SERAs may need to target multiple family members.

    Acknowledgements

    Sequence data for P. falciparum chromosomes 2 and 9 was obtained from the Sanger Institute Web site at http://www.sanger.ac.uk/Projects/P_falciparum/. Sequencing of P. falciparum chromosomes 2 and 9 was accomplished as part of the Malaria Genome Project with support by the Wellcome Trust. Preliminary sequence data for P. chabaudi, P. knowlesi, and P. reichenowi were obtained from the Wellcome Trust Sanger Institute Pathogen Sequencing Unit's partial shotgun sequencing of these genomes. This work is being done in collaboration with Chris Newbold (Oxford), Mike Turner (Glasgow), and Karen Day (Oxford). Sequencing of P. vivax was accomplished with support from the National Institute of Allergy and Infectious Diseases, and the U. S. Department of Defense. Sequencing of P. yoelii was done in collaboration with the Naval Medical Research Center, with support from the U. S. Department of Defense through cooperative agreement with the U. S. Army and Naval Medical Research Center. The authors wish to thank Jake Baum and the two anonymous reviewers. Support was provided by the National Science Foundation, the Howard Hughes Medical Institute, and the National Health and Medical Research Council of Australia.

    References

    Aoki, S., J. Li, S. Itagaki, B. A. Okech, T. G. Egwang, H. Matsuoka, N. M. Q. Palacpa, T. Mitamura, and T. Horii. 2002. Serine repeat antigen (SERA5) is predominantly expressed among the SERA multigene family of Plasmodium falciparum, and the acquired antibody titers correlate with serum inhibition of the parasite growth. J. Biol. Chem. 277:47533–47540.

    Arvestad, L., A. C. Berglund, J. Lagergren, and B. Sennblad. 2003. Bayesian gene/species tree reconciliation and orthology analysis using MCMC. Bioinformatics 19(suppl 1):i7–i15.

    Bahl, A., B. Brunk, R. L. Coppel et al. (17 co-authors). 2002. PlasmoDB: the Plasmodium genome resource. An integrated database providing tools for accessing and analyzing mapping, expression, and sequence data (both finished and unfinished). Nucleic Acids Res. 30:87–90.

    Bzik, D. J., W. B. Li, T. Horii, and J. Inselburg. 1988. Amino acid sequence of the serine-repeat antigen (SERA) of Plasmodium falciparum determined from cloned cDNA. Mol. Biochem. Parasitol. 30:279–288.

    Carlton, J. M., S. V. Angiuoli, B. B. Suh et al. (44 co-authors). 2002. Genome sequence and comparative analysis of the model rodent malaria parasite Plasmodium yoelii yoelii. Nature 419:512–519.

    Conway, D. J., and J. Baum. 2002. In the blood—the remarkable ancestry of Plasmodium falciparum. Trends Parasitol. 18:351–355.

    Dayhoff, M. O. 1979. Atlas of protein sequence and structure, Vol. 5, 1978. National Biomedical Research Foundation, Washington, DC.

    Delplace, P., B. Fortier, G. Tronchin, J. F. Dubremetz, and A. Vernes. 1987. Localization, biosynthesis, processing and isolation of a major 126 kDa antigen of the parasitophorous vacuole of Plasmodium falciparum. Mol. Biochem. Parasitol. 23:193–201.

    Escalante, A. A., and F. J. Ayala. 1994. Phylogeny of the malarial genus Plasmodium, derived from rRNA gene sequences. Proc. Natl. Acad. Sci. USA 91:11373–11377.

    Escalante, A. A., D. E. Freeland, W. E. Collins, and A. A. Lal. 1998. The evolution of primate malaria parasites based on the gene encoding cytochrome b from the linear mitochondrial genome. Proc. Natl. Acad. Sci. USA 95:8124–8129.

    Felsenstein, J. 1985. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39:783–791.

    ———. 1996. Inferring phylogenies from protein sequences by parsimony, distance, and likelihood methods. Methods Enzymol. 266:418–427.

    Foster, P. G., and D. A. Hickey. 1999. Compositional bias may affect both DNA-based and protein-based phylogenetic reconstructions. J. Mol. Evol. 48:284–290.

    Galtier, N., and M. Gouy. 1995. Inferring phylogenies from DNA sequences of unequal base compositions. Proc. Natl. Acad. Sci. USA 92:11317–11321.

    ———. 1998. Inferring pattern and process: maximum-likelihood implementation of a nonhomogeneous model of DNA sequence evolution for phylogenetic analysis. Mol. Biol. Evol. 15:871–879.

    Gardner, M. J., N. Hall, E. Fung et al. (45 co-authors). 2002. Genome sequence of the human malaria parasite Plasmodium falciparum. Nature 419:498–511.

    Goodman, M., J. Czelusniak, G. W. Moore, A. E. Romero-Herrera, and G. Matsuda. 1979. Fitting the gene lineage into its species lineage, a parsimony strategy illustrated by cladograms constructed from globin sequences. Syst. Zool. 28:132–168.

    Gor, D. O., A. C. Li, and P. J. Rosenthal. 1998. Protective immune responses against protease-like antigens of the murine malaria parasite Plasmodium vinckei. Vaccine 16:1193–1202.

    Guigo, R., I. Muchnik, and T. F. Smith. 1996. Reconstruction of ancient phylogenies. Mol. Phylogenet. Evol. 6:189–213.

    Hodder, A. N., D. R. Drew, V. C. Epa et al. (12 co-authors). 2003. Enzymic, phylogenetic, and structural characterization of the unusual papain-like protease domain of Plasmodium falciparum SERA5. J. Biol. Chem. 278:48169–48177.

    Jones, D. T., W. R. Taylor, and J. M. Thornton. 1992. The rapid generation of mutation data matrices from protein sequences. Comput. Appl. Biosci. 8:275–282.

    Kiefer, M. C., K. A. Crawford, L. J. Boley, K. E. Landsberg, H. L. Gibson, D. C. Kaslow, and P. J. Barr. 1996. Identification and cloning of a locus of serine repeat antigen (SERA)-related genes from Plasmodium vivax. Mol. Biochem. Parasitol. 78:55–65.

    Knapp, B., E. Hundt, U. Nau, and H. A. Kupper. 1989. Molecular cloning, genomic structure and localization in a blood stage antigen of Plasmodium falciparum characterized by a serine stretch. Mol. Biochem. Parasitol. 32:73–83.

    Knapp, B., U. Nau, E. Hundt, and H. A. Kupper. 1991. A new blood stage antigen of Plasmodium falciparum highly homologous to the serine-stretch protein SERP. Mol. Biochem. Parasitol. 44:1–13.

    McCutchan, T. F., J. C. Kissinger, M. G. Touray, M. J. Rogers, and J. Li. 1996. Comparison of circumsporozoite proteins from avian and mammalian malarias: biological and phylogenetic implications. Proc. Natl. Acad. Sci. USA 93:11889–11894.

    Michon, P., J. R. Stevens, O. Kaneko, and J. H. Adams. 2002. Evolutionary relationships of conserved cysteine-rich motifs in adhesive molecules of malaria parasites. Mol. Biol. Evol. 19:1128–1142.

    Miller, S. K., R. T. Good, D. R. Drew, M. Delorenzi, P. R. Sanders, A. N. Hodder, T. P. Speed, A. F. Cowman, T. F. de Koning-Ward, and B. S. Crabb. 2002. A subset of Plasmodium falciparum SERA genes are expressed and appear to play an important role in the erythrocytic cycle. J. Biol. Chem. 277:47524–47532.

    Mirkin, B., I. Muchnik, and T. F. Smith. 1995. A biologically consistent model for comparing molecular phylogenies. J. Comput. Biol. 2:493–507.

    Page, R. D. M. 1998. GeneTree: comparing gene and species phylogenies using reconciled trees. Bioinformatics 14:819–820.

    Page, R. D. M., and M. A. Charleston. 1997. Reconciled trees and incongruent gene and species trees. Pp. 57–70 in B. Mirkin, F. R. McMorris, F. S. Roberts, and A. Rzhetsky, eds. Mathematical hierarchies in biology, DIMACS series in discrete mathematics and theoretical computer science, Vol. 37. American Mathematical Society, Providence, RI.

    Perkins, S. L., and J. J. Schall. 2002. A molecular phylogeny of malarial parasites recovered from cytochrome b gene sequences. J. Parasitol. 88:972–978.

    Rathore, D., A. M. Wahl, M. Sullivan, and T. F. McCutchan. 2001. A phylogenetic comparison of gene trees constructed from plastid, mitochondrial and genomic DNA of Plasmodium species. Mol. Biochem. Par. 114:89–94.

    Rich, S. M., and F. J. Ayala. 2000. Population structure and recent evolution of Plasmodium falciparum. Proc. Natl. Acad. Sci. USA 97:6994–7001.

    Saitou, N., and M. Nei. 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406–425.

    Templeton, T. J., and D. C. Kaslow. 1997. Cloning and cross-species comparison of the thrombospondin-related anonymous protein (TRAP) gene from Plasmodium knowlesi, Plasmodium vivax and Plasmodium gallinaceum. Mol. Biochem. Parasitol. 84:13–24.

    Thompson, J. D., D. G. Higgins, and T. J. Gibson. 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:4673–4680.

    Thornton, J. W., and R. DeSalle. 2000. Gene family evolution and homology: genomics meets phylogenetics. Annu. Rev. Genomics Hum. Genet. 1:41–73.

    Zmasek, C. M., and S. R. Eddy. 2001. A simple algorithm to infer gene duplication and speciation events on a gene tree. Bioinfomatics 17:821–828.(Richard Bourgon*, Mauro D)