婵犵妲呴崑鎾跺緤妤e啯鍋嬮柣妯款嚙杩濋梺璺ㄥ櫐閹凤拷
闂傚倷绀侀幖顐︽偋韫囨稑绐楅幖娣妼閸ㄥ倿鏌ㄩ悤鍌涘: 闂佽娴烽弫濠氬磻婵犲洤绐楅柡鍥╁枔閳瑰秴鈹戦悩鍙夋悙婵☆偅锕㈤弻娑㈠Ψ閵忊剝鐝栭悷婊冨簻閹凤拷 闂傚倷绶氬ḿ鑽ゆ嫻閻旂厧绀夐幖娣妼閸氬綊骞栧ǎ顒€鐏柍缁樻礋閺屻劑寮崹顔规寖濠电偟銆嬮幏锟� 闂備浇宕垫慨宥夊礃椤垳鐥梻浣告惈椤戝倿宕滃┑鍫㈢煓濠㈣泛澶囬崑鎾绘晲鎼粹€崇缂備椒绶ら幏锟� 闂傚倷鑳舵灙妞ゆ垵鍟村畷鏇㈡焼瀹ュ懐鐣洪悗骞垮劚椤︻垳鐥閺屾稓浠﹂悙顒傛缂備胶濯撮幏锟� 闂傚倷鑳堕、濠勭礄娴兼潙纾块梺顒€绉撮崹鍌炴煛閸愩劎澧涢柣鎺曨嚙椤法鎹勯搹鍦紘濠碉紕鍎戦幏锟� 闂傚倷鑳剁涵鍫曞疾閻愭祴鏋嶉柨婵嗩槶閳ь兛绶氬畷銊╁级閹寸媭妲洪梺鑽ゅТ濞层倕螣婵犲洤绀夐柨鐕傛嫹 婵犵數鍋為崹鍫曞箰鐠囧唽缂氭繛鍡樺灱婵娊鏌曟径鍡樻珔缂佲偓瀹€鍕仯闁搞儜鍕ㄦ灆闂佸憡妫戦幏锟� 闂傚倷娴囬崑鎰版偤閺冨牆鍨傚ù鍏兼儗閺佸棝鏌ㄩ悤鍌涘 闂備浇顕х€涒晠宕樻繝姘挃闁告洦鍓氶崣蹇涙煥閻曞倹瀚� 婵犵數鍋為崹鍫曞箹閳哄懎鍌ㄩ柣鎾崇瘍閻熸嫈鏃堝川椤撶媭妲洪梺鑽ゅТ濞层倕螣婵犲洤绀夐柨鐕傛嫹 闂傚倷绀侀幉锟犮€冭箛娑樼;闁糕剝绋戦弸渚€鏌熼幑鎰靛殭缂佺姵鍨块弻锟犲礋椤愶絿顩伴梺鍝ュ櫐閹凤拷
婵犵數鍎戠徊钘壝洪敂鐐床闁稿本绋撻々鐑芥煥閻曞倹瀚�: 闂傚倷绀侀幖顐﹀磹閻戣棄纭€闁告劕妯婂〒濠氭煥閻曞倹瀚� 闂備浇宕垫慨鏉懨洪妶鍥e亾濮樼厧鐏︽い銏$懇閺佹捇鏁撻敓锟� 闂備浇宕甸崰鎰版偡閵夈儙娑樷槈閵忕姷锛涢梺璺ㄥ櫐閹凤拷 闂備焦鐪归崺鍕垂闁秵鍋ら柡鍥ュ灪閸庡﹪鏌ㄩ悤鍌涘 闂傚倷鐒﹂惇褰掑磹閺囩喐娅犻柦妯侯樈濞兼牠鏌ㄩ悤鍌涘 闂傚倷鐒﹂惇褰掑磿閸楃伝娲Ω閿旇棄寮块梺璺ㄥ櫐閹凤拷 闂傚倷鑳舵灙缂佺粯顨呴悾鐑芥偨缁嬫寧鐎梺璺ㄥ櫐閹凤拷 闂傚倷鑳堕崕鐢稿磻閹捐绀夐煫鍥ㄦ尵閺嗐倝鏌ㄩ悤鍌涘 闂傚倷鑳堕、濠勭礄娴兼潙纾规俊銈呮噹缁犳牠鏌ㄩ悤鍌涘 闂傚倷娴囬鏍礂濞嗘挸纾块柡灞诲劚閻ら箖鏌ㄩ悤鍌涘 闂傚倷鑳舵灙妞ゆ垵鍟村畷鏇㈠箻椤旂瓔妫呴梺璺ㄥ櫐閹凤拷 缂傚倸鍊搁崐绋棵洪妶鍡╂闁归棿绶¢弫濠囨煥閻曞倹瀚� 婵犵數鍋為崹鍫曞箰閸洖纾归柡宥庡幖閻掑灚銇勯幒鎴敾閻庢熬鎷� 闂傚倷鑳堕崢褍顕i幆鑸汗闁告劦鍠栫粈澶愭煥閻曞倹瀚� 闂傚倷鐒﹀鍨熆娓氣偓楠炲繘鏁撻敓锟� 婵犵數濞€濞佳囧磻婵犲洤绠柨鐕傛嫹 闂傚倷鑳堕崢褎鎯斿⿰鍫濈闁跨噦鎷� 闂備浇顕х换鎰殽韫囨稑绠柨鐕傛嫹 闂傚倷鐒﹂幃鍫曞磿閼碱剛鐭欓柟杈惧瘜閺佸棝鏌ㄩ悤鍌涘 闂備浇宕垫慨鏉懨洪埡浣烘殾闁割煈鍋呭▍鐘绘煥閻曞倹瀚� 闂傚倷绀侀幖顐⒚洪敃鈧玻鍨枎閹惧秴娲弫鎾绘晸閿燂拷
婵犵數鍋為崹鍫曞箹閳哄懎鍌ㄩ柤娴嬫櫃閻掑﹪鏌ㄩ悤鍌涘: 闂備焦鐪归崺鍕垂闁秵鍋ら柡鍥ュ灪閸庡﹪鏌ㄩ悤鍌涘 闂傚倷娴囧銊╂倿閿曞倹鍋¢柨鏇楀亾瀹€锝呮健閺佹捇鏁撻敓锟� 闂傚倷娴囬鏍礈濮樿鲸宕查柛鈩冪☉閻掑灚銇勯幒鎴敾閻庢熬鎷� 婵犵數鍋為崹鍫曞箹閳哄懎鐭楅柍褜鍓涢埀顒冾潐閹碱偊骞忛敓锟� 闂傚倷绀侀幉锟犳偋閻愯尙鏆﹂柣銏⑶圭粻鏍煥閻曞倹瀚� 婵犵數鍋為崹鍫曞箰鐠囧唽缂氭繛鍡樺灱婵娊鏌ㄩ悤鍌涘 闂傚倸鍊峰ù鍥涢崟顖涘亱闁圭偓妞块弫渚€鏌ㄩ悤鍌涘 濠电姵顔栭崰妤勬懌闂佹悶鍔忓▔娑滅亱闂佽法鍣﹂幏锟� 闂傚倷绀侀幖顐﹀磹缁嬫5娲Χ閸ワ絽浜剧痪鏉款槹鐎氾拷 闂傚倷鐒﹂崹婵嬫倿閿曞倸桅闁绘劗鏁哥粈濠囨煥閻曞倹瀚� 婵犲痉鏉库偓妤佹叏閹绢喖瀚夋い鎺戝閽冪喖鏌ㄩ悤鍌涘 闂傚倷鐒﹂幃鍫曞磿閹绘帞鏆︽俊顖欒濞尖晠鏌ㄩ悤鍌涘 婵犵绱曢崑娑㈩敄閸ヮ剙绐楅柟鎹愵嚙閸戠娀鏌ㄩ悤鍌涘 闂傚倷娴囬崑鎰版偤閺冨牆鍨傞柧蹇e亝濞呯娀鏌ㄩ悤鍌涘 闂傚倷娴囬崑鎰版偤閺冨牆鍨傛い鏍ㄧ矌缁犳棃鏌ㄩ悤鍌涘 闂傚倷娴囬崑鎰版偤閺冨牆鍨傞柟娈垮枤閸楁岸鏌ㄩ悤鍌涘 闂傚倷绀侀幖顐﹀磹鐟欏嫬鍨旈柦妯侯槺閺嗐倝鏌ㄩ悤鍌涘 闂傚倷鑳堕幊鎾诲触鐎n剙鍨濋幖娣妼绾惧ジ鏌ㄩ悤鍌涘 闂傚倷绀侀崥瀣i幒鎾变粓闁归棿绀侀崙鐘绘煥閻曞倹瀚�
当前位置: 首页 > 医学版 > 期刊论文 > 基础医学 > 分子生物学进展 > 2004年 > 第7期 > 正文
编号:11255041
Genome Phylogenetic Analysis Based on Extended Gene Contents

     * Department of Genetics, Development, and Cell Biology

    Center for Bioinformatics and Biological Statistics, Iowa State University

    Department of Mathematics and Statistics, University of West Florida

    E-mail: xgu@iastate.edu

    Abstract

    With the rapid growth of entire genome data, whole-genome approaches such as gene content become popular for genome phylogeny inference, including the tree of life. However, the underlying model for genome evolution is unclear, and the proposed (ad hoc) genome distance measure may violate the additivity. In this article, we formulate a stochastic framework for genome evolution, which provides a basis for defining an additive genome distance. However, we show that it is difficult to utilize the typical gene content data—i.e., the presence or absence of gene families across genomes—to estimate the genome distance. We solve this problem by introducing the concept of extended gene content; that is, the status of a gene family in a given genome could be absence, presence as single copy, or presence as duplicates, any of which can be used to estimate the genome distance and phylogenetic inference. Computer simulation shows that the new tree-making method is efficient, consistent, and fairly robust. The example of 35 microbial complete genomes demonstrates that it is useful not only to study the universal tree of life but also to explore the evolutionary pattern of genomes.

    Key Words: Gene content ? additive genome distance ? phylogenetic inference ? comparative genomics

    Introduction

    Since the concept of the tree of life was proposed (Woese 1987), it was thought that more sequences of orthologous genes could improve the depth and resolution of our knowledge of life's history. This view has been challenged since the publication of the first microbial genome sequence, Haemophilus influenzae. To date the roster of complete genomes is close to 100 (for an overview, see http://www.tigr.org). In spite of more than 10 prokaryotic phyla plus a few eukaryotes represented, we are actually facing more difficulties in having a meaningful interpretation of the tree of life. Because phylogenetic analysis based on a single gene (family) has produced many conflicted gene trees, the long-term controversy over "vertical" (tree-like) evolution versus lateral (horizontal) gene transfer has become more heated rather than resolved in the genome era (Golding and Gupta 1995; Doolittle and Logsdon 1998; Jain, Rivera, and Lake 1999; Doolittle 1999a, 1999b; Nelson et al. 1999; Tekaia, Lazcano, and Dujon 1999; Huynen and Snel 2000; Wolf et al. 2002; Daubin, Moran, and Ochman 2003).

    Because phylogenetic trees of individual genes are inconsistent, the whole-genome analysis—e.g., the gene content (the presence/absence of gene families over genomes)—is becoming an attractive approach to extracting the bulk phylogenetic signals. For instance, several authors (Snel, Bork, and Huynen 1999; Huynen, Snel, and Bork 1999; Lin and Gerstein 2000; Korbel et al. 2002) estimated the fraction of shared genes for genome pairs and transformed that fraction to the genome distance matrix by some ad hoc distance measures. Other methods include the coefficient of co-occurrence of genomics (Natale et al. 2000) and the ratio of orthologs to the number of genes in the smaller genome (Clarke et al. 2002). In addition, various parsimony algorithms have also been used (e.g., Fitz-Gibbon and House 1999; House and Fitz-Gibbon 2002).

    Interestingly, these genome-level studies show a general similarity between the gene-content tree and the classical rRNA tree, implying that the vertical (tree-like) evolutionary history of an organism could be maintained at the genome level, which is not seriously affected by the lateral gene transfer. However, Doolittle (1999b) raised a fundamental question about whether a genome tree based on gene content alone, and not the evolutionary relationship, is the best phenotypic measure. In fact, any inferred topology (including molecular phylogeny) could be potentially misleading. For instance, the high variation of the GC% in bacterial genomes results in high variation of amino acid compositions (Gu 2001) that may complicate the phylogenetic inference based on protein sequences. An inferred topology turns out to be an estimate of the phylogenetic relationship only when the assumptions have been carefully examined. A common problem shared by these genome approaches is the lack of a clear-cut evolutionary model. Consequently, these studies at best lead to a much weaker statement: that the genome tree might be interpreted as only a prevailing trend in the evolution of genome-scale gene sets rather than as a dominant picture of evolution (Wolf et al. 2002).

    We have recognized the important role of modeling for phylogenomic analysis in justifying whether the inferred tree indeed represents the genome phylogeny. Because the likelihood framework for phylogenetic gene-content analysis (Gu 2000) may require a huge amount of computational time, the genome distance approach is demanding in practice. In this article, we first show that the gene-content distance is generally not additive, so its application for phylogenomic analysis could be misleading. We then tackle this problem by extending the concept of gene content into a more general framework such that the additive genome distance can be estimated. The efficiency of genome phylogenetic reconstruction is examined by extensive computer simulations. Finally, we apply the newly developed method to study the universal tree of life.

    The Stochastic Model

    The Joint Size Distribution of the Gene Family in Multiple Genomes

    The whole-genome comparison has revealed a high variation in the size of gene families among complete genomes, because a gene family can be generated, expanded, reduced, or lost during the course of genome evolution. Therefore, the joint size distribution of the gene family among genomes is useful for phylogenomic analysis.

    Nei et al. (1997) proposed a birth-death hypothesis for the evolution of young duplicate genes. Here we develop a general stochastic model, considering two major evolutionary processes that influence the size of a gene family: gene loss (nonfunctionization or deletion) and gene proliferation (duplication). Let μ be the evolutionary rate of gene loss and be the evolutionary rate of gene proliferation. If each gene is subject to the same chance of being lost or duplicated, for a gene family with r member genes at t = 0, the number of member genes after t time units, denoted by Xt, follows the following distribution

    where the proliferation parameter and the loss parameter ? are given by

    respectively. Equation (2) implies / ? = / μ, which is called the P / L ratio. The size of the gene family under the birth-death model is expected to be X0e(–μ)t, > ? (or P / L > 1), which indicates, on average, an increase of gene family size during evolution and vice versa.

    Consider two genomes that diverged t time units ago (fig. 1). For a given gene family, we assume that there are r member genes at t = 0 (in the common ancestor). Let Xi, i = 1,2, denote the number of genes after t time units for genome i. Under the assumption of independent evolution between lineages, the (conditional) joint probability is given by P(X1, X2 | X0 = r) = P(X1 | X0 = r) x P(X2 | X0 = r). Because the size of a gene family in the ancestral genome is unknown, a (prior) distribution for X0 = r is assumed, denoted by (r). Thus, the joint probability of X1 and X2 is given by

    where P(Xi | r) is short for P(Xi | X0 = r) defined by equation (1).

    FIG. 1. Schematic genome evolution for two genomes and four genomes, respectively. The gene family has r member genes in the root. After t evolutionary time units, the size of the gene family is x1 and x2 in genomes 1 and 2, respectively. For four genomes, the size of the gene family is xi (i = 1, ... , 4)

    For the general n-genomes, let Xi represent the size of a gene family in the ith genome, i = 1, ... , n. The joint size distribution of the gene family X = (X1, ... , Xn) can be derived according to the Markov chain model, similar to DNA sequence evolution (Felsenstein 1981). For example, for four genomes (fig. 1), it is given by

    where P(. | .;i, ?i) is the transition probability for branch i, defined by equation (1).

    Two-Genome Model and Expression Distance

    The Additive Genome Distance Measures

    Given the joint-size distribution, say, equation (4) for four genomes, maximum likelihood phylogeny can be implemented. Unfortunately, the complexity of transition probability (eq. 1) makes it almost intractable for the genome-level analysis. Thus, the distance method becomes highly desirable, but first one should define an additive genome distance measure. With some algebras from equation (2), two quantities, the proliferation measure d and the loss measure dμ, are given by

    and

    respectively. For two genomes (fig. 1), let i, μi, i, ?i, d and d be the corresponding parameters in each lineage, i = 1, 2; see equations (2) and (5). Then we define the proliferation genome distance between two genomes (the P distance, for short) as GP = d+d = (1+2)t; from equation (5), it is given by

    In the same manner, the loss genome distance (L distance, for short) between two genomes is defined as GL = d+d = (μ1+μ2)t, given by

    and the general genome distance measure is defined as G = GP+GL, i.e.,

    Apparently, these genome distance measures are additive, and GP/GP = P/L ratio. Equations (6)–(8) provide the relationship between genome distances and parameters in the probabilistic model (eqs. 1–3). To estimate the genome distance, we shall develop a computationally efficient method for estimating the parameters (i and ?i).

    Gene Content: It's Not Sufficient

    The concept of gene content was introduced by several authors for studying the universal genome tree (e.g., Snel, Bork, and Huynen 1999; Tekaia, Lazcano, and Dujon 1999). For two genomes i = 1, 2, let Yi be the gene-content index of a gene family: Yi = 1 indicates at least one member gene found in the ith genome; otherwise Yi = 0. Therefore, gene-content pattern is the most degenerated size distribution of the gene family. In the following discussion we will show that it becomes insufficient for estimating the genome distance.

    From equation (3) one can show that the joint probability of Y1 and Y2 is given by

    Because P(Yi = 0 | r) = , and P(Yi = 1 | r) = 1 – , i = 1, 2, the analytical form of P(Y1,Y2) can be obtained if a geometric prior is assumed, i.e., (r) = (1 – f)r–1f. For simplicity, let P(i, j) = P(Y1 = i, Y2 = j). Then, putting (r) into equation (9), we have

    where the function Q(?) (? = ?1, ?2 or ?1?2) is defined as

    Because equation (10) relies only on the loss parameters ?1 and ?2, we cannot estimate the proliferation parameters (1 and 2). In other words, the additive genome distances defined by equations (6)–(8) in general cannot be estimated by the gene-content approach.

    Extended Gene Content

    We have found a plausible solution by further dividing the non-zero (member genes) case into two states: single-copy (one-member) genes or duplicates (more than one member genes). This extended gene-content analysis considers three possible states: no member gene (Z = 0), single-copy gene (Z = 1), and duplicate genes (Z = 2). According to equation (1), their probabilities are P(Z = 0 | X0 = r) = P(Xt = 0 | X0 = r), P(Z = 1 | X0 = r) = P(Xt = 1 | X0 = r) and P(Z = 2 | X0 = r) = k2 P(Xt = k | X0 = r), as given by

    respectively.

    The Joint Distribution for Two Genomes

    Consider two genomes that diverged t time units ago (fig. 1). Let Zi = 0, 1, or 2 be the extended gene-content index for a gene family in the ith genome, i = 1, 2. Similar to equation (3) and equation (9), the joint distribution of Z1 and Z2 is given by

    where P(Zi | r) = P(Zi | X0 = r). Given the geometric distribution for (r) = f(1 – f)r–1, we obtain the analytical forms of equation (13) as follows

    where 1 = (1 – ?1)(1 – 1) and 2 = (1 – ?2)(1 – 2); the function Q(?) is given by equation (11), the function R(?) = (r)r?r–1 is given by

    and the function S(?) = (r)r2?r–1 is given by

    Here ? = ?1, ?2 or ?1?2.

    Parameter Estimation

    When the extended gene-content data matrix for any two genomes 1 and 2 is given, we develop a maximum likelihood (ML)–based approach to estimating the genome distances. Usually the prior parameter f can be estimated from the observed size frequencies of gene families. Because the pattern of double loss (i.e., Z1 = 0 and Z2 = 0) is not observable, one may use the following modified joint probability,

    for Z1, Z2 = 0, 1 or 2, except Z1 = Z2 = 0. Let nij be the number of gene families with the pattern Z1 = i and Z2 = j, where i, j = 0, 1, 2 except i = j = 0. Then, the likelihood for the two genomes can be written as

    We use the Newton-Raphoson numerical iteration to obtain the ML estimates of 1, 2, ?1, and ?2. Their sampling variance-covariance matrix is approximately computed by the inverse of Fisher's information matrix. When these parameters (1, 2, ?1, ?2) are estimated, the computation of genome distances by equations (6)–(8) are straightforward, and the sampling variance of a genome distance can be obtained by the delta method.

    Computer Simulations

    We have conducted extensive computer simulations to examine the performance of phylogenetic reconstruction using the extended gene-content data. The computer program is encoded using the language C++. The number of replications in each simulation study is set at 2,000. Because of space limitations, we will discuss our main results briefly.

    Estimation of Genome Distance Is Asymptotically Unbiased

    We first simulate the stochastic process according to the two-genome evolution scenario (fig. 1), when the evolutionary parameters (it and μit, i = 1, 2) are given. For each gene family, the number of genes on the root, r, is generated from a geometric distribution with the parameter f = 0.5. In each replicate, we implement the ML algorithm to estimate the proliferation parameter i and the loss parameter ?i (i = 1, 2), and we then compute the genome distances according to equations (6)–(8). The mean and variance for each estimate are used for examining the statistical properties.

    We have studied four typical cases: the gene-loss model ( = 0), the growth model ( > μ), the equal model ( = μ), and the reduction model ( < μ). The number of gene families (N) is set at N = 200, 500, and 1,000, respectively. We have examined a variety of combinations from these models in two lineages and have found that the estimates of these parameters and genome distances are asymptotically biased, which is virtually trivial when N > 500. The sampling variances of genome distances decrease with the increase in the number of gene families, and the variances are usually acceptable if N > 500.

    Genome Tree Inference Is Efficient and Consistent

    We have examined the tree-making performance of the extended gene-content approach, using a typical four-genome scenario (fig. 2). After the extended gene-content matrix of four genomes is simulated, we estimate the genome distance matrix and then infer the tree with the Neighbor-Joining (NJ) algorithm. The efficiency of phylogenetic inference is then measured by the percentage of correct topology inference over 1,000 replicates. After having examined many combinations, we concluded that our method is efficient; that is, except in some extreme cases, the correct percentage is satisfactory (>70%) when N > 500. Our method is also consistent; that is, the correct percentage tends to be 100% when N .

    FIG. 2. The genome tree used for a computer simulation study. A. Equal external branch lengths. B. Unequal external branches (Felsenstein's zone). C. Unequal external branches (non-Felsenstein's zone)

    Table 1 shows the correct percentage of tree-making when the true tree has four equal external branch lengths (fig. 2A). When the internal branch length (c) is short, the genome tree inference can be significantly improved as N becomes larger. To examine the tree-making consistency, we consider two typical patterns when the external branches are highly unequal (fig. 2B and fig. 2C). As shown in table 2, the performance is poor when N is small and the internal branch length is short. Nevertheless, even in the very extreme case, the correct percentage of tree-making is close to 100% for sufficiently large number of gene families.

    Table 1 Correct Percentage (%) of Tree Making: Equal External Branch Lengths (see fig. 2A).

    Table 2 Correct Percentage (%) of Tree Making: Unequal Branch Lengths (see fig. 2B C).

    We have also investigated the effect of the prior distribution. We use several alternative distributions in our simulation model that have a longer tail than the geometric distribution. For instance, (r) = C(1 – f)f, or (r) = Cr– (C is the normalizing constant). After we examined many cases, we found that the performance of tree-making is very robust against the choice of a specific form of (r) (not shown).

    Example: The Universal Genome Tree of Life

    To compare it with previous genome phylogenetic inferences using gene-content data, we applied the newly developed extended gene-content method to infer the universal genome tree of 35 complete genomes, similar to Wolf et al. (2002). The extended gene-content data were obtained from the COG database (http://www.ncbi.nlm.nih.gov/COG/). Then, the pairwise genome distance (G) was estimated according to equation (8). We also estimated the proliferation (P) and the loss (L) genome distances, respectively (data not shown).

    We used the NJ method (Saitou and Nei 1987) to infer the genome phylogeny. The overall genome tree based on extended gene content (Fig. 3) supports the concept of a universal tree, similar to previous gene-content trees (Snel, Bork, and Huynen 1999; Wolf et al. 2002) and the standard 16s RNA tree (Olsen, Woese, and Overbeek 1994). That is, two major lineages of cellular life, the Archaea and the Bacteria, are monophyletic from the third lineage (Eukarya, represented by the yeast genome), supported by 100% bootstrap values. There are a few aspects in which our tree differs from other gene-content trees, however. We have compared our result to that of Wolf et al. (2002). In their study, the genome distance between species (A and B) was calculated DAB = 1 – JAB, where JAB is the Jaccard coefficient, which reflects the similarity of gene content between A and B. Consider the phylogeny of Archaea, for instance. Both studies support that Hbs (Halobacterium sp) appears at the root of the tree, and that the Euryachaeota (Afu, Mja, Mth, and Pho; see fig. 3 for species abbreviations) are clustered together. However, our genome phylogeny suggests that the Crenarchaeota "Ape" (Aeropyrum pernix) may also branch-off, whereas Wolf et al. (2002) showed that it was clustered with the Euryachaeota Tac (Thermoplasma acidophilum). Though it requires further investigation, the genome distance measure used by Wolf et al. (2002) is unlikely additive, so the theoretical basis of their genome tree remains open to question. Indeed, our simulation study has shown that an ad hoc (non-additive) genome distance could be misleading under the "Felsenstein zone" (not shown).

    FIG. 3. The genome phylogeny of 35 microbial complete genomes, inferred by the extended gene-content data set. Bootstrapping values <50% are not presented. Species abbreviations: Archaea: Afu, Archaeoglobus fulgidus; Hbs, Halobacterium sp. NRC-1; Mja, Methanococcus jannaschii; Mth, Methanothermobacter thermautotrophicus; Tac, Thermoplasma acidophilum; Pho, Pyrococcus horikoshii; Ape, Aeropyrum pernix. Eukaryota: Sce, Saccharomyces cerevisiae. Bacteria: Aae, Aquifex aeolicus; Tma, Thermotoga maritime; Dra, Deinococcus radiodurans; Mtu, Mycobacterium tuberculosis H37Rv; Lla, Lactococcus lactis; Spy, Streptococcus pyogenes M1 GAS; Bsu, Bacillus subtilis; Syn, Synechocystissp.; Eco, Escherichia coliK12; Buc, Buchnera sp. APS; Vch, Vibrio cholerae; Pae, Pseudomonas aeruginosa; Hin, Haemophilus influenzae; Pmu, Pasteurella multocida; Xfa, Xylella fastidiosa 9a5c; Nme, Neisseria meningitidis MC58; Hpy, Helicobacter pylori 26695; Cje, Campylobacter jejuni; Mlo, Mesorhizobium loti; Ccr, Caulobacter crescentus; Rpr, Rickettsia prowazekii; Ctr, Chlamydia trachomatis; Cpn, Chlamydophila pneumoniae; Tpa, Treponema pallidum; Bbu, Borrelia burgdorferi; Uur, Ureaplasma urealyticum; Mge, Mycoplasma genitalium

    Discussion

    Individual gene families may have different phylogenetic trees because of orthology problems caused by fast evolution—gene/genome duplication, or lateral gene transfer (Doolittle 1999b, Eisen 2000; Gu, Wang, and Gu 2002; Jordan et al. 2001; Gu and Huang 2002). The whole-genome approach provides one feasible solution for overcoming this problem. Other methods, including merging individual trees to a biologically meaningful phylogeny, or concatenating well-selected proteins to make a single phylogeny, are certainly also valuable.

    We developed a stochastic model for genome evolution under a given phylogeny. However, we have found that it is difficult use the widely cited gene-content data to estimate the additive genome distance. We solved this problem by using the extended gene contents that take duplicate genes into account. Computer simulation shows that the genome phylogeny inference is efficient, consistent, and fairly robust. Moreover, the example of 35 microbial complete genomes demonstrates that the new method is useful not only to study the universal tree of life but also to explore the evolutionary pattern of genomes.

    Though many reports of lateral gene transfer (Doolittle and Logsdon 1998; Lawrence and Ochman 1998) have made popular the view that it must be one of the "major forces," at the genome-level, there may be only a small portion of gene families that could be affected. Lateral gene transfer from one organism to another may only increase the size of an existing gene family (type A) in the host genome, or it may introduce new genes into the host genome (type B) (Snel, Bork, and Huynen 1999; Eisen 2000; Sankoff 2001). Our simulation study has shown that the genome tree is virtually unaffected by type A lateral gene transfer, and not very sensitive to type B lateral gene transfer except when it is overwhelming (unpublished result). Although the relative contributions of these two types of lateral gene transfer is yet to be determined, the genome tree seems to be robust against lateral gene transfer. Indeed, our example shows the correspondence of the genome tree (fig. 3) with the 16s rRNA tree (Snel, Bork, and Huynen 1999). Further study will show whether the genome tree can be used as an "independent" phylogenetic framework upon which to construct and test evolutionary hypotheses, including the pattern of lateral gene transfer.

    Further studies should take two directions. The first one is to improve the evolutionary model. For instance, the evolutionary rates of gene proliferation or gene loss ( and μ) could vary not only among gene families but also among lineages (Aravind et al. 2000). One may try some techniques (Gu, Fu, and Li 1995; Gu 1999) developed for sequence evolution to relax the assumption of constant rate. All gene-content–based methods actually assume independent evolution of gene families, which may not be realistic. Because gene families within similar metabolic pathways may tend to co-evolve (Pellegrini et al. 1999); that is, their presence/absence may not be independent among gene families, we shall study this problem under the phylogenetic framework in the future. It remains a challenge to find ways to model the effect of lateral gene transfer. The second direction for future studies involves means of implementing more sophisticated tree-making algorithms. We shall develop some fast but heuristic algorithms so that the ML phylogeny can be used in practice. The Bayesian inference in phylogenetics is also worth considering, though the controversy remains unresolved (Huelsenbeck et al. 2001; Susuki, Glazko, and Nei 2002; Alfaro, Zoller, and Lutzoni 2003).

    Acknowledgements

    This work was supported by National Institutes of Health grant number RO 1 GM 62118. The computer program is available at the Web site http://xgu.zool.iastate.edu.

    Literature Cited

    Alfaro, M. E., S. Zoller, and F. Lutzoni. 2003. Bayes or Bootstrap? A simulation study comparing the performance of Bayesian Markov chain Monte Carlo sampling and bootstrapping in assessing phylogenetic confidence. Mol. Biol. Evol. 20:255-266.

    Aravind, L., H. Watanabe, D. J. Lipman, and E. V. Koonin. 2000. Lineage-specific loss and divergence of functionally linked genes in eukaryotes. Proc. Natl. Acad. Sci. USA 97:11319-11324.

    Clarke, G. D. P., R. G. Beiko, M. A. Ragan, and R. L. Charlebois. 2002. Inferring genome trees by using a filter to eliminate phylogenetically discordant sequences and a distance matrix based on mean normalized BLASTP scores. J. Bacteriol. 184:2072-2080.

    Daubin, V., N. A. Moran, and H. Ochman. 2003. Phylogenetics and the cohesion of bacterial genomes. Science 301:829-832.

    Doolittle, W. F. 1999a. Phylogenetic classification and the universal tree. Science 284:2124-2129.

    Doolittle, W. F. 1999b. Technical comments (Response) on Doolittle (1999a). Science 286:1443a.

    Doolittle, W. F., and J. M. Logsdon. 1998. Archaeal genomics: do Archaea have a mixed heritage? Curr. Biol. 8:R209-R211.

    Eisen, J. A. 2000. Horizontal gene transfer among microbial genomes: new insights from complete genome analysis. Curr. Opin. Genet. Dev. 10:606-611.

    Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368-376.

    Fitz-Gibbon, S. T., and C. H. House. 1999. Whole genome–based phylogenetic analysis of free-living microorganisms. Nucleic Acids Res. 27:4218-4222.

    Golding, G. B., and R. S. Gupta. 1995. Protein-based phylogenies support a chimeric origin for the eukaryotic genome. Mol. Biol. Evol. 12:1-6.

    Gu, X., Y. X. Fu, and W. H. Li. 1995. Maximum likelihood estimation of the heterogeneity of substitution rate among nucleotide sites. Mol. Biol. Evol. 12:546-557.

    Gu, X. 1999. Statistical methods for testing functional divergence after gene duplication. Mol. Biol. Evol. 16:1664-1674.

    Gu, X. 2000. A simple evolutionary model for genome phylogeny inference based on gene content. Pp. 515–524 in D. Sankoff and J. H. Nadeau, eds. Comparative genomics. Kluwer Academic Publishers, Dordrect, The Netherlands.

    Gu, X. 2001. Maximum likelihood approach for gene family evolution under functional divergence. Mol. Biol. Evol. 18:453-464.

    Gu, X., and W. Huang. 2002. Testing the parsimony test of genome duplications: a counterexample. Genome Res. 12:1-2.

    Gu, X., Y. Wang, and J. Gu. 2002. Age-distribution of human gene families showing equal roles of large and small-scale duplications in vertebrate evolution. Nature Genet. 31:205-209.

    House, C. H., and S. T. Fitz-Gibbon. 2002. Using homolog groups to create a whole-genomic tree of free-living organisms: an update. J. Mol. Evol. 54:539-547.

    Huelsenbeck, J. P., F. Ronquist, R. Nielsen, and J. P. Bollback. 2001. Bayesian inference of phylogeny and its impact on evolutionary biology. Science 294:2310-2314.

    Huynen, M. A., B. Snel, and P. Bork. 1999. Technical comments on Doolittle . Science 286:1443a.

    Huynen, M. A., and B. Snel. 2000. Gene and context: integrative approaches to genome analysis. Adv. Prot. Chem. 54:345-379.

    Jordan, I. K., K. S. Makarova, J. L. Spouge, Y. I. Wolf, and E. V. Koonin. 2001. Lineage-specific gene expansions in bacterial and archaeal genomes. Genome Res. 11:555-565.

    Korbel, J. O., B. Snel, M. A. Huynen, and P. Bork. 2002. SHOT: a Web server for the construction of genome phylogenies. Trends Genet. 18:158-62.

    Lawrence, J. G., and H. Ochman. 1998. Molecular archaeology of the Escherichia coli genome. Proc. Natl. Acad. Sci. USA 95:9413-9417.

    Lin, J., and M. Gerstein. 2000. Whole-genome trees based on the occurrence of folds and orthologs: implications for comparing genomes on different levels. Genome Res. 10:808-818.

    Natale, D. A., U. T. Shankavaram, M. Y. Galperin, Y. I. Wolf, L. Aravind, and E. V. Koonin. 2000. Towards understanding the first genome sequence of a crenarchaeon by genome annotation using clusters of orthologous groups of proteins (COGs). Genome Biol. 1, RESEARCH0009.

    Nei, M., X. Gu, and R. Sitnikova. 1997. Evolution by the birth-and-death process in multigene families of the vertebrate immune system. Proc. Natl. Acad. Sci. USA 94:7799-7806.

    Nelson, K. E., R. A. Clayton, and S. R. Gill, et al. (29 co-authors). 1999. Evidence for lateral gene transfer between Archaea and bacteria from genome sequence of Thermotoga maritima. Nature 339:323-329.

    Olsen, G. J., C. R. Woese, and R. Overbeek. 1994. The winds of (evolutionary) change: breathing new life into microbiology. J. Bacteriol. 176:1-6.

    Pellegrini, M., E. M. Marcotte, M. J. Thompson, D. Eisenberg, and T. O. Yeates. 1999. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc. Natl. Acad. Sci. USA 96:4285-4288.

    Saitou, N., and M. Nei. 1987. The Neighbor-Joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406-425.

    Sankoff, D. 2001. Gene and genome duplication. Curr. Opin. Genet. Dev. 11:681-684.

    Snel, B., P. Bork, and M. A. Huynen. 1999. Genome phylogeny based on gene content. Nat. Genet. 21:108-110.

    Suzuki, Y., G. V. Glazko, and M. Nei. 2002. Overcredibility of molecular phylogenies obtained by Bayesian phylogenetics. Proc. Natl. Acad. Sci. USA 99:16138-16143.

    Tekaia, F., A. Lazcano, and B. Dujon. 1999. The genomic tree as revealed from whole proteome comparisons. Genome Res. 9:550-557.

    Woese, C. 1998. The universal ancestor. Proc. Natl. Acad. Sci. USA 95:6854-6859.

    Wolf, Y., I. B. Rogozin, N. V. Grishin, and E. V. Koonin. 2002. Genome trees and the tree of life, Trends Genet. 18:472–479.(Xun Gu*, and Hongmei Zhan)
    濠电姷鏁搁崕鎴犲緤閽樺娲偐鐠囪尙顦┑鐘绘涧濞层倝顢氶柆宥嗙厱婵炴垵宕弸銈嗐亜閳哄啫鍘撮柡灞剧☉閳藉宕¢悙宸骄闂佸搫顦弲婊兾涢崘顔艰摕婵炴垶菤閺嬪酣鐓崶銊﹀皑闁稿鎸荤粋鎺斺偓锝庝簽閸旓箑顪冮妶鍡楀潑闁稿鎹囬弻娑㈡偐瀹曞洢鈧帗淇婇崣澶婂闁宠鍨垮畷鍫曞煘閻愵剛浜欓梺璇查缁犲秹宕曢崡鐐嶆稑鈽夐姀鐘靛姦濡炪倖甯掗ˇ顖炴倶閿旂瓔娈介柣鎰▕閸庢梹顨ラ悙鍙夊枠妞ゃ垺妫冨畷銊╊敇閻愰潧鎼稿┑鐘垫暩閸嬬娀骞撻鍡楃筏闁诡垼鐏愬ú顏勭闁绘ê鍚€缁楀姊洪幐搴g畵闁瑰嘲顑夊畷鐢稿醇濠㈩亝妫冮弫鍌滅驳鐎n亜濡奸梻浣告憸閸嬬偤骞愰幎钘夎摕闁哄洢鍨归獮銏ゆ煛閸モ晛孝濠碘€茬矙閺岋綁濮€閳轰胶浠╃紓鍌氱Т閿曨亪鐛繝鍥ㄦ櫢闁绘ǹ灏欓悿鈧俊鐐€栭幐楣冨磻閻斿摜顩烽柟鎵閳锋垿鏌涢敂璇插笌闁荤喐鍣村ú顏勎ч柛銉厛濞肩喖姊洪崘鍙夋儓闁瑰啿姘︾换姘舵⒒娴e懙褰掑嫉椤掑倻鐭欓柟鐑橆殕閸婂灚銇勯弬鍨挃缁炬儳銈搁弻锟犲礃閵娿儮鍋撶粙鎸庢瘎婵犵數濮幏鍐礋閸偆鏉归柣搴㈩問閸犳牠鎮ラ悡搴f殾婵せ鍋撳┑鈩冪摃椤︽娊鏌涢幘鏉戠仸缂佺粯绋撻埀顒佺⊕宀e潡鎯屾繝鍋芥棃鎮╅崣澶嬪枑闂佽桨绶¢崳锝夈€侀弴銏℃櫆闁芥ê顦介埀顒佺☉閳规垿鏁嶉崟顐$捕婵犫拃鍛珪缂侇喗鐟︾换婵嬪炊閵娧冨箰濠电姰鍨煎▔娑㈡晝閵堝姹查柡鍥╁枑閸欏繘鏌i悢鐓庝喊婵☆垪鍋撻梻浣芥〃缁€浣虹矓閹绢喗鍋╂繝闈涱儏缁€鍐┿亜椤撶喎鐏i柟瀵稿厴濮婄粯鎷呯粵瀣異闂佸摜濮甸幑鍥х暦濠靛﹦鐤€婵炴垼椴搁弲锝囩磽閸屾瑧鍔嶅畝锝呮健閸┿垽寮崼鐔哄幗闂佺懓顕崕鎴炵瑹濞戙垺鐓曢柡鍌氱仢閺嗭綁鏌″畝瀣瘈鐎规洘甯掗~婵嬵敇閻橀潧骞€缂傚倸鍊烽悞锕傘€冮崨姝ゅ洭鏌嗗鍛姦濡炪倖甯掗崰姘缚閹邦喚纾兼い鏃囧亹缁犲鏌ㄥ┑鍫濅槐闁轰礁鍟村畷鎺戭潩閸楃偞鎲㈤梻浣藉吹婵炩偓缂傚倹鑹鹃埢宥夋晲閸モ晝鐓嬮梺鍓茬厛閸犳捇鍩€椤掍礁绗掓い顐g箞椤㈡﹢鎮╅锝庢綌闂傚倷绶氬ḿ褍煤閵堝悿娲Ω閳轰胶鍔﹀銈嗗笒閸嬪棝寮ㄩ悧鍫㈢濠㈣泛顑囧ú瀵糕偓瑙勬磸閸ㄨ姤淇婇崼鏇炵倞闁靛ǹ鍎烘导鏇㈡煟閻斿摜鐭屽褎顨堥弫顔嘉旈崪鍐◤婵犮垼鍩栭崝鏍磻閿濆鐓曢柕澶樺灠椤╊剙鈽夐幘鐟扮毢缂佽鲸甯楀ḿ蹇涘Ω瑜忛悾濂告⒑瑜版帩妫戝┑鐐╁亾闂佽鍠楃划鎾诲箰婵犲啫绶炲璺虹灱濮婄偓绻濋悽闈涗粶妞ゆ洦鍘介幈銊︺偅閸愩劍妲梺鍝勭▉閸樺ジ宕归崒鐐寸厪濠电偟鍋撳▍鍡涙煕鐎c劌濡奸棁澶愭煥濠靛棙鍣归柡鍡欏枑娣囧﹪顢涘鍗炩叺濠殿喖锕ュ浠嬨€侀弴銏℃櫜闁糕剝鐟﹂濠氭⒒娴h櫣甯涢柟纰卞亞閹广垹鈹戠€n剙绁﹂柣搴秵閸犳牜绮婚敐鍡欑瘈濠电姴鍊搁顐︽煙閺嬵偄濮傛慨濠冩そ楠炴劖鎯旈敐鍌涱潔闂備礁鎼悧婊堝礈閻旈鏆﹂柣鐔稿閸亪鏌涢弴銊ュ季婵炴潙瀚—鍐Χ閸℃鐟愰梺缁樺釜缁犳挸顕i幎绛嬫晜闁割偆鍠撻崢閬嶆⒑閻熺増鎯堢紒澶嬫綑閻g敻宕卞☉娆戝帗閻熸粍绮撳畷婊冾潩椤掑鍍甸梺闈浥堥弲婊堝磻閸岀偞鐓ラ柣鏂挎惈瀛濋柣鐔哥懕缁犳捇鐛弽顓炵妞ゆ挾鍋熸禒顖滅磽娴f彃浜炬繝銏f硾閳洝銇愰幒鎴狀槯闂佺ǹ绻楅崑鎰枔閵堝鈷戠紓浣贯缚缁犳牠鏌i埡濠傜仩闁伙絿鍏橀弫鎾绘偐閼碱剦妲伴梻浣藉亹閳峰牓宕滃棰濇晩闁硅揪闄勯埛鎴︽偣閸ワ絺鍋撻搹顐や簴闂備礁鎲¢弻銊︻殽閹间礁鐓濋柟鎹愵嚙缁狅綁鏌i幇顓熺稇妞ゅ孩鎸搁埞鎴︽偐鐠囇冧紣闂佸摜鍣ラ崹鍫曠嵁閸℃稑纾兼慨锝庡幖缂嶅﹪骞冮埡鍛闁圭儤绻傛俊閿嬬節閻㈤潧袥闁稿鎹囬弻鐔封枔閸喗鐏撶紒楣冪畺缁犳牠寮婚悢琛″亾閻㈢櫥鐟版毄闁荤喐绮庢晶妤呮偂閿熺姴钃熸繛鎴欏灩缁犳娊鏌¢崒姘辨皑闁哄鎳庨埞鎴︽倷閸欏娅i梻浣稿簻缁茬偓绌辨繝鍥х妞ゆ棁濮ゅ▍銏ゆ⒑鐠恒劌娅愰柟鍑ゆ嫹

   闂備浇顕уù鐑藉极婵犳艾纾诲┑鐘叉搐缁愭鏌¢崶鈺佹灁闁崇懓绉撮埞鎴︽偐閸欏鎮欏┑鈽嗗亝閿曘垽寮诲☉銏犖ㄩ柕蹇婂墲閻濇牠鎮峰⿰鍐ㄧ盎闁瑰嚖鎷�  闂傚倸鍊烽懗鑸电仚缂備胶绮〃鍛村煝瀹ュ鍗抽柕蹇曞У閻庮剟姊虹紒妯哄闁稿簺鍊濆畷鏇炵暆閸曨剛鍘介梺閫涘嵆濞佳勬櫠椤斿浜滈幖鎼灡鐎氾拷  闂傚倷娴囧畷鍨叏閺夋嚚娲Χ閸ワ絽浜炬慨妯煎帶閻忥附銇勯姀锛勬噰妤犵偛顑夐弫鍐焵椤掑倻鐭嗛柛鏇ㄥ灡閻撶喐淇婇婵愬殭缂佽尪宕电槐鎾愁吋韫囨柨顏�  闂傚倸鍊烽懗鍫曞箠閹捐瑙﹂悗锝庡墮閸ㄦ繈骞栧ǎ顒€濡肩痪鎯с偢閺屾洘绻涢悙顒佺彅闂佸憡顨嗘繛濠囧蓟閳╁啫绶為悗锝庝簽閸旂ǹ鈹戦埥鍡楃伈闁瑰嚖鎷�   闂傚倸鍊峰ù鍥綖婢跺顩插ù鐘差儏缁€澶屸偓鍏夊亾闁告洦鍓欐禒閬嶆⒑闂堟丹娑㈠川椤栥倗搴婂┑鐘垫暩閸嬫稑螞濞嗘挸绀夐柡宥庡亞娑撳秵绻涢崱妯诲鞍闁绘挻娲樼换娑㈠幢濡吋鍣柣搴㈢啲閹凤拷   闂傚倸鍊风粈渚€骞夐垾鎰佹綎缂備焦蓱閸欏繘鏌熺紒銏犳灈闁活厽顨婇弻娑㈠焺閸愵亖妲堢紓鍌欒閺呯娀寮婚悢纰辨晬婵犲﹤鍠氶弳顓烆渻閵堝啫鍔甸柟鍑ゆ嫹