The Temporal Distribution of Gene Duplication Events in a Set of Highly Conserved Human Gene Families
Department of Biological Sciences, University of South Carolina, Columbia, South Carolinac!p, http://www.100md.com
Abstractc!p, http://www.100md.com
Using a data set of protein translations associated with map positions in the human genome, we identified 1520 mapped highly conserved gene families. By comparing sharing of families between genomic windows, we identified 92 potentially duplicated blocks in the human genome containing 422 duplicated members of these families. Using branching order in the phylogenetic trees, we timed gene duplication events in these families relative to the primate-rodent divergence, the amniote-amphibian divergence, and the deuterostome-protostome divergence. The results showed similar patterns of gene duplication times within duplicated blocks and outside duplicated blocks. Both within and outside duplicated blocks, numerous duplications were timed prior to the deuterostome-protostome divergence, whereas others occurred after the amniote-amphibian divergence. Thus, neither gene duplication in general nor duplication of genomic blocks could be attributed entirely to polyploidization early in vertebrate history. The strongest signal in the data was a tendency for intrachromosomal duplications to be more recent than interchromosomal duplications, consistent with a model whereby tandem duplication—whether of single genes or of genomic blocks—may be followed by eventual separation of duplicates due to chromosomal rearrangements. The rate of separation of tandemly duplicated gene pairs onto separated chromosomes in the human lineage was estimated at 1.7 x 10-9 per gene-pair per year.
Key Words: block duplication • genome evolution • polyploidization • tandem duplication • vertebrate evolutionmm6:n, 百拇医药
Introductionmm6:n, 百拇医药
A popular hypothesis regarding the evolution of the vertebrate genome holds that there were two rounds of genome duplication by polyploidization early in vertebrate history (the 2R hypothesis) ). Until recently, it has not been possible to test this hypothesis rigorously. However, a number of recent tests of this hypothesis have been conducted, and the results have in every case failed to support the 2R hypothesis. For example, it had previously been suggested that a large number of families show a pattern of four members in vertebrates but a single member in Drosophila melanogaster; the best known example of such a pattern is provided by the Hox clusters . pointed out that such a pattern supports the 2R hypothesis only if (1) the vertebrate genes duplicated early in vertebrate history and (2) the phylogeny of the vertebrate genes shows a topology of two clusters of two, a topology described as (AB) (CD).
found that these conditions were rarely met in a set of gene families encoding a set of developmentally important proteins. The applied the same test to 57 families and reported results contrary to the 2R hypothesis in 76% of cases. In the most extensive application to date of this approach, examined all four-member gene families in the available portion of the human genome. In 134 families with resolved phylogenies, 71% showed results inconsistent with the 2R hypothesis. In addition, compared all homologous gene families in human and Drosophila and found that less than 5% of such families show a 4:1 ratio of the number of family members in human to the number of family members in Drosophila.[i1, http://www.100md.com
Advocates of the 2R hypothesis have frequently cited as evidence in favor of this hypothesis the existence of sets of paralogous genes found on two or more different chromosomes in the genomes of human or other vertebrates . The implication is that these genes were duplicated simultaneously during polyploidization. However, the hypothesis that a set of linked paralogues were duplicated simultaneously can only be accepted if the phylogenies of the gene families are consistent with their duplication during the same time period. When phylogenetic analyses have been applied to sets of linked paralogues allegedly duplicated simultaneously by polyploidization, the phylogenies have revealed that these genes were in fact duplicated at widely different times over the history of life . Furthermore, such potentially duplicated blocks are usually identified in a subjective manner. In a genome with numerous gene families, however, members of two or more of these families may be found in close linkage merely by chance, without having been duplicated simultaneously. For this reason, it is desirable to employ a statistical test of the hypothesis that genes have been duplicated simultaneously .
The major alternative to polyploidization hypotheses for explaining the occurrence of paralogous genes on different chromosomes is a hypothesis of tandem duplication followed by translocation of one or both duplicates to other chromosomes. Note that such a tandem duplication may involve a single genetic locus or a chromosomal block including several loci. Completely sequenced genomes of eukaryotes include examples of recently duplicated intrachromosomal blocks including many loci . A variety of mechanisms exist by which linked duplicates can be separated over evolutionary time. These include chromosomal breakage and rearrangement. Comparisons of the genomic maps of various mammals provide evidence that such mechanisms have operated repeatedly over the course of mammalian evolution .kdy, 百拇医药
Two recent papers have provided evidence of a peak of gene duplications early in vertebrate history, which the authors claim to be evidence that one or more genome duplications occurred at that time . There are numerous problems with these authors' analyses, however, and their conclusions are not well supported. First, in both studies the evidence for a peak of gene duplications relies on divergence time estimates made under the assumption of a molecular clock. It is well known that this assumption is often not met by molecular data (, pp. 215–235; , pp. 187–206). Furthermore, in both of these studies the evidence of a peak in gene duplications was entirely subjective. No statistical method was applied in either study to detect whether the observed peak might be attributed to random fluctuations. Finally, although used a statistical approach to identify potentially duplicated blocks in the human genome, they did not compare the duplication times of genes located within these blocks with those of genes outside the blocks.
In the present paper, we assign to families a set of highly conserved human protein-coding genes for which map positions are available. Using these conserved gene families, we apply the method of to identify potentially duplicated blocks in the human genome. We then identify homologous genes from other representative vertebrates, from the completely sequenced genomes of the invertebrates Drosophila melanogaster and Caenorhabditis elegans, and from the genomes of yeast and Arabidopsis thaliana. By constructing phylogenies of these gene families, we time gene duplication events relative to major cladogenetic events in vertebrate evolutionary history without relying on the assumption of a molecular clock. Comparing the duplication times of gene pairs within potentially duplicated blocks enables us to test the hypothesis that these blocks arose by polyploidization early in vertebrate history. In addition, comparison of the duplication times of gene pairs located on the same and different chromosomes enables us to estimate the pattern and rate of separation of duplicated gene pairs over the evolutionary history of the vertebrates.
Methods:?4g, http://www.100md.com
Sequences and Homology Searches:?4g, http://www.100md.com
The sequence data for the Caenorhabditis elegans was downloaded from the Sanger Center web site at . The "gadfly" data for Drosophila melanogaster was obtained from . The data for yeast Saccharomyces cerevisiae was obtained at ; and that for Arabidopsis thaliana, at . The human genome dataset used was from the October 7, 2000, freeze and is publicly available at . We used this version because it includes a features table providing map locations for a set of 16,663 known or predicted proteins. Proteome datasets from 38 complete genomes of Bacteria and all available sequences from a set of other representative organisms (the slime mold Dictyostelium discoideum, the zebrafish Danio rerio, the pufferfish Takifugu rubripes, the frog Xenopus laevis, and the mouse Mus musculus) were obtained from the National Center for Biotechnology Information (NCBI).:?4g, http://www.100md.com
The location of each protein in human was parsed from the features table. The location was then related to the protein sequence using the locus name as the unique identifier. When two predicted genes overlapped in location, one gene was chosen at random in order to eliminate redundancy (such as alternately spliced products of the same gene). Such overlap presumably occurs because of alternatively spliced transcripts from the same gene. After redundancy was removed, 13,802 proteins remained in the human data set. This number represents 47.0% of the total number of known and predicted proteins (both mapped and unmapped), including splice variants.
The text file of the nonredundant proteins was formatted as a database file using the blast tools obtained from the National Center for Biotechnology Information (NCBI) ftp site After the nonredundant proteome was determined for each genome, each protein was used to search for homology among the rest of the proteome. This "all against all" blast method was performed using the blastall executable which is packaged with the blast tools. Similarly, each protein in the nonredundant human proteome was searched against all proteins from other genomes. In searching human proteins against the remainder of the human proteome and in searching human proteins against nonhuman proteins, we used an expect value of E = 10-50. The use of a strict search criterion has the advantage that it identifies as homolgous only proteins showing evidence of homology throughout the length of the protein rather than in only one domain or a few domains . In all cases, we used the defaults of a BLOSUM62 substitution matrix and the SEG filter . The resultant records were filtered using MSPcrunch, a program to filter and convert the blast output to a tabular format .
Given all pairs of homologous proteins, a "single link" method was used to find the protein families. This step groups genes that share homology. For example, if gene A and B are in a family, and B and C are in another family, then A, B, and C are in a family. Further, in this example, if A and D also share homology, then A, B, C, and D are in a family. Using the strict homology search criterion E = 10-50 yielded a set of 1,520 highly conserved gene families with two or more members in human, containing a total of 5,475 genes . These were used in the identification of potentially duplicated genomic blocks and phylogenetic analyses (see below). Within families, amino acid sequences were aligned using ClustalW 1.81 .\, 百拇医药
fig.ommitted\, 百拇医药
Table 1 Distribution of Family Sizes in the Set of Conserved Human Gene Families Used in Analyses\, 百拇医药
Identification of Duplicated Blocks\, 百拇医药
The method of was used to identify potentially duplicated genomic blocks and to test the hypothesis that such blocks share gene families to a greater extent than expected by chance. This method involves comparing genomic windows for sharing of gene families. In applying this method to the mapped portion of human genome, we used the "paranome" or set of all genes in the genome belonging to families with two or more members (set of paralogues) obtained by homology search as described above with E = 10-50. Fixed, nonoverlapping windows of m paranome members (where m is a positive integer) were identified along each chromosome (excluding Y, which had fewer paranome members than the values of m used). When the number of paranome members in a chromosome was not evenly divisible by m, r windows in that chromosome were randomly chosen to be increased by one member (where r < m is the remainder of division of the total number of paranome members in the chromosome by m). The number of families shared between all windows in the genome was computed and compared with the results for 10,000 random genomes (in which paranome members were randomly assigned to chromosome locations). A significantly higher level of sharing of gene families between windows as determined by this randomization test was taken as evidence of a duplicated genomic structure. Here we report the results of analyses using m = 10 and m = 30. For further details of this method see .
Note that used a method similar to that of to identify potentially duplicated blocks in the human genome. However, the method of these authors differed from that of in important ways. First, these authors did not use a strict homology search criterion; thus, they may have identified as homologues genes showing strong evidence of homology only in a portion of the sequence. Second, "collapsed" all tandem arrays of duplicated genes, counting each such array as only a single gene. Thus, their randomization test of the hypothesis that the same families occur in syntenic groups in separate genomic locations to a greater extent than expected by chance is biased and is more likely to reject the null hypothesis than is our test.!^ogo, 百拇医药
Phylogenetic Analyses!^ogo, 百拇医药
Phylogenetic trees were constructed by the quartet maximum-likelihood (ML) method as implemented in TREEPUZZLE 5.0, using the JTT model of amino acid evolution and assuming that rate variation among sites followed a gamma distribution. All trees were treated as unrooted, and no attempt was made to assign an outgroup to root any tree. On the basis of tree topology, we determined the time of each human gene duplication event relative to the following cladogenetic events: the deuterostome-protostome divergence; the amniote-amphibian divergence; and the primate-rodent divergence. This method of timing duplication events does not assume a constant rate of molecular evolution ("molecular clock") and is independent of the rooting of the tree. We concluded that a gene duplicated prior to a cladogenetic event if the internal branch supporting that duplication was significantly supported. We considered a branch to be significantly supported if it was supported in 95% or more of 10,000 puzzling steps; this represents a highly conservative test for significance of an internal branch . We concluded that a gene duplicated prior to a cladogenetic event if the internal branch supporting that duplication was significantly supported.
In counting the proportions of duplications in a set of trees that could be dated prior to a given cladogenetic event, we compared only those families for which a sufficient number of sequences were available for the hypothesis of duplication before that event to be tested. For example, in a gene family including two human sequences, a mouse sequence, and a Drosophila sequence, the only hypothesis that can be tested regarding the duplication of the human genes is that they duplicated prior to the primate-rodent divergence. This hypothesis would be supported if one of the human genes clustered with the mouse gene and one with the Drosophila gene, and if the internal branch separating these two clusters was significantly supported. By contrast, if the two human genes clustered together, the phylogeny would be taken as consistent with the hypothesis that the human genes duplicated after the primate-rodent divergence. In comparing proportions of gene duplications occurring before and after a given cladogenetic event, we compared only (1) those families in which there was significant support for duplication before that cladogenetic event and (2) those families in which it was possible to test the hypothesis of gene duplication prior to that cladogenetic event yet the topology of the tree was consistent with duplication after that event.
Resultsh7/sv.s, 百拇医药
Duplicated Genomic Blocksh7/sv.s, 百拇医药
When a window size of 10 genes (paranome members) was used, the number of genomic windows sharing two or more families was significantly lower than expected by chance . This pattern presumably reflects the occurrence of numerous cases of closely linked duplicated genes in the human genome. Because a substantial number of paralogues are closely linked to each other, sharing of two or more families between genomic windows occurs less frequently in the actual human genome than in the random genomes . Similarly, the observed number of windows sharing 3 or more families was significantly lower than the number expected by chance at the 5% level . On the other hand, the number of windows sharing 4 or more families was significantly greater than expected by chance, and the number of windows sharing 6 or more families was significantly greater than expected by chance.h7/sv.s, 百拇医药
fig.ommitted
Table 2 Numbers of Matches Between Genomic Windowsp, 百拇医药
When a window size of 30 genes was used, the number of windows sharing two or more families was again significantly lower than expected by chance . Likewise, the numbers of windows sharing three or more members and four or more members were significantly lower than expected by chance . Again, these results presumably reflect the occurrence of duplicated genes in close linkage with one another to a greater extent than expected by chance. By contrast, the number of windows sharing six or more families was greater than expected by chance, as was the number of windows sharing eight or more families. Note that the numbers of duplicated blocks identified by our method is lower than that identified by . This difference can probably be attributed in large part to the less strict homology criterion used by these authors, as well as to the less conservative nature of their statistical test for duplicated structure.p, 百拇医药
Given that the human genome includes numerous closely linked duplicated genes, it might be argued that sharing of two or more families between windows will be observed relatively infrequently simply because a large proportion of windows will include members of a single family. To test for this possibility, we examined the distribution of the number of members in the largest family observed in each window . With a window size of 10, only 7 of 548 windows (1.3%) included 10 genes all of the same family; and in only 28 of 548 windows (5.1%) was the largest family represented by 6 or more genes. The median value for the size of the largest family per window was 2.0, whereas the mean was 2.34 with a standard error of 0.07. Thus, with a window size of 10, there were some cases in which a single family accounted for over half the genes in a window, but these were very rare.
fig.ommitted^7, http://www.100md.com
FIG. 1. Frequency distributions of the size of the largest gene family in fixed genomic windows (highest count) in windows of 10 paranome members (A) and windows of 30 paranome members (B)^7, http://www.100md.com
With a window size of 30, there were no cases of a single window in which all genes were members of the same family . Rather, the highest number of genes in the same family was 22, which occurred in just one case; and in only 2 of 183 windows (1.1%) was the largest family represented by 16 or more genes. The median value for the size of the largest family was 3.0, whereas the mean was 3.96 with a standard error of 0.23. Thus, with a window size of 30, there was no evidence that the frequency of windows sharing two or more gene families was biased by the occurrence of windows in which a single family predominated.^7, http://www.100md.com
To examine the maximal number of possibly duplicated regions, we combined the data on windows of size 10 sharing three or more families and windows of size 30 sharing six or more families to identify potentially duplicated genomic regions. The combined data included 422 duplicated gene pairs in 92 putatively duplicated genomic blocks. (A listing of the genes involved is available from the authors upon request.) shows a graph illustrating sharing of one or more duplicated blocks between chromosomes. All chromosomes were represented in the graph except chromosome 21, which shared no duplicated blocks with other chromosomes; and the graph was completely connected . In addition to between-chromosome duplicated blocks, within-chromosome duplicated blocks were observed on chromosomes 1, 3, 6, 8, 11, 15, 16, 17, 19, 22, and X.
fig.ommitted@l, 百拇医药
FIG. 2. Schematic illustration of between-chromosome duplicated blocks. Circles correspond to human chromosomes, and lines correspond to sharing of at least one putatively duplicated block between a pair of chromosomes@l, 百拇医药
Phylogenetic Analyses@l, 百拇医药
On the basis of resolved phylogenetic trees, a total of 1,057 gene duplication events in 706 human gene families were timed relative to one or more of the following three cladogenetic events: the primate-rodent divergence, the amniote-amphibian divergence, and the deuterostome-protostome divergence. For all three of these events, there was a striking difference between duplications of genes mapping to the same chromosome (within-chromosome) and duplications of genes mapping to different chromosomes (between-chromosome) . In the case of each of these cladogenetic events, a significantly higher proportion of between-chromosome duplications than of within-chromosome duplications were timed prior to the event
fig.ommitted$w?/, http://www.100md.com
FIG. 3. Proportions of gene duplications in conserved human gene families that could be dated, by significantly supported interior branches in phylogenetic trees, prior to three major cladogenetic events. Chi-square tests of the hypothesis of equality of the proportions in within-chromosome and between-chromosome duplicate pairs: ***P < 0.001$w?/, http://www.100md.com
When we compared genes in putatively duplicated genomic blocks with other genes, no significant differences were seen with respect to the timing of duplication events . Genes in within-chromosome duplicated blocks showed a pattern very similar to that seen for other within-chromosome duplications . Likewise, genes in between-chromosome duplicated blocks showed a pattern similar to that seen for other between-chromosome duplications . Considering only genes within duplicated blocks, a significantly lower proportion of within-chromosome duplications than of between-chromosome duplications could be dated prior to the primate-rodent divergence or prior to the amniote-amphibian divergence .
fig.ommitted0', http://www.100md.com
Table 3 Timing of Human Gene Duplications in Duplicated Blocks with Respect to Cladogenetic Events0', http://www.100md.com
We analyzed separately duplication times of genes located on sets of chromosomes that have previously been cited as evidence for ancient genome duplications: chromosomes 1, 6, 9, and 19 chromosomes 2, 7, 12, and 17 (the chromosomes bearing the Hox clusters) and chromosomes 1, 11, 15, and 19 . The pattern of timing of gene duplications between chromosomes in these sets showed no statistically significant differences from that for other between-chromosome duplications .0', http://www.100md.com
fig.ommitted0', http://www.100md.com
Table 4 Timing of Human Gene Duplications on Allegedly Duplicated Chromosomes with Respect to Cladogenetic Events0', http://www.100md.com
Separation of Duplicate Pairs0', http://www.100md.com
We used the results of the phylogenetic analyses to estimate the rate of separation of tandemly duplicated gene pairs onto separate chromosomes over the course of vertebrate evolution. We used a simple model assuming (1) that there is a constant probability of separation of tandemly duplicated gene pairs onto different chromosomes and (2) that the probability of gene pairs that have once been separated being translocated back to the same chromosome is negligible. If so, the probability that a tandemly duplicated pair of genes are still on the same chromosome after t years (Pt) is given by
We used the proportion of duplicate pairs estimated to have duplicated before a given cladogenetic event to estimate Pt. We then used linear regression of the natural logarithms of these estimated Pt values against estimates of the times of the cladogenetic events in order to estimate , the slope of the best fit line through the origin and through these three time points. We used the estimates of 110 MYA for the primate-rodent divergence; 360 MYA for the amniote-amphibian divergence; and 830 MYA for the deuterostome-protostome divergence .a{7zt2, http://www.100md.com
Although only three data points were available, the fit of the regression (R2 = 96.1%; P = 0.02) was quite good . The resulting estimate of was 1.7 x 10-9 per gene-pair per year. Because of the simplifying assumptions made, this represents a minimum estimate of the probability of separation of tandemly duplicated gene pairs.a{7zt2, http://www.100md.com
fig.ommitteda{7zt2, http://www.100md.com
FIG. 4. The natural logarithm of p, the proportion of duplications that can be dated to a given cladogenetic event, plotted against the estimated time of that event in millions of years (t). The line is the equation of the linear regression through the origin, Y = -0.311 - 0.0013 x (R2 = 96.1%; P = 0.02)
Discussionc, 百拇医药
The present study examined the timing of gene duplication events in a set of highly conserved human gene families with known map locations. Consistent with some previous studies ) but not others , the results provided no strong signal suggestive of ancient polyploidization in the vertebrates. On the polyploidization hypothesis, a substantial number of paralogous gene pairs on different chromosomes are expected to have arisen prior to the amniote-amphibian divergence but after the deuterostome-protostome divergence. This should be reflected in a high percentage of between-chromosome pairs of genes that duplicated prior to the amniote-amphibian divergence. In fact, the proportion of between-chromosome duplications that could be dated prior to the amniote-amphibian divergence was intermediate between the proportion dated prior to the primate-rodent divergence and the proportion dated prior to the deuterostome-protostome divergence . and reported peaks of gene duplication early in vertebrate history on the basis of molecular clock analyses. However, many of these gene duplications may have been incorrectly timed because of defects inherent in molecular clock analyses. It is well known that gene duplication is often followed by a period of accelerated evolution at the amino acid level as daughter genes adapt to distinct functions . Such accelerated evolution will disrupt the molecular clock and cause the duplication to be dated earlier than it actually occurred. The existence of numerous such cases might create an artifactual "peak" of gene duplications at earlier dates.
did not attempt to weed out protein families not evolving in a clock-like manner; thus, their analyses probably included a high proportion of families in which the assumption of a molecular clock is not valid. did attempt to weed out families not evolving in a clock-like manner. They used the two-cluster test of , which tests for nonuniformity of rate between two groups of sequences in comparison to an outgroup. This method, however, will not be able to detect bursts of rapid evolution after gene duplication if they occur in both duplicates. In addition, the families assumed by to be evolving in a clock-like manner were in fact merely those families for which they lacked statistical power to detect deviations from a molecular clock. Thus, the data set on which they based their estimates probably included many families of short sequences and sequences with high rates of replacement per site, because in these cases the test will lack statistical power to detect deviations from the molecular clock even when such deviations are present.
Furthermore, it is worth noting that an apparent peak of gene duplications, even if it is not a statistical artifact, is not in itself evidence of polyploidization. Gene duplication occurs continually over the course of evolution, but most duplicate genes are quickly lost . Specialization of a duplicate gene for a new function substantially enhances its probability of being retained . Thus, if we observe an apparent peak of gene duplication in the past history of a species, what we are really observing is not a peak of gene duplication per se, but rather a peak of retention of duplicate gene copies. And because retention of duplicate genes is likely to be associated with the evolution of new functions , an apparent peak of gene duplication is likely to be the signature not of polyploidization but of adaptive radiation. Thus, even if the peaks of retention of duplicate gene copies early in vertebrate history reported by and are not artifacts, they provide no information one way or another regarding the hypothesis of polyploidization. Rather, they merely reflect the occurrence of adaptive radiation early in vertebrate history, which is unsurprising on the basis of our knowledge of vertebrate paleontology .
On the polyploidization hypothesis, duplicated genomic regions are the residue of ancient polyploidization events . Here we used a simple method to identify such duplicated regions . When this method was previously applied to the genome of yeast, it provided a strong signal of genome duplication, as had previously been proposed for this species by , an event estimated to have occurred 200–300 MYA . This method identified a number of potentially duplicated blocks in the human genome. However, when phylogenetic analysis was used to time gene duplications between these blocks, the results provided no strong signal of ancient polyploidization. On the polyploidization hypothesis, we would expect gene pairs in duplicated blocks to show a disproportionate number duplicated after the deuterostome-protostome divergence but before the amniote-amphibian divergence. In fact, the pattern of gene duplication times in duplicated blocks was not different from that outside the blocks . Likewise, when chromosome pairs previously alleged to show the effects of ancient polyploidization were analyzed, the pattern of duplication times was very similar to that for genes on other chromosome pairs .
Duplication of certain of the gene pairs in duplicated blocks could be dated with strong statistical support prior to the deuterostome-protostome divergence, whereas others duplicated after the aminote-amphibian divergence or after the primate-rodent divergence . A similar result was seen in the case of genes on allegedly duplicated chromosome pairs . Duplication times after the amniote-amphibian divergence or prior to the deuterostome-protostome divergence are not explainable by polyplodization early in vertebrate history. Thus, our results suggest that, if genome duplication did occur early in vertebrate history, it was not responsible for a large fraction of the duplicated genes or for a large fraction of the duplicated genomic blocks found in the genomes of current-day vertebrates.4, 百拇医药
The simplest alternative model to that of polyploidization to explain the increase in gene number in vertebrates is a model invoking repeated independent events of tandem gene duplication . These tandem duplications might involve individual genes or they might involve chromosomal blocks such as we detected on chromosomes 1, 3, 6, 8, 11, 15, 16, 17, 19, 22, and X. Indeed, recent evidence from the human genome suggests that duplication of genomic blocks is a recurring feature of vertebrate genome evolution . Once a gene or genomic segment has been duplicated, subsequent events of chromosome breakage and translocation of chromosomal segments can serve to break up tandemly duplicated gene pairs.
Several aspects of our data support this model. First, we found a consistent tendency for duplicated gene pairs mapping to the same chromosome to have duplicated more recently than those mapping to separate chromosomes . This was true of genes in duplicated blocks as well as of other genes . Furthermore, our results on the timing of gene duplications provided an excellent fit to a simple model assuming only tandem duplication and a constant probability of separation onto different chromosomes .t0ap, 百拇医药
We estimated the rate of separation of tandemly duplicated gene pairs onto different chromosomes in the human lineage at 1.7 x 10-9 per gene-pair per year. This estimate represents a long-term average for the human lineage, and it cannot be expected to apply to vertebrates with numbers of chromosomes either much larger or much smaller than those of humans. Given this rate, it is expected that after 100 million years, about one in six duplicated gene pairs will have separated onto different chromosomes. After 450 million years, the estimated time since the last common ancestor of bony fishes and tetrapods , about three-quarters of duplicated gene pairs are expected to be separated onto different chromosomes. The anticipated availability of complete genomic sequences from human, mouse, pufferfish, and zebrafish will make it possible to test these predictions. In addition, application of similar methods to a number of complete genomes will make it possible for us to develop more precise quantitative models of vertebrate chromosomal evolution.
Acknowledgementso(f)}\c, 百拇医药
This research was supported by National Institutes of Health grant GM066710 to A.L.H.o(f)}\c, 百拇医药
Literature Citedo(f)}\c, 百拇医药
Abi-Rached, L., A. Gilles, T. Shiina, P. Pontarotti, and H. Inoko. 2002. Evidence of en bloc duplication in vertebrate genomes. Nat. Genet 31:100-105.o(f)}\c, 百拇医药
Altschul, S. F., T. L. Madden, A. A. Schäffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389-3402.o(f)}\c, 百拇医药
Bailey, J. A., G. Gu, R. A. Clark, K. Reinert, R. V. Samonte, S. Schwartz, M. D. Adams, E. W. Myers, P. W. Li, and E. E. Eichler. 2002. Recent segmental duplications in the human genome. Science 297:1003-1007.o(f)}\c, 百拇医药
Carroll, R. L. 1988. Vertebrate paleontology and evolution. W. H. Freeman, New York.o(f)}\c, 百拇医药
Friedman, R., and A. L. Hughes. 2001a. Pattern and timing of gene duplication in animal genomes. Genome Res 11:1842-1847.
Friedman, R., and 2001b. Gene duplication and the structure of eukaryotic genomes. Genome Res 11:373-381.d, http://www.100md.com
Gu, X., Y. Wang, and J. Gu. 2002. Age distribution of human gene families shows significant roles of both large- and small-scale duplications in vertebrate genomes. Nat. Genet 31:205-209.d, http://www.100md.com
Hughes, A. L. 1994. The evolution of functionally novel proteins after gene duplication. Proc. R. Soc. Lond. Ser. B 256:119-125.d, http://www.100md.com
Hughes, A. L. 1998. Phylogenetic tests of the hypothesis of block duplication of homologous genes on human chromosomes 6, 9, and 1. Mol. Biol. Evol 15:854-870.d, http://www.100md.com
Hughes, A. L. 1999. Phylogenies of developmentally important proteins do not support the hypothesis of two rounds of genome duplication early in vertebrate history. J. Mol. Evol 48:565-576.d, http://www.100md.com
Hughes, A. L., J. da Silva, and R. Friedman. 2001. Ancient genome duplications did not structure the human Hox-bearing chromosomes. Genome Res 11:771-780.d, http://www.100md.com
International, Human Genome Sequencing Consortium. 2001. Initial sequencing and analysis of the human genome. Nature 409:860-921.
Jékely, G., and P. Friedrich. 1999. The evolution of the calpain family as reflected in paralogous chromosome regions. J. Mol. Evol 49:272-281.|4eg, 百拇医药
Jones, D. T., W. R. Taylor, and J. M. Thornton. 1992. The rapid generation of mutation data matrices from protein sequences. Comput. Appl. Biosci 8:275-282.|4eg, 百拇医药
Kasahara, M., Y. Nayaka, Y. Satta, and N. Takahata. 1997. Chromosomal duplication and the emergence of the adaptive immune system. Trends Genet 13:90-92.|4eg, 百拇医药
Kent, W. J., and D. Haussler. 2001. Assembly of the working draft of the human genome with GigAssembler. Genome Res 11:1461-1462.|4eg, 百拇医药
Kumar, S., and S. B. Hedges. 1998. A molecular timescale for vertebrate evolution. Nature 392:917-919.|4eg, 百拇医药
Li, W.-H. 1997. Molecular evolution. Sinauer Associates, Sunderland, Mass.|4eg, 百拇医药
Lundin, L. G. 1993. Evolution of the vertebrate genome as reflected in paralogous chromosome regions in man and the house mouse. Genomics 16:1-19.|4eg, 百拇医药
Lynch, M., and J. S. Conery. 2000. The evolutionary fate and consequences of duplicate genes. Science 290:1151-1155.
Lynch, M., M. O'Hely, B. Walsh, and A. Force. 2001. The probability of preservation of a newly arisen gene duplicate. Genetics 159:1789-184.mo18v%, 百拇医药
McLysaght, A., K. Hokamp, and K. H. Wolfe. 2002. Extensive genomic duplication during early chordate evolution. Nat. Genet 31:200-204.mo18v%, 百拇医药
Meyer, A., and M. Schartl. 1999. Gene and genome duplication in vertebrates: the one-to-four (-to-eight in fish) rule and the evolution of novel gene functions. Curr. Opin. Cell Biol 11:699-704.mo18v%, 百拇医药
Nei, M., and S. Kumar. 2000. Molecular evolution and phylogenetics. Oxford University Press, New York.mo18v%, 百拇医药
Nei, M., P. Xu, and G. Glazko. 2001. Estimation of divergence times from multiprotein sequences for a few mammalian species and several distantly related organisms. Proc. Natl. Acad. Sci. USA 98:2497-2502.mo18v%, 百拇医药
O'Brien, S. J., M. Menotti-Raymond, W. J. Murphy, W. G. Nash, J. Wiensburg, R. Stanyon, N. G. Copeland, N. A. Jenkins, J. Womack, and J. A. M. Graves. 1999. The promise of comparative genomics in mammals. Science 286:458-481.
Samonte, R. V., and Eichler E. E. 2002. Segmental duplications and the evolution of the primate genome. Nat. Rev. Genet 3:65-72.]}3-.2m, 百拇医药
Sidow, A. 1996. Gen(om)e duplications in the evolution of early vertebrates. Curr. Opin. Genet. Dev 6:715-722.]}3-.2m, 百拇医药
Sonnhammer, E. L. L., and R. Durbin. 1994. A workbench for large scale sequence homology analysis. Comput. App. Biol. Sci 10:301-307.]}3-.2m, 百拇医药
Strimmer, K., and A. von Haeseler. 1996. Quartet puzzling: a quartet maximum-likelihood method for reconstructing tree topologies. Mol. Biol. Evol 13:964-969.]}3-.2m, 百拇医药
Takezaki, N., A. Rzhetsky, and M. Nei. 1995. Phylogenetic test of the molecular clock and linearized trees. Mol. Biol. Evol 12:823-833.]}3-.2m, 百拇医药
Thompson, J. D., D. G. Higgins, and T. Gibson. 1994. ClustalW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22:4673-4680.]}3-.2m, 百拇医药
Wolfe, K. H. 2001. Yesterday's polyploids and the mystery of diploidization. Nat. Rev. Genet 2:333-341.]}3-.2m, 百拇医药
Wolfe, K. H., and D. C. Shields. 1997. Molecular evidence for an ancient duplication of the entire yeast genome. Nature 387:708-713.]}3-.2m, 百拇医药
Wootton, J. C., and S. Federhen. 1993. Statistics of local complexity in amino acid sequences and sequence databases. Comp. Chem 17:149-163.]}3-.2m, 百拇医药
Accepted for publication September 25, 2002.(Robert Friedman and Austin L. Hughes)
Abstractc!p, http://www.100md.com
Using a data set of protein translations associated with map positions in the human genome, we identified 1520 mapped highly conserved gene families. By comparing sharing of families between genomic windows, we identified 92 potentially duplicated blocks in the human genome containing 422 duplicated members of these families. Using branching order in the phylogenetic trees, we timed gene duplication events in these families relative to the primate-rodent divergence, the amniote-amphibian divergence, and the deuterostome-protostome divergence. The results showed similar patterns of gene duplication times within duplicated blocks and outside duplicated blocks. Both within and outside duplicated blocks, numerous duplications were timed prior to the deuterostome-protostome divergence, whereas others occurred after the amniote-amphibian divergence. Thus, neither gene duplication in general nor duplication of genomic blocks could be attributed entirely to polyploidization early in vertebrate history. The strongest signal in the data was a tendency for intrachromosomal duplications to be more recent than interchromosomal duplications, consistent with a model whereby tandem duplication—whether of single genes or of genomic blocks—may be followed by eventual separation of duplicates due to chromosomal rearrangements. The rate of separation of tandemly duplicated gene pairs onto separated chromosomes in the human lineage was estimated at 1.7 x 10-9 per gene-pair per year.
Key Words: block duplication • genome evolution • polyploidization • tandem duplication • vertebrate evolutionmm6:n, 百拇医药
Introductionmm6:n, 百拇医药
A popular hypothesis regarding the evolution of the vertebrate genome holds that there were two rounds of genome duplication by polyploidization early in vertebrate history (the 2R hypothesis) ). Until recently, it has not been possible to test this hypothesis rigorously. However, a number of recent tests of this hypothesis have been conducted, and the results have in every case failed to support the 2R hypothesis. For example, it had previously been suggested that a large number of families show a pattern of four members in vertebrates but a single member in Drosophila melanogaster; the best known example of such a pattern is provided by the Hox clusters . pointed out that such a pattern supports the 2R hypothesis only if (1) the vertebrate genes duplicated early in vertebrate history and (2) the phylogeny of the vertebrate genes shows a topology of two clusters of two, a topology described as (AB) (CD).
found that these conditions were rarely met in a set of gene families encoding a set of developmentally important proteins. The applied the same test to 57 families and reported results contrary to the 2R hypothesis in 76% of cases. In the most extensive application to date of this approach, examined all four-member gene families in the available portion of the human genome. In 134 families with resolved phylogenies, 71% showed results inconsistent with the 2R hypothesis. In addition, compared all homologous gene families in human and Drosophila and found that less than 5% of such families show a 4:1 ratio of the number of family members in human to the number of family members in Drosophila.[i1, http://www.100md.com
Advocates of the 2R hypothesis have frequently cited as evidence in favor of this hypothesis the existence of sets of paralogous genes found on two or more different chromosomes in the genomes of human or other vertebrates . The implication is that these genes were duplicated simultaneously during polyploidization. However, the hypothesis that a set of linked paralogues were duplicated simultaneously can only be accepted if the phylogenies of the gene families are consistent with their duplication during the same time period. When phylogenetic analyses have been applied to sets of linked paralogues allegedly duplicated simultaneously by polyploidization, the phylogenies have revealed that these genes were in fact duplicated at widely different times over the history of life . Furthermore, such potentially duplicated blocks are usually identified in a subjective manner. In a genome with numerous gene families, however, members of two or more of these families may be found in close linkage merely by chance, without having been duplicated simultaneously. For this reason, it is desirable to employ a statistical test of the hypothesis that genes have been duplicated simultaneously .
The major alternative to polyploidization hypotheses for explaining the occurrence of paralogous genes on different chromosomes is a hypothesis of tandem duplication followed by translocation of one or both duplicates to other chromosomes. Note that such a tandem duplication may involve a single genetic locus or a chromosomal block including several loci. Completely sequenced genomes of eukaryotes include examples of recently duplicated intrachromosomal blocks including many loci . A variety of mechanisms exist by which linked duplicates can be separated over evolutionary time. These include chromosomal breakage and rearrangement. Comparisons of the genomic maps of various mammals provide evidence that such mechanisms have operated repeatedly over the course of mammalian evolution .kdy, 百拇医药
Two recent papers have provided evidence of a peak of gene duplications early in vertebrate history, which the authors claim to be evidence that one or more genome duplications occurred at that time . There are numerous problems with these authors' analyses, however, and their conclusions are not well supported. First, in both studies the evidence for a peak of gene duplications relies on divergence time estimates made under the assumption of a molecular clock. It is well known that this assumption is often not met by molecular data (, pp. 215–235; , pp. 187–206). Furthermore, in both of these studies the evidence of a peak in gene duplications was entirely subjective. No statistical method was applied in either study to detect whether the observed peak might be attributed to random fluctuations. Finally, although used a statistical approach to identify potentially duplicated blocks in the human genome, they did not compare the duplication times of genes located within these blocks with those of genes outside the blocks.
In the present paper, we assign to families a set of highly conserved human protein-coding genes for which map positions are available. Using these conserved gene families, we apply the method of to identify potentially duplicated blocks in the human genome. We then identify homologous genes from other representative vertebrates, from the completely sequenced genomes of the invertebrates Drosophila melanogaster and Caenorhabditis elegans, and from the genomes of yeast and Arabidopsis thaliana. By constructing phylogenies of these gene families, we time gene duplication events relative to major cladogenetic events in vertebrate evolutionary history without relying on the assumption of a molecular clock. Comparing the duplication times of gene pairs within potentially duplicated blocks enables us to test the hypothesis that these blocks arose by polyploidization early in vertebrate history. In addition, comparison of the duplication times of gene pairs located on the same and different chromosomes enables us to estimate the pattern and rate of separation of duplicated gene pairs over the evolutionary history of the vertebrates.
Methods:?4g, http://www.100md.com
Sequences and Homology Searches:?4g, http://www.100md.com
The sequence data for the Caenorhabditis elegans was downloaded from the Sanger Center web site at . The "gadfly" data for Drosophila melanogaster was obtained from . The data for yeast Saccharomyces cerevisiae was obtained at ; and that for Arabidopsis thaliana, at . The human genome dataset used was from the October 7, 2000, freeze and is publicly available at . We used this version because it includes a features table providing map locations for a set of 16,663 known or predicted proteins. Proteome datasets from 38 complete genomes of Bacteria and all available sequences from a set of other representative organisms (the slime mold Dictyostelium discoideum, the zebrafish Danio rerio, the pufferfish Takifugu rubripes, the frog Xenopus laevis, and the mouse Mus musculus) were obtained from the National Center for Biotechnology Information (NCBI).:?4g, http://www.100md.com
The location of each protein in human was parsed from the features table. The location was then related to the protein sequence using the locus name as the unique identifier. When two predicted genes overlapped in location, one gene was chosen at random in order to eliminate redundancy (such as alternately spliced products of the same gene). Such overlap presumably occurs because of alternatively spliced transcripts from the same gene. After redundancy was removed, 13,802 proteins remained in the human data set. This number represents 47.0% of the total number of known and predicted proteins (both mapped and unmapped), including splice variants.
The text file of the nonredundant proteins was formatted as a database file using the blast tools obtained from the National Center for Biotechnology Information (NCBI) ftp site After the nonredundant proteome was determined for each genome, each protein was used to search for homology among the rest of the proteome. This "all against all" blast method was performed using the blastall executable which is packaged with the blast tools. Similarly, each protein in the nonredundant human proteome was searched against all proteins from other genomes. In searching human proteins against the remainder of the human proteome and in searching human proteins against nonhuman proteins, we used an expect value of E = 10-50. The use of a strict search criterion has the advantage that it identifies as homolgous only proteins showing evidence of homology throughout the length of the protein rather than in only one domain or a few domains . In all cases, we used the defaults of a BLOSUM62 substitution matrix and the SEG filter . The resultant records were filtered using MSPcrunch, a program to filter and convert the blast output to a tabular format .
Given all pairs of homologous proteins, a "single link" method was used to find the protein families. This step groups genes that share homology. For example, if gene A and B are in a family, and B and C are in another family, then A, B, and C are in a family. Further, in this example, if A and D also share homology, then A, B, C, and D are in a family. Using the strict homology search criterion E = 10-50 yielded a set of 1,520 highly conserved gene families with two or more members in human, containing a total of 5,475 genes . These were used in the identification of potentially duplicated genomic blocks and phylogenetic analyses (see below). Within families, amino acid sequences were aligned using ClustalW 1.81 .\, 百拇医药
fig.ommitted\, 百拇医药
Table 1 Distribution of Family Sizes in the Set of Conserved Human Gene Families Used in Analyses\, 百拇医药
Identification of Duplicated Blocks\, 百拇医药
The method of was used to identify potentially duplicated genomic blocks and to test the hypothesis that such blocks share gene families to a greater extent than expected by chance. This method involves comparing genomic windows for sharing of gene families. In applying this method to the mapped portion of human genome, we used the "paranome" or set of all genes in the genome belonging to families with two or more members (set of paralogues) obtained by homology search as described above with E = 10-50. Fixed, nonoverlapping windows of m paranome members (where m is a positive integer) were identified along each chromosome (excluding Y, which had fewer paranome members than the values of m used). When the number of paranome members in a chromosome was not evenly divisible by m, r windows in that chromosome were randomly chosen to be increased by one member (where r < m is the remainder of division of the total number of paranome members in the chromosome by m). The number of families shared between all windows in the genome was computed and compared with the results for 10,000 random genomes (in which paranome members were randomly assigned to chromosome locations). A significantly higher level of sharing of gene families between windows as determined by this randomization test was taken as evidence of a duplicated genomic structure. Here we report the results of analyses using m = 10 and m = 30. For further details of this method see .
Note that used a method similar to that of to identify potentially duplicated blocks in the human genome. However, the method of these authors differed from that of in important ways. First, these authors did not use a strict homology search criterion; thus, they may have identified as homologues genes showing strong evidence of homology only in a portion of the sequence. Second, "collapsed" all tandem arrays of duplicated genes, counting each such array as only a single gene. Thus, their randomization test of the hypothesis that the same families occur in syntenic groups in separate genomic locations to a greater extent than expected by chance is biased and is more likely to reject the null hypothesis than is our test.!^ogo, 百拇医药
Phylogenetic Analyses!^ogo, 百拇医药
Phylogenetic trees were constructed by the quartet maximum-likelihood (ML) method as implemented in TREEPUZZLE 5.0, using the JTT model of amino acid evolution and assuming that rate variation among sites followed a gamma distribution. All trees were treated as unrooted, and no attempt was made to assign an outgroup to root any tree. On the basis of tree topology, we determined the time of each human gene duplication event relative to the following cladogenetic events: the deuterostome-protostome divergence; the amniote-amphibian divergence; and the primate-rodent divergence. This method of timing duplication events does not assume a constant rate of molecular evolution ("molecular clock") and is independent of the rooting of the tree. We concluded that a gene duplicated prior to a cladogenetic event if the internal branch supporting that duplication was significantly supported. We considered a branch to be significantly supported if it was supported in 95% or more of 10,000 puzzling steps; this represents a highly conservative test for significance of an internal branch . We concluded that a gene duplicated prior to a cladogenetic event if the internal branch supporting that duplication was significantly supported.
In counting the proportions of duplications in a set of trees that could be dated prior to a given cladogenetic event, we compared only those families for which a sufficient number of sequences were available for the hypothesis of duplication before that event to be tested. For example, in a gene family including two human sequences, a mouse sequence, and a Drosophila sequence, the only hypothesis that can be tested regarding the duplication of the human genes is that they duplicated prior to the primate-rodent divergence. This hypothesis would be supported if one of the human genes clustered with the mouse gene and one with the Drosophila gene, and if the internal branch separating these two clusters was significantly supported. By contrast, if the two human genes clustered together, the phylogeny would be taken as consistent with the hypothesis that the human genes duplicated after the primate-rodent divergence. In comparing proportions of gene duplications occurring before and after a given cladogenetic event, we compared only (1) those families in which there was significant support for duplication before that cladogenetic event and (2) those families in which it was possible to test the hypothesis of gene duplication prior to that cladogenetic event yet the topology of the tree was consistent with duplication after that event.
Resultsh7/sv.s, 百拇医药
Duplicated Genomic Blocksh7/sv.s, 百拇医药
When a window size of 10 genes (paranome members) was used, the number of genomic windows sharing two or more families was significantly lower than expected by chance . This pattern presumably reflects the occurrence of numerous cases of closely linked duplicated genes in the human genome. Because a substantial number of paralogues are closely linked to each other, sharing of two or more families between genomic windows occurs less frequently in the actual human genome than in the random genomes . Similarly, the observed number of windows sharing 3 or more families was significantly lower than the number expected by chance at the 5% level . On the other hand, the number of windows sharing 4 or more families was significantly greater than expected by chance, and the number of windows sharing 6 or more families was significantly greater than expected by chance.h7/sv.s, 百拇医药
fig.ommitted
Table 2 Numbers of Matches Between Genomic Windowsp, 百拇医药
When a window size of 30 genes was used, the number of windows sharing two or more families was again significantly lower than expected by chance . Likewise, the numbers of windows sharing three or more members and four or more members were significantly lower than expected by chance . Again, these results presumably reflect the occurrence of duplicated genes in close linkage with one another to a greater extent than expected by chance. By contrast, the number of windows sharing six or more families was greater than expected by chance, as was the number of windows sharing eight or more families. Note that the numbers of duplicated blocks identified by our method is lower than that identified by . This difference can probably be attributed in large part to the less strict homology criterion used by these authors, as well as to the less conservative nature of their statistical test for duplicated structure.p, 百拇医药
Given that the human genome includes numerous closely linked duplicated genes, it might be argued that sharing of two or more families between windows will be observed relatively infrequently simply because a large proportion of windows will include members of a single family. To test for this possibility, we examined the distribution of the number of members in the largest family observed in each window . With a window size of 10, only 7 of 548 windows (1.3%) included 10 genes all of the same family; and in only 28 of 548 windows (5.1%) was the largest family represented by 6 or more genes. The median value for the size of the largest family per window was 2.0, whereas the mean was 2.34 with a standard error of 0.07. Thus, with a window size of 10, there were some cases in which a single family accounted for over half the genes in a window, but these were very rare.
fig.ommitted^7, http://www.100md.com
FIG. 1. Frequency distributions of the size of the largest gene family in fixed genomic windows (highest count) in windows of 10 paranome members (A) and windows of 30 paranome members (B)^7, http://www.100md.com
With a window size of 30, there were no cases of a single window in which all genes were members of the same family . Rather, the highest number of genes in the same family was 22, which occurred in just one case; and in only 2 of 183 windows (1.1%) was the largest family represented by 16 or more genes. The median value for the size of the largest family was 3.0, whereas the mean was 3.96 with a standard error of 0.23. Thus, with a window size of 30, there was no evidence that the frequency of windows sharing two or more gene families was biased by the occurrence of windows in which a single family predominated.^7, http://www.100md.com
To examine the maximal number of possibly duplicated regions, we combined the data on windows of size 10 sharing three or more families and windows of size 30 sharing six or more families to identify potentially duplicated genomic regions. The combined data included 422 duplicated gene pairs in 92 putatively duplicated genomic blocks. (A listing of the genes involved is available from the authors upon request.) shows a graph illustrating sharing of one or more duplicated blocks between chromosomes. All chromosomes were represented in the graph except chromosome 21, which shared no duplicated blocks with other chromosomes; and the graph was completely connected . In addition to between-chromosome duplicated blocks, within-chromosome duplicated blocks were observed on chromosomes 1, 3, 6, 8, 11, 15, 16, 17, 19, 22, and X.
fig.ommitted@l, 百拇医药
FIG. 2. Schematic illustration of between-chromosome duplicated blocks. Circles correspond to human chromosomes, and lines correspond to sharing of at least one putatively duplicated block between a pair of chromosomes@l, 百拇医药
Phylogenetic Analyses@l, 百拇医药
On the basis of resolved phylogenetic trees, a total of 1,057 gene duplication events in 706 human gene families were timed relative to one or more of the following three cladogenetic events: the primate-rodent divergence, the amniote-amphibian divergence, and the deuterostome-protostome divergence. For all three of these events, there was a striking difference between duplications of genes mapping to the same chromosome (within-chromosome) and duplications of genes mapping to different chromosomes (between-chromosome) . In the case of each of these cladogenetic events, a significantly higher proportion of between-chromosome duplications than of within-chromosome duplications were timed prior to the event
fig.ommitted$w?/, http://www.100md.com
FIG. 3. Proportions of gene duplications in conserved human gene families that could be dated, by significantly supported interior branches in phylogenetic trees, prior to three major cladogenetic events. Chi-square tests of the hypothesis of equality of the proportions in within-chromosome and between-chromosome duplicate pairs: ***P < 0.001$w?/, http://www.100md.com
When we compared genes in putatively duplicated genomic blocks with other genes, no significant differences were seen with respect to the timing of duplication events . Genes in within-chromosome duplicated blocks showed a pattern very similar to that seen for other within-chromosome duplications . Likewise, genes in between-chromosome duplicated blocks showed a pattern similar to that seen for other between-chromosome duplications . Considering only genes within duplicated blocks, a significantly lower proportion of within-chromosome duplications than of between-chromosome duplications could be dated prior to the primate-rodent divergence or prior to the amniote-amphibian divergence .
fig.ommitted0', http://www.100md.com
Table 3 Timing of Human Gene Duplications in Duplicated Blocks with Respect to Cladogenetic Events0', http://www.100md.com
We analyzed separately duplication times of genes located on sets of chromosomes that have previously been cited as evidence for ancient genome duplications: chromosomes 1, 6, 9, and 19 chromosomes 2, 7, 12, and 17 (the chromosomes bearing the Hox clusters) and chromosomes 1, 11, 15, and 19 . The pattern of timing of gene duplications between chromosomes in these sets showed no statistically significant differences from that for other between-chromosome duplications .0', http://www.100md.com
fig.ommitted0', http://www.100md.com
Table 4 Timing of Human Gene Duplications on Allegedly Duplicated Chromosomes with Respect to Cladogenetic Events0', http://www.100md.com
Separation of Duplicate Pairs0', http://www.100md.com
We used the results of the phylogenetic analyses to estimate the rate of separation of tandemly duplicated gene pairs onto separate chromosomes over the course of vertebrate evolution. We used a simple model assuming (1) that there is a constant probability of separation of tandemly duplicated gene pairs onto different chromosomes and (2) that the probability of gene pairs that have once been separated being translocated back to the same chromosome is negligible. If so, the probability that a tandemly duplicated pair of genes are still on the same chromosome after t years (Pt) is given by
We used the proportion of duplicate pairs estimated to have duplicated before a given cladogenetic event to estimate Pt. We then used linear regression of the natural logarithms of these estimated Pt values against estimates of the times of the cladogenetic events in order to estimate , the slope of the best fit line through the origin and through these three time points. We used the estimates of 110 MYA for the primate-rodent divergence; 360 MYA for the amniote-amphibian divergence; and 830 MYA for the deuterostome-protostome divergence .a{7zt2, http://www.100md.com
Although only three data points were available, the fit of the regression (R2 = 96.1%; P = 0.02) was quite good . The resulting estimate of was 1.7 x 10-9 per gene-pair per year. Because of the simplifying assumptions made, this represents a minimum estimate of the probability of separation of tandemly duplicated gene pairs.a{7zt2, http://www.100md.com
fig.ommitteda{7zt2, http://www.100md.com
FIG. 4. The natural logarithm of p, the proportion of duplications that can be dated to a given cladogenetic event, plotted against the estimated time of that event in millions of years (t). The line is the equation of the linear regression through the origin, Y = -0.311 - 0.0013 x (R2 = 96.1%; P = 0.02)
Discussionc, 百拇医药
The present study examined the timing of gene duplication events in a set of highly conserved human gene families with known map locations. Consistent with some previous studies ) but not others , the results provided no strong signal suggestive of ancient polyploidization in the vertebrates. On the polyploidization hypothesis, a substantial number of paralogous gene pairs on different chromosomes are expected to have arisen prior to the amniote-amphibian divergence but after the deuterostome-protostome divergence. This should be reflected in a high percentage of between-chromosome pairs of genes that duplicated prior to the amniote-amphibian divergence. In fact, the proportion of between-chromosome duplications that could be dated prior to the amniote-amphibian divergence was intermediate between the proportion dated prior to the primate-rodent divergence and the proportion dated prior to the deuterostome-protostome divergence . and reported peaks of gene duplication early in vertebrate history on the basis of molecular clock analyses. However, many of these gene duplications may have been incorrectly timed because of defects inherent in molecular clock analyses. It is well known that gene duplication is often followed by a period of accelerated evolution at the amino acid level as daughter genes adapt to distinct functions . Such accelerated evolution will disrupt the molecular clock and cause the duplication to be dated earlier than it actually occurred. The existence of numerous such cases might create an artifactual "peak" of gene duplications at earlier dates.
did not attempt to weed out protein families not evolving in a clock-like manner; thus, their analyses probably included a high proportion of families in which the assumption of a molecular clock is not valid. did attempt to weed out families not evolving in a clock-like manner. They used the two-cluster test of , which tests for nonuniformity of rate between two groups of sequences in comparison to an outgroup. This method, however, will not be able to detect bursts of rapid evolution after gene duplication if they occur in both duplicates. In addition, the families assumed by to be evolving in a clock-like manner were in fact merely those families for which they lacked statistical power to detect deviations from a molecular clock. Thus, the data set on which they based their estimates probably included many families of short sequences and sequences with high rates of replacement per site, because in these cases the test will lack statistical power to detect deviations from the molecular clock even when such deviations are present.
Furthermore, it is worth noting that an apparent peak of gene duplications, even if it is not a statistical artifact, is not in itself evidence of polyploidization. Gene duplication occurs continually over the course of evolution, but most duplicate genes are quickly lost . Specialization of a duplicate gene for a new function substantially enhances its probability of being retained . Thus, if we observe an apparent peak of gene duplication in the past history of a species, what we are really observing is not a peak of gene duplication per se, but rather a peak of retention of duplicate gene copies. And because retention of duplicate genes is likely to be associated with the evolution of new functions , an apparent peak of gene duplication is likely to be the signature not of polyploidization but of adaptive radiation. Thus, even if the peaks of retention of duplicate gene copies early in vertebrate history reported by and are not artifacts, they provide no information one way or another regarding the hypothesis of polyploidization. Rather, they merely reflect the occurrence of adaptive radiation early in vertebrate history, which is unsurprising on the basis of our knowledge of vertebrate paleontology .
On the polyploidization hypothesis, duplicated genomic regions are the residue of ancient polyploidization events . Here we used a simple method to identify such duplicated regions . When this method was previously applied to the genome of yeast, it provided a strong signal of genome duplication, as had previously been proposed for this species by , an event estimated to have occurred 200–300 MYA . This method identified a number of potentially duplicated blocks in the human genome. However, when phylogenetic analysis was used to time gene duplications between these blocks, the results provided no strong signal of ancient polyploidization. On the polyploidization hypothesis, we would expect gene pairs in duplicated blocks to show a disproportionate number duplicated after the deuterostome-protostome divergence but before the amniote-amphibian divergence. In fact, the pattern of gene duplication times in duplicated blocks was not different from that outside the blocks . Likewise, when chromosome pairs previously alleged to show the effects of ancient polyploidization were analyzed, the pattern of duplication times was very similar to that for genes on other chromosome pairs .
Duplication of certain of the gene pairs in duplicated blocks could be dated with strong statistical support prior to the deuterostome-protostome divergence, whereas others duplicated after the aminote-amphibian divergence or after the primate-rodent divergence . A similar result was seen in the case of genes on allegedly duplicated chromosome pairs . Duplication times after the amniote-amphibian divergence or prior to the deuterostome-protostome divergence are not explainable by polyplodization early in vertebrate history. Thus, our results suggest that, if genome duplication did occur early in vertebrate history, it was not responsible for a large fraction of the duplicated genes or for a large fraction of the duplicated genomic blocks found in the genomes of current-day vertebrates.4, 百拇医药
The simplest alternative model to that of polyploidization to explain the increase in gene number in vertebrates is a model invoking repeated independent events of tandem gene duplication . These tandem duplications might involve individual genes or they might involve chromosomal blocks such as we detected on chromosomes 1, 3, 6, 8, 11, 15, 16, 17, 19, 22, and X. Indeed, recent evidence from the human genome suggests that duplication of genomic blocks is a recurring feature of vertebrate genome evolution . Once a gene or genomic segment has been duplicated, subsequent events of chromosome breakage and translocation of chromosomal segments can serve to break up tandemly duplicated gene pairs.
Several aspects of our data support this model. First, we found a consistent tendency for duplicated gene pairs mapping to the same chromosome to have duplicated more recently than those mapping to separate chromosomes . This was true of genes in duplicated blocks as well as of other genes . Furthermore, our results on the timing of gene duplications provided an excellent fit to a simple model assuming only tandem duplication and a constant probability of separation onto different chromosomes .t0ap, 百拇医药
We estimated the rate of separation of tandemly duplicated gene pairs onto different chromosomes in the human lineage at 1.7 x 10-9 per gene-pair per year. This estimate represents a long-term average for the human lineage, and it cannot be expected to apply to vertebrates with numbers of chromosomes either much larger or much smaller than those of humans. Given this rate, it is expected that after 100 million years, about one in six duplicated gene pairs will have separated onto different chromosomes. After 450 million years, the estimated time since the last common ancestor of bony fishes and tetrapods , about three-quarters of duplicated gene pairs are expected to be separated onto different chromosomes. The anticipated availability of complete genomic sequences from human, mouse, pufferfish, and zebrafish will make it possible to test these predictions. In addition, application of similar methods to a number of complete genomes will make it possible for us to develop more precise quantitative models of vertebrate chromosomal evolution.
Acknowledgementso(f)}\c, 百拇医药
This research was supported by National Institutes of Health grant GM066710 to A.L.H.o(f)}\c, 百拇医药
Literature Citedo(f)}\c, 百拇医药
Abi-Rached, L., A. Gilles, T. Shiina, P. Pontarotti, and H. Inoko. 2002. Evidence of en bloc duplication in vertebrate genomes. Nat. Genet 31:100-105.o(f)}\c, 百拇医药
Altschul, S. F., T. L. Madden, A. A. Schäffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389-3402.o(f)}\c, 百拇医药
Bailey, J. A., G. Gu, R. A. Clark, K. Reinert, R. V. Samonte, S. Schwartz, M. D. Adams, E. W. Myers, P. W. Li, and E. E. Eichler. 2002. Recent segmental duplications in the human genome. Science 297:1003-1007.o(f)}\c, 百拇医药
Carroll, R. L. 1988. Vertebrate paleontology and evolution. W. H. Freeman, New York.o(f)}\c, 百拇医药
Friedman, R., and A. L. Hughes. 2001a. Pattern and timing of gene duplication in animal genomes. Genome Res 11:1842-1847.
Friedman, R., and 2001b. Gene duplication and the structure of eukaryotic genomes. Genome Res 11:373-381.d, http://www.100md.com
Gu, X., Y. Wang, and J. Gu. 2002. Age distribution of human gene families shows significant roles of both large- and small-scale duplications in vertebrate genomes. Nat. Genet 31:205-209.d, http://www.100md.com
Hughes, A. L. 1994. The evolution of functionally novel proteins after gene duplication. Proc. R. Soc. Lond. Ser. B 256:119-125.d, http://www.100md.com
Hughes, A. L. 1998. Phylogenetic tests of the hypothesis of block duplication of homologous genes on human chromosomes 6, 9, and 1. Mol. Biol. Evol 15:854-870.d, http://www.100md.com
Hughes, A. L. 1999. Phylogenies of developmentally important proteins do not support the hypothesis of two rounds of genome duplication early in vertebrate history. J. Mol. Evol 48:565-576.d, http://www.100md.com
Hughes, A. L., J. da Silva, and R. Friedman. 2001. Ancient genome duplications did not structure the human Hox-bearing chromosomes. Genome Res 11:771-780.d, http://www.100md.com
International, Human Genome Sequencing Consortium. 2001. Initial sequencing and analysis of the human genome. Nature 409:860-921.
Jékely, G., and P. Friedrich. 1999. The evolution of the calpain family as reflected in paralogous chromosome regions. J. Mol. Evol 49:272-281.|4eg, 百拇医药
Jones, D. T., W. R. Taylor, and J. M. Thornton. 1992. The rapid generation of mutation data matrices from protein sequences. Comput. Appl. Biosci 8:275-282.|4eg, 百拇医药
Kasahara, M., Y. Nayaka, Y. Satta, and N. Takahata. 1997. Chromosomal duplication and the emergence of the adaptive immune system. Trends Genet 13:90-92.|4eg, 百拇医药
Kent, W. J., and D. Haussler. 2001. Assembly of the working draft of the human genome with GigAssembler. Genome Res 11:1461-1462.|4eg, 百拇医药
Kumar, S., and S. B. Hedges. 1998. A molecular timescale for vertebrate evolution. Nature 392:917-919.|4eg, 百拇医药
Li, W.-H. 1997. Molecular evolution. Sinauer Associates, Sunderland, Mass.|4eg, 百拇医药
Lundin, L. G. 1993. Evolution of the vertebrate genome as reflected in paralogous chromosome regions in man and the house mouse. Genomics 16:1-19.|4eg, 百拇医药
Lynch, M., and J. S. Conery. 2000. The evolutionary fate and consequences of duplicate genes. Science 290:1151-1155.
Lynch, M., M. O'Hely, B. Walsh, and A. Force. 2001. The probability of preservation of a newly arisen gene duplicate. Genetics 159:1789-184.mo18v%, 百拇医药
McLysaght, A., K. Hokamp, and K. H. Wolfe. 2002. Extensive genomic duplication during early chordate evolution. Nat. Genet 31:200-204.mo18v%, 百拇医药
Meyer, A., and M. Schartl. 1999. Gene and genome duplication in vertebrates: the one-to-four (-to-eight in fish) rule and the evolution of novel gene functions. Curr. Opin. Cell Biol 11:699-704.mo18v%, 百拇医药
Nei, M., and S. Kumar. 2000. Molecular evolution and phylogenetics. Oxford University Press, New York.mo18v%, 百拇医药
Nei, M., P. Xu, and G. Glazko. 2001. Estimation of divergence times from multiprotein sequences for a few mammalian species and several distantly related organisms. Proc. Natl. Acad. Sci. USA 98:2497-2502.mo18v%, 百拇医药
O'Brien, S. J., M. Menotti-Raymond, W. J. Murphy, W. G. Nash, J. Wiensburg, R. Stanyon, N. G. Copeland, N. A. Jenkins, J. Womack, and J. A. M. Graves. 1999. The promise of comparative genomics in mammals. Science 286:458-481.
Samonte, R. V., and Eichler E. E. 2002. Segmental duplications and the evolution of the primate genome. Nat. Rev. Genet 3:65-72.]}3-.2m, 百拇医药
Sidow, A. 1996. Gen(om)e duplications in the evolution of early vertebrates. Curr. Opin. Genet. Dev 6:715-722.]}3-.2m, 百拇医药
Sonnhammer, E. L. L., and R. Durbin. 1994. A workbench for large scale sequence homology analysis. Comput. App. Biol. Sci 10:301-307.]}3-.2m, 百拇医药
Strimmer, K., and A. von Haeseler. 1996. Quartet puzzling: a quartet maximum-likelihood method for reconstructing tree topologies. Mol. Biol. Evol 13:964-969.]}3-.2m, 百拇医药
Takezaki, N., A. Rzhetsky, and M. Nei. 1995. Phylogenetic test of the molecular clock and linearized trees. Mol. Biol. Evol 12:823-833.]}3-.2m, 百拇医药
Thompson, J. D., D. G. Higgins, and T. Gibson. 1994. ClustalW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22:4673-4680.]}3-.2m, 百拇医药
Wolfe, K. H. 2001. Yesterday's polyploids and the mystery of diploidization. Nat. Rev. Genet 2:333-341.]}3-.2m, 百拇医药
Wolfe, K. H., and D. C. Shields. 1997. Molecular evidence for an ancient duplication of the entire yeast genome. Nature 387:708-713.]}3-.2m, 百拇医药
Wootton, J. C., and S. Federhen. 1993. Statistics of local complexity in amino acid sequences and sequence databases. Comp. Chem 17:149-163.]}3-.2m, 百拇医药
Accepted for publication September 25, 2002.(Robert Friedman and Austin L. Hughes)