当前位置: 首页 > 期刊 > 《分子生物学进展》 > 2003年第2期 > 正文
编号:10582206
Compositional Evolution of Noncoding DNA in the Human and Chimpanzee Genomes
http://www.100md.com 《分子生物学进展》2003年第2期
     Department of Evolutionary Biology, Evolutionary Biology Centre, Uppsala University, Uppsala, Sweden-a]|4h{, 百拇医药

    Abstract-a]|4h{, 百拇医药

    We have examined the compositional evolution of noncoding DNA in the primate genome by comparison of lineage-specific substitutions observed in 1.8 Mb of genomic alignments of human, chimpanzee, and baboon with 6542 human single-nucleotide polymorphisms (SNPs) rooted using chimpanzee sequence. The pattern of compositional evolution, measured in terms of the numbers of GCAT and ATGC changes, differs significantly between fixed and polymorphic sites, and indicates that there is a bias toward fixation of ATGC mutations, which could result from weak directional selection or biased gene conversion in favor of high GC content. Comparison of the frequency distributions of a subset of the SNPs revealed no significant difference between GCAT and ATGC polymorphisms, although ATGC polymorphisms in regions of high GC segregate at slightly higher frequencies on average than GCAT polymorphisms, which is consistent with a fixation bias favoring high GC in these regions. However, the substitution data suggest that this fixation bias is relatively weak, because the compositional structure of the human and chimpanzee genomes is becoming homogenized, with regions of high GC decreasing in GC content and regions of low GC increasing in GC content. The rate and pattern of nucleotide substitution in 333 Alu repeats within the human-chimpanzee-baboon alignments are not significantly affected by the GC content of the region in which they are inserted, providing further evidence that, since the time of the human-chimpanzee ancestor, there has been little or no regional variation in mutation bias.

    Key Words: isochore • base composition • human • chimpanzee • selection • SNP}n@., 百拇医药

    Introduction}n@., 百拇医药

    The genomes of humans and many other eukaryotes exhibit striking variation in GC content, with regions of relatively homogeneous base composition known as isochores stretching up to the megabase scale . Although the degree of heterogeneity on the sequence level is not as extreme as initially suggested on the basis of physical separation of DNA fragments by gradient centrifugation , it is highly significant . One explanation for significant variation in GC content is that local variation occurs in the ratio between the rate of GCAT mutations (u) and ATGC mutations (v) which causes some regions to retain high GC contents . Alternatively, there could be no variation in the pattern of mutation, but instead some mutations could be more likely to become fixed than others in regions of different GC content. Such a fixation bias could be caused by weak directional selection or biased gene conversion .

    A number of different approaches have been used to distinguish between biases in the processes of mutation and fixation as explanations for the existence of isochores. The preferential fixation of certain mutations in regions of similar GC content would cause patterns of polymorphism and substitution between closely related species to be discordant. Comparison of patterns of polymorphism and substitution is the basis of the McDonald-Kreitman test of neutrality , which utilizes silent and amino acid replacement changes, and the same principle has also been applied to ATGC and GCAT mutations . Thus, if substitution patterns are due solely to variation in the mutation process, patterns of substitution and polymorphism caused by ATGC and GCAT mutations should be similar, assuming there have been no changes in mutation bias since the time of divergence of the two species (4.6–6.2 Myr for human-chimpanzee; ). However, previous studies using this prediction have relied on the genome composition being at equilibrium , which is not supported by recent analyses .

    Under a model of fixation bias, differences are predicted between the frequency distributions of noncoding polymorphisms caused by GCAT and ATGC mutations . If there is a bias toward fixation of GC nucleotides in GC-rich regions, then ATGC mutations will segregate at higher average frequencies than GCAT changes. The polymorphism frequency test is potentially more powerful than a test based on numbers of polymorphisms, as it is only sensitive to changes in mutation bias since the time of origin of extant neutral polymorphism (<1.5 Myr in human noncoding DNA e.g. see{pwr, 百拇医药

    Sequences that have been inserted into new positions in the genome, such as processed pseudogenes and interspersed repeats, have been employed to study the process of compositional evolution, and it has been previously demonstrated that such sequences evolve so that their GC content approaches that of surrounding DNA . Although these observations are consistent with both the fixation and mutation bias hypotheses , they suggest the presence of "regional effects" on the nucleotide substitution process, which could possibly extend across whole isochores. Repetitive elements are believed to comprise >50% of the human genome and are thus of great utility for understanding the determinants of molecular evolution in noncoding DNA.

    We have performed a large-scale comparison of inter- and intraspecific mutational changes observed in noncoding regions of the human genome by inferring lineage-specific substitutions in 1.8 Mb of human-chimpanzee-baboon DNA alignments and determining the roots of 6542 single-nucleotide polymorphisms (SNPs) within noncoding regions using >25 Mb of human-chimpanzee DNA alignments. The frequency distributions of the SNPs for which allele frequencies were available were also analyzed. The contribution of regional effects to the process of compositional evolution was studied by analysis of patterns of substitution in Alu sequences within the human-chimpanzee-baboon alignments. The results shed light on the forces that govern evolution of "junk" DNA in the human and chimpanzee genomes.q, 百拇医药

    Materials and Methodsq, 百拇医药

    Analysis of Human-Chimpanzee-Baboon DNA Alignmentsq, 百拇医药

    A data set consisting of 43 human-chimpanzee-baboon genomic alignments, with a total length of 1.8 Mb, was used to infer nucleotide substitutions along the human and chimpanzee lineages. Construction and critical evaluation of this data set has been described in detail previously . In brief, sequences derived from chimpanzee (Pan troglodytes) and baboon (Papio cynocephalus anubis) BAC clones orthologous to five large regions of human chromosome 7 (see ) were identified using BLAST searches, and alignments were then generated using the default values of ClustalW . Coding regions were masked using the annotation in human "contig" files, and repetitive elements were masked using RepeatMasker (A. F. A. Smit and P. Green, RepeatMasker at .

    The alignments were divided into 2 kb windows based only on human and chimpanzee length (i.e., positions containing a gap in both the human and the chimpanzee sequence were ignored). Only windows with at least 100 bp of noncoding and nonrepetitive DNA were considered. The average GC content of each class of regions defined by GC content was weighted by number of substitutions to enable comparison with polymorphism data where the GC content around each observed SNP contributed to the average value.(i!}l*, 百拇医药

    Alignments were analyzed for nucleotide substitutions occurring along the lineages leading to human and chimpanzee, with baboon as the outgroup. Only changes at sites where the human and chimpanzee differed and the baboon was equivalent to either human or chimpanzee were considered, allowing inference of the direction of the change by parsimony. Substitutions putatively due to CpG hypermutability (CpG to TpG or CpA) were identified. We performed a binomial test to determine whether the patterns of substitution on the human and chimpanzee lineages were significantly different. The probability of finding a bias in the numbers of ATGC and GCAT changes on the two lineages equal to or greater than the observed values, under the null assumption that each type of change occurs at the same rate on both lineages, was calculated using the binomial probability formula. A binomial test was also used to test the departure of ATGC and GCAT changes from a 1:1 ratio in regions of similar GC content in both substitution and polymorphism data.

    The location of Alu repeats in the alignments was determined using RepeatMasker and the substitution patterns along the human and chimpanzee lineages within these regions were analyzed, inferring ancestral states by parsimony. The GC content of 1 kb of noncoding nonrepetitive DNA surrounding each Alu repeat was calculated.c, http://www.100md.com

    Maximum Likelihood Inference of Substitution Patternsc, http://www.100md.com

    Inferring ancestral states using parsimony can be unreliable, particularly when base composition is biased and when base composition is not at equilibrium. However, parsimony is usually reasonably accurate when sequence divergence is low, as is the case in our primate comparisons. We checked the accuracy of parsimony in determining the numbers of ATGC and GCAT changes along the human and chimpanzee lineages by using a maximum likelihood implementation of a model which allows GC content to evolve nonhomogeneously The 2-kb windows were pooled according to GC content, and the NHML program of was applied to each GC category separately.

    The NHML program allows the maximum likelihood testing of heterogeneous base composition evolution. More specifically, the maximum likelihood for the same human-chimp-baboon data set is found under two models: (1) with the same equilibrium base composition ( in Galtier and Gouy's terminology) applying to all four branches in the rooted human-chimp-baboon tree, and (2) with four branch specific 's. Twice the increase in log likelihood from model one to model two (2logL) is expected to be distributed as a chi-squared distribution with three degrees of freedom (since there are three additional parameters in model two relative to model one). The other fitted parameters were the same for both models: ancestral base composition, root location, branch lengths, transition/transversion ratio, and base composition at nodes (for details, see ). We did not model rate variation among sites; in other words we assume that all sites evolve at the same rate.[\, 百拇医药

    Under the Galtier and Gouy model it is straightforward to employ the maximum likelihood parameter estimates from model two to assess the accuracy of parsimony inference with the empirical Bayesian approach. For example, consider when the site pattern in human-chimpanzee-baboon is AGG (HCB = AGG). Parsimony yields an unambiguous inference of a G to A substitution in the human lineage, equivalent to the state of the human-chimpanzee ancestor being G. But we know that evolution does not always proceed according to parsimony, and it is this uncertainty which we seek to quantify. The substitution probabilities of the Galtier and Gouy model, given in the form pIJ which is the probability of state J at the bottom of the branch given state I at the top of the branch, can be used to determine the probability that the human-chimpanzee ancestor is a G (HCA = G) given that HCB = AGG, Pr(HCA = G | HCB = AGG), using Bayes theorem, given here for the general case of HCA = W and HCB = XYZ.

    Pr(HCB = XYZ | HCA = W) is the likelihood of observing the HCB data given the state of the HCA. The substitution probabilities under the Galtier and Gouy model are functions of branch lengths ({lambda}), equilibrium base composition and the transition/transversion ratio (see Appendix in ). Here we use and {lambda}|6ae4x, 百拇医药

    subscripts to refer to individual branches, with branch 1 from HCA to human, branch 2 from HCA to chimpanzee, branch 3 from the human-chimpanzee-baboon ancestor to HCA, and branch 4 from the human-chimpanzee-baboon ancestor to baboon.|6ae4x, 百拇医药

    Pr(HCA = W) is the prior probability that the state of the HCA is W, which is given by the inferred base composition of the HCA node. Pr(HCB = XYZ) is the normalizing probability of the data calculated as the sum of the probabilities of all possible HCA states.|6ae4x, 百拇医药

    We assessed the accuracy of parsimony by applying the above formulae for all 16 HCB site patterns which are GC"->" AT or AT"->" GC parsimony informative. In all cases, the parsimony inferred state of HCA (and hence direction of mutation) was also the most probable according to empirical Bayesian analysis. However, in many cases the second most likely state with the opposite direction of mutation (HCA = A for HCB = AGG, indicating an A to G substitution) had an appreciable probability, sometimes over 10%. To check whether parsimony is reliable, we derived a ML estimate of the numbers of GC"->" AT and AT"->" GC substitutions by weighting the frequency of the 16 different HCB site patterns by the relative probabilities of the two most likely HCA states (the two remaining states had vanishingly small probabilities owing to the short branch lengths).

    Determination of Human SNP Roots Using Human-Chimpanzee Alignments25t, http://www.100md.com

    A total of 37,580 human-chimpanzee alignments (20.1 Mb) were obtained using the draft quality chimpanzee BAC end sequences published as part of the RIKEN chimpanzee genome project . The chimpanzee sequences were extracted from NCBI Entrez along with their corresponding orthologous human sequences, as given at . Alignments were generated using the default settings of ClustalW, and those containing coding sequence were identified by BLAST searches against the RefSeq database and omitted from the analysis. A further set of 80 human-chimpanzee genomic alignments, masked for genes and with a total length of 5.1 Mb, were also included in the analysis. These were constructed from full or partial length chimpanzee BAC clones and their orthologous sequence in human contigs as described in . The human-chimpanzee-baboon alignments described earlier comprise a subset of these 80 human-chimpanzee alignments for which baboon sequence was available (further details of all aligned sequences are available on request).

    The tenth release of The SNP Consortium (TSC) database, consisting of 1,250,611 mapped SNPs and their 3' and 5' flanking sequences (average of 690 bp flanking sequence per SNP) was obtained from . To identify SNPs in regions where the chimpanzee root is available, BLAST searches were performed between all of the human-chimpanzee alignments on each human chromosome and databases consisting of the flanking sequences of all the available SNPs on that chromosome. SNPs whose flanking sequences perfectly matched the human sequence in an alignment were then compared to the aligned chimpanzee sequence in order to determine the root of the mutation. Only biallelic SNPs where the chimpanzee root was the same as one of the two human alleles were included in the analysis. SNPs within repetitive DNA were removed from the analysis as they could potentially show mutation patterns unrepresentative of the remainder of noncoding DNA. This was done by masking repeats in the sequence surrounding SNPs with RepeatMasker. Nonrepetitive DNA within the masked sequence flanking SNPs was used to calculate local GC content.

    Files containing SNP allele frequencies were retrieved from TSC . To determine the frequency of the ancestral and variant alleles of biallelic SNPs, these were searched for SNPs where the chimpanzee root had been previously determined. Genotyping results from African American, Asian, and Caucasian samples were pooled, although there were no qualitative differences in the results when individual populations were considered (data not shown).&%, http://www.100md.com

    Bootstrap Analysis of Equilibrium GC Content&%, http://www.100md.com

    The predicted equilibrium GC content, f*, of a genomic region can be calculated by considering the present GC content, f, and the per base pair rate of GCAT (u) and ATGC (v) mutations using the following formula:&%, http://www.100md.com

    This equation predicts the equilibrium GC content of a neutrally evolving region assuming that the mutation bias remains constant over time, but it also applies to regions where a fixation bias operates (in which case u and v represent combined mutation and fixation probabilities) as long as the bias in substitution patterns remains constant.

    Values of f * were estimated for all GC categories using the parsimony-inferred patterns of substitution and polymorphism. Confidence intervals for f * were calculated by re-sampling all of the inferred mutational changes with replacement using 10,000 independent replicates, assuming that all mutations were independent events. We also tested the significance of differences between the predicted values of f * derived from polymorphism and divergence by comparison of each replicate from the different data sets. This procedure was also used to calculate the significance of differences in values of f * when sequences are divided into regions of high (">="'0, 百拇医药

    0.4) and low (<0.4) GC content.'0, 百拇医药

    Results'0, 百拇医药

    Patterns of Nucleotide Substitution'0, 百拇医药

    The patterns of substitution inferred along the human and chimpanzee lineages using the human-chimpanzee-baboon alignments showed no significant differences (see also ). The number of GCAT changes along the human and chimpanzee lineages were 2216 and 2256 respectively (binomial test, P = 0.54; see Materials and Methods), whereas the numbers of ATGC changes were 2280 and 2254 (binomial test, P = 0.68). The numbers of substitutions observed on each lineage were therefore combined for the all the following analyses.

    There is a significant excess of GC"->" AT substitutions in regions of high GC content and an excess of ATGC substitutions in lower GC regions compared with the predicted ratio of 1:1 when the GC content is at equilibrium ( binomial test). Furthermore, the predicted equilibrium GC contents ( f *) estimated from regions of different GC content within the human-chimpanzee-baboon alignments are similar, and suggest a homogenization of GC content toward the average GC content of all regions studied here (comparable to the genomic average of ~6r, 百拇医药

    41%; . However, when the observed changes are divided in regions of high (">="6r, 百拇医药

    0.4) and low (<0.4) GC content, f * is significantly higher in regions of higher GC content (0.43 in high GC, 0.40 in low GC; bootstrap P = 0.003). This suggests that there is a slight bias toward AT"->" GC substitutions in regions of higher GC content.6r, 百拇医药

    fig.ommitted6r, 百拇医药

    Table 1 Parsimony-Inferred Substitutions in Noncoding Nonrepetitive Genomic Regions Since the Human-Chimpanzee Common Ancestor.

    The maximum likelihood estimates of substitutional changes show extremely high concordance with the parsimony estimates . This indicates that parsimony is a reliable way to infer ancestral states at these evolutionary distances, even when non-equilibrium base composition is taken into account. Significant levels of 2logL were found in all of the GC content categories, providing further support for equilibrium GC content varying between branches.(z, 百拇医药

    fig.ommitted(z, 百拇医药

    Table 2 Comparison of Maximum Likelihood and Parsimony Estimates of Substitution Patterns Along the Human and Chimpanzee Lineages.(z, 百拇医药

    Mutation Pattern in Human SNPs(z, 百拇医药

    The pattern of changes observed in a genome-wide sample of 6542 noncoding human SNPs shows a highly significant excess of GCAT over ATGC changes in regions of high GC content compared a 1:1 ratio ( binomial test) and indicates a similarity to the pattern of homogenization of GC shown by the nucleotide substitutions. However, similar values of f * are obtained for SNPs irrespective of current GC content, and there are no significant differences in f * in regions of high (">="

    0.4) and low (<0.4) GC contents (0.37 in high GC, 0.36 in low GC). Assuming that patterns of polymorphism mainly reflect patterns of mutation, this suggests that there is little genomic variation in the level of mutation bias (u/v).l, 百拇医药

    fig.ommittedl, 百拇医药

    Table 3 Pattern of Mutations Inferred by Parsimony in Human SNPs.l, 百拇医药

    Whereas there is no significant bias in the pattern of substitutions inferred by parsimony (4525 ATGC, 4472 GCAT; ), the polymorphism data contain a net excess of GCAT mutations (1994 ATGC, 2346 GCAT; ). This excess is highly significant using a G test of independence and remains upon removal of possible CpG changes (4525 ATGC and 3617 non-CpG GCAT substitutions; 1994 ATGC and 1770 non-CpG GCAT polymorphisms; G = 7.019, P = 0.008). As the mean GC contents of the polymorphisms and substitutions, weighted by observed changes (see Materials and Methods) are both 0.41, this suggests that there is a fixation bias operating to increase the likelihood of fixation of ATGC mutations.

    We also compared the values of f* predicted by polymorphism and substitution data by comparing independent bootstrap replicates. When regions of all GC contents are analyzed together, f* predicted from polymorphisms is significantly lower than from substitutions (P < 0.001; ), indicating a overall bias toward fixation of ATGC mutations. It is unclear from these data, however, whether this fixation bias is more prevalent in regions of high or low GC content, as significant differences between patterns of divergence and polymorphism are only observed in regions of intermediate GC content, where the amount of data is larger.82@$h, 百拇医药

    Analysis of the frequency distributions of GCAT and ATGC SNPs in regions of high (">="82@$h, 百拇医药

    0.5), medium (40–50), and low (<0.4) GC content separately reveals no significant differences using Mann-Whitney U tests. The mean frequency of the two types of SNP is similar in all regions. However, at higher (">="82@$h, 百拇医药

    0.5) GC contents the average frequency of ATGC mutations is slightly higher than GCAT, reflected in the frequency distributions This discrepancy is consistent with the presence of a weak fixation bias favoring mutations that increase GC content in regions where GC content is already high.

    fig.ommitted*kuvtz1, http://www.100md.com

    Table 4 Average Frequencies of Human SNPs Classified by GC Content.*kuvtz1, http://www.100md.com

    fig.ommitted*kuvtz1, http://www.100md.com

    FIG. 1. Allele frequency distributions of human SNPs resulting from AT"->" GC (black bars) and GCAT (white bars) mutations with surrounding GC contents of <0.5 (a) and ">="*kuvtz1, http://www.100md.com

    0.5 (b)*kuvtz1, http://www.100md.com

    Regional Effects on Substitution Patterns*kuvtz1, http://www.100md.com

    As described in , there is a significant positive correlation between the rate of divergence in noncoding, nonrepetitive sequences along the human and chimpanzee lineages and GC content, when all changes are considered (Spearman's rank correlation r = 0.298 and P < 0.001) and this correlation remains upon removal of potential CpG mutations (r = 0.232 and P < 0.001). The higher divergence in high GC regions appears mainly to be caused by the increase in GCAT mutations (see ). This is expected when the genome is not at equilibrium, as there are more GC nucleotides to mutate in high GC regions leading to an ongoing decrease in GC content of these regions. There is also a significant positive correlation between GC content and the number of GCAT substitutions divided by the number of G and C nucleotides in the sequence (r = 0.139 and P < 0.001) and between GC content and the number of ATGC substitutions divided by the number of A and T nucleotides in the sequence (r = 0.239 and P < 0.001; see also ). Hence both ATGC and GCAT changes are more common in regions of higher GC than predicted by a simple model of constant mutation rates, even when compositional non-equilibrium is taken into account.

    fig.ommitted76quxn;, http://www.100md.com

    FIG. 2. Patterns of substitution inferred along the lineages leading to human and chimpanzee in noncoding nonrepetitive regions of 1.8 Mb of human-chimpanzee-baboon alignments. The average divergences given are calculated by considering only changes from each particular category divided by total number of nucleotides compared76quxn;, http://www.100md.com

    By analysis of patterns of substitution in 333 Alu elements identified within the human-chimpanzee-baboon alignments, we quantified regional effects in recent evolution since the time of the human-chimpanzee ancestor. If the correlation between GC and substitution rate is influenced by mutational or fixation pressures that act over large regions of sequence such as isochores, then we would expect the rate of divergence of an Alu repeat to be affected by the GC content of the surrounding DNA. However, there is no trend for Alu repeats to have higher substitution rates when they are inserted in regions of higher GC content ( Spearman's rank correlation r = 0.029, P = 0.597). There is a significant correlation between the rate of divergence in Alu repeats and the GC content of the repeat ( r = 0.208, P < 0.001), consistent with the effect of compositional non-equilibrium.

    fig.ommittedmp|lr, 百拇医药

    FIG. 3. Variation of divergence with GC content in Alu repeats grouped by both internal GC content of the repeat and surrounding GC content. There is no significant correlation with the surrounding GC content (dashed line), but there is a significant positive correlation with internal GC content (P < 0.001; solid line)mp|lr, 百拇医药

    The pattern of changes in Alu repeats grouped according to surrounding GC show no differences: the ratio of GC"->" AT and AT"->" GC mutations is similar and biased toward GC"->" AT changes regardless of the GC content of surrounding DNA . This GCAT bias can be explained by the high average GC content of Alu repeats (~mp|lr, 百拇医药

    0.51 in this sample) and suggests that all Alu repeats move to a similar equilibrium content to that estimated for nonrepetitive noncoding DNA. Comparison of 10,000 individual bootstrap replicates found no significant differences in the ratio of GCAT and ATGC substitutions in low (<40%) and high (">="mp|lr, 百拇医药

    40%) GC contents. The patterns of substitutions in Alu repeats are therefore broadly consistent with those in other noncoding regions and indicate little or no variation in mutation bias between different genomic regions.

    fig.ommittedy33, 百拇医药

    Table 5 Pattern of Substitution in 333 Alu Repeats Inferred from Human-Chimpanzee-Baboon DNA Alignments Classified by Repeat and Surrounding GC Content.y33, 百拇医药

    Discussiony33, 百拇医药

    Fixation Biasy33, 百拇医药

    We have examined the evolution of GC content in noncoding DNA by analysis of substitutions since the human-chimpanzee common ancestor and by rooting a genomic sample of human SNPs. Both data sets indicate a homogenization of GC content in the human and chimpanzee genomes. However, there are significant differences between the divergence and polymorphism data when all GC classes are pooled, revealing a slight bias toward fixation of ATGC mutations.y33, 百拇医药

    It is unclear from the comparisons of polymorphism and divergence whether the bias toward fixation of GC nucleotides is more powerful in regions of any particular GC content . However, the equilibrium GC content predicted by substitutions in regions of higher (">="

    0.4) GC contents is significantly higher than in regions of lower GC (<0.4) contents, whereas no such pattern is exhibited by polymorphisms. In addition, there is a trend toward ATGC mutations segregating at relatively higher mean frequencies in regions of high GC content . It is therefore possible that the bias toward fixation of ATGC mutations operates most strongly in regions of high GC content. Nevertheless, as the polymorphism and divergence data sets both reveal that the GC content of all regions of the genome is heading toward an equilibrium of close to 40%, the observed patterns of substitution can be considered to be mainly caused by the pattern of mutation, with only a weak effect of fixation bias.w.v[j/j, 百拇医药

    It is possible that a genome-wide change in mutation bias occurring after the divergence of humans and chimpanzees could result in the observed discrepancies between patterns of intra- and inter-specific mutational changes. However, there are no significant differences between the patterns of inferred changes on the human and chimpanzee lineages, and it is unlikely that similar changes in mutation bias have occurred in the two lineages.

    Another potential complication is that more rapidly evolving sequences are under-represented in the human-chimpanzee-baboon alignments (used to infer substitutions) compared with human chimpanzee alignments (used to infer SNP roots) because of difficulties in assigning orthology with baboon sequences due to greater divergence. If rapidly evolving sequences have different substitution patterns then this could cause a bias in the results. However, average divergence in noncoding, nonrepetitive regions of the human-chimpanzee alignments (excluding those of draft quality) is 0.0104, whereas in the human-chimpanzee-baboon alignments it is 0.0105, which suggests that the human-chimpanzee-baboon alignments do not represent particularly slowly evolving sequences.r?y+?/, http://www.100md.com

    It is important to exclude the possibility that errors in the data set could result in any of the observed results. In the case of the patterns of substitution observed in human-chimpanzee-baboon alignments this would happen only if sequencing were highly biased toward errors that change GC content of the sequence. This is unlikely, and furthermore would be expected to lead to differences in the patterns of substitutions inferred on human and chimpanzee lineages, which are not observed. A recent study found no significant differences between the substitutions observed by comparing human and chimpanzee sequences and all human SNPs , which means that differences reported here between substitution and polymorphism are only apparent when the roots of the polymorphisms are considered. Even though the chimpanzee genome project sequences used here are of draft quality, there would have to be a very large bias toward errors that change GC to result in incorrect rooting of a fraction of the human SNPs, which is unlikely because the level of divergence between chimpanzee draft sequences and human sequences is low (0.026 per nucleotide).

    Determinants of Mutation Rate Variation|\0sm, 百拇医药

    Under a model where the equilibrium GC frequency predicted by the combined effects of mutation and fixation processes is lower than 0.5, a positive correlation between divergence and GC content is expected , as exhibited by the data presented here. However, both the GCAT and ATGC substitution rates per nucleotide also increase with GC content, which suggests that, although compositional non-equilibrium is likely to be a major determinant of higher substitution rate in regions of high GC, other factors also act to increase the substitution rate in these regions. One intriguing possibility is the potentially mutagenic effect of recombination , which correlates with GC content .|\0sm, 百拇医药

    Further insight into the mechanisms determining substitution rate variation can be gained by examining the patterns of substitution in DNA elements inserted into backgrounds with different GC contents. examined substitutions in three processed pseudogenes of high GC content inserted into regions of intermediate GC contents and showed that the pseudogenes were approaching the GC contents of the surrounding DNA, consistent with a regional effect on the nucleotide- substitution process leading to an attenuation of GC content of inserted sequences. However, these findings are also consistent with a hypothesis of a constant mutation bias across the entire genome where u > v, which leads to an excess of GCAT substitutions in sequences of high GC content, regardless of their genomic location.

    An analysis of substitution patterns observed in a genome-wide sample of five different DNA transposons compared with their consensus sequences revealed that elements inserted into high GC regions accumulated a greater relative proportion of AT"->" GC substitutions than repeats in low GC regions . reported an effect of surrounding GC content on the pattern of substitution in Alu repeats in the human {alpha}*5f2$, 百拇医药

    - (high GC content) and ß-globin (intermediate GC content) gene clusters since the common ancestor of human and chimpanzee. However, a similar effect is not evident in our larger sample of Alu repeats, in which substitution patterns since the human-chimpanzee common ancestor are not significantly affected by surrounding GC content. It is possible, however, that such an effect would become apparent over a longer timescale. In addition, it is plausible that there were forces acting to maintain GC content earlier in vertebrate evolution, which left traces in the divergence of anciently inserted sequences, but that these factors have not been as effective since the human-chimpanzee common ancestor.

    Conclusions#ls@, 百拇医药

    The data presented here are consistent with a model of compositional evolution driven by the combined effects of a uniform mutation bias and a weak bias toward fixation of GC nucleotides, which is likely to be most effective in regions of high GC content. However, these factors do not appear to have maintained isochore GC content since the human-chimpanzee common ancestor, which raises further questions regarding when isochores originated and the processes involved. As there is no evidence to suggest that the mutation bias has varied across the genome in recent evolution (since the human-chimpanzee split), there is no specific reason to suggest that this was the case earlier in vertebrate evolution. However, this does not preclude the possibility that patterns of mutation bias are dynamically evolving and that many regions are undergoing changes in GC content unrelated to their present composition. The efficacy of selection or biased gene conversion in favor of GC nucleotides is greater in species with large effective population sizes. Hence, as effective population sizes of primates are presumably low compared with the vertebrate common ancestor and we can detect a signal of fixation bias, it seems plausible that fixation bias processes were stronger in the past and may have been instrumental in the shaping of isochores in vertebrate genomes.

    Acknowledgements(x!q, 百拇医药

    Financial support was obtained from the Swedish Research Council. H.E. is a Royal Swedish Academy of Sciences Research Fellow supported by a grant from the Knut and Alice Wallenberg Foundation.(x!q, 百拇医药

    Literature Cited(x!q, 百拇医药

    Akashi, H. 1995. Inferring weak selection from patterns of polymorphism and divergence at "silent" sites in Drosophila DNA. Genetics 139:1067-1076.(x!q, 百拇医药

    Akashi, H., and S. W. Schaeffer. 1997. Natural selection and the frequency distributions of "silent" DNA polymorphism in Drosophila. Genetics 146:295-307.(x!q, 百拇医药

    Altschul, S. F., T. L. Madden, A. A. Schaffer, J. H. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389-3402.(x!q, 百拇医药

    Bernardi, G., B. Olofsson, J. Filipski, M. Zerial, J. Salinas, G. Cuny, M. Meunier-Rotival, and F. Rodier. 1985. The mosaic genome of warm-blooded vertebrates. Science 228:953-958.

    Casane, D., S. Boissinot, B. H. Chang, L. C. Shimmin, and W.-H. Li. 1997. Mutation pattern variation among regions of the primate genome. J. Mol. Evol. 45:216-226.1, 百拇医药

    Chen, F. C., and W.-H. Li. 2001. Genomic divergences between humans and other hominoids and the effective population size of the common ancestor of humans and chimpanzees. Am. J. Hum. Genet. 68:444-456.1, 百拇医药

    Duret, L., M. Semon, G. Piganeau, D. Mouchiroud, and N. Galtier. 2002. Vanishing GC-rich isochores in mammalian genomes. Genetics (in press).1, 百拇医药

    Ebersberger, I., D. Metzler, C. Schwarz, and S. Paabo. 2002. Genomewide comparison of DNA sequences between humans and chimpanzees. Am. J. Hum. Genet. 70:1490-1497.1, 百拇医药

    Eyre-Walker, A. 1993. Recombination and mammalian genome evolution. Proc. R. Soc. Lond. Ser. B Biol. Sci. 252:237-243.1, 百拇医药

    Eyre-Walker, A. 1997. Differentiating between selection and mutation bias. Genetics 147:1983-1987.1, 百拇医药

    Eyre-Walker, A. 1998. Problems with parsimony in sequences of biased base composition. J. Mol. Evol. 47:686-690.

    Eyre-Walker, A. 1999. Evidence of selection on silent site base composition in mammals: potential implications for the evolution of isochores and junk DNA. Genetics 152:675-683.%u)}#@, 百拇医药

    Eyre-Walker, A., and L. D. Hurst. 2001. The evolution of isochores. Nat. Rev. Genet. 2:549-555.%u)}#@, 百拇医药

    Filipski, J., J. Salinas, and F. Rodier. 1989. Chromosome localization-dependent compositional bias of point mutations in Alu repetitive sequences. J. Mol. Biol. 206:563-566.%u)}#@, 百拇医药

    Fujiyama, A., H. Watanabe, A. Toyoda, et al. (17 co-authors). 2002. Construction and analysis of a human-chimpanzee comparative clone map. Science 295:131-134.%u)}#@, 百拇医药

    Fullerton, S. M., A. Bernardo Carvalho, and A. G. Clark. 2001. Local rates of recombination are positively correlated with GC content in the human genome. Mol. Biol. Evol. 18:1139-1142.%u)}#@, 百拇医药

    Galtier, N., and M. Gouy. 1998. Inferring pattern and process: maximum-likelihood implementation of a nonhomogeneous model of DNA sequence evolution for phylogenetic analysis. Mol. Biol. Evol. 15:871-879.

    Lander, E. S., L. M. Linton, B. Birren, et al. (248 co-authors). 2001. Initial sequencing and analysis of the human genome. Nature 409:860-921.3j:e/i#, 百拇医药

    Lercher, M. J., and L. D. Hurst. 2002. Human SNP variability and mutation rate are higher in regions of high recombination. Trends Genet. 18:337-340.3j:e/i#, 百拇医药

    McDonald, J. H., and M. Kreitman. 1991. Adaptive protein evolution at the Adh locus in Drosophila. Nature 351:652-654.3j:e/i#, 百拇医药

    Nekrutenko, A., and W.-H. Li. 2000. Assessment of compositional heterogeneity within and between eukaryotic genomes. Genome Res. 10:1986-1995.3j:e/i#, 百拇医药

    Piganeau, G., D. Mouchiroud, L. Duret, and C. Gautier. 2002. Expected relationship between the silent substitution rate and the GC content: implications for the evolution of isochores. J. Mol. Evol. 54:129-133.3j:e/i#, 百拇医药

    Pruitt, K. D., and D. R. Maglott. 2001. RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res. 29:137-140.3j:e/i#, 百拇医药

    Smith, N. G. C., and A. Eyre-Walker. 2001. Synonymous codon bias is not caused by mutation bias in G+C-rich genes in humans. Mol. Biol. Evol. 18:982-986.

    Smith, N. G. C., M. T. Webster, and H. Ellegren. 2002. Deterministic mutation rate variation in the human genome. Genome Res. 12:1350-1356.s/-d?, http://www.100md.com

    Sueoka N. 1988. Directional mutation pressure and neutral molecular evolution. Proc. Natl. Acad. Sci. USA 85:2653-2657.s/-d?, http://www.100md.com

    Sueoka N. 1993. Directional mutation pressure, mutator mutations, and dynamics of molecular evolution. J. Mol. Evol. 37:137-153.s/-d?, http://www.100md.com

    Thompson, J. D., D. G. Higgins, and T. J. Gibson. 1994. ClustalW–improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:4673-4680.s/-d?, http://www.100md.com

    Webster, M. T., N. G. C. Smith, and H. Ellegren. 2002. Microsatellite evolution inferred from human-chimpanzee genomic sequence alignments. Proc. Natl. Acad. Sci. USA 99:8748-8753.s/-d?, http://www.100md.com

    Yu, N., Y. X. Fu, N. Sambuughin, M. Ramsay, T. Jenkins, E. Leskinen, L. Patthy, L. B. Jorde, T. Kuromori, and W.-H. Li. 2001. Global patterns of human DNA sequence variation in a 10-kb region on chromosome 1. Mol. Biol. Evol. 18:214-222.s/-d?, http://www.100md.com

    Zhao, Z., L. Jin, Y. X. Fu, et al. (13 co-authors). 2000. Worldwide DNA sequence variation in a 10-kilobase non-coding region on human chromosome 22. Proc. Natl. Acad. Sci. USA 97:11354-11358.s/-d?, http://www.100md.com

    Accepted for publication October 16, 2002.(Matthew T. Webster Nick G. C. Smith and Hans Ellegren)