Gene Expression, Synteny, and Local Similarity in Human Noncoding Mutation Rates
http://www.100md.com
分子生物学进展 2004年第10期
* Department of Evolution, Genomics and Systematics, Evolutionary Biology Centre, Uppsala University, Uppsala, Sweden; Department of Biology & Biochemistry, University of Bath, Bath, UK
E-mail: matthew.webster@ebc.uu.se
Abstract
The human genome is organized with regard to many features such as isochores, Giemsa bands, clusters of genes with similar expression patterns, and contiguous regions with shared evolutionary histories (synteny blocks). In addition to these genomic features, it is clear that mutation rates also vary across the human genome. To address how mutation rates and genomic features are related, we analyzed substitution rates at three classes of putatively neutral noncoding sites (nongenic, intronic, and ancestral repeats) in 14 Mb of human–chimpanzee alignments covering human chromosome 7. Patterns of mutation rate variation inferred from substitution rate variation differ among the three site classes. In particular, we find that intronic mutation rates are strongly affected by the breadth of expression of the genes in which they reside, with broadly expressed genes exhibiting low mutation rates, probably as a consequence of the transcription-coupled repair process acting in the germ line. All site classes show significant local similarities in mutation rate at the megabase scale, and regional similarities in nongenic mutation rate covary with blocks of synteny between the human and mouse genomes, indicating that the evolutionary history of a genomic region is an important determinant of mutation rate.
Key Words: human-mouse synteny blocks ? gene expression ? mutation rate ? recombination ? gene density ? Giemsa bands
Introduction
The structure of mammalian chromosomes is not uniform. In addition to banding patterns revealed by chromosome staining such as Giemsa bands, regional variation is also observed in base composition (isochores; Bernardi et al. 1985; Eyre-Walker and Hurst 2001), repetitive element density, gene density, and patterns of gene expression (Caron et al. 2001; Lercher et al. 2002; Lercher et al. 2003). Understanding how these features evolved is problematic owing to their complex inter-relationships. For example, the location of Giemsa bands can be predicted on the basis of variation in GC content (Niimura and Gojobori 2002), although they do not directly correspond to isochores (Saccone et al. 2001). Different classes of interspersed repeat elements show different distributions within genomes. Broadly expressed genes have been shown to cluster in GC-rich isochores and in R and the lightest Giemsa bands (Lercher et al. 2003; Vinogradov 2003). An additional level of organization is indicated by contiguous chromosomal regions that are evolutionarily conserved between species (synteny blocks; MGSC 2002). Such regions may represent important evolutionary units, as it has been suggested that translocation breakpoints are often re-used in mammalian evolution (Pevzner and Tesler 2003).
Mutation rates are not constant across mammalian chromosomes. Several studies have now suggested the existence of regions of local similarity in mutation rates at scales >1 Mb by comparing rates of synonymous evolution (kS) in neighboring genes (Lercher et al. 2001; Malcom et al. 2003; Matassi et al. 1999). However, the relationship between these local similarities and other genomic features is not clear. One possibility is that similarities in levels of kS are caused by clustering of genes with similar expression patterns (Williams and Hurst 2000; Williams and Hurst 2002). This could either be due to a mutagenic effect of transcription (Datta and Jinks-Robertson 1995; Morey et al. 2000) or because linked genes have similar levels of transcription-coupled repair (TCR; Green et al. 2003; Svejstrup 2002). Alternatively, the underlying mutation rate could vary between genomic regions on a larger scale determined by genomic features not directly related to transcription (Hurst and Eyre-Walker 2000).
Base composition is a potential correlate of mutation rate. Recent studies indicate that GC content in mammals is not at equilibrium and that GC-rich isochores are decaying (Arndt et al. 2003; Duret et al. 2002; Smith et al. 2002). This would result in higher mutation rates in GC-rich regions if the rate of GCAT mutations is higher than the opposite ATGC (Piganeau et al. 2002). A number of studies support this prediction and report a positive correlation between GC content and substitution rates (Hurst and Williams 2000; Smith et al. 2002; Smith and Hurst 1999). However, the evidence for this is equivocal and a comparison of the human and mouse genomes suggested that the relationship was best described by regression with a quadratic curve, where regions of extremely low or high GC content have the highest substitution rates (MGSC 2002). Recombination rate has also been shown to correlate with divergence, leading to the suggestion that recombination is mutagenic (Hellmann et al. 2003; Lercher and Hurst 2002). The relative effects of these factors are not certain as they are strongly correlated to each other (Fullerton et al. 2001) and it is likely that recombination influences GC content (Galtier et al. 2001; Meunier and Duret 2004). In addition to the effects of recombination and GC, a recent study reported considerable variation in kS between synteny blocks, suggesting that these regions could define the unit of mutation rate variation (Malcom et al. 2003). However, because the relationship between these blocks and other features—such as clusters of genes of similar expression and cytogenic bands—has not been quantified, the precise cause of this observation is unclear.
In contrast to the majority of previous studies of mutation rate variation that analyzed differences in kS by using human-rodent genic alignments (Lercher et al. 2001; Malcom et al. 2003; Matassi et al. 1999), we have analyzed noncoding divergence in human–chimpanzee genomic alignments. This approach has a number of benefits. First, rates at silent sites in mammals may be under selection owing to constraints caused by translational efficiency (Urrutia and Hurst 2003), which could result in a lower kS in highly expressed genes. Although negative selection may act on regulatory sequences in noncoding DNA, our results are not likely to have been generated by selection (see Discussion). Secondly, estimation of substitution rates at silent sites is often sensitive to methodology and potentially affected by nonsynonymous substitution rates (Bierne and Eyre-Walker 2003; Hurst and Williams 2000; Li 1997; Smith and Hurst 1999). Indel mutations, which are usually selected against in coding regions, owing to disruption of protein function, can be studied in noncoding DNA. Analysis of substitutions in noncoding DNA also enables the comparison of mutation rates in transcribed sequences with surrounding intergenic regions, which allows quantification of the mutational pressures that act only on transcribed DNA, such as the possible effects of TCR. The estimation of mutation rates by using comparisons of closely related species such as human and chimpanzee has a number of advantages over comparisons of more distantly related species such as human and mouse. First, because of limited divergence, estimation of substitution rate is not greatly confounded by multiple hits. Furthermore, the impact of genomic rearrangements is much less, which means that the effects of base composition, recombination, and other features that may fluctuate over evolutionary time can be studied more accurately.
Herein we analyze the effects of a variety of features of genomic organization on noncoding human–chimpanzee divergence in order to attempt to define the causes of regional variation in mutation rate. We address these questions by using long (average length, 165 kb) genomic alignments of human and chimpanzee sequence distributed across human chromosome 7. For each alignment, three measures of the substitution rate at mutually exclusive sites were estimated: nongenic nonrepetitive (kNG), nongenic ancestral repeats (kAR), and intronic nonrepetitive (kIN). We analyze the effects of various genomic features on these measures of substitution rate, assay the extent of mutation rate banding patterns, and analyze the contribution of different correlates of substitution rate in generating local similarities in mutation rates.
Methods
Construction of Genomic Alignments
We downloaded all full-length chimpanzee BAC clones available in the EMBL database on April 29, 2003, using the SRS retrieval tool at http://www.be.embnet.org/. The orthologous human sequence for each chimpanzee clone was identified by using Blast searches (http://www.ncbi.nlm.nih.gov/blast/) against all human genome contigs from the April 2003 build. The majority of sequences were found to be orthologous to human chromosome 7. Because a number of studies indicate the existence of significant mutation rate heterogeneity between chromosomes (Ebersberger et al. 2002; Lercher et al. 2001; MGSC 2002) and there was not enough chimpanzee sequence to reliably estimate variation in noncoding divergence on other chromosomes, we restricted our analysis to regional variation on chromosome 7.
Orthologous human and chimpanzee sequences were identified from the Blast results using the program seqtool. This program provides a graphical output of Blast results, thereby allowing orthologous sequences to be distinguished from rearrangements. We compared the performance of two global alignment programs, ClustalW (Thompson et al. 1994) and AVID (Bray et al. 2003), by aligning the same pair of human and chimpanzee genomic sequences with both programs. The alignment created by ClustalW contained 0.1% more aligned bases but 1.5% fewer inferred substitutions, which suggests that for genomic sequences from closely related species such as human and chimpanzee, ClustalW performs slightly better. We, therefore, used ClustalW with default settings to produce 84 alignments used in the analysis of mutation rate variation. Repetitive sequences were identified in the alignments using RepeatMasker (http://repeatmasker.genome.washington.edu/). Sequences were masked for all the simple repeats identified by Sputnik (http://espressosoftware.com/). Alignments were then checked by eye and a number of poorly aligned regions were removed from subsequent analyses.
Analysis of Noncoding Substitution Patterns
The positions of genes and exons within each gene were identified from the GenBank annotation files corresponding to the human contig in the alignment. Substitutions in three main subsets of the alignments were identified, and substitution rates were calculated without multiple hits correction, i.e. human-chimp differences as proportion of ungapped sites: (1) nongenic, nonrepetitive (kNG); (2) nongenic, ancestral-repeats (kAR); and (3) intronic, nonrepetitive (kIN). In all subsets, simple repeats identified by Sputnik were masked. In subsets a) and c) repeats identified by RepeatMasker were also masked, whereas in subset b) only those sequences identified by RepeatMasker as repetitive were analyzed, not including simple repeats masked by Sputnik or genic regions. Indels in each of the three classes of genomic region were inferred from contiguous runs of alignment gaps matching only bases within the appropriate class.
Some alignments contained regions of overlap with neighboring alignments, which were only analyzed once. Substitutions were initially analyzed treating each full alignment as a separate data point, only those alignments containing >5 kb of the specific type of DNA were included in the analyses. In order to analyze the more fine-scale effects of GC content on divergence, we also performed the analyses using non-overlapping windows of 20 kb, only including windows containing >1 kb of the appropriate site class. We also analyzed changes in intronic regions of each gene within the alignments separately in order to compare these values to measures of expression rate and breadth within individual genes, only including genes with >5 kb of aligned intronic sequence. Furthermore, we estimated kNG in 10 kb of sequence flanking both sides of each gene.
Gene Expression Data
Summary measures of rate and breadth of gene expression were derived from three independent data sources: expressed sequence tags (EST), microarray data, and serial analysis of gene expression (SAGE).
ESTs contained in UniGene clusters (Schuler et al. 1996; UniGene build 161, obtained from the National Center for Biotechnology Information [NCBI] at http://www.ncbi.nlm.nih.gov/) were traced to libraries in the dbEST database (also at NCBI), excluding libraries containing <50 ESTs. On the basis of dbEST annotations, 235 libraries were sorted into 55 normal (nondisease) tissue types, each represented by ESTs mapped to at least 500 UniGene clusters. For each UniGene cluster, expression breadth was defined as the number of tissues with an EST contained in the cluster, divided by the total number of tissues types (55). This estimate was then mapped to LocusLink gene IDs.
Normalized microarray data were obtained from Su et al. (2002). Data encompassing 63 replicate hybridizations were sorted into 28 normal tissues. LocusLink gene IDs were identified using Affymetrix annotation (http://www.affymetrix.com/). Average counts were used where multiple probe sets were present for the same gene. For each gene, breadth of expression was defined as (average counts across tissues) / (maximum counts across tissues). Before this calculation, data <0 was set to 0; if the maximum across tissues was <50, the breadth was set to 0. The cutoff of 50 is lower than the value of 200 used by Su et al. (2002). This was chosen because our estimate of gene expression breadth takes a larger number of measurements into account. Rate of expression was estimated as log2 (of the maximum counts across tissues).
SAGE data were obtained from SAGEmap at NCBI (Lash et al. 2000). The dataset was curated to avoid possible GC biases following the approach of Margulies et al. (2001), by removing libraries with mean tag GC>0.5. The remaining 40 libraries were merged into 19 normal tissue types. Tag counts were converted to counts per million. If only one tag was found for a gene–tissue type combination, we discarded the observation as a likely sequencing error. The expression data were cross-linked to RefSeq mRNA sequences in Ensembl (http://www.ensembl.org/), by extracting the 3'-most NlaIII SAGE tag for each mRNA. If the same tag was cross-linked to more than one Ensembl gene, all corresponding genes were excluded. If several tags mapped to the same gene, we chose the maximum counts per million across tags. Genes were then mapped to LocusLink IDs. For each gene, breadth of expression was defined as (average counts across tissues) / (maximum counts across tissues); breadth was set to 0 if the maximum was zero. Rate of expression was defined as log2 (of the maximum counts across tissues).
For each of the autosomal genes for which EST expression breadth data was available, we extracted the transcribed sequences from GenBank genomic contigs, located by using the GenBank contig annotation files. We then calculated the proportion of G+T nucleotides on the coding strand in order to quantify the potential effect of TCR in generating strand asymmetry in transcribed regions of genes of different expression patterns.
Other Genomic Features
Local GC was calculated as the average GC content of the nonmasked human and chimpanzee DNA sequences in the region surveyed. Recombination rate was estimated from the rate at the nearest marker to region under analysis identified from the human contig for which recombination rate information was available from Kong et al. (2002). We calculated the density of transcribed DNA and exonic DNA within each alignment by using positional information from GenBank contig annotation files. These factors were defined as the number of bases in the alignment belonging to each category divided by the total number of bases. We downloaded tables describing the location of Giemsa bands and bands of contiguous synteny between human and mouse on the April 2003 build of the human genome from UCSC (http://genome.ucsc.edu/). These tables were used to identify the position of each alignment relative to these genomic features.
Statistical Analyses
Correlations between substitution rates and continuously varying alignment features (GC content, recombination rate, exon and transcript density, and gene expression measures) were analyzed using Spearman's rank correlations. The effect of categorical sequence features (Giemsa bands and synteny blocks) on substitution rates were analyzed by using analyses of variance (ANOVAs). Following the rationale presented in Smith and Eyre-Walker (2003), substitution rates were log transformed. We also tested for effects of categorical features by using the nonparametric Kruskal-Wallis test, with significance obtained by using the standard asymptotic method. To determine whether significant ANOVA results resulting from synteny blocks were caused specifically by these features and not as a secondary effect of local similarities in mutation rate at the megabase scale, we performed a randomization test. By using 1000 replicates, we randomly shuffled the order of synteny blocks along the chromosome, keeping the lengths constant, and we re-assigned alignments to the randomized blocks. For each randomized dataset we recalculated the ANOVA F-value and compared this to the original value from the real data.
The extent of local similarities in substitution rates and the effects of various genomic features on these similarities were assayed using methods developed in Lercher et al. (2001). For each alignment, the absolute substitution rate difference was calculated between that alignment and the mean of all other alignments with their midpoints within a specified distance of the midpoint of that alignment. Then this observed total difference was compared to the total difference expected when the order of the alignments was randomly shuffled (1000 shuffles). The statistic is defined as the ratio of the observed total difference to the expected total difference (the mean of the ratio taken over all shuffles). Lower values of indicate greater local similarity, and one-tailed significance values were obtained by comparison of the observed total difference with the set of shuffled total differences. This analysis was repeated using different scales to determine the range of substitution rate variation. The effects of different correlates of substitution rates on local similarities were tested by only shuffling alignments with values similar to those of a particular correlate.
Results
Deterministic Mutation Rate Variation
We analyzed substitution rates in three classes of noncoding site within a set of 84 genomic human–chimp alignments corresponding to human chromosome 7. When only alignments with >5kb of aligned sites within each site class were considered, the size of each dataset was reduced (table 1). The large proportion of sites in introns probably reflects the fact that many of the chimpanzee clones used in this study are derived from the NISC comparative vertebrate sequencing project (http://www.nisc.nih.gov/), which has specifically targeted gene-rich regions. Divergence in ancestral repeats (kAR) is significantly higher than at both nonrepetitive nongenic (kNG) and intronic (kIN) sites (Willcoxon rank-sum test; kNG vs. kAR, W = 797, P = 0.00219; kIN vs kAR, W = 2057, P = 5.6 x 10–5), whereas kNG and kIN were not significantly different (W = 2582, P = 0.226). The high rate of evolution in ancestral repeats may be due partly to higher levels of hypermutable CpG dinucleotides. The observed variances of the divergence estimates between full alignments were kNG = 7.08 x 10–6, kAR = 1.03 x 10–5, and kIN = 7.01 x 10–6. We simulated the expected variances of each distribution assuming a uniform mutation rate by using the observed number of sites in each class in each alignment. For kNG and kAR, the observed variances were 11 and 12 times expected, respectively, whereas for kIN the variance was 29 times expected. This indicates substantial variation in mutation rate at this scale, with particularly high variation exhibited by intronic regions. Variation in kNG across human chromosome 7 is illustrated in figure 1.
Table 1 Summary of human—chimpanzee alignments
FIG. 1.— Variation in kNG across human chromosome 7. Bars represent two standard errors and were calculated using the binomial formula for variance, under the assumption of no variation in substitution rate within each alignment. The estimated length of the chromosome is 159 Mb.
Indel and substitution rates are significantly correlated for kNG (r = 0.421, P = 0.0006) and kIN (r = 0.432, P = 0.0002) although the relationship is not as strong for kAR (r = 0.302, P = 0.062). Substitution rates for the three site classes in the same alignments are strongly correlated (fig. 2; kNG vs. kIN, r = 0.634, P < 10–6; kNG vs. kAR, r = 0.571, P = 0.00019; kIN vs. kAR, r = 0.530, P = 0.0021). However, a large part of the variation in different substitution rate estimates does not correlate with other estimates, which suggests that different factors may affect variation in each substitution type. In particular, the comparisons involving kAR have lower correlation coefficients, suggesting that substitution rates in ancestral repeats may be governed by additional factors. These results confirm those of previous studies that deterministic mutation rate variation exists in the human genome (Hardison et al. 2003; Smith et al. 2002); we now turn to factors that influence this variation.
FIG. 2.— Deterministic variation in mutation rate is indicated by a positive correlation between kNG and kAR (filled circles, dotted line) and kIN (empty circles, solid line) using entire human–chimpanzee genomic alignments as separate data points. Lines represent linear regressions.
The Effect of Recombination and GC
We find no significant correlation between GC content of the aligned sequences and any of the measures of substitution rate, when complete alignments are considered as single data points (kNG, r = –0.013, P = 0.917; kAR, r = 0.135, P = 0.413; kIN, r = –0.082, P = 0.492; figure 3). These values remain nonsignificant when alignments are broken into 20-kb non-overlapping windows (kNG, r = 0.051, P = 0.13; kAR, r = 0.065, P = 0.38; kIN, r = –0.024, P = 0.576). Some previous studies have suggested that GC content and substitution rate are positively correlated. However, comparison of the entire human and mouse genomes (MGSC 2002) revealed more complex relationships: at synonymous coding sites, substitution rate was found to be approximately constant at low and intermediate GC levels, with a sharp increase at extremely high GC level; in ancestral repeats, substitution rate showed a V-shaped dependence on GC, with a minimum at intermediate GC level. Our data probably do not rule out these relationships, although we do not have sufficient data to accurately define them (fig. 3). Furthermore, it is possible that mutation rate is affected by regionality in GC content on a larger scale, such as the scale of several megabases shown by isochores rather than a local effect. This is difficult to test using the alignments presented here because many alignments are close to each other, which would mean data points are not independent.
FIG. 3.— No significant correlations were observed between local GC content and substitution rate, using kNG (empty circles), kAR (filled circles), and kIN (crosses).
We found a positive correlation between divergence and recombination by using complete alignments, significant for kNG (r = 0.251, P = 0.049, figure 4) and suggestive for kIN (r = 0.215, P = 0.071). The cause of this correlation is a point of debate, but one suggestion is that recombination is mutagenic (Hellmann et al. 2003; Lercher and Hurst 2002). In contrast, kAR does not correlate with recombination rate (r = 0.079, P = 0.642). One potential problem with the previously described analyses of the relationships between recombination rate and substitution rates concerns the nonindependence of recombination rate measures: in several cases, neighboring alignments share the same recombination rate because recombination rate marker density is lower than alignment density. This problem means that the analyses may have been biased. To check this possibility we removed various alignments from consideration so that each recombination rate marker corresponded to just a single alignment. In the case of kNG, the Spearman rank correlation coefficient increased to r = 0.292, although the relationship became nonsignificant (P = 0.060) owing to reduction of the data set from 62 to 42 alignments. The increase in correlation coefficient suggests that the relationship is genuine. On the other hand, with kIN the Spearman rank correlation coefficient decreased markedly to r = 0.134, confirming the weaker relationship between recombination and intronic substitution rates.
FIG. 4.— Positive correlation between kNG from entire genomic alignments and recombination rate, defined as the mean rate of all markers within each alignment reported in (Kong et al. 2002). Line represents linear regression.
The Effect of Gene Expression
We calculated kIN separately for 122 genes within the alignments that could be identified using LocusLink IDs. This identifier was used to link to estimates of gene expression breadth and rate. Significant negative correlations were observed between kIN and expression breadth using the EST data (r = –0.405, P = 0.001, n = 60 genes; figure 5) and microarray data (r = –0.458, P = 0.006, n = 34 genes). A similar correlation coefficient was indicated by the SAGE data, although this was not significant (r = –0.388, P = 0.075, n = 22 genes). We find no significant correlation between kIN and rate of expression measured using microarray or SAGE data (microarray, r = 0.165, P = 0.393, n = 29 genes; SAGE, r = –0.042, P = 0.855, n = 22 genes). No significant correlations were observed between any measure of gene expression rate or breadth and kNG in 10 kb flanking both sides of each gene (data not shown).
FIG. 5.— Negative correlation between kIN and expression breadth for individual genes within human–chimpanzee genomic alignments. Expression breadth was estimated from EST data (see Methods section). Line represents linear regression.
A potential reason for the significantly lower intronic substitution rate (kIN) in broadly expressed genes is that such genes are more likely to be expressed in the germ line and that the effects of TCR will be passed on to the next generation. A mutational bias caused by TCR is believed to be responsible for an observed skew in base composition in transcribed regions, which exhibit an excess of G+T nucleotides on the coding strand. To examine the hypothesis that TCR has a greater effect on the substitution pattern in broadly expressed genes, we calculated the skew in base composition (GT/AC) on the coding strand of all autosomal transcripts in the human genome for which sequence and EST data were available. There is a significant positive correlation between expression breadth and compositional asymmetry (r = 0.368, P < 10–15, n = 12,723, fig. 6). This suggests that TCR has a significantly greater effect on the nucleotide substitution process in broadly expressed genes.
FIG. 6.— Positive correlation between expression breadth estimated from EST data (see Methods section) and compositional asymmetry measured by the ratio of GT–AC nucleotides on the coding strand of transcribed regions. Line represents linear regression.
As genes of similar expression are often clustered together (Caron et al. 2001; Lercher et al. 2002; Lercher et al. 2003), we tested whether the negative correlation between kIN and expression was caused by modified mutation rates in the surrounding genomic region, or whether the effect only acts on individual genes. We estimated the average expression breadth and rate for all of the genes found within single genomic alignments and compared these values with levels of divergence in each alignment. All of the correlations between kIN and expression breadth are reduced to nonsignificance and the expression rate measures remain nonsignificant (data not shown). This suggests that gene expression has a direct effect on the mutation rate of individual genes rather than the mutation rate being a function of the clustering of genes of similar mutation rates. In addition, there are no significant correlations between kNG or kAR and any of these average measures of expression breadth or rate (data not shown). Taken together, these results suggest that factors linked to gene expression reduce mutation rates in individual genes but have little or no effect on surrounding noncoding regions.
To test the proportion of variance in kIN that is due to TCR, we performed a multiple linear regression by using the three measures of gene expression breadth from EST, microarray, and SAGE data. The total R2 value was 0.397, suggesting that TCR can account for a large proportion of variation in mutation rates in transcribed regions. The variance in intronic divergence (kIN) was 29 times that expected under a uniform mutation rate compared with the variance in nontranscribed divergence (kNG and kAR) of 11 to 12 times expected (see previous text). If 40% of the variance in kIN can be accounted for by expression data, that still leaves the remaining variance in kIN roughly 17 times expected (i.e., more than for nontranscribed divergence). This could mean that further factors govern mutation rates in transcribed regions compared with nongenic regions, in addition to TCR.
Other Chromosomal Features
The density of genes is known to vary greatly across the genome, with genes clustering in GC-rich isochores and Giemsa negative chromosome bands (IHGSC 2001). To calculate whether this clustering affects mutation rate, we calculated the proportion of sites within exons (exon density) within each alignment. There is a significant negative correlation between alignment kIN and exon density (r = –0.276, P = 0.019) but not with transcript density (r = –0.027, P = 0.822). Similar negative, although non-significant, correlations with exon density are found for kAR (r = –0.313, P = 0.052) and kNG (r = –0.148, P = 0.243).
We also examined the effects of Giemsa bands and regions of human–mouse synteny on mutation rate by using analysis of variance. Giemsa band locations were classified as either Giemsa negative or one of four classes of Giemsa positive bands (classified by the intensity of staining). kNG (F = 1.463, P = 0.225), kAR (F = 0.664, P = 0.621), and kIN (F = 2.171, P = 0.082) were not significantly affected by Giemsa band. However, the suggestive result in the case of kIN is interesting to note as genes with similar expression breadth have been observed to cluster according to Giemsa band (Lercher et al. 2003), and we have shown gene expression breadth to be an important determinant of kIN. Similar results were obtained by using Wilcoxon rank sum tests to compare substitution rates in Giemsa negative with all classes of Giemsa positive bands (kNG, W = 513, P = 0.989; kAR, W = 203, P = 0.664; kIN, W = 515, P = 0.149). It is interesting to note that the region with the highest nongenic divergence (kNG = 0.022) is the closest to the end of chromosome 7 (3.4 Mb from the end of 7p22). Regions of high recombination rate have been identified close to telomeres (Kong et al. 2002), although it is unclear if this is the cause of the elevated substitution rate because the recombination estimate for this alignment is moderate (1.43 cM/Mb).
We then used a similar analysis to test whether regions that exist within the same contiguous segments in the mouse genome (synteny blocks) have similar mutation rates. Significant variation between synteny blocks was observed for both kNG (ANOVA F = 3.72, P = 1.94 x 10–4; Kruskal-Wallis P = 0.013) and kIN (F = 4.00, P = 3.13 x 10–5; Kruskal-Wallis P = 0.018). The effect of synteny blocks is not significant for kAR (ANOVA F = 1.98, P = 0.075; Kruskal-Wallis P = 0.240). Local similarities in mutation rate are known to exist at the megabase scale (Lercher et al. 2001; Malcom et al. 2003; Matassi et al. 1999), and these could account for significant differences between synteny blocks even if the blocks themselves did not influence this variation. In order to account for this, we recalculated the ANOVA F-values using datasets where the positions of synteny blocks had been randomized. For kNG, the F-value from the real dataset was significantly higher than that for the randomized samples (P = 0.04). However, for kIN, the real F-values were not significantly greater than those for randomized datasets (P = 0.20). These results suggest that evolutionary history is a major determinant of mutation rate variation in nongenic, nonrepetitive sequence at the megabase scale. The absence of this effect for kAR is interesting to note, as recent ancestral repeats do not share the same evolutionary history as their surrounding DNA.
The Extent and Causes of Local Similarities in Mutation Rate
We examined local similarities in noncoding divergence using the statistic (Lercher et al. 2001), which is equal to the observed local heterogeneity in divergence between alignments divided by the expected heterogeneity derived from randomized data (see methods). Values of significantly less than 1 indicate the presence of local similarities. We first considered the effect of the distance range on local similarity. Our alignments provide only patchy coverage of chromosome 7 and are not suitable for close consideration of the range of local similarity. Instead we simply wished to identify the distance range that gave the strongest signal of local similarity. The true level of local similarity may well be greatest at the shortest distances, but short distance ranges reduce the number of possible alignment comparisons; therefore, we expect to be minimized at an intermediate distance range. We considered only those alignments on chromosome 7 with sufficient data in each particular substitution rate class (64 for kNG, 39 for kAR, and 72 for kIN). When substitution rates were compared between alignments within a distance range of 1 Mb, NG = 0.852 (P = 0.115), AR = 0.907 (P = 0.249), and IN = 0.639 (P = 0.001). For a distance range of 2 Mb, NG = 0.694 (P < 0.001), AR = 0.735 (P = 0.018), and IN = 0.600 (P < 0.001); and for a distance range of 3 Mb NG = 0.782 (P = 0.002), AR = 0.814 (P = 0.035), and IN = 0.728 (P < 0.001). Hence the strongest signal of local similarity is shown by all measures of substitution rate at a distance range of 2 Mb. We used this distance range to test the effects of correlates of divergence in determining local similarities in mutation rate.
We investigated the causes of local similarity (see Lercher et al. 2001) by modifying the randomization algorithm so that only alignments with similar levels of a particular correlate were shuffled with each other. For example, if local similarities in substitution rates directly corresponded to local similarities in GC content, then shuffling among alignments of similar GC content would not be expected to make much difference to the pattern of substitution rate variation. In this case the observed and shuffled total differences would be similar and there would be no signal of local similarity.
Variation in kNG correlates with recombination rate. In order to test whether recombination can account for local similarities in kNG, the alignments were first ranked according to the recombination rate of the region from which they were taken and then grouped into five classes of equal size. The alignments were then shuffled within each of the five recombination rate classes, only considering the 62 alignments for which recombination rate data were available. With this slightly reduced data set we still found highly significant local similarity as determined by shuffling all alignments (NG = 0.694, P < 0.001). However, neither nor P changed greatly when shuffling was performed within recombination rate classes (NG = 0.709, P < 0.001), which suggests that recombination can only account for a very small part of the local similarities observed in kNG.
Alignments are significantly affected by exon density, for which a measure was available for all alignments. Shuffling within exon density classes had very little effect on local similarity (IN = 0.602, P < 0.001 compared with IN = 0.600, P < 0.001 for the unshuffled dataset). Because we did not find any genomic features that correlate with kAR, we did not perform any further randomizations using this dataset.
Although recombination rate and exon density correlate with substitution rate, neither of these features can explain the local similarities observed on a megabase scale. An additional feature that covaries with substitution rate is location within a particular human–mouse synteny block. Location within a certain synteny band could potentially explain a large proportion of the local similarities in substitution rate. However, the 64 alignments with sufficient sites to estimate kNG fit into 20 synteny blocks, and only seven of these contain more than three alignments. Hence shuffling within synteny bands would not greatly alter the distribution of alignments along chromosome 7 and hence would not be an adequate test of whether synteny blocks are responsible for local similarities in divergence.
Discussion
We have examined mutation rate variation across a single human chromosome by first identifying factors that have significant effects on substitution rates, and then by attempting to determine whether any of these factors could be responsible for the existence of regions of local similarity in mutation rate. Substitution rates in three classes of noncoding sites (kNG, kAR, and kIN) show significant evidence for local similarities at the megabase level. Factors such as recombination rate, gene expression breadth, and exon density were shown to significantly correlate with some or all of these measures of substitution rate, suggesting a direct or indirect effect on mutation rate, but none of these factors could explain the existence of large-scale local similarities.
The substitution rate within introns (kIN) exhibits the greatest variance compared to simulated levels. This could suggest that variation in kIN is controlled by additional factors compared with kNG and kAR. One such factor is gene expression. A previous analysis of human–mouse orthologous genes found a marginally significant negative correlation between EST expression breadth and evolutionary rate at synonymous coding sites (Duret and Mouchiroud 2000), which was interpreted as a consequence of substitutions at neighboring nonsynonymous sites. In contrast to this interpretation, we show here that broadly expressed genes have lower intronic substitution rates. A likely explanation for this is that such genes are more likely to be expressed in the germ line and hence have reduced mutation rates owing to the TCR mechanism (Svejstrup 2002). TCR is believed to cause a compositional asymmetry, where G+T nucleotides are over-represented on the coding strand of transcribed regions (Green et al. 2003). We have shown that this asymmetry is strongest in broadly expressed genes, suggesting an increased likelihood of such genes being affected by TCR.
One potential problem with analyzing divergence in introns is the effect of selective constraint on conserved regulatory elements. If the introns of broadly expressed genes were under greater constraint, this could lead to a negative correlation between kIN and gene expression breadth. However, housekeeping genes do not require complex regulation, as they are transcriptionally active in all cell types. Therefore, they are not likely to contain more regulatory elements in their introns than less broadly expressed genes. In addition, there is no correlation between total intron length and gene expression breadth in our data (using the EST data, r = 0.0237, P = 0.843, n = 60 genes). This suggests that regulatory elements do not take up a greater proportion of the introns of broadly expressed genes in this dataset. We, therefore, conclude that the lower kIN in broadly expressed genes is not likely to be influenced by selective constraint.
Nongenic regions may also contain regulatory elements and unidentified transcripts under selective constraint, which could potentially contribute to the observed variation in the divergence between genomic regions. However, the presence of conserved regions is not expected to have a large influence on divergence between closely related species such as human and chimpanzee: as the proportion of divergent sites is already very low, the reduction in divergence caused by conserved blocks cannot greatly increase its variance (Smith et al. 2002).
Substitutions in intergenic regions (kNG and kAR) do not appear to be affected by the average levels of gene expression breadth of nearby genes, and the relationship between kIN and gene expression breadth weakens when the substitution rates of entire alignments are compared to average expression levels. These observations indicate that the expression breadth of a particular gene affects kIN within that gene rather than this being a secondary effect of the clustering of genes of similar expression levels. The reason is unclear for the negative correlation between kIN and the proportion of sites within exons in an alignment (exon density). However, one possibility is that exon density is a correlate of local levels of germ line expression, as it is likely that highly expressed housekeeping genes cluster in gene dense regions and have shorter introns (Urrutia and Hurst 2003; Vinogradov 2003). Because we only have estimates of gene expression for approximately one third of the genes within the alignments, it is difficult to obtain accurate estimates of average gene expression breadth within entire alignments. It is, therefore, possible that exon density is a better indicator of average germ line expression and that it correlates negatively with kIN owing to the clustering of genes of similar expression patterns.
A positive correlation between substitution rate and recombination rate supports the hypothesis that recombination is mutagenic (Hellmann et al. 2003; Lercher and Hurst 2002). Our data suggest that recombination affects the mutation rate in introns and intergenic regions, but we find no relationship between kAR and recombination rate. Mean substitution rate in ancestral repeats (kAR) is significantly higher than that of other noncoding sites. Interspersed repeats serve as substrates for both homologous and nonhomologous recombination (Deininger et al. 2003), which means the number of recombination events may be governed by additional factors compared with surrounding regions. This could potentially explain both the significantly higher average levels of kAR and the lack of dependence of kAR with regional recombination rates.
The finding that substitution rates at the three site classes are affected by different factors has important implications. Given that mutation rates in transcribed regions are likely to be influenced by breadth of gene expression, they may not provide a good estimate of regional mutation rates. Furthermore, the difference between substitution rates in repetitive and nonrepetitive sequences suggests that estimates from ancestral repeats (e.g. IHGSC 2001; MGSC 2002) provide only limited information about substitution processes that affect neighboring nonrepetitive sequences.
Our analyses found the strongest evidence for local similarities in substitution rate at a scale of 2 Mb, with substitution rate differences between nearby alignments of 60% to 75% of those in randomized datasets. Such local similarities remain after accounting for variation in two different correlates of substitution rates (recombination rate and exon density). An additional factor affecting variation in nongenic substitution rates on the megabase scale is location within a particular region of human–mouse synteny (Malcom et al. 2003). Regions of shared evolutionary history could, therefore, partially define substitution rate domains. Because translocation breakpoints are often reused in mammalian evolution (Pevzner and Tesler 2003) it is also possible that the regional variation in mutation rates associated with synteny blocks is also evolutionarily conserved.
Acknowledgements
We are grateful to Mikael Brandstr?m for writing and adapting the seqtool application, and to Elena Jazin for comments on the manuscript. We acknowledge the Swedish Research Council and the Knut and Alice Wallenberg Foundation for financial support.
References
Arndt P. F., D. A. Petrov, and T. Hwa. 2003. Distinct changes of genomic biases in nucleotide substitution at the time of mammalian radiation. Mol. Biol. Evol. 20:1887–1896.
Bernardi G., B. Olofsson, J. Filipski, M. Zerial, J. Salinas, G. Cuny, M. Meunier-Rotival, and F. Rodier. 1985. The mosaic genome of warm-blooded vertebrates. Science 228:953–958.
Bierne N., and A. Eyre-Walker. 2003. The problem of counting sites in the estimation of the synonymous and nonsynonymous substitution rates: implications for the correlation between the synonymous substitution rate and codon usage bias. Genetics 165:1587–1597.
Bray N., I. Dubchak, and L. Pachter. 2003. AVID: a global alignment program. Genome Res 13:97–102.
Caron H., B. van Schaik, M. van der Mee, et al. 2001. The human transcriptome map: Clustering of highly expressed genes in chromosomal domains. Science 291:1289–1292.
Datta A., and S. Jinks-Robertson. 1995. Association of increased spontaneous mutation rates with high levels of transcription in yeast. Science 268:1616–1619.
Deininger P. L., J. V. Moran, M. A. Batzer, and H. H. Kazazian. 2003. Mobile elements and mammalian genome evolution. Curr. Opin. Genet. Dev. 13:651–658.
Duret L., and D. Mouchiroud. 2000. Determinants of substitution rates in mammalian genes: Expression pattern affects selection intensity but not mutation rate. Mol. Biol. Evol. 17:68–74.
Duret L., M. Semon, G. Piganeau, D. Mouchiroud, and N. Galtier. 2002. Vanishing GC-rich isochores in mammalian genomes. Genetics 162:1837–1847.
Ebersberger I., D. Metzler, C. Schwarz, and S. P??bo. 2002. Genomewide comparison of DNA sequences between humans and chimpanzees. Am. J. Hum. Genet. 70:1490–1497.
Eyre-Walker A., and L. D. Hurst. 2001. The evolution of isochores. Nat. Rev. Genet. 2:549–555.
Fullerton S. M., A. Bernardo Carvalho, and A. G. Clark. 2001. Local rates of recombination are positively correlated with GC content in the human genome. Mol. Biol. Evol. 18:1139–1142.
Galtier N., G. Piganeau, D. Mouchiroud, and L. Duret. 2001. GC-content evolution in mammalian genomes: the biased gene conversion hypothesis. Genetics 159:907–911.
Green P., B. Ewing, W. Miller, P. J. Thomas, and E. D. Green. 2003. Transcription-associated mutational asymmetry in mammalian evolution. Nat. Genet. 33:514–517.
Hardison R. C., K. M. Roskin, S. Yang, et al. 2003. Covariation in frequencies of substitution, deletion, transposition, and recombination during eutherian evolution. Genome Res. 13:13–26.
Hellmann I., I. Ebersberger, S. E. Ptak, S. P??bo, and M. Przeworski. 2003. A neutral explanation for the correlation of diversity with recombination rates in humans. Am. J. Hum. Genet. 72:1527–1535.
Hurst L. D., and A. Eyre-Walker. 2000. Evolutionary genomics: reading the bands. Bioessays 22:105–107.
Hurst L. D., and E. J. Williams. 2000. Covariation of GC content and the silent site substitution rate in rodents: implications for methodology and for the evolution of isochores. Gene 261:107–114.
International Human Genome Sequencing Consortium. 2001. Initial sequencing and analysis of the human genome. Nature 409:860–921.
Kong A., D. F. Gudbjartsson, J. Sainz, et al. 2002. A high-resolution recombination map of the human genome. Nat. Genet. 31:241–247.
Lash A. E., C. M. Tolstoshev, L. Wagner, G. D. Schuler, R. L. Strausberg, G. J. Riggins, and S. F. Altschul. 2000. SAGEmap: a Public Gene Expression Resource. Gen. Res. 10:1051–1060.
Lercher M. J., and L. D. Hurst. 2002. Human SNP variability and mutation rate are higher in regions of high recombination. Trends Genet. 18:337–340.
Lercher M. J., A. O. Urrutia, and L. D. Hurst. 2002. Clustering of housekeeping genes provides a unified model of gene order in the human genome. Nat. Genet. 31:180–183.
Lercher M. J., A. O. Urrutia, A. Pavlicek, and L. D. Hurst. 2003. A unification of mosaic structures in the human genome. Hum. Mol. Genet. 12:2411–2415.
Lercher M. J., E. J. Williams, and L. D. Hurst. 2001. Local similarity in evolutionary rates extends over whole chromosomes in human-rodent and mouse-rat comparisons: implications for understanding the mechanistic basis of the male mutation bias. Mol. Biol. Evol. 18:2032–2039.
Li W. H. (1997) Molecular evolution. Sinauer Associates, Sunderland.
Malcom C. M., G. J. Wyckoff, and B. T. Lahn. 2003. Genic mutation rates in mammals: local similarity, chromosomal heterogeneity, and X-versus-autosome disparity. Mol. Biol. Evol. 20:1633–1641.
Margulies E. H., S. L. Kardia, and J. W. Innis. 2001. Identification and prevention of a GC content bias in SAGE libraries. Nucleic Acids Res. 29:e60.
Matassi G., P. M. Sharp, and C. Gautier. 1999. Chromosomal location effects on gene sequence evolution in mammals. Curr. Biol. 9:786–791.
Meunier J., and L. Duret. 2004. Recombination drives the evolution of GC-content in the human genome. Mol. Biol. Evol. 21:984–990.
Mouse Genome Sequencing Consortium. 2002. Initial sequencing and comparative analysis of the mouse genome. Nature 420:520–562.
Morey N. J., C. N. Greene, and S. Jinks-Robertson. 2000. Genetic analysis of transcription-associated mutation in Saccharomyces cerevisiae. Genetics 154:109–120.
Niimura Y., and T. Gojobori. 2002. In silico chromosome staining: reconstruction of Giemsa bands from the whole human genome sequence. Proc. Natl. Acad. Sci. USA 99:797–802.
Pevzner P., and G. Tesler. 2003. Human and mouse genomic sequences reveal extensive breakpoint reuse in mammalian evolution. Proc. Natl. Acad. Sci. USA 100:7672–7677.
Piganeau G., D. Mouchiroud, L. Duret, and C. Gautier. 2002. Expected relationship between the silent substitution rate and the GC content: implications for the evolution of isochores. J. Mol. Evol. 54:129–133.
Saccone S., A. Pavlicek, C. Federico, J. Paces, and G. Bernard. 2001. Genes, isochores and bands in human chromosomes 21 and 22. Chromosome Res 9:533–539.
Schuler G. D., M. S. Boguski, E. A. Stewart, et al. 1996. A gene map of the human genome. Science 274:540–546.
Smith N. G. C., and A. Eyre-Walker. 2003. Partitioning the variation in mammalian substitution rates. Mol. Biol. Evol. 20:10–17.
Smith N. G. C., M. T. Webster, and H. Ellegren. 2002. Deterministic mutation rate variation in the human genome. Genome Res. 12:1350–1356.
Smith N. G. C., and L. D. Hurst. 1999. The effect of tandem substitutions on the correlation between synonymous and nonsynonymous rates in rodents. Genetics 153:1395–1402.
Su A. I., M. P. Cooke, K. A. Ching, et al. 2002. Large-scale analysis of the human and mouse transcriptomes. Proc. Natl. Acad. Sci. USA 99:4465–4470.
Svejstrup J. Q. 2002. Mechanisms of transcription-coupled DNA repair. Nat. Rev. Mol. Cell. Biol. 3:21–29.
Thompson J. D., D. G. Higgins, and T. J. Gibson. 1994. Clustal-W—Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position- specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:4673–4680.
Urrutia A. O., and L. D. Hurst. 2003. The signature of selection mediated by expression on human genes. Genome Res 13:2260–2264.
Williams E. J., and L. D. Hurst. 2000. The proteins of linked genes evolve at similar rates. Nature 407:900–903.
Williams E. J., and L. D. Hurst. 2002. Clustering of tissue-specific genes underlies much of the similarity in rates of protein evolution of linked genes. J. Mol. Evol. 54:511–518.
Vinogradov A. E. 2003. Isochores and tissue-specificity. Nucleic Acids Res. 31:5212–5220.(Matthew T. Webster*, Nick)
E-mail: matthew.webster@ebc.uu.se
Abstract
The human genome is organized with regard to many features such as isochores, Giemsa bands, clusters of genes with similar expression patterns, and contiguous regions with shared evolutionary histories (synteny blocks). In addition to these genomic features, it is clear that mutation rates also vary across the human genome. To address how mutation rates and genomic features are related, we analyzed substitution rates at three classes of putatively neutral noncoding sites (nongenic, intronic, and ancestral repeats) in 14 Mb of human–chimpanzee alignments covering human chromosome 7. Patterns of mutation rate variation inferred from substitution rate variation differ among the three site classes. In particular, we find that intronic mutation rates are strongly affected by the breadth of expression of the genes in which they reside, with broadly expressed genes exhibiting low mutation rates, probably as a consequence of the transcription-coupled repair process acting in the germ line. All site classes show significant local similarities in mutation rate at the megabase scale, and regional similarities in nongenic mutation rate covary with blocks of synteny between the human and mouse genomes, indicating that the evolutionary history of a genomic region is an important determinant of mutation rate.
Key Words: human-mouse synteny blocks ? gene expression ? mutation rate ? recombination ? gene density ? Giemsa bands
Introduction
The structure of mammalian chromosomes is not uniform. In addition to banding patterns revealed by chromosome staining such as Giemsa bands, regional variation is also observed in base composition (isochores; Bernardi et al. 1985; Eyre-Walker and Hurst 2001), repetitive element density, gene density, and patterns of gene expression (Caron et al. 2001; Lercher et al. 2002; Lercher et al. 2003). Understanding how these features evolved is problematic owing to their complex inter-relationships. For example, the location of Giemsa bands can be predicted on the basis of variation in GC content (Niimura and Gojobori 2002), although they do not directly correspond to isochores (Saccone et al. 2001). Different classes of interspersed repeat elements show different distributions within genomes. Broadly expressed genes have been shown to cluster in GC-rich isochores and in R and the lightest Giemsa bands (Lercher et al. 2003; Vinogradov 2003). An additional level of organization is indicated by contiguous chromosomal regions that are evolutionarily conserved between species (synteny blocks; MGSC 2002). Such regions may represent important evolutionary units, as it has been suggested that translocation breakpoints are often re-used in mammalian evolution (Pevzner and Tesler 2003).
Mutation rates are not constant across mammalian chromosomes. Several studies have now suggested the existence of regions of local similarity in mutation rates at scales >1 Mb by comparing rates of synonymous evolution (kS) in neighboring genes (Lercher et al. 2001; Malcom et al. 2003; Matassi et al. 1999). However, the relationship between these local similarities and other genomic features is not clear. One possibility is that similarities in levels of kS are caused by clustering of genes with similar expression patterns (Williams and Hurst 2000; Williams and Hurst 2002). This could either be due to a mutagenic effect of transcription (Datta and Jinks-Robertson 1995; Morey et al. 2000) or because linked genes have similar levels of transcription-coupled repair (TCR; Green et al. 2003; Svejstrup 2002). Alternatively, the underlying mutation rate could vary between genomic regions on a larger scale determined by genomic features not directly related to transcription (Hurst and Eyre-Walker 2000).
Base composition is a potential correlate of mutation rate. Recent studies indicate that GC content in mammals is not at equilibrium and that GC-rich isochores are decaying (Arndt et al. 2003; Duret et al. 2002; Smith et al. 2002). This would result in higher mutation rates in GC-rich regions if the rate of GCAT mutations is higher than the opposite ATGC (Piganeau et al. 2002). A number of studies support this prediction and report a positive correlation between GC content and substitution rates (Hurst and Williams 2000; Smith et al. 2002; Smith and Hurst 1999). However, the evidence for this is equivocal and a comparison of the human and mouse genomes suggested that the relationship was best described by regression with a quadratic curve, where regions of extremely low or high GC content have the highest substitution rates (MGSC 2002). Recombination rate has also been shown to correlate with divergence, leading to the suggestion that recombination is mutagenic (Hellmann et al. 2003; Lercher and Hurst 2002). The relative effects of these factors are not certain as they are strongly correlated to each other (Fullerton et al. 2001) and it is likely that recombination influences GC content (Galtier et al. 2001; Meunier and Duret 2004). In addition to the effects of recombination and GC, a recent study reported considerable variation in kS between synteny blocks, suggesting that these regions could define the unit of mutation rate variation (Malcom et al. 2003). However, because the relationship between these blocks and other features—such as clusters of genes of similar expression and cytogenic bands—has not been quantified, the precise cause of this observation is unclear.
In contrast to the majority of previous studies of mutation rate variation that analyzed differences in kS by using human-rodent genic alignments (Lercher et al. 2001; Malcom et al. 2003; Matassi et al. 1999), we have analyzed noncoding divergence in human–chimpanzee genomic alignments. This approach has a number of benefits. First, rates at silent sites in mammals may be under selection owing to constraints caused by translational efficiency (Urrutia and Hurst 2003), which could result in a lower kS in highly expressed genes. Although negative selection may act on regulatory sequences in noncoding DNA, our results are not likely to have been generated by selection (see Discussion). Secondly, estimation of substitution rates at silent sites is often sensitive to methodology and potentially affected by nonsynonymous substitution rates (Bierne and Eyre-Walker 2003; Hurst and Williams 2000; Li 1997; Smith and Hurst 1999). Indel mutations, which are usually selected against in coding regions, owing to disruption of protein function, can be studied in noncoding DNA. Analysis of substitutions in noncoding DNA also enables the comparison of mutation rates in transcribed sequences with surrounding intergenic regions, which allows quantification of the mutational pressures that act only on transcribed DNA, such as the possible effects of TCR. The estimation of mutation rates by using comparisons of closely related species such as human and chimpanzee has a number of advantages over comparisons of more distantly related species such as human and mouse. First, because of limited divergence, estimation of substitution rate is not greatly confounded by multiple hits. Furthermore, the impact of genomic rearrangements is much less, which means that the effects of base composition, recombination, and other features that may fluctuate over evolutionary time can be studied more accurately.
Herein we analyze the effects of a variety of features of genomic organization on noncoding human–chimpanzee divergence in order to attempt to define the causes of regional variation in mutation rate. We address these questions by using long (average length, 165 kb) genomic alignments of human and chimpanzee sequence distributed across human chromosome 7. For each alignment, three measures of the substitution rate at mutually exclusive sites were estimated: nongenic nonrepetitive (kNG), nongenic ancestral repeats (kAR), and intronic nonrepetitive (kIN). We analyze the effects of various genomic features on these measures of substitution rate, assay the extent of mutation rate banding patterns, and analyze the contribution of different correlates of substitution rate in generating local similarities in mutation rates.
Methods
Construction of Genomic Alignments
We downloaded all full-length chimpanzee BAC clones available in the EMBL database on April 29, 2003, using the SRS retrieval tool at http://www.be.embnet.org/. The orthologous human sequence for each chimpanzee clone was identified by using Blast searches (http://www.ncbi.nlm.nih.gov/blast/) against all human genome contigs from the April 2003 build. The majority of sequences were found to be orthologous to human chromosome 7. Because a number of studies indicate the existence of significant mutation rate heterogeneity between chromosomes (Ebersberger et al. 2002; Lercher et al. 2001; MGSC 2002) and there was not enough chimpanzee sequence to reliably estimate variation in noncoding divergence on other chromosomes, we restricted our analysis to regional variation on chromosome 7.
Orthologous human and chimpanzee sequences were identified from the Blast results using the program seqtool. This program provides a graphical output of Blast results, thereby allowing orthologous sequences to be distinguished from rearrangements. We compared the performance of two global alignment programs, ClustalW (Thompson et al. 1994) and AVID (Bray et al. 2003), by aligning the same pair of human and chimpanzee genomic sequences with both programs. The alignment created by ClustalW contained 0.1% more aligned bases but 1.5% fewer inferred substitutions, which suggests that for genomic sequences from closely related species such as human and chimpanzee, ClustalW performs slightly better. We, therefore, used ClustalW with default settings to produce 84 alignments used in the analysis of mutation rate variation. Repetitive sequences were identified in the alignments using RepeatMasker (http://repeatmasker.genome.washington.edu/). Sequences were masked for all the simple repeats identified by Sputnik (http://espressosoftware.com/). Alignments were then checked by eye and a number of poorly aligned regions were removed from subsequent analyses.
Analysis of Noncoding Substitution Patterns
The positions of genes and exons within each gene were identified from the GenBank annotation files corresponding to the human contig in the alignment. Substitutions in three main subsets of the alignments were identified, and substitution rates were calculated without multiple hits correction, i.e. human-chimp differences as proportion of ungapped sites: (1) nongenic, nonrepetitive (kNG); (2) nongenic, ancestral-repeats (kAR); and (3) intronic, nonrepetitive (kIN). In all subsets, simple repeats identified by Sputnik were masked. In subsets a) and c) repeats identified by RepeatMasker were also masked, whereas in subset b) only those sequences identified by RepeatMasker as repetitive were analyzed, not including simple repeats masked by Sputnik or genic regions. Indels in each of the three classes of genomic region were inferred from contiguous runs of alignment gaps matching only bases within the appropriate class.
Some alignments contained regions of overlap with neighboring alignments, which were only analyzed once. Substitutions were initially analyzed treating each full alignment as a separate data point, only those alignments containing >5 kb of the specific type of DNA were included in the analyses. In order to analyze the more fine-scale effects of GC content on divergence, we also performed the analyses using non-overlapping windows of 20 kb, only including windows containing >1 kb of the appropriate site class. We also analyzed changes in intronic regions of each gene within the alignments separately in order to compare these values to measures of expression rate and breadth within individual genes, only including genes with >5 kb of aligned intronic sequence. Furthermore, we estimated kNG in 10 kb of sequence flanking both sides of each gene.
Gene Expression Data
Summary measures of rate and breadth of gene expression were derived from three independent data sources: expressed sequence tags (EST), microarray data, and serial analysis of gene expression (SAGE).
ESTs contained in UniGene clusters (Schuler et al. 1996; UniGene build 161, obtained from the National Center for Biotechnology Information [NCBI] at http://www.ncbi.nlm.nih.gov/) were traced to libraries in the dbEST database (also at NCBI), excluding libraries containing <50 ESTs. On the basis of dbEST annotations, 235 libraries were sorted into 55 normal (nondisease) tissue types, each represented by ESTs mapped to at least 500 UniGene clusters. For each UniGene cluster, expression breadth was defined as the number of tissues with an EST contained in the cluster, divided by the total number of tissues types (55). This estimate was then mapped to LocusLink gene IDs.
Normalized microarray data were obtained from Su et al. (2002). Data encompassing 63 replicate hybridizations were sorted into 28 normal tissues. LocusLink gene IDs were identified using Affymetrix annotation (http://www.affymetrix.com/). Average counts were used where multiple probe sets were present for the same gene. For each gene, breadth of expression was defined as (average counts across tissues) / (maximum counts across tissues). Before this calculation, data <0 was set to 0; if the maximum across tissues was <50, the breadth was set to 0. The cutoff of 50 is lower than the value of 200 used by Su et al. (2002). This was chosen because our estimate of gene expression breadth takes a larger number of measurements into account. Rate of expression was estimated as log2 (of the maximum counts across tissues).
SAGE data were obtained from SAGEmap at NCBI (Lash et al. 2000). The dataset was curated to avoid possible GC biases following the approach of Margulies et al. (2001), by removing libraries with mean tag GC>0.5. The remaining 40 libraries were merged into 19 normal tissue types. Tag counts were converted to counts per million. If only one tag was found for a gene–tissue type combination, we discarded the observation as a likely sequencing error. The expression data were cross-linked to RefSeq mRNA sequences in Ensembl (http://www.ensembl.org/), by extracting the 3'-most NlaIII SAGE tag for each mRNA. If the same tag was cross-linked to more than one Ensembl gene, all corresponding genes were excluded. If several tags mapped to the same gene, we chose the maximum counts per million across tags. Genes were then mapped to LocusLink IDs. For each gene, breadth of expression was defined as (average counts across tissues) / (maximum counts across tissues); breadth was set to 0 if the maximum was zero. Rate of expression was defined as log2 (of the maximum counts across tissues).
For each of the autosomal genes for which EST expression breadth data was available, we extracted the transcribed sequences from GenBank genomic contigs, located by using the GenBank contig annotation files. We then calculated the proportion of G+T nucleotides on the coding strand in order to quantify the potential effect of TCR in generating strand asymmetry in transcribed regions of genes of different expression patterns.
Other Genomic Features
Local GC was calculated as the average GC content of the nonmasked human and chimpanzee DNA sequences in the region surveyed. Recombination rate was estimated from the rate at the nearest marker to region under analysis identified from the human contig for which recombination rate information was available from Kong et al. (2002). We calculated the density of transcribed DNA and exonic DNA within each alignment by using positional information from GenBank contig annotation files. These factors were defined as the number of bases in the alignment belonging to each category divided by the total number of bases. We downloaded tables describing the location of Giemsa bands and bands of contiguous synteny between human and mouse on the April 2003 build of the human genome from UCSC (http://genome.ucsc.edu/). These tables were used to identify the position of each alignment relative to these genomic features.
Statistical Analyses
Correlations between substitution rates and continuously varying alignment features (GC content, recombination rate, exon and transcript density, and gene expression measures) were analyzed using Spearman's rank correlations. The effect of categorical sequence features (Giemsa bands and synteny blocks) on substitution rates were analyzed by using analyses of variance (ANOVAs). Following the rationale presented in Smith and Eyre-Walker (2003), substitution rates were log transformed. We also tested for effects of categorical features by using the nonparametric Kruskal-Wallis test, with significance obtained by using the standard asymptotic method. To determine whether significant ANOVA results resulting from synteny blocks were caused specifically by these features and not as a secondary effect of local similarities in mutation rate at the megabase scale, we performed a randomization test. By using 1000 replicates, we randomly shuffled the order of synteny blocks along the chromosome, keeping the lengths constant, and we re-assigned alignments to the randomized blocks. For each randomized dataset we recalculated the ANOVA F-value and compared this to the original value from the real data.
The extent of local similarities in substitution rates and the effects of various genomic features on these similarities were assayed using methods developed in Lercher et al. (2001). For each alignment, the absolute substitution rate difference was calculated between that alignment and the mean of all other alignments with their midpoints within a specified distance of the midpoint of that alignment. Then this observed total difference was compared to the total difference expected when the order of the alignments was randomly shuffled (1000 shuffles). The statistic is defined as the ratio of the observed total difference to the expected total difference (the mean of the ratio taken over all shuffles). Lower values of indicate greater local similarity, and one-tailed significance values were obtained by comparison of the observed total difference with the set of shuffled total differences. This analysis was repeated using different scales to determine the range of substitution rate variation. The effects of different correlates of substitution rates on local similarities were tested by only shuffling alignments with values similar to those of a particular correlate.
Results
Deterministic Mutation Rate Variation
We analyzed substitution rates in three classes of noncoding site within a set of 84 genomic human–chimp alignments corresponding to human chromosome 7. When only alignments with >5kb of aligned sites within each site class were considered, the size of each dataset was reduced (table 1). The large proportion of sites in introns probably reflects the fact that many of the chimpanzee clones used in this study are derived from the NISC comparative vertebrate sequencing project (http://www.nisc.nih.gov/), which has specifically targeted gene-rich regions. Divergence in ancestral repeats (kAR) is significantly higher than at both nonrepetitive nongenic (kNG) and intronic (kIN) sites (Willcoxon rank-sum test; kNG vs. kAR, W = 797, P = 0.00219; kIN vs kAR, W = 2057, P = 5.6 x 10–5), whereas kNG and kIN were not significantly different (W = 2582, P = 0.226). The high rate of evolution in ancestral repeats may be due partly to higher levels of hypermutable CpG dinucleotides. The observed variances of the divergence estimates between full alignments were kNG = 7.08 x 10–6, kAR = 1.03 x 10–5, and kIN = 7.01 x 10–6. We simulated the expected variances of each distribution assuming a uniform mutation rate by using the observed number of sites in each class in each alignment. For kNG and kAR, the observed variances were 11 and 12 times expected, respectively, whereas for kIN the variance was 29 times expected. This indicates substantial variation in mutation rate at this scale, with particularly high variation exhibited by intronic regions. Variation in kNG across human chromosome 7 is illustrated in figure 1.
Table 1 Summary of human—chimpanzee alignments
FIG. 1.— Variation in kNG across human chromosome 7. Bars represent two standard errors and were calculated using the binomial formula for variance, under the assumption of no variation in substitution rate within each alignment. The estimated length of the chromosome is 159 Mb.
Indel and substitution rates are significantly correlated for kNG (r = 0.421, P = 0.0006) and kIN (r = 0.432, P = 0.0002) although the relationship is not as strong for kAR (r = 0.302, P = 0.062). Substitution rates for the three site classes in the same alignments are strongly correlated (fig. 2; kNG vs. kIN, r = 0.634, P < 10–6; kNG vs. kAR, r = 0.571, P = 0.00019; kIN vs. kAR, r = 0.530, P = 0.0021). However, a large part of the variation in different substitution rate estimates does not correlate with other estimates, which suggests that different factors may affect variation in each substitution type. In particular, the comparisons involving kAR have lower correlation coefficients, suggesting that substitution rates in ancestral repeats may be governed by additional factors. These results confirm those of previous studies that deterministic mutation rate variation exists in the human genome (Hardison et al. 2003; Smith et al. 2002); we now turn to factors that influence this variation.
FIG. 2.— Deterministic variation in mutation rate is indicated by a positive correlation between kNG and kAR (filled circles, dotted line) and kIN (empty circles, solid line) using entire human–chimpanzee genomic alignments as separate data points. Lines represent linear regressions.
The Effect of Recombination and GC
We find no significant correlation between GC content of the aligned sequences and any of the measures of substitution rate, when complete alignments are considered as single data points (kNG, r = –0.013, P = 0.917; kAR, r = 0.135, P = 0.413; kIN, r = –0.082, P = 0.492; figure 3). These values remain nonsignificant when alignments are broken into 20-kb non-overlapping windows (kNG, r = 0.051, P = 0.13; kAR, r = 0.065, P = 0.38; kIN, r = –0.024, P = 0.576). Some previous studies have suggested that GC content and substitution rate are positively correlated. However, comparison of the entire human and mouse genomes (MGSC 2002) revealed more complex relationships: at synonymous coding sites, substitution rate was found to be approximately constant at low and intermediate GC levels, with a sharp increase at extremely high GC level; in ancestral repeats, substitution rate showed a V-shaped dependence on GC, with a minimum at intermediate GC level. Our data probably do not rule out these relationships, although we do not have sufficient data to accurately define them (fig. 3). Furthermore, it is possible that mutation rate is affected by regionality in GC content on a larger scale, such as the scale of several megabases shown by isochores rather than a local effect. This is difficult to test using the alignments presented here because many alignments are close to each other, which would mean data points are not independent.
FIG. 3.— No significant correlations were observed between local GC content and substitution rate, using kNG (empty circles), kAR (filled circles), and kIN (crosses).
We found a positive correlation between divergence and recombination by using complete alignments, significant for kNG (r = 0.251, P = 0.049, figure 4) and suggestive for kIN (r = 0.215, P = 0.071). The cause of this correlation is a point of debate, but one suggestion is that recombination is mutagenic (Hellmann et al. 2003; Lercher and Hurst 2002). In contrast, kAR does not correlate with recombination rate (r = 0.079, P = 0.642). One potential problem with the previously described analyses of the relationships between recombination rate and substitution rates concerns the nonindependence of recombination rate measures: in several cases, neighboring alignments share the same recombination rate because recombination rate marker density is lower than alignment density. This problem means that the analyses may have been biased. To check this possibility we removed various alignments from consideration so that each recombination rate marker corresponded to just a single alignment. In the case of kNG, the Spearman rank correlation coefficient increased to r = 0.292, although the relationship became nonsignificant (P = 0.060) owing to reduction of the data set from 62 to 42 alignments. The increase in correlation coefficient suggests that the relationship is genuine. On the other hand, with kIN the Spearman rank correlation coefficient decreased markedly to r = 0.134, confirming the weaker relationship between recombination and intronic substitution rates.
FIG. 4.— Positive correlation between kNG from entire genomic alignments and recombination rate, defined as the mean rate of all markers within each alignment reported in (Kong et al. 2002). Line represents linear regression.
The Effect of Gene Expression
We calculated kIN separately for 122 genes within the alignments that could be identified using LocusLink IDs. This identifier was used to link to estimates of gene expression breadth and rate. Significant negative correlations were observed between kIN and expression breadth using the EST data (r = –0.405, P = 0.001, n = 60 genes; figure 5) and microarray data (r = –0.458, P = 0.006, n = 34 genes). A similar correlation coefficient was indicated by the SAGE data, although this was not significant (r = –0.388, P = 0.075, n = 22 genes). We find no significant correlation between kIN and rate of expression measured using microarray or SAGE data (microarray, r = 0.165, P = 0.393, n = 29 genes; SAGE, r = –0.042, P = 0.855, n = 22 genes). No significant correlations were observed between any measure of gene expression rate or breadth and kNG in 10 kb flanking both sides of each gene (data not shown).
FIG. 5.— Negative correlation between kIN and expression breadth for individual genes within human–chimpanzee genomic alignments. Expression breadth was estimated from EST data (see Methods section). Line represents linear regression.
A potential reason for the significantly lower intronic substitution rate (kIN) in broadly expressed genes is that such genes are more likely to be expressed in the germ line and that the effects of TCR will be passed on to the next generation. A mutational bias caused by TCR is believed to be responsible for an observed skew in base composition in transcribed regions, which exhibit an excess of G+T nucleotides on the coding strand. To examine the hypothesis that TCR has a greater effect on the substitution pattern in broadly expressed genes, we calculated the skew in base composition (GT/AC) on the coding strand of all autosomal transcripts in the human genome for which sequence and EST data were available. There is a significant positive correlation between expression breadth and compositional asymmetry (r = 0.368, P < 10–15, n = 12,723, fig. 6). This suggests that TCR has a significantly greater effect on the nucleotide substitution process in broadly expressed genes.
FIG. 6.— Positive correlation between expression breadth estimated from EST data (see Methods section) and compositional asymmetry measured by the ratio of GT–AC nucleotides on the coding strand of transcribed regions. Line represents linear regression.
As genes of similar expression are often clustered together (Caron et al. 2001; Lercher et al. 2002; Lercher et al. 2003), we tested whether the negative correlation between kIN and expression was caused by modified mutation rates in the surrounding genomic region, or whether the effect only acts on individual genes. We estimated the average expression breadth and rate for all of the genes found within single genomic alignments and compared these values with levels of divergence in each alignment. All of the correlations between kIN and expression breadth are reduced to nonsignificance and the expression rate measures remain nonsignificant (data not shown). This suggests that gene expression has a direct effect on the mutation rate of individual genes rather than the mutation rate being a function of the clustering of genes of similar mutation rates. In addition, there are no significant correlations between kNG or kAR and any of these average measures of expression breadth or rate (data not shown). Taken together, these results suggest that factors linked to gene expression reduce mutation rates in individual genes but have little or no effect on surrounding noncoding regions.
To test the proportion of variance in kIN that is due to TCR, we performed a multiple linear regression by using the three measures of gene expression breadth from EST, microarray, and SAGE data. The total R2 value was 0.397, suggesting that TCR can account for a large proportion of variation in mutation rates in transcribed regions. The variance in intronic divergence (kIN) was 29 times that expected under a uniform mutation rate compared with the variance in nontranscribed divergence (kNG and kAR) of 11 to 12 times expected (see previous text). If 40% of the variance in kIN can be accounted for by expression data, that still leaves the remaining variance in kIN roughly 17 times expected (i.e., more than for nontranscribed divergence). This could mean that further factors govern mutation rates in transcribed regions compared with nongenic regions, in addition to TCR.
Other Chromosomal Features
The density of genes is known to vary greatly across the genome, with genes clustering in GC-rich isochores and Giemsa negative chromosome bands (IHGSC 2001). To calculate whether this clustering affects mutation rate, we calculated the proportion of sites within exons (exon density) within each alignment. There is a significant negative correlation between alignment kIN and exon density (r = –0.276, P = 0.019) but not with transcript density (r = –0.027, P = 0.822). Similar negative, although non-significant, correlations with exon density are found for kAR (r = –0.313, P = 0.052) and kNG (r = –0.148, P = 0.243).
We also examined the effects of Giemsa bands and regions of human–mouse synteny on mutation rate by using analysis of variance. Giemsa band locations were classified as either Giemsa negative or one of four classes of Giemsa positive bands (classified by the intensity of staining). kNG (F = 1.463, P = 0.225), kAR (F = 0.664, P = 0.621), and kIN (F = 2.171, P = 0.082) were not significantly affected by Giemsa band. However, the suggestive result in the case of kIN is interesting to note as genes with similar expression breadth have been observed to cluster according to Giemsa band (Lercher et al. 2003), and we have shown gene expression breadth to be an important determinant of kIN. Similar results were obtained by using Wilcoxon rank sum tests to compare substitution rates in Giemsa negative with all classes of Giemsa positive bands (kNG, W = 513, P = 0.989; kAR, W = 203, P = 0.664; kIN, W = 515, P = 0.149). It is interesting to note that the region with the highest nongenic divergence (kNG = 0.022) is the closest to the end of chromosome 7 (3.4 Mb from the end of 7p22). Regions of high recombination rate have been identified close to telomeres (Kong et al. 2002), although it is unclear if this is the cause of the elevated substitution rate because the recombination estimate for this alignment is moderate (1.43 cM/Mb).
We then used a similar analysis to test whether regions that exist within the same contiguous segments in the mouse genome (synteny blocks) have similar mutation rates. Significant variation between synteny blocks was observed for both kNG (ANOVA F = 3.72, P = 1.94 x 10–4; Kruskal-Wallis P = 0.013) and kIN (F = 4.00, P = 3.13 x 10–5; Kruskal-Wallis P = 0.018). The effect of synteny blocks is not significant for kAR (ANOVA F = 1.98, P = 0.075; Kruskal-Wallis P = 0.240). Local similarities in mutation rate are known to exist at the megabase scale (Lercher et al. 2001; Malcom et al. 2003; Matassi et al. 1999), and these could account for significant differences between synteny blocks even if the blocks themselves did not influence this variation. In order to account for this, we recalculated the ANOVA F-values using datasets where the positions of synteny blocks had been randomized. For kNG, the F-value from the real dataset was significantly higher than that for the randomized samples (P = 0.04). However, for kIN, the real F-values were not significantly greater than those for randomized datasets (P = 0.20). These results suggest that evolutionary history is a major determinant of mutation rate variation in nongenic, nonrepetitive sequence at the megabase scale. The absence of this effect for kAR is interesting to note, as recent ancestral repeats do not share the same evolutionary history as their surrounding DNA.
The Extent and Causes of Local Similarities in Mutation Rate
We examined local similarities in noncoding divergence using the statistic (Lercher et al. 2001), which is equal to the observed local heterogeneity in divergence between alignments divided by the expected heterogeneity derived from randomized data (see methods). Values of significantly less than 1 indicate the presence of local similarities. We first considered the effect of the distance range on local similarity. Our alignments provide only patchy coverage of chromosome 7 and are not suitable for close consideration of the range of local similarity. Instead we simply wished to identify the distance range that gave the strongest signal of local similarity. The true level of local similarity may well be greatest at the shortest distances, but short distance ranges reduce the number of possible alignment comparisons; therefore, we expect to be minimized at an intermediate distance range. We considered only those alignments on chromosome 7 with sufficient data in each particular substitution rate class (64 for kNG, 39 for kAR, and 72 for kIN). When substitution rates were compared between alignments within a distance range of 1 Mb, NG = 0.852 (P = 0.115), AR = 0.907 (P = 0.249), and IN = 0.639 (P = 0.001). For a distance range of 2 Mb, NG = 0.694 (P < 0.001), AR = 0.735 (P = 0.018), and IN = 0.600 (P < 0.001); and for a distance range of 3 Mb NG = 0.782 (P = 0.002), AR = 0.814 (P = 0.035), and IN = 0.728 (P < 0.001). Hence the strongest signal of local similarity is shown by all measures of substitution rate at a distance range of 2 Mb. We used this distance range to test the effects of correlates of divergence in determining local similarities in mutation rate.
We investigated the causes of local similarity (see Lercher et al. 2001) by modifying the randomization algorithm so that only alignments with similar levels of a particular correlate were shuffled with each other. For example, if local similarities in substitution rates directly corresponded to local similarities in GC content, then shuffling among alignments of similar GC content would not be expected to make much difference to the pattern of substitution rate variation. In this case the observed and shuffled total differences would be similar and there would be no signal of local similarity.
Variation in kNG correlates with recombination rate. In order to test whether recombination can account for local similarities in kNG, the alignments were first ranked according to the recombination rate of the region from which they were taken and then grouped into five classes of equal size. The alignments were then shuffled within each of the five recombination rate classes, only considering the 62 alignments for which recombination rate data were available. With this slightly reduced data set we still found highly significant local similarity as determined by shuffling all alignments (NG = 0.694, P < 0.001). However, neither nor P changed greatly when shuffling was performed within recombination rate classes (NG = 0.709, P < 0.001), which suggests that recombination can only account for a very small part of the local similarities observed in kNG.
Alignments are significantly affected by exon density, for which a measure was available for all alignments. Shuffling within exon density classes had very little effect on local similarity (IN = 0.602, P < 0.001 compared with IN = 0.600, P < 0.001 for the unshuffled dataset). Because we did not find any genomic features that correlate with kAR, we did not perform any further randomizations using this dataset.
Although recombination rate and exon density correlate with substitution rate, neither of these features can explain the local similarities observed on a megabase scale. An additional feature that covaries with substitution rate is location within a particular human–mouse synteny block. Location within a certain synteny band could potentially explain a large proportion of the local similarities in substitution rate. However, the 64 alignments with sufficient sites to estimate kNG fit into 20 synteny blocks, and only seven of these contain more than three alignments. Hence shuffling within synteny bands would not greatly alter the distribution of alignments along chromosome 7 and hence would not be an adequate test of whether synteny blocks are responsible for local similarities in divergence.
Discussion
We have examined mutation rate variation across a single human chromosome by first identifying factors that have significant effects on substitution rates, and then by attempting to determine whether any of these factors could be responsible for the existence of regions of local similarity in mutation rate. Substitution rates in three classes of noncoding sites (kNG, kAR, and kIN) show significant evidence for local similarities at the megabase level. Factors such as recombination rate, gene expression breadth, and exon density were shown to significantly correlate with some or all of these measures of substitution rate, suggesting a direct or indirect effect on mutation rate, but none of these factors could explain the existence of large-scale local similarities.
The substitution rate within introns (kIN) exhibits the greatest variance compared to simulated levels. This could suggest that variation in kIN is controlled by additional factors compared with kNG and kAR. One such factor is gene expression. A previous analysis of human–mouse orthologous genes found a marginally significant negative correlation between EST expression breadth and evolutionary rate at synonymous coding sites (Duret and Mouchiroud 2000), which was interpreted as a consequence of substitutions at neighboring nonsynonymous sites. In contrast to this interpretation, we show here that broadly expressed genes have lower intronic substitution rates. A likely explanation for this is that such genes are more likely to be expressed in the germ line and hence have reduced mutation rates owing to the TCR mechanism (Svejstrup 2002). TCR is believed to cause a compositional asymmetry, where G+T nucleotides are over-represented on the coding strand of transcribed regions (Green et al. 2003). We have shown that this asymmetry is strongest in broadly expressed genes, suggesting an increased likelihood of such genes being affected by TCR.
One potential problem with analyzing divergence in introns is the effect of selective constraint on conserved regulatory elements. If the introns of broadly expressed genes were under greater constraint, this could lead to a negative correlation between kIN and gene expression breadth. However, housekeeping genes do not require complex regulation, as they are transcriptionally active in all cell types. Therefore, they are not likely to contain more regulatory elements in their introns than less broadly expressed genes. In addition, there is no correlation between total intron length and gene expression breadth in our data (using the EST data, r = 0.0237, P = 0.843, n = 60 genes). This suggests that regulatory elements do not take up a greater proportion of the introns of broadly expressed genes in this dataset. We, therefore, conclude that the lower kIN in broadly expressed genes is not likely to be influenced by selective constraint.
Nongenic regions may also contain regulatory elements and unidentified transcripts under selective constraint, which could potentially contribute to the observed variation in the divergence between genomic regions. However, the presence of conserved regions is not expected to have a large influence on divergence between closely related species such as human and chimpanzee: as the proportion of divergent sites is already very low, the reduction in divergence caused by conserved blocks cannot greatly increase its variance (Smith et al. 2002).
Substitutions in intergenic regions (kNG and kAR) do not appear to be affected by the average levels of gene expression breadth of nearby genes, and the relationship between kIN and gene expression breadth weakens when the substitution rates of entire alignments are compared to average expression levels. These observations indicate that the expression breadth of a particular gene affects kIN within that gene rather than this being a secondary effect of the clustering of genes of similar expression levels. The reason is unclear for the negative correlation between kIN and the proportion of sites within exons in an alignment (exon density). However, one possibility is that exon density is a correlate of local levels of germ line expression, as it is likely that highly expressed housekeeping genes cluster in gene dense regions and have shorter introns (Urrutia and Hurst 2003; Vinogradov 2003). Because we only have estimates of gene expression for approximately one third of the genes within the alignments, it is difficult to obtain accurate estimates of average gene expression breadth within entire alignments. It is, therefore, possible that exon density is a better indicator of average germ line expression and that it correlates negatively with kIN owing to the clustering of genes of similar expression patterns.
A positive correlation between substitution rate and recombination rate supports the hypothesis that recombination is mutagenic (Hellmann et al. 2003; Lercher and Hurst 2002). Our data suggest that recombination affects the mutation rate in introns and intergenic regions, but we find no relationship between kAR and recombination rate. Mean substitution rate in ancestral repeats (kAR) is significantly higher than that of other noncoding sites. Interspersed repeats serve as substrates for both homologous and nonhomologous recombination (Deininger et al. 2003), which means the number of recombination events may be governed by additional factors compared with surrounding regions. This could potentially explain both the significantly higher average levels of kAR and the lack of dependence of kAR with regional recombination rates.
The finding that substitution rates at the three site classes are affected by different factors has important implications. Given that mutation rates in transcribed regions are likely to be influenced by breadth of gene expression, they may not provide a good estimate of regional mutation rates. Furthermore, the difference between substitution rates in repetitive and nonrepetitive sequences suggests that estimates from ancestral repeats (e.g. IHGSC 2001; MGSC 2002) provide only limited information about substitution processes that affect neighboring nonrepetitive sequences.
Our analyses found the strongest evidence for local similarities in substitution rate at a scale of 2 Mb, with substitution rate differences between nearby alignments of 60% to 75% of those in randomized datasets. Such local similarities remain after accounting for variation in two different correlates of substitution rates (recombination rate and exon density). An additional factor affecting variation in nongenic substitution rates on the megabase scale is location within a particular region of human–mouse synteny (Malcom et al. 2003). Regions of shared evolutionary history could, therefore, partially define substitution rate domains. Because translocation breakpoints are often reused in mammalian evolution (Pevzner and Tesler 2003) it is also possible that the regional variation in mutation rates associated with synteny blocks is also evolutionarily conserved.
Acknowledgements
We are grateful to Mikael Brandstr?m for writing and adapting the seqtool application, and to Elena Jazin for comments on the manuscript. We acknowledge the Swedish Research Council and the Knut and Alice Wallenberg Foundation for financial support.
References
Arndt P. F., D. A. Petrov, and T. Hwa. 2003. Distinct changes of genomic biases in nucleotide substitution at the time of mammalian radiation. Mol. Biol. Evol. 20:1887–1896.
Bernardi G., B. Olofsson, J. Filipski, M. Zerial, J. Salinas, G. Cuny, M. Meunier-Rotival, and F. Rodier. 1985. The mosaic genome of warm-blooded vertebrates. Science 228:953–958.
Bierne N., and A. Eyre-Walker. 2003. The problem of counting sites in the estimation of the synonymous and nonsynonymous substitution rates: implications for the correlation between the synonymous substitution rate and codon usage bias. Genetics 165:1587–1597.
Bray N., I. Dubchak, and L. Pachter. 2003. AVID: a global alignment program. Genome Res 13:97–102.
Caron H., B. van Schaik, M. van der Mee, et al. 2001. The human transcriptome map: Clustering of highly expressed genes in chromosomal domains. Science 291:1289–1292.
Datta A., and S. Jinks-Robertson. 1995. Association of increased spontaneous mutation rates with high levels of transcription in yeast. Science 268:1616–1619.
Deininger P. L., J. V. Moran, M. A. Batzer, and H. H. Kazazian. 2003. Mobile elements and mammalian genome evolution. Curr. Opin. Genet. Dev. 13:651–658.
Duret L., and D. Mouchiroud. 2000. Determinants of substitution rates in mammalian genes: Expression pattern affects selection intensity but not mutation rate. Mol. Biol. Evol. 17:68–74.
Duret L., M. Semon, G. Piganeau, D. Mouchiroud, and N. Galtier. 2002. Vanishing GC-rich isochores in mammalian genomes. Genetics 162:1837–1847.
Ebersberger I., D. Metzler, C. Schwarz, and S. P??bo. 2002. Genomewide comparison of DNA sequences between humans and chimpanzees. Am. J. Hum. Genet. 70:1490–1497.
Eyre-Walker A., and L. D. Hurst. 2001. The evolution of isochores. Nat. Rev. Genet. 2:549–555.
Fullerton S. M., A. Bernardo Carvalho, and A. G. Clark. 2001. Local rates of recombination are positively correlated with GC content in the human genome. Mol. Biol. Evol. 18:1139–1142.
Galtier N., G. Piganeau, D. Mouchiroud, and L. Duret. 2001. GC-content evolution in mammalian genomes: the biased gene conversion hypothesis. Genetics 159:907–911.
Green P., B. Ewing, W. Miller, P. J. Thomas, and E. D. Green. 2003. Transcription-associated mutational asymmetry in mammalian evolution. Nat. Genet. 33:514–517.
Hardison R. C., K. M. Roskin, S. Yang, et al. 2003. Covariation in frequencies of substitution, deletion, transposition, and recombination during eutherian evolution. Genome Res. 13:13–26.
Hellmann I., I. Ebersberger, S. E. Ptak, S. P??bo, and M. Przeworski. 2003. A neutral explanation for the correlation of diversity with recombination rates in humans. Am. J. Hum. Genet. 72:1527–1535.
Hurst L. D., and A. Eyre-Walker. 2000. Evolutionary genomics: reading the bands. Bioessays 22:105–107.
Hurst L. D., and E. J. Williams. 2000. Covariation of GC content and the silent site substitution rate in rodents: implications for methodology and for the evolution of isochores. Gene 261:107–114.
International Human Genome Sequencing Consortium. 2001. Initial sequencing and analysis of the human genome. Nature 409:860–921.
Kong A., D. F. Gudbjartsson, J. Sainz, et al. 2002. A high-resolution recombination map of the human genome. Nat. Genet. 31:241–247.
Lash A. E., C. M. Tolstoshev, L. Wagner, G. D. Schuler, R. L. Strausberg, G. J. Riggins, and S. F. Altschul. 2000. SAGEmap: a Public Gene Expression Resource. Gen. Res. 10:1051–1060.
Lercher M. J., and L. D. Hurst. 2002. Human SNP variability and mutation rate are higher in regions of high recombination. Trends Genet. 18:337–340.
Lercher M. J., A. O. Urrutia, and L. D. Hurst. 2002. Clustering of housekeeping genes provides a unified model of gene order in the human genome. Nat. Genet. 31:180–183.
Lercher M. J., A. O. Urrutia, A. Pavlicek, and L. D. Hurst. 2003. A unification of mosaic structures in the human genome. Hum. Mol. Genet. 12:2411–2415.
Lercher M. J., E. J. Williams, and L. D. Hurst. 2001. Local similarity in evolutionary rates extends over whole chromosomes in human-rodent and mouse-rat comparisons: implications for understanding the mechanistic basis of the male mutation bias. Mol. Biol. Evol. 18:2032–2039.
Li W. H. (1997) Molecular evolution. Sinauer Associates, Sunderland.
Malcom C. M., G. J. Wyckoff, and B. T. Lahn. 2003. Genic mutation rates in mammals: local similarity, chromosomal heterogeneity, and X-versus-autosome disparity. Mol. Biol. Evol. 20:1633–1641.
Margulies E. H., S. L. Kardia, and J. W. Innis. 2001. Identification and prevention of a GC content bias in SAGE libraries. Nucleic Acids Res. 29:e60.
Matassi G., P. M. Sharp, and C. Gautier. 1999. Chromosomal location effects on gene sequence evolution in mammals. Curr. Biol. 9:786–791.
Meunier J., and L. Duret. 2004. Recombination drives the evolution of GC-content in the human genome. Mol. Biol. Evol. 21:984–990.
Mouse Genome Sequencing Consortium. 2002. Initial sequencing and comparative analysis of the mouse genome. Nature 420:520–562.
Morey N. J., C. N. Greene, and S. Jinks-Robertson. 2000. Genetic analysis of transcription-associated mutation in Saccharomyces cerevisiae. Genetics 154:109–120.
Niimura Y., and T. Gojobori. 2002. In silico chromosome staining: reconstruction of Giemsa bands from the whole human genome sequence. Proc. Natl. Acad. Sci. USA 99:797–802.
Pevzner P., and G. Tesler. 2003. Human and mouse genomic sequences reveal extensive breakpoint reuse in mammalian evolution. Proc. Natl. Acad. Sci. USA 100:7672–7677.
Piganeau G., D. Mouchiroud, L. Duret, and C. Gautier. 2002. Expected relationship between the silent substitution rate and the GC content: implications for the evolution of isochores. J. Mol. Evol. 54:129–133.
Saccone S., A. Pavlicek, C. Federico, J. Paces, and G. Bernard. 2001. Genes, isochores and bands in human chromosomes 21 and 22. Chromosome Res 9:533–539.
Schuler G. D., M. S. Boguski, E. A. Stewart, et al. 1996. A gene map of the human genome. Science 274:540–546.
Smith N. G. C., and A. Eyre-Walker. 2003. Partitioning the variation in mammalian substitution rates. Mol. Biol. Evol. 20:10–17.
Smith N. G. C., M. T. Webster, and H. Ellegren. 2002. Deterministic mutation rate variation in the human genome. Genome Res. 12:1350–1356.
Smith N. G. C., and L. D. Hurst. 1999. The effect of tandem substitutions on the correlation between synonymous and nonsynonymous rates in rodents. Genetics 153:1395–1402.
Su A. I., M. P. Cooke, K. A. Ching, et al. 2002. Large-scale analysis of the human and mouse transcriptomes. Proc. Natl. Acad. Sci. USA 99:4465–4470.
Svejstrup J. Q. 2002. Mechanisms of transcription-coupled DNA repair. Nat. Rev. Mol. Cell. Biol. 3:21–29.
Thompson J. D., D. G. Higgins, and T. J. Gibson. 1994. Clustal-W—Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position- specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:4673–4680.
Urrutia A. O., and L. D. Hurst. 2003. The signature of selection mediated by expression on human genes. Genome Res 13:2260–2264.
Williams E. J., and L. D. Hurst. 2000. The proteins of linked genes evolve at similar rates. Nature 407:900–903.
Williams E. J., and L. D. Hurst. 2002. Clustering of tissue-specific genes underlies much of the similarity in rates of protein evolution of linked genes. J. Mol. Evol. 54:511–518.
Vinogradov A. E. 2003. Isochores and tissue-specificity. Nucleic Acids Res. 31:5212–5220.(Matthew T. Webster*, Nick)