Effects of Gene Expression on Molecular Evolution in Arabidopsis thaliana and Arabidopsis lyrata
http://www.100md.com
分子生物学进展 2004年第9期
* ICAPB, University of Edinburgh Ashworth Labs, Edinburgh, UK
Department of Plant and Soil Sciences, University of Delaware
E-mail: stephenw@yorku.ca.
Abstract
We analyzed the complete genome sequence of Arabidopsis thaliana and sequence data from 83 genes in the outcrossing A. lyrata, to better understand the role of gene expression on the strength of natural selection on synonymous and replacement sites in Arabidopsis. From data on tRNA gene abundance, we find a good concordance between codon preferences and the relative abundance of isoaccepting tRNAs in the complete A. thaliana genome, consistent with models of translational selection. Both EST-based and new quantitative measures of gene expression (MPSS) suggest that codon preferences derived from information on tRNA abundance are more strongly associated with gene expression than those obtained from multivariate analysis, which provides further support for the hypothesis that codon bias in Arabidopsis is under selection mediated by tRNA abundance. Consistent with previous results, analysis of protein evolution reveals a significant correlation between gene expression level and amino acid substitution rate. Analysis by MPSS estimates of gene expression suggests that this effect is primarily the result of a correlation between the number of tissues in which a gene is expressed and the rate of amino acid substitution, which indicates that the degree of tissue specialization may be an important determinant of the rate of protein evolution in Arabidopsis.
Key Words: Arabidopsis thaliana ? Arabidopsis lyrata ? tRNA ? MPSS ? codon bias, protein evolution
Introduction
Differences among genes in their expression appear to be an important cause of variation in rates and patterns of molecular evolution (Akashi 2001). First, positive correlations between gene expression and the level of codon usage bias on synonymous sites have been observed in a number of model eukaryotic genomes, consistent with models of translational selection on codon usage (Akashi 2001; Duret and Mouchiroud 1999). Further support for translational selection comes from a strong association between tRNA abundance and codon bias, which has been repeatedly demonstrated in bacteria and yeast (Ikemura 1985), although evidence is less for such an association in multicellular eukaryotes. In Drosophila, a general correspondence appears to exist between experimentally estimated tRNA abundance and codon bias (Moriyama and Powell 1997). Similarly, in C. elegans, when tRNA gene copy number is used as a proxy for tRNA abundance, a good concordance occurs between the most frequently used codons and the copy number of the corresponding isoaccepting tRNAs (Duret 2000).
Recently, an effect of gene expression on rates of amino acid substitution has also been detected in several taxa (Pal, Papp, and Hurst 2001; Duret and Mouchiroud 2000; Iida and Akashi 2000; Zhang, Vision, and Gaut 2002; Wright, Lauga, and Charlesworth 2002), but the underlying cause of these patterns remains uncertain. One hypothesis is that genes that experience a higher complexity and diversity of biochemical environments will be subject to stronger selective constraints (Hastings 1996; Kuma, Iwabe, and Miyata 1995). Alternatively, mutations that affect proteins expressed in a greater number of tissues may on average have larger effects on fitness (Duret 2000). Translational selection may also be acting on amino acid composition to optimize protein synthesis (Akashi 2003), which should be more intense in highly expressed genes, given the higher level of translation of each amino acid. Finally, the degree to which the observed effect of gene expression on amino acid substitution rate is driven by inherent differences in selective constraint among genes of different function is unclear, because the functional categories differ, on average, in their relative abundance among different expression levels (Akashi 2001).
In this study, we investigate the interaction between gene expression and the effectiveness of natural selection at replacement and synonymous sites in Arabidopsis thaliana and its outcrossing congener A. lyrata. Although a positive correlation between preferred synonymous codon frequencies and gene expression has been detected in A. thaliana and close relatives (Duret and Mouchiroud 1999; Wright, Lauga, and Charlesworth 2002), the corresponding correlation coefficients are weak, and codon preferences have been defined solely by use of genome-wide multivariate analysis of codon usage, by identifying codon preferences in genes at the extreme of the first axis of correspondence analysis (Chiapello et al. 1998). Thus, little independent evidence remains in either Arabidopsis species that the identified "preferred" codons are explained by translational selection. Preliminary evidence for a positive correlation between rates of amino acid substitution and EST-based estimates of gene expression level has also been found in Arabidopsis, both between duplicated genes within the A. thaliana genome (Zhang, Vision, and Gaut 2002) and for orthologous genes between A. thaliana and A. lyrata (Wright et al. 2002). However, because EST-based estimates represent fairly crude indices of gene expression, the attempt to uncouple the underlying causes of this correlation in these studies was difficult.
We use genome-wide data from Arabidopsis thaliana and sequence data from 83 genes in Arabidopsis lyrata to investigate the effects of gene expression on substitution patterns and synonymous codon usage in Arabidopsis. We take advantage of new fine-scale estimates of gene expression in A. thaliana provided by Massively Parallel Signature Sequencing (MPSS [Brenner et al. 2000]) and data on tRNA gene abundance in the A. thaliana genome (http://rna.wustl.edu/GtRDB/At/). We use these data to address two primary questions: (1) Does translational selection act on codon bias in Arabidopsis? (2) What is the role of gene expression on the rate of protein evolution in Arabidopsis?
Methods
Data Sets
The complete list of 83 loci used for the analysis of codon bias is shown in table A1 of Supplementary Material online. We include all genes from Wright, Lauga, and Charlesworth (2002), with the exception of two members of the AOP gene family (AOP1 and AOP2), to avoid redundancies of base composition and substitution patterns across gene family members. We have added 22 new A. lyrata genes that are now available from GenBank, eight unpublished loci provided by colleagues (E. Kamau, H. Kuittinen, and O. Savolainen, personal communication), as well as 32 new genes sequenced for this study. Partial sequencing of genes in A. lyrata was done as previously described (Wright, Lauga, and Charlesworth 2002) or by direct sequencing of both strands from large exons. All new sequences from this study have been deposited into GenBank, with accession numbers AY647924 to AY647956.
Data Analysis
Calculations of rates of synonymous and nonsynonymous substitutions in each species were made by DnaSP version 3.95 (Rozas and Rozas 1999). To investigate codon preferences, we conduct correspondence analysis of relative synonymous codon usage (RSCU) of the sequenced A. lyrata genes, by application of the program CodonW (http://www.molbiol.ox.ac.uk/cu/culong.html). We exclude all gene fragments with less than 100 synonymous sites for this analysis, because small sequences give poor estimates of codon usage (e.g., Novembre 2002). Following the approach of Chiapello et al. (1998) for A. thaliana, preferred codons were detected by comparing codon frequencies of the highest-biased and lowest-biased genes, defined by the two extremes of the first axis of the correspondence analysis. Chi-square analysis was implemented in CodonW to test the null hypothesis that the relative frequencies of individual codons are the same in the highest-biased and lowest-biased 30 genes against the alternative hypothesis of an increase in frequency. We compared these results with the genome-wide identification of codon preferences in A. thaliana. For further comparison, we also performed correspondence analysis in A. thaliana on the orthologous loci we examined in A. lyrata, to evaluate the assessment of codon preferences in this subset of genes. CodonW was also used to estimate Fop, the frequency of optimal codons (Stenico, Lloyd, and Sharp 1994) for each gene.
Data on tRNA gene abundance in the complete genome of A. thaliana was obtained from the genomic tRNA database (http://rna.wustl.edu/tRNAdb/). To compare with preferred codon frequencies, we group codons by their predicted major tRNA species, following the revised wobble rules for eukaryotic genomes (Percudani 2001). These rules assume that GNN tRNAs pair with both C-ending and U-ending codons, whereas ANN tRNA genes are modified to inosine and decode both U-ending and G-ending codons. Identified tRNA genes in A. thaliana generally adhere to these wobble rules, because almost every codon can only be decoded by a single class of tRNA (see table 1). However, three exceptions exist of single unexpected tRNA genes (Gly GGU, Ser GGA, Leu GAG [shown in table 1]), which have been hypothesized to represent pseudogenes or sequencing errors (Wang, Shi, and Hao 2002). For our estimates of relative tRNA abundance, we ignore these exceptional genes, although their inclusion does not change the rank order of tRNA gene frequencies. tRNA gene abundance has been found to correlate strongly with the corresponding tRNA abundance in a number of prokaryotic and eukaryotic genomes (Kanaya et al. 2001), which suggests that it is a good proxy for relative tRNA abundance in the genome. In accordance with this prediction, we find a moderately significant positive correlation between codon number in the high-biased genes from Chiapello et al. (1998) and the number of isoaccepting tRNAs (r = 0.40, P < 0.01). When we remove three outliers with the highest tRNA gene copy numbers, the correlation improves substantially (r = 0.84, P << 0.001).
Table 1 Codon Preferences and tRNA Gene Abundance in Arabidopsis.
We use measures of gene expression level for each gene from A. thaliana from two data sets. First, data on the maximum abundance of expressed sequence tags (ESTs) extracted from GenBank (www.ncbi.nlm.nih.gov) Blast search were obtained from G. Marais (personal communication). Second, quantitative estimates of gene expression from five tissues were obtained from a database of Massively Parallel Signature Sequencing (MPSS) data (Brenner et al. 2000) of mRNA in A. thaliana (Meyers et al. 2004; http://mpss.udel.edu/at). This method generates quantitative estimates of the abundance of short sequence tags from messenger RNA. The five tissues represented in the expression data included leaves, roots, flowers, siliques, and callus. To estimate the expression level for each gene, we searched the Arabidopsis MPSS database in two ways. First, a search by the Arabidopsis protein identifier number determined sequence tag matches in the vicinity of the queried gene. We estimated expression level for each tissue by summing all expressed tags in the sense strand either within the coding sequence of the annotated gene (class I) or within 500 bp 3' of the coding region (class II). However, because this database searches for sequence tag matches to genomic DNA, it does not include all tags potentially generated after intron splicing. We, therefore, also submitted the available full-length mRNA for each locus to a search in the database to identify additional expressed sequence tags, and the abundance of any additional tags was added to estimates of expression level. For a small subset of loci, expressed tags matched to multiple genes, and these loci were, therefore, excluded from subsequent analysis. To investigate the effects of gene expression, we use both the maximum expression level across tissues and the total expression level across tissues as our estimates. For genome-wide analysis of the A. thaliana genome, gene-by-gene estimates of MPSS expression level were used. These data only represent expression levels for sequence tags that match the genomic DNA and might, therefore, in some cases, represent underestimates of the gene expression level. Because of the potential for errors in the MPSS data set for estimates with low expression, we conservatively assume that a gene is expressed in a given tissue only if it has values greater than or equal to 4 parts per million.
Results and Discussion
Codon Preferences in Arabidopsis lyrata
Correspondence analysis provides a means of summarizing variation among genes in codon usage into a single major axis. Codon "preferences" can then be assigned by comparing the differences in codon usage between genes at the extremes of the first axis. A summary of inferred codon preferences in A. lyrata obtained by multivariate analysis is shown in table 1; the A. thaliana results for the orthologous genes and the genome-wide results of Chiapello et al. (1998) are given for comparison. In comparison with the results of Chiapello et al. (1998), we identified a moderate concordance among codon preferences; out of the 21 codons denoted as preferred in Chiapello et al. (1998) for A. thaliana, 13 of the same codons were also preferred in our analysis, and three additional codons show a nonsignificant increase in frequency. Conversely, a single codon that was unpreferred in the analysis of A. thaliana by Chiapello et al. (1998) was identified as preferred in A. lyrata in our study (ACG). However, a comparison with correspondence analysis from A. thaliana made by use of the same set of loci shows the identification of very similar codon preferences to those in A. lyrata; the rank order of relative codon frequencies is identical across the two species, and relative codon frequencies are very highly correlated (Spearman's r = 0.96, P << 0.01). This finding suggests that differences from the genome-wide analysis may reflect artifacts of multivariate analysis on a smaller subset of genes, rather than true differences in codon preferences between species. Thus, this data set gives little evidence for changes in codon preferences between species, although the power to detect differences in codon preference may be low, given their close evolutionary history.
Evidence for Translational Selection Favoring Preferred Codons
To test the hypothesis that the identified preferred codons are under translational selection, we investigated the correspondence between codon preferences and tRNA gene abundance in the A. thaliana genome. tRNA gene numbers, relative codon frequencies between high-biased and low-biased genes, and absolute frequencies in highly biased genes are shown in table 1. By combining codons on the basis of the same tRNA isoacceptor, we found a perfect correspondence between the most abundant tRNA gene class and codon classes that show the highest relative frequencies in high-biased genes.
Furthermore, with the exception of two sixfold degenerate codons (Leu and Arg), the rank order of codon frequencies corresponds exactly with the rank order of tRNA abundance. This result provided strong support for the hypothesis that the preferred codons are favored by translational selection. Interestingly, four of the five preferred codons that did not show an increase in frequency in our smaller data set of 83 sequences also studied in A. lyrata (AAG, CAG, GAG, and AGG) had only slight differences in tRNA abundance. This finding suggests that, as would be expected a priori, the smaller data set has less power to detect subtler differences in selection.
For codons recognized by the same tRNA, Ikemura (1985 ["rule 4"]) proposed that because of weak pairing interactions, tRNAs pairing with (A/U)-(A/U)-pyrimidine codons should favor C-ending codons. This rule appears to hold in the A. thaliana genome, which supports the hypothesis of preference for the C-ending codons in Asn, Tyr, Phe, and Ile (table 1). However, in other cases (Val, Ser, Thr, Pro, His, and Cys), an apparent preference of C-ending over U-ending codons occurs that is not expected according to these rules. Furthermore, in several other similar cases, the preferred and unpreferred codons identified by Chiapello et al. (1998) did not match precisely with the expectations based on tRNA abundance (Arg, Leu, Pro, and Ala [see table 1]). Because low-biased genes are generally AT rich (table 1), these apparent disagreements could represent artifacts of defining preferred codons as those that show a significant increase in frequency at one extreme of the first multivariate axis of correspondence analysis. Because the first multivariate axis maximizes the differences between genes in codon usage, it may be influenced by base composition as well as selection. For example, when the absolute frequencies of GCT, GTT, TCT, ACT, and CCT in high-biased genes are examined, no such preference for C-ending over U-ending codons is apparent (in contrast with the ratio of usage in high-biased versus low-biased genes). Thus, identification of preferred and unpreferred codons by multivariate analysis could possibly lead to the misidentification of some codons, because of the additional contribution of mutational biases. Alternatively, an additional selection pressure may be favoring C-ending codons, such as mRNA secondary structure. This possibility has been hypothesized for Drosophila melanogaster to explain the strong preference for C-ending codons (Carlini, Chen, and Stephan 2001), which is not always expected from information on tRNA gene abundance (Kanaya et al. 2001).
To further test this data set for the action and nature of translational selection, we examined the relationship between gene expression level and codon bias, with the expectation that codon bias indices that more accurately reflect translational selection should show a stronger correlation with gene expression. To investigate this hypothesis, we used estimates of gene expression level based on EST data and estimates based on the abundance of short sequence tags (MPSS [see Methods]). These latter estimates should represent more accurate estimates of gene expression (Meyers et al., unpublished data), because they do not use normalized expression libraries, and they are based on many more counts of mRNA abundance than are EST projects. Furthermore, we used two estimates of the frequency of optimal codons: one based on codon preferences identified in Chiapello et al. (1998) utilized multivariate analysis (Fopmult) and a second based on expectations derived from tRNA gene relative abundance (FoptRNA). For this latter measure, we identify preferred codon classes as those with the highest relative tRNA abundance for each amino acid. We also followed Ikemura's (1985) rule 4 and assumed a preference of C over T for (A/U)-(A/U)-pyrimidine codons. Preferred codons defined by these rules are shown in bold in table 1. Correlation analysis between three measures of gene expression, these two estimates of Fop, and the GC content at third codon positions (GC3) is shown in table 2. First, as expected, FoptRNA shows a weaker correlation with GC3 (r = 0.51) than Fopmult (r = 0.87) because of the higher proportion of A/T-ending preferred codons. For all three measures of codon usage, MPSS estimates of gene expression tend to show a slightly higher correlation with codon bias than EST-based estimates, as expected from more accurate estimates of gene expression.
Table 2 Pearson Correlations Between Gene Expression and Codon Usage Bias.
For both A. lyrata and A. thaliana, FoptRNA shows the highest correlation with all measures of gene expression. Furthermore, a partial correlation between Fopmult and gene expression, factoring out FoptRNA, is not significant, which suggests that there is no residual effect of this measure of codon bias, once the similarity with FoptRNA is factored out (table 2). Conversely, a residual effect of gene expression on FoptRNA is still present, factoring out Fopmult (table 2). Finally, GC3 is not positively correlated with gene expression in a partial correlation, after factoring out the effects of FoptRNA. Because selection on mRNA secondary structure is likely to be associated with gene expression level, this observation suggests that a residual action of selection on base composition does not occur, independent of translational selection.
Note that these results also suggest that the codon preferences identified by tRNA relative abundance may more accurately reflect those codons favored by translational selection than those identified by multivariate analysis and that codon preferences identified solely by correspondence analysis should be treated with caution. Codon preferences have also been identified by explicitly comparing codon frequencies in genes with contrasting expression levels (e.g., Duret and Mouchiroud 1999). For several amino acids (Leu, Ala, and Thr), the identification of codon preferences by comparison of genes of highest and lowest expression are more similar to those expected on the basis of tRNA abundance and pairing rules (last two columns of table 1), which suggests that this method might provide more accurate identification of codon preferences. However, this approach generates an inherent circularity when testing for the action of translational selection (Duret and Mouchiroud 1999), and by using explicit a priori hypotheses based on isoaccepting tRNA abundance, we appear to have identified the codon preferences most directly relevant to selection on codon usage.
Figure 1 summarizes the relationship between MPSS-based gene expression classes and total FoptRNA across expression classes. In contrast with previous results based on EST estimates of gene expression (Duret and Mouchiroud 1999), we see a considerable increase (30%) in the frequency of preferred codons across expression classes, and variation in gene expression level explains a low but nontrivial percentage of the variance in FoptRNA among individual genes (approximately 10% [see table 2]) and a high percentage of the variance in FoptRNA across the gene expression classes shown in figure 1 (82%). However, a strong increase in codon bias appears to be primarily restricted to the very highest expression classes (fig. 1). This finding suggests that, unlike in other taxa (Akashi 2003; Duret and Mouchiroud 1999), selection may be acting on codon usage only on the most highly expressed genes and that selection on codon usage may be generally weaker in A. thaliana than in other systems, perhaps because high levels of self-fertilization are reducing the effectiveness of natural selection on slightly deleterious mutations (Wright, Lauga, and Charlesworth 2002). As previously found (Duret and Mouchiroud 1999), a significant negative correlation also exists between gene length and FoptRNA (Pearson's r = –0.251, P < 0.01), even once the effect of gene expression is factored out (partial r = –0.239, P < 0.01). Together, gene expression and gene length explain approximately 15% of the variation in codon usage across individual genes. Other factors that may be important in driving variation in codon usage include genetic drift, local mutational biases, and the rate of amino acid substitution (Marais, Charlesworth, and Wright 2004), although the current data set provides no evidence for a role of amino acid substitution on codon bias (see below).
FIG. 1. Relationship between gene expression class and codon bias in A. thaliana, as measured by FoptRNA (see text). Genes were divided into approximately equal groups by their total expression level by use of MPSS and the total FoptRNA for each class is shown. All expression classes consist of at least 50,000 codons. On the x-axis is the midpoint value of total MPSS expression (in parts per million) for that class
An alternative neutral explanation for the significant correlation between gene expression level and codon bias is that levels of gene expression are correlated with mutational biases, and the relationship with tRNA abundance solely reflects the evolution of tRNA pools to match the codon usage of highly expressed genes. If so, we would expect to observe a similar correlation between gene expression level and intron base composition. As determined from the complete genome of A. thaliana, a very weak positive correlation exists between MPSS-based estimates of total expression level and intron GC (Pearson's r = 0.024, P < 0.01). However, if we factor out the effect of intron GC using a partial correlation, a residual effect of gene expression on FoptRNA (partial r = 0.323, P < 0.001) is still present, which suggests that mutational biases alone cannot explain the pattern. In fact, GC3 is negatively correlated with the GC content in introns (r = –0.065, P < 0.01 [Marais et al. 2004]), which suggests an uncoupling of base composition at synonymous and intron sites.
Forces Driving Rates of Nucleotide Substitution
Table 3 shows a summary of the average rates of substitution and the effects of gene expression on substitution rates between A. thaliana and A. lyrata. Consistent with previous results (Wright, Lauga, and Charlesworth 2002), we find no correlation between either gene expression or codon bias and the rate of synonymous substitution (Ks), by application of the new measures of gene expression (MPSS) and codon bias (FoptRNA). Although a significant correlation with Ks is expected when natural selection on codon usage is strong, theoretical work has shown that weak selection on codon bias has very little effect on the rate of synonymous substitution (McVean and Vieira 2001), despite significant effects on codon usage. These results are, therefore, still consistent with the interpretation that weak translational selection on codon usage occurs in Arabidopsis. In contrast with observations from Drosophila (e.g., Bettancourt and Presgraves 2002), we also find no evidence for a correlation between the rate of amino acid substitution and the level of codon bias (table 3).
Table 3 Correlation Analysis Between Gene Expression and Rates Of Substitution.
Also consistent with earlier work, we found a significant negative correlation between EST-based estimates of gene expression and the rate of amino acid substitution (Ka [table 3]). Furthermore, we found a stronger correlation between MPSS-based estimates of gene expression level and Ka, as expected, because these latter data represent more accurate estimates of gene expression. As noted, distinguishing between the effects of gene expression level and the breadth of gene expression is difficult, because these factors are likely to be correlated (Akashi 2001). If the observed relationship between gene expression and amino acid substitution rate is driven primarily by an effect of higher selective constraints on proteins expressed in a greater diversity of chemical environments (Hastings 1996; Kuma, Iwabe, and Miyata 1995) or a higher fitness effect of mutations expressed in many tissues (Duret and Mouchiroud 1999), we would expect that tissue number should show the highest correlation with Ka, and no residual effect of expression level would be present when correcting for tissue number. Alternatively, if this relationship reflects translational selection on amino acid usage (Akashi 2003), we would expect that total expression level should have a residual significant effect, because the strength of selection on amino acid composition is expected to be determined by the total amount of translation. The MPSS data on gene expression from five tissues allowed for a detailed test of this prediction. Analyzing these data, we find a stronger negative correlation between Ka and tissue number (table 3) than with estimates of gene expression level. Furthermore, a partial correlation shows no significant effect of either EST or MPSS expression level when controlling for tissue number (table 3). In contrast, a partial correlation between tissue number and expression remains significant once total or maximum expression levels are factored out. As a comparison, the number of tissues in which a gene is expressed shows no correlation with FoptRNA (P > 0.05), despite the observed correlation with estimates of expression level (table 2), as expected under a model of translational selection. These results suggest that the diversity of tissues in which a gene is expressed may be the primary factor that drives the observed effect on protein evolution and that the specificity of gene expression may be a general factor that drives variation in rates of protein evolution in Arabidopsis. However, differences in gene function may exist in our data set that are associated with tissue number, and part of the effect of gene expression on the rate of protein evolution could be driven by underlying differences in selective constraint across gene categories. This possibility should be examined in detail once a larger comparative data set becomes available.
Acknowledgements
We thank G. Marais, D. Charlesworth, B. Gaut, H. Akashi, and two anonymous reviewers for very helpful discussion and comments on the manuscript, and E. Kamau, H. Kuittinen, and O. Savolainen for allowing us to use their unpublished sequence data. S.W. was supported by a Commonwealth fellowship and an NSERC PGSB scholarship. The MPSS data were generated with support from the NSF Plant Genome Research Program (DBI-0110528) to B.C.M.
Literature Cited
Akashi, H. 2001. Gene expression and molecular evolution. Curr. Opin. Genet. Devel. 11:660-666.
Akashi, H. 2003. Translational selection and yeast proteome evolution. Genetics 164:1291-1303.
Bettancourt, A. J., and Presgraves, D. C. 2002. Linkage limits the power of natural selection in Drosophila. Proc. Natl. Acad. Sci. USA 99:13616-13620.
Brenner, S., M. Johnson, and J. Bridgham, et al. (21 co-authors). 2000. Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nat. Biotechnol. 18:630-634.
Carlini, D. B., Y. Chen, and W. Stephan. 2001. The relationship between third-codon position nucleotide content, codon bias, mRNA secondary structure and gene expression in the drosophilid alcohol dehydrogenase genes Adh and Adhr. Genetics 159:623-633.
Chiapello, H., F. Lisacek, M. Caboche, and A. Henaut. 1998. Codon usage and gene function are related in sequences of Arabidopsis thaliana. Gene 209:GC1-GC38.
Duret, L. 2000. tRNA gene number and codon usage in the C. elegans genome are co-adapted for optimal translation of highly expressed genes. Trends Genet. 16:287-289.
Duret, L., and D. Mouchiroud. 1999. Expression pattern and, surprisingly, gene length shape codon usage in Caenorhabditis, Drosophila, and Arabidopsis. Proc. Natl. Acad. Sci. USA 96:4482-4487.
Duret, L., and D. Mouchiroud. 2000. Determinants of substitution rates in mammalian genes: expression pattern affects selection intensity but not mutation rate. Mol. Biol. Evol. 17:68-74.
Hastings, K. E. 1996. Strong evolutionary conservation of broadly expressed protein isoforms in the troponin I gene family and other vertebrate gene families. J. Mol. Evol. 42:631-640.
Iida, K., and H. Akashi. 2000. A test of translational selection at ‘silent’ sites in the human genome: base composition comparisons in alternatively spliced genes. Gene 261:93-105.
Ikemura, T. 1985. Codon usage and tRNA content in unicellular and multicellular organisms. Mol. Biol. Evol. 2:13-34.
Kanaya, S., Y. Yamada, M. Kinouchi, Y. Kudo, and T. Ikemura. 2001. Codon usage and tRNA genes in eukaryotes: correlation of codon usage diversity with translation efficiency and with CG-dinucleotide usage as assessed by multivariate analysis. J. Mol. Evol. 53:290-298.
Kuma, K., N. Iwabe, and T. Miyata. 1995. Functional constraints against variations on molecules from the tissue level: slowly evolving brain-specific genes demonstrated by protein kinase and immunoglobulin supergene families. Mol. Biol. Evol. 12:123-130.
Marais, G., Charlesworth, B., and S. I. Wright. 2004. Recombination and base composition: the case of the highly self-fertilizing plant Arabidopsis thaliana. Genome Biol. 5:R45.
McVean, G. A., and J. Vieira. 2001. Inferring parameters of mutation, selection and demography from patterns of synonymous site evolution in Drosophila. Genetics 157:245-257.
Meyers, B. C., D. K. Lee, T. H. Vu, S. S. Tej, S. B. Edberg, M. Matvienko, and L. D. Tindell. 2004. Arabidopsis MPSS. An online resource for quantitative expression analysis. Plant Physiol. 135:801-813.
Moriyama, E. N., and J. R. Powell. 1997. Codon usage bias and tRNA abundance in Drosophila. J. Mol. Evol. 45:514-523.
Novembre, J. A. 2002. Accounting for background nucleotide composition when measuring codon usage bias. Mol. Biol. Evol. 19:1390-1394.
Pal, C., B. Papp, and L. D. Hurst. 2001. Highly expressed genes in yeast evolve slowly. Genetics 158:927-931.
Percudani, R. 2001. Restricted wobble rules for eukaryotic genomes. Trends Genet. 17:133-135.
Rozas, J., and R. Rozas. 1999. DnaSP version 3: an integrated program for molecular population genetics and molecular evolution analysis. Bioinformatics 15:174-175.
Stenico, M., A. T. Lloyd, and P. M. Sharp. 1994. Codon usage in Caenorhabditis elegans: delineation of translational selection and mutational biases. Nucleic Acids Res. 22:2437-2446.
Wang, X., X. Shi, and B. Hao. 2002. The transfer RNA genes in Oryza sativa L. ssp. indica. Sci. China (Ser. C) 45:504-511.
Wright, S. I., B. Lauga, and D. Charlesworth. 2002. Rates and patterns of molecular evolution in inbred and outbred Arabidopsis. Mol. Biol. Evol. 19:1399-1406.
Zhang, L., T. J. Vision, and B. S. Gaut. 2002. Patterns of nucleotide substitution among simultaneously duplicated gene pairs in Arabidopsis thaliana. Mol. Biol. Evol. 19:1464-1473.(Stephen I. Wright*,1, C. )
Department of Plant and Soil Sciences, University of Delaware
E-mail: stephenw@yorku.ca.
Abstract
We analyzed the complete genome sequence of Arabidopsis thaliana and sequence data from 83 genes in the outcrossing A. lyrata, to better understand the role of gene expression on the strength of natural selection on synonymous and replacement sites in Arabidopsis. From data on tRNA gene abundance, we find a good concordance between codon preferences and the relative abundance of isoaccepting tRNAs in the complete A. thaliana genome, consistent with models of translational selection. Both EST-based and new quantitative measures of gene expression (MPSS) suggest that codon preferences derived from information on tRNA abundance are more strongly associated with gene expression than those obtained from multivariate analysis, which provides further support for the hypothesis that codon bias in Arabidopsis is under selection mediated by tRNA abundance. Consistent with previous results, analysis of protein evolution reveals a significant correlation between gene expression level and amino acid substitution rate. Analysis by MPSS estimates of gene expression suggests that this effect is primarily the result of a correlation between the number of tissues in which a gene is expressed and the rate of amino acid substitution, which indicates that the degree of tissue specialization may be an important determinant of the rate of protein evolution in Arabidopsis.
Key Words: Arabidopsis thaliana ? Arabidopsis lyrata ? tRNA ? MPSS ? codon bias, protein evolution
Introduction
Differences among genes in their expression appear to be an important cause of variation in rates and patterns of molecular evolution (Akashi 2001). First, positive correlations between gene expression and the level of codon usage bias on synonymous sites have been observed in a number of model eukaryotic genomes, consistent with models of translational selection on codon usage (Akashi 2001; Duret and Mouchiroud 1999). Further support for translational selection comes from a strong association between tRNA abundance and codon bias, which has been repeatedly demonstrated in bacteria and yeast (Ikemura 1985), although evidence is less for such an association in multicellular eukaryotes. In Drosophila, a general correspondence appears to exist between experimentally estimated tRNA abundance and codon bias (Moriyama and Powell 1997). Similarly, in C. elegans, when tRNA gene copy number is used as a proxy for tRNA abundance, a good concordance occurs between the most frequently used codons and the copy number of the corresponding isoaccepting tRNAs (Duret 2000).
Recently, an effect of gene expression on rates of amino acid substitution has also been detected in several taxa (Pal, Papp, and Hurst 2001; Duret and Mouchiroud 2000; Iida and Akashi 2000; Zhang, Vision, and Gaut 2002; Wright, Lauga, and Charlesworth 2002), but the underlying cause of these patterns remains uncertain. One hypothesis is that genes that experience a higher complexity and diversity of biochemical environments will be subject to stronger selective constraints (Hastings 1996; Kuma, Iwabe, and Miyata 1995). Alternatively, mutations that affect proteins expressed in a greater number of tissues may on average have larger effects on fitness (Duret 2000). Translational selection may also be acting on amino acid composition to optimize protein synthesis (Akashi 2003), which should be more intense in highly expressed genes, given the higher level of translation of each amino acid. Finally, the degree to which the observed effect of gene expression on amino acid substitution rate is driven by inherent differences in selective constraint among genes of different function is unclear, because the functional categories differ, on average, in their relative abundance among different expression levels (Akashi 2001).
In this study, we investigate the interaction between gene expression and the effectiveness of natural selection at replacement and synonymous sites in Arabidopsis thaliana and its outcrossing congener A. lyrata. Although a positive correlation between preferred synonymous codon frequencies and gene expression has been detected in A. thaliana and close relatives (Duret and Mouchiroud 1999; Wright, Lauga, and Charlesworth 2002), the corresponding correlation coefficients are weak, and codon preferences have been defined solely by use of genome-wide multivariate analysis of codon usage, by identifying codon preferences in genes at the extreme of the first axis of correspondence analysis (Chiapello et al. 1998). Thus, little independent evidence remains in either Arabidopsis species that the identified "preferred" codons are explained by translational selection. Preliminary evidence for a positive correlation between rates of amino acid substitution and EST-based estimates of gene expression level has also been found in Arabidopsis, both between duplicated genes within the A. thaliana genome (Zhang, Vision, and Gaut 2002) and for orthologous genes between A. thaliana and A. lyrata (Wright et al. 2002). However, because EST-based estimates represent fairly crude indices of gene expression, the attempt to uncouple the underlying causes of this correlation in these studies was difficult.
We use genome-wide data from Arabidopsis thaliana and sequence data from 83 genes in Arabidopsis lyrata to investigate the effects of gene expression on substitution patterns and synonymous codon usage in Arabidopsis. We take advantage of new fine-scale estimates of gene expression in A. thaliana provided by Massively Parallel Signature Sequencing (MPSS [Brenner et al. 2000]) and data on tRNA gene abundance in the A. thaliana genome (http://rna.wustl.edu/GtRDB/At/). We use these data to address two primary questions: (1) Does translational selection act on codon bias in Arabidopsis? (2) What is the role of gene expression on the rate of protein evolution in Arabidopsis?
Methods
Data Sets
The complete list of 83 loci used for the analysis of codon bias is shown in table A1 of Supplementary Material online. We include all genes from Wright, Lauga, and Charlesworth (2002), with the exception of two members of the AOP gene family (AOP1 and AOP2), to avoid redundancies of base composition and substitution patterns across gene family members. We have added 22 new A. lyrata genes that are now available from GenBank, eight unpublished loci provided by colleagues (E. Kamau, H. Kuittinen, and O. Savolainen, personal communication), as well as 32 new genes sequenced for this study. Partial sequencing of genes in A. lyrata was done as previously described (Wright, Lauga, and Charlesworth 2002) or by direct sequencing of both strands from large exons. All new sequences from this study have been deposited into GenBank, with accession numbers AY647924 to AY647956.
Data Analysis
Calculations of rates of synonymous and nonsynonymous substitutions in each species were made by DnaSP version 3.95 (Rozas and Rozas 1999). To investigate codon preferences, we conduct correspondence analysis of relative synonymous codon usage (RSCU) of the sequenced A. lyrata genes, by application of the program CodonW (http://www.molbiol.ox.ac.uk/cu/culong.html). We exclude all gene fragments with less than 100 synonymous sites for this analysis, because small sequences give poor estimates of codon usage (e.g., Novembre 2002). Following the approach of Chiapello et al. (1998) for A. thaliana, preferred codons were detected by comparing codon frequencies of the highest-biased and lowest-biased genes, defined by the two extremes of the first axis of the correspondence analysis. Chi-square analysis was implemented in CodonW to test the null hypothesis that the relative frequencies of individual codons are the same in the highest-biased and lowest-biased 30 genes against the alternative hypothesis of an increase in frequency. We compared these results with the genome-wide identification of codon preferences in A. thaliana. For further comparison, we also performed correspondence analysis in A. thaliana on the orthologous loci we examined in A. lyrata, to evaluate the assessment of codon preferences in this subset of genes. CodonW was also used to estimate Fop, the frequency of optimal codons (Stenico, Lloyd, and Sharp 1994) for each gene.
Data on tRNA gene abundance in the complete genome of A. thaliana was obtained from the genomic tRNA database (http://rna.wustl.edu/tRNAdb/). To compare with preferred codon frequencies, we group codons by their predicted major tRNA species, following the revised wobble rules for eukaryotic genomes (Percudani 2001). These rules assume that GNN tRNAs pair with both C-ending and U-ending codons, whereas ANN tRNA genes are modified to inosine and decode both U-ending and G-ending codons. Identified tRNA genes in A. thaliana generally adhere to these wobble rules, because almost every codon can only be decoded by a single class of tRNA (see table 1). However, three exceptions exist of single unexpected tRNA genes (Gly GGU, Ser GGA, Leu GAG [shown in table 1]), which have been hypothesized to represent pseudogenes or sequencing errors (Wang, Shi, and Hao 2002). For our estimates of relative tRNA abundance, we ignore these exceptional genes, although their inclusion does not change the rank order of tRNA gene frequencies. tRNA gene abundance has been found to correlate strongly with the corresponding tRNA abundance in a number of prokaryotic and eukaryotic genomes (Kanaya et al. 2001), which suggests that it is a good proxy for relative tRNA abundance in the genome. In accordance with this prediction, we find a moderately significant positive correlation between codon number in the high-biased genes from Chiapello et al. (1998) and the number of isoaccepting tRNAs (r = 0.40, P < 0.01). When we remove three outliers with the highest tRNA gene copy numbers, the correlation improves substantially (r = 0.84, P << 0.001).
Table 1 Codon Preferences and tRNA Gene Abundance in Arabidopsis.
We use measures of gene expression level for each gene from A. thaliana from two data sets. First, data on the maximum abundance of expressed sequence tags (ESTs) extracted from GenBank (www.ncbi.nlm.nih.gov) Blast search were obtained from G. Marais (personal communication). Second, quantitative estimates of gene expression from five tissues were obtained from a database of Massively Parallel Signature Sequencing (MPSS) data (Brenner et al. 2000) of mRNA in A. thaliana (Meyers et al. 2004; http://mpss.udel.edu/at). This method generates quantitative estimates of the abundance of short sequence tags from messenger RNA. The five tissues represented in the expression data included leaves, roots, flowers, siliques, and callus. To estimate the expression level for each gene, we searched the Arabidopsis MPSS database in two ways. First, a search by the Arabidopsis protein identifier number determined sequence tag matches in the vicinity of the queried gene. We estimated expression level for each tissue by summing all expressed tags in the sense strand either within the coding sequence of the annotated gene (class I) or within 500 bp 3' of the coding region (class II). However, because this database searches for sequence tag matches to genomic DNA, it does not include all tags potentially generated after intron splicing. We, therefore, also submitted the available full-length mRNA for each locus to a search in the database to identify additional expressed sequence tags, and the abundance of any additional tags was added to estimates of expression level. For a small subset of loci, expressed tags matched to multiple genes, and these loci were, therefore, excluded from subsequent analysis. To investigate the effects of gene expression, we use both the maximum expression level across tissues and the total expression level across tissues as our estimates. For genome-wide analysis of the A. thaliana genome, gene-by-gene estimates of MPSS expression level were used. These data only represent expression levels for sequence tags that match the genomic DNA and might, therefore, in some cases, represent underestimates of the gene expression level. Because of the potential for errors in the MPSS data set for estimates with low expression, we conservatively assume that a gene is expressed in a given tissue only if it has values greater than or equal to 4 parts per million.
Results and Discussion
Codon Preferences in Arabidopsis lyrata
Correspondence analysis provides a means of summarizing variation among genes in codon usage into a single major axis. Codon "preferences" can then be assigned by comparing the differences in codon usage between genes at the extremes of the first axis. A summary of inferred codon preferences in A. lyrata obtained by multivariate analysis is shown in table 1; the A. thaliana results for the orthologous genes and the genome-wide results of Chiapello et al. (1998) are given for comparison. In comparison with the results of Chiapello et al. (1998), we identified a moderate concordance among codon preferences; out of the 21 codons denoted as preferred in Chiapello et al. (1998) for A. thaliana, 13 of the same codons were also preferred in our analysis, and three additional codons show a nonsignificant increase in frequency. Conversely, a single codon that was unpreferred in the analysis of A. thaliana by Chiapello et al. (1998) was identified as preferred in A. lyrata in our study (ACG). However, a comparison with correspondence analysis from A. thaliana made by use of the same set of loci shows the identification of very similar codon preferences to those in A. lyrata; the rank order of relative codon frequencies is identical across the two species, and relative codon frequencies are very highly correlated (Spearman's r = 0.96, P << 0.01). This finding suggests that differences from the genome-wide analysis may reflect artifacts of multivariate analysis on a smaller subset of genes, rather than true differences in codon preferences between species. Thus, this data set gives little evidence for changes in codon preferences between species, although the power to detect differences in codon preference may be low, given their close evolutionary history.
Evidence for Translational Selection Favoring Preferred Codons
To test the hypothesis that the identified preferred codons are under translational selection, we investigated the correspondence between codon preferences and tRNA gene abundance in the A. thaliana genome. tRNA gene numbers, relative codon frequencies between high-biased and low-biased genes, and absolute frequencies in highly biased genes are shown in table 1. By combining codons on the basis of the same tRNA isoacceptor, we found a perfect correspondence between the most abundant tRNA gene class and codon classes that show the highest relative frequencies in high-biased genes.
Furthermore, with the exception of two sixfold degenerate codons (Leu and Arg), the rank order of codon frequencies corresponds exactly with the rank order of tRNA abundance. This result provided strong support for the hypothesis that the preferred codons are favored by translational selection. Interestingly, four of the five preferred codons that did not show an increase in frequency in our smaller data set of 83 sequences also studied in A. lyrata (AAG, CAG, GAG, and AGG) had only slight differences in tRNA abundance. This finding suggests that, as would be expected a priori, the smaller data set has less power to detect subtler differences in selection.
For codons recognized by the same tRNA, Ikemura (1985 ["rule 4"]) proposed that because of weak pairing interactions, tRNAs pairing with (A/U)-(A/U)-pyrimidine codons should favor C-ending codons. This rule appears to hold in the A. thaliana genome, which supports the hypothesis of preference for the C-ending codons in Asn, Tyr, Phe, and Ile (table 1). However, in other cases (Val, Ser, Thr, Pro, His, and Cys), an apparent preference of C-ending over U-ending codons occurs that is not expected according to these rules. Furthermore, in several other similar cases, the preferred and unpreferred codons identified by Chiapello et al. (1998) did not match precisely with the expectations based on tRNA abundance (Arg, Leu, Pro, and Ala [see table 1]). Because low-biased genes are generally AT rich (table 1), these apparent disagreements could represent artifacts of defining preferred codons as those that show a significant increase in frequency at one extreme of the first multivariate axis of correspondence analysis. Because the first multivariate axis maximizes the differences between genes in codon usage, it may be influenced by base composition as well as selection. For example, when the absolute frequencies of GCT, GTT, TCT, ACT, and CCT in high-biased genes are examined, no such preference for C-ending over U-ending codons is apparent (in contrast with the ratio of usage in high-biased versus low-biased genes). Thus, identification of preferred and unpreferred codons by multivariate analysis could possibly lead to the misidentification of some codons, because of the additional contribution of mutational biases. Alternatively, an additional selection pressure may be favoring C-ending codons, such as mRNA secondary structure. This possibility has been hypothesized for Drosophila melanogaster to explain the strong preference for C-ending codons (Carlini, Chen, and Stephan 2001), which is not always expected from information on tRNA gene abundance (Kanaya et al. 2001).
To further test this data set for the action and nature of translational selection, we examined the relationship between gene expression level and codon bias, with the expectation that codon bias indices that more accurately reflect translational selection should show a stronger correlation with gene expression. To investigate this hypothesis, we used estimates of gene expression level based on EST data and estimates based on the abundance of short sequence tags (MPSS [see Methods]). These latter estimates should represent more accurate estimates of gene expression (Meyers et al., unpublished data), because they do not use normalized expression libraries, and they are based on many more counts of mRNA abundance than are EST projects. Furthermore, we used two estimates of the frequency of optimal codons: one based on codon preferences identified in Chiapello et al. (1998) utilized multivariate analysis (Fopmult) and a second based on expectations derived from tRNA gene relative abundance (FoptRNA). For this latter measure, we identify preferred codon classes as those with the highest relative tRNA abundance for each amino acid. We also followed Ikemura's (1985) rule 4 and assumed a preference of C over T for (A/U)-(A/U)-pyrimidine codons. Preferred codons defined by these rules are shown in bold in table 1. Correlation analysis between three measures of gene expression, these two estimates of Fop, and the GC content at third codon positions (GC3) is shown in table 2. First, as expected, FoptRNA shows a weaker correlation with GC3 (r = 0.51) than Fopmult (r = 0.87) because of the higher proportion of A/T-ending preferred codons. For all three measures of codon usage, MPSS estimates of gene expression tend to show a slightly higher correlation with codon bias than EST-based estimates, as expected from more accurate estimates of gene expression.
Table 2 Pearson Correlations Between Gene Expression and Codon Usage Bias.
For both A. lyrata and A. thaliana, FoptRNA shows the highest correlation with all measures of gene expression. Furthermore, a partial correlation between Fopmult and gene expression, factoring out FoptRNA, is not significant, which suggests that there is no residual effect of this measure of codon bias, once the similarity with FoptRNA is factored out (table 2). Conversely, a residual effect of gene expression on FoptRNA is still present, factoring out Fopmult (table 2). Finally, GC3 is not positively correlated with gene expression in a partial correlation, after factoring out the effects of FoptRNA. Because selection on mRNA secondary structure is likely to be associated with gene expression level, this observation suggests that a residual action of selection on base composition does not occur, independent of translational selection.
Note that these results also suggest that the codon preferences identified by tRNA relative abundance may more accurately reflect those codons favored by translational selection than those identified by multivariate analysis and that codon preferences identified solely by correspondence analysis should be treated with caution. Codon preferences have also been identified by explicitly comparing codon frequencies in genes with contrasting expression levels (e.g., Duret and Mouchiroud 1999). For several amino acids (Leu, Ala, and Thr), the identification of codon preferences by comparison of genes of highest and lowest expression are more similar to those expected on the basis of tRNA abundance and pairing rules (last two columns of table 1), which suggests that this method might provide more accurate identification of codon preferences. However, this approach generates an inherent circularity when testing for the action of translational selection (Duret and Mouchiroud 1999), and by using explicit a priori hypotheses based on isoaccepting tRNA abundance, we appear to have identified the codon preferences most directly relevant to selection on codon usage.
Figure 1 summarizes the relationship between MPSS-based gene expression classes and total FoptRNA across expression classes. In contrast with previous results based on EST estimates of gene expression (Duret and Mouchiroud 1999), we see a considerable increase (30%) in the frequency of preferred codons across expression classes, and variation in gene expression level explains a low but nontrivial percentage of the variance in FoptRNA among individual genes (approximately 10% [see table 2]) and a high percentage of the variance in FoptRNA across the gene expression classes shown in figure 1 (82%). However, a strong increase in codon bias appears to be primarily restricted to the very highest expression classes (fig. 1). This finding suggests that, unlike in other taxa (Akashi 2003; Duret and Mouchiroud 1999), selection may be acting on codon usage only on the most highly expressed genes and that selection on codon usage may be generally weaker in A. thaliana than in other systems, perhaps because high levels of self-fertilization are reducing the effectiveness of natural selection on slightly deleterious mutations (Wright, Lauga, and Charlesworth 2002). As previously found (Duret and Mouchiroud 1999), a significant negative correlation also exists between gene length and FoptRNA (Pearson's r = –0.251, P < 0.01), even once the effect of gene expression is factored out (partial r = –0.239, P < 0.01). Together, gene expression and gene length explain approximately 15% of the variation in codon usage across individual genes. Other factors that may be important in driving variation in codon usage include genetic drift, local mutational biases, and the rate of amino acid substitution (Marais, Charlesworth, and Wright 2004), although the current data set provides no evidence for a role of amino acid substitution on codon bias (see below).
FIG. 1. Relationship between gene expression class and codon bias in A. thaliana, as measured by FoptRNA (see text). Genes were divided into approximately equal groups by their total expression level by use of MPSS and the total FoptRNA for each class is shown. All expression classes consist of at least 50,000 codons. On the x-axis is the midpoint value of total MPSS expression (in parts per million) for that class
An alternative neutral explanation for the significant correlation between gene expression level and codon bias is that levels of gene expression are correlated with mutational biases, and the relationship with tRNA abundance solely reflects the evolution of tRNA pools to match the codon usage of highly expressed genes. If so, we would expect to observe a similar correlation between gene expression level and intron base composition. As determined from the complete genome of A. thaliana, a very weak positive correlation exists between MPSS-based estimates of total expression level and intron GC (Pearson's r = 0.024, P < 0.01). However, if we factor out the effect of intron GC using a partial correlation, a residual effect of gene expression on FoptRNA (partial r = 0.323, P < 0.001) is still present, which suggests that mutational biases alone cannot explain the pattern. In fact, GC3 is negatively correlated with the GC content in introns (r = –0.065, P < 0.01 [Marais et al. 2004]), which suggests an uncoupling of base composition at synonymous and intron sites.
Forces Driving Rates of Nucleotide Substitution
Table 3 shows a summary of the average rates of substitution and the effects of gene expression on substitution rates between A. thaliana and A. lyrata. Consistent with previous results (Wright, Lauga, and Charlesworth 2002), we find no correlation between either gene expression or codon bias and the rate of synonymous substitution (Ks), by application of the new measures of gene expression (MPSS) and codon bias (FoptRNA). Although a significant correlation with Ks is expected when natural selection on codon usage is strong, theoretical work has shown that weak selection on codon bias has very little effect on the rate of synonymous substitution (McVean and Vieira 2001), despite significant effects on codon usage. These results are, therefore, still consistent with the interpretation that weak translational selection on codon usage occurs in Arabidopsis. In contrast with observations from Drosophila (e.g., Bettancourt and Presgraves 2002), we also find no evidence for a correlation between the rate of amino acid substitution and the level of codon bias (table 3).
Table 3 Correlation Analysis Between Gene Expression and Rates Of Substitution.
Also consistent with earlier work, we found a significant negative correlation between EST-based estimates of gene expression and the rate of amino acid substitution (Ka [table 3]). Furthermore, we found a stronger correlation between MPSS-based estimates of gene expression level and Ka, as expected, because these latter data represent more accurate estimates of gene expression. As noted, distinguishing between the effects of gene expression level and the breadth of gene expression is difficult, because these factors are likely to be correlated (Akashi 2001). If the observed relationship between gene expression and amino acid substitution rate is driven primarily by an effect of higher selective constraints on proteins expressed in a greater diversity of chemical environments (Hastings 1996; Kuma, Iwabe, and Miyata 1995) or a higher fitness effect of mutations expressed in many tissues (Duret and Mouchiroud 1999), we would expect that tissue number should show the highest correlation with Ka, and no residual effect of expression level would be present when correcting for tissue number. Alternatively, if this relationship reflects translational selection on amino acid usage (Akashi 2003), we would expect that total expression level should have a residual significant effect, because the strength of selection on amino acid composition is expected to be determined by the total amount of translation. The MPSS data on gene expression from five tissues allowed for a detailed test of this prediction. Analyzing these data, we find a stronger negative correlation between Ka and tissue number (table 3) than with estimates of gene expression level. Furthermore, a partial correlation shows no significant effect of either EST or MPSS expression level when controlling for tissue number (table 3). In contrast, a partial correlation between tissue number and expression remains significant once total or maximum expression levels are factored out. As a comparison, the number of tissues in which a gene is expressed shows no correlation with FoptRNA (P > 0.05), despite the observed correlation with estimates of expression level (table 2), as expected under a model of translational selection. These results suggest that the diversity of tissues in which a gene is expressed may be the primary factor that drives the observed effect on protein evolution and that the specificity of gene expression may be a general factor that drives variation in rates of protein evolution in Arabidopsis. However, differences in gene function may exist in our data set that are associated with tissue number, and part of the effect of gene expression on the rate of protein evolution could be driven by underlying differences in selective constraint across gene categories. This possibility should be examined in detail once a larger comparative data set becomes available.
Acknowledgements
We thank G. Marais, D. Charlesworth, B. Gaut, H. Akashi, and two anonymous reviewers for very helpful discussion and comments on the manuscript, and E. Kamau, H. Kuittinen, and O. Savolainen for allowing us to use their unpublished sequence data. S.W. was supported by a Commonwealth fellowship and an NSERC PGSB scholarship. The MPSS data were generated with support from the NSF Plant Genome Research Program (DBI-0110528) to B.C.M.
Literature Cited
Akashi, H. 2001. Gene expression and molecular evolution. Curr. Opin. Genet. Devel. 11:660-666.
Akashi, H. 2003. Translational selection and yeast proteome evolution. Genetics 164:1291-1303.
Bettancourt, A. J., and Presgraves, D. C. 2002. Linkage limits the power of natural selection in Drosophila. Proc. Natl. Acad. Sci. USA 99:13616-13620.
Brenner, S., M. Johnson, and J. Bridgham, et al. (21 co-authors). 2000. Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nat. Biotechnol. 18:630-634.
Carlini, D. B., Y. Chen, and W. Stephan. 2001. The relationship between third-codon position nucleotide content, codon bias, mRNA secondary structure and gene expression in the drosophilid alcohol dehydrogenase genes Adh and Adhr. Genetics 159:623-633.
Chiapello, H., F. Lisacek, M. Caboche, and A. Henaut. 1998. Codon usage and gene function are related in sequences of Arabidopsis thaliana. Gene 209:GC1-GC38.
Duret, L. 2000. tRNA gene number and codon usage in the C. elegans genome are co-adapted for optimal translation of highly expressed genes. Trends Genet. 16:287-289.
Duret, L., and D. Mouchiroud. 1999. Expression pattern and, surprisingly, gene length shape codon usage in Caenorhabditis, Drosophila, and Arabidopsis. Proc. Natl. Acad. Sci. USA 96:4482-4487.
Duret, L., and D. Mouchiroud. 2000. Determinants of substitution rates in mammalian genes: expression pattern affects selection intensity but not mutation rate. Mol. Biol. Evol. 17:68-74.
Hastings, K. E. 1996. Strong evolutionary conservation of broadly expressed protein isoforms in the troponin I gene family and other vertebrate gene families. J. Mol. Evol. 42:631-640.
Iida, K., and H. Akashi. 2000. A test of translational selection at ‘silent’ sites in the human genome: base composition comparisons in alternatively spliced genes. Gene 261:93-105.
Ikemura, T. 1985. Codon usage and tRNA content in unicellular and multicellular organisms. Mol. Biol. Evol. 2:13-34.
Kanaya, S., Y. Yamada, M. Kinouchi, Y. Kudo, and T. Ikemura. 2001. Codon usage and tRNA genes in eukaryotes: correlation of codon usage diversity with translation efficiency and with CG-dinucleotide usage as assessed by multivariate analysis. J. Mol. Evol. 53:290-298.
Kuma, K., N. Iwabe, and T. Miyata. 1995. Functional constraints against variations on molecules from the tissue level: slowly evolving brain-specific genes demonstrated by protein kinase and immunoglobulin supergene families. Mol. Biol. Evol. 12:123-130.
Marais, G., Charlesworth, B., and S. I. Wright. 2004. Recombination and base composition: the case of the highly self-fertilizing plant Arabidopsis thaliana. Genome Biol. 5:R45.
McVean, G. A., and J. Vieira. 2001. Inferring parameters of mutation, selection and demography from patterns of synonymous site evolution in Drosophila. Genetics 157:245-257.
Meyers, B. C., D. K. Lee, T. H. Vu, S. S. Tej, S. B. Edberg, M. Matvienko, and L. D. Tindell. 2004. Arabidopsis MPSS. An online resource for quantitative expression analysis. Plant Physiol. 135:801-813.
Moriyama, E. N., and J. R. Powell. 1997. Codon usage bias and tRNA abundance in Drosophila. J. Mol. Evol. 45:514-523.
Novembre, J. A. 2002. Accounting for background nucleotide composition when measuring codon usage bias. Mol. Biol. Evol. 19:1390-1394.
Pal, C., B. Papp, and L. D. Hurst. 2001. Highly expressed genes in yeast evolve slowly. Genetics 158:927-931.
Percudani, R. 2001. Restricted wobble rules for eukaryotic genomes. Trends Genet. 17:133-135.
Rozas, J., and R. Rozas. 1999. DnaSP version 3: an integrated program for molecular population genetics and molecular evolution analysis. Bioinformatics 15:174-175.
Stenico, M., A. T. Lloyd, and P. M. Sharp. 1994. Codon usage in Caenorhabditis elegans: delineation of translational selection and mutational biases. Nucleic Acids Res. 22:2437-2446.
Wang, X., X. Shi, and B. Hao. 2002. The transfer RNA genes in Oryza sativa L. ssp. indica. Sci. China (Ser. C) 45:504-511.
Wright, S. I., B. Lauga, and D. Charlesworth. 2002. Rates and patterns of molecular evolution in inbred and outbred Arabidopsis. Mol. Biol. Evol. 19:1399-1406.
Zhang, L., T. J. Vision, and B. S. Gaut. 2002. Patterns of nucleotide substitution among simultaneously duplicated gene pairs in Arabidopsis thaliana. Mol. Biol. Evol. 19:1464-1473.(Stephen I. Wright*,1, C. )