Variation in the strength of selected codon usage bias among bacteria
http://www.100md.com
《核酸研究医学期刊》
Institute of Genetics, University of Nottingham, Queens Medical Centre Nottingham NG7 2UH, UK
*To whom correspondence should be addressed. Tel: +44 115 9709263; Fax: +44 115 9709906; Email: paul@evol.nott.ac.uk
ABSTRACT
Among bacteria, many species have synonymous codon usage patterns that have been influenced by natural selection for those codons that are translated more accurately and/or efficiently. However, in other species selection appears to have been ineffective. Here, we introduce a population genetics-based model for quantifying the extent to which selection has been effective. The approach is applied to 80 phylogenetically diverse bacterial species for which whole genome sequences are available. The strength of selected codon usage bias, S, is found to vary substantially among species; in 30% of the genomes examined, there was no significant evidence that selection had been effective. Values of S are highly positively correlated with both the number of rRNA operons and the number of tRNA genes. These results are consistent with the hypothesis that species exposed to selection for rapid growth have more rRNA operons, more tRNA genes and more strongly selected codon usage bias. For example, Clostridium perfringens, the species with the highest value of S, can have a generation time as short as 7 min.
INTRODUCTION
The frequency of use of alternative synonymous codons varies among species, and often also among genes from a single genome (1–3). The pattern of codon usage in any gene reflects a complex balance among biases generated by mutation, selection and random genetic drift (4–6). Among bacteria, genomic G+C content varies over a wide range, presumably reflecting variation in mutation biases (7), with a major impact on codon usage (8). In addition, three major factors have been found to contribute to codon usage variation among genes within a bacterial genome. First, mutation biases seem to differ between the leading and lagging strands of replication, since genes on the leading strand are often more G+T-rich (9,10). Second, in many species, there is evidence of natural selection on codon usage. Genes expressed at high levels exhibit a bias towards a subset of synonymous codons, which are those most accurately and/or efficiently recognized by the most abundant tRNA species, and the strength of this bias is correlated with the level of gene expression (2,11). Third, there is evidence of extensive horizontal gene transfer among bacteria (12), and genes recently acquired from sources other than close relatives have atypical codon usage. The extent or magnitude of all three factors varies greatly among species. Here, we focus on the manner in which selected codon usage bias varies among bacteria.
The first species in which codon usage was examined in detail, the bacterium Escherichia coli (13,14) and the yeast Saccharomyces cerevisiae (15,16), were both found to show strong evidence of natural selection on codon usage. Subsequently, it has often been assumed that such selection is ubiquitous, at least among unicellular organisms. However, there have been a number of reports of bacterial species exhibiting little or no evidence of selected codon usage bias. Some concern species with extremely A+T- or G+C-rich genomes (17–22), where mutational bias appears to swamp any selected bias. However, in other cases, there is no sign of selection, even though the genomic base composition is not extreme (23,24). In addition, there are species where codon selection has been detected, but the effect seems relatively minor (25–28). It would be useful to be able to quantify the strength of selected codon usage bias in such a way that the results can be compared between species. There are two particular difficulties. First, the extent of bias in the absence of selection varies among species due to mutational biases. Second, many of the codons favoured by selection vary between species, such that the nature of the bias within a set of synonyms for a particular amino acid can be quite different in different species.
To overcome the first of these problems, we use a population genetics model to assess the strength of selected codon usage bias (5), modifying it to take account of background mutation biases. To overcome the second problem, we focus on certain codons that are expected to be translationally advantageous in all bacterial species. For example, the two Phe codons (UUU and UUC) are recognized, through wobble, by a single species of tRNA with the anticodon sequence GAA. While the G at the wobble position may be modified , it appears that the UUC codon is always better recognized and thus the translationally optimal codon (11).
The extent of selected synonymous codon usage bias might be expected to vary among species dependent on various factors. First, codons are thought to be selected for their effects on the efficiency and accuracy of translation, and ultimately for their effect on bacterial growth rate (30). Bacterial life styles vary markedly, with different species living within nutrient-rich eukaryotic cells but isolated from competitors, or as surface monocultures in oligotrophic external environments, or as complex mixed communities growing either in planktonic log phase within guts or as biofilms on rapidly cycling mucosal surfaces. Some species cross between diverse growth modes and rates, such as passing from terrestrial or aquatic environments to symbiotic relationships with eukaryotes. Thus, the relative importance of efficiency of rapid competitive growth as a component of fitness is likely to vary greatly among species. Second, the selection coefficient for a single synonymous mutation, in a genome with hundreds of thousands of synonymously variable sites, is expected to be extremely small. Then, although bacteria may have extremely large global population sizes, the population structure of the species may be such as to reduce the effective population size to the point where codon selection is less effective. Furthermore, the extent of recombination varies greatly among bacterial species (31), and in those with low recombination rates the linkage among numerous polymorphic synonymous sites on the bacterial chromosome may lead to interference in their selection (32). We consider these various factors in interpreting the variation in the strength of selected codon usage bias among species.
METHODS
Estimation of S
Following Bulmer (5), we can consider the case of an amino acid encoded by two synonyms, C1 and C2. The mutation rate from C1 to C2 is u; and from C2 to C1 is v. The selective difference between the two codons is s: the fitness of the optimal codon C1 is 1, while that of C2 is (1 – s). Under the combined effects of mutation, selection and random genetic drift, the equilibrium frequency (P) of C1 in a gene, or set of genes, is given by:
(1)
where S = 2 N e s, U = 2 N e u and V = 2 N e v.
In genes where selection is strong enough to influence codon usage, the frequency of codons is determined by both the pattern of mutation and the strength of selection. The magnitude of S can be estimated from Equation 1:
(2)
where k = U/V.
In genes where selection is so weak as to be ineffective, the frequency of the codons is determined by the pattern of mutation between them:
(3)
This allows the estimation of k = (1 – P)/P for use in Equation 2 above.
This methodology was applied to codons for four amino acids (Phe, Tyr, Ile and Asn) where the nature of codon selection is expected to be the same in all species. For Tyr (codons UAU and UAC; anticodon GUA) and Asn (codons AAU and AAC; anticodon GUU), the situation is analogous to that for Phe described in the Introduction. For Ile, there are three synonyms, but one (AUA) is recognized by a distinct tRNA with the anticodon CAU; the other two synonyms (AUU and AUC) are recognized by a tRNA with anticodon GAU. Here, the AUA codon was ignored (it is often rare) and Ile was treated as if it were analogous to Phe, Tyr and Asn. There are no other amino acids for which it seems clear that the translationally optimal codon is the same in all species. S-values were calculated for each of the four amino acids: the overall value for a species was computed as the average weighted by the number of codons analysed for the highly expressed genes.
Sequence data
Complete genome sequences of bacterial species were obtained from GenBank release 136 (June 2003). Sequences were extracted using the ACNUC interface (33), and initial codon usage analyses performed using CodonW (34). Base composition statistics (GC3S and GT3S) were calculated as the frequency of these nucleotides at synonymously variable third positions of sense codons, i.e. excluding Met, Trp and termination codons.
In the case of species for which multiple strains have been sequenced, only one representative was selected. In addition, some other pairs of species are no more divergent than strains of a single species. To assess this, the average nucleotide sequence divergence across the genes rplA-C and rpsB-C was estimated. A criterion of at least 4% sequence divergence was used for inclusion of strains. This led to the exclusion of Mycobacterium bovis (0.05% different from Mycobacterium tuberculosis), Shigella flexneri (0.2% different from E.coli K12), Brucella suis (0.3% different from Brucella melitensis), Listeria innocua (1.5% different from L.monocytogenes) and Bacillus cereus (1.6% different from Bacillus anthracis). In contrast, Buchnera aphidicola strains Ap, Bp and Sg differed by 17–26%, and so all three were included. The least divergent pairs of species retained were Xanthomonas axonopodis and Xanthomonas campestris (4.0%) and E.coli and Salmonella enterica typhimurium (4.1%). With the exception of B.aphidicola, the 4% criterion would exclude all cases of multiple strains of a single species: the most divergent were Helicobacter pylori strains 26695 and J99 (3.2%) and Xylella fastidiosa strains 9a5c and Temecula (2.5%). Finally, Streptococcus mutans UA159 was excluded because several genes used in the analysis (see below) were incomplete or missing: the sequence has a deletion between the rplD and rpsS genes, truncating both and deleting the rplB and rplW genes that lie between rplD and rpsS in other Streptococcus species. The final data set included 80 different genomes (Table 1).
Table 1 The 80 bacterial genome sequences analysed
To represent genes under the weakest selection, the codon usage of the entire genome was used, on the assumption that the number of genes expressed at high levels is a very small fraction of the genome as a whole. To represent genes where codon usage would be expected to be subject to strong translational selection, codon usage was summed across a set of 40 genes expected to be expressed constitutively at very high levels. This set included the genes encoding translation elongation factors Tu (tufA), Ts (tsf) and G (fusA), and 37 of the larger ribosomal proteins (encoded by genes rplA-rplF, rplI-rplT and rpsB-rpsT). No homologue of rplI was found in Mycoplasma penetrans; in this species rplU was added to the data set. Otherwise, the same 40 genes were used for all species. Many bacteria have two copies of the translation elongation factor Tu gene, although these are usually very similar due to concerted evolution (35), while some species have two or more homologues of fusA or certain ribosomal protein genes. In each case, the gene with the highest S-value was retained.
To assess whether the S-values observed were significantly greater than zero, for each species S-values were also calculated for 1000 sets of genes randomly selected from the genome. For each genome, the set of 40 highly expressed genes contained on average 1000 codons used in the analysis (Table 1). For the random data sets, genes were added until a total of at least 1000 codons were present for the four amino acids analysed. The range of S-values including 95% of these samples was recorded.
Phylogenetic analyses
The phylogenetic relationships of the 80 bacterial strains were estimated from a concatenated alignment of the proteins encoded by tuf, rplA-C and rpsB-C. Sequences were aligned using ClustalW (36), and sites with a gap in any sequence were removed. The tree was estimated by the Bayesian method implemented in MrBayesV3.0 (37), using the JTT model of protein evolution (38) with gamma distributed rates across sites. Phylogeny-independent correlations among species characters were estimated using the generalized least squares approach implemented in Continuous (39).
RESULTS
The strength of selected codon usage bias (S)
The strength of selected codon usage bias (S) was analysed for 80 genomes representing diverse major lineages of bacteria (Table 1 and Figure 1). S was estimated from the codon frequencies in a set of 40 genes expressed at very high levels compared with those in the genome as a whole, with the latter taken as an indication of the frequencies generated by mutation biases in the absence of selection. The analysis focused on four amino acids (Phe, Tyr, Ile and Asn), where the same codon is expected to be translationally advantageous in all species. The components of S for each of the four amino acids were highly correlated across species, and there was no clear indication that the U-ending codon is ever the optimal codon for any of the four amino acids.
Figure 1 Phylogenetic relationships of the 80 bacterial genomes analysed. Species codes are given in Table 1.
Some species have either two chromosomes (i.e. the three Vibrio species, Agrobacterium tumefaciens, Brucella melitensis, Leptospira interrogans and Deinococcus radiodurans) or one or more plasmids of larger than 1 Mb (Ralstonia solanacearum and Sinorhizobium meliloti). In each case, most (if not all) of the 40 genes expressed at high levels reside on just one of these chromosomes. Using the codon usage of genes from only this chromosome, rather than both, as the guide to mutational biases had only a minor impact on the S-values estimated: in all seven cases where both replicons are regarded as chromosomes the value of S was reduced by <3%. The effect was also minor in R.solanacearum, where S changed from 0.02 to –0.06, but more marked in S.meliloti, where the value decreased from 0.64 to 0.53, indicating a small difference in the overall codon usage between the plasmids and the chromosome in this species.
The species analysed here have genomic G+C contents ranging from 22 to 72%. Since bacterial genomes have little non-coding DNA, and the first two positions within codons are constrained by protein-coding requirements, most of the variation is due to the third position of codons . Thus the overall G+C content at synonymously variable third positions (GC3S) ranged from 9 to 93% among the 80 genomes (Table 1). This base composition bias is so pervasive that it can be seen even when considering individual genes: e.g. for dnaA (a conserved gene with low selected codon usage bias), only one species (Xylella fastidiosa) showed a substantial deviation from the general trend, with a surprisingly low third position G+C content (28%) for a genome at 52% (Figure 2). This highlights the potential difficulty in estimating selected codon usage bias. The method used here for estimating S was explicitly designed to take account of genomic mutation biases, and indeed there was no correlation between S and the overall G+C content at synonymously variable third positions of codons (Figure 3). The optimal codons for the four amino acids analysed here are all C-ending, but there was no correlation between the S-value and the difference in GC3S values between the highly expressed gene data set and the genome as a whole; in fact, for 51 species the GC3S value for the highly expressed gene data set was the lower of the two (Table 1). This indicates that in species with high S-values many of the optimal codons for other amino acids are not C- or G-ending.
Figure 2 G+C content at the three codon positions within the dnaA gene, compared with the G+C content of the genome as a whole, for 79 bacterial genomes (no dnaA homologue has been found in W.glossinidia). Positions 1, 2 and 3 are indicated by open circles, open triangles and filled circles, respectively. The third position is strongly influenced by G+C bias; the first two positions are also influenced, implying an effect on amino acid composition (68).
Figure 3 Selected codon usage bias (S) and genomic G+C bias for 80 bacterial species. Genomic G+C bias is estimated by the overall GC3S. Open circles denote species where the S-value is not greater than found among randomly selected genes; filled triangles denote three Clostridium species.
The S-values showed a wide variation among species, ranging from –0.88 to 2.65 (Table 1). In most species, the 95% limits of the distribution of S-values for randomly selected genes were 0.2–0.3 either side of zero. For 24 species (i.e. 30% of the total), the S-value for the highly expressed genes was not as high as the upper 95% limit for the randomly selected genes, providing no immediate evidence that selection has affected codon usage in those genomes.
Negative S-values
The minimum S-values are expected to be around zero, but for five species the S-values were more highly negative than expected for randomly selected genes. This is surprising because the U-ending codons for the four amino acids analysed are unlikely to be translationally advantageous in any species, and the C-ending codons are not expected to be selected against in highly expressed genes. Two factors seem to contribute to these unexpectedly low S-values. First, in many species, there is a replication-dependent compositional skew between the leading and lagging strands, such that the leading strand is more G+T-rich, although the extent of this skew varies greatly among species (10). Most very highly expressed genes lie on the leading strand and so may have reduced frequencies of C-ending codons due to their location rather than because of selection. For example, in X.fastidiosa (S = –0.78), multivariate analysis of codon usage found that the primary source of variation among genes was associated with this strand skew (40): the mean G+T contents (at synonymously variable third positions; GT3S) of leading and lagging strand genes are 0.61 and 0.40, respectively. Of the 40 highly expressed genes analysed here, 37 are encoded by the leading strand. When the highly expressed genes were compared with only those encoded on the leading strand, the S-value was much less highly negative (–0.43). Similarly, in Buchnera aphidicola strain Bp (S = –0.59), the average GT3S is 0.57 and 0.42 for genes on the leading and lagging strands, respectively; when the 34 highly expressed genes lying on the leading strand are compared with other leading strand genes, the S-value is –0.18. By comparison, in the other two B.aphidicola genomes (strains Ap and Sg), the skew between the two strands is much less pronounced, and the S-values are close to zero.
Second, many bacterial genomes contain regions (‘islands’) of unusual base composition, generally inferred to reflect horizontal gene transfer. In Nitrosomonas europaea (S = –0.88), where the average G+C content at synonymously variable third positions (GC3S) was 0.53 for the chromosome as a whole, many of the highly expressed genes lie within two islands with unusually low G+C content: 18 of the 40 genes in the highly expressed data set lie within a region encompassing 27 genes (rpsJ-rpoA, genes 400–426) where the average GC3S is 0.29, while 7 more lie in a cluster of 13 genes (rplL-NE2059, genes 2047–2059) with an average GC3S of 0.34. The S-value for these 25 genes is –1.36. The other 15 genes included in the set of 40 highly expressed genes are scattered around the genome, having an average GC3s of 0.45, and an S-value of –0.23. Horizontal transfer is thought to be rare for ‘informational’ genes, such as those encoding ribosomal proteins (41). However, since both regions include other genes, not expected to be highly expressed but with similarly low GC3S values, and since the highly expressed genes at other locations do not have such low GC3S values, the anomalously low S-values do not appear to be related to selection.
Correlation of selected codon usage bias with rRNA and tRNA gene numbers
The strength of selection on synonymous codon usage is likely to be related to the degree to which speed and efficiency of growth and replication have been important during evolution. To investigate this, we have compared S-values with the numbers of rRNA operons and tRNA genes in each genome. Inter-specific variation in bacterial growth rate appears to be positively correlated with the number of rRNA operons (42). The abundance of different tRNAs is correlated with, and apparently largely determined by, gene copy number (11). The increased gene copy number, and consequent increased relative abundance, of particular tRNA species appears to be part of the strategy for optimizing translational efficiency (43,44). As expected, the numbers of rRNA and tRNA genes were found to be highly correlated in an analysis of 18 bacterial genomes (11). Among the 80 genomes analysed here, rRNA operon and tRNA gene copy numbers vary from 1 to 11, and 28 to 126, respectively (Table 1), and are very highly correlated (Figure 4).
Figure 4 Ribosomal RNA operon copy number and tRNA gene number for 80 bacterial species. Symbols are as in Figure 3.
S-values are positively correlated with both rRNA operon and tRNA gene copy numbers (Figures 5 and 6). The highest S-value of all (2.65) was found in Clostridium perfringens, a genome with 10 rRNA operons and 95 tRNA genes. The species with the largest number of tRNA genes, Vibrio parahaemolyticus, is also among those with the largest number of rRNA operons, and has a high S-value (1.89). All species with >6 rRNA operons, and all species with >70 tRNA genes, have stronger codon usage bias in the highly expressed genes than in randomly selected genes. Among the 30 species with S-values >1, only two have fewer than four rRNA operons, and only two have fewer than 50 tRNA genes. Conversely, a majority of the species with only one rRNA operon, or <40 tRNA genes, show no evidence of selected codon usage bias.
Figure 5 Selected codon usage bias (S) and ribosomal RNA operon copy number for 80 bacterial species. Symbols are as in Figure 3.
Figure 6 Selected codon usage bias (S) and tRNA gene number for 80 bacterial species. Symbols are as in Figure 3.
The strengths of these correlations among rRNA operon numbers, tRNA gene copy numbers and S are overestimated by a simple analysis of the data as presented in Figures 4–6, due to the nonindependence of the data points. The 80 genomes are linked by a phylogenetic tree (Figure 1), and closely related species often share similar numbers of rRNA and tRNA genes, and have similar S-values, which may simply be due to their recent common ancestry. Using an approach to estimate the correlations after removing the effects of shared ancestry (39), the correlation coefficient for rRNA and tRNA gene copy numbers is 0.82, while the correlations between S and rRNA and tRNA gene copy numbers are 0.49 and 0.44, respectively (all values are highly statistically significant). While the phylogenetic relationships shown in Figure 1 are broadly consistent with those derived from analyses of other sequence data sets (45,46), there are some differences, such as Escherichia and Haemophilus being more closely related to each other than to Vibrio and Wigglesworthia lying within the radiation of Buchnera strains (47). However, we found that using alternative trees with such minor differences in topology had very little impact on the magnitude of the correlation coefficients.
DISCUSSION
Previous analyses of codon usage in bacteria have mostly focussed on the analyses of particular species, with no quantitative attempt to compare the strength of selected codon usage bias across different species (a recent exception is discussed below). Some analyses have started from the assumption that there is selected codon usage bias, without testing whether that is indeed the case (48), while others have concluded that ‘codon usage in most bacteria, if not all, is constrained by translation efficiency’ (11). Here, we have described a measure of the strength of selected codon usage bias, S, and a method for testing whether S is larger than expected by chance. The approach should be applicable to all species, and provides a means of comparing the strength of selected codon usage bias among them. We have applied this approach to 80 species. For 30% of these species, there was no evidence of selected codon usage bias, while among the others the value of S ranged widely.
Comparisons with previous analyses of individual species
The archetypal example of a species with strongly selected codon usage bias has been E.coli, where the selective pressure exerted via tRNA relative abundance and anticodon sequence was first elucidated (13,14). The S-value calculated here for E.coli is high (1.49), but 15 other species (20% of the species analysed) have even higher values, indicating more strongly selected codon usage bias. Among these 15 species are 8 members of the Firmicutes (A+T-rich gram positive bacteria), including C.perfringens with the highest value of 2.65. A recent analysis detected the selected codon usage bias in C.perfringens, and also noted that the bias was stronger than in Clostridium acetobutylicum (49); here, the latter species has an S-value of 0.84. In fact, with the exception of the Mollicutes and C.acetobutylicum, all of the Firmicutes have S-values >1.0 (Table 1). An early analysis of one of these species, B.subtilis, concluded (from 56 genes) that the selected codon usage bias was weaker than in E.coli (50), but here the S-value for B.subtilis is 1.36, not substantially different from E.coli.
An early analysis of M.tuberculosis (using 41 genes) reported weak but significant selected codon usage bias (25), and this is confirmed by the S-value of 0.45, compared with 0.26 as the upper limit of the 95% range for randomly selected genes in that species. Analysis of the genome of Thermotoga maritima detected selected codon usage in highly expressed genes, but found this to be a relatively minor source of variation among genes (28). No significant difference in the use of Tyr, Ile or Asn codons was found between genes expressed at high and low levels; and since these are three of the four amino acids used here, it is not surprising that the S-value is very low (0.37), and only just above the value (0.28) for randomly selected genes.
For two other species, where weak selected codon usage bias has been reported, the present analysis yields S-values within the range of randomly selected genes. In Chlamydia trachomatis, the major trend among genes in codon usage is related to strand skew (26). The average GT3S values for genes on the leading and lagging strands are 0.57 and 0.48, respectively. If the 40 genes are compared with only those on the leading strand, the S-value becomes 0.42, indicative of weak selection. Pseudomonas aeruginosa has an S-value close to zero (–0.02), providing no evidence for selection, whereas we previously found small but significant differences in codon usage between highly expressed and other genes (27). This discrepancy arises because the largest components of selected bias in this species relate to codons for Ser (especially UCC), Thr (ACC), Ala (GCU), Arg (CGU) and Gly (GGU), whereas frequencies of the C-ending codons for Phe, Tyr, Ile and Asn (used to calculate S) differ little between highly expressed genes and the genome as a whole (27).
Otherwise, few analyses have commented on the relative strength of selected codon usage bias, except in those cases where it appears to be absent. Evidence of a lack of selected codon usage bias has been reported for Helicobacter pylori (24), Rickettsia prowazekii (18), Treponema pallidum (23), Buchnera strains (21) and Wigglesworthia (22), all of which have S-values close to zero. In addition, an absence of selected codon usage bias has been reported in Borrelia burgdorferi (23,51) and Mycoplasma genitalium (19,20), but for these two species these conclusions have been questioned (52). For B.burgdorferi, there is no sign of selected codon usage bias in the present analysis, since the S-value is negative (–0.31). However, in this species, there is extremely pronounced skew between the chromosome strands: the average GT3S values for genes on the leading and lagging strands are 0.62 and 0.39, respectively. Of the 40 genes in the highly expressed data set, 38 lie on the leading strand, and when these are compared with genes from the leading strand only, the B.burgdorferi value is –0.04, still providing no evidence for selection. For M.genitalium, the possibility of selected codon usage bias was invoked on the grounds that highly expressed genes tend to use more G+C-rich codons (52). Indeed, here the S-value for M.genitalium (0.32) is slightly higher than expected for randomly selected genes. However, it has been shown that the major source of variation among M.genitalium genes is in G+C content, which varies systematically in a wave around the genome, seemingly affecting all genes irrespective of their expression level (19,20). A total of 29 of the 40 highly expressed genes used here lie within the most G+C-rich 40% of the genome. When these 29 genes are compared with the 192 genes in this region, the S-value is lower (0.17), and within the range of values for randomly selected genes from this region. This suggests that the minor difference in codon usage between highly expressed genes (in total) and the genome as a whole reflects compositional variation, and provides no evidence for selected codon usage bias in this species.
Streptomyces species are extremely G+C-rich, and this compositional bias was found to dominate codon usage in an early study (17). However, it was noted that tufA (the only unambiguously highly expressed gene sequence then available) had slightly different codon usage that might indicate the action of weak translational selection. Here, Streptomyces coelicolor has an S-value of 0.99. This value is close to that expected for a genome with 6 rRNA operons (Figure 5) and 63 tRNA genes (Figure 6), and all of these features are consistent with moderately strong translational selection. However, the difficulty in interpreting codon usage variation in this species is shown by the unusually broad range of values observed for randomly selected genes (Table 1). Among 1000 randomly selected S.coelicolor data sets, 28 had S-values as large as that for the highly expressed genes. For Streptomyces avermitilis, the S-value is lower (0.69), but again just within the range of values for 95% of randomly selected gene data sets. Overall, it appears that the codon selection in Streptomyces has been marginally effective in overcoming the very strong mutational bias.
Thus, the S-values obtained here are largely consistent with more detailed studies on individual species. However, because S is calculated from only four amino acids, where the choice is always between the translationally optimal C-ending codon and a U-ending codon, intragenomic variations in G+T content can impinge on the value obtained. Since most highly expressed genes lie on the leading (G+T-rich) strand this tends to reduce S, but the size of the effect, reflecting the extent of skew between the strands, varies substantially among species. For example, in E.coli the average GT3S values of genes on the leading and lagging strands are 0.55 and 0.51, respectively, and using only leading strand genes as the control for mutational bias leaves the S-value unaltered. It might be preferable to always only use genes on the leading strand as the control for mutational bias, but for many species this is impracticable because it is difficult to locate the origin and terminus of replication precisely. Furthermore, even closely related strains can show extensive genomic rearrangement , which can confound comparisons of leading and lagging strand genes.
Intragenomic variations in G+C content can also impinge on the value of S. With the exception of M.genitalium (discussed above), intragenomic G+C variation mostly reflects ‘islands’ of atypical base composition. Typically, as many as half of the 40 highly expressed genes examined here are located in a single cluster, and we have noticed that in a number of species this cluster is more A+T-rich than the genome as a whole, tending to reduce the S-value. Islands of atypical base composition are usually explained as the result of horizontal gene transfer, but it is generally not expected that ribosomal protein genes undergo this process. Thus, the reason(s) for this base composition difference warrant further investigation.
These caveats regarding intragenomic variations in base composition serve to emphasise that any automated analysis of codon usage, without some detailed consideration of the variation among genes, may be prone to errors. However, the advantage of calculating S-values by the method described here is that a uniform approach can be used for all species, enabling comparisons among them.
Variation among bacteria in the strength of selected codon usage bias
At a biochemical level, the C-ending codons for Phe, Tyr, Ile and Asn are expected to be translationally optimal in all bacteria, but the wide range of S-values observed (Table 1) indicates that the strength and/or efficacy of selection for these optimal codons has varied considerably among species. The strength of selected codon usage bias, as estimated by S, is highly correlated with the number of rRNA operons and the number of tRNA genes. We expect that codon usage will have been more strongly selected in species which replicate fast. Information regarding the growth rate of bacteria in the wild is sparse, and so we have used the number of rRNA operons as a (very approximate) guide to the growth rate of species. Remarkably, C.perfringens, the species with the highest S-value (2.65) and 10 rRNA operons, can grow with a generation time under 7 min in specific laboratory conditions (55). In contrast, Mycobacterium species are renowned for their very slow growth: M.tuberculosis and M.leprae have generation times of 1 and 14 days, respectively. Both species have one rRNA operon and low S-values (0.5). These observations are consistent with the effects of selection for efficiency of translation under rapid and competitive growth conditions, and then the lack of selected codon usage bias in some species would reflect a relative unimportance of an exponential growth phase during their life history.
Alternatively, a lack of selected codon usage bias may reflect the greater impact of random genetic drift, due to a population structure with a low long-term effective population size and/or interference between linked synonymous sites due to a lack of recombination. For most species, it is difficult to know the long-term evolutionary effective population size relevant to codon usage. For example, M.tuberculosis currently infects many more people worldwide than M.leprae, such that the former is likely to have much the larger ongoing effective population size. However, M.tuberculosis exhibits little genetic diversity (56) and is thought to be a recently emerged clone from M.canetti (57); this evolutionary bottleneck would have reduced the effective population size of M.tuberculosis. But even this may have little relevance: in the same way that it is thought that the codon usage of horizontally transferred genes may take many millions of years to ameliorate to that of a new host genome (58), strongly selectively biased codon usage may take a very long time to decay after a reduction in effective population size, i.e. the codon usage bias currently observed may still be due in some part to evolutionary processes that occurred millions of years ago. The two Mycobacterium species currently have similar levels of selected codon usage bias.
Nevertheless, it seems clear that the life histories of some of the bacteria analysed are likely to lead to low effective population sizes. Many of the species with very low S-values are obligate intracellular parasites or endosymbionts: these include species in the genera Buchnera, Wigglesworthia, Coxiella, Rickettsia and Tropheryma, the Mollicutes (Mycoplasma plus Ureaplasma) as well as the four Chlamydiales. Among these 18 species, all have S-values <0.5, and only the Mollicutes have values >0.2, and marginally higher than expected from randomly selected genes. Most have reduced genome sizes (<1000 genes), all have only 1 or 2 rRNA operons, and most have <40 tRNA genes (Table 1). For example, Buchnera and Wigglesworthia are obligate endosymbionts of insects, with low effective population sizes (due to bottlenecks during their transmission) and limited recombination. It has been noted that, as well as an absence of selected codon usage bias, these species have rapid evolutionary rates, presumably reflecting the enhanced power of random genetic drift (21). In contrast, all of the bacteria with high S-values (say, >1.5) live outside host cells, typically in mixed environments, such as soil, water or the intestinal tracts of animals. Thus, this difference between an intracellular parasitic lifestyle and an extracellular existence appears to be a pervasive influence on S among the species included in this analysis.
A lack of recombination would be expected to impair the efficacy of selection on codon usage. Many of the intracellular parasitic species, noted above for their low S-values, are known or expected to be effectively clonal. Additionally, the primarily extracellular pathogenic spirochaete B.burgdorferi is extremely clonal (59) and has S near zero. In contrast, Streptococcus pneumoniae, Streptococcus pyogenes and Staphylococcus aureus all appear to have undergone high rates of recombination (60), and have high S-values (Table 1). However, E.coli and Haemophilus influenzae also have high S-values, despite apparently lower rates of recombination (60). It is clear that a high recombination rate alone is not enough to promote codon selection: H.pylori has perhaps the highest rate of recombination known among bacteria (61), and yet an S-value close to zero. In this case, the lack of selected codon usage bias has been interpreted as a consequence of the unimportance of competitive growth in the isolated acidic niche of this species (24).
Overall, it is difficult to disentangle the effects of low effective population size and a lack of recombination from the other aspects of these organisms' lifestyles discussed above. For example, among the spirochaetes, two (B.burgdorferi and T.pallidum) have S-values close to zero, whereas the third (L.interrogans) has a somewhat higher value (0.67). Both B.burgdorferi and T.pallidum are obligate parasites and grow slowly, whereas L.interrogans is a facultative parasite with many saprophytic relatives, is more metabolically versatile and can grow more rapidly. The stronger selected codon usage bias in L.interrogans appears to reflect this difference in lifestyles, although interestingly it is not accompanied by an increase in rRNA or tRNA gene number.
The correlations between S and rRNA and tRNA gene copy numbers are sufficiently strong that it is interesting to examine the outliers. For example, values for the three Clostridium species are highlighted in Figures 4–6. The S-value for C.acetobutylicum (0.84) is surprisingly low for a genome with 11 rRNA operons (Figure 5). It is similar to that of Clostridium tetani (1.00), with only 6 rRNA operons, but much lower than that of C.perfringens (2.65), a genome with 10 rRNA operons. However, the S-value for C.acetobutylicum is not unusual for a genome with 73 tRNA genes (Figure 6). Thus, it seems to be the high number of rRNA operons in C.acetobutylicum that is anomalous; this may reflect a very recent expansion in this gene family.
Perhaps the most surprising example of low codon usage bias is P.aeruginosa. This species can grow quite rapidly (doubling times <1 h) in laboratory planktonic cultures and is metabolically highly versatile. It is moderately recombinogenic via plasmid transfer, and there appear to be many horizontally transferred genes in its genome (27). The low selected bias was apparent in a full analysis of codon usage in this species (27), as well as the S-value calculated here. Selected codon usage bias is rather stronger in the two other Pseudomonas species analysed (Table 1). These paradoxical observations perhaps highlight our ignorance of the evolutionary history of even ‘well-known’ bacterial species.
Comparison with another estimate of S
Recently, another approach to estimating the strength of selected codon usage bias in a genome has been published by dos Reis and co-workers (62). These authors calculated two indices of codon usage bias. The first, based on the effective number of codons used in a gene (63), attempted to estimate the strength of general deviation from random codon usage in a gene. The second was a modification of the codon adaptation index, CAI (64), using tRNA gene copy number (as a surrogate for tRNA abundance) and the estimated strength of codon–anticodon interaction to assign fitness values to codons; the tRNA adaptation index for a gene was calculated as the average of these fitness values, as an attempt to estimate the adaptation of a gene's codon usage to the tRNA pool of the species. It was suggested that the strength of translationally selected codon usage bias, S (here termed St to distinguish it from S described above), could be estimated from the magnitude of the correlation between these two indices; the significance of St was estimated from a permutation test.
Dos Reis et al. (62) applied this methodology to 101 bacterial genomes, including 66 of those analysed here as well as another 20 genomes excluded here because of their close relationship to other strains. The St method found significant evidence for selection in only 26% of bacterial genomes analysed. Among the 66 species common to both analyses, S- and St -values are significantly correlated (coefficient = 0.46); 14 species were found to have significant evidence for selection in both analyses and 18 were found to lack such evidence in both analyses (Figure 7). However, 32 species found here to have significant S-values were not significant in the St analysis. These included a number of species where previous analyses have found clear evidence of selectively biased codon usage in highly expressed genes, such as B.subtilis (50,65), C.acetobutylicum (49) and Vibrio cholerae (40). Most strikingly, C.perfringens had the highest S-value among the 80 species analysed here, and yet was not significant in the St test; detailed analysis of codon usage in this species has revealed strongly selected bias in highly expressed genes (49).
Figure 7 Comparison of two estimates of selected codon usage bias: x-axis values are taken from this paper, y-axis values from dos Reis et al. (62). Values significantly greater than zero in the dos Reis et al. analysis are shown as circles; values significantly greater than zero in our analysis are shown as filled symbols.
Interestingly, two species found here not to have significant S-values, Neisseria meningitidis and Bacteroides thetaiotamicron, were significant in the St test. Closer examination of these species revealed that, in both, the primary trends in codon usage variation among genes were associated with leading versus lagging strand composition bias and G+C content, but there was evidence for weak selected codon usage bias in highly expressed genes. Overall, it appears that the estimation of S described here is generally much more effective than the St test at detecting translationally selected codon usage bias, even though S can sometimes be reduced by compositional biases. One difference between the two approaches should be noted. The method described here asks how strong the selected bias is in a specified set of very highly expressed genes, but not how many genes exhibit selected bias. The dos Reis et al. method aimed at quantifying the extent to which variation among genes across the genome as a whole can be explained as adaptation to the tRNA pool of the species. Given this difference, further comparison of the results of the two methods may shed additional light on the causes of selected codon usage bias.
Solving the riddle of codon usage preferences?
In their analysis dos Reis et al. included a small number of eukaryote genomes, as well as archaeal and bacterial species. They found that variation in the strength of codon usage bias among species was highly positively correlated with genome size and tRNA gene copy number (except in very large genomes), and concluded that these two factors ‘ultimately determine the action of natural selection’ on codon usage (62). They proposed a model whereby, from an ancestral bacterium with a small genome size, increases in genome size led to increases in tRNA gene copy number, which in turn led to selection for the optimization of codon usage. However, we find that genome size does not seem to cause tRNA gene copy number (among bacteria, at least), while it seems inappropriate to consider codon bias as the result of tRNA gene copy number. In contrast, we suggest that it is the biology of the organism (its ‘lifestyle’) that determines whether codon usage is affected by natural selection.
The overall results of dos Reis et al. were heavily influenced by the inclusion of eukaryote species, which contributed disproportionately to the variation in both genome size and tRNA gene number. Although there is a positive correlation between genome size and tRNA gene number among the 80 bacterial species examined here, this seems to be due only to species with small genomes. (Note that dos Reis et al. considered genome size in terms of DNA content, whereas we have used the estimated number of protein-coding genes; however, these two measures are extremely highly correlated among bacteria and so this difference should have no impact.) Among the larger bacterial genomes (e.g. the 42 species with >2500 genes), there is no significant correlation between genome size and tRNA copy number. For example, 10 of the 11 species with >5000 genes have <75 tRNA genes, while 10 of the 11 species with >75 tRNA genes have <5000 protein-coding genes; the single exception is B.anthracis with 5311 genes and 95 tRNA genes (Table 1). Thus, increases in genome size do not generally involve an increase in the number of tRNA genes. The forces that have led to reduced genome size (e.g. in Buchnera, Rickettsia and Mycoplasma species) may have impacted on tRNA gene copy number directly, but it seems more likely that these evolutionary pressures reflect the adoption of a lifestyle (typically intracellular parasitism), in which rapid replication was not advantageous (or perhaps even detrimental) and thus translational efficiency became less important, and additional tRNA genes became unnecessary.
It seems inappropriate to consider codon usage bias as simply being caused by tRNA abundances, since both factors are likely to co-evolve in response to selection for translational efficiency (44,66). Indeed, it is possible to consider circumstances where changes in codon usage bias, perhaps brought about by a change in the genome wide mutational bias, could select for a change in the tRNA pool (67). Thus, while we find correlations across species in the numbers of rRNA operons and tRNA genes, and the strength of selected codon usage bias, we do not invoke a causal relationship among any of these factors; rather, we take all three as indicative of the need for rapid and efficient bacterial growth.
ACKNOWLEDGEMENTS
We are very grateful to Michael Bulmer for discussion of his population genetic model of codon usage bias, and to Manolo Gouy and colleagues in Lyon for providing the ACNUC interface to GenBank. We also thank Mario dos Reis for discussion of his recent paper. This work was supported in part by studentships from the MRC (to R.J.G.) and the University of Nottingham (to J.F.P.). Funding to pay the Open Access publication charges for this article was provided by The University of Nottingham.
REFERENCES
Grantham, R., Gautier, C., Gouy, M., Jacobzone, M., Mercier, R. (1981) Codon catalog usage is a genome strategy modulated for gene expressivity Nucleic Acids Res., 8, r43–r74 .
Ikemura, T. (1985) Codon usage and tRNA content in unicellular and multicellular organisms Mol. Biol. Evol., 2, 13–34 .
Sharp, P.M., Cowe, E., Higgins, D.G., Shields, D.C., Wolfe, K.H., Wright, F. (1988) Codon usage in Escherichia coli, Bacillus subtilis, Saccharomyces cerevisiae, Schizosaccharomyces pombe, Drosophila melanogaster and Homo sapiens; a review of the considerable within-species diversity Nucleic Acids Res., 16, 8207–8211 .
Sharp, P.M. and Li, W.-H. (1986) An evolutionary perspective on synonymous codon usage in unicellular organisms J. Mol. Evol., 24, 28–38 .
Bulmer, M. (1991) The selection-mutation-drift theory of synonymous codon usage Genetics, 129, 897–907 .
Sharp, P.M., Stenico, M., Peden, J.F., Lloyd, A.T. (1993) Codon usage: mutational bias, translational selection, or both? Biochem. Soc. Trans., 21, 835–841 .
Sueoka, N. (1962) On the genetic basis of variation and heterogeneity of DNA base composition Proc. Natl Acad. Sci. USA, 48, 582–592 .
Muto, A. and Osawa, S. (1987) The guanine and cytosine content of genomic DNA and bacterial evolution Proc. Natl Acad. Sci. USA, 84, 166–169 .
Lobry, J.R. (1996) Asymmetric substitution patterns in the two DNA strands of bacteria Mol. Biol. Evol., 13, 660–665 .
McLean, M.J., Devine, K.M., Wolfe, K.H. (1997) Base composition skews, replication orientation, and gene orientation in 12 prokaryotic genomes J. Mol. Evol., 47, 691–696 .
Kanaya, S., Yamada, Y., Kudo, Y., Ikemura, T. (1999) Studies of codon usage and tRNA genes of 18 unicellular organisms and quantification of Bacillus subtilis tRNAs: gene expression level and species-specific diversity of codon usage based on multivariate analysis Gene, 238, 143–155 .
Ochman, H., Lawrence, J.G., Groisman, E.A. (2000) Lateral gene transfer and the nature of bacterial innovation Nature, 405, 299–304 .
Post, L.E. and Nomura, M. (1980) DNA sequences from the str operon of Escherichia coli J. Biol. Chem., 255, 4660–4666 .
Ikemura, T. (1981) Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes J. Mol. Biol., 146, 1–21 .
Ikemura, T. (1982) Correlation between the abundance of yeast transfer RNAs and the occurrence of the respective codons in protein genes J. Mol. Biol., 158, 573–597 .
Bennetzen, J.L. and Hall, B.D. (1982) Codon selection in yeast J. Biol. Chem., 257, 3026–3031 .
Wright, F. and Bibb, M.J. (1992) Codon usage in the G+C-rich Streptomyces genome Gene, 113, 55–65 .
Andersson, S.G.E. and Sharp, P.M. (1996) Codon usage and base composition in Rickettsia prowazekii J. Mol. Evol., 42, 525–536 .
Kerr, A.R.W., Peden, J.F., Sharp, P.M. (1997) Systematic base composition variation around the genome of Mycoplasma genitalium, but not Mycoplasma pneumoniae Mol. Microbiol., 25, 1177–1179 .
McInerney, J.O. (1997) Prokaryotic genome evolution as assessed by multivariate analysis of codon usage patterns Microb. Comp. Genomics, 2, 1–10 .
Wernegreen, J.J. and Moran, N.A. (1999) Evidence for genetic drift in endosymbionts (Buchnera): analyses of protein-coding genes Mol. Biol. Evol., 16, 83–97 .
Herbeck, J.T., Wall, D.P., Wernegreen, J.J. (2003) Gene expression level influences amino acid usage, but not codon usage, in the tsetse fly endomsymbiont Wigglesworthia Microbiology, 149, 2585–2596 .
Lafay, B., Lloyd, A.T., McLean, M.J., Devine, K.M., Sharp, P.M., Wolfe, K.H. (1999) Proteome composition and codon usage in spirochaetes: species-specific and DNA strand-specific mutational biases Nucleic Acids Res., 27, 1642–1649 .
Lafay, B., Atherton, J.C., Sharp, P.M. (2000) Absence of translationally selected codon usage bias in Helicobacter pylori Microbiology, 146, 851–860 .
Andersson, S.G.E. and Sharp, P.M. (1996) Codon usage in the Mycobacterium tuberculosis complex Microbiology, 142, 915–925 .
Romero, H., Zavala, A., Musto, H. (2000) Codon usage in Chlamydia trachomatis is the result of strand-specific mutational biases and a complex pattern of selective forces Nucleic Acids Res., 28, 2084–2090 .
Grocock, R.J. and Sharp, P.M. (2002) Synonymous codon usage in Pseudomonas aeruginosa PAO1 Gene, 289, 131–139 .
Zavala, A., Naya, H., Romero, H., Musto, H. (2002) Trends in codon and amino acid usage in Thermotoga maritima J. Mol. Evol., 54, 563–568 .
Arnold, H.H. and Keith, G. (1977) The nucleotide sequence of phenylalanine tRNA from Bacillus subtilis Nucleic Acids Res., 4, 2821–2829 .
Kurland, C.G. (1987) Strategies for efficiency and accuracy in gene expression. 1. The major codon preference: a growth optimization strategy Trends Biochem. Sci., 12, 126–128 .
Maynard Smith, J., Smith, N.H., O'Rourke, M., Spratt, B.G. (1993) How clonal are bacteria? Proc. Natl Acad. Sci. USA, 90, 4384–4388 .
McVean, G.A.T. and Charlesworth, B. (2000) The effects of Hill-Robertson interference between weakly selected mutations on patterns of molecular evolution and variation Genetics, 155, 929–944 .
Gouy, M., Gautier, C., Attimonelli, M., Lanave, C., Di Paola, G. (1985) ACNUC—a portable retrieval system for nucleic acid sequence databases: logical and physical design and usage Comp. Appl. Biosci., 1, 167–172 .
Peden, J.F. (1999) Analysis of codon usage PhD Thesis, University of Nottingham, UK .
Sharp, P.M. (1991) Determinants of DNA sequence divergence between Escherichia coli and Salmonella typhimurium: codon usage, map position and concerted evolution J. Mol. Evol., 33, 23–33 .
Thompson, J.D., Higgins, D.G., Gibson, T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence-weighting, position-specific gap penalties and weight matrix choice Nucleic Acids Res., 22, 4673–4680 .
Huelsenbeck, J.P. and Ronquist, F. (2001) MRBAYES: Bayesain inference of phylogeny Bioinformatics, 17, 754–755 .
Jones, D.T., Taylor, W.R., Thornton, J.M. (1992) The rapid generation of mutation data matrices from protein sequences Comput. Appl. Biosci., 8, 275–282 .
Pagel, M. (1999) Inferring the historical patterns of biological evolution Nature, 401, 877–884 .
Grocock, R.J. (2003) Evolution of codon usage among the gamma Proteobacteria PhD Thesis, University of Nottingham, UK .
Jain, R., Rivera, M., Lake, J.A. (1999) Horizontal gene transfer among genomes: the complexity hypothesis Proc. Natl Acad. Sci. USA, 96, 3801–3806 .
Klappenbach, J.A., Dunbar, J.M., Schmidt, T.M. (2000) rRNA operon copy number reflects ecological strategies of bacteria Appl. Environ. Microbiol., 66, 1328–1333 .
Ehrenberg, M. and Kurland, C.G. (1984) Costs of accuracy determined by a maximal growth rate constraint Q. Rev. Biophys., 17, 45–82 .
Berg, O.G. and Kurland, C.G. (1997) Growth rate-optimised tRNA abundance and codon usage J. Mol. Biol., 270, 544–550 .
Olsen, G.J., Woese, C.R., Overbeek, R. (1994) The winds of (evolutionary) change—breathing new life into microbiology J. Bacteriol., 176, 1–6 .
Haubold, B. and Wiehe, T. (2004) Comparative genomics: methods and applications Naturwissenschaften, 91, 405–421 .
Wernegreen, J.J., Degnan, P.H., Lazarus, A.B., Palacios, C., Bordenstein, S.R. (2003) Genome evolution in an insect cell: distinct features of an ant-bacterial partnership Biol. Bull., 204, 221–231 .
Karlin, S. and Mrazek, J. (2000) Predicted highly expressed genes of diverse prokaryotic genomes J. Bacteriol., 182, 5238–5250 .
Musto, H., Romero, H., Zavala, A. (2003) Translational selection is operative for synonymous codon usage in Clostridium perfringens and Clostridium acetobutylicum Microbiology, 149, 855–863 .
Shields, D.C. and Sharp, P.M. (1987) Synonymous codon usage in Bacillus subtilis reflects both translational selection and mutational biases Nucleic Acids Res., 15, 8023–8040 .
McInerney, J.O. (1998) Replicational and transcriptional selection on codon usage in Borrelia burgdorferi Proc. Natl Acad. Sci. USA, 95, 10698–10703 .
Perriere, G. and Thiolouse, J. (2002) Use and misuse of correspondence analysis in codon usage studies Nucleic Acids Res., 30, 4548–4555 .
Simpson, A.J.G., Reinach, F.C., Arruda, P., Abreu, F.A., Acencio, M., Alvarenga, R., Alves, L.M.C., Araya, J.E., Baia, G.S., Baptista, C.S., et al. (2000) The genome sequence of the plant pathogen Xylella fastidiosa Nature, 406, 151–159 .
Van Sluys, M.A., de Oliveira, M.C., Monteior-Vitorello, C.B., Miyaki, C.Y., Furlan, L.R., Camargo, L.E.A., da Silva, A.C.R., Moon, D.H., Takita, M.A., Lemos, E.G.M., et al. (2003) Comparative analysis of the complete genome sequences of Pierce's disease and citrus variegated chlorosis strains of Xylella fastidiosa J. Bacteriol., 185, 1018–1026 .
Labbe, R.G. and Huang, T.H. (1995) Generation times and modeling of enterotoxin-positive and enterotoxin-negative strains of Clostridium perfringens in laboratory media and ground beef J. Food Prot., 58, 1303–1306 .
Sreevatsan, S., Pan, X., Stockbauer, K.E., Connell, N.D., Kreiswirth, B.N., Whittam, T.S., Musser, J.M. (1997) Restricted structural gene polymorphism in the Mycobacterium tuberculosis complex indicates evolutionarily recent global dissemination Proc. Natl Acad. Sci. USA, 94, 9869–9874 .
Fabre, M., Koeck, J.-L., Le Fleche, P., Simon, F., Herve, V., Vergnaud, G., Pourcel, C. (2004) High genetic diversity revealed by variable-number tandem repeat genotyping and analysis of hsp65 gene polymorphism in a large collection of "Mycobacterium canetti" strains indicates that the Mycobacterium tuberculosis complex is a recently emerged clone of "M. canetti" J. Clin. Microbiol., 42, 3248–3255 .
Lawrence, J.G. and Ochman, H. (1997) Amelioration of bacterial genomes: rates of change and exchange J. Mol. Evol., 44, 383–397 .
Dykhuizen, D.E., Polin, D.S., Dunn, J.J., Wilske, B., Preac-Mursic, V., Dattwyler, R.J., Luft, B.J. (1993) Borrelia burgdorferi is clonal: implications for taxonomy and vaccine development Proc. Natl Acad. Sci. USA, 90, 10163–10167 .
Feil, E.J., Holmes, E.C., Bessen, D.E., Chan, M.-S., Day, N.J.P., Enright, M.C., Goldstein, R., Hood, D.W., Kalia, A., Moore, C.E., Zhou, J., Spratt, B.G. (2001) Recombination within natural populations of pathogenic bacteria: short-term empirical estimates and long-term phylogenetic consequences Proc. Natl Acad. Sci. USA, 98, 182–187 .
Suerbaum, S., Maynard Smith, J., Bapumia, K., Morelli, G., Smith, N.H., Kunstmann, E., Dyrek, I., Achtman, M. (1998) Free recombination within Helicobacter pylori Proc. Natl Acad. Sci. USA, 95, 12619–12624 .
Dos Reis, M., Savva, R., Wernisch, L. (2004) Solving the riddle of codon usage preferences: a test for translational selection Nucleic Acids Res., 32, 5036–5044 .
Wright, F. (1990) The ‘effective number of codons’ used in a gene Gene, 87, 23–29 .
Sharp, P.M. and Li, W.-H. (1987) The codon adaptation index—a measure of directional synonymous codon usage bias, and its potential applications Nucleic Acids Res., 15, 1281–1295 .
Moszer, I., Rocha, E.P.C., Danchin, A. (1999) Codon usage and lateral gene transfer in Bacillus subtilis Curr. Opin. Microbiol., 2, 524–528 .
Bulmer, M. (1987) Co-evolution of codon usage and transfer RNA abundance Nature, 325, 728–730 .
Shields, D.C. (1990) Switches in species-specific codon preferences: the influence of mutation biases J. Mol. Evol., 31, 71–80 .
Gu, X., Hewett-Emmett, D., Li, W.-H. (1998) Directional mutational pressure affects the amino acid composition of proteins in bacteria Genetica, 102/103, 383–391 .(Paul M. Sharp*, Elizabeth Bailes, Russel)
*To whom correspondence should be addressed. Tel: +44 115 9709263; Fax: +44 115 9709906; Email: paul@evol.nott.ac.uk
ABSTRACT
Among bacteria, many species have synonymous codon usage patterns that have been influenced by natural selection for those codons that are translated more accurately and/or efficiently. However, in other species selection appears to have been ineffective. Here, we introduce a population genetics-based model for quantifying the extent to which selection has been effective. The approach is applied to 80 phylogenetically diverse bacterial species for which whole genome sequences are available. The strength of selected codon usage bias, S, is found to vary substantially among species; in 30% of the genomes examined, there was no significant evidence that selection had been effective. Values of S are highly positively correlated with both the number of rRNA operons and the number of tRNA genes. These results are consistent with the hypothesis that species exposed to selection for rapid growth have more rRNA operons, more tRNA genes and more strongly selected codon usage bias. For example, Clostridium perfringens, the species with the highest value of S, can have a generation time as short as 7 min.
INTRODUCTION
The frequency of use of alternative synonymous codons varies among species, and often also among genes from a single genome (1–3). The pattern of codon usage in any gene reflects a complex balance among biases generated by mutation, selection and random genetic drift (4–6). Among bacteria, genomic G+C content varies over a wide range, presumably reflecting variation in mutation biases (7), with a major impact on codon usage (8). In addition, three major factors have been found to contribute to codon usage variation among genes within a bacterial genome. First, mutation biases seem to differ between the leading and lagging strands of replication, since genes on the leading strand are often more G+T-rich (9,10). Second, in many species, there is evidence of natural selection on codon usage. Genes expressed at high levels exhibit a bias towards a subset of synonymous codons, which are those most accurately and/or efficiently recognized by the most abundant tRNA species, and the strength of this bias is correlated with the level of gene expression (2,11). Third, there is evidence of extensive horizontal gene transfer among bacteria (12), and genes recently acquired from sources other than close relatives have atypical codon usage. The extent or magnitude of all three factors varies greatly among species. Here, we focus on the manner in which selected codon usage bias varies among bacteria.
The first species in which codon usage was examined in detail, the bacterium Escherichia coli (13,14) and the yeast Saccharomyces cerevisiae (15,16), were both found to show strong evidence of natural selection on codon usage. Subsequently, it has often been assumed that such selection is ubiquitous, at least among unicellular organisms. However, there have been a number of reports of bacterial species exhibiting little or no evidence of selected codon usage bias. Some concern species with extremely A+T- or G+C-rich genomes (17–22), where mutational bias appears to swamp any selected bias. However, in other cases, there is no sign of selection, even though the genomic base composition is not extreme (23,24). In addition, there are species where codon selection has been detected, but the effect seems relatively minor (25–28). It would be useful to be able to quantify the strength of selected codon usage bias in such a way that the results can be compared between species. There are two particular difficulties. First, the extent of bias in the absence of selection varies among species due to mutational biases. Second, many of the codons favoured by selection vary between species, such that the nature of the bias within a set of synonyms for a particular amino acid can be quite different in different species.
To overcome the first of these problems, we use a population genetics model to assess the strength of selected codon usage bias (5), modifying it to take account of background mutation biases. To overcome the second problem, we focus on certain codons that are expected to be translationally advantageous in all bacterial species. For example, the two Phe codons (UUU and UUC) are recognized, through wobble, by a single species of tRNA with the anticodon sequence GAA. While the G at the wobble position may be modified , it appears that the UUC codon is always better recognized and thus the translationally optimal codon (11).
The extent of selected synonymous codon usage bias might be expected to vary among species dependent on various factors. First, codons are thought to be selected for their effects on the efficiency and accuracy of translation, and ultimately for their effect on bacterial growth rate (30). Bacterial life styles vary markedly, with different species living within nutrient-rich eukaryotic cells but isolated from competitors, or as surface monocultures in oligotrophic external environments, or as complex mixed communities growing either in planktonic log phase within guts or as biofilms on rapidly cycling mucosal surfaces. Some species cross between diverse growth modes and rates, such as passing from terrestrial or aquatic environments to symbiotic relationships with eukaryotes. Thus, the relative importance of efficiency of rapid competitive growth as a component of fitness is likely to vary greatly among species. Second, the selection coefficient for a single synonymous mutation, in a genome with hundreds of thousands of synonymously variable sites, is expected to be extremely small. Then, although bacteria may have extremely large global population sizes, the population structure of the species may be such as to reduce the effective population size to the point where codon selection is less effective. Furthermore, the extent of recombination varies greatly among bacterial species (31), and in those with low recombination rates the linkage among numerous polymorphic synonymous sites on the bacterial chromosome may lead to interference in their selection (32). We consider these various factors in interpreting the variation in the strength of selected codon usage bias among species.
METHODS
Estimation of S
Following Bulmer (5), we can consider the case of an amino acid encoded by two synonyms, C1 and C2. The mutation rate from C1 to C2 is u; and from C2 to C1 is v. The selective difference between the two codons is s: the fitness of the optimal codon C1 is 1, while that of C2 is (1 – s). Under the combined effects of mutation, selection and random genetic drift, the equilibrium frequency (P) of C1 in a gene, or set of genes, is given by:
(1)
where S = 2 N e s, U = 2 N e u and V = 2 N e v.
In genes where selection is strong enough to influence codon usage, the frequency of codons is determined by both the pattern of mutation and the strength of selection. The magnitude of S can be estimated from Equation 1:
(2)
where k = U/V.
In genes where selection is so weak as to be ineffective, the frequency of the codons is determined by the pattern of mutation between them:
(3)
This allows the estimation of k = (1 – P)/P for use in Equation 2 above.
This methodology was applied to codons for four amino acids (Phe, Tyr, Ile and Asn) where the nature of codon selection is expected to be the same in all species. For Tyr (codons UAU and UAC; anticodon GUA) and Asn (codons AAU and AAC; anticodon GUU), the situation is analogous to that for Phe described in the Introduction. For Ile, there are three synonyms, but one (AUA) is recognized by a distinct tRNA with the anticodon CAU; the other two synonyms (AUU and AUC) are recognized by a tRNA with anticodon GAU. Here, the AUA codon was ignored (it is often rare) and Ile was treated as if it were analogous to Phe, Tyr and Asn. There are no other amino acids for which it seems clear that the translationally optimal codon is the same in all species. S-values were calculated for each of the four amino acids: the overall value for a species was computed as the average weighted by the number of codons analysed for the highly expressed genes.
Sequence data
Complete genome sequences of bacterial species were obtained from GenBank release 136 (June 2003). Sequences were extracted using the ACNUC interface (33), and initial codon usage analyses performed using CodonW (34). Base composition statistics (GC3S and GT3S) were calculated as the frequency of these nucleotides at synonymously variable third positions of sense codons, i.e. excluding Met, Trp and termination codons.
In the case of species for which multiple strains have been sequenced, only one representative was selected. In addition, some other pairs of species are no more divergent than strains of a single species. To assess this, the average nucleotide sequence divergence across the genes rplA-C and rpsB-C was estimated. A criterion of at least 4% sequence divergence was used for inclusion of strains. This led to the exclusion of Mycobacterium bovis (0.05% different from Mycobacterium tuberculosis), Shigella flexneri (0.2% different from E.coli K12), Brucella suis (0.3% different from Brucella melitensis), Listeria innocua (1.5% different from L.monocytogenes) and Bacillus cereus (1.6% different from Bacillus anthracis). In contrast, Buchnera aphidicola strains Ap, Bp and Sg differed by 17–26%, and so all three were included. The least divergent pairs of species retained were Xanthomonas axonopodis and Xanthomonas campestris (4.0%) and E.coli and Salmonella enterica typhimurium (4.1%). With the exception of B.aphidicola, the 4% criterion would exclude all cases of multiple strains of a single species: the most divergent were Helicobacter pylori strains 26695 and J99 (3.2%) and Xylella fastidiosa strains 9a5c and Temecula (2.5%). Finally, Streptococcus mutans UA159 was excluded because several genes used in the analysis (see below) were incomplete or missing: the sequence has a deletion between the rplD and rpsS genes, truncating both and deleting the rplB and rplW genes that lie between rplD and rpsS in other Streptococcus species. The final data set included 80 different genomes (Table 1).
Table 1 The 80 bacterial genome sequences analysed
To represent genes under the weakest selection, the codon usage of the entire genome was used, on the assumption that the number of genes expressed at high levels is a very small fraction of the genome as a whole. To represent genes where codon usage would be expected to be subject to strong translational selection, codon usage was summed across a set of 40 genes expected to be expressed constitutively at very high levels. This set included the genes encoding translation elongation factors Tu (tufA), Ts (tsf) and G (fusA), and 37 of the larger ribosomal proteins (encoded by genes rplA-rplF, rplI-rplT and rpsB-rpsT). No homologue of rplI was found in Mycoplasma penetrans; in this species rplU was added to the data set. Otherwise, the same 40 genes were used for all species. Many bacteria have two copies of the translation elongation factor Tu gene, although these are usually very similar due to concerted evolution (35), while some species have two or more homologues of fusA or certain ribosomal protein genes. In each case, the gene with the highest S-value was retained.
To assess whether the S-values observed were significantly greater than zero, for each species S-values were also calculated for 1000 sets of genes randomly selected from the genome. For each genome, the set of 40 highly expressed genes contained on average 1000 codons used in the analysis (Table 1). For the random data sets, genes were added until a total of at least 1000 codons were present for the four amino acids analysed. The range of S-values including 95% of these samples was recorded.
Phylogenetic analyses
The phylogenetic relationships of the 80 bacterial strains were estimated from a concatenated alignment of the proteins encoded by tuf, rplA-C and rpsB-C. Sequences were aligned using ClustalW (36), and sites with a gap in any sequence were removed. The tree was estimated by the Bayesian method implemented in MrBayesV3.0 (37), using the JTT model of protein evolution (38) with gamma distributed rates across sites. Phylogeny-independent correlations among species characters were estimated using the generalized least squares approach implemented in Continuous (39).
RESULTS
The strength of selected codon usage bias (S)
The strength of selected codon usage bias (S) was analysed for 80 genomes representing diverse major lineages of bacteria (Table 1 and Figure 1). S was estimated from the codon frequencies in a set of 40 genes expressed at very high levels compared with those in the genome as a whole, with the latter taken as an indication of the frequencies generated by mutation biases in the absence of selection. The analysis focused on four amino acids (Phe, Tyr, Ile and Asn), where the same codon is expected to be translationally advantageous in all species. The components of S for each of the four amino acids were highly correlated across species, and there was no clear indication that the U-ending codon is ever the optimal codon for any of the four amino acids.
Figure 1 Phylogenetic relationships of the 80 bacterial genomes analysed. Species codes are given in Table 1.
Some species have either two chromosomes (i.e. the three Vibrio species, Agrobacterium tumefaciens, Brucella melitensis, Leptospira interrogans and Deinococcus radiodurans) or one or more plasmids of larger than 1 Mb (Ralstonia solanacearum and Sinorhizobium meliloti). In each case, most (if not all) of the 40 genes expressed at high levels reside on just one of these chromosomes. Using the codon usage of genes from only this chromosome, rather than both, as the guide to mutational biases had only a minor impact on the S-values estimated: in all seven cases where both replicons are regarded as chromosomes the value of S was reduced by <3%. The effect was also minor in R.solanacearum, where S changed from 0.02 to –0.06, but more marked in S.meliloti, where the value decreased from 0.64 to 0.53, indicating a small difference in the overall codon usage between the plasmids and the chromosome in this species.
The species analysed here have genomic G+C contents ranging from 22 to 72%. Since bacterial genomes have little non-coding DNA, and the first two positions within codons are constrained by protein-coding requirements, most of the variation is due to the third position of codons . Thus the overall G+C content at synonymously variable third positions (GC3S) ranged from 9 to 93% among the 80 genomes (Table 1). This base composition bias is so pervasive that it can be seen even when considering individual genes: e.g. for dnaA (a conserved gene with low selected codon usage bias), only one species (Xylella fastidiosa) showed a substantial deviation from the general trend, with a surprisingly low third position G+C content (28%) for a genome at 52% (Figure 2). This highlights the potential difficulty in estimating selected codon usage bias. The method used here for estimating S was explicitly designed to take account of genomic mutation biases, and indeed there was no correlation between S and the overall G+C content at synonymously variable third positions of codons (Figure 3). The optimal codons for the four amino acids analysed here are all C-ending, but there was no correlation between the S-value and the difference in GC3S values between the highly expressed gene data set and the genome as a whole; in fact, for 51 species the GC3S value for the highly expressed gene data set was the lower of the two (Table 1). This indicates that in species with high S-values many of the optimal codons for other amino acids are not C- or G-ending.
Figure 2 G+C content at the three codon positions within the dnaA gene, compared with the G+C content of the genome as a whole, for 79 bacterial genomes (no dnaA homologue has been found in W.glossinidia). Positions 1, 2 and 3 are indicated by open circles, open triangles and filled circles, respectively. The third position is strongly influenced by G+C bias; the first two positions are also influenced, implying an effect on amino acid composition (68).
Figure 3 Selected codon usage bias (S) and genomic G+C bias for 80 bacterial species. Genomic G+C bias is estimated by the overall GC3S. Open circles denote species where the S-value is not greater than found among randomly selected genes; filled triangles denote three Clostridium species.
The S-values showed a wide variation among species, ranging from –0.88 to 2.65 (Table 1). In most species, the 95% limits of the distribution of S-values for randomly selected genes were 0.2–0.3 either side of zero. For 24 species (i.e. 30% of the total), the S-value for the highly expressed genes was not as high as the upper 95% limit for the randomly selected genes, providing no immediate evidence that selection has affected codon usage in those genomes.
Negative S-values
The minimum S-values are expected to be around zero, but for five species the S-values were more highly negative than expected for randomly selected genes. This is surprising because the U-ending codons for the four amino acids analysed are unlikely to be translationally advantageous in any species, and the C-ending codons are not expected to be selected against in highly expressed genes. Two factors seem to contribute to these unexpectedly low S-values. First, in many species, there is a replication-dependent compositional skew between the leading and lagging strands, such that the leading strand is more G+T-rich, although the extent of this skew varies greatly among species (10). Most very highly expressed genes lie on the leading strand and so may have reduced frequencies of C-ending codons due to their location rather than because of selection. For example, in X.fastidiosa (S = –0.78), multivariate analysis of codon usage found that the primary source of variation among genes was associated with this strand skew (40): the mean G+T contents (at synonymously variable third positions; GT3S) of leading and lagging strand genes are 0.61 and 0.40, respectively. Of the 40 highly expressed genes analysed here, 37 are encoded by the leading strand. When the highly expressed genes were compared with only those encoded on the leading strand, the S-value was much less highly negative (–0.43). Similarly, in Buchnera aphidicola strain Bp (S = –0.59), the average GT3S is 0.57 and 0.42 for genes on the leading and lagging strands, respectively; when the 34 highly expressed genes lying on the leading strand are compared with other leading strand genes, the S-value is –0.18. By comparison, in the other two B.aphidicola genomes (strains Ap and Sg), the skew between the two strands is much less pronounced, and the S-values are close to zero.
Second, many bacterial genomes contain regions (‘islands’) of unusual base composition, generally inferred to reflect horizontal gene transfer. In Nitrosomonas europaea (S = –0.88), where the average G+C content at synonymously variable third positions (GC3S) was 0.53 for the chromosome as a whole, many of the highly expressed genes lie within two islands with unusually low G+C content: 18 of the 40 genes in the highly expressed data set lie within a region encompassing 27 genes (rpsJ-rpoA, genes 400–426) where the average GC3S is 0.29, while 7 more lie in a cluster of 13 genes (rplL-NE2059, genes 2047–2059) with an average GC3S of 0.34. The S-value for these 25 genes is –1.36. The other 15 genes included in the set of 40 highly expressed genes are scattered around the genome, having an average GC3s of 0.45, and an S-value of –0.23. Horizontal transfer is thought to be rare for ‘informational’ genes, such as those encoding ribosomal proteins (41). However, since both regions include other genes, not expected to be highly expressed but with similarly low GC3S values, and since the highly expressed genes at other locations do not have such low GC3S values, the anomalously low S-values do not appear to be related to selection.
Correlation of selected codon usage bias with rRNA and tRNA gene numbers
The strength of selection on synonymous codon usage is likely to be related to the degree to which speed and efficiency of growth and replication have been important during evolution. To investigate this, we have compared S-values with the numbers of rRNA operons and tRNA genes in each genome. Inter-specific variation in bacterial growth rate appears to be positively correlated with the number of rRNA operons (42). The abundance of different tRNAs is correlated with, and apparently largely determined by, gene copy number (11). The increased gene copy number, and consequent increased relative abundance, of particular tRNA species appears to be part of the strategy for optimizing translational efficiency (43,44). As expected, the numbers of rRNA and tRNA genes were found to be highly correlated in an analysis of 18 bacterial genomes (11). Among the 80 genomes analysed here, rRNA operon and tRNA gene copy numbers vary from 1 to 11, and 28 to 126, respectively (Table 1), and are very highly correlated (Figure 4).
Figure 4 Ribosomal RNA operon copy number and tRNA gene number for 80 bacterial species. Symbols are as in Figure 3.
S-values are positively correlated with both rRNA operon and tRNA gene copy numbers (Figures 5 and 6). The highest S-value of all (2.65) was found in Clostridium perfringens, a genome with 10 rRNA operons and 95 tRNA genes. The species with the largest number of tRNA genes, Vibrio parahaemolyticus, is also among those with the largest number of rRNA operons, and has a high S-value (1.89). All species with >6 rRNA operons, and all species with >70 tRNA genes, have stronger codon usage bias in the highly expressed genes than in randomly selected genes. Among the 30 species with S-values >1, only two have fewer than four rRNA operons, and only two have fewer than 50 tRNA genes. Conversely, a majority of the species with only one rRNA operon, or <40 tRNA genes, show no evidence of selected codon usage bias.
Figure 5 Selected codon usage bias (S) and ribosomal RNA operon copy number for 80 bacterial species. Symbols are as in Figure 3.
Figure 6 Selected codon usage bias (S) and tRNA gene number for 80 bacterial species. Symbols are as in Figure 3.
The strengths of these correlations among rRNA operon numbers, tRNA gene copy numbers and S are overestimated by a simple analysis of the data as presented in Figures 4–6, due to the nonindependence of the data points. The 80 genomes are linked by a phylogenetic tree (Figure 1), and closely related species often share similar numbers of rRNA and tRNA genes, and have similar S-values, which may simply be due to their recent common ancestry. Using an approach to estimate the correlations after removing the effects of shared ancestry (39), the correlation coefficient for rRNA and tRNA gene copy numbers is 0.82, while the correlations between S and rRNA and tRNA gene copy numbers are 0.49 and 0.44, respectively (all values are highly statistically significant). While the phylogenetic relationships shown in Figure 1 are broadly consistent with those derived from analyses of other sequence data sets (45,46), there are some differences, such as Escherichia and Haemophilus being more closely related to each other than to Vibrio and Wigglesworthia lying within the radiation of Buchnera strains (47). However, we found that using alternative trees with such minor differences in topology had very little impact on the magnitude of the correlation coefficients.
DISCUSSION
Previous analyses of codon usage in bacteria have mostly focussed on the analyses of particular species, with no quantitative attempt to compare the strength of selected codon usage bias across different species (a recent exception is discussed below). Some analyses have started from the assumption that there is selected codon usage bias, without testing whether that is indeed the case (48), while others have concluded that ‘codon usage in most bacteria, if not all, is constrained by translation efficiency’ (11). Here, we have described a measure of the strength of selected codon usage bias, S, and a method for testing whether S is larger than expected by chance. The approach should be applicable to all species, and provides a means of comparing the strength of selected codon usage bias among them. We have applied this approach to 80 species. For 30% of these species, there was no evidence of selected codon usage bias, while among the others the value of S ranged widely.
Comparisons with previous analyses of individual species
The archetypal example of a species with strongly selected codon usage bias has been E.coli, where the selective pressure exerted via tRNA relative abundance and anticodon sequence was first elucidated (13,14). The S-value calculated here for E.coli is high (1.49), but 15 other species (20% of the species analysed) have even higher values, indicating more strongly selected codon usage bias. Among these 15 species are 8 members of the Firmicutes (A+T-rich gram positive bacteria), including C.perfringens with the highest value of 2.65. A recent analysis detected the selected codon usage bias in C.perfringens, and also noted that the bias was stronger than in Clostridium acetobutylicum (49); here, the latter species has an S-value of 0.84. In fact, with the exception of the Mollicutes and C.acetobutylicum, all of the Firmicutes have S-values >1.0 (Table 1). An early analysis of one of these species, B.subtilis, concluded (from 56 genes) that the selected codon usage bias was weaker than in E.coli (50), but here the S-value for B.subtilis is 1.36, not substantially different from E.coli.
An early analysis of M.tuberculosis (using 41 genes) reported weak but significant selected codon usage bias (25), and this is confirmed by the S-value of 0.45, compared with 0.26 as the upper limit of the 95% range for randomly selected genes in that species. Analysis of the genome of Thermotoga maritima detected selected codon usage in highly expressed genes, but found this to be a relatively minor source of variation among genes (28). No significant difference in the use of Tyr, Ile or Asn codons was found between genes expressed at high and low levels; and since these are three of the four amino acids used here, it is not surprising that the S-value is very low (0.37), and only just above the value (0.28) for randomly selected genes.
For two other species, where weak selected codon usage bias has been reported, the present analysis yields S-values within the range of randomly selected genes. In Chlamydia trachomatis, the major trend among genes in codon usage is related to strand skew (26). The average GT3S values for genes on the leading and lagging strands are 0.57 and 0.48, respectively. If the 40 genes are compared with only those on the leading strand, the S-value becomes 0.42, indicative of weak selection. Pseudomonas aeruginosa has an S-value close to zero (–0.02), providing no evidence for selection, whereas we previously found small but significant differences in codon usage between highly expressed and other genes (27). This discrepancy arises because the largest components of selected bias in this species relate to codons for Ser (especially UCC), Thr (ACC), Ala (GCU), Arg (CGU) and Gly (GGU), whereas frequencies of the C-ending codons for Phe, Tyr, Ile and Asn (used to calculate S) differ little between highly expressed genes and the genome as a whole (27).
Otherwise, few analyses have commented on the relative strength of selected codon usage bias, except in those cases where it appears to be absent. Evidence of a lack of selected codon usage bias has been reported for Helicobacter pylori (24), Rickettsia prowazekii (18), Treponema pallidum (23), Buchnera strains (21) and Wigglesworthia (22), all of which have S-values close to zero. In addition, an absence of selected codon usage bias has been reported in Borrelia burgdorferi (23,51) and Mycoplasma genitalium (19,20), but for these two species these conclusions have been questioned (52). For B.burgdorferi, there is no sign of selected codon usage bias in the present analysis, since the S-value is negative (–0.31). However, in this species, there is extremely pronounced skew between the chromosome strands: the average GT3S values for genes on the leading and lagging strands are 0.62 and 0.39, respectively. Of the 40 genes in the highly expressed data set, 38 lie on the leading strand, and when these are compared with genes from the leading strand only, the B.burgdorferi value is –0.04, still providing no evidence for selection. For M.genitalium, the possibility of selected codon usage bias was invoked on the grounds that highly expressed genes tend to use more G+C-rich codons (52). Indeed, here the S-value for M.genitalium (0.32) is slightly higher than expected for randomly selected genes. However, it has been shown that the major source of variation among M.genitalium genes is in G+C content, which varies systematically in a wave around the genome, seemingly affecting all genes irrespective of their expression level (19,20). A total of 29 of the 40 highly expressed genes used here lie within the most G+C-rich 40% of the genome. When these 29 genes are compared with the 192 genes in this region, the S-value is lower (0.17), and within the range of values for randomly selected genes from this region. This suggests that the minor difference in codon usage between highly expressed genes (in total) and the genome as a whole reflects compositional variation, and provides no evidence for selected codon usage bias in this species.
Streptomyces species are extremely G+C-rich, and this compositional bias was found to dominate codon usage in an early study (17). However, it was noted that tufA (the only unambiguously highly expressed gene sequence then available) had slightly different codon usage that might indicate the action of weak translational selection. Here, Streptomyces coelicolor has an S-value of 0.99. This value is close to that expected for a genome with 6 rRNA operons (Figure 5) and 63 tRNA genes (Figure 6), and all of these features are consistent with moderately strong translational selection. However, the difficulty in interpreting codon usage variation in this species is shown by the unusually broad range of values observed for randomly selected genes (Table 1). Among 1000 randomly selected S.coelicolor data sets, 28 had S-values as large as that for the highly expressed genes. For Streptomyces avermitilis, the S-value is lower (0.69), but again just within the range of values for 95% of randomly selected gene data sets. Overall, it appears that the codon selection in Streptomyces has been marginally effective in overcoming the very strong mutational bias.
Thus, the S-values obtained here are largely consistent with more detailed studies on individual species. However, because S is calculated from only four amino acids, where the choice is always between the translationally optimal C-ending codon and a U-ending codon, intragenomic variations in G+T content can impinge on the value obtained. Since most highly expressed genes lie on the leading (G+T-rich) strand this tends to reduce S, but the size of the effect, reflecting the extent of skew between the strands, varies substantially among species. For example, in E.coli the average GT3S values of genes on the leading and lagging strands are 0.55 and 0.51, respectively, and using only leading strand genes as the control for mutational bias leaves the S-value unaltered. It might be preferable to always only use genes on the leading strand as the control for mutational bias, but for many species this is impracticable because it is difficult to locate the origin and terminus of replication precisely. Furthermore, even closely related strains can show extensive genomic rearrangement , which can confound comparisons of leading and lagging strand genes.
Intragenomic variations in G+C content can also impinge on the value of S. With the exception of M.genitalium (discussed above), intragenomic G+C variation mostly reflects ‘islands’ of atypical base composition. Typically, as many as half of the 40 highly expressed genes examined here are located in a single cluster, and we have noticed that in a number of species this cluster is more A+T-rich than the genome as a whole, tending to reduce the S-value. Islands of atypical base composition are usually explained as the result of horizontal gene transfer, but it is generally not expected that ribosomal protein genes undergo this process. Thus, the reason(s) for this base composition difference warrant further investigation.
These caveats regarding intragenomic variations in base composition serve to emphasise that any automated analysis of codon usage, without some detailed consideration of the variation among genes, may be prone to errors. However, the advantage of calculating S-values by the method described here is that a uniform approach can be used for all species, enabling comparisons among them.
Variation among bacteria in the strength of selected codon usage bias
At a biochemical level, the C-ending codons for Phe, Tyr, Ile and Asn are expected to be translationally optimal in all bacteria, but the wide range of S-values observed (Table 1) indicates that the strength and/or efficacy of selection for these optimal codons has varied considerably among species. The strength of selected codon usage bias, as estimated by S, is highly correlated with the number of rRNA operons and the number of tRNA genes. We expect that codon usage will have been more strongly selected in species which replicate fast. Information regarding the growth rate of bacteria in the wild is sparse, and so we have used the number of rRNA operons as a (very approximate) guide to the growth rate of species. Remarkably, C.perfringens, the species with the highest S-value (2.65) and 10 rRNA operons, can grow with a generation time under 7 min in specific laboratory conditions (55). In contrast, Mycobacterium species are renowned for their very slow growth: M.tuberculosis and M.leprae have generation times of 1 and 14 days, respectively. Both species have one rRNA operon and low S-values (0.5). These observations are consistent with the effects of selection for efficiency of translation under rapid and competitive growth conditions, and then the lack of selected codon usage bias in some species would reflect a relative unimportance of an exponential growth phase during their life history.
Alternatively, a lack of selected codon usage bias may reflect the greater impact of random genetic drift, due to a population structure with a low long-term effective population size and/or interference between linked synonymous sites due to a lack of recombination. For most species, it is difficult to know the long-term evolutionary effective population size relevant to codon usage. For example, M.tuberculosis currently infects many more people worldwide than M.leprae, such that the former is likely to have much the larger ongoing effective population size. However, M.tuberculosis exhibits little genetic diversity (56) and is thought to be a recently emerged clone from M.canetti (57); this evolutionary bottleneck would have reduced the effective population size of M.tuberculosis. But even this may have little relevance: in the same way that it is thought that the codon usage of horizontally transferred genes may take many millions of years to ameliorate to that of a new host genome (58), strongly selectively biased codon usage may take a very long time to decay after a reduction in effective population size, i.e. the codon usage bias currently observed may still be due in some part to evolutionary processes that occurred millions of years ago. The two Mycobacterium species currently have similar levels of selected codon usage bias.
Nevertheless, it seems clear that the life histories of some of the bacteria analysed are likely to lead to low effective population sizes. Many of the species with very low S-values are obligate intracellular parasites or endosymbionts: these include species in the genera Buchnera, Wigglesworthia, Coxiella, Rickettsia and Tropheryma, the Mollicutes (Mycoplasma plus Ureaplasma) as well as the four Chlamydiales. Among these 18 species, all have S-values <0.5, and only the Mollicutes have values >0.2, and marginally higher than expected from randomly selected genes. Most have reduced genome sizes (<1000 genes), all have only 1 or 2 rRNA operons, and most have <40 tRNA genes (Table 1). For example, Buchnera and Wigglesworthia are obligate endosymbionts of insects, with low effective population sizes (due to bottlenecks during their transmission) and limited recombination. It has been noted that, as well as an absence of selected codon usage bias, these species have rapid evolutionary rates, presumably reflecting the enhanced power of random genetic drift (21). In contrast, all of the bacteria with high S-values (say, >1.5) live outside host cells, typically in mixed environments, such as soil, water or the intestinal tracts of animals. Thus, this difference between an intracellular parasitic lifestyle and an extracellular existence appears to be a pervasive influence on S among the species included in this analysis.
A lack of recombination would be expected to impair the efficacy of selection on codon usage. Many of the intracellular parasitic species, noted above for their low S-values, are known or expected to be effectively clonal. Additionally, the primarily extracellular pathogenic spirochaete B.burgdorferi is extremely clonal (59) and has S near zero. In contrast, Streptococcus pneumoniae, Streptococcus pyogenes and Staphylococcus aureus all appear to have undergone high rates of recombination (60), and have high S-values (Table 1). However, E.coli and Haemophilus influenzae also have high S-values, despite apparently lower rates of recombination (60). It is clear that a high recombination rate alone is not enough to promote codon selection: H.pylori has perhaps the highest rate of recombination known among bacteria (61), and yet an S-value close to zero. In this case, the lack of selected codon usage bias has been interpreted as a consequence of the unimportance of competitive growth in the isolated acidic niche of this species (24).
Overall, it is difficult to disentangle the effects of low effective population size and a lack of recombination from the other aspects of these organisms' lifestyles discussed above. For example, among the spirochaetes, two (B.burgdorferi and T.pallidum) have S-values close to zero, whereas the third (L.interrogans) has a somewhat higher value (0.67). Both B.burgdorferi and T.pallidum are obligate parasites and grow slowly, whereas L.interrogans is a facultative parasite with many saprophytic relatives, is more metabolically versatile and can grow more rapidly. The stronger selected codon usage bias in L.interrogans appears to reflect this difference in lifestyles, although interestingly it is not accompanied by an increase in rRNA or tRNA gene number.
The correlations between S and rRNA and tRNA gene copy numbers are sufficiently strong that it is interesting to examine the outliers. For example, values for the three Clostridium species are highlighted in Figures 4–6. The S-value for C.acetobutylicum (0.84) is surprisingly low for a genome with 11 rRNA operons (Figure 5). It is similar to that of Clostridium tetani (1.00), with only 6 rRNA operons, but much lower than that of C.perfringens (2.65), a genome with 10 rRNA operons. However, the S-value for C.acetobutylicum is not unusual for a genome with 73 tRNA genes (Figure 6). Thus, it seems to be the high number of rRNA operons in C.acetobutylicum that is anomalous; this may reflect a very recent expansion in this gene family.
Perhaps the most surprising example of low codon usage bias is P.aeruginosa. This species can grow quite rapidly (doubling times <1 h) in laboratory planktonic cultures and is metabolically highly versatile. It is moderately recombinogenic via plasmid transfer, and there appear to be many horizontally transferred genes in its genome (27). The low selected bias was apparent in a full analysis of codon usage in this species (27), as well as the S-value calculated here. Selected codon usage bias is rather stronger in the two other Pseudomonas species analysed (Table 1). These paradoxical observations perhaps highlight our ignorance of the evolutionary history of even ‘well-known’ bacterial species.
Comparison with another estimate of S
Recently, another approach to estimating the strength of selected codon usage bias in a genome has been published by dos Reis and co-workers (62). These authors calculated two indices of codon usage bias. The first, based on the effective number of codons used in a gene (63), attempted to estimate the strength of general deviation from random codon usage in a gene. The second was a modification of the codon adaptation index, CAI (64), using tRNA gene copy number (as a surrogate for tRNA abundance) and the estimated strength of codon–anticodon interaction to assign fitness values to codons; the tRNA adaptation index for a gene was calculated as the average of these fitness values, as an attempt to estimate the adaptation of a gene's codon usage to the tRNA pool of the species. It was suggested that the strength of translationally selected codon usage bias, S (here termed St to distinguish it from S described above), could be estimated from the magnitude of the correlation between these two indices; the significance of St was estimated from a permutation test.
Dos Reis et al. (62) applied this methodology to 101 bacterial genomes, including 66 of those analysed here as well as another 20 genomes excluded here because of their close relationship to other strains. The St method found significant evidence for selection in only 26% of bacterial genomes analysed. Among the 66 species common to both analyses, S- and St -values are significantly correlated (coefficient = 0.46); 14 species were found to have significant evidence for selection in both analyses and 18 were found to lack such evidence in both analyses (Figure 7). However, 32 species found here to have significant S-values were not significant in the St analysis. These included a number of species where previous analyses have found clear evidence of selectively biased codon usage in highly expressed genes, such as B.subtilis (50,65), C.acetobutylicum (49) and Vibrio cholerae (40). Most strikingly, C.perfringens had the highest S-value among the 80 species analysed here, and yet was not significant in the St test; detailed analysis of codon usage in this species has revealed strongly selected bias in highly expressed genes (49).
Figure 7 Comparison of two estimates of selected codon usage bias: x-axis values are taken from this paper, y-axis values from dos Reis et al. (62). Values significantly greater than zero in the dos Reis et al. analysis are shown as circles; values significantly greater than zero in our analysis are shown as filled symbols.
Interestingly, two species found here not to have significant S-values, Neisseria meningitidis and Bacteroides thetaiotamicron, were significant in the St test. Closer examination of these species revealed that, in both, the primary trends in codon usage variation among genes were associated with leading versus lagging strand composition bias and G+C content, but there was evidence for weak selected codon usage bias in highly expressed genes. Overall, it appears that the estimation of S described here is generally much more effective than the St test at detecting translationally selected codon usage bias, even though S can sometimes be reduced by compositional biases. One difference between the two approaches should be noted. The method described here asks how strong the selected bias is in a specified set of very highly expressed genes, but not how many genes exhibit selected bias. The dos Reis et al. method aimed at quantifying the extent to which variation among genes across the genome as a whole can be explained as adaptation to the tRNA pool of the species. Given this difference, further comparison of the results of the two methods may shed additional light on the causes of selected codon usage bias.
Solving the riddle of codon usage preferences?
In their analysis dos Reis et al. included a small number of eukaryote genomes, as well as archaeal and bacterial species. They found that variation in the strength of codon usage bias among species was highly positively correlated with genome size and tRNA gene copy number (except in very large genomes), and concluded that these two factors ‘ultimately determine the action of natural selection’ on codon usage (62). They proposed a model whereby, from an ancestral bacterium with a small genome size, increases in genome size led to increases in tRNA gene copy number, which in turn led to selection for the optimization of codon usage. However, we find that genome size does not seem to cause tRNA gene copy number (among bacteria, at least), while it seems inappropriate to consider codon bias as the result of tRNA gene copy number. In contrast, we suggest that it is the biology of the organism (its ‘lifestyle’) that determines whether codon usage is affected by natural selection.
The overall results of dos Reis et al. were heavily influenced by the inclusion of eukaryote species, which contributed disproportionately to the variation in both genome size and tRNA gene number. Although there is a positive correlation between genome size and tRNA gene number among the 80 bacterial species examined here, this seems to be due only to species with small genomes. (Note that dos Reis et al. considered genome size in terms of DNA content, whereas we have used the estimated number of protein-coding genes; however, these two measures are extremely highly correlated among bacteria and so this difference should have no impact.) Among the larger bacterial genomes (e.g. the 42 species with >2500 genes), there is no significant correlation between genome size and tRNA copy number. For example, 10 of the 11 species with >5000 genes have <75 tRNA genes, while 10 of the 11 species with >75 tRNA genes have <5000 protein-coding genes; the single exception is B.anthracis with 5311 genes and 95 tRNA genes (Table 1). Thus, increases in genome size do not generally involve an increase in the number of tRNA genes. The forces that have led to reduced genome size (e.g. in Buchnera, Rickettsia and Mycoplasma species) may have impacted on tRNA gene copy number directly, but it seems more likely that these evolutionary pressures reflect the adoption of a lifestyle (typically intracellular parasitism), in which rapid replication was not advantageous (or perhaps even detrimental) and thus translational efficiency became less important, and additional tRNA genes became unnecessary.
It seems inappropriate to consider codon usage bias as simply being caused by tRNA abundances, since both factors are likely to co-evolve in response to selection for translational efficiency (44,66). Indeed, it is possible to consider circumstances where changes in codon usage bias, perhaps brought about by a change in the genome wide mutational bias, could select for a change in the tRNA pool (67). Thus, while we find correlations across species in the numbers of rRNA operons and tRNA genes, and the strength of selected codon usage bias, we do not invoke a causal relationship among any of these factors; rather, we take all three as indicative of the need for rapid and efficient bacterial growth.
ACKNOWLEDGEMENTS
We are very grateful to Michael Bulmer for discussion of his population genetic model of codon usage bias, and to Manolo Gouy and colleagues in Lyon for providing the ACNUC interface to GenBank. We also thank Mario dos Reis for discussion of his recent paper. This work was supported in part by studentships from the MRC (to R.J.G.) and the University of Nottingham (to J.F.P.). Funding to pay the Open Access publication charges for this article was provided by The University of Nottingham.
REFERENCES
Grantham, R., Gautier, C., Gouy, M., Jacobzone, M., Mercier, R. (1981) Codon catalog usage is a genome strategy modulated for gene expressivity Nucleic Acids Res., 8, r43–r74 .
Ikemura, T. (1985) Codon usage and tRNA content in unicellular and multicellular organisms Mol. Biol. Evol., 2, 13–34 .
Sharp, P.M., Cowe, E., Higgins, D.G., Shields, D.C., Wolfe, K.H., Wright, F. (1988) Codon usage in Escherichia coli, Bacillus subtilis, Saccharomyces cerevisiae, Schizosaccharomyces pombe, Drosophila melanogaster and Homo sapiens; a review of the considerable within-species diversity Nucleic Acids Res., 16, 8207–8211 .
Sharp, P.M. and Li, W.-H. (1986) An evolutionary perspective on synonymous codon usage in unicellular organisms J. Mol. Evol., 24, 28–38 .
Bulmer, M. (1991) The selection-mutation-drift theory of synonymous codon usage Genetics, 129, 897–907 .
Sharp, P.M., Stenico, M., Peden, J.F., Lloyd, A.T. (1993) Codon usage: mutational bias, translational selection, or both? Biochem. Soc. Trans., 21, 835–841 .
Sueoka, N. (1962) On the genetic basis of variation and heterogeneity of DNA base composition Proc. Natl Acad. Sci. USA, 48, 582–592 .
Muto, A. and Osawa, S. (1987) The guanine and cytosine content of genomic DNA and bacterial evolution Proc. Natl Acad. Sci. USA, 84, 166–169 .
Lobry, J.R. (1996) Asymmetric substitution patterns in the two DNA strands of bacteria Mol. Biol. Evol., 13, 660–665 .
McLean, M.J., Devine, K.M., Wolfe, K.H. (1997) Base composition skews, replication orientation, and gene orientation in 12 prokaryotic genomes J. Mol. Evol., 47, 691–696 .
Kanaya, S., Yamada, Y., Kudo, Y., Ikemura, T. (1999) Studies of codon usage and tRNA genes of 18 unicellular organisms and quantification of Bacillus subtilis tRNAs: gene expression level and species-specific diversity of codon usage based on multivariate analysis Gene, 238, 143–155 .
Ochman, H., Lawrence, J.G., Groisman, E.A. (2000) Lateral gene transfer and the nature of bacterial innovation Nature, 405, 299–304 .
Post, L.E. and Nomura, M. (1980) DNA sequences from the str operon of Escherichia coli J. Biol. Chem., 255, 4660–4666 .
Ikemura, T. (1981) Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes J. Mol. Biol., 146, 1–21 .
Ikemura, T. (1982) Correlation between the abundance of yeast transfer RNAs and the occurrence of the respective codons in protein genes J. Mol. Biol., 158, 573–597 .
Bennetzen, J.L. and Hall, B.D. (1982) Codon selection in yeast J. Biol. Chem., 257, 3026–3031 .
Wright, F. and Bibb, M.J. (1992) Codon usage in the G+C-rich Streptomyces genome Gene, 113, 55–65 .
Andersson, S.G.E. and Sharp, P.M. (1996) Codon usage and base composition in Rickettsia prowazekii J. Mol. Evol., 42, 525–536 .
Kerr, A.R.W., Peden, J.F., Sharp, P.M. (1997) Systematic base composition variation around the genome of Mycoplasma genitalium, but not Mycoplasma pneumoniae Mol. Microbiol., 25, 1177–1179 .
McInerney, J.O. (1997) Prokaryotic genome evolution as assessed by multivariate analysis of codon usage patterns Microb. Comp. Genomics, 2, 1–10 .
Wernegreen, J.J. and Moran, N.A. (1999) Evidence for genetic drift in endosymbionts (Buchnera): analyses of protein-coding genes Mol. Biol. Evol., 16, 83–97 .
Herbeck, J.T., Wall, D.P., Wernegreen, J.J. (2003) Gene expression level influences amino acid usage, but not codon usage, in the tsetse fly endomsymbiont Wigglesworthia Microbiology, 149, 2585–2596 .
Lafay, B., Lloyd, A.T., McLean, M.J., Devine, K.M., Sharp, P.M., Wolfe, K.H. (1999) Proteome composition and codon usage in spirochaetes: species-specific and DNA strand-specific mutational biases Nucleic Acids Res., 27, 1642–1649 .
Lafay, B., Atherton, J.C., Sharp, P.M. (2000) Absence of translationally selected codon usage bias in Helicobacter pylori Microbiology, 146, 851–860 .
Andersson, S.G.E. and Sharp, P.M. (1996) Codon usage in the Mycobacterium tuberculosis complex Microbiology, 142, 915–925 .
Romero, H., Zavala, A., Musto, H. (2000) Codon usage in Chlamydia trachomatis is the result of strand-specific mutational biases and a complex pattern of selective forces Nucleic Acids Res., 28, 2084–2090 .
Grocock, R.J. and Sharp, P.M. (2002) Synonymous codon usage in Pseudomonas aeruginosa PAO1 Gene, 289, 131–139 .
Zavala, A., Naya, H., Romero, H., Musto, H. (2002) Trends in codon and amino acid usage in Thermotoga maritima J. Mol. Evol., 54, 563–568 .
Arnold, H.H. and Keith, G. (1977) The nucleotide sequence of phenylalanine tRNA from Bacillus subtilis Nucleic Acids Res., 4, 2821–2829 .
Kurland, C.G. (1987) Strategies for efficiency and accuracy in gene expression. 1. The major codon preference: a growth optimization strategy Trends Biochem. Sci., 12, 126–128 .
Maynard Smith, J., Smith, N.H., O'Rourke, M., Spratt, B.G. (1993) How clonal are bacteria? Proc. Natl Acad. Sci. USA, 90, 4384–4388 .
McVean, G.A.T. and Charlesworth, B. (2000) The effects of Hill-Robertson interference between weakly selected mutations on patterns of molecular evolution and variation Genetics, 155, 929–944 .
Gouy, M., Gautier, C., Attimonelli, M., Lanave, C., Di Paola, G. (1985) ACNUC—a portable retrieval system for nucleic acid sequence databases: logical and physical design and usage Comp. Appl. Biosci., 1, 167–172 .
Peden, J.F. (1999) Analysis of codon usage PhD Thesis, University of Nottingham, UK .
Sharp, P.M. (1991) Determinants of DNA sequence divergence between Escherichia coli and Salmonella typhimurium: codon usage, map position and concerted evolution J. Mol. Evol., 33, 23–33 .
Thompson, J.D., Higgins, D.G., Gibson, T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence-weighting, position-specific gap penalties and weight matrix choice Nucleic Acids Res., 22, 4673–4680 .
Huelsenbeck, J.P. and Ronquist, F. (2001) MRBAYES: Bayesain inference of phylogeny Bioinformatics, 17, 754–755 .
Jones, D.T., Taylor, W.R., Thornton, J.M. (1992) The rapid generation of mutation data matrices from protein sequences Comput. Appl. Biosci., 8, 275–282 .
Pagel, M. (1999) Inferring the historical patterns of biological evolution Nature, 401, 877–884 .
Grocock, R.J. (2003) Evolution of codon usage among the gamma Proteobacteria PhD Thesis, University of Nottingham, UK .
Jain, R., Rivera, M., Lake, J.A. (1999) Horizontal gene transfer among genomes: the complexity hypothesis Proc. Natl Acad. Sci. USA, 96, 3801–3806 .
Klappenbach, J.A., Dunbar, J.M., Schmidt, T.M. (2000) rRNA operon copy number reflects ecological strategies of bacteria Appl. Environ. Microbiol., 66, 1328–1333 .
Ehrenberg, M. and Kurland, C.G. (1984) Costs of accuracy determined by a maximal growth rate constraint Q. Rev. Biophys., 17, 45–82 .
Berg, O.G. and Kurland, C.G. (1997) Growth rate-optimised tRNA abundance and codon usage J. Mol. Biol., 270, 544–550 .
Olsen, G.J., Woese, C.R., Overbeek, R. (1994) The winds of (evolutionary) change—breathing new life into microbiology J. Bacteriol., 176, 1–6 .
Haubold, B. and Wiehe, T. (2004) Comparative genomics: methods and applications Naturwissenschaften, 91, 405–421 .
Wernegreen, J.J., Degnan, P.H., Lazarus, A.B., Palacios, C., Bordenstein, S.R. (2003) Genome evolution in an insect cell: distinct features of an ant-bacterial partnership Biol. Bull., 204, 221–231 .
Karlin, S. and Mrazek, J. (2000) Predicted highly expressed genes of diverse prokaryotic genomes J. Bacteriol., 182, 5238–5250 .
Musto, H., Romero, H., Zavala, A. (2003) Translational selection is operative for synonymous codon usage in Clostridium perfringens and Clostridium acetobutylicum Microbiology, 149, 855–863 .
Shields, D.C. and Sharp, P.M. (1987) Synonymous codon usage in Bacillus subtilis reflects both translational selection and mutational biases Nucleic Acids Res., 15, 8023–8040 .
McInerney, J.O. (1998) Replicational and transcriptional selection on codon usage in Borrelia burgdorferi Proc. Natl Acad. Sci. USA, 95, 10698–10703 .
Perriere, G. and Thiolouse, J. (2002) Use and misuse of correspondence analysis in codon usage studies Nucleic Acids Res., 30, 4548–4555 .
Simpson, A.J.G., Reinach, F.C., Arruda, P., Abreu, F.A., Acencio, M., Alvarenga, R., Alves, L.M.C., Araya, J.E., Baia, G.S., Baptista, C.S., et al. (2000) The genome sequence of the plant pathogen Xylella fastidiosa Nature, 406, 151–159 .
Van Sluys, M.A., de Oliveira, M.C., Monteior-Vitorello, C.B., Miyaki, C.Y., Furlan, L.R., Camargo, L.E.A., da Silva, A.C.R., Moon, D.H., Takita, M.A., Lemos, E.G.M., et al. (2003) Comparative analysis of the complete genome sequences of Pierce's disease and citrus variegated chlorosis strains of Xylella fastidiosa J. Bacteriol., 185, 1018–1026 .
Labbe, R.G. and Huang, T.H. (1995) Generation times and modeling of enterotoxin-positive and enterotoxin-negative strains of Clostridium perfringens in laboratory media and ground beef J. Food Prot., 58, 1303–1306 .
Sreevatsan, S., Pan, X., Stockbauer, K.E., Connell, N.D., Kreiswirth, B.N., Whittam, T.S., Musser, J.M. (1997) Restricted structural gene polymorphism in the Mycobacterium tuberculosis complex indicates evolutionarily recent global dissemination Proc. Natl Acad. Sci. USA, 94, 9869–9874 .
Fabre, M., Koeck, J.-L., Le Fleche, P., Simon, F., Herve, V., Vergnaud, G., Pourcel, C. (2004) High genetic diversity revealed by variable-number tandem repeat genotyping and analysis of hsp65 gene polymorphism in a large collection of "Mycobacterium canetti" strains indicates that the Mycobacterium tuberculosis complex is a recently emerged clone of "M. canetti" J. Clin. Microbiol., 42, 3248–3255 .
Lawrence, J.G. and Ochman, H. (1997) Amelioration of bacterial genomes: rates of change and exchange J. Mol. Evol., 44, 383–397 .
Dykhuizen, D.E., Polin, D.S., Dunn, J.J., Wilske, B., Preac-Mursic, V., Dattwyler, R.J., Luft, B.J. (1993) Borrelia burgdorferi is clonal: implications for taxonomy and vaccine development Proc. Natl Acad. Sci. USA, 90, 10163–10167 .
Feil, E.J., Holmes, E.C., Bessen, D.E., Chan, M.-S., Day, N.J.P., Enright, M.C., Goldstein, R., Hood, D.W., Kalia, A., Moore, C.E., Zhou, J., Spratt, B.G. (2001) Recombination within natural populations of pathogenic bacteria: short-term empirical estimates and long-term phylogenetic consequences Proc. Natl Acad. Sci. USA, 98, 182–187 .
Suerbaum, S., Maynard Smith, J., Bapumia, K., Morelli, G., Smith, N.H., Kunstmann, E., Dyrek, I., Achtman, M. (1998) Free recombination within Helicobacter pylori Proc. Natl Acad. Sci. USA, 95, 12619–12624 .
Dos Reis, M., Savva, R., Wernisch, L. (2004) Solving the riddle of codon usage preferences: a test for translational selection Nucleic Acids Res., 32, 5036–5044 .
Wright, F. (1990) The ‘effective number of codons’ used in a gene Gene, 87, 23–29 .
Sharp, P.M. and Li, W.-H. (1987) The codon adaptation index—a measure of directional synonymous codon usage bias, and its potential applications Nucleic Acids Res., 15, 1281–1295 .
Moszer, I., Rocha, E.P.C., Danchin, A. (1999) Codon usage and lateral gene transfer in Bacillus subtilis Curr. Opin. Microbiol., 2, 524–528 .
Bulmer, M. (1987) Co-evolution of codon usage and transfer RNA abundance Nature, 325, 728–730 .
Shields, D.C. (1990) Switches in species-specific codon preferences: the influence of mutation biases J. Mol. Evol., 31, 71–80 .
Gu, X., Hewett-Emmett, D., Li, W.-H. (1998) Directional mutational pressure affects the amino acid composition of proteins in bacteria Genetica, 102/103, 383–391 .(Paul M. Sharp*, Elizabeth Bailes, Russel)