Site-to-Site Variation of Synonymous Substitution Rates
http://www.100md.com
分子生物学进展 2005年第12期
* Antiviral Research Center, University of California, San Diego; and Bioinformatics Research Center, North Carolina State University
E-mail: muse@stat.ncsu.edu.
Abstract
We develop a new model for studying the molecular evolution of protein-coding DNA sequences. In contrast to existing models, we incorporate the potential for site-to-site heterogeneity of both synonymous and nonsynonymous substitution rates. We demonstrate that within-gene heterogeneity of synonymous substitution rates appears to be common. Using the new family of models, we investigate the utility of a variety of new statistical inference procedures, and we pay particular attention to issues surrounding the detection of sites undergoing positive selection. We discuss how failure to model synonymous rate variation in the model can lead to misidentification of sites as positively selected.
Key Words: synonymous substitution rate ? positive selection ? molecular evolution ? codon model ? model selection
Introduction
One of the most fundamental goals of molecular evolutionary studies is the identification and quantification of sources of heterogeneity in rates and patterns of nucleotide substitution. To facilitate these studies, an ever-growing list of stochastic models of sequence evolution has appeared in the literature. Beginning with the simple model of Jukes and Cantor (1969), which assumed that all changes at all sites occurred at equal rates, researchers have incorporated such effects as unequal base frequencies (Felsenstein 1981), transition/transversion bias (Kimura 1980; Hasegawa, Kishino, and Yano 1985), site-to-site rate heterogeneity (Uzzel and Corbin 1971; Yang 1993), differences in synonymous and nonsynonymous substitution rates (Goldman and Yang 1994; Muse and Gaut 1994), and correlated rates or patterns of evolution among sites (Sch?niger and von Haeseler 1994; Felsenstein and Churchill 1996). The development of these models allowed biologists to formally estimate the parameters governing the process of sequence evolution and to make statistically rigorous inferences about them. With rapid growth in molecular sequence databases coupled with continued improvements in the accessibility of likelihood methods in molecular evolution (Huelsenbeck and Crandall 1997), having a variety of biologically realistic models to use for data analysis is more important than ever.
Models describing the evolution of protein-coding DNA sequences are a relatively new addition to the molecular evolution landscape. In 1994, the first tractable models of codon evolution were introduced (Goldman and Yang 1994; Muse and Gaut 1994). In the decade since then, only a handful of modifications of these original models have been published (e.g., Nielsen and Yang 1998; Pedersen, Wiuf, and Christiansen 1998; Yang et al. 2000). Of particular note are models that allow for variation in substitution rates among sites within a gene. Nielsen and his colleagues have published a fascinating series of papers exploiting codon models to identify lineages and codon sites that are undergoing adaptive or positive selection (e.g., Nielsen and Yang 1998; Yang and Nielsen 2002). The key element of these models is the ratio of nonsynonymous to synonymous substitution rates, . By allowing to vary across sites using methods similar to those developed for nucleotide models (Yang 1993, 1994), individual sites with > 1—a hallmark of positive selection—may be identified. One important technical choice was made in the formulation of all of these models: the rate of synonymous substitution was assumed to be constant across sites, implying that site-to-site variation in is solely the result of corresponding variation in nonsynonymous rates. If this assumption is incorrect, variation in is actually related to variation in both synonymous and nonsynonymous rates, and subsequent inferences about may be flawed. We would like to test this assumption and develop appropriate inference procedures for cases where tests suggest that synonymous rates are variable. While there have been many studies that showed large variation in synonymous substitution rates between different organisms, genomic regions, and individual genes (see, for example, Miyata and Yasunaga 1980; Wolfe, Li, and Sharp 1987; Wolfe, Sharp, and Li 1989; Gaut et al. 1996; Hanada, Suzuki, and Gojobori 2004) and while some evidence for variable silent substitution rates within a gene has been presented (Hurst and Pal 2001; Tuplin et al. 2002), to our knowledge there are no published results on modeling the site-to-site variation of synonymous rates in a "single" gene. Toward this end, we develop and study a class of stochastic models that allow for simultaneous variation in both categories of rates in coding sequences. We then provide a series of statistical inference procedures, including methods for identifying and quantifying amounts of site-to-site heterogeneity, locating sites that are targets of positive selection, and estimation of evolutionary distances.
Materials and Methods
A Model for Codon Evolution with Variable Rates of Synonymous and Nonsynonymous Substitution
Existing Codon Models
We describe our model as a modification of the MG94 codon model (Muse and Gaut 1994), although the principles can be easily applied to any substitution model containing two or more substitution rates (such as HKY85 or GY94). The MG94 model is defined by the instantaneous rate matrix whose entry reflects the rate of change from codon x to codon y. Formally,
where refers to the frequency of the "target" nucleotide in codon y at codon position m. For instance, the target nucleotide for the synonymous substitution is T in the third codon position, and its corresponding rate is Under this model, the equilibrium frequency of codon ijk is the product of the constituent nucleotide frequencies, scaled to account for the absence of stop codons. For sequences obeying the universal genetic code these frequencies are
(1)
Other genetic codes can be accommodated by making the obvious modifications. Finally, note that the original version of the MG94 model assumed that nucleotide frequencies were shared by all three codon positions.
For a given evolutionary lineage, the estimable substitution parameters when using the MG94 model are t and ?t. Recognizing this natural confounding of evolutionary rates and times (Felsenstein 1981), we can arbitrarily set = 1. This formulation is similar to that of Goldman and Yang (1994), and it has the advantage of expressing the substitution rates in terms of the ratio of nonsynonymous and synonymous rates, = ?/, which provides an indication of the magnitude of selective pressure. Originally, was treated as a deterministic model parameter; however, one can easily envision varying across sites. Nielsen and Yang (1998) and Yang et al. (2000) studied models with codon-to-codon variation of described by a variety of probability distributions. Approaches of this type allow us to relate patterns of sequence evolution to various biological hypotheses of interest. For completeness and consistency of notation, we include details of how to define and evaluate the phylogenetic likelihood in this context as Supplementary Material online (originally discussed in Nielsen and Yang 1998; Yang et al. 2000).
Modifications for Variable Synonymous Substitution Rates
The basic modeling approach described above allows for site-to-site variation in by treating the synonymous rate as constant across sites and introducing variable nonsynonymous rates. Using the notation of the original MG94 model, is shared by all sites and ? varies over sites. Our extension of this approach allows rates of both synonymous and nonsynonymous substitutions to vary across sites. The revised rate matrix can be written in the same form as the original MG94 model, but with the substitution rates at codon s considered as random quantities (s and ?s) rather than deterministic model parameters. We describe (s, ?s) as a random vector with a bivariate density h(s, ?s;) parameterized by , which may also be a vector. Each site in a sequence is modeled with its own (unknown) value of (s, ?s), leading to variation in s = ?s/s.
Let our data consist of a set of aligned sequences related by a phylogenetic tree , with the sth column of the alignment denoted as s, s = 1, ..., S. Computing the likelihood function requires integrating over the positive quarter-plane
A closed form expression is unavailable and it is not practical to perform numerical integration, so we restrict ourselves to considering discrete (or discretized) distributions as Yang (1994) introduced for nucleotide models. For such distributions the likelihood can be computed by carrying out a double summation:
where (rij, qij), i = 1, ..., M, j = 1,..., N denote the possible values for (s, ?s). If s and ?s are sampled from independent distributions, this expression can be rewritten as:
The conditional probability can be computed efficiently using Felsenstein's pruning algorithm (Felsenstein 1981), with numerical matrix exponentiation routines (Moler and Loan 1978) used to evaluate the transition probabilities needed for the site likelihoods (c.f. Goldman and Yang 1994; Muse and Gaut 1994). Note that the S sites produce only M x N sets of transition probabilities, eliminating the need for repeated calculations at each site. Additional computational savings in evaluating the likelihood function are described in Larget and Simon (1998) and Kosakovsky Pond and Muse (2004). Maximum likelihood estimates (MLEs) of and the branch lengths in can then be obtained numerically. Likelihood function evaluation can be distributed across multiple processors using the algorithm described in Kosakovsky Pond and Frost (2005c).
Although the notation has grown cumbersome, the basic idea is simple: we have provided M rate categories for synonymous substitution rates, N rate categories for nonsynonymous substitution rates, and a discrete probability distribution for the M x N possible combinations of these categories. The model posits that each codon falls into one of the M x N categories, and the likelihood function is calculated by taking a weighted sum over all possible assignments of codons to rate categories.
As is typical for studies of substitution rates, only products of rates and times can be estimated (Felsenstein 1981), and we can restrict one of the distributions to have a fixed mean. We choose to restrict the distribution of s to have unit mean because this leads to a natural extension of current models, which set s = 1 for all sites. This restriction is analogous to current rate heterogeneity models (e.g., Yang 1993) that use one-parameter gamma distributions with unit mean instead of the more general two-parameter gamma.
Independent Gamma Distributions for Synonymous and Nonsynonymous Rates
For purposes of illustrating the modeling approach, we will assume that both synonymous rates s and nonsynonymous rates ?s are sampled independently from gamma distributions, with separate means and variances for the two types of substitutions. We define the gamma density with mean A and variance A2/B. The synonymous rates s are drawn from a unit mean gamma density g(s; 1, μ). In the simplest case, we assume that nonsynonymous rates are drawn from a gamma density g(?s; , μ?).
It is often desirable to include a category of invariable sites (Fitch 1971). To capture such behavior, we also introduce a model where nonsynonymous rates are described by a mixture of a gamma distribution with a point mass at 0. If a fraction P of the sites are invariable at the amino acid level, the density is given by gi (x; A, B, P) = P0(x) + (1 – P)g(x; A, B), where 0(x) denotes the point mass distribution at 0, i.e., 0(x) = 0 for x > 0 and The mean and variance of this Gamma + Inv distribution are E[?s] = (1 – P)A and Under this framework, the extension of the simple bivariate gamma leads to nonsynonymous rates ?s being sampled from gi(?s; , μ?, P).
Using these models, the joint density for (s, ?s) depends on either three or four parameters—the shape parameter for synonymous rates μ, both shape and scale parameters for nonsynonymous rates μ? and , and in the second setting the proportion of invariant sites P—and can be written as either h(s, ?s; μ, μ?, ) = g(s; μ)g(?s; , μ?) or h(s, ?s; μ, μ?, , P) = g(s; μ)gi(?s; , μ?, P) as appropriate. Using the discretization procedure described in the Supplementary Material online to convert the continuous bivariate distribution to a discrete one, we are able to fit our models to a sequence alignment as described previously.
Independent Discrete Distributions for Synonymous and Nonsynonymous Rates
In general, there is no compelling reason to believe that substitution rates follow a gamma (or any other simple) distribution. The choice of the gamma distribution in Yang (1993) was justified largely by the flexibility of its parametric family and by the simplicity of estimating a single shape parameter. Given a fixed computational complexity (i.e., the number of rate classes D), the most general parametric approach to fitting a rate distribution is to use a "general discrete" distribution (GDD) with D bins. Such a distribution is specified by enumerating its values di, i = 1, ..., D (without loss of generality we may assume that di < dj, for all i < j) and the probabilities of observing each value pi = Pr{X = di}, subject to A GDD distribution with D bins will include 2D – 1 free parameters to be estimated by maximum likelihood. When modeling variation in synonymous rates the mean of the distribution must be restricted to 1 (cf. UGDD distribution in Kosakovsky Pond and Frost 2005a), reducing the number of free parameters to 2D – 2.
GDDs are parameter rich, and care must be taken not to include too many rate classes because it can easily lead to estimation problems such as slow convergence, multiple local maxima, and rapid performance degradation. For all but the largest alignments the value of D should not exceed 5 (unpublished data and Kosakovsky Pond and Frost 2005a).
Model Extensions and Variations
The previous section described the basic modeling approach we have used for investigating site-to-site heterogeneity of synonymous substitution rates. We are interested in improvements in model fit, in the robustness of parameter estimation (particularly of the nonsynonymous:synonymous rate ratio ) to model choice, and in exploration of the extent to which synonymous rates vary from site to site within gene sequences. To facilitate these investigations, we need to describe a series of variations on the basic model designed to allow comparisons with the published literature and to address biological hypotheses of interest. Each specific model consists of four components: the rate heterogeneity model, the "core" rate matrix, the rate distribution used, and the discretization scheme, and each component is described below.
The Core Rate Matrix
Entries in the instantaneous rate matrix QMG94 take the form In the simplest case we had only two possibilities, for synonymous substitutions and for nonsynonymous changes. However, one might want to include other common substitution biases, such as differences in transition and transversion rates. One can think of combining the basic MG94 model that accomodates separate synonymous and nonsynonymous rates with a nucleotide-based model to capture nucleotide-level factors. It is this combination that we dub the "core" rate matrix. We can create a core matrix in this manner using any time reversible 4 x 4 nucleotide model, but in the current work we will limit ourselves to the following core rate matrices.
MG94 x REV.
In these models the Muse and Gaut (1994) model is enhanced with a separate substitution parameter for each nucleotide substitution type. The rate matrix elements describing the substitution of codon x with codon y take the form:
To ensure reversibility, ij = ji. Because we can only estimate the products of rates and times, one of the parameters ij cannot be identified, and without loss of generality we set AG = 1. This rate matrix is the most general one used in this study.
MG94.Set all ij = 1.
MG94 x HKY85.
Set AG = CT = 1 and AC = AT = CG = GT.
GY94.
The Goldman and Yang (1994) model does not quite follow the general form described in this work, differing slightly in the modeling of equilibrium frequencies. We include this core matrix so that we can use it for direct comparisons with published results. The GY94 differs from MG94 x HKY85 in that each nonzero rate in the matrix is proportional to the frequency of the target "codon" y instead of the frequency of the target nucleotide. This feature leads to small (but typically negligible) differences from the MG94 x HKY85 model.
The Rate Heterogeneity Model
We described above a basic model for allowing site-to-site rate heterogenetiy of both synonymous and nonsynonymous rates in which rates were described by independent random variables. For purposes of exploring the utility of allowing heterogeneity of synonymous rates, and specifically to allow us to test various hypotheses about the relationship between synonymous and nonsynonymous rates, we now introduce several modifications of that scheme.
Constant: constant rates model.
This model is the simplest used in this work. It sets s = 1 and ?s = for every site s in the alignment. The original MG94 and GY94 models used this approach of having no site-to-site rate variation.
Nonsynonymous: variable nonsynonymous rates model.
This rate heterogeneity model was used by Yang et al. (2000) and described in the "Existing Codon Models" above. All sites have s = 1, while the ?s values are sampled from a given rate distribution (e.g., gamma or general discrete). "Nonsynonymous" assumes that synonymous rates do not vary over sites. However, the nonsynonymous-to-synonymous rate ratio does vary over sites as a result of the nonsynonymous rate variation.
Proportional: proportional variable rates model.
"Proportional" is perhaps the simplest approach for allowing both types of rate heterogeneity. The basic framework is similar toYang et al. (2000). Unlike that work, however, the synonymous rates are not constant over sites. Instead, the nonsynonymous rate is proportional to the synonymous rate, with the proportionality constant shared by all sites. The s values are drawn from a given unit mean rate distribution, and the nonsynonymous rates are then given by ?s = s, where is the ratio of nonsynonymous to synonymous rates and is shared by all sites. Such a model might be effective if site-specific mutation rates were the driving factor in creating site-to-site heterogeneity (Kosakovsky Pond and Frost 2005a).
Dual: dual variable rates model.
"Dual" is the primary rate heterogeneity model described in the "Modifications for variable synonymous substitution rates" above. The s and ?s values are sampled from two given rate distributions, with s restricted to have mean 1. Dual assumes that synonymous and nonsynonymous rates are independent.
Lineage Dual: dual variable rates model, branch corrections.
Model "Lineage Dual" is an extension of Dual that allows for lineage-specific adjustments to the ratio, , of nonsynonymous to synonymous rates. The values of s and an ancillary quantity s are sampled from two (independent) unit mean distributions and ?s = b x s; b denotes the E[nonsynonymous]/E[synonymous] ratio for branch b. The branch correction parameters, b, are estimated separately for every branch in the phylogeny. They can be thought of as correction factors to the "average" value of defined in Dual, with larger values of b indicative of branches with values higher than the tree average. These values are useful for identifying specific lineages that may have undergone adaptive evolution events (Yang 1998; Kosakovsky Pond and Frost 2005a). Dual is a particular case of Lineage Dual and can be used to test the null hypothesis b = for all branches b. Lineage Dual is the most general model considered in this study. A diagram illustrating the hierarchy of the models is found in figure 1.
FIG. 1.— Hierarchical presentation of rate heterogeneity models used in this work.
The Discretization Method
Having selected a rate heterogeneity model based on gamma-distributed random variables, for practicality we must now convert those continuous distributions into discrete approximations. Using the schemes discussed in Supplementary Material online, discretization based on M 2 synonymous rate classes and N 2 nonsynonymous rate classes is denoted by M x N. The inclusion of models Nonsynonymous and Proportional, which have only one random rate, as well as the use of distributions with an invariant category for Dual and Lineage Dual, require us to define some rules for discretization. For Nonsynonymous and Proportional, we discretize the single distribution into MN equiprobable rate classes. For Dual and Lineage Dual, we discretize the distribution of synonymous rates into M equiprobable classes and the distribution of nonsynonymous rates into N equiprobable classes. If an invariable class is present for nonsynonymous rates, we discretize the distribution of synonymous rates into M equiprobable classes, the continuous portion of the mixture of nonsynonymous rates into N – 1 equiprobable classes, and add the invariable class at 0 to obtain N total nonsynonymous classes. Under this discretization strategy, all models have MN rate classes and equivalent computational costs.
When GDDs are used, no discretization is needed and one only need specify the number of rate classes.
With the three components in place, we now have a wide selection of models from which to choose. To completely specify a model, we list its rate variation model, core rate matrix, rate heterogeneity method, rate distributions used, and the discretization method. The model MG94 x REV Dual Gamma 5 x 7 would use the MG94 core matrix composed with the REV nucleotide model, have separate independent gamma distributions for synonymous and nonsynonymous rates, discretized into five synonymous categories and seven nonsynonymous categories.
A Simple Test for Presence of Synonymous Rate Variation
The Dual and Nonsynonymous models are nested if GDD distributions are used to model rate variation, making it possible to assess significance using standard likelihood ratio tests. If N nonsynonymous rate categories are used both in Dual and in Nonsynonymous and if M synonymous rate categories are employed in Dual, then there are 2M – 2 more estimable parameters in Dual than in Nonsynonymous. Dual can be constrained so that Nonsynonymous is obtained. When enough alignment columns are available, the asymptotic 2 distribution with 2M – 2 degrees of freedom can be used to compute P values. For smaller data sets a parametric bootstrap P value can be found.
Alignments, Trees, and Programs
To investigate the utility of the new models in exploring site-to-site rate heterogeneity, we made use of 10 previously studied codon alignments (table 1) of varying sizes and levels of divergence. The archive of alignments and phylogenetic trees in NEXUS format can be downloaded from http://www.hyphy.org/pubs/2rates.tgz. We examine the mammalian ?-globin data set in great detail but for brevity focus only on tests for the presence of synonymous rate variation in other data sets.
Table 1 Data Sets Analyzed for Presence of Synonymous Rate Variation
All analyses were performed using "HyPhy" version .99? (Kosakovsky Pond, Frost, and Muse 2004) for Mac OS X on a dual G5/2Ghz workstation or a 16-processor Linux MPI cluster. Both the program and the batch files for comparing the models discussed herein and result processing are included as standard analyses with the HyPhy distribution (http://www.hyphy.org). The chart generation option used for creating some of the graphics in this paper is presently only available in Mac OS and Windows.
Results
Nucleotide Substitution Biases
In all data sets analyzed for this paper, the MG94 x REV model of codon substitution was preferred over both MG94 x HKY85 and traditional MG94 when assessed by the appropriate likelihood ratio test, and it was also chosen over GY94 using Akaike Information Criterion (AIC) (Supplementary Material online). This finding held even when different rate variation models were considered. For the remainder of this article, we consider only the MG94 x REV core model unless specified otherwise.
Choice of Rate Distribution
For every choice of rate variation models, GDD tends to have similar or better AIC scores than a discretized gamma (+Inv) distribution (Supplementary Material online). Fewer rate categories are needed by GDD to achieve good fits when compared to continuous discretized distributions, and consequently the analyses require less computer time to run. In general, a uniformly discretized gamma distribution may be a poor choice to capture the underlying pattern of rate variation, even if the true distribution of substitution rates is a gamma (Kosakovsky Pond and Frost 2005c). Adaptive discretization schemes or GDDs may be a better general choice for modeling distributions of rates. We chose to focus on the results derived from using GDDs of rates, although similar results have been obtained with a variety of other distributions (results not shown).
Heterogeneity of Synonymous Rates Is Not Negligible
Each of the ten data sets was analyzed for evidence of synonymous rate variation (table 1). The MG94 x REV model, using GDD was fit to each data set under two rate variation hypotheses:
Null—No synonymous rate variation: Nonsynonymous with three rate categories.
Alternative—Presence of synonymous rate variation: Dual, with three rate categories for s and three for ?s.
Because the models are nested, a standard likelihood ratio test using the asymptotic distribution can be used to assess significance. Additionally, we employed the AIC to quantify the relative improvement of fit. The computational difficulty of the Dual model was increased roughly by a factor of 3 compared to the Nonsynonymous model; however, if run in parallel on a cluster of computers with at least 10 nodes (Kosakovsky Pond and Frost 2005c) the models are roughly equivalent in terms of overall run time. The choice of three (or 3 x 3) rate categories was partly historical, e.g., the often cited M3 model of Yang et al. (2000) used a GDD with three rate categories and partly because finite sample sizes do not, in general, permit reliable inference of more than a few rate classes (Kosakovsky Pond and Frost 2005c).
In 9 of 10 cases the hypothesis of constant synonymous rates can be rejected at the 0.01 level. The Drosophila adh data set is the sole exception where the test failed to reject the assumption of constant synonymous rates. Dual was preferred by AIC in the same nine cases, with score improvements ranging from 12.43 to 291.24, with larger data sets yielding more pronounced AIC decreases.
Coefficients of variation (CV: defined as the standard deviation of a distribution divided by its mean) can be used to quantify and compare the variability of different distributions. The CV for synonymous rate distributions ranged from 0.18 to 1.47 and in many cases were comparable to, or even greater than, the CV of the corresponding nonsynonymous distributions (table 2). Note that while for most data sets both models yielded similar estimates of overall tree lengths (table 1) and CV of the nonsynonymous rate distributions, there were some notable exceptions (COXI, ?-globin, and flavivirus NS5 data sets) for which there was a pronounced effect on the estimate of overall sequence divergence. This effect could be important, for example, in estimation of divergence times and definitely merits further exploration.
Table 2 Descriptive Statistics for the Inferred Distributions of Synonymous (s) and Nonsynonymous (?s) Substitution Rates
Detailed Analysis of the ?-Globin Data Set
In the following paragraphs, we illustrate potential applications of the models developed above through a series of analyses of the ?-globin data set. We examine the fits of various models and investigate the inferred rate heterogeneity distributions.
Model Selection for the ?-Globin Data Set
A more thorough analysis of the ?-globin data is presented in table 3. Introduction of branch-specific rate correction factors in Lineage Dual appears to lead to a significant fit improvement over Dual model as measured by the likelihood ratio test (asymptotic P value of 0.005). Inferred branch selection coefficients b ranged from 0 to 3.7 (median 0.31). The Lineage Dual model can serve as a simple test for heterogeneity in average selection pressure among lineages when both substitution rates are allowed to vary among sites, complementing existing branch-site models (Yang 1998; Yang and Nielsen 2002). It is likely (Kosakovsky Pond and Frost 2005a) that a model where each branch has an adjustable b is too parameter rich and that a number of models with only a few classes of b will fit the data as well, a possibility reinforced by the fact that the best model using AIC as a criterion is Dual. Note that for this particular data set, estimates of the various parameters regulating site-to-site rate heterogeneity appear quite insensitive to model selection, although the results from tables 1 and 2 indicate that this behavior is not universal.
Table 3 MG94 GDD 3 x 3 Applied to the ?-Globin Alignment
Examining Rate Patterns at Sites in the ?-Globin Data Set
Once the MLEs, , of the entire collection of unknown parameters, , have been obtained, one can ask questions about rate patterns among sites in the data set. One of the results produced by a likelihood analysis is a table of conditional probabilities comprised of the entries s = 1, ..., S, j = 1, ..., M (number of synonymous rate classes), k = 1, ..., N (number of nonsynonymous rate classes): Given we can compute the empirical Bayes posterior probability of site s belonging to rate class (j, k) by applying the elementary Bayes rule:
The posterior distribution of rates at a site is useful in determining the rate of evolution at a given site. For instance, one can simply assign site s to the rate class (j, k) that has the maximal posterior probability. Note that if all rate classes are equiprobable, the same results can be obtained simply by selecting the rate class with the largest conditional posterior probability (table 4). In figure 2, we show examples of sites assigned to different rate classes by various models.
Table 4 Rate Class Assignments Under MG94 x REV Dual GDD 3 x 3 for ?-Globin
FIG. 2.— Examples of rate class assignments and positive selection identification for ?-globin under MG94 x REV Dual GDD 3 x 3, MG94 x REV Nonsynonymous GDD 3, and M3 in (Yang et al. 2000).
Additional indicators of the rate pattern present at site s are the expected posterior synonymous and nonsynonymous rates at that site, denoted by and (fig. 3):
Note that both and shown in figure 3 show substantial variation (as suggested by their respective CV in table 2) but with little obvious spatial correlation.
FIG. 3.— Expected posterior rates (E[S|i] for synonymous rates and E[NS|i] for nonsynonymous rates) for ?-globin under MG94 x REV Dual GDD 3 x 3.
Selective Pressure
For the Dual and Lineage Dual models, one can examine the conditional distribution of s = ?s/s (or Ms = ?s – s) over codon sites and attempt to identify sites under selective pressure. Because the distributions used for s and ?s are always discrete (or discretized), the inferred distribution of s (or Ms) is also discrete and can be obtained in a straightforward way. Following the empirical Bayes methodology of Nielsen and Yang (1998), given the posterior probabilities as defined above and a cutoff level 0 C 1, we label a codon site s as a site under selective pressure if
One could also employ Bayes factors to identify sites that are under selective pressure. A Bayes factor BF(s) for the hypothesis (or event) at site s is defined as the ratio of posterior odds of the event and its prior odds:
A large value of BF(s), for instance BF(s) > 100 ("decisive" evidence, see Kass and Raftery 1995) indicates that the effect of the data in column s on the prior event of positive selection is amplifying and thus suggestive of the presence of selective pressure at site s. One of the advantages of using Bayes factors as opposed to a posterior probability cutoff is that Bayes factors properly account for the effect of the prior distribution. For example, first suppose that Pr{Ws} = 0.7 and It is easy to check that BF(s) = 8.14. Next, consider Pr{Ws} = 0.05 and Now, BF(s) = 361. Both situations would be treated as equally strong evidence for positive selection with the posterior cutoff but vastly different under Bayes factors. It seems reasonable to consider the "relative" effect of the data upon the prior as measured by BF(s), if for no other reason than to make the conclusions of positive selection analyses based on different priors directly comparable.
Under certain scenarios (Kosakovsky Pond and Frost 2005b), the reciprocals of Bayes factors serve as a useful approximate measure of the rates of false positives of the test for selection at a site. In practice, however, there seems to be little difference in inferring the location of sites under selection, whether posterior P values or Bayes factors are used (Kosakovsky Pond and Frost 2005b; unpublished simulation results). Empirical Bayes inference can be strongly influenced by errors in MLE. Recently, several methods to correct for this effect have been suggested. One can either use a hierarchical Bayes model (Huelsenbeck and Dyer 2004) or employ Bayes Empirical Bayes inference (Yang, Wong, and Nielsen 2005). Because our Dual models include many estimated parameters, we caution that estimation errors can exert strong influence on the empirical Bayes analysis, especially for smaller data sets. While we base our initial analyses on a simplistic application of empirical Bayes inference, both approaches to dealing with errors in MLEs can be adapted to include synonymous rate variation. Furthermore, new approaches (e.g., sampling importance resampling) could be brought to bear on handling errors in parameter estimates, and we plan to extend our methodology to account for the effect of such estimation errors.
Comparison with Existing Methods
In Yang et al. (2000), model M3 was found to have the best fit for the ?-globin data (AIC = 7,408.12). Note that M3 of Yang et al. (2000) can be denoted as GY94 Nonsynonymous GDD 3 using the conventions of this paper. GY94 Nonsynonymous GDD 3 and MG94 x REV Dual GDD 3 x 3 (AIC = 7,388.23) differ in three aspects: (1) modeling of rate variation, (2) compensating for nucleotide substitution biases, and (3) treatment of equilibrium frequencies. In order to focus specifically on the effect of failing to allow for synonymous rate variation we also included results from MG94 x REV Nonsynonymous GDD 3 (AIC = 7,400.66). The AIC scores for the three models suggest that site-to-site synonymous variation is present and that it is important to model. In figure 2, we illustrate examples of sites whose rates can be categorized as high synonymous/low nonsynonymous (codon 55), high synonymous/high nonsynonymous (codon 85), low synonymous/high nonsynonymous (codon 48), and moderate synonymous/high nonsynonymous (codon 7). The presence of this variety of site categories suggests that it might be important to include both types of site-to-site rate variation.
While the model fit is important, we focus now on the specific task of identifying sites under positive selection. A few sites illustrate important points with regard to the suitability of models for this task. The most crucial difference between methods is found at codon 85, a position with high synonymous and nonsynonymous diversity. Because of the high level of nonsynonymous diversity, models that assume equal synonymous rates identify such a site positively selected with high statistical significance (note the Bayes factors/posterior probabilities of 5,545 and 0.998 in fig. 2). However, an argument can be made that in reality site 85 is evolving neutrally or even under purifying selection. Using MG94 x REV Dual GDD 3 x 3, the estimated ratio of nonsynonymous to synonymous rates at site 85 is 0.108. The model assigns high synonymous and nonsynonymous rates to such variable sites, leading to a negligible positive selection signal. A look at figure 3 suggests that while evolution at most sites in the alignment can be explained well by a single synonymous rate, there are a number of prominent exceptions where synonymous rate estimates are substantially elevated.
Figure 4 illustrates the effects of including (or excluding) site-to-site variation in synonymous rates when searching for sites undergoing positive selection. We used models MG94 x REV Nonsynonymous GDD 3 and MG94 x REV Dual GDD 3 x 3 in order to focus solely on the performance differences arising from the inclusion or exclusion of site-to-site synonymous rate variation. The scatterplots of posterior expected s and posterior probabilities for positive selection (?s > s) should form a line along the y = x axis if the models lead to the same inferences. Predictably, the models agree on a large subset of the sites. However, the most important feature of the plots is the tendency for some sites (7, 67, 85, and 123) with large support for positive selection under the Nonsynonymous model to have considerably weaker support when site-to-site synonymous variation is introduced using the Dual model. This behavior could lead to an elevated rate of false positives when analyzing data sets that include synonymous rate variation with Nonsynonymous-type models (for an example, see Kosakovsky Pond and Frost 2005b).
FIG. 4.— Comparison of site-by-site selection inference using MG94 x REV Dual GDD 3 x 3 and MG94 x REV Nonsynonymous GDD. Shaded areas in each plot show codons that are classified as positively selected by the Nonsynonymous model only (cutoff of 0.9 used for the posterior probability plot).
As a final investigation of the relationship between model choice and identification of positively selected sites, in table 5, we display results for every site that had a Bayes factor for positive selection greater than 100 (which also translates to posterior probabilities of s > 1 being over 0.9 in all cases) using either MG94 x REV Dual GDD 3 x 3 or M3. For most sites the two models are in general agreement. However, large discrepancies are found at sites 7, 67, and 85, and to a lesser degree at site 123. In each of these cases the evidence for positive selection disappears when synonymous rate variation is introduced.
Table 5 Evidence of Positive Selection with and Without Synonymous Rate Heterogeneity
Conclusions
The results above clearly indicate that the inclusion of potential site-to-site variation in synonymous rates when studying the molecular evolution of protein-coding regions is desirable. From the statistical perspective, these models allow for rigorous testing of parameters regulating the process of sequence evolution. Applications of the new models and associated inference tools on ten data sets indicate that variation in synonymous substitution rates is a real phenomenon. This observation, if it turns out to be widespread, will no doubt be one of considerable interest: investigations using estimates of synonymous rates for molecular clock dating could be brought into question; synonymous rate variation might be used to explain the overdispersed clock; positive selection studies might require second looks. At the most basic level, it raises questions as to the biological sources of the heterogeneity. Selective arguments, perhaps hinging on localized codon usage patterns or avoidance of transcription factor–binding site motifs, could be crafted. An explanation drawing on variation of mutation rates at different dinucleotides has some appeal and has been described in Hepatitis C (Tuplin et al. 2002). One could also imagine an explanation based on properties of DNA winding on histones. In any event, the model provides for the first time the tools to carefully investigate any of these phenomena that might actually be present in nature, and our analyses suggest that the phenomenon is ubiquitous enough to merit close attention.
Supplementary Material
Supplementary materials are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).
Acknowledgements
The authors would like to thank Joseph Watkins, Simon D. W. Frost, Andy J. Leigh Brown, and Jesse Taylor for their comments and suggestions.
This research was supported in part by the National Science Foundation (award number 0331654), National Institutes of Health (AI47745, AI43638, and AI57167), the University of California Universitywide AIDS Research Program (grant number IS02-SD-701), and by a University of California, San Diego Center for AIDS Research/National Institute of Allergy and Infectious Diseases Developmental Award (AI36214).
References
Akaike, H. 1974. A new look at the statistical model identification. IEEE Trans. Autom. Contr. 19:716–723.
Anisimova, M., and Z. Yang. 2004. Molecular evolution of the hepatitis delta virus antigen gene: recombination or positive selection? J. Mol. Evol. 59:815–826.
Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368–376.
Felsenstein, J., and G. A. Churchill. 1996. A hidden Markov model approach to variation among sites in rate of evolution. Mol. Biol. Evol. 13:93–104.
Fitch, W. 1971. Towards defining the course of evolution: minimum change for a specific tree topology. Syst. Zool. 20:406–416.
Gaut, B. S., B. R. Morton, B. McCaig, and M. T. Clegg. 1996. Substitution rate comparisons between grasses and palms: synonymous rate differences at the nuclear gene Adh parallel rate differences at the plastid gene rbcl. Proc. Natl. Acad. Sci. USA 93:10274–10279.
Goldman, N., and Z. Yang. 1994. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11:725–736.
Hanada, K., Y. Suzuki, and T. Gojobori. 2004. A large variation in the rates of synonymous substitution for RNA viruses and its relationship to a diversity of viral infection and transmission modes. Mol. Biol. Evol. 21:1074–1080.
Harmsen, M., R. Ruuls, I. Nijman, T. Niewold, Franken, L. G., and B. de Geus. 2000. Llama heavy-chain V regions consist of at least four distinct subfamilies revealing novel sequence features. Mol. Immunol. 37:579–590.
Hasegawa, M., H. Kishino, and T. Yano. 1985. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22:160–174.
Huelsenbeck, J., and K. Crandall. 1997. Phylogeny estimation and hypothesis testing using maximum likelihood. Annu. Rev. Ecol. Syst. 28:437–466.
Huelsenbeck, J. P., and K. A. Dyer. 2004. Bayesian estimation of positively selected sites. J. Mol. Evol. 58:661–672.
Hurst, L., and C. Pal. 2001. Evidence for purifying selection acting on silent sites in BRCA1. Trends Genet. 17:62–65.
Jukes, T. H., and C. R. Cantor. 1969. Evolution of protein molecules. Pp. 21–132 in H. M. Munro, ed. Mammalian protein metabolism. Academic Press, New York.
Kass, R., and A. Raftery. 1995. Bayes factors. J. Am. Stat. Assoc. 90:773–795.
Kimura, M. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16:111–120.
Kosakovsky Pond, S. L., and S. D. Frost. 2005a. A genetic algorithm approach to detecting lineage-specific variation in selection pressure. Mol. Biol. Evol. 22:478–485.
———. 2005b. Not so different after all: a comparison of methods for detecting amino-acid sites under selection. Mol. Biol. Evol. 22:1208–1222.
———. 2005c. A simple hierarchical approach to modeling distributions of substitution rates. Mol. Biol. Evol. 22:223–234.
Kosakovsky Pond, S. L., S. D. W. Frost, and S. V. Muse. 2004. Hyphy: hypothesis testing using phylogenies. Bioinformatics 21:676–679.
Kosakovsky Pond, S. L., and S. V. Muse. 2004. Column sorting: rapid calculation of the phylogenetic likelihood function. Syst. Biol. 53:685–692.
Larget, B., and D. Simon. 1998. Faster likelihood calculations on trees. Technical Report 98-02. Department of Mathematics and Computer Science, Duquesne University, Pittsburgh, Pa.
Miyata, T., and T. Yasunaga. 1980. Molecular evolution of mRNA: a method for estimating evolutionary rates of synonymous and amino acid substitutions from homologous nucleotide sequences and its application. J. Mol. Evol. 16:23–36.
Moler, C., and C. V. Loan. 1978. Nineteen dubious ways to compute the exponential of a matrix. SIAM Rev. 20:801–836.
Muse, S. V., and B. S. Gaut. 1994. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates with application to the chloroplast genome. Mol. Biol. Evol. 11:715–724.
Nielsen, R., and Z. Yang. 1998. Likelihood models for detecting positive selected amino acid sites and applications to the HIV-1 envelope gene. Genetics 148:929–936.
Pedersen, A. M. K., C. Wiuf, and F. B. Christiansen. 1998. A codon-based model designed to describe lentiviral evolution. Mol. Biol. Evol. 15:1069–1081.
Sch?niger, M., and A. von Haeseler. 1994. A stochastic model for the evolution of autocorrelated DNA sequences. Mol. Phylogenet. Evol. 3:240–247.
Seo, T.-K., H. Kishino, and J. Thorne. 2004. Estimating absolute rates of synonymous and nonsynonymous nucleotide substitution in order to characterize natural selection and date species divergences. Mol. Biol. Evol. 21:1201–1213.
Tuplin, A., J. Wood, D. Evans, A. Patel, and P. Simmonds. 2002. Thermodynamic and phylogenetic prediction of RNA secondary structures in the coding region of hepatitis C virus. RNA 82:824–841.
Uzzel, T., and K. W. Corbin. 1971. Fitting discrete probability distributions to evolutionary events. Science 172:1089–1096.
Wolfe, K. H., W.-H. Li, and P. M. Sharp. 1987. Rates of nucleotide substitution vary greatly among plant mitochondrial, chloroplast, and nuclear DNAs. Proc. Natl. Acad. Sci. USA 84:9054–9058.
Wolfe, K. H., P. M. Sharp, and W.-H. Li. 1989. Rates of synonymous substitution in plant nuclear genes. J. Mol. Evol. 29:208–211.
Yang, Z. 1993. Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol. Biol. Evol. 10:1396–1401.
———. 1994. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J. Mol. Evol. 39:105–111.
———. 1998. Likelihood ratio tests for detecting positive selection and application to primate lysozyme evolution. Mol. Biol. Evol. 15:568–573.
———. 2000. Maximum likelihood estimation on large phylogenies and analysis of adaptive evolution in human influenza virus A. J. Mol. Evol. 51:423–432.
Yang, Z., and W. Swanson. 2002. Codon-substitution models to detect adaptive evolution that account for heterogeneous selective pressures among site classes. Mol. Biol. Evol. 19:49–57.
Yang, Z., W. S. Wong, and R. Nielsen. 2005. Bayes empirical Bayes inference of amino acid sites under positive selection. Mol. Biol. Evol. 22:1107–1118.
Yang, Z. H., and R. Nielsen. 2002. Codon-substitution models for detecting molecular adaptation at individual sites along specific lineages. Mol. Biol. Evol. 19:908–917.
Yang, Z. H., R. Nielsen, N. Goldman, and A. M. K. Pedersen. 2000. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics 155:431–449.(Sergei Kosakovsky Pond* a)
E-mail: muse@stat.ncsu.edu.
Abstract
We develop a new model for studying the molecular evolution of protein-coding DNA sequences. In contrast to existing models, we incorporate the potential for site-to-site heterogeneity of both synonymous and nonsynonymous substitution rates. We demonstrate that within-gene heterogeneity of synonymous substitution rates appears to be common. Using the new family of models, we investigate the utility of a variety of new statistical inference procedures, and we pay particular attention to issues surrounding the detection of sites undergoing positive selection. We discuss how failure to model synonymous rate variation in the model can lead to misidentification of sites as positively selected.
Key Words: synonymous substitution rate ? positive selection ? molecular evolution ? codon model ? model selection
Introduction
One of the most fundamental goals of molecular evolutionary studies is the identification and quantification of sources of heterogeneity in rates and patterns of nucleotide substitution. To facilitate these studies, an ever-growing list of stochastic models of sequence evolution has appeared in the literature. Beginning with the simple model of Jukes and Cantor (1969), which assumed that all changes at all sites occurred at equal rates, researchers have incorporated such effects as unequal base frequencies (Felsenstein 1981), transition/transversion bias (Kimura 1980; Hasegawa, Kishino, and Yano 1985), site-to-site rate heterogeneity (Uzzel and Corbin 1971; Yang 1993), differences in synonymous and nonsynonymous substitution rates (Goldman and Yang 1994; Muse and Gaut 1994), and correlated rates or patterns of evolution among sites (Sch?niger and von Haeseler 1994; Felsenstein and Churchill 1996). The development of these models allowed biologists to formally estimate the parameters governing the process of sequence evolution and to make statistically rigorous inferences about them. With rapid growth in molecular sequence databases coupled with continued improvements in the accessibility of likelihood methods in molecular evolution (Huelsenbeck and Crandall 1997), having a variety of biologically realistic models to use for data analysis is more important than ever.
Models describing the evolution of protein-coding DNA sequences are a relatively new addition to the molecular evolution landscape. In 1994, the first tractable models of codon evolution were introduced (Goldman and Yang 1994; Muse and Gaut 1994). In the decade since then, only a handful of modifications of these original models have been published (e.g., Nielsen and Yang 1998; Pedersen, Wiuf, and Christiansen 1998; Yang et al. 2000). Of particular note are models that allow for variation in substitution rates among sites within a gene. Nielsen and his colleagues have published a fascinating series of papers exploiting codon models to identify lineages and codon sites that are undergoing adaptive or positive selection (e.g., Nielsen and Yang 1998; Yang and Nielsen 2002). The key element of these models is the ratio of nonsynonymous to synonymous substitution rates, . By allowing to vary across sites using methods similar to those developed for nucleotide models (Yang 1993, 1994), individual sites with > 1—a hallmark of positive selection—may be identified. One important technical choice was made in the formulation of all of these models: the rate of synonymous substitution was assumed to be constant across sites, implying that site-to-site variation in is solely the result of corresponding variation in nonsynonymous rates. If this assumption is incorrect, variation in is actually related to variation in both synonymous and nonsynonymous rates, and subsequent inferences about may be flawed. We would like to test this assumption and develop appropriate inference procedures for cases where tests suggest that synonymous rates are variable. While there have been many studies that showed large variation in synonymous substitution rates between different organisms, genomic regions, and individual genes (see, for example, Miyata and Yasunaga 1980; Wolfe, Li, and Sharp 1987; Wolfe, Sharp, and Li 1989; Gaut et al. 1996; Hanada, Suzuki, and Gojobori 2004) and while some evidence for variable silent substitution rates within a gene has been presented (Hurst and Pal 2001; Tuplin et al. 2002), to our knowledge there are no published results on modeling the site-to-site variation of synonymous rates in a "single" gene. Toward this end, we develop and study a class of stochastic models that allow for simultaneous variation in both categories of rates in coding sequences. We then provide a series of statistical inference procedures, including methods for identifying and quantifying amounts of site-to-site heterogeneity, locating sites that are targets of positive selection, and estimation of evolutionary distances.
Materials and Methods
A Model for Codon Evolution with Variable Rates of Synonymous and Nonsynonymous Substitution
Existing Codon Models
We describe our model as a modification of the MG94 codon model (Muse and Gaut 1994), although the principles can be easily applied to any substitution model containing two or more substitution rates (such as HKY85 or GY94). The MG94 model is defined by the instantaneous rate matrix whose entry reflects the rate of change from codon x to codon y. Formally,
where refers to the frequency of the "target" nucleotide in codon y at codon position m. For instance, the target nucleotide for the synonymous substitution is T in the third codon position, and its corresponding rate is Under this model, the equilibrium frequency of codon ijk is the product of the constituent nucleotide frequencies, scaled to account for the absence of stop codons. For sequences obeying the universal genetic code these frequencies are
(1)
Other genetic codes can be accommodated by making the obvious modifications. Finally, note that the original version of the MG94 model assumed that nucleotide frequencies were shared by all three codon positions.
For a given evolutionary lineage, the estimable substitution parameters when using the MG94 model are t and ?t. Recognizing this natural confounding of evolutionary rates and times (Felsenstein 1981), we can arbitrarily set = 1. This formulation is similar to that of Goldman and Yang (1994), and it has the advantage of expressing the substitution rates in terms of the ratio of nonsynonymous and synonymous rates, = ?/, which provides an indication of the magnitude of selective pressure. Originally, was treated as a deterministic model parameter; however, one can easily envision varying across sites. Nielsen and Yang (1998) and Yang et al. (2000) studied models with codon-to-codon variation of described by a variety of probability distributions. Approaches of this type allow us to relate patterns of sequence evolution to various biological hypotheses of interest. For completeness and consistency of notation, we include details of how to define and evaluate the phylogenetic likelihood in this context as Supplementary Material online (originally discussed in Nielsen and Yang 1998; Yang et al. 2000).
Modifications for Variable Synonymous Substitution Rates
The basic modeling approach described above allows for site-to-site variation in by treating the synonymous rate as constant across sites and introducing variable nonsynonymous rates. Using the notation of the original MG94 model, is shared by all sites and ? varies over sites. Our extension of this approach allows rates of both synonymous and nonsynonymous substitutions to vary across sites. The revised rate matrix can be written in the same form as the original MG94 model, but with the substitution rates at codon s considered as random quantities (s and ?s) rather than deterministic model parameters. We describe (s, ?s) as a random vector with a bivariate density h(s, ?s;) parameterized by , which may also be a vector. Each site in a sequence is modeled with its own (unknown) value of (s, ?s), leading to variation in s = ?s/s.
Let our data consist of a set of aligned sequences related by a phylogenetic tree , with the sth column of the alignment denoted as s, s = 1, ..., S. Computing the likelihood function requires integrating over the positive quarter-plane
A closed form expression is unavailable and it is not practical to perform numerical integration, so we restrict ourselves to considering discrete (or discretized) distributions as Yang (1994) introduced for nucleotide models. For such distributions the likelihood can be computed by carrying out a double summation:
where (rij, qij), i = 1, ..., M, j = 1,..., N denote the possible values for (s, ?s). If s and ?s are sampled from independent distributions, this expression can be rewritten as:
The conditional probability can be computed efficiently using Felsenstein's pruning algorithm (Felsenstein 1981), with numerical matrix exponentiation routines (Moler and Loan 1978) used to evaluate the transition probabilities needed for the site likelihoods (c.f. Goldman and Yang 1994; Muse and Gaut 1994). Note that the S sites produce only M x N sets of transition probabilities, eliminating the need for repeated calculations at each site. Additional computational savings in evaluating the likelihood function are described in Larget and Simon (1998) and Kosakovsky Pond and Muse (2004). Maximum likelihood estimates (MLEs) of and the branch lengths in can then be obtained numerically. Likelihood function evaluation can be distributed across multiple processors using the algorithm described in Kosakovsky Pond and Frost (2005c).
Although the notation has grown cumbersome, the basic idea is simple: we have provided M rate categories for synonymous substitution rates, N rate categories for nonsynonymous substitution rates, and a discrete probability distribution for the M x N possible combinations of these categories. The model posits that each codon falls into one of the M x N categories, and the likelihood function is calculated by taking a weighted sum over all possible assignments of codons to rate categories.
As is typical for studies of substitution rates, only products of rates and times can be estimated (Felsenstein 1981), and we can restrict one of the distributions to have a fixed mean. We choose to restrict the distribution of s to have unit mean because this leads to a natural extension of current models, which set s = 1 for all sites. This restriction is analogous to current rate heterogeneity models (e.g., Yang 1993) that use one-parameter gamma distributions with unit mean instead of the more general two-parameter gamma.
Independent Gamma Distributions for Synonymous and Nonsynonymous Rates
For purposes of illustrating the modeling approach, we will assume that both synonymous rates s and nonsynonymous rates ?s are sampled independently from gamma distributions, with separate means and variances for the two types of substitutions. We define the gamma density with mean A and variance A2/B. The synonymous rates s are drawn from a unit mean gamma density g(s; 1, μ). In the simplest case, we assume that nonsynonymous rates are drawn from a gamma density g(?s; , μ?).
It is often desirable to include a category of invariable sites (Fitch 1971). To capture such behavior, we also introduce a model where nonsynonymous rates are described by a mixture of a gamma distribution with a point mass at 0. If a fraction P of the sites are invariable at the amino acid level, the density is given by gi (x; A, B, P) = P0(x) + (1 – P)g(x; A, B), where 0(x) denotes the point mass distribution at 0, i.e., 0(x) = 0 for x > 0 and The mean and variance of this Gamma + Inv distribution are E[?s] = (1 – P)A and Under this framework, the extension of the simple bivariate gamma leads to nonsynonymous rates ?s being sampled from gi(?s; , μ?, P).
Using these models, the joint density for (s, ?s) depends on either three or four parameters—the shape parameter for synonymous rates μ, both shape and scale parameters for nonsynonymous rates μ? and , and in the second setting the proportion of invariant sites P—and can be written as either h(s, ?s; μ, μ?, ) = g(s; μ)g(?s; , μ?) or h(s, ?s; μ, μ?, , P) = g(s; μ)gi(?s; , μ?, P) as appropriate. Using the discretization procedure described in the Supplementary Material online to convert the continuous bivariate distribution to a discrete one, we are able to fit our models to a sequence alignment as described previously.
Independent Discrete Distributions for Synonymous and Nonsynonymous Rates
In general, there is no compelling reason to believe that substitution rates follow a gamma (or any other simple) distribution. The choice of the gamma distribution in Yang (1993) was justified largely by the flexibility of its parametric family and by the simplicity of estimating a single shape parameter. Given a fixed computational complexity (i.e., the number of rate classes D), the most general parametric approach to fitting a rate distribution is to use a "general discrete" distribution (GDD) with D bins. Such a distribution is specified by enumerating its values di, i = 1, ..., D (without loss of generality we may assume that di < dj, for all i < j) and the probabilities of observing each value pi = Pr{X = di}, subject to A GDD distribution with D bins will include 2D – 1 free parameters to be estimated by maximum likelihood. When modeling variation in synonymous rates the mean of the distribution must be restricted to 1 (cf. UGDD distribution in Kosakovsky Pond and Frost 2005a), reducing the number of free parameters to 2D – 2.
GDDs are parameter rich, and care must be taken not to include too many rate classes because it can easily lead to estimation problems such as slow convergence, multiple local maxima, and rapid performance degradation. For all but the largest alignments the value of D should not exceed 5 (unpublished data and Kosakovsky Pond and Frost 2005a).
Model Extensions and Variations
The previous section described the basic modeling approach we have used for investigating site-to-site heterogeneity of synonymous substitution rates. We are interested in improvements in model fit, in the robustness of parameter estimation (particularly of the nonsynonymous:synonymous rate ratio ) to model choice, and in exploration of the extent to which synonymous rates vary from site to site within gene sequences. To facilitate these investigations, we need to describe a series of variations on the basic model designed to allow comparisons with the published literature and to address biological hypotheses of interest. Each specific model consists of four components: the rate heterogeneity model, the "core" rate matrix, the rate distribution used, and the discretization scheme, and each component is described below.
The Core Rate Matrix
Entries in the instantaneous rate matrix QMG94 take the form In the simplest case we had only two possibilities, for synonymous substitutions and for nonsynonymous changes. However, one might want to include other common substitution biases, such as differences in transition and transversion rates. One can think of combining the basic MG94 model that accomodates separate synonymous and nonsynonymous rates with a nucleotide-based model to capture nucleotide-level factors. It is this combination that we dub the "core" rate matrix. We can create a core matrix in this manner using any time reversible 4 x 4 nucleotide model, but in the current work we will limit ourselves to the following core rate matrices.
MG94 x REV.
In these models the Muse and Gaut (1994) model is enhanced with a separate substitution parameter for each nucleotide substitution type. The rate matrix elements describing the substitution of codon x with codon y take the form:
To ensure reversibility, ij = ji. Because we can only estimate the products of rates and times, one of the parameters ij cannot be identified, and without loss of generality we set AG = 1. This rate matrix is the most general one used in this study.
MG94.Set all ij = 1.
MG94 x HKY85.
Set AG = CT = 1 and AC = AT = CG = GT.
GY94.
The Goldman and Yang (1994) model does not quite follow the general form described in this work, differing slightly in the modeling of equilibrium frequencies. We include this core matrix so that we can use it for direct comparisons with published results. The GY94 differs from MG94 x HKY85 in that each nonzero rate in the matrix is proportional to the frequency of the target "codon" y instead of the frequency of the target nucleotide. This feature leads to small (but typically negligible) differences from the MG94 x HKY85 model.
The Rate Heterogeneity Model
We described above a basic model for allowing site-to-site rate heterogenetiy of both synonymous and nonsynonymous rates in which rates were described by independent random variables. For purposes of exploring the utility of allowing heterogeneity of synonymous rates, and specifically to allow us to test various hypotheses about the relationship between synonymous and nonsynonymous rates, we now introduce several modifications of that scheme.
Constant: constant rates model.
This model is the simplest used in this work. It sets s = 1 and ?s = for every site s in the alignment. The original MG94 and GY94 models used this approach of having no site-to-site rate variation.
Nonsynonymous: variable nonsynonymous rates model.
This rate heterogeneity model was used by Yang et al. (2000) and described in the "Existing Codon Models" above. All sites have s = 1, while the ?s values are sampled from a given rate distribution (e.g., gamma or general discrete). "Nonsynonymous" assumes that synonymous rates do not vary over sites. However, the nonsynonymous-to-synonymous rate ratio does vary over sites as a result of the nonsynonymous rate variation.
Proportional: proportional variable rates model.
"Proportional" is perhaps the simplest approach for allowing both types of rate heterogeneity. The basic framework is similar toYang et al. (2000). Unlike that work, however, the synonymous rates are not constant over sites. Instead, the nonsynonymous rate is proportional to the synonymous rate, with the proportionality constant shared by all sites. The s values are drawn from a given unit mean rate distribution, and the nonsynonymous rates are then given by ?s = s, where is the ratio of nonsynonymous to synonymous rates and is shared by all sites. Such a model might be effective if site-specific mutation rates were the driving factor in creating site-to-site heterogeneity (Kosakovsky Pond and Frost 2005a).
Dual: dual variable rates model.
"Dual" is the primary rate heterogeneity model described in the "Modifications for variable synonymous substitution rates" above. The s and ?s values are sampled from two given rate distributions, with s restricted to have mean 1. Dual assumes that synonymous and nonsynonymous rates are independent.
Lineage Dual: dual variable rates model, branch corrections.
Model "Lineage Dual" is an extension of Dual that allows for lineage-specific adjustments to the ratio, , of nonsynonymous to synonymous rates. The values of s and an ancillary quantity s are sampled from two (independent) unit mean distributions and ?s = b x s; b denotes the E[nonsynonymous]/E[synonymous] ratio for branch b. The branch correction parameters, b, are estimated separately for every branch in the phylogeny. They can be thought of as correction factors to the "average" value of defined in Dual, with larger values of b indicative of branches with values higher than the tree average. These values are useful for identifying specific lineages that may have undergone adaptive evolution events (Yang 1998; Kosakovsky Pond and Frost 2005a). Dual is a particular case of Lineage Dual and can be used to test the null hypothesis b = for all branches b. Lineage Dual is the most general model considered in this study. A diagram illustrating the hierarchy of the models is found in figure 1.
FIG. 1.— Hierarchical presentation of rate heterogeneity models used in this work.
The Discretization Method
Having selected a rate heterogeneity model based on gamma-distributed random variables, for practicality we must now convert those continuous distributions into discrete approximations. Using the schemes discussed in Supplementary Material online, discretization based on M 2 synonymous rate classes and N 2 nonsynonymous rate classes is denoted by M x N. The inclusion of models Nonsynonymous and Proportional, which have only one random rate, as well as the use of distributions with an invariant category for Dual and Lineage Dual, require us to define some rules for discretization. For Nonsynonymous and Proportional, we discretize the single distribution into MN equiprobable rate classes. For Dual and Lineage Dual, we discretize the distribution of synonymous rates into M equiprobable classes and the distribution of nonsynonymous rates into N equiprobable classes. If an invariable class is present for nonsynonymous rates, we discretize the distribution of synonymous rates into M equiprobable classes, the continuous portion of the mixture of nonsynonymous rates into N – 1 equiprobable classes, and add the invariable class at 0 to obtain N total nonsynonymous classes. Under this discretization strategy, all models have MN rate classes and equivalent computational costs.
When GDDs are used, no discretization is needed and one only need specify the number of rate classes.
With the three components in place, we now have a wide selection of models from which to choose. To completely specify a model, we list its rate variation model, core rate matrix, rate heterogeneity method, rate distributions used, and the discretization method. The model MG94 x REV Dual Gamma 5 x 7 would use the MG94 core matrix composed with the REV nucleotide model, have separate independent gamma distributions for synonymous and nonsynonymous rates, discretized into five synonymous categories and seven nonsynonymous categories.
A Simple Test for Presence of Synonymous Rate Variation
The Dual and Nonsynonymous models are nested if GDD distributions are used to model rate variation, making it possible to assess significance using standard likelihood ratio tests. If N nonsynonymous rate categories are used both in Dual and in Nonsynonymous and if M synonymous rate categories are employed in Dual, then there are 2M – 2 more estimable parameters in Dual than in Nonsynonymous. Dual can be constrained so that Nonsynonymous is obtained. When enough alignment columns are available, the asymptotic 2 distribution with 2M – 2 degrees of freedom can be used to compute P values. For smaller data sets a parametric bootstrap P value can be found.
Alignments, Trees, and Programs
To investigate the utility of the new models in exploring site-to-site rate heterogeneity, we made use of 10 previously studied codon alignments (table 1) of varying sizes and levels of divergence. The archive of alignments and phylogenetic trees in NEXUS format can be downloaded from http://www.hyphy.org/pubs/2rates.tgz. We examine the mammalian ?-globin data set in great detail but for brevity focus only on tests for the presence of synonymous rate variation in other data sets.
Table 1 Data Sets Analyzed for Presence of Synonymous Rate Variation
All analyses were performed using "HyPhy" version .99? (Kosakovsky Pond, Frost, and Muse 2004) for Mac OS X on a dual G5/2Ghz workstation or a 16-processor Linux MPI cluster. Both the program and the batch files for comparing the models discussed herein and result processing are included as standard analyses with the HyPhy distribution (http://www.hyphy.org). The chart generation option used for creating some of the graphics in this paper is presently only available in Mac OS and Windows.
Results
Nucleotide Substitution Biases
In all data sets analyzed for this paper, the MG94 x REV model of codon substitution was preferred over both MG94 x HKY85 and traditional MG94 when assessed by the appropriate likelihood ratio test, and it was also chosen over GY94 using Akaike Information Criterion (AIC) (Supplementary Material online). This finding held even when different rate variation models were considered. For the remainder of this article, we consider only the MG94 x REV core model unless specified otherwise.
Choice of Rate Distribution
For every choice of rate variation models, GDD tends to have similar or better AIC scores than a discretized gamma (+Inv) distribution (Supplementary Material online). Fewer rate categories are needed by GDD to achieve good fits when compared to continuous discretized distributions, and consequently the analyses require less computer time to run. In general, a uniformly discretized gamma distribution may be a poor choice to capture the underlying pattern of rate variation, even if the true distribution of substitution rates is a gamma (Kosakovsky Pond and Frost 2005c). Adaptive discretization schemes or GDDs may be a better general choice for modeling distributions of rates. We chose to focus on the results derived from using GDDs of rates, although similar results have been obtained with a variety of other distributions (results not shown).
Heterogeneity of Synonymous Rates Is Not Negligible
Each of the ten data sets was analyzed for evidence of synonymous rate variation (table 1). The MG94 x REV model, using GDD was fit to each data set under two rate variation hypotheses:
Null—No synonymous rate variation: Nonsynonymous with three rate categories.
Alternative—Presence of synonymous rate variation: Dual, with three rate categories for s and three for ?s.
Because the models are nested, a standard likelihood ratio test using the asymptotic distribution can be used to assess significance. Additionally, we employed the AIC to quantify the relative improvement of fit. The computational difficulty of the Dual model was increased roughly by a factor of 3 compared to the Nonsynonymous model; however, if run in parallel on a cluster of computers with at least 10 nodes (Kosakovsky Pond and Frost 2005c) the models are roughly equivalent in terms of overall run time. The choice of three (or 3 x 3) rate categories was partly historical, e.g., the often cited M3 model of Yang et al. (2000) used a GDD with three rate categories and partly because finite sample sizes do not, in general, permit reliable inference of more than a few rate classes (Kosakovsky Pond and Frost 2005c).
In 9 of 10 cases the hypothesis of constant synonymous rates can be rejected at the 0.01 level. The Drosophila adh data set is the sole exception where the test failed to reject the assumption of constant synonymous rates. Dual was preferred by AIC in the same nine cases, with score improvements ranging from 12.43 to 291.24, with larger data sets yielding more pronounced AIC decreases.
Coefficients of variation (CV: defined as the standard deviation of a distribution divided by its mean) can be used to quantify and compare the variability of different distributions. The CV for synonymous rate distributions ranged from 0.18 to 1.47 and in many cases were comparable to, or even greater than, the CV of the corresponding nonsynonymous distributions (table 2). Note that while for most data sets both models yielded similar estimates of overall tree lengths (table 1) and CV of the nonsynonymous rate distributions, there were some notable exceptions (COXI, ?-globin, and flavivirus NS5 data sets) for which there was a pronounced effect on the estimate of overall sequence divergence. This effect could be important, for example, in estimation of divergence times and definitely merits further exploration.
Table 2 Descriptive Statistics for the Inferred Distributions of Synonymous (s) and Nonsynonymous (?s) Substitution Rates
Detailed Analysis of the ?-Globin Data Set
In the following paragraphs, we illustrate potential applications of the models developed above through a series of analyses of the ?-globin data set. We examine the fits of various models and investigate the inferred rate heterogeneity distributions.
Model Selection for the ?-Globin Data Set
A more thorough analysis of the ?-globin data is presented in table 3. Introduction of branch-specific rate correction factors in Lineage Dual appears to lead to a significant fit improvement over Dual model as measured by the likelihood ratio test (asymptotic P value of 0.005). Inferred branch selection coefficients b ranged from 0 to 3.7 (median 0.31). The Lineage Dual model can serve as a simple test for heterogeneity in average selection pressure among lineages when both substitution rates are allowed to vary among sites, complementing existing branch-site models (Yang 1998; Yang and Nielsen 2002). It is likely (Kosakovsky Pond and Frost 2005a) that a model where each branch has an adjustable b is too parameter rich and that a number of models with only a few classes of b will fit the data as well, a possibility reinforced by the fact that the best model using AIC as a criterion is Dual. Note that for this particular data set, estimates of the various parameters regulating site-to-site rate heterogeneity appear quite insensitive to model selection, although the results from tables 1 and 2 indicate that this behavior is not universal.
Table 3 MG94 GDD 3 x 3 Applied to the ?-Globin Alignment
Examining Rate Patterns at Sites in the ?-Globin Data Set
Once the MLEs, , of the entire collection of unknown parameters, , have been obtained, one can ask questions about rate patterns among sites in the data set. One of the results produced by a likelihood analysis is a table of conditional probabilities comprised of the entries s = 1, ..., S, j = 1, ..., M (number of synonymous rate classes), k = 1, ..., N (number of nonsynonymous rate classes): Given we can compute the empirical Bayes posterior probability of site s belonging to rate class (j, k) by applying the elementary Bayes rule:
The posterior distribution of rates at a site is useful in determining the rate of evolution at a given site. For instance, one can simply assign site s to the rate class (j, k) that has the maximal posterior probability. Note that if all rate classes are equiprobable, the same results can be obtained simply by selecting the rate class with the largest conditional posterior probability (table 4). In figure 2, we show examples of sites assigned to different rate classes by various models.
Table 4 Rate Class Assignments Under MG94 x REV Dual GDD 3 x 3 for ?-Globin
FIG. 2.— Examples of rate class assignments and positive selection identification for ?-globin under MG94 x REV Dual GDD 3 x 3, MG94 x REV Nonsynonymous GDD 3, and M3 in (Yang et al. 2000).
Additional indicators of the rate pattern present at site s are the expected posterior synonymous and nonsynonymous rates at that site, denoted by and (fig. 3):
Note that both and shown in figure 3 show substantial variation (as suggested by their respective CV in table 2) but with little obvious spatial correlation.
FIG. 3.— Expected posterior rates (E[S|i] for synonymous rates and E[NS|i] for nonsynonymous rates) for ?-globin under MG94 x REV Dual GDD 3 x 3.
Selective Pressure
For the Dual and Lineage Dual models, one can examine the conditional distribution of s = ?s/s (or Ms = ?s – s) over codon sites and attempt to identify sites under selective pressure. Because the distributions used for s and ?s are always discrete (or discretized), the inferred distribution of s (or Ms) is also discrete and can be obtained in a straightforward way. Following the empirical Bayes methodology of Nielsen and Yang (1998), given the posterior probabilities as defined above and a cutoff level 0 C 1, we label a codon site s as a site under selective pressure if
One could also employ Bayes factors to identify sites that are under selective pressure. A Bayes factor BF(s) for the hypothesis (or event) at site s is defined as the ratio of posterior odds of the event and its prior odds:
A large value of BF(s), for instance BF(s) > 100 ("decisive" evidence, see Kass and Raftery 1995) indicates that the effect of the data in column s on the prior event of positive selection is amplifying and thus suggestive of the presence of selective pressure at site s. One of the advantages of using Bayes factors as opposed to a posterior probability cutoff is that Bayes factors properly account for the effect of the prior distribution. For example, first suppose that Pr{Ws} = 0.7 and It is easy to check that BF(s) = 8.14. Next, consider Pr{Ws} = 0.05 and Now, BF(s) = 361. Both situations would be treated as equally strong evidence for positive selection with the posterior cutoff but vastly different under Bayes factors. It seems reasonable to consider the "relative" effect of the data upon the prior as measured by BF(s), if for no other reason than to make the conclusions of positive selection analyses based on different priors directly comparable.
Under certain scenarios (Kosakovsky Pond and Frost 2005b), the reciprocals of Bayes factors serve as a useful approximate measure of the rates of false positives of the test for selection at a site. In practice, however, there seems to be little difference in inferring the location of sites under selection, whether posterior P values or Bayes factors are used (Kosakovsky Pond and Frost 2005b; unpublished simulation results). Empirical Bayes inference can be strongly influenced by errors in MLE. Recently, several methods to correct for this effect have been suggested. One can either use a hierarchical Bayes model (Huelsenbeck and Dyer 2004) or employ Bayes Empirical Bayes inference (Yang, Wong, and Nielsen 2005). Because our Dual models include many estimated parameters, we caution that estimation errors can exert strong influence on the empirical Bayes analysis, especially for smaller data sets. While we base our initial analyses on a simplistic application of empirical Bayes inference, both approaches to dealing with errors in MLEs can be adapted to include synonymous rate variation. Furthermore, new approaches (e.g., sampling importance resampling) could be brought to bear on handling errors in parameter estimates, and we plan to extend our methodology to account for the effect of such estimation errors.
Comparison with Existing Methods
In Yang et al. (2000), model M3 was found to have the best fit for the ?-globin data (AIC = 7,408.12). Note that M3 of Yang et al. (2000) can be denoted as GY94 Nonsynonymous GDD 3 using the conventions of this paper. GY94 Nonsynonymous GDD 3 and MG94 x REV Dual GDD 3 x 3 (AIC = 7,388.23) differ in three aspects: (1) modeling of rate variation, (2) compensating for nucleotide substitution biases, and (3) treatment of equilibrium frequencies. In order to focus specifically on the effect of failing to allow for synonymous rate variation we also included results from MG94 x REV Nonsynonymous GDD 3 (AIC = 7,400.66). The AIC scores for the three models suggest that site-to-site synonymous variation is present and that it is important to model. In figure 2, we illustrate examples of sites whose rates can be categorized as high synonymous/low nonsynonymous (codon 55), high synonymous/high nonsynonymous (codon 85), low synonymous/high nonsynonymous (codon 48), and moderate synonymous/high nonsynonymous (codon 7). The presence of this variety of site categories suggests that it might be important to include both types of site-to-site rate variation.
While the model fit is important, we focus now on the specific task of identifying sites under positive selection. A few sites illustrate important points with regard to the suitability of models for this task. The most crucial difference between methods is found at codon 85, a position with high synonymous and nonsynonymous diversity. Because of the high level of nonsynonymous diversity, models that assume equal synonymous rates identify such a site positively selected with high statistical significance (note the Bayes factors/posterior probabilities of 5,545 and 0.998 in fig. 2). However, an argument can be made that in reality site 85 is evolving neutrally or even under purifying selection. Using MG94 x REV Dual GDD 3 x 3, the estimated ratio of nonsynonymous to synonymous rates at site 85 is 0.108. The model assigns high synonymous and nonsynonymous rates to such variable sites, leading to a negligible positive selection signal. A look at figure 3 suggests that while evolution at most sites in the alignment can be explained well by a single synonymous rate, there are a number of prominent exceptions where synonymous rate estimates are substantially elevated.
Figure 4 illustrates the effects of including (or excluding) site-to-site variation in synonymous rates when searching for sites undergoing positive selection. We used models MG94 x REV Nonsynonymous GDD 3 and MG94 x REV Dual GDD 3 x 3 in order to focus solely on the performance differences arising from the inclusion or exclusion of site-to-site synonymous rate variation. The scatterplots of posterior expected s and posterior probabilities for positive selection (?s > s) should form a line along the y = x axis if the models lead to the same inferences. Predictably, the models agree on a large subset of the sites. However, the most important feature of the plots is the tendency for some sites (7, 67, 85, and 123) with large support for positive selection under the Nonsynonymous model to have considerably weaker support when site-to-site synonymous variation is introduced using the Dual model. This behavior could lead to an elevated rate of false positives when analyzing data sets that include synonymous rate variation with Nonsynonymous-type models (for an example, see Kosakovsky Pond and Frost 2005b).
FIG. 4.— Comparison of site-by-site selection inference using MG94 x REV Dual GDD 3 x 3 and MG94 x REV Nonsynonymous GDD. Shaded areas in each plot show codons that are classified as positively selected by the Nonsynonymous model only (cutoff of 0.9 used for the posterior probability plot).
As a final investigation of the relationship between model choice and identification of positively selected sites, in table 5, we display results for every site that had a Bayes factor for positive selection greater than 100 (which also translates to posterior probabilities of s > 1 being over 0.9 in all cases) using either MG94 x REV Dual GDD 3 x 3 or M3. For most sites the two models are in general agreement. However, large discrepancies are found at sites 7, 67, and 85, and to a lesser degree at site 123. In each of these cases the evidence for positive selection disappears when synonymous rate variation is introduced.
Table 5 Evidence of Positive Selection with and Without Synonymous Rate Heterogeneity
Conclusions
The results above clearly indicate that the inclusion of potential site-to-site variation in synonymous rates when studying the molecular evolution of protein-coding regions is desirable. From the statistical perspective, these models allow for rigorous testing of parameters regulating the process of sequence evolution. Applications of the new models and associated inference tools on ten data sets indicate that variation in synonymous substitution rates is a real phenomenon. This observation, if it turns out to be widespread, will no doubt be one of considerable interest: investigations using estimates of synonymous rates for molecular clock dating could be brought into question; synonymous rate variation might be used to explain the overdispersed clock; positive selection studies might require second looks. At the most basic level, it raises questions as to the biological sources of the heterogeneity. Selective arguments, perhaps hinging on localized codon usage patterns or avoidance of transcription factor–binding site motifs, could be crafted. An explanation drawing on variation of mutation rates at different dinucleotides has some appeal and has been described in Hepatitis C (Tuplin et al. 2002). One could also imagine an explanation based on properties of DNA winding on histones. In any event, the model provides for the first time the tools to carefully investigate any of these phenomena that might actually be present in nature, and our analyses suggest that the phenomenon is ubiquitous enough to merit close attention.
Supplementary Material
Supplementary materials are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).
Acknowledgements
The authors would like to thank Joseph Watkins, Simon D. W. Frost, Andy J. Leigh Brown, and Jesse Taylor for their comments and suggestions.
This research was supported in part by the National Science Foundation (award number 0331654), National Institutes of Health (AI47745, AI43638, and AI57167), the University of California Universitywide AIDS Research Program (grant number IS02-SD-701), and by a University of California, San Diego Center for AIDS Research/National Institute of Allergy and Infectious Diseases Developmental Award (AI36214).
References
Akaike, H. 1974. A new look at the statistical model identification. IEEE Trans. Autom. Contr. 19:716–723.
Anisimova, M., and Z. Yang. 2004. Molecular evolution of the hepatitis delta virus antigen gene: recombination or positive selection? J. Mol. Evol. 59:815–826.
Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368–376.
Felsenstein, J., and G. A. Churchill. 1996. A hidden Markov model approach to variation among sites in rate of evolution. Mol. Biol. Evol. 13:93–104.
Fitch, W. 1971. Towards defining the course of evolution: minimum change for a specific tree topology. Syst. Zool. 20:406–416.
Gaut, B. S., B. R. Morton, B. McCaig, and M. T. Clegg. 1996. Substitution rate comparisons between grasses and palms: synonymous rate differences at the nuclear gene Adh parallel rate differences at the plastid gene rbcl. Proc. Natl. Acad. Sci. USA 93:10274–10279.
Goldman, N., and Z. Yang. 1994. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11:725–736.
Hanada, K., Y. Suzuki, and T. Gojobori. 2004. A large variation in the rates of synonymous substitution for RNA viruses and its relationship to a diversity of viral infection and transmission modes. Mol. Biol. Evol. 21:1074–1080.
Harmsen, M., R. Ruuls, I. Nijman, T. Niewold, Franken, L. G., and B. de Geus. 2000. Llama heavy-chain V regions consist of at least four distinct subfamilies revealing novel sequence features. Mol. Immunol. 37:579–590.
Hasegawa, M., H. Kishino, and T. Yano. 1985. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22:160–174.
Huelsenbeck, J., and K. Crandall. 1997. Phylogeny estimation and hypothesis testing using maximum likelihood. Annu. Rev. Ecol. Syst. 28:437–466.
Huelsenbeck, J. P., and K. A. Dyer. 2004. Bayesian estimation of positively selected sites. J. Mol. Evol. 58:661–672.
Hurst, L., and C. Pal. 2001. Evidence for purifying selection acting on silent sites in BRCA1. Trends Genet. 17:62–65.
Jukes, T. H., and C. R. Cantor. 1969. Evolution of protein molecules. Pp. 21–132 in H. M. Munro, ed. Mammalian protein metabolism. Academic Press, New York.
Kass, R., and A. Raftery. 1995. Bayes factors. J. Am. Stat. Assoc. 90:773–795.
Kimura, M. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16:111–120.
Kosakovsky Pond, S. L., and S. D. Frost. 2005a. A genetic algorithm approach to detecting lineage-specific variation in selection pressure. Mol. Biol. Evol. 22:478–485.
———. 2005b. Not so different after all: a comparison of methods for detecting amino-acid sites under selection. Mol. Biol. Evol. 22:1208–1222.
———. 2005c. A simple hierarchical approach to modeling distributions of substitution rates. Mol. Biol. Evol. 22:223–234.
Kosakovsky Pond, S. L., S. D. W. Frost, and S. V. Muse. 2004. Hyphy: hypothesis testing using phylogenies. Bioinformatics 21:676–679.
Kosakovsky Pond, S. L., and S. V. Muse. 2004. Column sorting: rapid calculation of the phylogenetic likelihood function. Syst. Biol. 53:685–692.
Larget, B., and D. Simon. 1998. Faster likelihood calculations on trees. Technical Report 98-02. Department of Mathematics and Computer Science, Duquesne University, Pittsburgh, Pa.
Miyata, T., and T. Yasunaga. 1980. Molecular evolution of mRNA: a method for estimating evolutionary rates of synonymous and amino acid substitutions from homologous nucleotide sequences and its application. J. Mol. Evol. 16:23–36.
Moler, C., and C. V. Loan. 1978. Nineteen dubious ways to compute the exponential of a matrix. SIAM Rev. 20:801–836.
Muse, S. V., and B. S. Gaut. 1994. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates with application to the chloroplast genome. Mol. Biol. Evol. 11:715–724.
Nielsen, R., and Z. Yang. 1998. Likelihood models for detecting positive selected amino acid sites and applications to the HIV-1 envelope gene. Genetics 148:929–936.
Pedersen, A. M. K., C. Wiuf, and F. B. Christiansen. 1998. A codon-based model designed to describe lentiviral evolution. Mol. Biol. Evol. 15:1069–1081.
Sch?niger, M., and A. von Haeseler. 1994. A stochastic model for the evolution of autocorrelated DNA sequences. Mol. Phylogenet. Evol. 3:240–247.
Seo, T.-K., H. Kishino, and J. Thorne. 2004. Estimating absolute rates of synonymous and nonsynonymous nucleotide substitution in order to characterize natural selection and date species divergences. Mol. Biol. Evol. 21:1201–1213.
Tuplin, A., J. Wood, D. Evans, A. Patel, and P. Simmonds. 2002. Thermodynamic and phylogenetic prediction of RNA secondary structures in the coding region of hepatitis C virus. RNA 82:824–841.
Uzzel, T., and K. W. Corbin. 1971. Fitting discrete probability distributions to evolutionary events. Science 172:1089–1096.
Wolfe, K. H., W.-H. Li, and P. M. Sharp. 1987. Rates of nucleotide substitution vary greatly among plant mitochondrial, chloroplast, and nuclear DNAs. Proc. Natl. Acad. Sci. USA 84:9054–9058.
Wolfe, K. H., P. M. Sharp, and W.-H. Li. 1989. Rates of synonymous substitution in plant nuclear genes. J. Mol. Evol. 29:208–211.
Yang, Z. 1993. Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol. Biol. Evol. 10:1396–1401.
———. 1994. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J. Mol. Evol. 39:105–111.
———. 1998. Likelihood ratio tests for detecting positive selection and application to primate lysozyme evolution. Mol. Biol. Evol. 15:568–573.
———. 2000. Maximum likelihood estimation on large phylogenies and analysis of adaptive evolution in human influenza virus A. J. Mol. Evol. 51:423–432.
Yang, Z., and W. Swanson. 2002. Codon-substitution models to detect adaptive evolution that account for heterogeneous selective pressures among site classes. Mol. Biol. Evol. 19:49–57.
Yang, Z., W. S. Wong, and R. Nielsen. 2005. Bayes empirical Bayes inference of amino acid sites under positive selection. Mol. Biol. Evol. 22:1107–1118.
Yang, Z. H., and R. Nielsen. 2002. Codon-substitution models for detecting molecular adaptation at individual sites along specific lineages. Mol. Biol. Evol. 19:908–917.
Yang, Z. H., R. Nielsen, N. Goldman, and A. M. K. Pedersen. 2000. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics 155:431–449.(Sergei Kosakovsky Pond* a)