Estimating Absolute Rates of Synonymous and Nonsynonymous Nucleotide Substitution in Order to Characterize Natural Selection and Date Specie
http://www.100md.com
分子生物学进展 2004年第7期
Bioinformatics Research Center, Box 7566, North Carolina State University
Laboratory of Biometrics, Graduate School of Agriculture and Life Sciences, University of Tokyo, Bunkyo-ku, Tokyo, Japan
E-mail: seo@statgen.ncsu.edu
Abstract
The rate of molecular evolution can vary among lineages. Sources of this variation have differential effects on synonymous and nonsynonymous substitution rates. Changes in effective population size or patterns of natural selection will mainly alter nonsynonymous substitution rates. Changes in generation length or mutation rates are likely to have an impact on both synonymous and nonsynonymous substitution rates. By comparing changes in synonymous and nonsynonymous rates, the relative contributions of the driving forces of evolution can be better characterized. Here, we introduce a procedure for estimating the chronological rates of synonymous and nonsynonymous substitutions on the branches of an evolutionary tree. Because the widely used ratio of nonsynonymous and synonymous rates is not designed to detect simultaneous increases or simultaneous decreases in synonymous and nonsynonymous rates, the estimation of these rates rather than their ratio can improve characterization of the evolutionary process. With our Bayesian approach, we analyze cytochrome oxidase subunit I evolution in primates and infer that nonsynonymous rates have a greater tendency to change over time than do synonymous rates. Our analysis of these data also suggests that rates have been positively correlated.
Key Words: Bayesian method ? Markov chain Monte Carlo (MCMC) ? rate change model ? lognormal ? evolutionary rate ? selection ? synonymous rate ? nonsynonymous rate
Introduction
Much attention has been paid to synonymous and nonsynonymous substitution amounts and their ratio (Nei and Gojobori 1986; Goldman and Yang 1994; Muse and Gaut 1994; Ina 1995; Ohta 1995; Zhang, Rosenberg and Nei 1998; Yang 1998; Yang and Nielsen 1998, 2000, 2002; Yang and Bielawski 2000; Yang and Swanson 2002), but less emphasis has been placed on the actual values of nonsynonymous and synonymous rates. This is unfortunate because absolute rates of nonsynonymous and synonymous substitutions can provide more insight into evolutionary mechanism than can the amounts or their ratio. A change in the ratio may be due to alteration of the nonsynonymous or synonymous substitution rate or both. With the ability to estimate the actual nonsynonymous and synonymous rates, the factors involved in a change to the evolutionary process can be more readily identified.
For estimation of substitution rates in chronological time units (e.g., years, millions of years, etc.), the most difficult obstacle is that only the amount of evolution (i.e., the product of rate and time) can be inferred solely from molecular sequence data. This confounding of rates and times necessitates the addition of information external to the molecular sequence data if actual nonsynonymous and synonymous substitution rates are to be accurately determined. Conventionally, this external information has taken the form of fossil data that calibrates a "molecular clock" that is assumed to specify a common rate of evolution on all branches of a phylogeny. Recently, the sampling times of quickly evolving organisms such as RNA viruses have been exploited to calibrate the molecular clock on some genealogies (Rambaut 2000; Drummond and Rodrigo 2000). Regardless of how calibration is achieved, when the goal is to determine how substitution rates have changed over time, the molecular clock assumption of constant rates has little utility. There have been proposals for analyzing sequence data with "local clocks" that operate on only a portion of an evolutionary tree (Yoder and Yang 2000). This local clock approach is attractive, but there may be some difficulty in determining how many local clocks should be employed and in determining which branches should be governed by which local clocks.
Sanderson (1997; see also Sanderson 2002, 2003) introduced a pioneering approach that permits different rates in different lineages and that lacks the model specification difficulties of the local clock approaches. Later, Bayesian approaches for allowing the rate of evolution to change over time were developed (Thorne, Kishino, and Painter 1998; Huelsenbeck, Larget, and Swofford 2000; Kishino, Thorne, and Bruno 2001; Thorne and Kishino 2002; Aris-Brosou and Yang 2002). An advantage of the Bayesian implementations is their ability to incorporate both the full likelihood and prior information. Here, we adapt previously proposed Bayesian approaches (Thorne, Kishino, and Painter 1998; Kishino, Thorne, and Bruno 2001; Thorne and Kishino 2002) to a codon model of evolution. In this way, absolute synonymous and nonsynonymous substitution rates as well as divergence times can be estimated. In addition, the tendency for nonsynonymous rates to change over time can be compared and contrasted with the tendency for synonymous rates to change over time. We also introduce a test of whether nonsynonymous and synonymous substitution rates change in a correlated fashion over time, and we discuss the strengths and weaknesses of this test.
Our new methodology is applied to a group of cytochrome oxidase subunit I (COXI) primate sequences. Cytochrome oxidase is a protein complex that affects aerobic energy metabolism by playing an important role in the final step of the electron transport system. Cytochrome oxidase is composed of 13 subunits. Subunits 1, 2, and 3 are encoded in the mitochondrion, and the others are encoded in the nucleus. The molecular evolution of some of the subunits has been investigated in detail and, despite the different genomic origins of the subunits that have been studied, a consistent pattern among subunits is that anthropoids have experienced a higher ratio of nonsynonymous to synonymous change than have non-anthropoids (Adkins and Honeycutt 1994; Adkins Honeycutt, and Disotell 1996; Wu et al. 1997, 2000; Schmidt, Goodman, and Grossman 1999; Andrews and Easteal 2000; Grossman et al. 2001; Wildman, Goodman, and Grossman 2002). One suggested explanation of this pattern is that there has been an increase in size of the neocortex during anthropoid evolution and that the increased needs of the neocortex for energy may be responsible for a change in aerobic energy metabolism in anthropoids. Selection for an alteration to aerobic energy metabolism could have led to the higher nonsynonymous to synonymous substitution ratio of cytochrome oxidase in anthropoids (see Grossman et al. 2001).
By analyzing COXI, we obtain divergence time estimates similar to those obtained in a previous study that analyzed total mitochondrial genomes with a treatment that did not distinguish between synonymous and nonsynonymous changes (Cao et al. 2000). Our inferred pattern of nonsynonymous rate increase in the anthropoid group is consistent with previous analyses based on different methods (Andrews and Easteal 2000; Wu et al. 2000). An advantage of our Bayesian method over traditional approaches that are based on pairwise distances is that it can localize rate changes to relatively specific portions of a phylogeny. For COXI, we tentatively conclude that nonsynonymous rates have been comparatively high throughout the anthropoid group and not just on the branch that was immediately ancestral to the anthropoid group.
Methods
Codon Model
Consider a slight modification of the codon substitution model introduced by Goldman and Yang (1994). The modification allows the expected proportion of nonsynonymous substitutions to vary among branches and has been previously suggested (Yang 1998). With this modification, the instantaneous rate of substitution from codon i to codon j on branch m is
where j is the frequency of codon j, differentiates between transitions and transversions, and (m) distinguishes between nonsynonymous and synonymous substitutions for branch m. In contrast to the (m) parameter, the u(m) parameter affects the chronological rates of both synonymous and nonsynonymous substitutions on branch m. In equation (1), the i parameters and are common to all branches.
The average rate on a branch m of synonymous and nonsynonymous substitution per chronological time unit will be respectively denoted and . These rates can be defined as
and
The synonymous and nonsynonymous branch lengths for a branch m with chronological time duration t(m) are
and
The total length of branch m (i.e., the expected number of substitutions per codon) is then
Knowing + and (m) along with and the i parameters allows determination of and . This is because , the i parameters, and (m) define the / u(m) values. The / u(m) values then allow calculation of / u(m) and / u(m) via equations (2) and (3). Equation (6) can be used to solve for u(m)t(m). Finally, u(m)t(m) with / u(m) determines (eq. 4) and u(m)t(m) with / u(m) determines (eq. 5).
Likewise, knowing and along with and the i parameters allows determination of + and (m). This is because and the i parameters are sufficient to define / u(m). With equation (4), u(m)t(m) can be expressed in terms of / u(m) and . Using u(m)t(m) and equation (5), the value of / u(m) follows. Finally, the value of (m) can be found by applying equation (3).
We stress this ability to interconvert between the two sets of parameters because the (i,, + ,(m)) parameterization is more natural for calculation of likelihoods whereas the (i,,,) parameterization is more natural when considering our model of synonymous and nonsynonymous rate change. Ability to interconvert between these two parameterizations also allows the invariance property of maximum likelihood estimators (e.g., Hogg and Craig 1995, p. 265) to be exploited so that estimates under one parameterization can therefore be converted to be estimates under the other. Maximum likelihood estimates are relevant because, as described below, our Bayesian approach involves approximating the likelihood surface needed for posterior density estimation by Taylor expansion around the maximum likelihood estimates.
Model of Nonsynonymous/Synonymous Rate Change
Our model of rate evolution is similar in spirit to that of Kishino, Thorne, and Bruno (2001). When evolutionary lineages are separated by only small amounts of time, their evolutionary rates are likely to be similar. When lineages are distantly related, their rates are more likely to substantially differ. In other words, evolutionary rates are autocorrelated.
Rates of evolution are likely to undergo a nearly continuous process of change on a lineage. Because molecular sequence data contain only limited information regarding this nearly continuous process of change, the average evolutionary rate on a branch is approximated in our treatment by the mean of the rate of evolution at the nodes that begin and end the branch. If rates of evolution changed in a deterministic linear fashion between beginning and ending nodes of a branch, our approximation would be exact. However, we simply view this connection between node rates and average rates on a branch as a computationally simplifying assumption. With this approximation, the average rates of evolution on branches and the expected amounts of evolution on branches both depend completely on rates of evolution at the nodes of a tree.
To model the rates of evolution at tree nodes, a nonsynonymous rate variation parameter n and a synonymous rate variation parameter s are introduced. The case where n = s = 0 represents the biologically unrealistic case of a molecular clock. These rate variation parameters are restricted to non-negative values and, as they increase, the expected departure from a molecular clock becomes more extreme. Given the nonsynonymous rate and the synonymous rate at a node i that begins a branch of time duration t, the logarithms of the rates at the node j that ends the branch are assumed to follow a bivariate normal distribution of the form:
With this parameterization, the expected value of () is larger than (), even though the expected value of log (log ) equals log (log ). The difference gets bigger when n(s) or t is large. To eliminate such differences, Kishino, Thorne, and Bruno (2001) forced the expected value of rates at the end of branches to be equal to their values at the beginning of branches. Because the parameterization in equation (7) simplifies our proposed test of concordance between synonymous and nonsynonymous rates (see below), we do not make any attempt to eliminate such differences here.
As a further simplification, we set the a priori covariance of log and log to be zero in equation (7). Future implementations could easily permit this covariance to be nonzero a priori or might even define a prior distribution for a parameter representing the covariance. This hierarchical approach may be worthwhile because some features of the evolutionary process such as generation time or mutation rate might simultaneously affect both nonsynonymous and synonymous rates when they change. Although we model synonymous and nonsynonymous rate change as occurring independently, we describe a procedure that uses sequence data to detect correlation in changes of these rates.
We assume the topology relating the sequences is known. Following Thorne, Kishino, and Painter (1998), our procedure requires an outgroup to root the ingroup, but no attempt is made to model rate evolution in the outgroup taxa. With the model of rate evolution described by equation (7), the prior density of rates at a node is defined in terms of the rates at its parental node, and this leaves the prior distribution for the rates at the ingroup root node to be specified. In our implementation, the ingroup root node has priors for nonsynonymous and synonymous rates that are specified by two independent gamma distributions. For n and s, we specify priors that are exponential distributions.
For every rooted tree, our implementation arbitrarily selects one branch that emanates from the ingroup root node. On this branch, the nonsynonymous and synonymous rates at the end of the branch are forced to be identical to the corresponding rates at the ingroup root node. In previous work (Kishino, Thorne, and Bruno 2001), this forcing of no rate change along one branch has made Markov chain Monte Carlo (MCMC) approximation of posterior rate densities computationally tractable. With a similar motivation, the forcing of no rate change was included here to achieve computational tractability with the MCMC procedure described below. With the exception of this single branch, evolution of rates on all branches is governed by equation (7).
Prior Distribution
For the Bayesian estimation of chronological rates of nonsynonymous and synonymous substitutions, a prior distribution of node times must be specified. The details of the prior distribution for node times are identical to those described in Kishino, Thorne, and Bruno (2001), and only an outline is provided here. For the case where there are no constraints on node times due to fossil or other information, our implementation uses a gamma distribution to describe the time of the ingroup root node. Given the ingroup root time, the prior distribution for the times of all other internal nodes are jointly distributed according to a modified Dirichlet distribution (Kishino, Thorne, and Bruno 2001). Representing the node times (including the ingroup root) by a vector T, the modified Dirichlet distribution and the gamma distribution of the ingroup root time determine the prior density of the node times p(T).
Often, fossil or other evidence suggests that a certain node time may exceed some value. Likewise, evidence external to the molecular sequence data may indicate that a node is more recent than some date. Sanderson (1997) demonstrated that constraints on node times can be valuable for estimating dates of divergence. When a set C of constraints on node times is available, we condition upon these constraints and use this conditional prior p(T|C) to estimate divergence times and evolutionary rates.
Posterior Distribution and MCMC
Following Yang (1998; see also Pederson, Wiuf, and Christiansen 1998), we have explored various parameterizations of the codon frequencies in terms of frequencies of the four nucleotide types. One parameterization forces these nucleotide frequencies to be identical at all three codon positions, and another allows different nucleotide frequencies at the first, second, and third codon positions. However, for all results presented here, we estimate probabilities of the different codon types by their empirical frequencies. We then treat these estimated codon frequencies as well as the maximum likelihood estimate of as if their values were known with certainty. Because we neglect uncertainty regarding the and i parameters, we omit them from subsequent notation and focus on approximating the posterior distribution p(T,rn,rs,n, s|X,C). We note that the values of T, rn, and rs are sufficient to determine bn and bs. In turn, bn and bs along with the known and i parameters determine (or can be determined by) and the sum bn + bs.
For given sequence data X, the posterior distribution of unknown parameters is
where T, rn, and rs respectively represent the vectors of the divergence times, the nonsynonymous rates , and the synonymous rates . Because p(X|T,rn,rs) = p(X|bn,bs), equation (8) can be rewritten as
Except for the treatment of p(X|bn,bs), the Metropolis-Hastings implementation that we use to sample from the posterior distribution p(T,rn,rs,n,s|X,C) is a straightforward extension of that described earlier for estimating divergence times and evolutionary rates with less realistic models of sequence change (Thorne, Kishino, and Painter 1998; Kishino, Thorne, and Bruno 2001). The p(X|bn,bs) term in equation (9) is a likelihood and can be calculated with Felsenstein's pruning algorithm (Felsenstein 1981). Because the pruning algorithm is computationally demanding, we instead approximate p(X|bn,bs) up to a proportionality constant with a Taylor expansion of the log-likelihood function around the maximum likelihood estimates of bn and bs (see Appendix). Key to this approximation is the ability, as described earlier, to interconvert between the (bn + bs, , i, ) parameterization that is convenient for calculating likelihoods and the (bn,bs,i,) parameterization used in the p(X|bn,bs) term of equation (9).
We note that the approximation of p(X|bn,bs) that is implemented here is an improvement on the multivariate normal one of earlier studies (Thorne, Kishino, and Painter 1998; Kishino, Thorne, and Bruno 2001; Thorne and Kishino 2002). The one here is equivalent to the multivariate normal treatment when maximum likelihood branch length estimates are not close to 0. The approximation of p(X|bn,bs) in earlier studies (Thorne, Kishino, and Painter 1998; Kishino, Thorne, and Bruno 2001; Thorne and Kishino 2002) treated maximum likelihood branch length estimates that are close to 0 in an ad hoc fashion. Although the approximation here is similar to those applied elsewhere (e.g., see Gelman et al. 1995, p. 95; also see the supplementary information of Korber et al. 2000 at http://www.santafe.edu/btk/science-paper/bette.html), we describe it in detail (see Appendix).
To approximate the likelihood function p(X|bn,bs) up to a constant of proportionality, we need to numerically evaluate the first and second derivatives with respect to branch lengths of the log-likelihood function around its maximum. Our experience has been that numerical stability of second derivative approximations is more difficult to achieve with codon models than with simpler models that have independent evolution among nucleotide positions. For this reason, we evaluated second derivatives with the strategy described in the Appendix. Our experience is that this strategy is numerically stable even when synonymous branch lengths are rather large.
Correlated Change of Synonymous and Nonsynonymous Rates: Test of Concordance
In addition to inferring the pattern of changes in chronological rates of synonymous and nonsynonymous substitutions over time, it is of particular interest to examine whether changes in the synonymous and nonsynonymous rates are correlated. Because mutation rate and generation length affect both synonymous and nonsynonymous rates, changes in either mutation rate or generation length could yield changes in both these rates. Consider the null hypothesis that the synonymous rate and the nonsynonymous rate change independently. Let and be respectively the nonsynonymous and synonymous rates at node i in the jth sample from the Markov chain used in the MCMC approximation of the posterior distribution. We define the concordance measure
where ip is the parent node of i. The concordance measure takes the value 1 if both synonymous and nonsynonymous rates increase on a branch or if both decrease on a branch. Otherwise, the concordance measure has the value 0. Then, the posterior probability of the concordance for the branch between i and ip is estimated as
where N is total number of MCMC samples. The average among branches of these estimated concordance probabilities is
where B is the total number of branches on which synonymous and nonsynonymous rates could change. If S is close to 1, synonymous and nonsynonymous rates are inferred to change in the same direction on most branches of the tree. However, a value of S close to 1 does not necessarily indicate strong evidence of a positive correlation between synonymous and nonsynonymous rate changes, because a tree consisting of relatively few branches could yield a data set where S is close to 1. With a tree that has few branches, it is plausible that synonymous and nonsynonymous rates changed in the same direction on these few branches just by chance rather than because of a general tendency for these rate changes to be positively correlated.
To evaluate the null hypothesis that concordance and discordance are equally likely, we need to approximate the null distribution of our test statistic S. If the null hypothesis is true, the expected value of S is 0.5 and the null distribution of S can be simulated. The key to our simulation strategy is that q(i) represents the posterior probability of concordance for branch i. Likewise, 1 – q(i) represents the posterior probability of discordance. When the null hypothesis is true, the observation that the posterior probability of concordance has some particular value would be exactly as surprising as the observation that the posterior probability of discordance takes that particular value. In other words, the distribution of q(i) among branches i is expected to be symmetric about 0.5 when the null hypothesis is true. Consider a procedure that randomly determines a value q(i)* by setting with probability 0.5 and by setting otherwise. The quantity
is then approximately a sample from the null distribution of S. A large number of values of S* can then be simulated. If the value of S exceeds a proportion (1 – ) of the S* realizations, then the null hypothesis can be rejected at a significance level in favor of the alternative hypothesis that synonymous and nonsynonymous rates change in a concordant fashion.
Results
We illustrate our approach by analyzing cytochrome oxidase subunit I (COXI) from 19 primates along with two rodent outgroup species. Specifically, the represented primates and their GenBank accession numbers are as follows: Homo sapiens (J01415); Pan troglodytes (D38113); Pan paniscus (D38116); Gorilla gorilla (D38114); Pongo pygmaeus pygmaeus (D38115); Pongo pygmaeus abelii (X97707); Hylobates lar (NC002082); Theropithecus gelada (AF312702); Cercopithecus aethiops (AF312703); Saimiri ustus (AF312704); Saimiri sciureus (AF312705); Tarsius bancanus (Af312706); Perodicticus potto (AF312707); Propithecus verreauxi (Af312708); Macaca sylvanus (NC002764); Cebus albifrons (NC002763); Papio hamadryas (NC001992); Nycticebus coucang (NC002765); Lemur catta (NC004025). The two rodent outgroup species are these: Mus musculus (V00711); Rattus norvegicus (X14848). Sequence data were manually edited after being aligned with the default option of the Clustal W software (Thompson, Higgins, and Gibson 1994). All gaps in the alignment were removed prior to subsequent analyses.
Maximum Likelihood Estimation of Branch Lengths
The COXI sequences were assumed to be related by topology of Purvis (1995). The i parameters were estimated with empirical codon frequencies, and other parameters (,(m),,) were estimated via the maximum likelihood method. Figure 1a depicts the nonsynonymous branch length estimates, and figure 1b shows the synonymous branch length estimates. Although some of the branches have maximum likelihood estimates that exceed one expected synonymous change per codon, the likelihoods for these maximum likelihood estimates are still substantially higher than when the synonymous branch lengths are forced to be infinite. Therefore, we do not detect any saturation for synonymous substitution amounts in these sequences.
FIG. 1. Maximum likelihood estimates of synonymous and nonsynonymous branch lengths from COXI primate data. a. Nonsynonymous lengths. b. Synonymous branch lengths
Prior Distribution
The prior distribution of node times was set to satisfy the constraint that African apes and humans diverged from Orangutans between 13 and 18 MYA (see Cao et al. 2000). Our approach for setting prior distributions on node times (Kishino, Thorne, and Bruno 2001) also needs the mean and standard deviation of a gamma distribution to be specified. This gamma distribution would represent the prior distribution of the ingroup root time if there were no constraints on node time. We selected a gamma distribution with mean 81.5 and standard deviation 5.0. This prior mean and standard deviation were intentionally designed so as to be very similar to the best estimate and 95% confidence interval of the primate root as reported in a recent study of fossil data (Tavaré et al. 2002). Because of the constraints on the Human + African ape/Orangutan divergence, our actual prior distribution for time since the common ancestor of the 19 primate species had a mean of approximately 80.1 and an approximate 95% credibility interval of (71.1, 90.0).
To set the prior distributions for the synonymous and nonsynonymous rates at the ingroup root, our practice is to calculate the estimated numbers of synonymous and nonsynonymous substitutions separating the root and each of the ingroup tips. The median among tips of the number of synonymous substitutions per codon was 4.87, and the median for nonsynonymous changes was 0.105. By dividing these median values by the prior mean of the ingroup root time, we get 4.87/81.5 0.0598 synonymous substitutions per codon per million years and 0.105/81.5 0.00129 nonsynonymous substitutions per million years. We then set the median of an exponential prior distribution for the synonymous rate at the ingroup root node to be 0.0598 and the median of an exponential prior distribution for the nonsynonymous rate to be 0.00129. Technically, we are violating the Bayesian framework by using our data to specify the prior distributions for rates. However, the impact of this violation is expected to be insubstantial because the constraints on node times help to separate rates and times and because these exponentially distributed priors for rates have relatively large standard deviations. For the prior for both n and s, we employed an exponential distribution with a mean expected increase of 0.01 in the variance of the logarithm of evolutionary rates per million years.
Metropolis-Hastings Approximation
Posterior distributions were approximated by implementing a Metropolis-Hastings algorithm (Metropolis et al. 1953; Hastings 1970). As with the implementations described for simpler models of sequence change (Thorne, Kishino, and Painter 1998; Kishino, Thorne, and Bruno 2001), our Metropolis-Hastings algorithm consists of cycling through various ways of changing parameter values. To check the convergence of our Metropolis-Hastings algorithm to the posterior distribution, multiple separate Metropolis-Hastings runs from different initial parameter states were performed. All Metropolis-Hastings runs consisted of 500,000 burn-in cycles followed by one sample of the Markov chain for each 1,000 cycles until a total of 10,000 samples had been obtained.
Posterior Distribution
Figure 2 displays some of the results of our Bayesian analysis. It is interesting to compare figures 1 and 2 to inspect how relatively long estimated branch lengths can result in longer time duration of branches and/or higher evolutionary rates. When the distribution of the posterior mean of concordance statistic S is simulated according to the null hypothesis of independent synonymous and nonsynonymous rate change, the frequency of S* values that exceed 0.5234 is 0.0309. Therefore, if the alternative hypothesis had been a tendency for concordant rate changes, the associated P value would be about 0.0309. Because our alternative hypothesis was, instead, that the null of independent rate change was false, we find only marginal significance and the P value for this two-tailed test to be 0.0618.
FIG. 2. Bayesian parameter estimates from COXI primate data. Posterior means of divergence times are used to depict the internal node times. a. Posterior median of . b. Posterior median of expected number of nonsynonymous substitutions per codon per million years. c. Posterior medians of expected number of synonymous substitutions per codon per million years. d. Posterior means and 95% credible intervals of divergence times
With the posterior distribution of the n and ns parameters, we can also quantify tendencies for nonsynonymous and synonymous rates to change over time. The estimated posterior mean and 95% credibility interval for n is 0.01402 and (0.005233, 0.02984) whereas s yields 0.004247 and (1.161 x 10–4, 0.01588). Because of the large overlap between the 95% credibility intervals for n and s, it is tempting to conclude that the evidence is weak that n exceeds s. However, the posterior distributions of n and s are correlated. Out of 10,000 samples from the joint posterior distribution of n and s, the value of n exceeded s 9287 times. Based on the 10,000 samples, the median of n / s is 4.500 and the 95% credibility interval is (0.6433, 115.38). In contrast, the prior median of n / s must be 1 because the priors of n and s are identical. Explicit calculation shows that the 95% prior interval for n / s is (0.0256, 39.0). Based on the difference between the prior and the estimated posterior distribution of n / s, the data indicate that nonsynonymous rates have changed more than synonymous rates, at least when rate change is assessed on a log-scale.
The large value of n seems to be mostly due to the difference of nonsynonymous rates between anthropoids (from Pan troglodytes to Tarsius bancanus in figure 2) and non-anthropoid groups. Figure 2b shows that the nonsynonymous rates of non-anthropoids are generally lower than for anthropoids, whereas there seems to be no clear difference in synonymous rates between the two groups (fig. 2c). The elevated nonsynonymous rates in the anthropoid group seem to be responsible for the elevated ratio () of nonsynonymous to synonymous rates in this group. The estimated values of are especially high in the Old World monkey group (T. gelada, P. hamadryas, M. sylvanus, and C. aethiops), as has been previously noted (Andrews and Easteal 2000).
Simulations
We have also assessed the performance of our codon-based procedure with simulations, and we report some of our findings here. Consider six ingroup taxa and one outgroup taxon that are topologically related as depicted in figure 3. In all simulation scenarios, 100 different sequence data sets with sequence lengths of 500 codons were generated.
By exponentiating both sides of equation (12), it becomes possible to approximate the likelihood function as
When no maximum likelihood estimate of branch lengths is 0, is not on the boundary of the parameter space, and the second term of equation (12) is zero; then
and this leads to the multivariate normal approximation employed in previous studies (Thorne, Kishino, and Painter 1998; Kishino, Thorne, and Bruno 2001; Thorne and Kishino 2002). When is on the boundary of the parameter space, the first derivative may not be zero and, for this reason, we do not rely on equation (14).
To evaluate the final term of equation (13), the determination of second derivatives is required. When the log-likelihood surface is relatively flat, numerical evaluations of second derivatives can be unstable. We have found that a strategy of estimating second derivatives in terms of the first derivatives of individual codon log-likelihoods works well. The second derivative approximation is based on the relation (e.g., Hogg and Craig 1995, p. 373)
This leads to
because the left side of equation (16) estimates the left side of equation (15) multiplied by N and the right side of equation (16) estimates the right side of equation (15) multiplied by N. The right side of equation (16) is not computationally very burdensome to evaluate and it is numerically relatively stable because only first derivatives are involved (e.g., see Porter 2002). Substituting equation (16) into the third term of equation (13), produces the approximation that we used in our analysis.
Acknowledgements
We thank Masami Hasegawa, Elizabeth A. Thompson, and two anonymous reviewers for their suggestions. This work was supported by the Institute for Bioinformatics Research and Development of the Japanese Science and Technology Corporation, the Japanese Society for the Promotion of Science, and the National Science Foundation (INT 99-0934, DEB-0120635). Software written to perform these analyses is freely available from T.-K. Seo (seo@statgen.ncsu.edu).
Literature Cited
Adkins, R. M., and R. L. Honeycutt. 1994. Evolution of the primate cytochrome c oxidase subunit II gene. J. Mol. Evol. 38:215-231.
Adkins, R. M., R. L. Honeycutt, and T. R. Disotell. 1996. Evolution of eutherian cytochrome c oxidase subunit II: heterogeneous rates of protein evolution and altered interaction with cytochrome c. Mol. Biol. Evol. 13:1393-1404.
Andrews, T. D., and S. Easteal. 2000. Evolutionary rate acceleration of cytochrome c oxidase subunit I in simian primates. J. Mol. Evol. 50:562-568.
Andrews, T. D., L. S. Jermiin, and S. Easteal. 1998. Accelerated evolution of cytochrome b in simian primates: adaptive evolution in concert with other mitochondrial proteins? J. Mol. Evol. 47:249-257.
Aris-Brosou, S., and Z. Yang. 2002. Effects of models of rate evolution on estimation of divergence dates with special reference to the metazoan 18S ribosomal RNA phylogeny. Syst. Biol. 51:703-714.
Cao, Y., M. Fujiwara, M. Nikaido, N. Okada, and M. Hasegawa. 2000. Interordinal relationships and time-scale of eutherian evolution as inferred from mitochondrial genome data. Gene 259:149-158.
Chao, L., and D. E. Carr. 1993. The molecular clock and the relationship between population size and generation time. Evolution 47:688-690.
Drummond, A., and A. G. Rodrigo. 2000. Reconstructing genealogies of serial samples under the assumption of a molecular clock using serial-sample UPGMA. Mol. Biol. Evol. 17:1807-1815.
Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368-376.
Gelman, A., J. B. Carlin, H. S. Stern, and D. B. Rubin. 1995. Bayesian data analysis. Chapman & Hall, 2-6 London.
Glazko, G. V., and M. Nei. 2003. Estimation of divergence times for major lineages of primate species. Mol. Biol. Evol. 20:424-434.
Goldman, N., and Z. Yang. 1994. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 17:32-43.
Grossman, L. I., T. R. Schmidt, D. E. Wildman, and M. Goodman. 2001. Molecular evolution of aerobic energy metabolism in primates. Mol. Phylogent. Evol. 18:26-36.
Hastings, W. K. 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57:97-109.
Hogg, R. V., and A. T. Craig. 1995. Introduction to mathematical statistics. Fifth edition. Prentice-Hall, Eaglewood Cliffs, N.J.
Huelsenbeck, J. P., B. S. Larget, and D. Swofford. 2000. A compound Poisson process for relaxing the molecular clock. Genetics 154:1879-1892.
Ina, Y. 1995. New methods for estimating the numbers of synonymous and nonsynonymous substitutions. J. Mol. Evol. 40:190-226.
Kimura, M. 1983. The neutral theory of molecular evolution. Cambridge University Press, Cambridge, U.K.
Kishino, H., J. L. Thorne, and W. J. Bruno. 2001. Performance of a divergence time estimation method under a probabilistic model of rate evolution. Mol. Biol. Evol. 18:352-361.
Korber, B., M. Muldoon, J. Theiler, F. Gao, R. Gupta, A. Lapedes, B. Hahn, S. Wolinsky, and T. Bhattacharya. 2000. Timing the ancestor of the HIV-1 pandemic strains. Science. 288:1757-1759.
Metropolis, N., A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. 1953. Equations of state calculations by fast computing machine. J. Chem. Phys. 21:1087-1091.
Muse, S., and B. Gaut. 1994. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol. Biol. Evol. 11:715-724.
Muse, S. V., and B. S. Weir. 1992. Testing for equality of evolutionary rates. Genetics 132:269-276.
Nei, M., and T. Gojobori. 1986. Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol. Biol. Evol. 3:418-426.
Ohta, T. 1972. Population size and rate of evolution. J. Mol. Evol. 1:305-314.
Ohta, T. 1987. Very slightly deleterious mutations and the molecular clock. J. Mol. Evol. 26:1-6.
Ohta, T. 1995. Synonymous and nonsynonymous substitutions in mammalian genes and the nearly neutral theory. J. Mol. Evol. 40:56-63.
Ohta, T., and H. Tachida. 1990. Theoretical study of near neutrality. I. Heterozygosity and rate of mutant substitution. Genetics. 126:219-229.
Pedersen, A.-M., K. C. Wiuf, and R. B. Christiansen. 1998. A codon-based model designed to describe lentiviral evolution. Mol. Biol. Evol. 15:1069-1081.
Porter, J. 2002. Efficiency of covariance matrix estimators for maximum likelihood estimation. J. Bus. Econ. Stat. 20:431-440.
Purvis, A. 1995. A composite estimate of primate phylogeny. Phil. Trans. R. Soc. Lond. Ser. B. 348:405-421.
Rambaut, A. 2000. Estimating the rate of molecular evolution: incorporating non-contemporaneous sequences into maximum likelihood phylogenies. Bioinformatics 16:395-399.
Sanderson, M. J. 1997. A nonparametric approach to estimating divergence times in the absence of rate constancy. Mol. Biol. Evol. 14:1218-1231.
Sanderson, M. J. 2002. Estimating absolute rates of molecular evolution and divergence times: a penalized likelihood approach. Mol. Biol. Evol. 19:101-109.
Sanderson, M. J. 2003. r8s: inferring absolute rates of molecular evolution and divergence times in the absence of a molecular clock. Bioinformatics 19:301-302.
Stauffer, R. L., A. Walker, O. A. Ryder, M. Lyons-Weiler, and S. B. Hedges. 2001. Human and ape molecular clocks and constraints on paleontological hypotheses. J. Hered. 92:469-474.
Schmidt, T. R., M. Goodman, and L. I. Grossman. 1999. Molecular evolution of the COX7A gene family in primates. Mol. Biol. Evol. 16:619-626.
Tavaré, S., C. R. Marshall, O. Will, C. Soligo, and R. D. Martin. 2002. Using the fossil record to estimate the age of the last common ancestor of extant primates. Nature 18:726-729.
Thompson, J. D., D. G. Higgins, and T. J. Gibson. 1994. Clustal W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:4673-4680.
Thorne, J. L., H. Kishino, and I. S. Painter. 1998. Estimating the rate of evolution of the rate of molecular evolution. Mol. Biol. Evol. 15:1647-1657.
Thorne, J. L., and H. Kishino. 2002. Divergence time and evolutionary rate estimation with multilocus data. Syst. Biol. 51:689-702.
Wildman, D. E., W. Wu, M. Goodman, and L. I. Grossman. 2002. Episodic positive selection in ape cytochrome c oxidase subunit IV. Mol. Biol. Evol. 19:1812-1815.
Wu, C. I., and W.-H. Li. 1985. Evidence for higher rates of nucleotide substitution in rodents than in man. Proc. Natl. Acad. Sci. USA 82:1741-1745.
Wu, W., M. Goodman, M. I. Lomax, and L. I. Grossman. 1997. Molecular evolution of cytochrome oxidase subunit IV: evidence for positive selection in simian primates. J. Mol. Evol. 44:477-491.
Wu, W., T. R. Schmidt, M. Goodman, and L. I. Grossman. 2000. Molecular evolution of cytochrome c oxidase subunit I in primates: is there coevolution between mitochondrial and nuclear genomes? Mol. Phylogenet. Evol. 17:294-304.
Yang, Z. 1998. Likelihood ratio tests for detecting positive selection and application to primate lysozyme evolution. Mol. Biol. Evol. 15:568-573.
Yang, Z., and P. Bielawski. 2000. Statistical methods for detecting molecular adaptation. Trends Ecol. Evol. 15:496-503.
Yang, Z., and R. Nielsen. 1998. Synonymous and nonsynonymous rate variation in nuclear genes of mammals. J. Mol. Evol. 46:409-418.
Yang, Z., and R. Nielsen. 2000. Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Mol. Biol. Evol. 17:32-43.
Yang, Z., and R. Nielsen. 2002. Codon-substitution models for detecting molecular adaptation at individual sites along specific lineages. Mol. Biol. Evol. 19:908-917.
Yang, Z., and W. J. Swanson. 2002. Codon-substitution models to detect adaptive evolution that account for heterogeneous selective pressures among site classes. Mol. Biol. Evol. 19:49-57.
Yoder, A. D., and Z. Yang. 2000. Estimation of primate speciation dates using local molecular clocks. Mol. Biol. Evol. 17:1081-1090.
Zhang, Z., H. F. Rosenberg, and M. Nei. 1998. Positive Darwinian selection after gene duplication in primate ribonuclease genes. Proc. Natl. Acad. Sci. USA 95:3708-3713.(Tae-Kun Seo, Hirohisa Kis)
Laboratory of Biometrics, Graduate School of Agriculture and Life Sciences, University of Tokyo, Bunkyo-ku, Tokyo, Japan
E-mail: seo@statgen.ncsu.edu
Abstract
The rate of molecular evolution can vary among lineages. Sources of this variation have differential effects on synonymous and nonsynonymous substitution rates. Changes in effective population size or patterns of natural selection will mainly alter nonsynonymous substitution rates. Changes in generation length or mutation rates are likely to have an impact on both synonymous and nonsynonymous substitution rates. By comparing changes in synonymous and nonsynonymous rates, the relative contributions of the driving forces of evolution can be better characterized. Here, we introduce a procedure for estimating the chronological rates of synonymous and nonsynonymous substitutions on the branches of an evolutionary tree. Because the widely used ratio of nonsynonymous and synonymous rates is not designed to detect simultaneous increases or simultaneous decreases in synonymous and nonsynonymous rates, the estimation of these rates rather than their ratio can improve characterization of the evolutionary process. With our Bayesian approach, we analyze cytochrome oxidase subunit I evolution in primates and infer that nonsynonymous rates have a greater tendency to change over time than do synonymous rates. Our analysis of these data also suggests that rates have been positively correlated.
Key Words: Bayesian method ? Markov chain Monte Carlo (MCMC) ? rate change model ? lognormal ? evolutionary rate ? selection ? synonymous rate ? nonsynonymous rate
Introduction
Much attention has been paid to synonymous and nonsynonymous substitution amounts and their ratio (Nei and Gojobori 1986; Goldman and Yang 1994; Muse and Gaut 1994; Ina 1995; Ohta 1995; Zhang, Rosenberg and Nei 1998; Yang 1998; Yang and Nielsen 1998, 2000, 2002; Yang and Bielawski 2000; Yang and Swanson 2002), but less emphasis has been placed on the actual values of nonsynonymous and synonymous rates. This is unfortunate because absolute rates of nonsynonymous and synonymous substitutions can provide more insight into evolutionary mechanism than can the amounts or their ratio. A change in the ratio may be due to alteration of the nonsynonymous or synonymous substitution rate or both. With the ability to estimate the actual nonsynonymous and synonymous rates, the factors involved in a change to the evolutionary process can be more readily identified.
For estimation of substitution rates in chronological time units (e.g., years, millions of years, etc.), the most difficult obstacle is that only the amount of evolution (i.e., the product of rate and time) can be inferred solely from molecular sequence data. This confounding of rates and times necessitates the addition of information external to the molecular sequence data if actual nonsynonymous and synonymous substitution rates are to be accurately determined. Conventionally, this external information has taken the form of fossil data that calibrates a "molecular clock" that is assumed to specify a common rate of evolution on all branches of a phylogeny. Recently, the sampling times of quickly evolving organisms such as RNA viruses have been exploited to calibrate the molecular clock on some genealogies (Rambaut 2000; Drummond and Rodrigo 2000). Regardless of how calibration is achieved, when the goal is to determine how substitution rates have changed over time, the molecular clock assumption of constant rates has little utility. There have been proposals for analyzing sequence data with "local clocks" that operate on only a portion of an evolutionary tree (Yoder and Yang 2000). This local clock approach is attractive, but there may be some difficulty in determining how many local clocks should be employed and in determining which branches should be governed by which local clocks.
Sanderson (1997; see also Sanderson 2002, 2003) introduced a pioneering approach that permits different rates in different lineages and that lacks the model specification difficulties of the local clock approaches. Later, Bayesian approaches for allowing the rate of evolution to change over time were developed (Thorne, Kishino, and Painter 1998; Huelsenbeck, Larget, and Swofford 2000; Kishino, Thorne, and Bruno 2001; Thorne and Kishino 2002; Aris-Brosou and Yang 2002). An advantage of the Bayesian implementations is their ability to incorporate both the full likelihood and prior information. Here, we adapt previously proposed Bayesian approaches (Thorne, Kishino, and Painter 1998; Kishino, Thorne, and Bruno 2001; Thorne and Kishino 2002) to a codon model of evolution. In this way, absolute synonymous and nonsynonymous substitution rates as well as divergence times can be estimated. In addition, the tendency for nonsynonymous rates to change over time can be compared and contrasted with the tendency for synonymous rates to change over time. We also introduce a test of whether nonsynonymous and synonymous substitution rates change in a correlated fashion over time, and we discuss the strengths and weaknesses of this test.
Our new methodology is applied to a group of cytochrome oxidase subunit I (COXI) primate sequences. Cytochrome oxidase is a protein complex that affects aerobic energy metabolism by playing an important role in the final step of the electron transport system. Cytochrome oxidase is composed of 13 subunits. Subunits 1, 2, and 3 are encoded in the mitochondrion, and the others are encoded in the nucleus. The molecular evolution of some of the subunits has been investigated in detail and, despite the different genomic origins of the subunits that have been studied, a consistent pattern among subunits is that anthropoids have experienced a higher ratio of nonsynonymous to synonymous change than have non-anthropoids (Adkins and Honeycutt 1994; Adkins Honeycutt, and Disotell 1996; Wu et al. 1997, 2000; Schmidt, Goodman, and Grossman 1999; Andrews and Easteal 2000; Grossman et al. 2001; Wildman, Goodman, and Grossman 2002). One suggested explanation of this pattern is that there has been an increase in size of the neocortex during anthropoid evolution and that the increased needs of the neocortex for energy may be responsible for a change in aerobic energy metabolism in anthropoids. Selection for an alteration to aerobic energy metabolism could have led to the higher nonsynonymous to synonymous substitution ratio of cytochrome oxidase in anthropoids (see Grossman et al. 2001).
By analyzing COXI, we obtain divergence time estimates similar to those obtained in a previous study that analyzed total mitochondrial genomes with a treatment that did not distinguish between synonymous and nonsynonymous changes (Cao et al. 2000). Our inferred pattern of nonsynonymous rate increase in the anthropoid group is consistent with previous analyses based on different methods (Andrews and Easteal 2000; Wu et al. 2000). An advantage of our Bayesian method over traditional approaches that are based on pairwise distances is that it can localize rate changes to relatively specific portions of a phylogeny. For COXI, we tentatively conclude that nonsynonymous rates have been comparatively high throughout the anthropoid group and not just on the branch that was immediately ancestral to the anthropoid group.
Methods
Codon Model
Consider a slight modification of the codon substitution model introduced by Goldman and Yang (1994). The modification allows the expected proportion of nonsynonymous substitutions to vary among branches and has been previously suggested (Yang 1998). With this modification, the instantaneous rate of substitution from codon i to codon j on branch m is
where j is the frequency of codon j, differentiates between transitions and transversions, and (m) distinguishes between nonsynonymous and synonymous substitutions for branch m. In contrast to the (m) parameter, the u(m) parameter affects the chronological rates of both synonymous and nonsynonymous substitutions on branch m. In equation (1), the i parameters and are common to all branches.
The average rate on a branch m of synonymous and nonsynonymous substitution per chronological time unit will be respectively denoted and . These rates can be defined as
and
The synonymous and nonsynonymous branch lengths for a branch m with chronological time duration t(m) are
and
The total length of branch m (i.e., the expected number of substitutions per codon) is then
Knowing + and (m) along with and the i parameters allows determination of and . This is because , the i parameters, and (m) define the / u(m) values. The / u(m) values then allow calculation of / u(m) and / u(m) via equations (2) and (3). Equation (6) can be used to solve for u(m)t(m). Finally, u(m)t(m) with / u(m) determines (eq. 4) and u(m)t(m) with / u(m) determines (eq. 5).
Likewise, knowing and along with and the i parameters allows determination of + and (m). This is because and the i parameters are sufficient to define / u(m). With equation (4), u(m)t(m) can be expressed in terms of / u(m) and . Using u(m)t(m) and equation (5), the value of / u(m) follows. Finally, the value of (m) can be found by applying equation (3).
We stress this ability to interconvert between the two sets of parameters because the (i,, + ,(m)) parameterization is more natural for calculation of likelihoods whereas the (i,,,) parameterization is more natural when considering our model of synonymous and nonsynonymous rate change. Ability to interconvert between these two parameterizations also allows the invariance property of maximum likelihood estimators (e.g., Hogg and Craig 1995, p. 265) to be exploited so that estimates under one parameterization can therefore be converted to be estimates under the other. Maximum likelihood estimates are relevant because, as described below, our Bayesian approach involves approximating the likelihood surface needed for posterior density estimation by Taylor expansion around the maximum likelihood estimates.
Model of Nonsynonymous/Synonymous Rate Change
Our model of rate evolution is similar in spirit to that of Kishino, Thorne, and Bruno (2001). When evolutionary lineages are separated by only small amounts of time, their evolutionary rates are likely to be similar. When lineages are distantly related, their rates are more likely to substantially differ. In other words, evolutionary rates are autocorrelated.
Rates of evolution are likely to undergo a nearly continuous process of change on a lineage. Because molecular sequence data contain only limited information regarding this nearly continuous process of change, the average evolutionary rate on a branch is approximated in our treatment by the mean of the rate of evolution at the nodes that begin and end the branch. If rates of evolution changed in a deterministic linear fashion between beginning and ending nodes of a branch, our approximation would be exact. However, we simply view this connection between node rates and average rates on a branch as a computationally simplifying assumption. With this approximation, the average rates of evolution on branches and the expected amounts of evolution on branches both depend completely on rates of evolution at the nodes of a tree.
To model the rates of evolution at tree nodes, a nonsynonymous rate variation parameter n and a synonymous rate variation parameter s are introduced. The case where n = s = 0 represents the biologically unrealistic case of a molecular clock. These rate variation parameters are restricted to non-negative values and, as they increase, the expected departure from a molecular clock becomes more extreme. Given the nonsynonymous rate and the synonymous rate at a node i that begins a branch of time duration t, the logarithms of the rates at the node j that ends the branch are assumed to follow a bivariate normal distribution of the form:
With this parameterization, the expected value of () is larger than (), even though the expected value of log (log ) equals log (log ). The difference gets bigger when n(s) or t is large. To eliminate such differences, Kishino, Thorne, and Bruno (2001) forced the expected value of rates at the end of branches to be equal to their values at the beginning of branches. Because the parameterization in equation (7) simplifies our proposed test of concordance between synonymous and nonsynonymous rates (see below), we do not make any attempt to eliminate such differences here.
As a further simplification, we set the a priori covariance of log and log to be zero in equation (7). Future implementations could easily permit this covariance to be nonzero a priori or might even define a prior distribution for a parameter representing the covariance. This hierarchical approach may be worthwhile because some features of the evolutionary process such as generation time or mutation rate might simultaneously affect both nonsynonymous and synonymous rates when they change. Although we model synonymous and nonsynonymous rate change as occurring independently, we describe a procedure that uses sequence data to detect correlation in changes of these rates.
We assume the topology relating the sequences is known. Following Thorne, Kishino, and Painter (1998), our procedure requires an outgroup to root the ingroup, but no attempt is made to model rate evolution in the outgroup taxa. With the model of rate evolution described by equation (7), the prior density of rates at a node is defined in terms of the rates at its parental node, and this leaves the prior distribution for the rates at the ingroup root node to be specified. In our implementation, the ingroup root node has priors for nonsynonymous and synonymous rates that are specified by two independent gamma distributions. For n and s, we specify priors that are exponential distributions.
For every rooted tree, our implementation arbitrarily selects one branch that emanates from the ingroup root node. On this branch, the nonsynonymous and synonymous rates at the end of the branch are forced to be identical to the corresponding rates at the ingroup root node. In previous work (Kishino, Thorne, and Bruno 2001), this forcing of no rate change along one branch has made Markov chain Monte Carlo (MCMC) approximation of posterior rate densities computationally tractable. With a similar motivation, the forcing of no rate change was included here to achieve computational tractability with the MCMC procedure described below. With the exception of this single branch, evolution of rates on all branches is governed by equation (7).
Prior Distribution
For the Bayesian estimation of chronological rates of nonsynonymous and synonymous substitutions, a prior distribution of node times must be specified. The details of the prior distribution for node times are identical to those described in Kishino, Thorne, and Bruno (2001), and only an outline is provided here. For the case where there are no constraints on node times due to fossil or other information, our implementation uses a gamma distribution to describe the time of the ingroup root node. Given the ingroup root time, the prior distribution for the times of all other internal nodes are jointly distributed according to a modified Dirichlet distribution (Kishino, Thorne, and Bruno 2001). Representing the node times (including the ingroup root) by a vector T, the modified Dirichlet distribution and the gamma distribution of the ingroup root time determine the prior density of the node times p(T).
Often, fossil or other evidence suggests that a certain node time may exceed some value. Likewise, evidence external to the molecular sequence data may indicate that a node is more recent than some date. Sanderson (1997) demonstrated that constraints on node times can be valuable for estimating dates of divergence. When a set C of constraints on node times is available, we condition upon these constraints and use this conditional prior p(T|C) to estimate divergence times and evolutionary rates.
Posterior Distribution and MCMC
Following Yang (1998; see also Pederson, Wiuf, and Christiansen 1998), we have explored various parameterizations of the codon frequencies in terms of frequencies of the four nucleotide types. One parameterization forces these nucleotide frequencies to be identical at all three codon positions, and another allows different nucleotide frequencies at the first, second, and third codon positions. However, for all results presented here, we estimate probabilities of the different codon types by their empirical frequencies. We then treat these estimated codon frequencies as well as the maximum likelihood estimate of as if their values were known with certainty. Because we neglect uncertainty regarding the and i parameters, we omit them from subsequent notation and focus on approximating the posterior distribution p(T,rn,rs,n, s|X,C). We note that the values of T, rn, and rs are sufficient to determine bn and bs. In turn, bn and bs along with the known and i parameters determine (or can be determined by) and the sum bn + bs.
For given sequence data X, the posterior distribution of unknown parameters is
where T, rn, and rs respectively represent the vectors of the divergence times, the nonsynonymous rates , and the synonymous rates . Because p(X|T,rn,rs) = p(X|bn,bs), equation (8) can be rewritten as
Except for the treatment of p(X|bn,bs), the Metropolis-Hastings implementation that we use to sample from the posterior distribution p(T,rn,rs,n,s|X,C) is a straightforward extension of that described earlier for estimating divergence times and evolutionary rates with less realistic models of sequence change (Thorne, Kishino, and Painter 1998; Kishino, Thorne, and Bruno 2001). The p(X|bn,bs) term in equation (9) is a likelihood and can be calculated with Felsenstein's pruning algorithm (Felsenstein 1981). Because the pruning algorithm is computationally demanding, we instead approximate p(X|bn,bs) up to a proportionality constant with a Taylor expansion of the log-likelihood function around the maximum likelihood estimates of bn and bs (see Appendix). Key to this approximation is the ability, as described earlier, to interconvert between the (bn + bs, , i, ) parameterization that is convenient for calculating likelihoods and the (bn,bs,i,) parameterization used in the p(X|bn,bs) term of equation (9).
We note that the approximation of p(X|bn,bs) that is implemented here is an improvement on the multivariate normal one of earlier studies (Thorne, Kishino, and Painter 1998; Kishino, Thorne, and Bruno 2001; Thorne and Kishino 2002). The one here is equivalent to the multivariate normal treatment when maximum likelihood branch length estimates are not close to 0. The approximation of p(X|bn,bs) in earlier studies (Thorne, Kishino, and Painter 1998; Kishino, Thorne, and Bruno 2001; Thorne and Kishino 2002) treated maximum likelihood branch length estimates that are close to 0 in an ad hoc fashion. Although the approximation here is similar to those applied elsewhere (e.g., see Gelman et al. 1995, p. 95; also see the supplementary information of Korber et al. 2000 at http://www.santafe.edu/btk/science-paper/bette.html), we describe it in detail (see Appendix).
To approximate the likelihood function p(X|bn,bs) up to a constant of proportionality, we need to numerically evaluate the first and second derivatives with respect to branch lengths of the log-likelihood function around its maximum. Our experience has been that numerical stability of second derivative approximations is more difficult to achieve with codon models than with simpler models that have independent evolution among nucleotide positions. For this reason, we evaluated second derivatives with the strategy described in the Appendix. Our experience is that this strategy is numerically stable even when synonymous branch lengths are rather large.
Correlated Change of Synonymous and Nonsynonymous Rates: Test of Concordance
In addition to inferring the pattern of changes in chronological rates of synonymous and nonsynonymous substitutions over time, it is of particular interest to examine whether changes in the synonymous and nonsynonymous rates are correlated. Because mutation rate and generation length affect both synonymous and nonsynonymous rates, changes in either mutation rate or generation length could yield changes in both these rates. Consider the null hypothesis that the synonymous rate and the nonsynonymous rate change independently. Let and be respectively the nonsynonymous and synonymous rates at node i in the jth sample from the Markov chain used in the MCMC approximation of the posterior distribution. We define the concordance measure
where ip is the parent node of i. The concordance measure takes the value 1 if both synonymous and nonsynonymous rates increase on a branch or if both decrease on a branch. Otherwise, the concordance measure has the value 0. Then, the posterior probability of the concordance for the branch between i and ip is estimated as
where N is total number of MCMC samples. The average among branches of these estimated concordance probabilities is
where B is the total number of branches on which synonymous and nonsynonymous rates could change. If S is close to 1, synonymous and nonsynonymous rates are inferred to change in the same direction on most branches of the tree. However, a value of S close to 1 does not necessarily indicate strong evidence of a positive correlation between synonymous and nonsynonymous rate changes, because a tree consisting of relatively few branches could yield a data set where S is close to 1. With a tree that has few branches, it is plausible that synonymous and nonsynonymous rates changed in the same direction on these few branches just by chance rather than because of a general tendency for these rate changes to be positively correlated.
To evaluate the null hypothesis that concordance and discordance are equally likely, we need to approximate the null distribution of our test statistic S. If the null hypothesis is true, the expected value of S is 0.5 and the null distribution of S can be simulated. The key to our simulation strategy is that q(i) represents the posterior probability of concordance for branch i. Likewise, 1 – q(i) represents the posterior probability of discordance. When the null hypothesis is true, the observation that the posterior probability of concordance has some particular value would be exactly as surprising as the observation that the posterior probability of discordance takes that particular value. In other words, the distribution of q(i) among branches i is expected to be symmetric about 0.5 when the null hypothesis is true. Consider a procedure that randomly determines a value q(i)* by setting with probability 0.5 and by setting otherwise. The quantity
is then approximately a sample from the null distribution of S. A large number of values of S* can then be simulated. If the value of S exceeds a proportion (1 – ) of the S* realizations, then the null hypothesis can be rejected at a significance level in favor of the alternative hypothesis that synonymous and nonsynonymous rates change in a concordant fashion.
Results
We illustrate our approach by analyzing cytochrome oxidase subunit I (COXI) from 19 primates along with two rodent outgroup species. Specifically, the represented primates and their GenBank accession numbers are as follows: Homo sapiens (J01415); Pan troglodytes (D38113); Pan paniscus (D38116); Gorilla gorilla (D38114); Pongo pygmaeus pygmaeus (D38115); Pongo pygmaeus abelii (X97707); Hylobates lar (NC002082); Theropithecus gelada (AF312702); Cercopithecus aethiops (AF312703); Saimiri ustus (AF312704); Saimiri sciureus (AF312705); Tarsius bancanus (Af312706); Perodicticus potto (AF312707); Propithecus verreauxi (Af312708); Macaca sylvanus (NC002764); Cebus albifrons (NC002763); Papio hamadryas (NC001992); Nycticebus coucang (NC002765); Lemur catta (NC004025). The two rodent outgroup species are these: Mus musculus (V00711); Rattus norvegicus (X14848). Sequence data were manually edited after being aligned with the default option of the Clustal W software (Thompson, Higgins, and Gibson 1994). All gaps in the alignment were removed prior to subsequent analyses.
Maximum Likelihood Estimation of Branch Lengths
The COXI sequences were assumed to be related by topology of Purvis (1995). The i parameters were estimated with empirical codon frequencies, and other parameters (,(m),,) were estimated via the maximum likelihood method. Figure 1a depicts the nonsynonymous branch length estimates, and figure 1b shows the synonymous branch length estimates. Although some of the branches have maximum likelihood estimates that exceed one expected synonymous change per codon, the likelihoods for these maximum likelihood estimates are still substantially higher than when the synonymous branch lengths are forced to be infinite. Therefore, we do not detect any saturation for synonymous substitution amounts in these sequences.
FIG. 1. Maximum likelihood estimates of synonymous and nonsynonymous branch lengths from COXI primate data. a. Nonsynonymous lengths. b. Synonymous branch lengths
Prior Distribution
The prior distribution of node times was set to satisfy the constraint that African apes and humans diverged from Orangutans between 13 and 18 MYA (see Cao et al. 2000). Our approach for setting prior distributions on node times (Kishino, Thorne, and Bruno 2001) also needs the mean and standard deviation of a gamma distribution to be specified. This gamma distribution would represent the prior distribution of the ingroup root time if there were no constraints on node time. We selected a gamma distribution with mean 81.5 and standard deviation 5.0. This prior mean and standard deviation were intentionally designed so as to be very similar to the best estimate and 95% confidence interval of the primate root as reported in a recent study of fossil data (Tavaré et al. 2002). Because of the constraints on the Human + African ape/Orangutan divergence, our actual prior distribution for time since the common ancestor of the 19 primate species had a mean of approximately 80.1 and an approximate 95% credibility interval of (71.1, 90.0).
To set the prior distributions for the synonymous and nonsynonymous rates at the ingroup root, our practice is to calculate the estimated numbers of synonymous and nonsynonymous substitutions separating the root and each of the ingroup tips. The median among tips of the number of synonymous substitutions per codon was 4.87, and the median for nonsynonymous changes was 0.105. By dividing these median values by the prior mean of the ingroup root time, we get 4.87/81.5 0.0598 synonymous substitutions per codon per million years and 0.105/81.5 0.00129 nonsynonymous substitutions per million years. We then set the median of an exponential prior distribution for the synonymous rate at the ingroup root node to be 0.0598 and the median of an exponential prior distribution for the nonsynonymous rate to be 0.00129. Technically, we are violating the Bayesian framework by using our data to specify the prior distributions for rates. However, the impact of this violation is expected to be insubstantial because the constraints on node times help to separate rates and times and because these exponentially distributed priors for rates have relatively large standard deviations. For the prior for both n and s, we employed an exponential distribution with a mean expected increase of 0.01 in the variance of the logarithm of evolutionary rates per million years.
Metropolis-Hastings Approximation
Posterior distributions were approximated by implementing a Metropolis-Hastings algorithm (Metropolis et al. 1953; Hastings 1970). As with the implementations described for simpler models of sequence change (Thorne, Kishino, and Painter 1998; Kishino, Thorne, and Bruno 2001), our Metropolis-Hastings algorithm consists of cycling through various ways of changing parameter values. To check the convergence of our Metropolis-Hastings algorithm to the posterior distribution, multiple separate Metropolis-Hastings runs from different initial parameter states were performed. All Metropolis-Hastings runs consisted of 500,000 burn-in cycles followed by one sample of the Markov chain for each 1,000 cycles until a total of 10,000 samples had been obtained.
Posterior Distribution
Figure 2 displays some of the results of our Bayesian analysis. It is interesting to compare figures 1 and 2 to inspect how relatively long estimated branch lengths can result in longer time duration of branches and/or higher evolutionary rates. When the distribution of the posterior mean of concordance statistic S is simulated according to the null hypothesis of independent synonymous and nonsynonymous rate change, the frequency of S* values that exceed 0.5234 is 0.0309. Therefore, if the alternative hypothesis had been a tendency for concordant rate changes, the associated P value would be about 0.0309. Because our alternative hypothesis was, instead, that the null of independent rate change was false, we find only marginal significance and the P value for this two-tailed test to be 0.0618.
FIG. 2. Bayesian parameter estimates from COXI primate data. Posterior means of divergence times are used to depict the internal node times. a. Posterior median of . b. Posterior median of expected number of nonsynonymous substitutions per codon per million years. c. Posterior medians of expected number of synonymous substitutions per codon per million years. d. Posterior means and 95% credible intervals of divergence times
With the posterior distribution of the n and ns parameters, we can also quantify tendencies for nonsynonymous and synonymous rates to change over time. The estimated posterior mean and 95% credibility interval for n is 0.01402 and (0.005233, 0.02984) whereas s yields 0.004247 and (1.161 x 10–4, 0.01588). Because of the large overlap between the 95% credibility intervals for n and s, it is tempting to conclude that the evidence is weak that n exceeds s. However, the posterior distributions of n and s are correlated. Out of 10,000 samples from the joint posterior distribution of n and s, the value of n exceeded s 9287 times. Based on the 10,000 samples, the median of n / s is 4.500 and the 95% credibility interval is (0.6433, 115.38). In contrast, the prior median of n / s must be 1 because the priors of n and s are identical. Explicit calculation shows that the 95% prior interval for n / s is (0.0256, 39.0). Based on the difference between the prior and the estimated posterior distribution of n / s, the data indicate that nonsynonymous rates have changed more than synonymous rates, at least when rate change is assessed on a log-scale.
The large value of n seems to be mostly due to the difference of nonsynonymous rates between anthropoids (from Pan troglodytes to Tarsius bancanus in figure 2) and non-anthropoid groups. Figure 2b shows that the nonsynonymous rates of non-anthropoids are generally lower than for anthropoids, whereas there seems to be no clear difference in synonymous rates between the two groups (fig. 2c). The elevated nonsynonymous rates in the anthropoid group seem to be responsible for the elevated ratio () of nonsynonymous to synonymous rates in this group. The estimated values of are especially high in the Old World monkey group (T. gelada, P. hamadryas, M. sylvanus, and C. aethiops), as has been previously noted (Andrews and Easteal 2000).
Simulations
We have also assessed the performance of our codon-based procedure with simulations, and we report some of our findings here. Consider six ingroup taxa and one outgroup taxon that are topologically related as depicted in figure 3. In all simulation scenarios, 100 different sequence data sets with sequence lengths of 500 codons were generated.
By exponentiating both sides of equation (12), it becomes possible to approximate the likelihood function as
When no maximum likelihood estimate of branch lengths is 0, is not on the boundary of the parameter space, and the second term of equation (12) is zero; then
and this leads to the multivariate normal approximation employed in previous studies (Thorne, Kishino, and Painter 1998; Kishino, Thorne, and Bruno 2001; Thorne and Kishino 2002). When is on the boundary of the parameter space, the first derivative may not be zero and, for this reason, we do not rely on equation (14).
To evaluate the final term of equation (13), the determination of second derivatives is required. When the log-likelihood surface is relatively flat, numerical evaluations of second derivatives can be unstable. We have found that a strategy of estimating second derivatives in terms of the first derivatives of individual codon log-likelihoods works well. The second derivative approximation is based on the relation (e.g., Hogg and Craig 1995, p. 373)
This leads to
because the left side of equation (16) estimates the left side of equation (15) multiplied by N and the right side of equation (16) estimates the right side of equation (15) multiplied by N. The right side of equation (16) is not computationally very burdensome to evaluate and it is numerically relatively stable because only first derivatives are involved (e.g., see Porter 2002). Substituting equation (16) into the third term of equation (13), produces the approximation that we used in our analysis.
Acknowledgements
We thank Masami Hasegawa, Elizabeth A. Thompson, and two anonymous reviewers for their suggestions. This work was supported by the Institute for Bioinformatics Research and Development of the Japanese Science and Technology Corporation, the Japanese Society for the Promotion of Science, and the National Science Foundation (INT 99-0934, DEB-0120635). Software written to perform these analyses is freely available from T.-K. Seo (seo@statgen.ncsu.edu).
Literature Cited
Adkins, R. M., and R. L. Honeycutt. 1994. Evolution of the primate cytochrome c oxidase subunit II gene. J. Mol. Evol. 38:215-231.
Adkins, R. M., R. L. Honeycutt, and T. R. Disotell. 1996. Evolution of eutherian cytochrome c oxidase subunit II: heterogeneous rates of protein evolution and altered interaction with cytochrome c. Mol. Biol. Evol. 13:1393-1404.
Andrews, T. D., and S. Easteal. 2000. Evolutionary rate acceleration of cytochrome c oxidase subunit I in simian primates. J. Mol. Evol. 50:562-568.
Andrews, T. D., L. S. Jermiin, and S. Easteal. 1998. Accelerated evolution of cytochrome b in simian primates: adaptive evolution in concert with other mitochondrial proteins? J. Mol. Evol. 47:249-257.
Aris-Brosou, S., and Z. Yang. 2002. Effects of models of rate evolution on estimation of divergence dates with special reference to the metazoan 18S ribosomal RNA phylogeny. Syst. Biol. 51:703-714.
Cao, Y., M. Fujiwara, M. Nikaido, N. Okada, and M. Hasegawa. 2000. Interordinal relationships and time-scale of eutherian evolution as inferred from mitochondrial genome data. Gene 259:149-158.
Chao, L., and D. E. Carr. 1993. The molecular clock and the relationship between population size and generation time. Evolution 47:688-690.
Drummond, A., and A. G. Rodrigo. 2000. Reconstructing genealogies of serial samples under the assumption of a molecular clock using serial-sample UPGMA. Mol. Biol. Evol. 17:1807-1815.
Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368-376.
Gelman, A., J. B. Carlin, H. S. Stern, and D. B. Rubin. 1995. Bayesian data analysis. Chapman & Hall, 2-6 London.
Glazko, G. V., and M. Nei. 2003. Estimation of divergence times for major lineages of primate species. Mol. Biol. Evol. 20:424-434.
Goldman, N., and Z. Yang. 1994. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 17:32-43.
Grossman, L. I., T. R. Schmidt, D. E. Wildman, and M. Goodman. 2001. Molecular evolution of aerobic energy metabolism in primates. Mol. Phylogent. Evol. 18:26-36.
Hastings, W. K. 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57:97-109.
Hogg, R. V., and A. T. Craig. 1995. Introduction to mathematical statistics. Fifth edition. Prentice-Hall, Eaglewood Cliffs, N.J.
Huelsenbeck, J. P., B. S. Larget, and D. Swofford. 2000. A compound Poisson process for relaxing the molecular clock. Genetics 154:1879-1892.
Ina, Y. 1995. New methods for estimating the numbers of synonymous and nonsynonymous substitutions. J. Mol. Evol. 40:190-226.
Kimura, M. 1983. The neutral theory of molecular evolution. Cambridge University Press, Cambridge, U.K.
Kishino, H., J. L. Thorne, and W. J. Bruno. 2001. Performance of a divergence time estimation method under a probabilistic model of rate evolution. Mol. Biol. Evol. 18:352-361.
Korber, B., M. Muldoon, J. Theiler, F. Gao, R. Gupta, A. Lapedes, B. Hahn, S. Wolinsky, and T. Bhattacharya. 2000. Timing the ancestor of the HIV-1 pandemic strains. Science. 288:1757-1759.
Metropolis, N., A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. 1953. Equations of state calculations by fast computing machine. J. Chem. Phys. 21:1087-1091.
Muse, S., and B. Gaut. 1994. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol. Biol. Evol. 11:715-724.
Muse, S. V., and B. S. Weir. 1992. Testing for equality of evolutionary rates. Genetics 132:269-276.
Nei, M., and T. Gojobori. 1986. Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol. Biol. Evol. 3:418-426.
Ohta, T. 1972. Population size and rate of evolution. J. Mol. Evol. 1:305-314.
Ohta, T. 1987. Very slightly deleterious mutations and the molecular clock. J. Mol. Evol. 26:1-6.
Ohta, T. 1995. Synonymous and nonsynonymous substitutions in mammalian genes and the nearly neutral theory. J. Mol. Evol. 40:56-63.
Ohta, T., and H. Tachida. 1990. Theoretical study of near neutrality. I. Heterozygosity and rate of mutant substitution. Genetics. 126:219-229.
Pedersen, A.-M., K. C. Wiuf, and R. B. Christiansen. 1998. A codon-based model designed to describe lentiviral evolution. Mol. Biol. Evol. 15:1069-1081.
Porter, J. 2002. Efficiency of covariance matrix estimators for maximum likelihood estimation. J. Bus. Econ. Stat. 20:431-440.
Purvis, A. 1995. A composite estimate of primate phylogeny. Phil. Trans. R. Soc. Lond. Ser. B. 348:405-421.
Rambaut, A. 2000. Estimating the rate of molecular evolution: incorporating non-contemporaneous sequences into maximum likelihood phylogenies. Bioinformatics 16:395-399.
Sanderson, M. J. 1997. A nonparametric approach to estimating divergence times in the absence of rate constancy. Mol. Biol. Evol. 14:1218-1231.
Sanderson, M. J. 2002. Estimating absolute rates of molecular evolution and divergence times: a penalized likelihood approach. Mol. Biol. Evol. 19:101-109.
Sanderson, M. J. 2003. r8s: inferring absolute rates of molecular evolution and divergence times in the absence of a molecular clock. Bioinformatics 19:301-302.
Stauffer, R. L., A. Walker, O. A. Ryder, M. Lyons-Weiler, and S. B. Hedges. 2001. Human and ape molecular clocks and constraints on paleontological hypotheses. J. Hered. 92:469-474.
Schmidt, T. R., M. Goodman, and L. I. Grossman. 1999. Molecular evolution of the COX7A gene family in primates. Mol. Biol. Evol. 16:619-626.
Tavaré, S., C. R. Marshall, O. Will, C. Soligo, and R. D. Martin. 2002. Using the fossil record to estimate the age of the last common ancestor of extant primates. Nature 18:726-729.
Thompson, J. D., D. G. Higgins, and T. J. Gibson. 1994. Clustal W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:4673-4680.
Thorne, J. L., H. Kishino, and I. S. Painter. 1998. Estimating the rate of evolution of the rate of molecular evolution. Mol. Biol. Evol. 15:1647-1657.
Thorne, J. L., and H. Kishino. 2002. Divergence time and evolutionary rate estimation with multilocus data. Syst. Biol. 51:689-702.
Wildman, D. E., W. Wu, M. Goodman, and L. I. Grossman. 2002. Episodic positive selection in ape cytochrome c oxidase subunit IV. Mol. Biol. Evol. 19:1812-1815.
Wu, C. I., and W.-H. Li. 1985. Evidence for higher rates of nucleotide substitution in rodents than in man. Proc. Natl. Acad. Sci. USA 82:1741-1745.
Wu, W., M. Goodman, M. I. Lomax, and L. I. Grossman. 1997. Molecular evolution of cytochrome oxidase subunit IV: evidence for positive selection in simian primates. J. Mol. Evol. 44:477-491.
Wu, W., T. R. Schmidt, M. Goodman, and L. I. Grossman. 2000. Molecular evolution of cytochrome c oxidase subunit I in primates: is there coevolution between mitochondrial and nuclear genomes? Mol. Phylogenet. Evol. 17:294-304.
Yang, Z. 1998. Likelihood ratio tests for detecting positive selection and application to primate lysozyme evolution. Mol. Biol. Evol. 15:568-573.
Yang, Z., and P. Bielawski. 2000. Statistical methods for detecting molecular adaptation. Trends Ecol. Evol. 15:496-503.
Yang, Z., and R. Nielsen. 1998. Synonymous and nonsynonymous rate variation in nuclear genes of mammals. J. Mol. Evol. 46:409-418.
Yang, Z., and R. Nielsen. 2000. Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Mol. Biol. Evol. 17:32-43.
Yang, Z., and R. Nielsen. 2002. Codon-substitution models for detecting molecular adaptation at individual sites along specific lineages. Mol. Biol. Evol. 19:908-917.
Yang, Z., and W. J. Swanson. 2002. Codon-substitution models to detect adaptive evolution that account for heterogeneous selective pressures among site classes. Mol. Biol. Evol. 19:49-57.
Yoder, A. D., and Z. Yang. 2000. Estimation of primate speciation dates using local molecular clocks. Mol. Biol. Evol. 17:1081-1090.
Zhang, Z., H. F. Rosenberg, and M. Nei. 1998. Positive Darwinian selection after gene duplication in primate ribonuclease genes. Proc. Natl. Acad. Sci. USA 95:3708-3713.(Tae-Kun Seo, Hirohisa Kis)