Comparison of Three Methods for Estimating Rates of Synonymous and Nonsynonymous Nucleotide Substitutions(百拇医药)

Comparison of Three Methods for Estimating Rates of Synonymous and Nonsynonymous Nucleotide Substitutions

http://www.100md.com 分子生物学进展 2004年第12期

     * Department of Ecology and Evolution, University of Chicago; Department of Mathematics, National Tsing Hua University, Hsinchu, Taiwan; and Institute of Information Science, Academia Sinica, Taiwan

    E-mail: whli@uchicago.edu.

    Abstract

    Three frequently used methods for estimating the synonymous and nonsynonymous substitution rates (Ks and Ka) were evaluated and compared for their accuracies; these methods are denoted by LWL85, LPB93, and GY94, respectively. For this purpose, we used a codon-evolution model to obtain the expected Ka and Ks values for the above three methods and compared the values with those obtained by the three methods. We also proposed some modifications of LWL85 and LPB93 to increase their accuracies. Our computer simulations under the codon-evolution model showed that for sequences 300 codons, the performance of GY94 may not be reliable. For longer sequences, GY94 is more accurate for estimating the Ka/Ks ratio than the modified LPB93 and LWL85 in the majority of the cases studied. This is particularly so when k 3, which is the transition/transversion (mutation) rate ratio. However, when k is approximately 2 and when the sequence divergence is relatively large, the modified LWL85 performed better than GY94 and the modified LPB93. The inferiority of LPB93 to LWL85 is surprising because LPB93 was intended to improve LWL85. Also, it has been thought that the codon-based method of GY94 is better than the heuristic method of LWL85, but our simulation results showed that in many cases, the opposite was true, even though our simulation was based on the codon-evolution model.

    Key Words: synonymous substitution ? nonsynonymous substitution ? estimation methods ? Ka/Ks ratio ? accuracy of estimation

    Introduction

    The estimation of synonymous and nonsynonymous substitution rates is important for understanding the dynamics of gene sequence evolution and for reconstructing phylogenetic trees (see Kimura [1983] and Li [1997]). The simplest problem in this regard is the estimation of the numbers of substitutions per synonymous site (Ks) and per nonsynonymous site (Ka) between two protein-coding sequences. In the past 2 decades, a number of methods have been proposed for this purpose (see Li [1997] and Nei and Kumar [2000]). We focus on three methods: (1) Li, Wu, and Luo (1985) and Li (1993), (2) Pamilo and Bianchi (1993), and (3) Goldman and Yang (1994) (the codon-based maximum-likelihood method contained in the PAML package [Yang 1997]). These three methods have been frequently used in the literature. They will be denoted as LWL85, LPB93, and GY94. In addition, accurate estimation of the Ka/Ks ratio is important because this ratio is commonly used as an indicator of selective constraints on the protein-coding genes studied.

    We used computer simulation to compare the performances of the three methods. For this purpose, we simulated the evolution of many pairs of protein-coding sequences derived from common-ancestor sequences. Then we used the three different methods to calculate the Ka and Ks distances between every simulated pair of sequences. Furthermore, to know the error rates of these three methods, we also needed to compute the expected Ka and Ks values for the pairs of simulated sequences. Therefore, we also derived formulas for the expected Ka and Ks values from the model used to generate the pairs of sequences.

    After analyzing possible reasons for the bias of LWL85 and LPB93, we derived a bias correction for these two methods. Furthermore, we investigated whether the transition/transversion mutation rate ratio in the codon-evolution model can significantly affect the accuracies of the three methods. We also proposed a modified LWL85 method and studied its performance.

    Methods

    Model of Sequence Evolution

    The model we used to simulate sequence evolution is a simplified version of the model of Goldman and Yang (1994). We use the following notations:

    S = number of synonymous sites in a sequence

    N = number of nonsynonymous sites in a sequence

    Ks = number of (synonymous) substitutions per synonymous site

    Ka = number of (nonsynonymous) substitutions per nonsynonymous site

    t = time (distance), measured by the expected number of nucleotide substitutions per nucleotide site (= (Ks x S + Ka x N)/(S + N))

    w = nonsynonymous/synonymous rate ratio (= Ka/Ks)

    k = transition/transversion mutation rate ratio

    j = equilibrium frequency of codon j

    The substitution rate qij from codon i to codon j (i j) is given by the following formula:

    (1)

    The condition qij = 0 when codons i and j differ at more than one position means we allow only one nucleotide substitution to occur at one time. The equilibrium codon frequencies (ij) are from the Homo sapiens codon usage table in the Codon Usage Database (http://www.kazusa.or.jp/codon/) and the universal genetic code is used in this study.

    Simulation Procedures

    Because the Evolver program in the PAML package for simulating the evolution of coding sequences is based on the model described above, we used it in our simulation studies. For each set of parameters studied, we generated 1,000 pairs of daughter sequences from 1,000 common ancestral sequences with a specified sequence length (from 100 to 1,000 codons). We assumed that the substitution rate is constant in the entire phylogenetic tree. Thus, when the total evolutionary time between two daughter sequences, S1 and S2, is t, the times from S1 and S2 to the common ancestor, S0, are both t/2. Under this model, the relationship w = Ka/Ks, the ratio of nonsynonymous/synonymous substitution rates, holds.

    Expected Ka and Ks Values

    Based on the parameters used in the codon-evolution model, we can derive the expected Ka and Ks values between daughter sequences S1 and S2 by solving the following two equations:

    (2)

    (3)

    In these two equations, Ka and Ks are two unknown variables and t and w are two parameters specified (known) in our model. If we can estimate the values of S and N in equation (2), then after some algebraic operations, we can obtain the expected Ka and Ks values:

    (4)

    (5)

    Hence, the question now is, "How do we estimate the parameters S and N in our formulas?"

    S and N are the numbers of synonymous and nonsynonymous sites in a sequence, respectively. Although the corresponding numbers in the daughter sequences is slightly changed in every substitution step, their expected values should be the same as those of the ancestral sequence because we assume that the codon frequencies are at equilibrium. Therefore, we use the S and N in the common ancestral sequence for equations (4) and (5). However, there are several different definitions of S and N.

    Ina (1995) proposed a method to estimate S and N according to Kimura's (1983) two-parameter model. In our notation, Ina's definitions are S = (kL2/(k + 2)) + L4 and N = (2L2/(k + 2)) + L0, where L0, L2 and L4 are the corresponding numbers of nondegenerate, twofold degenerate, and fourfold degenerate sites in the common ancestral sequence, respectively. In Ina's model, k is estimated from the transitional and transversional substitution rates at the third position of codons of sequences, whereas in our codon-evolution model, k is specified before simulation of the evolution of the daughter sequences from the common ancestral sequence. Hence, we specify the value of k in our formulas of S and N instead of estimating the corresponding value from the daughter sequences. In GY94, the definitions of S and N are the same as those in Ina (1995). Thus, we use these definitions to derive the expected Ka and Ks values for the GY94 method, which are called Expected Values I.

    In LWL85, the definitions of S and N are based on Jukes and Cantor's (1969) one-parameter model and are given by S = (L2/3) + L4 and N = (2L2/3) + L0. This definition assumes k = 1, so that LWL85 may perform poorly for large k; indeed, the poor performance was revealed by our computer simulation (results not shown). For this reason, we propose to use S = ((k – 1)L2/((k – 1) + 2)) + L4 and N = (2L2/((k – 1) + 2)) + L0 for k 2 and define Ka and Ks as follows:

    (6)

    (7)

    We call this the modified LWL85 or M-LWL85. We denote the expected Ka and Ks values under these definitions of S and N as Expected Values II (i.e., S = ((k – 1)L2/((k – 1) + 2)) + L4 and N = (2L2/((k – 1) + 2)) + L0 for k 2 and S = (L2/3) + L4 and N = (2L2/3) + L0 for k 2 in equations [4] and [5]). Note that when k = 2, the modified LWL85 method is exactly the same as the LWL85 method.

    Note that because expected values I and II are obtained from equations (4) and (5), which are derived from equations (2) and (3), the expected Ka/Ks values are already fixed to be equal to the nonsynonymous/synonymous rate ratio (w), as specified in the codon-evolution model. We set w = 0.2 in our simulations.

    The calculations of the expected Ka and Ks values for LPB93 are quite different. To be free of the influence of k, LPB93 estimates Ka and Ks without estimating S and N. The estimation of Ks in LPB93 is based on the assumption that the transitional rate at twofold degenerate sites is the same as that at fourfold degenerate sites and uses the weighted average (L2A2 + L4A4)/(L2 + L4) as an estimate of the average transitional rate at twofold and fourfold degenerate sites, where L2A2 is the total number of transitional substitutions at twofold degenerate sites between the two sequences, L4A4 is the corresponding number at fourfold degenerate sites, and L2 + L4 is the sum of twofold and fourfold degenerate sites. A similar explanation applies to the Ka formula in LPB93. To estimate the expected Ka and Ks values in LPB93, we need to calculate the expected mean numbers of transitional (Ai) and transversional (Bi) substitutions per ith-type site. From our codon-evolution model, each codon has nine possible substitutional changes to another codon with one nucleotide difference, and each of them has a different relative probability. The nucleotide differences in each i-fold degenerate class can be further classified into transitional (Si) and transversional (Vi) differences (i = 0, 2, and 4). Using the normalized relative probabilities as weights of different substitutional pathways, we can obtain the weighted average of Si and Vi of each codon in the common ancestral sequence. We take the averages of the Si and Vi values over all codons in the common ancestral sequence to obtain the average Si and Vi values of each codon. After applying equations (4.12) and (4.13) in Li (1997), we can obtain the expected Ai and Bi values of each codon in the common ancestral sequence. The expected L0, L2, and L4 values used here (in equations [4.12] and [4.13] from Li [1997]) were counted from the common ancestral sequence. After multiplying the expected Ai and Bi values of each codon by the expected number of the substitutions in the whole phylogenetic tree, tL, we can obtain the expected Ai and Bi values between the two daughter sequences. Note that the expected Ai and Bi values are dependent on the codons of the common ancestral sequence. Therefore, if the common ancestral sequence is different, the corresponding expected Ai and Bi values are different. After applying the expected Ai and Bi values and expected L0, L2, and L4 values (counted from the common ancestral sequence) to the Ka and Ks formulas in LPB93, we can obtain the corresponding expected values for LPB93. We denote them as Expected Values III.

    Under the codon-evolution model, all the expected Ka and Ks values can be calculated from the common ancestral sequence. The expected values I, II, and III are all dependent on parameters t, w, and k.

    Arginine Corrections for LWL85 and LPB93

    In LWL85 and LPB93, the nucleotide differences between two sequences are classified into transitional (Si) and transversional (Vi) differences (i = 0, 2, 4, which are the folds of codon degeneracy). In the class of twofold degenerate sites, almost all transitions are synonymous and almost all transversions are nonsynonymous. There is no exception to this rule in the mammalian mitochondrial code. In the universal code, there are two exceptions: the first positions of the arginine codons CGA, CGG, AGA, and AGG, and the third positions of the three isoleucine codons (ATT, ATC, and ATA).

    At the third positions of ATT, ATC, and ATA, the exception occurs because they are threefold degenerate sites, but each is treated as a twofold degenerate site. Because threefold degenerate sites are very few in the universal codon table (only the third positions of ATT, ATC, and ATA), if we consider the threefold degenerate sites in our equations (4) and (5), it will result in larger variances of the Ka and Ks values. Hence, treating threefold degenerate sites as twofold degenerate sites is reasonable. Note that the third positions of ATT and ATC also obey the rule that transitions at twofold degenerate sites are synonymous. Therefore, only the third position of ATA is an exception—when ATA (Ile) changes to ATG (Met), it is a transition but nonsynonymous. In all these three cases, all synonymous changes are included in S2 and all nonsynonymous changes are included in V2.

    For the first positions of the arginine codons CGA, CGG, AGA, and AGG, they all belong to twofold degenerate sites. Note that when CGG changes to AGG (or AGG changes to CGG), it is a transversion but is synonymous. The same is true when CGA changes to AGA (or AGA changes to CGA). The difference between CGG and AGG is classified into V2 in LWL85 but is here classified into S2. Similarly, the difference between CGA and AGA is here also classified into S2.

    For the cases where one of the four arginine codons CGA, CGG, AGA, and AGG is compared with another codon that has two or three differences, the classifications are more complicated and are treated as follows. Consider the difference between codons CGA (Arg) and ACA (Thr). We count the difference at the first position to be 0.5 S2 and 0.5 V0 if we assume equal weights (0.5) for the following two pathways.

    In this illustration, s and v stand for "transitional" and "transversional" and syn and asy stand for "synonymous" and "nonsynonymous." The three-digit number in front of each codon indicates the folds of degeneracy for the three positions of that codon. In pathway II, the first substitution from CGA (Arg) to AGA (Arg) is synonymous but is classified into V2 in LWL85 because it is a transversion. We propose to classify it into S2 (as indicated by s) because it is synonymous. For this reason, we count the difference in the first position (between CGA and ACA) as 0.5 S2 (it comes from the first substitution in pathway II) and 0.5 V0 (it comes from the second substitution in pathway I) if we assume equal weights (0.5) for the two pathways; that is, this difference could be either synonymous (pathway II) or nonsynonymous (pathway I). In essence, our rule is that every synonymous difference between two sequences at a twofold degenerate site is counted as a transition even if it is actually a transversion. The other corrections are available upon request.

    We call the above new way of counting arginine codons the Arginine Corrections, and we call the LPB93 method with these corrections the modified LPB93 (denoted as M-LPB93). We also apply the Arginine Corrections to the M-LWL85 method (the modified LWL85 method as described in the previous section) and still use the same notation M-LWL85. In our simulation experiments, the Ka and Ks values in M-LWL85 and M-LPB93 were found to be approximately 5% more accurate than those computed by LWL85 and LPB93 when compared with their expected values. In the following, we present only the results obtained with M-LWL85 and M-LPB93 but not those obtained with LWL85 and LPB93.

    Results

    Comparison of M-LWL85, M-LPB93, and GY94

    We used computer simulation to compare the performances of M-LWL85, M-LPB93, and GY94. Each codon of the common ancestral sequences S0 was chosen randomly from the 61 sense codons according to the codon frequencies in the human codon usage table; the sequence length varies from 100 to 1,000 codons (i.e., from 300 to 3,000 nucleotide sites). (We also used the uniform codon frequencies, that is, every sense codon having the same frequency 1/61, to generate the common ancestral sequence, but it did not have strong effects on our results). In our codon-evolution model, we first set parameters w = 0.2, k = 2, and t = 0.1; for these parameters, the expected Ks and Ka are approximately 0.25 and approximately 0.05, respectively. For each common ancestral sequence, we evolved a pair of daughter sequences according to the codon-evolution model. We repeated the process to obtain 1,000 pairs of daughter sequences.

    We then use the program codeml of GY94 in PAML, M-LWL85, and M-LPB93 to calculate the Ka and Ks values and the ratio Ka/Ks for each pair of daughter sequences and took the average Ka, Ks, and Ka/Ks over the 1,000 pairs of daughter sequences. We also calculated the corresponding expected Ka and Ks values I, II, and III for all the daughter sequences. We compared the Ka and Ks values computed from the three methods with their corresponding expected values and calculated their error percentages. The error percentage is defined as

    The error percentage is the relative bias between the simulated and expected values of each method. We used this as the measure to tell which method will give better estimates of Ka, Ks, and Ka/Ks. Because Ina's (1995) definition of synonymous and nonsynonymous sites seems to be the most reasonable, we also compared the estimates of M-LWL85 with Expected Values I; this comparison is given as M-LWL85* in figures 1, 2, and 3.

    FIG. 1.— The average error percentages of simulation results for the case of t = 0.1, w = 0.2, and k = 2, where t denotes the expected number of substitutions per nucleotide site between the two sequences, w denotes the Ka/Ks ratio, and k denotes transition/transversion mutation rate ratio. The negative and positive signs on the ordinate mean, respectively, underestimation and overestimation of the expected values. (A) The average error percentages of the Ka values estimated by GY94, M-LWL85, M-LPB93, and M-LWL85* as compared with the corresponding expected values for different sequence lengths (from 100 to 1,000 codons). (B) The average error percentages of the Ks values estimated by GY94, M-LWL85, M-LPB93, and M-LWL85* as compared with the corresponding expected values. (C) The average error percentages of the Ka/Ks values estimated by GY94, M-LWL85, M-LPB93, and M-LWL85* as compared with the corresponding expected values. In each case, we took the averages of the Ka, Ks, and Ka/Ks values estimated from the 1,000 pairs of daughter sequences and then compared them to the corresponding expected values of different methods.

    FIG. 2.— The average error percentages of simulation results for the case of t = 0.4, w = 0.2, and k = 2. The notations are the same as those in figure 1. (A) The average error percentages of the Ka values estimated by GY94, M-LWL85, M-LPB93, and M-LWL85* as compared with the corresponding expected values. (B) The average error percentages of the Ks values estimated by GY94, M-LWL85, M-LPB93, and M-LWL85* as compared with the corresponding expected values. (C) The average error percentages of the Ka/Ks values estimated by GY94, M-LWL85, M-LPB93, and M-LWL85* as compared with the corresponding expected values. In each case, we took the averages of the Ka, Ks, and Ka/Ks values estimated from the 1,000 pairs of daughter sequences and then compared them to the corresponding expected values of different methods.

    FIG. 3.— The average error percentages of simulation results for the case of t = 0.1, w = 0.2, and k = 1 to 10 with a fixed sequence length of 1,000 codons. (A) The average error percentages of the Ka values estimated by GY94, M-LWL85, M-LPB93, and M-LWL85* as compared with the corresponding expected values. (B) The average error percentages of the Ks values estimated by GY94, M-LWL85, M-LPB93, and M-LWL85* as compared with the corresponding expected values. (C) The average error percentages of the Ka/Ks values estimated by GY94, M-LWL85, M-LPB93, and M-LWL85* as compared with the corresponding expected values. In each case, we took the averages of the Ka, Ks, and Ka/Ks values estimated from the 1,000 pairs of daughter sequences and then compared them with the corresponding expected values of different methods.

    M-LWL85 and M-LPB93 tend to underestimate the Ka values, overestimate the Ks values, and, thus, underestimate the ratio Ka/Ks, whereas GY94 tends to overestimate both the Ka and Ks values, especially when the sequence is short (fig. 1). Interestingly, M-LWL85* gives the smallest error percentage for Ka and largest error percentage for the Ks values. The expected Ka/Ks ratio is equal to 0.2 (w) for both expected values I and II, so M-LWL85* has the same error percentage for Ka/Ks ratio as M-LWL85.

    For the Ka/Ks values, GY94 tends to give better estimates than do M-LWL85 and M-LPB93, and the order of average error rates is GY94 < M-LPB93 < M-LWL85 (= M-LWL85*). For the Ka values, GY94 tends to give better estimates than do M-LWL85 and M-LPB93 but poorer estimates than does M-LWL85*, and the order of average error rates is M-LWL85* < GY94 < M-LPB93 < M-LWL85. For short sequences, however, GY94 may not perform well. For Ks, surprisingly, M-LWL85 gives a better estimate than do M-LPB93, GY94, and M-LWL85*, and the order of average error rates is M-LWL85 < M-LPB93 < GY94 < M-LWL85*.

    For the cases considered in figure 1, the average error percentages (over all sequence lengths we considered) in Ka are 6.8% for M-LWL85, 5.7% for M-LPB93, 2.3% for GY94, and 0.4% for M-LWL85*. The average error percentages in Ks are 2.7% for M-LWL85, 2.8% for M-LPB93, 3.9% for GY94, and 9.9% for M-LWL85*. The average error percentages in Ka/Ks are 7.3% for M-LWL85, 6.0% for M-LPB93, 1.3% for GY94, and 7.3% for M-LWL85*. For M-LWL85 and M-LPB93, the error percentages in the ratio Ka/Ks are larger than those in Ka and Ks because the two methods tend to underestimate Ka and overestimate Ks.

    It should be emphasized that the above error percentages are the error rate compared with the expected values and that different methods (except GY94 and M-LWL85*) have different expected values. We note from table 1 that the expected ratio Ka/Ks for M-LPB93 is actually larger than the specified w (i.e., w = 0.2) in the codon-evolution model. For this reason, it is not suitable for estimating w = Ka/Ks.

    Table 1 Simulation Results for the Average Estimates of Ka, Ks, and Ka/Ks with Different Methods

    Effects of Sequence Divergence

    To see if the degree of sequence divergence affects the accuracy of the above three methods, we also simulated different divergence times (t values). Because when t is large, all three methods have large variances in Ka and Ks values. We, therefore, consider t 0.5. Note that when t > 0.5, Ks will be considerably larger than 1, so that the estimate of Ks by any of these three methods may not be reliable. (Note that when w = 0.2, Ks is about 2.5 times larger than t.) Therefore, we fixed w = 0.2 and k = 2 and let t vary from 0.1 to 0.5.

    The results show that the divergence time (t) does not have a strong influence in the order of the accuracies of Ka and Ks in M-LWL85, M-LPB93, GY94, and M-LWL85*. However, for Ka/Ks, M-LWL85 (= M-LWL85*) and M-LPB93 give better results than does GY94 when t is large and the sequences are not long. To provide some details, we show below the simulation results for the case of w = 0.2, k = 2, and t = 0.4; the expected Ks values are approximately 1.0 for M-LWL85 and GY94 and approximately 0.9 for M-LPB93.

    As can be seen from figure 2 and table 1, M-LWL85 and M-LPB93 tend to underestimate Ka and Ks and overestimate Ka/Ks in most cases. GY94 tends to overestimate Ka and Ks and underestimate Ka/Ks. M-LWL85* gives the best estimates of Ka, Ks, and Ka/Ks, and this means the results obtained from M-LWL85 are closest to the Expected Values I. For Ka values, the order of average error rates is somewhat different from the case of t = 0.1: M-LWL85* < GY94 < M-LPB93 < M-LWL85. For the Ks values, M-LWL85 and M-LPB93 give better estimates than does GY94, and the order of average error rates is somewhat different from the case of t = 0.1: M-LWL85* < M-LPB93 < M-LWL85 < GY94. The major difference is in the estimates of the ratio Ka/Ks: the order of average error rates is in the reverse order of the result obtained in the case of t = 0.1: M-LWL85 (= M-LWL85*) < M-LPB93 < GY94.

    For the cases considered in figure 2, the average errors (over all sequences we generated) in Ka are 6.2% for M-LWL85, 5.1% for M-LPB93, 3.0% for GY94, and 0.3% for M-LWL85*. The average errors in Ks are 7.0% for M-LWL85, 6.7% for M-LPB93, 12.2% for GY94, and 1.3% for M-LWL85*. The average errors in Ka/Ks are 2.3% for M-LWL85, 3.2% for M-LPB93, 3.8% for GY94, and 2.3% for M-LWL85*. We show some simulation results for cases t = 0.1 and 0.4 in table 1.

    Effects of the Transition/Transversion Rate Ratio k

    We also considered the effect of k values on the accuracies of the three methods. The parameters we used in this simulation are listed below:

    k = transition/transversion rate ratio

    t = number of substitutions per site = 0.1 and 0.4

    w = Ka/Ks = 0.2

    Sequence length = 3,000 nucleotide sites (1,000 codons)

    We let k vary from 1 to 10, and in every case we generated 1,000 pairs of daughter sequences from the common ancestral sequence with each k and compared the Ka, Ks and Ka/Ks values obtained from different methods with their corresponding expected values. The results of t = 0.1 and 0.4 are similar, so we only show the result of t = 0.1 in figure 3.

    GY94

    GY94 gives overestimates of the Ka and Ks values but almost the exact expected value of the ratio Ka/Ks. An increase in k seems to have no strong influence on the accuracy of GY94 when the sequence is long. For all k values, the average error percentages (over all sequences generated with different k values) are 2.0% in Ka, 2.5% in Ks, and 0.5% in Ka/Ks.

    M-LPB93

    The tendency of M-LPB93 with increasing k is almost parallel to that of GY94. The Ks value is overestimated when k 3 and underestimated when k 4. The Ka and Ka/Ks values are both underestimated by M-LPB93. An increase in k has an influence for only the Ks estimate but not for the Ka and Ka/Ks estimates. For all k values considered, the average error percentages (over all sequences generated with different k values) are 7.3% in Ka, 1.7% in Ks, and 6.4% in the ratio Ka/Ks.

    M-LWL85

    The M-LWL85 was proposed to modify the LWL85. Hence the results obtained from M-LWL85 are much better than those from LWL85 for large k (results not shown). The estimates of Ka obtained from M-LWL85 are better than those from M-LPB93 and GY94 when k is larger than 4. For the k values used, the average error percentages (over all sequences generated with different k values) are 1.9% in Ka, 3.5% in Ks, and 5.5% in Ka/Ks. For M-LWL85*, the average error percentages are 1.5% in Ka, 3.6% in Ks, and 5.5% in Ka/Ks.

    Discussion

    Comeron (1995) proposed a modification of LPB93. He separated the twofold degenerate sites into 2S-fold and 2V-fold degenerate sites, where 2S denotes those sites where every transition substitution is synonymous and every transversion is nonsynonymous, and 2V denotes those sites where every transition substitution is nonsynonymous and every transversion is synonymous. In comparison with our approach, he also used different estimates of the numbers of synonymous and nonsynonymous substitutions per site. In his divergence estimates for 20 genes between Drosophila melanogaster and species of the obscura group, the Ka and Ks values obtained from his method were smaller than those from LPB93 and LWL85 in most cases. In contrast, from our simulations, we found that the Ka values are overestimated and Ks values are underestimated by LWL85 and LPB93 in most the cases (compared with their expected Ka and Ks values II and III, respectively). Therefore, Comeron's method would not be suitable for estimating Ks, because it would underestimate Ks. In comparison, in the case of the Arginine Corrections for LWL85 and LPB93, the Ka values obtained from M-LWL85 and M-LPB93 are slightly smaller than those from LWL85 and LPB93, and the Ks values obtained from M-LWL85 and M-LPB93 are slightly larger than those from LWL85 and LPB93. Thus, the Ka and Ks values obtained from M-LWL85 and M-LPB93 are closer to the expected values II or III, respectively, than those obtained from LWL85 and LPB93. This is the advantage of the Arginine Corrections over Comeron's modification.

    From the simulations we conducted, we found that the Ka value obtained from M-LPB93 tends to be larger than that from M-LWL85 and the Ks value obtained from M-LPB93 tends to be smaller than that from M-LWL85. This property has been noted in Li (1993) and can be shown analytically.

    Surprisingly, when the sequence divergence (t) is relatively large, M-LWL85 gives the best estimates of Ka, Ks, and Ka/Ks when the expected Ka and Ks are defined in the sense of GY94 (see M-LWL85* in figure 2), although it looses its advantage over GY94 when the sequence length become very long. The increase in the accuracy of GY94 with increasing sequence length is because GY94 is a maximum-likelihood method, so that its accuracy increases as the amount of data increases. When the sequences are longer than 800 codons, GY94 in general performs better than M-LWL85 and M-LPB93.

    As can be seen from table 1, the expected value of Ka/Ks by M-LPB93 is larger than the value specified by the codon-evolution model (w = 0.2), so that M-LPB93 is not suitable for estimating w = Ka/Ks. In general, GY94 is a better method for estimating w than M-LWL85 when the sequence is longer than 500 codons. For shorter sequences, however, M-LWL85 is better than GY94.

    In our study of the effect of transition/transversion mutation rate ratio (k), we found that an increase in k had no obvious influence on the performances of GY94 and M-LPB93 but could have a strong effect on LWL85. M-LPB93 and GY94 have an advantage over LWL85 when k is larger than 3. This is because LWL85 is based on Jukes and Cantor's (1969) one-parameter model and is not suitable for estimating the distance between sequences with a large k. We have, therefore, proposed a modified LWL85 method (M-LWL85), which is given by equations (6) and (7). Furthermore, from our simulations, the M-LWL85 method is a good method for estimating the rates of synonymous and nonsynonymous substitutions when k is approximately 2 for all sequence lengths we considered (see figures 1 and 2). For the longer sequences and larger k, M-LWL85 also performs well, especially for the estimate of Ka (see figure 3). However, the performance of M-LWL85 is poorer when k is 1 and when sequence divergence is small (figs. 1 and 3). Our simulation (results not shown) suggested that when k < 2, replacing 2L2/3 by 2L2/(k + 1.5) and L2/3 by (k – 0.5)L2/(k + 1.5) in the denominators of equations (6) and (7) can improve the performance.

    In considering the performance of a method, one should consider both estimation bias and variance. For simplicity, in this study we have considered only estimation bias. If all methods have the same expectation, then bias is certainly a good measure of performance. The problem here is that different methods may have different expectations, and that is confusing in the literature. For this reason, we have also used the expected values of GY94 (Expected Values I) to compare the performances of GY94 and M-LWL85 (given as M-LWL85* in figures 1, 2, and 3). Interestingly, when t is relative large, the estimates of Ka, Ks, and Ka/Ks obtained from M-LWL85 are even closer to the expected values of GY94 than those obtained from GY94 (see figure 2).

    In a comparison of LWL85 and LPB93, Ina (1995) found that LPB93 gave better estimates of Ka and Ks than did LWL85. Our results are not totally consistent with this conclusion. We found that M-LWL85 did give a better estimate in some cases, especially with respect to the Ks values, when t and k are not large. Note that a comparison of methods for estimating Ka and Ks depends on how sequences are generated and how the expected values are derives. In these two respects, our approach is different from Ina's. Firstly, Ina did not generate sequences according to the codon-evolution model. Secondly, our derivation of the expected Ka and Ks values of M-LWL85 and M-LPB93 under the codon-evolution model is very different from Ina's calculations. Moreover, the advantage of LPB93 over LWL85 for large k disappears if we use the new equations in (6) and (7) for LWL85.

    Acknowledgements

    We thank Ziheng Yang for valuable suggestions. This study was supported by a fellowship to Y.H.T. from the National Center for Theoretical Science, Taiwan, by a grant to S.B.H. from the National Science Council, Taiwan, by Academia Sinica, Taiwan, and by NIH grants to W.H.L.

    References

    Comeron, J. M. 1995. A method for estimating the numbers of synonymous and nonsynonymous substitutions per site. J. Mol. Evol. 41:1152–1159.

    Goldman, N., and Z. Yang. 1994. A codon-based model of nucleotide substitution for protein-coding DNA sequence. Mol. Biol. Evol. 11:725–736.

    Ina, Y. 1995. New methods for estimating the numbers of synonymous and nonsynonymous substitutions. J. Mol. Evol. 40:190–226.

    Jukes, K. H., and C. R. Cantor. 1969. Evolution of protein molecules. Pp. 21–132 in H. N. Munro, ed. Mammalian protein metabolism. Academic Press, New York.

    Kimura, M. 1983. The neutral theory of molecular evolution. Cambridge University Press, Cambridge, England.

    Li, W.-H. 1993. Unbiased estimation of the rates of synonymous and nonsynonymous substitution. J. Mol. Evol. 36:96–99.

    ———. 1997. Molecular evolution. Sinauer Associates, Sunderland, Mass.

    Li, W.-H., C.-I. Wu, and C.-C. Luo. 1985. A new method for estimating synonymous and non-synonymous rates of nucleotide substitutions considering the relative likelihood of nucleotide and codon changes. Mol. Biol. Evol. 2:150–174.

    Nei, M., and S. Kumar. 2000. Molecular evolution and phylogenetics. Oxford University Press, New York.

    Pamilo, P., and N. O. Bianchi. 1993. Evolution of the Zfx and Zfy genes: rates and interdependence between the genes. Mol. Biol. Evol. 10:271–281.

    Yang, Z. 1997. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. 13:555–556.(Yun-Huei Tzeng*,, Runsun )

http://www.100md.com/html/DirDu/2006/10/18/25/50/33.htm