当前位置: 首页 > 医学版 > 期刊论文 > 基础医学 > 分子生物学进展 > 2005年 > 第12期 > 正文
编号:11259225
A Bayesian Model Comparison Approach to Inferring Positive Selection
     Computational Biology Group, Institute of Infectious Disease and Molecular Medicine, University of Cape Town, Rondebosch, South Africa

    E-mail: konrad@cbio.uct.ac.za.

    Abstract

    A popular approach to detecting positive selection is to estimate the parameters of a probabilistic model of codon evolution and perform inference based on its maximum likelihood parameter values. This approach has been evaluated intensively in a number of simulation studies and found to be robust when the available data set is large. However, uncertainties in the estimated parameter values can lead to errors in the inference, especially when the data set is small or there is insufficient divergence between the sequences. We introduce a Bayesian model comparison approach to infer whether the sequence as a whole contains sites at which the rate of nonsynonymous substitution is greater than the rate of synonymous substitution. We incorporated this probabilistic model comparison into a Bayesian approach to site-specific inference of positive selection. Using simulated sequences, we compared this approach to the commonly used empirical Bayes approach and investigated the effect of tree length on the performance of both methods. We found that the Bayesian approach outperforms the empirical Bayes method when the amount of sequence divergence is small and is less prone to false-positive inference when the sequences are saturated, while the results are indistinguishable for intermediate levels of sequence divergence.

    Key Words: positive selection ? Bayesian inference ? model comparison ? phylogenetic simulation

    Introduction

    Positive selection can be inferred in protein-coding sequences when there exists a subset of codon sites in the sequence for which the rate of nonsynonymous substitution (dN) is significantly greater than the synonymous rate (dS). A popular approach to the problem is to estimate the parameters of a probabilistic model of codon evolution and perform inference based on its maximum likelihood parameter values (Nielsen and Yang 1998; Yang et al. 2000). The models of codon substitution used in this method were first proposed by Goldman and Yang (1994) and Muse and Gaut (1994) and allow for selective pressure that varies across sites in a sequence.

    This approach has been evaluated intensively in a number of simulation studies (Anisimova, Bielawski, and Yang 2001, 2002; Anisimova, Nielsen, and Yang 2003; Wong et al. 2004) and found to be robust when the available data set is large. However, uncertainties in the estimated parameter values can lead to errors (either false positives or false negatives) in the inference, especially when the data set is small or there is insufficient divergence between the sequences.

    In the context of detecting positive selection from coding sequences, two inference problems are of interest. The first problem is to decide whether the sequence as a whole contains sites with substitution rate ratio ( or dN/dS) larger than 1. This problem is usually tackled by performing a likelihood ratio test (LRT) to compare two models which differ only in that one allows sites with > 1 while the other (simpler) model is constrained such that 1. If the constrained model is rejected in favor of the model allowing > 1, we can conclude that at least one site in the sequence is evolving under positive selection.

    The second inference problem is to decide whether specific sites are evolving with > 1. This problem can be addressed by calculating the posterior probability that > 1 for each site. In the maximum likelihood framework the estimated model parameters are assumed to be correct, and the posterior probabilities are estimated by means of the naive empirical Bayes (NEB) method (Yang, Wong, and Nielsen 2005). By contrast, a Bayesian approach to the problem would integrate out the uncertainty in parameter values. Two variants of this approach have been proposed: the Bayes empirical Bayes (BEB) method (Yang, Wong, and Nielsen 2005), which uses the likelihoods calculated for a limited number of values of the most important parameters, and a Markov chain Monte Carlo (MCMC) sampling approach (Huelsenbeck and Dyer 2004), that samples across the full range of parameter values as well as tree topologies. Because the MCMC approach samples over a larger range of parameter values, it is more computationally intensive and can be expected to provide a more accurate approximation to the exact Bayesian solution. It also provides the possibility of incorporating a probabilistic model comparison approach into the site-specific inference. One aim of this study is to investigate the relative performances of these methods, in order to evaluate under which circumstances the additional computational expense is justified.

    In this paper, we introduce a Bayesian model comparison approach to investigate the whole-sequence inference problem using MCMC sampling. The calculated "Bayes factor" is used both to decide whether the sequence is under positive selection and to infer whether specific sites are evolving under positive selection. We compare the results for the Bayesian approach to those obtained with the standard LRT.

    We also use simulation to investigate the Bayesian approach for site-specific inference of positive selection, both with and without model comparison, and compare the NEB, BEB, and MCMC approaches. We investigate the effect of tree length on the performance of these algorithms, with particular attention to cases where the amount of sequence divergence, and hence the amount of information on which the inferences are based, is low. As expected, we find that the results for the BEB and MCMC approaches are very similar for more divergent sequences, but that the MCMC approach performs better for less divergent sequences. We show that the NEB performs considerably worse than the other methods under these conditions. On the other hand, when sequence divergence increases to the point of saturation, we find that the false-positive rate obtained by the NEB and BEB increases, while that obtained by the MCMC approach decreases.

    Materials and Methods

    Codon Model of Evolution

    We consider the codon model proposed by Nielsen and Yang (1998), in which substitutions between sense codons occur as a continuous-time Markov process. In this model the site-specific nonsynonymous/synonymous substitution rate ratios (h) are assumed to be distributed according to a parametric distribution with known form. A large number of such distributions have been investigated by Yang et al. (2000); here we consider the discrete mixture models M1a and M2a (Wong et al. 2004). Model M2a forces each site to belong to one of three discrete site classes, each with its own value and mixture weight p. (Due to the scaling of the rate matrices [Liò and Goldman 1998], these models are not ordinary mixture models: the p parameters are used as mixture component weights to calculate the average substitution rate over all site classes, and the rate matrices are then scaled by a common factor to ensure that branch lengths are measured in expected number of substitutions per codon site. This creates a dependency between the p parameters and the branch length parameters, with the result that the likelihood under a particular site class is a function of the p parameters, which is not permissible for a mixture model. Nevertheless, when all parameters are optimized, the p parameters do act as mixture weights as intended, so it is useful to think of them as such.) The values for the three site classes are constrained in order to model sites under purifying selection (0 – < 1), neutral drift (N = 1), and positive selection (+ > 1). Model M1a is a special case of this model, where the positive selection site class is removed: this can be achieved in model M2a by fixing its mixture weight (p+) at zero.

    Inference Under a Single Model

    MCMC Method

    The Bayesian posterior probability of positive selection, either for a particular site or for the entire sequence, is calculated under the assumption that a particular model, which allows positive selection, is correct. For each site, we are interested in the probability that it is under positive selection, given the data set along with the parameters for which we use fixed empirical estimates (for all methods discussed here, the parameters describing codon equilibrium frequencies were obtained empirically by means of the "F3 x 4" model which uses the nucleotide frequencies observed at the three codon positions [Goldman and Yang 1994]). Calculating this quantity involves integrating over all parameters that contain uncertainty: the tree topology, the branch lengths, the transition/transversion rate ratio, and the parameters describing the distribution of . The integral can be evaluated using MCMC techniques (MacKay 2003). This calculation is implemented in the MrBayes program (Huelsenbeck and Ronquist 2001) and will be referred to below as the MCMC method.

    BEB and NEB Methods

    In the above calculation, the uncertainty in the frequency parameters was disregarded by using fixed empirical values for these parameters. In a similar way, if the uncertainty in the tree topology, the branch lengths, and the transition/transversion rate ratio are disregarded, further computational saving is achieved by lowering the number of variables to be integrated out and thus the number of samples required. Instead of using MCMC sampling, the likelihood function is estimated by calculating the likelihoods at points on a fixed grid. This calculation is implemented in the codeml program as the BEB method.

    An even simpler calculation is obtained by disregarding all uncertainty in parameter estimates. This was the initial implementation in the codeml program and is currently available in this program as the NEB method.

    These methods assume the correctness of the maximum likelihood estimates of the parameters for which uncertainty is disregarded. They use a single-tree topology which is specified in advance. Unless otherwise stated, the topologies used in this study are the maximum likelihood topologies obtained under the HKY85 model (Hasegawa, Kishino, and Yano 1985).

    Inference Under Multiple Models

    The Bayesian approach allows consideration of more than one model simultaneously. If prior probabilities are assigned to the possible models, their posterior probabilities can be calculated on observing the data. Model comparison can also be performed without assigning prior probabilities to the models, by calculating the Bayes factor (MacKay 2003), which is the ratio between the marginal likelihoods under the two models being compared and is the standard quantity used in Bayesian model comparison. When expressed in decibels, it is commonly referred to as the "evidence" provided by the data in favor of one or the other model. As a rule of thumb, evidence values of 6 and 10 dB are, respectively, considered strong and very strong. Unlike the posterior model probability, the evidence is independent of our prior beliefs regarding the model probabilities; instead, it quantifies the amount by which observing the data forces us to change our prior beliefs: multiplying the prior model probabilities with the Bayes factor produces the posterior probability.

    In order to calculate the probability of positive selection when taking more than one model into account, we need to integrate the model-specific probabilities over the set of models. However, this requires calculation of the marginal model likelihoods, which are intractable. Approaches to variants of this problem for comparison of nucleotide-based phylogenetic models have been proposed, for instance, by Suchard, Weiss, and Sinsheimer (2001), who compared nested models using the Savage-Dickey ratio, and Huelsenbeck, Larget, and Alfaro (2004), who used a reversible jump MCMC approach with novel proposal functions for a large class of models with varying numbers of parameters. A simpler approach, commonly used with MCMC, is to estimate the marginal likelihoods of the models being compared by calculating the harmonic mean of the likelihoods obtained for the MCMC samples under the two models (Gelfand and Dey 1994). While this harmonic mean estimator is more stable than an estimator based on the arithmetic mean, it is nevertheless notoriously unstable (Bos 2002) and requires a large number of samples to produce accurate estimates.

    For the present problem we make use of the fact that positive selection can be evaluated by comparing a single pair of nested models, which allows us to express the Bayes factor as follows (see Supplementary Material online):

    (1)

    Here p+ = 0 is the parameter setting at which the unconstrained model M1 reduces to the constrained model. The right-hand side of this equation is known as the Savage-Dickey ratio: it is the ratio between the posterior and prior probability functions for the unconstrained model, evaluated at p+ = 0. The Savage-Dickey ratio provides an alternative means of estimating the Bayes factor: instead of estimating the likelihoods under the two models being compared, we use MCMC samples obtained for the unconstrained model to estimate the posterior probability above the line. This can be done in a number of ways: Suchard, Weiss, and Sinsheimer (2001) used a normal approximation, which is not applicable here as we are interested in the value of the posterior at the boundary of the domain. Instead, we estimate the probability that p+ is between 0 and 1/L, where L is the sequence length in codons. The choice of 1/L is based on the fact that we are interested in detecting sequences with at least one out of L codons under positive selection. A uniform Dirichlet distribution is used for the prior probability below the line (this is the standard prior for discrete distributions). A perl script that calculates the Savage-Dickey ratio from a MrBayes ".p" output file, along with an example ".nex" input file for MrBayes, can be downloaded from http://www.cbio.uct.ac.za/SavageDickey/.

    Data Sets

    We simulated data sets from trees with total branch lengths that ranged from 3 substitutions per codon (this is the same tree length that was used by Wong et al. [2004]) down to 0.1 substitutions per codon. We also investigated the opposite extreme by simulating neutral data sets that are nearly saturated (total branch length of 50) and completely saturated (total branch length of 1,000). Data sets were simulated using the evolver program in the PAML 3.14 package (Yang 1997) on a five-taxon tree (fig. 1). Two sets of simulation parameters were used to produce neutral and positively selected data sets, respectively:

    Neutral data: = 1 at all sites. This is identical to scheme 1 used by Wong et al. (2004), aimed at modeling pseudogene evolution. It can be considered to be a worst case scenario for the detection of positive selection because all sites are on the border of the positive selection domain.

    Positively selected data: = 0 with probability 0.45, = 1 with probability 0.45, and = 5 with probability 0.1. This is identical to scheme 6 used by Wong et al. (2004), aimed at modeling strong positive selection.

    FIG. 1.— Phylogenetic tree used to simulate the data. Branch lengths were scaled differently for different simulations, with the sum of branch lengths varying between 0.1 and 3. All branch lengths are equal, except for the central branch which is twice as long as the others.

    All simulations used a transition/transversion rate ratio of = 1 and stationary codon frequencies for all sense codons of 1/61. Simulated sequences were 100 codons long, and 100 replicates were simulated for each parameter setting.

    To investigate the feasibility of the MCMC approach on larger data sets, we also compared the methods on the abalone sperm lysin data set analyzed by Lee, Ota, and Vacquier (1995), Yang, Swanson, and Vacquier (2000) and Yang and Swanson (2002), which contains 25 taxa and has a sequence length of 134 codons.

    Results

    We analyzed the simulated data sets using both the codeml program in the PAML 3.14 package and the MrBayes 3.0 program. For codeml, we performed model comparison between the M1a and M2a models using the LRT. The tree topology used to obtain the results reported here was the maximum likelihood tree obtained under the HKY85 model using the PAUP* program (Swofford 2002). We found that, although using deliberately incorrect trees generated large numbers of false positives, using the maximum likelihood trees gave results that were identical to those obtained using the correct trees. This is despite the fact that, for short tree lengths, these trees were frequently different from the trees used to simulate the data. Thus, our results (Table 1 in the Supplementary Material online) support the claim by Yang et al. (2000) that codeml is not sensitive to incorrect tree topologies, provided the topologies used are sensible.

    In MrBayes, we used the "NY98" model, which is identical to the M2a model of codeml. For model comparison using the harmonic mean approach, we obtained the M1a model by adding the constraint that p+ = 0. We used the default prior distributions but used the F3 x 4 model to estimate empirical codon frequencies. (Our implementation of this in MrBayes 3.0 beta 4, along with our corrections of some other errors in the program, is available at request from the corresponding author.) We generated MCMC runs of 400,000 iterations, subsampled every 10 iterations, and discarded the first 10,000 iterations as the burn-in (satisfactory results for the site-specific analysis were obtained using 100,000 iterations, but the model comparison required a larger number of iterations).

    Mixing of the MCMC chain was monitored by inspecting the proposal acceptance and chain-swapping rates obtained in a number of runs. We also validated the program by performing a simulation in which the data were generated from the prior, similar to that performed by Yang, Wong, and Nielsen (2005). Results are shown in the Supplementary Material online.

    Model Comparison Results

    We performed Bayesian model comparison using Bayes factors obtained by both the harmonic mean estimator and the Savage-Dickey ratio. In figure 2, we illustrate the Savage-Dickey model comparison approach by showing the posterior probability functions (approximated using histograms) and prior probability functions (straight line obtained theoretically) for a number of different data sets. Because the model reduces to the constrained model when p+ = 0, the Savage-Dickey ratio measures the extent to which observing the data makes it more or less likely that this is the "correct" value of p+. For data sets supporting the positive selection hypothesis, the posterior probability at p+ = 0 is lower than the prior: hence, observing the data has made us less inclined to believe that p+ = 0. For neutral simulations, on the other hand, observing the data makes us more likely to believe that p+ = 0. In both cases, the strength of our final belief is also influenced by the amount of information contained in the data set: data sets with very little divergence tend to contain very little information, so that the posterior probabilities are close to the priors and we are left without strong evidence either for or against positive selection.

    FIG. 2.— Histograms showing the distribution of values for the p+ parameter of the unconstrained model for different data sets. The prior distribution of this parameter is shown as a straight line.

    For the HIV data set (the HIV-1 env V3 region analyzed by Yang et al. [2000]), the total tree length is close to 2 substitutions per codon and the data practically exclude the possibility of p+ = 0, which can be seen by observing that the histogram height is zero. We obtained strong evidence (26.79 dB) in favor of positive election. For the human T-cell lymphotropic virus (HTLV) data set (the tax gene of the HTLV [Suzuki and Nei 2004; Yang, Wong, and Nielsen 2005]), the total tree length is about 0.1 substitutions per codon. In this case, observing the data also decreases the probability of p+ = 0 (the histogram height is below the line at p+ = 0), but the histogram height is nonzero at p+ = 0 and the evidence in favor of positive selection is lower (11.9 dB). For the neutral simulation with very low divergence (total tree length is 0.1 substitutions per codon), the data provided almost no information so that the histogram closely approximated the prior distribution and the resulting evidence was close to zero (0.14 dB) against positive selection. The neutral simulation with higher divergence (total tree length is 3 substitutions per codon) provided data that made p+ = 0 slightly more likely than it was under the prior distribution, with a small amount of evidence (2.22 dB) against positive selection.

    We evaluated the MCMC method for the whole-sequence inference problem on simulated data sets using both the harmonic mean estimator and the Savage-Dickey ratio for model comparison and compared with results obtained using the LRT. For fair comparison, results for the harmonic mean estimator were calculated using 200,000 iterations from each of two MCMC runs, one for each model. Note that the LRT results are not directly comparable to those obtained using the Bayesian methods because the thresholds are different and the quantities being thresholded have different meanings.

    The Savage-Dickey ratio gave more stable results than the harmonic mean estimator, which tended to make liberal decisions both in favor of and against positive selection (fig. 3). The fact that this behavior worsens when less information is available is worrying because the appropriate response to a lack of information is to make more conservative assessments. For this reason, only the Savage-Dickey ratio was used for the site-specific analysis.

    FIG. 3.— Whole-sequence classifications of simulated data sets using different model comparison techniques.

    It is difficult to compare the LRT with the Bayesian approach because it is hard to find thresholds at which both methods strike the same balance between liberal and conservative behaviors. However, results of the LRT are comparable to, and perhaps slightly better than, the Savage-Dickey results. This is not surprising, given the heuristic nature of the estimator used for the posterior probability in the Savage-Dickey ratio.

    We also show the results obtained for the nearly saturated data set using the Savage-Dickey ratio and the LRT (fig. 3). In this case, the LRT introduces false positives which are absent for the Savage-Dickey ratio. Results for the completely saturated data were similar (not shown).

    Site-Specific Results

    For the NEB and BEB methods, we report site-specific results both with and without first filtering out the sequences for which the LRT fails to reject the null model at the 5% confidence level. For the MCMC method, we report results both without model comparison (results are labeled "1 model" because we are effectively using the assumption that M2a is the correct model) and after incorporating Bayesian model comparison (results are labeled "2 models"). We incorporate Bayesian model comparison by assuming that either M1a or M2a is correct, with uninformative prior probabilities of 0.5 for each model, and discounting the site-specific posterior probability as in equation (2) in the Supplementary Material online.

    The site-specific results are shown as receiver operating characteristic curves obtained on data sets containing equal proportions of the neutral and positively selected simulations (fig. 4). Each curve plots the power (true positives as a percentage of all positively selected sites) against the false-positive rate (false positives as a percentage of all nonpositively selected sites), for false-positive rates from 0% to 5%, which is the region of interest for practical use. The region with false-positive rates above 5% showed similar behavior. Results for the NEB method without model comparison are not shown in this figure, as the lowest false-positive rate observed (at a posterior probability threshold of 0.99) was above 5% in all plots.

    FIG. 4.— Receiver operating characteristic (ROC) curves for different tree lengths showing the performance of all the methods on the combined data set consisting of neutral sequences and sequences with strong positive selection, for false-positive rates between 0% and 5%. Methods were evaluated at posterior thresholds at 5% intervals between 5% and 95%, as well as at 1% and 99% (the leftmost point on the curve). The NEB method without model comparison is not shown, as the leftmost point falls at a false-positive rate above 5%. Solid lines indicate methods without model selection, while dotted lines indicate methods with model selection. Error bars indicate two standard deviations along both axes (see Supplementary Material online). For clarity, error bars are shown on only one of the curves: the sizes of the error bars on the remaining curves are similar.

    The NEB with model comparison performed consistently worse than the other methods when information was sparse, giving results that are statistically indistinguishable from those of the other methods only when the tree length reached 3. Without model comparison, the NEB performed extremely poorly and is effectively unusable.

    The BEB without model comparison can compete with the MCMC methods under sparse information but performed worse than all model comparison methods when the amount of information increased. This confirms that, despite the apparent robustness of the BEB without model comparison reported by Yang, Wong, and Nielsen (2005), it is best not to omit the LRT when using this technique. The poor performance of the BEB without model comparison is due to the fact that it has a much higher false-positive rate, which forces the use of more conservative thresholds than can be used for the other methods.

    The BEB with model comparison performed well under normal conditions, but performance was poor relative to the MCMC methods and to the BEB without model comparison when information was sparse. This is because information sparseness occasionally leaves the LRT unable to reject the null hypothesis for sequences containing a few positively selected sites even though there might be sufficient information for the BEB to detect these sites individually. This problem is avoided by Bayesian model comparison through the use of soft decisions rather than a hard threshold.

    Both MCMC methods consistently outperformed the other methods. The difference was not significant under normal conditions but became large when divergence was low. The MCMC method without model comparison outperformed the MCMC with model comparison under sparse information, and the two methods performed equally well under normal conditions.

    While the theoretical best performance of the methods is of interest for deciding which methods are optimal, it is also of interest which thresholds should be used for the posterior probabilities. We plotted the specificity (1, false-positive rate) of the different methods obtained on the neutral simulations for tree lengths of 0.1, 1, and 3 and for saturated data (fig. 5). This gives an indication of the false-positive rates at different posterior probability thresholds. For unsaturated data, methods using the LRT had high specificity at all site-specific posterior probability thresholds. This is because entire data sets on which the LRT fails to reject the neutral model at a significance level of 0.05 are filtered out. Methods that do not use the LRT started producing false positives as the posterior threshold decreased. This happened at higher thresholds for the BEB than for the MCMC methods. The NEB without LRT was unusable throughout.

    FIG. 5.— Specificity (1, false-positive rate) of the different methods of site-specific inference on neutral simulations for tree lengths of 0.1, 1, and 3, as well as for nearly saturated data. Solid lines indicate methods without model selection, while dotted lines indicate methods with model selection.

    We also considered specificity of the different methods applied to nearly saturated data (fig. 5). In this case, the BEB without LRT also produced excessive levels of false positives at all but the highest threshold. Although the false-positive rates obtained by the NEB and BEB methods after first using the LRT were not excessive, it is somewhat disconcerting that they were higher than those for unsaturated data, despite the fact that saturated data do not contain any usable information and should thus cause an inference method to become more conservative. The MCMC method, on the other hand, showed the desired decrease in false positives. Results for the completely saturated data were similar, but with even higher rates of false positives for the NEB and BEB without LRT (not shown).

    Power and specificity obtained with the standard posterior probability thresholds used in codeml are shown in table 1. At these thresholds, the MCMC methods are more conservative than the codeml methods. This should not be taken to mean that the MCMC methods produce more conservative results in general. Instead, it suggests that different thresholds should be used for different methods and that the standard thresholds may be overly conservative. The results of figure 5 indicate that if false-positive rates of up to 5% are acceptable, thresholds of about 75% for the BEB and 60% for the MCMC method are more appropriate without model comparison, while if model comparison is used the thresholds could safely be set at 50%. Further figures illustrating the performance and reliability of different thresholds on the simulated data sets are provided in the Supplementary Material online.

    Table 1 Power and Specificity of Site-Specific Inferences

    Lysin Results

    Running the analyses on the larger lysin data set required 2h38min for codeml (43 min for model M1a and 1 h 55 min for model M2a) and 16h3min for MrBayes (100,000 generations, of which 10,000 were discarded as burn-in; results were unchanged on extending the run to 400,000 generations). As could be expected, the methods produce very similar results (table 2), with the only differences in the list of positively selected sites being codon 41 (posterior probability of 0.954 using BEB and 0.948 using MrBayes) and codon 83 (posterior probability of 0.999 using BEB and 0.988 using MrBayes). BEB and NEB results gave identical classifications for all sites. Both the LRT (P < 10–16) and the Savage-Dickey ratio (21.3 dB) were strongly in favor of positive selection.

    Table 2 Results Obtained for the Lysin Data Set

    Discussion

    Our results demonstrate that, under conditions of information sparsity caused by either low or high levels of sequence divergence, MCMC methods do outperform both the NEB and BEB methods, while the BEB method outperforms the NEB method. In order to take full advantage of the improvement, however, the posterior thresholds should be lower than the commonly used levels of 95% and 99%. Users should bear in mind that these are not frequentist P values predicting the false-positive rate, but Bayesian posterior probabilities which indicate the degree of belief we should assign to a specific site being under positive selection.

    While a tree length of 0.1 substitutions per codon may seem far too short for any reliable inference, this is the tree length for the controversial data set of the tax gene of the HTLV (Suzuki and Nei 2004; Wong et al. 2004), on which different authors reached different conclusions. The fact that data sets with such low levels of divergence are deemed to be of interest means that we do need to evaluate the available methods under these circumstances.

    The tree length used for the nearly saturated data is the maximum recommended tree length in codeml. We expect that problems may also be experienced for somewhat shorter tree lengths. We did not simulate positive selection for the saturated sequences because they are so far diverged that the methods are not expected to have any power on such data, and even if a true-positive classification is made it may purely be due to chance and thus be undesirable. For this reason we reported false-positive rates but not power for the saturated data.

    We also performed our experiments for scheme 4 of Wong et al. (2004), modeling weak positive selection. However, this problem was found to be very challenging for all methods, making it hard to distinguish between the different results obtained by the different methods. For this reason these results, which were consistent with but weaker than the results we reported for the strong positive selection simulations (scheme 6 of Wong et al. [2004]), are not reported here.

    In all of the experiments discussed here, we obtained the parameters describing codon equilibrium frequencies empirically by means of the F3 x 4 model. This is implemented as the default model in the codeml program of the PAML package (Yang 1997), but has not previously been used with the MrBayes program (Huelsenbeck and Ronquist 2001). Using these empirical frequency estimates instead of sampling over the 61 codon frequencies as in the approach presented by Huelsenbeck and Dyer (2004) speeds up the convergence of the MCMC run, yet is more realistic than using equal codon frequencies or the F1x4 model. This helps to address the concern about the computational cost of this method expressed by Yang, Wong, and Nielsen (2005). We also point out that, in contrast to the claim that only models M2 and M3 are available in MrBayes (Yang, Wong, and Nielsen 2005), any model which is a special case of the implemented models Ny98, M3, and M10 can be obtained: this includes models M0, M1, M1a, M2, M2a, M7, M8 (Yang et al. 2000; Wong et al. 2004), and M8a (Swanson, Nielsen, and Yang 2003).

    The codon substitution models investigated in this paper allow rate variation over sites but not over lineages. Furthermore, only the M1a and M2a models of rate variation (Yang, Wong, and Nielsen 2005) were investigated. However, we expect that the approach and conclusions would apply equally well to other pairs of rate variation models (such as M8a and M8), as well as models that relax the basic model assumptions, including models that allow rate variation over branches (Yang and Nielsen 2002) and models that allow synonymous rate variation (Seo, Kishino, and Thorne 2004; Kosakovsky Pond and Frost 2005a, 2005b).

    The primary concern about the MCMC method is its computational cost: while the codeml analyses of the simulated sequences required about a minute per replicate, an MCMC run of 400,000 generations required about 10 h. However, because the MrBayes software can be run in parallel (Altekar et al. 2004), a cluster of 50 machines would reduce the required time to about 15 min. Thus, the method should also be feasible for data sets of realistic size, provided sufficient resources are available.

    For tree lengths of an average of three or more substitutions per codon (normal conditions), there seems to be no difference between the different methods, so we recommend the BEB method because of its computational efficiency. For trees with 1–3 substitutions per codon, we recommend that investigators should consider the MCMC method, which appears to do slightly better although the difference may not be statistically significant. For trees with less than 1 substitution per codon, we recommend the MCMC method as it clearly outperforms the BEB in this domain. We also advise using the MCMC method if the sequences are approaching saturation. When using the BEB and NEB methods, we advise performing the LRT in all cases, but especially when the sequences are approaching saturation. Our results show that the BEB without the LRT becomes particularly unreliable when the data saturate.

    The results in figure 4 were obtained using equal numbers of neutral and positively selected sequences. If it is the case that positive selection is rare in nature (i.e., it occurs with prior probability less than 50%), the excess of positively selected sequences in our data sets will tend to favor the methods that assume the positive selection model relative to those that first perform model comparison. For this reason we suggest that, although our results show no advantage in performing model comparison when using the MCMC method, it may be prudent to do so nonetheless: the model comparison step can decrease the power of the method but provides added insurance against false positives without sacrificing much except under very sparse data. However, for preliminary investigations on very sparse data sets, we suggest the MCMC method without model comparison as the most promising method for highlighting sites of potential interest.

    Supplementary Material

    Further details and Table 1 are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).

    Acknowledgements

    This study was supported by the South African National Bioinformatics Network and by the National Institute of Allergy and Infectious Disease and the National Institutes of Health through the Centre for the AIDS Programme of Research in South Africa (grant no. 1U19AI51794). We also thank Fourie Joubert for use of the Linux cluster at the University of Pretoria, as well as Oliver Pybus and Ziheng Yang for constructive comments.

    References

    Altekar, G., S. Dwarkadas, J. P. Huelsenbeck, and F. Ronquist. 2004. Parallel metropolis coupled Markov chain Monte Carlo for Bayesian phylogenetic inference. Bioinformatics 20:407–415.

    Anisimova, M., J. P. Bielawski, and Z. Yang. 2001. Accuracy and power of the likelihood ratio test in detecting adaptive molecular evolution. Mol. Biol. Evol. 18:1585–1592.

    ———. 2002. Accuracy and power of Bayes prediction of amino acid sites under positive selection. Mol. Biol. Evol. 19:950–958.

    Anisimova, M., R. Nielsen, and Z. Yang. 2003. Effect of recombination on the accuracy of the likelihood method for detecting positive selection at amino acid sites. Genetics 164:1229–1236.

    Bos, C. S. 2002. A comparison of marginal likelihood computation methods. In W. Hardle and B. Ronz, eds. COMPSTAT 2002: Proc. Comput. Stat. 111–117.

    Gelfand, A. E., and D. K. Dey. 1994. Bayesian model choice: asymptotics and exact calculations. J. R. Stat. Soc B. 56:501–514.

    Goldman, N., and Z. Yang. 1994. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11:725–736.

    Hasegawa, M., H. Kishino, and T. Yano. 1985. Dating the human-ape split by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22:160–174.

    Huelsenbeck, J. P., and K. A. Dyer. 2004. Bayesian estimation of positively selected sites. J. Mol. Evol. 58:661–672.

    Huelsenbeck, J. P., B. Larget, and M. E. Alfaro. 2004. Bayesian phylogenetic model selection using reversible jump Markov chain Monte Carlo. Mol. Biol. Evol. 21:1123–1133.

    Huelsenbeck, J. P., and F. Ronquist. 2001. MRBAYES: Bayesian inference of phylogeny. Bioinformatics 17:754–755.

    Kosakovsky Pond, S. L., and S. D. W. Frost. 2005a. A simple hierarchical approach to modeling distributions of substitution rates. Mol. Biol. Evol. 22:223–234.

    ———. 2005b. Not so different after all: a comparison of methods for detecting amino acid sites under selection. Mol. Biol. Evol. 22:1208–1222.

    Lee, Y., T. Ota, and V. Vacquier. 1995. Positive selection is a general phenomenon in the evolution of abalone sperm lysin. Mol. Biol. Evol. 12:231–238.

    Liò, P., and N. Goldman. 1998. Models of molecular evolution and phylogeny. Genome Res. 8:1233–1244.

    MacKay, D. J. C. 2003. Information theory, inference, and learning algorithms. (http://www.inference.phy.cam.ac.uk/mackay/itila/). Cambridge University Press, Cambridge, UK.

    Muse, S., and B. Gaut. 1994. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol. Biol. Evol. 11:715–724.

    Nielsen, R., and Z. Yang. 1998. Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. Genetics 148:929–936.

    Seo, T.-K., H. Kishino, and J. L. Thorne. 2004. Estimating absolute rates of synonymous and nonsynonymous nucleotide substitution in order to characterize natural selection and date species divergences. Mol. Biol. Evol. 21:1201–1213.

    Suchard, M. A., R. E. Weiss, and J. S. Sinsheimer. 2001. Bayesian selection of continuous-time Markov chain evolutionary models. Mol. Biol. Evol. 18:1001–1013.

    Suzuki, Y., and M. Nei. 2004. False-positive selection identified by ML-based methods: examples from the Sig1 gene of the diatom Thalassiosira weissflogii and the tax gene of a human T-cell lymphotropic virus. Mol. Biol. Evol. 21:914–921.

    Swanson, W. J., R. Nielsen, and Q. Yang. 2003. Pervasive adaptive evolution in mammalian fertilization proteins. Mol. Biol. Evol. 20:18–20.

    Swofford, D. 2002. PAUP*. Phylogenetic analysis using parsimony (*and other methods). Version 4. Sinauer Associates, Sunderland, Mass.

    Wong, W. S. W., Z. Yang, N. Goldman, and R. Nielsen. 2004. Accuracy and power of statistical methods for detecting adaptive evolution in protein coding sequences and for identifying positively selected sites. Genetics 168:1041–1051.

    Yang, Z. 1997. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. 13:555–556.

    Yang, Z., and R. Nielsen. 2002. Codon-substitution models for detecting molecular adaptation at individual sites along specific lineages. Mol. Biol. Evol. 19:908–917.

    Yang, Z., R. Nielsen, N. Goldman, and A.-M. K. Pedersen. 2000. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics 155:431–449.

    Yang, Z., and W. J. Swanson. 2002. Codon-substitution models to detect adaptive evolution that account for heterogeneous selective pressures among site classes. Mol. Biol. Evol. 19:49–57.

    Yang, Z., W. J. Swanson, and V. D. Vacquier. 2000. Maximum-likelihood analysis of molecular adaptation in abalone sperm lysin reveals variable selective pressures among lineages and sites. Mol. Biol. Evol. 17:1446–1455.

    Yang, Z., W. S. Wong, and R. Nielsen. 2005. Bayes empirical Bayes inference of amino acid sites under positive selection. Mol. Biol. Evol. 22:1107–1118.(K. Scheffler and C. Seoig)