The Comparison of the Confidence Regions in Phylogeny
http://www.100md.com
分子生物学进展 2005年第11期
Department of Mathematics and Statistics, Dalhousie University, Halifax, Canada
E-mail: shi@mathstat.dal.ca.
Abstract
In this paper, several different procedures for constructing confidence regions for the true evolutionary tree are evaluated both in terms of coverage and size without considering model misspecification. The regions are constructed on the basis of tests of hypothesis using six existing tests: Shimodaira Hasegawa (SH), SOWH, star form of SOWH (SSOWH), approximately unbiased (AU), likelihood weight (LW), generalized least squares, plus two new tests proposed in this paper: single distribution nonparametric bootstrap (SDNB) and single distribution parametric bootstrap (SDPB). The procedures are evaluated on simulated trees both with small and large number of taxa. Overall, the SH, SSOWH, AU, and LW tests led to regions with higher coverage than the nominal level at the price of including large numbers of trees. Under the specified model, the SOWH test gives accurate coverage and relatively small regions. The SDNB and SDPB tests led to the small regions with occasional undercoverage. These two procedures have a substantial computational advantage over the SOWH test. Finally, the cutoff levels for the SDNB test are shown to be more variable than those for the SDPB test.
Key Words: likelihood ? confidence region ? test ? coverage ? bootstrap
Introduction
In phylogenetic analysis, an important issue is to determine the confidence regions for gene trees from DNA sequences. The first step usually involves the estimation of the trees. The widely used estimation methods are based on a maximum likelihood (ML) approach (Cavalli-Sforza and Edwards 1967; Felsenstein 1981) and a distance approach (Cavalli-Sforza and Edwards 1967; Bulmer 1991). A number of the existing confidence region methods are likelihood based and look for trees that are close to the ML tree in some sense. Interestingly, many available procedures can give rather different confidence regions (e.g., Goldman, Anderson, and Rodrigo 2000; Strimmer and Rambaut 2001; Shimodaira 2002).
In this paper, our main aim is to evaluate the performance of a number of the current proposals and modifications of them to see which procedure gives us the best result under different conditions. Our approach will be to construct the confidence regions based on tests of the null hypothesis that the tested tree has the same topology as the true one. For simplicity, we will refer to 0 as the tree topology being tested. The branch lengths are not specified in the null hypothesis but are estimated during the test. In most tests, the ML branch lengths which maximize the likelihood of the given tree structure are used. The P value is calculated as the probability, under H0, of observing a value of the test statistic at least as large as that observed. Using the general duality between testing and confidence regions, the test statistic can be used to construct confidence regions of topologies. A (1 – ) x 100% confidence region for the true topology is a set of topologies that contains the true topology with probability 1 – . It includes 0 if the null hypothesis is not rejected with significance level . The probability that the confidence region includes the true tree is called the coverage.
We will examine the following test procedures: Shimodaira-Hasegawa (SH) test (Shimodaira and Hasegawa 1999), single distribution nonparametric bootstrap (SDNB) test (introduced in this paper), SOWH test (Swofford et al. 1996; see also Hillis, Mable, and Moritz 1996; Goldman, Anderson, and Rodrigo 2000), star form of SOWH (SSOWH) test (Antezana 2003), single distribution parametric bootstrap (SDPB) test (introduced in this paper), approximately unbiased (AU) test (Shimodaira 2002), likelihood weight (LW) test (Strimmer and Rambaut 2001), and generalized least squares (GLS) test (Susko 2003). Except the GLS test which is distance based, all the other tests are based on likelihood. The simulation results are compared under two criteria: the coverage and the size which is defined here as the number of trees in the confidence regions.
The rest of the paper is organized as follows. Methodology introduces the statistical methods including the detailed description of the two newly proposed SDNB and SDPB tests and their relationships with an emphasis on methodology. The comparison of confidence regions of gene trees according to their coverages and sizes based on data sets simulated from trees with small and large numbers of taxa is presented in Simulation Methods and Analysis and the relevant discussion in Discussion. Without considering model misspecification, the SOWH test performs very well with reasonable coverage and small size of the confidence region. The SDNB and SDPB tests based on two different bootstrap methods also have small sizes but undercover the true tree (the coverage of the true tree is lower than the nominal coverage) in some cases. The SH, LW, AU, and SSOWH tests always give conservative results in the simulations, i.e., the coverage is larger than the nominal coverage, giving large confidence regions. Although, under large sample approximation, the coverages of the AU and GLS tests are expected to be the same as the confidence level, it did not happen in the simulations. Their asymptotic characters are not approached by the sequence lengths in the simulation, and it is hard to tell how long the sequence should be to guarantee the asymptotic characters. Nonparametric and parametric bootstraps behave differently, which is demonstrated by looking at the critical values for the SDNB and SDPB tests. Finally, the SDNB test with the AU test as an alternative is recommended for the real data in Conclusion.
Methodology
In carrying out the tests to construct the confidence regions, the overall biological goal is the same: are observations compatible with a specified tree topology 0? This can be written as
which is called the canonical hypothesis in this paper. Likelihood is a popular method to measure the compatibility of the tested and the true trees. It can be calculated for each candidate tree using the estimated branch lengths. The ML tree is considered to be the best estimate of the true tree. Let li and lML denote the log likelihoods of i and the ML tree, respectively. Their difference, i = lML – li, measures the discrepancy of the two trees and is commonly used as the test statistic of the canonical hypothesis.
The bootstrap method is widely used for constructing the distribution of the test statistic when a closed-form expression is not available. There are two types of bootstrap: nonparametric and parametric. For DNA sequences, the combinations of nucleotide types in sites are considered as the samples in the nonparametric bootstrap. The sites are resampled with replacement to build the replicate data sets. In the standard bootstrap, all the nuisance parameters, such as the branch lengths and the base frequencies, need to be reestimated to calculate the likelihoods. The RELL technique gives time-saving approximations, in which the site likelihoods are resampled. It is equivalent to calculating the likelihoods of the standard replicates with the parameters estimated from the original data set. These two techniques give similar results, with the latter being more popular. In the parametric bootstrap, the replicate data sets are simulated from a parametric substitution model and some tree. For instance, with the SSOWH test, bootstrap data are simulated from the star topology with estimated substitution model parameters and external branch lengths, while for the SOWH test, bootstrap data are simulated from the hypothesized tree with estimated branch lengths. Hence, the parametric bootstrap relies on the model specification more closely than the nonparametric bootstrap, which also makes it more powerful (Sokal and Rohlf 1995). However, this also is a limitation due to the concern that some assumptions in the model may not be realistic. * in this paper refers to estimates obtained from a replicate data set and will be referred to as a bootstrap value. For instance, lML is the log likelihood of the ML tree of the original data set, and would denote the corresponding value of a bootstrapped data set.
We now describe the tests in terms of the hypothesis, test statistic, and the bootstrap method. The names of the tests and the corresponding acronyms are listed in table 1.
Table 1 The Tests and Their Acronyms Compared in this Paper
The SH Test
The SH test (Shimodaira and Hasegawa 1999) is a modified version of the Kishino-Hasegawa (KH) test (Kishino and Hasegawa 1989). The KH test was designed to test any two topologies selected independently of any analysis of the data but was often inappropriately used for comparing a candidate tree with the ML tree, a fact noted by many authors (Shimodaira and Hasegawa 1999; Goldman, Anderson, and Rodrigo 2000). We will not give details and simulation results for the KH test because we are interested in more than two candidate trees. The hypothesis for the SH test is
where m is the number of candidate trees. The hypothesis indicates that we are testing a group of trees rather than trees independently. Though the hypothesis is not canonical, the canonical test statistics are used in this test,
and their distributions are estimated by nonparametric bootstrap.
Here we give an intuitive explanation and some comments on the rationale and the effects of the "centering" procedure proposed in the resampling method of the SH test. The nonparametric bootstrap approximates the true distribution whether the null hypothesis is true or not. Because the null hypothesis may not be true, it may not be approximating the distribution of the test statistic under the null hypothesis as desired. If all the trees equally support the observations, they should have equal expected log likelihood, denoted by E[li], i = 1, ..., m. In order to build the distribution of the test statistic under H0, the least favorable configuration (l.f.c.), i.e., is considered. The l.f.c. states zero branch length for any split that is not present in all the trees under consideration. It is least favorable in the sense that it corresponds to the specific hypothesis consistent with H0 which is the most difficult to reject. By subtracting and adding E[l1] to i, we have
(1)
(2)
(3)
Under the l.f.c. assumption, E[l1] can be replaced by E[li] for all i, so that
(4)
Let i = 1, ..., m, denote the mean of all replicate log likelihoods of i. The centering procedure replaces E[li] by to calculate the bootstrap value of i in equation (4), i.e.,
(5)
However, because the l.f.c. assumption does not necessarily hold for the equivalence between equations (3) and (4) does not hold for the bootstrap value of In fact, we always get larger bootstrap values of i when more trees are involved in the null hypothesis, which results in larger P values, making it harder to reject the tested trees.
The SDNB Test
The bootstrap paradigm requires us to replace the true parameters with the estimates and the estimates with bootstrap values of the replicates. In phylogeny, the tested tree topology is the parameter we are interested in. As the estimate of the true tree, the ML tree obtained from the original data set gives the most probable evolution process. So instead of centering, we replace the tested tree by the ML tree of the original data set in bootstrap replicates in terms of bootstrap theory. This core idea of our proposed method conforms with the principle of the original nonparametric bootstrap method. The proposed method, SDNB test, only requires a single distribution for all the test statistics.
The hypothesis and the test statistic are the canonical ones in the SDNB test. The test statistics are
and the ML tree, ML, is recorded.
The key steps of constructing the distribution of the test statistics are:
Generate nonparametric bootstrap replicate data sets.
Reestimate any free parameters to get maximum log likelihood and for each replicate. l* denotes the log likelihood for the bootstrap data set. The subscripts, ML and ML*, denote the ML tree ML of the original data set and the ML tree ML* of the replicate, respectively.
Construct the distribution of i = lML – li by the difference of log likelihood between the ML tree of the replicate and the ML tree of the original data set, Comparing the calculation of * in the bootstrap to the original test statistic, the tested tree i is replaced with ML and ML with ML*. (We could use RELL in this step to save time. In this paper the test was implemented without RELL.)
Test whether the attained values of is, i = 1, 2, ..., m, are plausible samples from *. For significance level , if i is less than the 100(1 – )th percentile of *, i is included in the 100(1 – )% confidence region.
We only need one bootstrap distribution, *, which is used for testing all trees, leading to substantial computational saving. Note that our bootstrap paradigm differs slightly from the standard bootstrap. The standard bootstrap method estimates by , the parameter of interest, is not random and does not depend on the number of observations. In our case, the tree topology i is the parameter of interest in the null hypothesis. However, the consistency of i with the true tree is measured by lML – li that is estimated by In contrast to the standard bootstrap, li is random and depends on the sequence length.
The SOWH and SSOWH Tests
The SOWH test was named after the authors who originally described it (Swofford et al. 1996; see also Hillis, Mable, and Moritz 1996). It has the canonical hypothesis and test statistic. In this test, a parametric bootstrap is used to construct the distribution of the test statistic. The replicate data sets are simulated under the tested tree with the nuisance parameters, such as the branch lengths and the substitution model, estimated from the original observations. Treating H0 as the single null hypothesis, there are two disadvantages about this test. One is low coverage when the true tree topology is close to star topology (Antezana 2003). The other is that the distribution of i in the SOWH test has to be built for each tested tree i using which is computationally intensive.
In order to solve the first problem, Antezana (2003) proposed a star version of parametric bootstrap by which the replicates are simulated from the star topology with ML branch lengths estimated from the original observations. This is similar to the treatment of the complex null hypothesis to calculate the largest P value under H0. It is motivated by the fact that the star topology can be viewed as the boundary between trees. Using the boundary tree of H0 to simulate data will naturally make the tested tree difficult to reject. Because the tree model used to generate the parametric bootstrap is the only difference between these two tests, the later one is referred to as the star version of the SOWH (SSOWH) test.
The SDPB Test
Following the bootstrap paradigm as the SDNB test, computational intensity is decreased in the SDPB, in which the replicates are simulated from the ML tree of the original data set. The SDPB test can be considered as the parametric bootstrap version of the SDNB test. The key steps of this method are the same as the SDNB test except step (1), which is replaced by:
(1') Simulate parametric bootstrap replicates based on ML with the ML branch length and other parameters obtained from the original observations.
This procedure gives the SDPB test a substantial computational advantage over the SOWH test because i's are all evaluated using the distribution of This eliminates the need of a separate bootstrap for each i which is required in the SOWH test. The SDPB test is much more feasible than the SOWH test, especially for trees with large numbers of taxa and consequently large number of trees to test.
Other Methods
The AU test (Shimodaira 2002) is widely used now. Its hypothesis is stated in terms of the expected log likelihood of topology i, denoted by μi, i = 1, ..., m. The hypothesis for testing i is
The proportion of times that i is the ML tree in the bootstrap replicates is called bootstrap probability (BP). A multiscaled nonparametric bootstrap in which different sequence lengths are used provides a technique for computing BP. Because it gives the first-order accurateness for the P value (Efron, Halloran, and Holmes 1996), Shimodaira used the results from Efron and Tibshirani (1998) to correct BP to give the second-order accurate P values which means that the expected coverage of the AU test is (1 – ) + o(n–3/2) (see eq. 11 and the discussion therein of Shimodaira 2002).
The LW test is designed for constructing confidence region directly and does not involve any tests of hypotheses (Strimmer and Rambaut 2001). LW is the proportion of the likelihood of a tested tree over the sum of the likelihoods of all candidate trees. Nonparametric bootstrap gives the estimate of the expected LW of each tree. The weight can be viewed roughly as the posterior probability of a tree under a uniform prior on the trees. Tested trees are ordered by expected LWs and will be included in the confidence region until the sum of their weights is equal to or slightly greater than the confidence level. Trees with larger expected likelihood are more likely to be the true tree and hence will be in the confidence region.
The GLS test (Susko 2003) is based on the idea of fitting ML pairwise distances using a regression model determined by the tree topology. It is computationally inexpensive relative to the ML methods, especially for trees with large number of taxa. All candidate trees need to be tested separately because the canonical hypothesis is used. The ML pairwise distances, consistent with estimates of the true distance, form the dependent variable vector. They can also be expressed as the sum of the branch lengths based on the tree topology. Therefore, the tested tree topology is transformed into a design matrix given the arrangement of the ML pairwise distances and the branch lengths in the vectors. For a pair of taxa i and j, the index of the branch is 1 in the design matrix if the path from i to j passes through it; 0 otherwise. The branch lengths are estimated by the generated least square method. Because the ML pairwise distances asymptotically have a multivariate normal distribution, the generalized sum of squared differences between the pairwise distances and the estimated distances based on the tree structure asymptotically follows a 2 distribution with degrees of freedom of T(T – 1)/2 – (2T – 3), where T is the number of taxa.
Simulation Methods and Analysis
Trees
In our comparison, we used data simulated from model trees by Seq-Gen, version 1.2.5 (Rambaut and Grassly 1997), so that the true tree topologies are known. The model trees with small and large number of taxa were considered separately.
Small Number of Taxa
Five-taxon trees with five external branches and two internal branches were considered as models for simulation. In our model trees, two external branches were set to be the long branches with the same length. The others were the short branches again with the same length. So, there are three possible tree shapes determined by the positions of the two long branches as shown in figure 1. The values of the factors affecting the results of phylogenetic analyses, such as sequence lengths, branch lengths, and tree shapes, are summarized as follows—sequence length: 500, 1,000; long branch length: 0.9, 1.5; short branch length: 0.03, 0.06; tree shape (in fig. 1).
FIG. 1.— The three tree shapes with a, ..., g denoting the branch lengths.
There are 12 different model trees labeled A–L as listed in table 2. The branch lengths in each tree are given following the order of a, ..., g, as in figure 1. Trees A–D correspond to the tree on the left in figure 1, E–H in the middle, and I–L on the right. The percentage of time that the maximum likelihood and the true trees (PMLT) are same is also listed in the table for different sequence lengths. The smaller the PMLT is, the more difficult the phylogenetic analysis is. The PMLT is highly related to the coverage of the tests because the ML tree is always in the confidence region. It is useless to compare the coverages of the confidence regions if all the PMLTs are close to or higher than the confidence level. The last column shows the diameters of the trees, which are the maximum distances between any two taxa in a tree (Zwickl and Hillis 2002).
Table 2 The Branch Lengths, the PMLTs of Different Sequence Lengths (sl), and the Diameters of the Model Trees
Large Number of Taxa
The simulations were also performed for trees with large number of taxa. In order to make our work consistent with reality, one data set was simulated from a published 66-taxon tree (Murphy et al. 2001) with sequence length of 3,000. We selected three subsets of taxa of size 10, 15, and 20 from the 66 taxa and constructed the ML tree for each by PAUP*. The three ML trees serve as the model trees of simulations for large number of taxa. The tree topologies are shown in figure 2, and the detailed tree structures including branch lengths are given in Appendix.
FIG. 2.— The structures of 10-, 15-, and 20-taxon trees for simulation.
Substitution Models
In order to generate data sets by Seq-Gen, we must specify the substitution models. Nucleotide substitution is assumed to follow a Markov process which is specified by a rate matrix, Q, whose elements represent instantaneous substitution rates among the four nucleotides. For the small number of taxa, we used the general time reversible (GTR) model which allows the most parameters. For likelihood-based approaches, it becomes a heavy computational burden to find the ML tree for large number of taxa. In order to save time, we use Hasegawa-Kishino-Yano (HKY) model (Hasegawa, Kishino, and Yano 1985) for simulation which allows different rate of transitions and transversions as well as unequal base frequencies. Because we did not consider model misspecification, the models for simulation and analysis were consistent.
For small number of taxa, the base frequencies were set as (a, c, g, t) = (0.18, 0.33, 0.26, 0.23). Q can be decomposed as
where u is chosen so that In our simulation, (a, b, c, d, e) are specified as (3.297, 12.55, 1.167, 2.06, 13.01), respectively.
For the large number of taxa, the values of the parameters in HKY model were set as 2.93 (transition/transversion) and (a, c, g, t) = (0.37, 0.24, 0.12, 0.27). These values were taken from the paper by Zwickl and Hillis (2002), which are estimated by ML on a 66-taxon tree representing the phylogenetic relationships among Eutherian mammals from Murphy et al. (2001). To avoid confounding the issues, we did not consider heterogeneous site rates.
Methods
All the likelihoods were calculated from PAUP* 4.0b8 (Swofford 2000). The SH test was implemented directly by the programs in PAUP* with 1,000 bootstraps of RELL. After the sitewise log likelihoods were obtained from PAUP*, the AU test was performed using CONSEL (Shimodaira and Hasegawa 2001) which also uses RELL for bootstrapping. We used the default settings of CONSEL with the scales in the bootstrap from 0.5 to 1.4, in increments of 0.1 and with 10,000 bootstraps for each scale. The expected LWs were estimated by the mean of 1,000 replicates in the LW test. The distribution of the test statistics in the SOWH, SSOWH, SDNB, and SDPB tests were built based on 100 bootstrap replicates. The GLS test used a program developed from glsdna_eig written by Susko. When the eigenvalues of the estimated covariance matrix are greater than 10–6, this program is the test discussed in Susko (2003). Otherwise, a modification is automatically implemented to avoid difficulties with the near singularity of the covariance matrix (more details are given in the software documentation available at http://www.mathstat.dal.ca/tsusko).
All the possible 15 unrooted tree topologies of five taxa were tested by all methods to determine whether or not they belong to the confidence region. When the number of taxa becomes large, it is difficult to construct confidence region based on all possible trees. Generally, we are interested in the trees with the highest likelihoods and the ML tree. One hundred trees with the highest likelihood were selected by a search method, such as nearest-neighbor-interchanging (NNI) branch swapping, to get a measure of the size of the confidence region. To evaluate coverage, we only need to check whether the true tree is in the confidence region. It should be noted that under a searching method, the "ML" tree we obtained may not be the global ML tree.
The 100(1 – )% confidence region was obtained by collecting the trees with P value greater than significance level in tests. Note that the LW test uses LWs rather than P values to construct the confidence region. If we construct confidence regions only based on 100 repeated simulations for each method, the variance of the coverage should be considered. An approximate 95% confidence interval for the nominal coverage c can be estimated as
We considered one-sided confidence interval because only low coverage causes problems. When c is 0.95 or 0.8, the corresponding 95% confidence interval is (0.91–1) or (0.76–1).
Simulation Results
Small Number of Taxa
We simulated 100 data sets from each model tree in table 2 with sequence lengths of 500 and 1,000. Confidence regions were constructed for all of them based on eight methods. The estimated coverage is the proportion of times that the true tree is in the confidence region during the simulations. Note that the coverage will always exceed the PMLT for all procedures technically except the GLS test. Our goal is to have a small size of confidence region with coverage of at least 1 – . The size is simply the number of trees (out of 15) in the confidence region. From a testing point of view, including a tree other than the true tree in the confidence region is equivalent to making a type-II error. The mean size of 100 confidence regions under the same simulation parameters is a good indication of the size of the confidence region.
Through the simulation results, we found that the behavior of the methods was consistent regardless of the different simulation parameters. The mean sizes of the confidence regions are almost always in the ascending order according to SDPB, SOWH, SDNB, SSOWH, GLS, LW, AU, and SH. (The mean sizes of the SOWH and SDNB tests are close. It is likely that the latter is smaller than the former.) The average mean sizes of the confidence regions based on 12 model trees are shown on the x axis in figure 3 for both 80% and 95% levels and sequence lengths of 500 and 1,000. The solid line parallel to the x axis shows the nominal coverage. The lower bound of the confidence interval of the nominal coverage is shown by the dotted line. Any coverage beyond the lower bound of the confidence interval may indicate the low coverage of the method because the probability that it should happen is smaller than 0.05. For each method, the 12 "+"s indicate the coverage of each model tree. Note that in some cases the coverages are the same for different model trees so that there are less than 12 distinct "+"s. When the coverages are below the lower bound, "+" is replaced by the labels of the model trees. With the smallest PMLTs among 12 model trees, trees H, J, K, and L appear frequently below the dotted lines.
FIG. 3.— The simulation results of five-taxon trees. The average number of trees in the confidence regions is calculated over all the 12 model trees. The nominal coverage line shows the nominal coverage of the confidence region. The line below it is the lower bound of 95% confidence interval of the nominal coverage. The "+" indicates the coverage of single model tree. When the coverage is less than the lower bound, the label of the tree is shown.
Large Number of Taxa
When the number of taxa becomes large, the size of the confidence region depends on the number of candidate trees and on the tree-searching method of choosing them. A single data set was simulated from each of the three model trees with large number of taxa in order to get a basic idea of the behavior of the methods. At first, 100 trees with the highest likelihoods were selected by a heuristic search with neighbor-joining starting trees followed by NNI branch swapping as the candidates. However, they might not be the 100 best trees globally, and some of them were partially star trees (some internal branch lengths are 0). If a partially star tree is not rejected, all the binary trees nested in such structure should be included in the confidence region as well. Although the number of these 100 trees in the confidence region is not the true size in this section, a small number of trees still indicates a small size. In table 3, the number of trees in the confidence region was counted as the number of candidate trees with P values larger than 0.05, except for the LW test. Each tree in the confidence region with zero internal branch lengths was counted as 1 as well. Note that the true tree was covered by the confidence regions in all these three examples.
Table 3 The Number of Trees in 95% Confidence Regions of SDPB, SOWH, SDNB, SSOWH, GLS, AU, SH, and LW for 10-, 15-, and 20-Taxon Trees
Basically, the behavior of the methods follows the same pattern as for five-taxon trees. Still, the SOWH, SDPB, and SDNB tests give us the smallest size of the confidence regions. But there were some exceptions. The number of trees in the confidence region of the SSOWH test became larger than the AU test in these three examples. The GLS test behaved strangely in that the number of trees included is even larger than the SH test when the number of taxa increases.
For the SH and AU tests which involve tree comparisons in the null hypothesis, we also searched 500 trees with the highest likelihoods by NNI branch swapping for the 20-taxon example. The P values of the SH test increase with the increasing number of the tested trees, as many authors have already noticed (e.g., Strimmer and Rambaut 2001).
The AU test is much less sensitive to this variation. The P values of the true tree for the AU test in both situations are the same, which is 0.43. But the P value of the AU test changes when another 100 best trees were built as the candidates by nonparametric bootstrap instead of NNI branch swapping for the 20-taxon data set. The ML tree of the replicate data sets were recorded as the tested tree during the bootstrap until we obtained 100 distinctive ones. In this case, the largest difference of log likelihoods between 100 trees is 12.8 but was 72.4 by NNI branch swapping, which means that the trees generated by nonparametric bootstrap are closer to the ML tree. (The ML tree picked by the two methods is the same.) The corresponding P value changed to 0.479. So for a given tree, the AU test does give different results depending on the set of candidate trees.
The likelihood weight in the LW test is not the likelihood weight when we only consider some of the possible trees. The performance of the LW test is stable when the number of the tested trees from NNI branch swapping changed from 100 to 500. And the number of trees in the confidence region became relatively smaller in the 20-taxon example even compared to the SDNB test. But there is a potential problem. The LW depends on not only the likelihood itself but also the number of tested trees. When the number of tested trees is large, the trees with the lowest likelihoods are always rejected, no matter how close their likelihoods are to the ML tree. It may cause low coverage if the true tree is one of them.
Because of the issues mentioned for the SH, AU, and LW tests, we only compared the coverage of the other tests for the large trees. Again, 100 trees with the highest likelihoods by NNI branch swapping were selected as the candidates for each original data set. Table 4 lists the simulation results. The coverages of the SOWH, SDPB, and SDNB tests were acceptable, which were around 0.95. The coverage of the SSOWH test was always 1. Conversely, the coverage of the 20-taxon tree went down to 0.84 for 95% confidence region for the GLS test. We also used a 30-taxon tree as the model tree to check the coverage of the GLS test and obtained coverage of 66% for 95% confidence region. The fact that these two observed coverages went beyond the lower bound of the 95% confidence interval of the nominal coverage suggests that the GLS test is not reliable for large number of taxa.
Table 4 The Coverage of the Confidence Region with Sequence Length of 3,000 for 10-, 15-, and 20-Taxon Trees
Discussion
Comparison of the Tests
The discussions are mostly based on the simulation results of 95% confidence region of five-taxon trees (see fig. 3), combined with the results from the large number of taxa. The results of 80% confidence region serve as reinforcement of our opinions.
The SDPB, SDNB, and SOWH tests behave similarly. Many results have shown that the confidence region only contained the ML tree when the SOWH test was applied to real data even with high confidence level (e.g., Goldman, Anderson, and Rodrigo 2000; Buckley 2002). But ignoring the problem of model misspecification, the performance of the SOWH test is the best. Although the coverages were below the nominal level sometimes, the smallest coverage was 91% for confidence level 95%, which was not outside the sampling interval. However, it is very time consuming to test many trees one by one. The SDPB and SDNB tests solve this problem by using one bootstrap distribution for all trees to be tested.
The SDPB test creates the smallest mean size confidence region, but there are some issues about the coverage. For instance, with sequence length of 500, the coverage can be as low as 87% for 95% confidence region. The SDNB test has higher coverage but with a larger size of the confidence region. Sometimes its coverage was lower than the nominal coverage but seldom below the lower bound. Compared to the SOWH and SDPB tests, the SDNB test is not as dependent on the model assumptions.
The coverages of the SH and AU tests exceed the nominal coverage for all situations of the five-taxon trees in figure 3. The sizes of the confidence regions based on these two methods were the largest among the eight tests, with the size of the SH region being noticeably larger. Because they were derived as ways of adjusting for the multiple comparisons with more than two trees, it may seem possible that their performance would be more comparable to the other tests when m = 2. However, this did not turn out to be the case in the simulations we considered. For example, when 500 sites were generated from tree G and the tests were applied with an alternative tree that switched taxon labels 1 and 4 (see fig. 1), the average sizes of the confidence regions were 2 and 1.89 for the SH and AU tests, respectively, and 1.58 for the SDNB test. Note that because the largest size of the confidence region is 2 for two candidate trees, the difference between these sizes seems not to be significant but still indicates the conservative performance of the SH and AU tests. Comparison with confidence regions when m = 15 suggests that the conservative performance of the SH and AU tests is not due simply to the multiple testing issue.
The LW test tended to give smaller confidence regions than the AU test when the number of taxa is small. But for the large numbers of taxa, its coverage and size are related to the trees in the testing pool as we have explained in Large Number of Taxa in Simulation Results, which is not a good feature for implementation.
The SSOWH test always has coverage above the nominal level but with much smaller size of the confidence region than the SH, AU, or LW test for five-taxon tree. Although on average there were two trees less for the SSOWH test under the same condition than the LW test, we cannot recommend it as a good choice because the size of the confidence region became larger than the AU test for the 20-taxon tree. For a four-taxon tree, the star tree is the only common structure among the three possible unrooted trees and can be considered as the boundary of the tree space. But for five-taxon tree, ((1,2)(3,4,5)) is the true boundary of ((1,2)(3,(4,5))), ((1,2)((3,4),5)), and ((1,2)((3,5),4)). By shortening the internal branch between (1,2) and (3,4,5), we can finally achieve the star topology. The star tree as the boundary makes H0 hard to reject. It gets worse with increasing number of taxa.
As the only test based on distance methods, the GLS test worked reasonably well for five-taxon trees. Sizes of confidence regions tended to be smaller than for the AU and SH tests but larger than for the bootstrap-based tests. Serious problems occurred with large numbers of taxa. The coverage was much less than the nominal level, and the size of the confidence region could be extremely large. There are two possible reasons for this behavior. Although the pairwise distances have an asymptotic multivariate normal distribution, it is unclear how large the sequence length has to be for the asymptotic approximation to be good with large numbers of taxa. The second reason is that with larger numbers of taxa, the estimated covariance matrix is more likely to be close to singular. Because the GLS test statistics are sometimes undefined in such cases, a modified routine was used whenever some of the eigenvalues in the estimated covariance matrix were less than 10–6. This modification was based on the idea that GLS can be interpreted as weighted least squares for a set of approximately independent linear transformations. The modification ignores those linear transformations, with estimated variance near 0, that create problems when the covariance matrix is near singular. Although this avoids difficulties with undefined GLS test statistics, some information is lost in ignoring these linearly transformed distances.
Some authors concluded that the difficulty of phylogenetic analysis increases with the increasing diameter of the underlying tree, especially for trees with small number of taxa (Huelsenbeck and Hillis 1993). The diameters of the model trees in the paper of Zwickl and Hillis (2002) ranged from 0.1 to 0.45. In our simulations of five taxa, even the smallest diameter was 1.8, which meant we considered very hard conditions for five-taxon trees. The acceptable coverages of the SDPB and SDNB tests for large number of taxa are partly due to the small diameter of the model trees, which are less than 0.45. The PMLT is a scale to reflect the difficulty of phylogenetic analysis in the simulation. Even for the very large diameters of the five-taxon trees, the coverages of the SDPB and SDNB tests are reasonable if the PMLT is larger than 0.6.
When the true tree has star topology, all the candidate trees are acceptable for the observations, which is the underlying assumption of the SH and SSOWH tests. All the other tests undercover at varying levels. For instance, the coverage of the SDNB, SOWH, and SDPB tests went down to 87%, 85%, and 80% for 95% confidence region by shrinking the two internal branches of tree G to 0.001. This example indicates a possible reason for the small confidence region of the SOWH test observed in reality. While the SOWH test generally performed well, this is a region of tree space where it could a priori be expected to have some difficulties. Because branch lengths are bounded below by 0, the estimated branch lengths for any hypothesized tree are much more likely to be larger than the true branch lengths. The SOWH test is thus simulated from a tree with more "structure" than the true tree. The level of uncertainty reflected in the estimated distribution of the likelihood ratio statistic is less than it would have been if the actual branch lengths were used in simulations. The consequence, as the simulations confirm, is undercoverage when true branch lengths are close to 0. While these are cases where the SOWH test will not perform well, phylogenies near the star phylogeny do not usually arise in real problems of biological interest.
Effects of Factors
Sequence length, tree shape, and branch lengths are the factors that influence the difficulty of the phylogenetic analysis. We present their effects based on the simulation results of small number of taxa.
As expected, increasing the sequence length enables us to estimate the tree more precisely and obtain smaller confidence regions. The average mean size of the confidence regions with sequence length of 1,000 is almost three trees less than with sequence length of 500 as in figure 3.
Tree shape is determined by the relative positions of long and short branches. The diameters of the three trees are different even though they have the same long and short branch lengths. We found that the sizes of the confidence regions of the data sets simulated from the left two trees in figure 1 were smaller than the right one. Also, it shows that long-branch attraction does not happen in our simulations. For example, tree C arose in 94 of the confidence regions obtained through simulation. Trees that differed from C by a single NNI appeared 40–46 times. The trees with long branches together occurred 18, 25, and 24 times, which was consistent with the numbers of times (18–28) that other trees differing from C by more than one NNI appeared.
The branch length gives the expected substitutions in a certain time. In order to sort out the effect of the relative branch lengths, it is instructive to look at the results for trees I and K which have the same diameter (1.8) and the same long branch lengths, but the short branch lengths are 0.03 for I and 0.06 for K. The mean sizes of 95% confidence regions from the SOWH test with sequence length of 500 were 6.6 and 3.71, respectively. (The coverages of the SOWH test for these two trees were both larger than 95%.) Obviously, the relatively longer short branch in tree K was the reason for the smaller size. We enlarged the short branches of the tree structure on the right in figure 1 and listed the values of the PMLT in table 5. When the tree structure became fully symmetric with 0.9 as the short branch lengths, the PMLT went up to 0.99 even when the diameter increased to 3.6. When we broke the symmetry by increasing the length of branches a, d, e, f, and g beyond 0.9, the PMLT decreased. This phenomenon is consistent with the conclusion drawn from tree shape.
Table 5 The Values of the PMLT When the Short Branch Lengths Increase for the Tree Structure on the Right in Figure 1
Parametric Versus Nonparametric Bootstrap
There are two kinds of bootstrap: parametric and nonparametric. In phylogeny, parametric bootstrap requires a substitution model along with the estimated parameters. If the model is misspecified, this may introduce bias into the bootstrap results (Buckley 2002). Nonparametric bootstrap is more popular because it is more directly based on the data (site likelihood here) rather than on the parametric model.
The 95th percentiles of the SDNB and SDPB bootstrap distributions of the likelihood ratio statistic estimate the 95th percentile of the actual distribution. For a given true tree, the 95th percentile of the actual distribution can be approximated as the 95th percentile of likelihood ratios obtained by repeatedly simulating data sets from the true tree. We did this for three trees at different levels of PMLT: 0.94 (tree A), 0.63 (tree C), and 0.25 (tree L). In each case, 5,000 data sets with sequence length of 500 were generated. This gave actual 95th percentiles 0.257, 1.39, and 2.565 for the three trees.
To investigate the biases and variances due to the differing forms of bootstrapping in SDNB and SDPB, we selected 100 data sets from the 5,000 and applied the procedures with RELL resampling. For each tree and both the SDNB and SDPB tests, this gave 100 estimated 95th percentiles. The results are in table 6.
Table 6 The Means and the SDs of the Cutoff Values in the SDNB, SDNB with RELL, and SDPB Tests
The standard deviations (SDs) reflect two sources of variations: the bootstrap methods and the number of bootstraps. Because we are interested in the former one, the same number (100) of bootstraps was used in three techniques in order to make them comparable. The table shows that the parametric bootstrap provides smaller cutoffs than nonparametric bootstrap. With the decreasing PMLT, the "true cutoff" increases. Though the cutoffs from the SDPB test increase as well, their mean underestimates the true cutoff of tree L. Conversely, the SDNB test provides better estimate in this situation even though it overestimates the true cutoffs of trees A and C. No matter whether the cutoff is overestimated or underestimated, we notice that the SD of the nonparametric bootstrap is always larger than the parametric one. It means that the parametric bootstrap provides more stable cutoffs than the nonparametric one.
Obviously, the performances of the bootstrap methods are related to the PMLT as well. The nonparametric bootstrap tends to be conservative. Note that the underestimate or overestimate of the true cutoff does not necessarily mean high or low coverage. It is possible that small cutoff corresponds to smaller test statistic value. The results in this part only show the behavior patterns of the cutoffs determined by the bootstraps.
Conclusion
In evaluating the various test methods, we considered the coverage and the size of the confidence regions based on simulations. Generally, high coverage causes large size and small size leads to low coverage. The good method gives a small confidence region with reasonable coverage. Overall, the SOWH test performed the best during the simulation, with reasonable coverage and relative small size. But in this paper, we only examined the efficiency of various tests without considering the model assumption violation. The sensitivity of the SOWH test to model misspecification is the major reason for its lower coverage (Buckley 2002). So we do not recommend it unless we have confidence that the substitution model is an accurate representation of the data. The SDPB test, which is expected to have lower coverage than the SOWH test in the same situation, is not recommended for the same reason. Though the SDNB test loses some power to detect the wrong trees in the confidence region because of the nonparametric bootstrap, it is less dependent on the model assumption. And it avoids the time-consuming problem in the SOWH test when the number of candidate trees is large. We recommend it as a good alternative giving small confidence regions with reasonable coverage.
Because the SSOWH, LW, and GLS tests perform unstably, they are not recommended for large number of taxa. We do not recommend the SH and AU tests for small number of taxa because the large size reflects the lack of power to detect the wrong trees. As a popular method, the AU test is conservative and sensitive to the set of candidate trees. But it is not seriously misleading for large number of taxa. The AU test is safe to use in order to control the coverage strictly.
Note that only the binary trees were considered as the model trees in our simulation. With the star trees as the models, all the candidate trees equally support the data. Except for the SH and SSOWH tests, all the other tests tend to undercover in this situation, especially the SOWH, SDNB, and SDPB tests.
The performance under model misspecification is not discussed in the paper either. Under such situations, the performance of these methods depends on how far the substitution model is from reality. Model misspecification is an important issue, but it involves a whole span of other issues taking us beyond the scope of this paper.
Appendix Tree Structures in Figure 2
The trees are in NEXUS format, and can be viewed in PAUP*.
10-Taxon Tree
(1: 0.013830, 2: 0.015832, (3: 0.029746, ((4: 0.017846, 5: 0.014022): 0.032036, (6: 0.047509, (7: 0.057006, (8: 0.026139, (9: 0.021788, 10: 0.032609): 0.001234): 0.038943): 0.001426): 0.006973): 0.004049): 0.017251);
15-Taxon Tree
(1: 0.013935, 2: 0.015366, (3: 0.033192, ((4: 0.023817, 5: 0.015218): 0.038405, (6: 0.042860, (7: 0.056182, (((8: 0.023098, 9: 0.024827): 0.000432, 10: 0.023845): 0.015672, ((((11: 0.005724, 12: 0.006404): 0.001903, 13: 0.005464): 0.028134, 14: 0.037194): 0.001776, 15: 0.047241): 0.023468): 0.015238): 0.001294): 0.004894): 0.004605): 0.015000);
20-Taxon Tree
(1: 0.013823, 2: 0.015819, (3: 0.029629, ((4: 0.017817, 5: 0.014036): 0.031956, (6: 0.047519, (7: 0.057420, ((8: 0.026198, (9: 0.022117, 10: 0.032245): 0.001317): 0.015635, (((((11: 0.004742, 12: 0.007728): 0.000829, 13: 0.005867): 0.035062, (14: 0.043034, 15: 0.046981): 0.001445): 0.017608, 16: 0.059090): 0.001133, ((17: 0.072962, 18: 0.051127): 0.011258, (19: 0.020931, 20: 0.013127): 0.029135): 0.013228): 0.000914): 0.022642): 0.001226): 0.006938): 0.003981): 0.017363);
Acknowledgements
We thank two reviewers' suggestions in improving the presentation of the manuscript. We also thank Andrew Roger for the discussions about this paper. The authors are supported by Genome Atlantic and NSERC.
References
Antezana, M. 2003. When being ‘most likely’ is not enough: examining the performance of three uses of the parametric bootstrap in phylogenetics. J. Mol. Evol. 56:198–222.
Buckley, T. 2002. Model misspecification and probabilistic tests of topology: evidence from empirical data sets. Syst. Biol. 51:509–523.
Bulmer, M. 1991. Use of the method of generalized least squares in reconstructing phylogenies from sequence data. Mol. Biol. 8:868–883.
Cavalli-Sforza, L., and W. Edwards. 1967. Phylogenetic analysis: models and estimation procedures. Evolution 32:550–570.
Efron, B., E. Halloran, and S. Holmes. 1996. Bootstrap confidence levels for phylogenetic trees. Proc. Natl. Acad. Sci. USA 93:13429–13434.
Efron, B., and R. Tibshirani. 1998. The problem of regions. Ann. Stat. 26:1687–1718.
Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368–376.
Goldman, N., J. Anderson, and A. Rodrigo. 2000. Likelihood-based tests of topologies in phylogenetics. Syst. Biol. 49:652–670.
Hasegawa, M., H. Kishino, and T. Yano. 1985. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 21:160–174.
Hillis, D., B. Mable, and C. Moritz. 1996. Applications of molecular systematics: the state of the field and a look to the future. Pp. 515–543 in D. Hillis, C. Moritz, and B. K. Mable, eds. Molecular systematics. Sinauer, Sunderland, Mass.
Huelsenbeck, J., and D. Hillis. 1993. Success of phylogenetic methods in the four-taxon case. Syst. Biol. 42:247–264.
Kishino, H., and M. Hasegawa. 1989. Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in Hominoidea. J. Mol. Evol. 29:170–179.
Murphy, W., E. Eizirik, W. Johnson, Y. Zhang, O. Ryder, and S. O'Brien. 2001. Molecular phylogenetics and the origins of placental mammals. Nature 409:614–618.
Rambaut, A., and N. Grassly. 1997. Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput. Appl. Biosci. 13:235–238.
Shimodaira, H. 2002. An approximately unbiased test of phylogenetic tree selection. Syst. Biol. 51:492–508.
Shimodaira, H., and M. Hasegawa. 1999. Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Mol. Biol. Evol. 16:1114–1116.
———. 2001. CONSEL: for assessing the confidence of phylogenetic tree selection. Bioinformatics 17:1246–1247.
Sokal, R., and F. Rohlf. 1995. Biometry. 2nd edition. W. H. Freeman, New York.
Strimmer, K., and A. Rambaut. 2001. Inferring confidence sets of possibly misspecified gene trees. Proc. R. Soc. Lond. B 269:137–142.
Susko, E. 2003. Confidence regions and hypothesis tests for topologies using generalized least squares. Mol. Biol. Evol. 20:862–868.
Swofford, D. 2000. PAUP*: phylogenetic analysis using parsimony (*and other methods). Version 4.0b4a. Sinauer Associates, Sunderland, Mass.
Swofford, D., G. Olsen, P. Waddell, and D. Hillis. 1996. Phylogenetic inference. Pp. 407–514 in D. Hillis, C. Moritz, and B. K. Mable, eds. Molecular systematics. Sinauer, Sunderland, Mass.
Zwickl, D., and D. Hillis. 2002. Increased taxon sampling greatly reduces phylogenetic error. Syst. Biol. 51:588–598.(Xiaofei Shi, Hong Gu, Edw)
E-mail: shi@mathstat.dal.ca.
Abstract
In this paper, several different procedures for constructing confidence regions for the true evolutionary tree are evaluated both in terms of coverage and size without considering model misspecification. The regions are constructed on the basis of tests of hypothesis using six existing tests: Shimodaira Hasegawa (SH), SOWH, star form of SOWH (SSOWH), approximately unbiased (AU), likelihood weight (LW), generalized least squares, plus two new tests proposed in this paper: single distribution nonparametric bootstrap (SDNB) and single distribution parametric bootstrap (SDPB). The procedures are evaluated on simulated trees both with small and large number of taxa. Overall, the SH, SSOWH, AU, and LW tests led to regions with higher coverage than the nominal level at the price of including large numbers of trees. Under the specified model, the SOWH test gives accurate coverage and relatively small regions. The SDNB and SDPB tests led to the small regions with occasional undercoverage. These two procedures have a substantial computational advantage over the SOWH test. Finally, the cutoff levels for the SDNB test are shown to be more variable than those for the SDPB test.
Key Words: likelihood ? confidence region ? test ? coverage ? bootstrap
Introduction
In phylogenetic analysis, an important issue is to determine the confidence regions for gene trees from DNA sequences. The first step usually involves the estimation of the trees. The widely used estimation methods are based on a maximum likelihood (ML) approach (Cavalli-Sforza and Edwards 1967; Felsenstein 1981) and a distance approach (Cavalli-Sforza and Edwards 1967; Bulmer 1991). A number of the existing confidence region methods are likelihood based and look for trees that are close to the ML tree in some sense. Interestingly, many available procedures can give rather different confidence regions (e.g., Goldman, Anderson, and Rodrigo 2000; Strimmer and Rambaut 2001; Shimodaira 2002).
In this paper, our main aim is to evaluate the performance of a number of the current proposals and modifications of them to see which procedure gives us the best result under different conditions. Our approach will be to construct the confidence regions based on tests of the null hypothesis that the tested tree has the same topology as the true one. For simplicity, we will refer to 0 as the tree topology being tested. The branch lengths are not specified in the null hypothesis but are estimated during the test. In most tests, the ML branch lengths which maximize the likelihood of the given tree structure are used. The P value is calculated as the probability, under H0, of observing a value of the test statistic at least as large as that observed. Using the general duality between testing and confidence regions, the test statistic can be used to construct confidence regions of topologies. A (1 – ) x 100% confidence region for the true topology is a set of topologies that contains the true topology with probability 1 – . It includes 0 if the null hypothesis is not rejected with significance level . The probability that the confidence region includes the true tree is called the coverage.
We will examine the following test procedures: Shimodaira-Hasegawa (SH) test (Shimodaira and Hasegawa 1999), single distribution nonparametric bootstrap (SDNB) test (introduced in this paper), SOWH test (Swofford et al. 1996; see also Hillis, Mable, and Moritz 1996; Goldman, Anderson, and Rodrigo 2000), star form of SOWH (SSOWH) test (Antezana 2003), single distribution parametric bootstrap (SDPB) test (introduced in this paper), approximately unbiased (AU) test (Shimodaira 2002), likelihood weight (LW) test (Strimmer and Rambaut 2001), and generalized least squares (GLS) test (Susko 2003). Except the GLS test which is distance based, all the other tests are based on likelihood. The simulation results are compared under two criteria: the coverage and the size which is defined here as the number of trees in the confidence regions.
The rest of the paper is organized as follows. Methodology introduces the statistical methods including the detailed description of the two newly proposed SDNB and SDPB tests and their relationships with an emphasis on methodology. The comparison of confidence regions of gene trees according to their coverages and sizes based on data sets simulated from trees with small and large numbers of taxa is presented in Simulation Methods and Analysis and the relevant discussion in Discussion. Without considering model misspecification, the SOWH test performs very well with reasonable coverage and small size of the confidence region. The SDNB and SDPB tests based on two different bootstrap methods also have small sizes but undercover the true tree (the coverage of the true tree is lower than the nominal coverage) in some cases. The SH, LW, AU, and SSOWH tests always give conservative results in the simulations, i.e., the coverage is larger than the nominal coverage, giving large confidence regions. Although, under large sample approximation, the coverages of the AU and GLS tests are expected to be the same as the confidence level, it did not happen in the simulations. Their asymptotic characters are not approached by the sequence lengths in the simulation, and it is hard to tell how long the sequence should be to guarantee the asymptotic characters. Nonparametric and parametric bootstraps behave differently, which is demonstrated by looking at the critical values for the SDNB and SDPB tests. Finally, the SDNB test with the AU test as an alternative is recommended for the real data in Conclusion.
Methodology
In carrying out the tests to construct the confidence regions, the overall biological goal is the same: are observations compatible with a specified tree topology 0? This can be written as
which is called the canonical hypothesis in this paper. Likelihood is a popular method to measure the compatibility of the tested and the true trees. It can be calculated for each candidate tree using the estimated branch lengths. The ML tree is considered to be the best estimate of the true tree. Let li and lML denote the log likelihoods of i and the ML tree, respectively. Their difference, i = lML – li, measures the discrepancy of the two trees and is commonly used as the test statistic of the canonical hypothesis.
The bootstrap method is widely used for constructing the distribution of the test statistic when a closed-form expression is not available. There are two types of bootstrap: nonparametric and parametric. For DNA sequences, the combinations of nucleotide types in sites are considered as the samples in the nonparametric bootstrap. The sites are resampled with replacement to build the replicate data sets. In the standard bootstrap, all the nuisance parameters, such as the branch lengths and the base frequencies, need to be reestimated to calculate the likelihoods. The RELL technique gives time-saving approximations, in which the site likelihoods are resampled. It is equivalent to calculating the likelihoods of the standard replicates with the parameters estimated from the original data set. These two techniques give similar results, with the latter being more popular. In the parametric bootstrap, the replicate data sets are simulated from a parametric substitution model and some tree. For instance, with the SSOWH test, bootstrap data are simulated from the star topology with estimated substitution model parameters and external branch lengths, while for the SOWH test, bootstrap data are simulated from the hypothesized tree with estimated branch lengths. Hence, the parametric bootstrap relies on the model specification more closely than the nonparametric bootstrap, which also makes it more powerful (Sokal and Rohlf 1995). However, this also is a limitation due to the concern that some assumptions in the model may not be realistic. * in this paper refers to estimates obtained from a replicate data set and will be referred to as a bootstrap value. For instance, lML is the log likelihood of the ML tree of the original data set, and would denote the corresponding value of a bootstrapped data set.
We now describe the tests in terms of the hypothesis, test statistic, and the bootstrap method. The names of the tests and the corresponding acronyms are listed in table 1.
Table 1 The Tests and Their Acronyms Compared in this Paper
The SH Test
The SH test (Shimodaira and Hasegawa 1999) is a modified version of the Kishino-Hasegawa (KH) test (Kishino and Hasegawa 1989). The KH test was designed to test any two topologies selected independently of any analysis of the data but was often inappropriately used for comparing a candidate tree with the ML tree, a fact noted by many authors (Shimodaira and Hasegawa 1999; Goldman, Anderson, and Rodrigo 2000). We will not give details and simulation results for the KH test because we are interested in more than two candidate trees. The hypothesis for the SH test is
where m is the number of candidate trees. The hypothesis indicates that we are testing a group of trees rather than trees independently. Though the hypothesis is not canonical, the canonical test statistics are used in this test,
and their distributions are estimated by nonparametric bootstrap.
Here we give an intuitive explanation and some comments on the rationale and the effects of the "centering" procedure proposed in the resampling method of the SH test. The nonparametric bootstrap approximates the true distribution whether the null hypothesis is true or not. Because the null hypothesis may not be true, it may not be approximating the distribution of the test statistic under the null hypothesis as desired. If all the trees equally support the observations, they should have equal expected log likelihood, denoted by E[li], i = 1, ..., m. In order to build the distribution of the test statistic under H0, the least favorable configuration (l.f.c.), i.e., is considered. The l.f.c. states zero branch length for any split that is not present in all the trees under consideration. It is least favorable in the sense that it corresponds to the specific hypothesis consistent with H0 which is the most difficult to reject. By subtracting and adding E[l1] to i, we have
(1)
(2)
(3)
Under the l.f.c. assumption, E[l1] can be replaced by E[li] for all i, so that
(4)
Let i = 1, ..., m, denote the mean of all replicate log likelihoods of i. The centering procedure replaces E[li] by to calculate the bootstrap value of i in equation (4), i.e.,
(5)
However, because the l.f.c. assumption does not necessarily hold for the equivalence between equations (3) and (4) does not hold for the bootstrap value of In fact, we always get larger bootstrap values of i when more trees are involved in the null hypothesis, which results in larger P values, making it harder to reject the tested trees.
The SDNB Test
The bootstrap paradigm requires us to replace the true parameters with the estimates and the estimates with bootstrap values of the replicates. In phylogeny, the tested tree topology is the parameter we are interested in. As the estimate of the true tree, the ML tree obtained from the original data set gives the most probable evolution process. So instead of centering, we replace the tested tree by the ML tree of the original data set in bootstrap replicates in terms of bootstrap theory. This core idea of our proposed method conforms with the principle of the original nonparametric bootstrap method. The proposed method, SDNB test, only requires a single distribution for all the test statistics.
The hypothesis and the test statistic are the canonical ones in the SDNB test. The test statistics are
and the ML tree, ML, is recorded.
The key steps of constructing the distribution of the test statistics are:
Generate nonparametric bootstrap replicate data sets.
Reestimate any free parameters to get maximum log likelihood and for each replicate. l* denotes the log likelihood for the bootstrap data set. The subscripts, ML and ML*, denote the ML tree ML of the original data set and the ML tree ML* of the replicate, respectively.
Construct the distribution of i = lML – li by the difference of log likelihood between the ML tree of the replicate and the ML tree of the original data set, Comparing the calculation of * in the bootstrap to the original test statistic, the tested tree i is replaced with ML and ML with ML*. (We could use RELL in this step to save time. In this paper the test was implemented without RELL.)
Test whether the attained values of is, i = 1, 2, ..., m, are plausible samples from *. For significance level , if i is less than the 100(1 – )th percentile of *, i is included in the 100(1 – )% confidence region.
We only need one bootstrap distribution, *, which is used for testing all trees, leading to substantial computational saving. Note that our bootstrap paradigm differs slightly from the standard bootstrap. The standard bootstrap method estimates by , the parameter of interest, is not random and does not depend on the number of observations. In our case, the tree topology i is the parameter of interest in the null hypothesis. However, the consistency of i with the true tree is measured by lML – li that is estimated by In contrast to the standard bootstrap, li is random and depends on the sequence length.
The SOWH and SSOWH Tests
The SOWH test was named after the authors who originally described it (Swofford et al. 1996; see also Hillis, Mable, and Moritz 1996). It has the canonical hypothesis and test statistic. In this test, a parametric bootstrap is used to construct the distribution of the test statistic. The replicate data sets are simulated under the tested tree with the nuisance parameters, such as the branch lengths and the substitution model, estimated from the original observations. Treating H0 as the single null hypothesis, there are two disadvantages about this test. One is low coverage when the true tree topology is close to star topology (Antezana 2003). The other is that the distribution of i in the SOWH test has to be built for each tested tree i using which is computationally intensive.
In order to solve the first problem, Antezana (2003) proposed a star version of parametric bootstrap by which the replicates are simulated from the star topology with ML branch lengths estimated from the original observations. This is similar to the treatment of the complex null hypothesis to calculate the largest P value under H0. It is motivated by the fact that the star topology can be viewed as the boundary between trees. Using the boundary tree of H0 to simulate data will naturally make the tested tree difficult to reject. Because the tree model used to generate the parametric bootstrap is the only difference between these two tests, the later one is referred to as the star version of the SOWH (SSOWH) test.
The SDPB Test
Following the bootstrap paradigm as the SDNB test, computational intensity is decreased in the SDPB, in which the replicates are simulated from the ML tree of the original data set. The SDPB test can be considered as the parametric bootstrap version of the SDNB test. The key steps of this method are the same as the SDNB test except step (1), which is replaced by:
(1') Simulate parametric bootstrap replicates based on ML with the ML branch length and other parameters obtained from the original observations.
This procedure gives the SDPB test a substantial computational advantage over the SOWH test because i's are all evaluated using the distribution of This eliminates the need of a separate bootstrap for each i which is required in the SOWH test. The SDPB test is much more feasible than the SOWH test, especially for trees with large numbers of taxa and consequently large number of trees to test.
Other Methods
The AU test (Shimodaira 2002) is widely used now. Its hypothesis is stated in terms of the expected log likelihood of topology i, denoted by μi, i = 1, ..., m. The hypothesis for testing i is
The proportion of times that i is the ML tree in the bootstrap replicates is called bootstrap probability (BP). A multiscaled nonparametric bootstrap in which different sequence lengths are used provides a technique for computing BP. Because it gives the first-order accurateness for the P value (Efron, Halloran, and Holmes 1996), Shimodaira used the results from Efron and Tibshirani (1998) to correct BP to give the second-order accurate P values which means that the expected coverage of the AU test is (1 – ) + o(n–3/2) (see eq. 11 and the discussion therein of Shimodaira 2002).
The LW test is designed for constructing confidence region directly and does not involve any tests of hypotheses (Strimmer and Rambaut 2001). LW is the proportion of the likelihood of a tested tree over the sum of the likelihoods of all candidate trees. Nonparametric bootstrap gives the estimate of the expected LW of each tree. The weight can be viewed roughly as the posterior probability of a tree under a uniform prior on the trees. Tested trees are ordered by expected LWs and will be included in the confidence region until the sum of their weights is equal to or slightly greater than the confidence level. Trees with larger expected likelihood are more likely to be the true tree and hence will be in the confidence region.
The GLS test (Susko 2003) is based on the idea of fitting ML pairwise distances using a regression model determined by the tree topology. It is computationally inexpensive relative to the ML methods, especially for trees with large number of taxa. All candidate trees need to be tested separately because the canonical hypothesis is used. The ML pairwise distances, consistent with estimates of the true distance, form the dependent variable vector. They can also be expressed as the sum of the branch lengths based on the tree topology. Therefore, the tested tree topology is transformed into a design matrix given the arrangement of the ML pairwise distances and the branch lengths in the vectors. For a pair of taxa i and j, the index of the branch is 1 in the design matrix if the path from i to j passes through it; 0 otherwise. The branch lengths are estimated by the generated least square method. Because the ML pairwise distances asymptotically have a multivariate normal distribution, the generalized sum of squared differences between the pairwise distances and the estimated distances based on the tree structure asymptotically follows a 2 distribution with degrees of freedom of T(T – 1)/2 – (2T – 3), where T is the number of taxa.
Simulation Methods and Analysis
Trees
In our comparison, we used data simulated from model trees by Seq-Gen, version 1.2.5 (Rambaut and Grassly 1997), so that the true tree topologies are known. The model trees with small and large number of taxa were considered separately.
Small Number of Taxa
Five-taxon trees with five external branches and two internal branches were considered as models for simulation. In our model trees, two external branches were set to be the long branches with the same length. The others were the short branches again with the same length. So, there are three possible tree shapes determined by the positions of the two long branches as shown in figure 1. The values of the factors affecting the results of phylogenetic analyses, such as sequence lengths, branch lengths, and tree shapes, are summarized as follows—sequence length: 500, 1,000; long branch length: 0.9, 1.5; short branch length: 0.03, 0.06; tree shape (in fig. 1).
FIG. 1.— The three tree shapes with a, ..., g denoting the branch lengths.
There are 12 different model trees labeled A–L as listed in table 2. The branch lengths in each tree are given following the order of a, ..., g, as in figure 1. Trees A–D correspond to the tree on the left in figure 1, E–H in the middle, and I–L on the right. The percentage of time that the maximum likelihood and the true trees (PMLT) are same is also listed in the table for different sequence lengths. The smaller the PMLT is, the more difficult the phylogenetic analysis is. The PMLT is highly related to the coverage of the tests because the ML tree is always in the confidence region. It is useless to compare the coverages of the confidence regions if all the PMLTs are close to or higher than the confidence level. The last column shows the diameters of the trees, which are the maximum distances between any two taxa in a tree (Zwickl and Hillis 2002).
Table 2 The Branch Lengths, the PMLTs of Different Sequence Lengths (sl), and the Diameters of the Model Trees
Large Number of Taxa
The simulations were also performed for trees with large number of taxa. In order to make our work consistent with reality, one data set was simulated from a published 66-taxon tree (Murphy et al. 2001) with sequence length of 3,000. We selected three subsets of taxa of size 10, 15, and 20 from the 66 taxa and constructed the ML tree for each by PAUP*. The three ML trees serve as the model trees of simulations for large number of taxa. The tree topologies are shown in figure 2, and the detailed tree structures including branch lengths are given in Appendix.
FIG. 2.— The structures of 10-, 15-, and 20-taxon trees for simulation.
Substitution Models
In order to generate data sets by Seq-Gen, we must specify the substitution models. Nucleotide substitution is assumed to follow a Markov process which is specified by a rate matrix, Q, whose elements represent instantaneous substitution rates among the four nucleotides. For the small number of taxa, we used the general time reversible (GTR) model which allows the most parameters. For likelihood-based approaches, it becomes a heavy computational burden to find the ML tree for large number of taxa. In order to save time, we use Hasegawa-Kishino-Yano (HKY) model (Hasegawa, Kishino, and Yano 1985) for simulation which allows different rate of transitions and transversions as well as unequal base frequencies. Because we did not consider model misspecification, the models for simulation and analysis were consistent.
For small number of taxa, the base frequencies were set as (a, c, g, t) = (0.18, 0.33, 0.26, 0.23). Q can be decomposed as
where u is chosen so that In our simulation, (a, b, c, d, e) are specified as (3.297, 12.55, 1.167, 2.06, 13.01), respectively.
For the large number of taxa, the values of the parameters in HKY model were set as 2.93 (transition/transversion) and (a, c, g, t) = (0.37, 0.24, 0.12, 0.27). These values were taken from the paper by Zwickl and Hillis (2002), which are estimated by ML on a 66-taxon tree representing the phylogenetic relationships among Eutherian mammals from Murphy et al. (2001). To avoid confounding the issues, we did not consider heterogeneous site rates.
Methods
All the likelihoods were calculated from PAUP* 4.0b8 (Swofford 2000). The SH test was implemented directly by the programs in PAUP* with 1,000 bootstraps of RELL. After the sitewise log likelihoods were obtained from PAUP*, the AU test was performed using CONSEL (Shimodaira and Hasegawa 2001) which also uses RELL for bootstrapping. We used the default settings of CONSEL with the scales in the bootstrap from 0.5 to 1.4, in increments of 0.1 and with 10,000 bootstraps for each scale. The expected LWs were estimated by the mean of 1,000 replicates in the LW test. The distribution of the test statistics in the SOWH, SSOWH, SDNB, and SDPB tests were built based on 100 bootstrap replicates. The GLS test used a program developed from glsdna_eig written by Susko. When the eigenvalues of the estimated covariance matrix are greater than 10–6, this program is the test discussed in Susko (2003). Otherwise, a modification is automatically implemented to avoid difficulties with the near singularity of the covariance matrix (more details are given in the software documentation available at http://www.mathstat.dal.ca/tsusko).
All the possible 15 unrooted tree topologies of five taxa were tested by all methods to determine whether or not they belong to the confidence region. When the number of taxa becomes large, it is difficult to construct confidence region based on all possible trees. Generally, we are interested in the trees with the highest likelihoods and the ML tree. One hundred trees with the highest likelihood were selected by a search method, such as nearest-neighbor-interchanging (NNI) branch swapping, to get a measure of the size of the confidence region. To evaluate coverage, we only need to check whether the true tree is in the confidence region. It should be noted that under a searching method, the "ML" tree we obtained may not be the global ML tree.
The 100(1 – )% confidence region was obtained by collecting the trees with P value greater than significance level in tests. Note that the LW test uses LWs rather than P values to construct the confidence region. If we construct confidence regions only based on 100 repeated simulations for each method, the variance of the coverage should be considered. An approximate 95% confidence interval for the nominal coverage c can be estimated as
We considered one-sided confidence interval because only low coverage causes problems. When c is 0.95 or 0.8, the corresponding 95% confidence interval is (0.91–1) or (0.76–1).
Simulation Results
Small Number of Taxa
We simulated 100 data sets from each model tree in table 2 with sequence lengths of 500 and 1,000. Confidence regions were constructed for all of them based on eight methods. The estimated coverage is the proportion of times that the true tree is in the confidence region during the simulations. Note that the coverage will always exceed the PMLT for all procedures technically except the GLS test. Our goal is to have a small size of confidence region with coverage of at least 1 – . The size is simply the number of trees (out of 15) in the confidence region. From a testing point of view, including a tree other than the true tree in the confidence region is equivalent to making a type-II error. The mean size of 100 confidence regions under the same simulation parameters is a good indication of the size of the confidence region.
Through the simulation results, we found that the behavior of the methods was consistent regardless of the different simulation parameters. The mean sizes of the confidence regions are almost always in the ascending order according to SDPB, SOWH, SDNB, SSOWH, GLS, LW, AU, and SH. (The mean sizes of the SOWH and SDNB tests are close. It is likely that the latter is smaller than the former.) The average mean sizes of the confidence regions based on 12 model trees are shown on the x axis in figure 3 for both 80% and 95% levels and sequence lengths of 500 and 1,000. The solid line parallel to the x axis shows the nominal coverage. The lower bound of the confidence interval of the nominal coverage is shown by the dotted line. Any coverage beyond the lower bound of the confidence interval may indicate the low coverage of the method because the probability that it should happen is smaller than 0.05. For each method, the 12 "+"s indicate the coverage of each model tree. Note that in some cases the coverages are the same for different model trees so that there are less than 12 distinct "+"s. When the coverages are below the lower bound, "+" is replaced by the labels of the model trees. With the smallest PMLTs among 12 model trees, trees H, J, K, and L appear frequently below the dotted lines.
FIG. 3.— The simulation results of five-taxon trees. The average number of trees in the confidence regions is calculated over all the 12 model trees. The nominal coverage line shows the nominal coverage of the confidence region. The line below it is the lower bound of 95% confidence interval of the nominal coverage. The "+" indicates the coverage of single model tree. When the coverage is less than the lower bound, the label of the tree is shown.
Large Number of Taxa
When the number of taxa becomes large, the size of the confidence region depends on the number of candidate trees and on the tree-searching method of choosing them. A single data set was simulated from each of the three model trees with large number of taxa in order to get a basic idea of the behavior of the methods. At first, 100 trees with the highest likelihoods were selected by a heuristic search with neighbor-joining starting trees followed by NNI branch swapping as the candidates. However, they might not be the 100 best trees globally, and some of them were partially star trees (some internal branch lengths are 0). If a partially star tree is not rejected, all the binary trees nested in such structure should be included in the confidence region as well. Although the number of these 100 trees in the confidence region is not the true size in this section, a small number of trees still indicates a small size. In table 3, the number of trees in the confidence region was counted as the number of candidate trees with P values larger than 0.05, except for the LW test. Each tree in the confidence region with zero internal branch lengths was counted as 1 as well. Note that the true tree was covered by the confidence regions in all these three examples.
Table 3 The Number of Trees in 95% Confidence Regions of SDPB, SOWH, SDNB, SSOWH, GLS, AU, SH, and LW for 10-, 15-, and 20-Taxon Trees
Basically, the behavior of the methods follows the same pattern as for five-taxon trees. Still, the SOWH, SDPB, and SDNB tests give us the smallest size of the confidence regions. But there were some exceptions. The number of trees in the confidence region of the SSOWH test became larger than the AU test in these three examples. The GLS test behaved strangely in that the number of trees included is even larger than the SH test when the number of taxa increases.
For the SH and AU tests which involve tree comparisons in the null hypothesis, we also searched 500 trees with the highest likelihoods by NNI branch swapping for the 20-taxon example. The P values of the SH test increase with the increasing number of the tested trees, as many authors have already noticed (e.g., Strimmer and Rambaut 2001).
The AU test is much less sensitive to this variation. The P values of the true tree for the AU test in both situations are the same, which is 0.43. But the P value of the AU test changes when another 100 best trees were built as the candidates by nonparametric bootstrap instead of NNI branch swapping for the 20-taxon data set. The ML tree of the replicate data sets were recorded as the tested tree during the bootstrap until we obtained 100 distinctive ones. In this case, the largest difference of log likelihoods between 100 trees is 12.8 but was 72.4 by NNI branch swapping, which means that the trees generated by nonparametric bootstrap are closer to the ML tree. (The ML tree picked by the two methods is the same.) The corresponding P value changed to 0.479. So for a given tree, the AU test does give different results depending on the set of candidate trees.
The likelihood weight in the LW test is not the likelihood weight when we only consider some of the possible trees. The performance of the LW test is stable when the number of the tested trees from NNI branch swapping changed from 100 to 500. And the number of trees in the confidence region became relatively smaller in the 20-taxon example even compared to the SDNB test. But there is a potential problem. The LW depends on not only the likelihood itself but also the number of tested trees. When the number of tested trees is large, the trees with the lowest likelihoods are always rejected, no matter how close their likelihoods are to the ML tree. It may cause low coverage if the true tree is one of them.
Because of the issues mentioned for the SH, AU, and LW tests, we only compared the coverage of the other tests for the large trees. Again, 100 trees with the highest likelihoods by NNI branch swapping were selected as the candidates for each original data set. Table 4 lists the simulation results. The coverages of the SOWH, SDPB, and SDNB tests were acceptable, which were around 0.95. The coverage of the SSOWH test was always 1. Conversely, the coverage of the 20-taxon tree went down to 0.84 for 95% confidence region for the GLS test. We also used a 30-taxon tree as the model tree to check the coverage of the GLS test and obtained coverage of 66% for 95% confidence region. The fact that these two observed coverages went beyond the lower bound of the 95% confidence interval of the nominal coverage suggests that the GLS test is not reliable for large number of taxa.
Table 4 The Coverage of the Confidence Region with Sequence Length of 3,000 for 10-, 15-, and 20-Taxon Trees
Discussion
Comparison of the Tests
The discussions are mostly based on the simulation results of 95% confidence region of five-taxon trees (see fig. 3), combined with the results from the large number of taxa. The results of 80% confidence region serve as reinforcement of our opinions.
The SDPB, SDNB, and SOWH tests behave similarly. Many results have shown that the confidence region only contained the ML tree when the SOWH test was applied to real data even with high confidence level (e.g., Goldman, Anderson, and Rodrigo 2000; Buckley 2002). But ignoring the problem of model misspecification, the performance of the SOWH test is the best. Although the coverages were below the nominal level sometimes, the smallest coverage was 91% for confidence level 95%, which was not outside the sampling interval. However, it is very time consuming to test many trees one by one. The SDPB and SDNB tests solve this problem by using one bootstrap distribution for all trees to be tested.
The SDPB test creates the smallest mean size confidence region, but there are some issues about the coverage. For instance, with sequence length of 500, the coverage can be as low as 87% for 95% confidence region. The SDNB test has higher coverage but with a larger size of the confidence region. Sometimes its coverage was lower than the nominal coverage but seldom below the lower bound. Compared to the SOWH and SDPB tests, the SDNB test is not as dependent on the model assumptions.
The coverages of the SH and AU tests exceed the nominal coverage for all situations of the five-taxon trees in figure 3. The sizes of the confidence regions based on these two methods were the largest among the eight tests, with the size of the SH region being noticeably larger. Because they were derived as ways of adjusting for the multiple comparisons with more than two trees, it may seem possible that their performance would be more comparable to the other tests when m = 2. However, this did not turn out to be the case in the simulations we considered. For example, when 500 sites were generated from tree G and the tests were applied with an alternative tree that switched taxon labels 1 and 4 (see fig. 1), the average sizes of the confidence regions were 2 and 1.89 for the SH and AU tests, respectively, and 1.58 for the SDNB test. Note that because the largest size of the confidence region is 2 for two candidate trees, the difference between these sizes seems not to be significant but still indicates the conservative performance of the SH and AU tests. Comparison with confidence regions when m = 15 suggests that the conservative performance of the SH and AU tests is not due simply to the multiple testing issue.
The LW test tended to give smaller confidence regions than the AU test when the number of taxa is small. But for the large numbers of taxa, its coverage and size are related to the trees in the testing pool as we have explained in Large Number of Taxa in Simulation Results, which is not a good feature for implementation.
The SSOWH test always has coverage above the nominal level but with much smaller size of the confidence region than the SH, AU, or LW test for five-taxon tree. Although on average there were two trees less for the SSOWH test under the same condition than the LW test, we cannot recommend it as a good choice because the size of the confidence region became larger than the AU test for the 20-taxon tree. For a four-taxon tree, the star tree is the only common structure among the three possible unrooted trees and can be considered as the boundary of the tree space. But for five-taxon tree, ((1,2)(3,4,5)) is the true boundary of ((1,2)(3,(4,5))), ((1,2)((3,4),5)), and ((1,2)((3,5),4)). By shortening the internal branch between (1,2) and (3,4,5), we can finally achieve the star topology. The star tree as the boundary makes H0 hard to reject. It gets worse with increasing number of taxa.
As the only test based on distance methods, the GLS test worked reasonably well for five-taxon trees. Sizes of confidence regions tended to be smaller than for the AU and SH tests but larger than for the bootstrap-based tests. Serious problems occurred with large numbers of taxa. The coverage was much less than the nominal level, and the size of the confidence region could be extremely large. There are two possible reasons for this behavior. Although the pairwise distances have an asymptotic multivariate normal distribution, it is unclear how large the sequence length has to be for the asymptotic approximation to be good with large numbers of taxa. The second reason is that with larger numbers of taxa, the estimated covariance matrix is more likely to be close to singular. Because the GLS test statistics are sometimes undefined in such cases, a modified routine was used whenever some of the eigenvalues in the estimated covariance matrix were less than 10–6. This modification was based on the idea that GLS can be interpreted as weighted least squares for a set of approximately independent linear transformations. The modification ignores those linear transformations, with estimated variance near 0, that create problems when the covariance matrix is near singular. Although this avoids difficulties with undefined GLS test statistics, some information is lost in ignoring these linearly transformed distances.
Some authors concluded that the difficulty of phylogenetic analysis increases with the increasing diameter of the underlying tree, especially for trees with small number of taxa (Huelsenbeck and Hillis 1993). The diameters of the model trees in the paper of Zwickl and Hillis (2002) ranged from 0.1 to 0.45. In our simulations of five taxa, even the smallest diameter was 1.8, which meant we considered very hard conditions for five-taxon trees. The acceptable coverages of the SDPB and SDNB tests for large number of taxa are partly due to the small diameter of the model trees, which are less than 0.45. The PMLT is a scale to reflect the difficulty of phylogenetic analysis in the simulation. Even for the very large diameters of the five-taxon trees, the coverages of the SDPB and SDNB tests are reasonable if the PMLT is larger than 0.6.
When the true tree has star topology, all the candidate trees are acceptable for the observations, which is the underlying assumption of the SH and SSOWH tests. All the other tests undercover at varying levels. For instance, the coverage of the SDNB, SOWH, and SDPB tests went down to 87%, 85%, and 80% for 95% confidence region by shrinking the two internal branches of tree G to 0.001. This example indicates a possible reason for the small confidence region of the SOWH test observed in reality. While the SOWH test generally performed well, this is a region of tree space where it could a priori be expected to have some difficulties. Because branch lengths are bounded below by 0, the estimated branch lengths for any hypothesized tree are much more likely to be larger than the true branch lengths. The SOWH test is thus simulated from a tree with more "structure" than the true tree. The level of uncertainty reflected in the estimated distribution of the likelihood ratio statistic is less than it would have been if the actual branch lengths were used in simulations. The consequence, as the simulations confirm, is undercoverage when true branch lengths are close to 0. While these are cases where the SOWH test will not perform well, phylogenies near the star phylogeny do not usually arise in real problems of biological interest.
Effects of Factors
Sequence length, tree shape, and branch lengths are the factors that influence the difficulty of the phylogenetic analysis. We present their effects based on the simulation results of small number of taxa.
As expected, increasing the sequence length enables us to estimate the tree more precisely and obtain smaller confidence regions. The average mean size of the confidence regions with sequence length of 1,000 is almost three trees less than with sequence length of 500 as in figure 3.
Tree shape is determined by the relative positions of long and short branches. The diameters of the three trees are different even though they have the same long and short branch lengths. We found that the sizes of the confidence regions of the data sets simulated from the left two trees in figure 1 were smaller than the right one. Also, it shows that long-branch attraction does not happen in our simulations. For example, tree C arose in 94 of the confidence regions obtained through simulation. Trees that differed from C by a single NNI appeared 40–46 times. The trees with long branches together occurred 18, 25, and 24 times, which was consistent with the numbers of times (18–28) that other trees differing from C by more than one NNI appeared.
The branch length gives the expected substitutions in a certain time. In order to sort out the effect of the relative branch lengths, it is instructive to look at the results for trees I and K which have the same diameter (1.8) and the same long branch lengths, but the short branch lengths are 0.03 for I and 0.06 for K. The mean sizes of 95% confidence regions from the SOWH test with sequence length of 500 were 6.6 and 3.71, respectively. (The coverages of the SOWH test for these two trees were both larger than 95%.) Obviously, the relatively longer short branch in tree K was the reason for the smaller size. We enlarged the short branches of the tree structure on the right in figure 1 and listed the values of the PMLT in table 5. When the tree structure became fully symmetric with 0.9 as the short branch lengths, the PMLT went up to 0.99 even when the diameter increased to 3.6. When we broke the symmetry by increasing the length of branches a, d, e, f, and g beyond 0.9, the PMLT decreased. This phenomenon is consistent with the conclusion drawn from tree shape.
Table 5 The Values of the PMLT When the Short Branch Lengths Increase for the Tree Structure on the Right in Figure 1
Parametric Versus Nonparametric Bootstrap
There are two kinds of bootstrap: parametric and nonparametric. In phylogeny, parametric bootstrap requires a substitution model along with the estimated parameters. If the model is misspecified, this may introduce bias into the bootstrap results (Buckley 2002). Nonparametric bootstrap is more popular because it is more directly based on the data (site likelihood here) rather than on the parametric model.
The 95th percentiles of the SDNB and SDPB bootstrap distributions of the likelihood ratio statistic estimate the 95th percentile of the actual distribution. For a given true tree, the 95th percentile of the actual distribution can be approximated as the 95th percentile of likelihood ratios obtained by repeatedly simulating data sets from the true tree. We did this for three trees at different levels of PMLT: 0.94 (tree A), 0.63 (tree C), and 0.25 (tree L). In each case, 5,000 data sets with sequence length of 500 were generated. This gave actual 95th percentiles 0.257, 1.39, and 2.565 for the three trees.
To investigate the biases and variances due to the differing forms of bootstrapping in SDNB and SDPB, we selected 100 data sets from the 5,000 and applied the procedures with RELL resampling. For each tree and both the SDNB and SDPB tests, this gave 100 estimated 95th percentiles. The results are in table 6.
Table 6 The Means and the SDs of the Cutoff Values in the SDNB, SDNB with RELL, and SDPB Tests
The standard deviations (SDs) reflect two sources of variations: the bootstrap methods and the number of bootstraps. Because we are interested in the former one, the same number (100) of bootstraps was used in three techniques in order to make them comparable. The table shows that the parametric bootstrap provides smaller cutoffs than nonparametric bootstrap. With the decreasing PMLT, the "true cutoff" increases. Though the cutoffs from the SDPB test increase as well, their mean underestimates the true cutoff of tree L. Conversely, the SDNB test provides better estimate in this situation even though it overestimates the true cutoffs of trees A and C. No matter whether the cutoff is overestimated or underestimated, we notice that the SD of the nonparametric bootstrap is always larger than the parametric one. It means that the parametric bootstrap provides more stable cutoffs than the nonparametric one.
Obviously, the performances of the bootstrap methods are related to the PMLT as well. The nonparametric bootstrap tends to be conservative. Note that the underestimate or overestimate of the true cutoff does not necessarily mean high or low coverage. It is possible that small cutoff corresponds to smaller test statistic value. The results in this part only show the behavior patterns of the cutoffs determined by the bootstraps.
Conclusion
In evaluating the various test methods, we considered the coverage and the size of the confidence regions based on simulations. Generally, high coverage causes large size and small size leads to low coverage. The good method gives a small confidence region with reasonable coverage. Overall, the SOWH test performed the best during the simulation, with reasonable coverage and relative small size. But in this paper, we only examined the efficiency of various tests without considering the model assumption violation. The sensitivity of the SOWH test to model misspecification is the major reason for its lower coverage (Buckley 2002). So we do not recommend it unless we have confidence that the substitution model is an accurate representation of the data. The SDPB test, which is expected to have lower coverage than the SOWH test in the same situation, is not recommended for the same reason. Though the SDNB test loses some power to detect the wrong trees in the confidence region because of the nonparametric bootstrap, it is less dependent on the model assumption. And it avoids the time-consuming problem in the SOWH test when the number of candidate trees is large. We recommend it as a good alternative giving small confidence regions with reasonable coverage.
Because the SSOWH, LW, and GLS tests perform unstably, they are not recommended for large number of taxa. We do not recommend the SH and AU tests for small number of taxa because the large size reflects the lack of power to detect the wrong trees. As a popular method, the AU test is conservative and sensitive to the set of candidate trees. But it is not seriously misleading for large number of taxa. The AU test is safe to use in order to control the coverage strictly.
Note that only the binary trees were considered as the model trees in our simulation. With the star trees as the models, all the candidate trees equally support the data. Except for the SH and SSOWH tests, all the other tests tend to undercover in this situation, especially the SOWH, SDNB, and SDPB tests.
The performance under model misspecification is not discussed in the paper either. Under such situations, the performance of these methods depends on how far the substitution model is from reality. Model misspecification is an important issue, but it involves a whole span of other issues taking us beyond the scope of this paper.
Appendix Tree Structures in Figure 2
The trees are in NEXUS format, and can be viewed in PAUP*.
10-Taxon Tree
(1: 0.013830, 2: 0.015832, (3: 0.029746, ((4: 0.017846, 5: 0.014022): 0.032036, (6: 0.047509, (7: 0.057006, (8: 0.026139, (9: 0.021788, 10: 0.032609): 0.001234): 0.038943): 0.001426): 0.006973): 0.004049): 0.017251);
15-Taxon Tree
(1: 0.013935, 2: 0.015366, (3: 0.033192, ((4: 0.023817, 5: 0.015218): 0.038405, (6: 0.042860, (7: 0.056182, (((8: 0.023098, 9: 0.024827): 0.000432, 10: 0.023845): 0.015672, ((((11: 0.005724, 12: 0.006404): 0.001903, 13: 0.005464): 0.028134, 14: 0.037194): 0.001776, 15: 0.047241): 0.023468): 0.015238): 0.001294): 0.004894): 0.004605): 0.015000);
20-Taxon Tree
(1: 0.013823, 2: 0.015819, (3: 0.029629, ((4: 0.017817, 5: 0.014036): 0.031956, (6: 0.047519, (7: 0.057420, ((8: 0.026198, (9: 0.022117, 10: 0.032245): 0.001317): 0.015635, (((((11: 0.004742, 12: 0.007728): 0.000829, 13: 0.005867): 0.035062, (14: 0.043034, 15: 0.046981): 0.001445): 0.017608, 16: 0.059090): 0.001133, ((17: 0.072962, 18: 0.051127): 0.011258, (19: 0.020931, 20: 0.013127): 0.029135): 0.013228): 0.000914): 0.022642): 0.001226): 0.006938): 0.003981): 0.017363);
Acknowledgements
We thank two reviewers' suggestions in improving the presentation of the manuscript. We also thank Andrew Roger for the discussions about this paper. The authors are supported by Genome Atlantic and NSERC.
References
Antezana, M. 2003. When being ‘most likely’ is not enough: examining the performance of three uses of the parametric bootstrap in phylogenetics. J. Mol. Evol. 56:198–222.
Buckley, T. 2002. Model misspecification and probabilistic tests of topology: evidence from empirical data sets. Syst. Biol. 51:509–523.
Bulmer, M. 1991. Use of the method of generalized least squares in reconstructing phylogenies from sequence data. Mol. Biol. 8:868–883.
Cavalli-Sforza, L., and W. Edwards. 1967. Phylogenetic analysis: models and estimation procedures. Evolution 32:550–570.
Efron, B., E. Halloran, and S. Holmes. 1996. Bootstrap confidence levels for phylogenetic trees. Proc. Natl. Acad. Sci. USA 93:13429–13434.
Efron, B., and R. Tibshirani. 1998. The problem of regions. Ann. Stat. 26:1687–1718.
Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368–376.
Goldman, N., J. Anderson, and A. Rodrigo. 2000. Likelihood-based tests of topologies in phylogenetics. Syst. Biol. 49:652–670.
Hasegawa, M., H. Kishino, and T. Yano. 1985. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 21:160–174.
Hillis, D., B. Mable, and C. Moritz. 1996. Applications of molecular systematics: the state of the field and a look to the future. Pp. 515–543 in D. Hillis, C. Moritz, and B. K. Mable, eds. Molecular systematics. Sinauer, Sunderland, Mass.
Huelsenbeck, J., and D. Hillis. 1993. Success of phylogenetic methods in the four-taxon case. Syst. Biol. 42:247–264.
Kishino, H., and M. Hasegawa. 1989. Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in Hominoidea. J. Mol. Evol. 29:170–179.
Murphy, W., E. Eizirik, W. Johnson, Y. Zhang, O. Ryder, and S. O'Brien. 2001. Molecular phylogenetics and the origins of placental mammals. Nature 409:614–618.
Rambaut, A., and N. Grassly. 1997. Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput. Appl. Biosci. 13:235–238.
Shimodaira, H. 2002. An approximately unbiased test of phylogenetic tree selection. Syst. Biol. 51:492–508.
Shimodaira, H., and M. Hasegawa. 1999. Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Mol. Biol. Evol. 16:1114–1116.
———. 2001. CONSEL: for assessing the confidence of phylogenetic tree selection. Bioinformatics 17:1246–1247.
Sokal, R., and F. Rohlf. 1995. Biometry. 2nd edition. W. H. Freeman, New York.
Strimmer, K., and A. Rambaut. 2001. Inferring confidence sets of possibly misspecified gene trees. Proc. R. Soc. Lond. B 269:137–142.
Susko, E. 2003. Confidence regions and hypothesis tests for topologies using generalized least squares. Mol. Biol. Evol. 20:862–868.
Swofford, D. 2000. PAUP*: phylogenetic analysis using parsimony (*and other methods). Version 4.0b4a. Sinauer Associates, Sunderland, Mass.
Swofford, D., G. Olsen, P. Waddell, and D. Hillis. 1996. Phylogenetic inference. Pp. 407–514 in D. Hillis, C. Moritz, and B. K. Mable, eds. Molecular systematics. Sinauer, Sunderland, Mass.
Zwickl, D., and D. Hillis. 2002. Increased taxon sampling greatly reduces phylogenetic error. Syst. Biol. 51:588–598.(Xiaofei Shi, Hong Gu, Edw)