当前位置: 首页 > 期刊 > 《分子生物学进展》 > 2003年第2期 > 正文
编号:10582237
The Effect of Single Nucleotide Polymorphism Identification Strategies on Estimates of Linkage Disequilibrium
http://www.100md.com 《分子生物学进展》2003年第2期
     * Center for Genome Information, University of Cincinnati, Cincinnati, Ohiog$si, 百拇医药

    Human Genetics Center, University of Texas-Houston, Houston, Texas.g$si, 百拇医药

    Abstractg$si, 百拇医药

    At present there is tremendous interest in characterizing the magnitude and distribution of linkage disequilibrium (LD) throughout the human genome, which will provide the necessary foundation for genome-wide LD analyses and facilitate detailed evolutionary studies. To this end, a human high-density single-nucleotide polymorphism (SNP) marker map has been constructed. Many of the SNPs on this map, however, were identified by sampling a small number of chromosomes from a single population, and inferences drawn from studies using such SNPs may be influenced by ascertainment bias (AB). Through extensive simulations, we have found that AB is a potentially significant problem in estimating and comparing LD within and between populations. Specifically, the magnitude of AB is a function of the SNP discovery strategy, number of chromosomes used for SNP discovery, population genetic characteristics of the particular genomic region considered, amount of gene flow between populations, and demographic history of the populations. We demonstrate that a balanced SNP discovery strategy (where equal numbers of chromosomes are sampled from multiple subpopulations) is the optimal study design for generating broadly applicable SNP resources. Finally, we validate our theoretical predictions by comparing our results to publicly available data from ten genes sequenced in 24 African American and 23 European American individuals.

    Key Words: ascertainment bias • linkage disequilibrium • SNPs • coalescentqa'q, 百拇医药

    Introductionqa'q, 百拇医药

    Single nucleotide polymorphisms (SNPs) are rapidly usurping microsatellites and other classes of DNA markers in contemporary genetics research, because they are abundant, mutationally stable, and amenable to high-throughput genotyping (reviewed by ). These characteristics suggest that SNPs will be an important resource in a diverse array of applications. Arguably, the most promising application of SNPs will be in linkage disequilibrium (LD)-based studies such as disease gene mapping , pharmacogenetics, and testing hypotheses of human evolution . Before LD-based study designs can become a pragmatic genome-wide tool, however, it is necessary to understand how LD is apportioned throughout the genome. To this end, there have been a number of studies attempting to characterize the magnitude and distribution of "background" LD (i.e., LD between anonymous genetic markers) throughout the genome .

    The prospect of systematically determining genome-wide patterns of background LD has been greatly facilitated by the recent construction of a human high-density SNP marker map . It is important to realize, however, that many of these SNPs were identified by in silico methods that ascertained SNPs from a small number of chromosomes in a limited number of populations . Therefore, inferences drawn from studies using such SNPs may be influenced by ascertainment bias (AB). In fact, several recent studies have demonstrated that the SNP discovery process introduces bias into estimates of various population genetic parameters such as the population migration rate , the population mutation rate , and the population recombination rate .1m5, http://www.100md.com

    To date, the effect of AB on estimates of background LD has not been rigorously investigated, although have succinctly described the problem in a recent review article. Therefore, the purpose of this article is to demonstrate how the ascertainment strategy of SNP markers affects estimates of LD.

    Methodsoh#y, http://www.100md.com

    Data Miningoh#y, http://www.100md.com

    A Perl script was written to search all submitted SNPs deposited in dbSNP to identify the distribution of SNPs discovered in n chromosomes (where n = 2, 4, 8, 16, and >16). The results presented in this study are based on all submitted SNPs in dbSNP through the first-quarter release in 2002.oh#y, http://www.100md.com

    Outline of Study Designoh#y, http://www.100md.com

    There are many sources of bias that can potentially affect an estimate of LD. In this article, we restrict our analysis to AB—that is, bias attributable to the ascertainment protocol of SNP markers. Thus, we developed a study design that allows us to specifically characterize the effect of AB on estimates of background LD; it consists of four distinct steps. First, we generated the raw SNP data by simulating a genomic region from a subdivided population with two subpopulations that symmetrically exchange M migrants/generation. Second, we modeled a SNP identification experiment. Here two strategies were considered: (1) a hierarchical approach in which n1 chromosomes from subpopulation 1 are sampled and (2) a balanced design in which an equal number of chromosomes, denoted n1 = n2, are sampled from both subpopulations. Third, we calculated two LD statistics, D'all and D'id. D'all is the average D' between all pairwise SNPs produced in a particular realization of a coalescent simulation, whereas D'id is the average D' between only the SNPs identified in the SNP discovery experiment. Finally, in the fourth step, AB was measured as the mean absolute fractional error (MAFE). These steps are explained in more detail in the paragraphs that follow.

    Step 1: Coalescent Simulations0{9ev'}, http://www.100md.com

    The coalescent is a stochastic process that provides a powerful technique for rapidly simulating population genetic data (reviewed by ). We used a coalescent model to simulate a genomic region from two subpopulations connected by gene flow whose patterns of sequence variation and LD were influenced by genetic drift, mutation, population demography, recombination, and migration among subpopulations0{9ev'}, http://www.100md.com

    The major parameters of the coalescent simulations were: M = Nm (one-quarter the population migration rate, where N is the effective population size and m is the migration rate/generation), = 4Nµ (the population mutation rate where µ is the mutation rate/locus/generation), and = 4Nc (the population recombination rate where c is the recombination rate/gamete across the entire region). Migration was modeled according to an island model in which each subpopulation symmetrically exchanges M migrants/generation.

    A major objective of simulation studies is to make the study as realistic as possible by carefully selecting parameter values that are consistent with empirical and theoretical data. In many cases, however, exact parameter values are either unknown or, as in our case, vary throughout the genome. We therefore adopted the strategy of performing simulations over a wide range of plausible parameter values culled from the empirical literature. Values of and were assumed to be of the same order of magnitude . Specifically, values of and considered were: = = [1, 2, ... , 12], which is in the range reported by many empirical and theoretical studies. Moreover, the values of M considered were as follows: M = [0.0125, 0.125, 1.25], which again are consistent with available empirical data . For each parameter combination, 1000 simulations were performed and analyzed as described below. For each simulation replicate, 200 chromosomes were simulated from each subpopulation.+5, 百拇医药

    Furthermore, we considered two demographic models: (1) a model of constant population size and (2) a model of recent population expansion where each subpopulation expanded to 1000 N at 1.1 N generations in the past. If we assume N = 10,000 individuals, this model corresponds to a population expansion approximately 11,000 years ago, and thus the agricultural revolution is modeled .

    Step 2: Model SNP Identification Strategies5?, 百拇医药

    After simulating a genomic region, we model a SNP identification experiment. Here we considered two strategies: (1) a hierarchical approach in which chromosomes from a single subpopulation are sampled and (2) a balanced study design in which an equal number of chromosomes from each subpopulation are sampled for SNP discovery. In the hierarchical SNP identification strategy, we sampled n1 = [2, 4, 8, 16] chromosomes from subpopulation 1, whereas in the balanced strategy we sampled n1 = n2 = [2, 4, 8] chromosomes from each subpopulation. In all cases, chromosomes were "aligned" and candidate SNPs were identified. Candidate SNPs were accepted for further analysis if the minor allele frequency was ">="5?, 百拇医药

    2% in the total sample of 200 chromosomes. This somewhat arbitrary threshold was selected for two reasons: (1) it is consistent with previous frequency-dependent definitions of SNPs and (2) imposing a frequency threshold is necessary to avoid spuriously high estimates of D' (see).

    Step 3: Calculate LD$)sks, http://www.100md.com

    Following SNP identification, we calculated two LD statistics, D'all and D'id. Let xi and yi denote the number of total and identified SNPs in subpopulation i (where i = 1 or 2), respectively. Then, D'all in the ith subpopulation is$)sks, http://www.100md.com

    and D'id in the ith subpopulation is$)sks, http://www.100md.com

    where D'jk is the standardized coefficient of LD between the jth and kth locus . Specifically, denote the alleles at locus j and k as A and a and B and b, respectively. The frequency of alleles A, a, B, and b are given by PA, Pa, PB, and Pb, and the haplotype frequencies of AB, Ab, aB, and ab gametes are denoted as PAB, PAb, PaB, and Pab, respectively. Thus, D'jk = D/Dmax where D = PAB PAb - PaB Pab and Dmax = min {PAPb, PaPB} if D > 0 or Dmax = min {PaPb, PAPB} if D < 0.

    In words, D'all is the average D' between all pairwise SNPs produced in the coalescent simulations, whereas D'id is the average D' between only the SNPs identified in the SNP discovery experiment. Note that D'all and D'id are calculated in the total number of chromosomes in each subpopulation (200), which eliminates any potential bias attributable to sample size. In some replicates, all candidate SNPs that were identified in a small number of chromosomes were not sufficiently polymorphic when genotyped in the entire population (i.e., a minor allele frequency ">="!%, 百拇医药

    2%). When this occurred, D'id was set to be equal to 0. Although in practice one would likely attempt to discover more SNPs in these circumstances and re-estimate D'id, setting D'id equal to 0 provides a useful theoretical approach for decomposing the total amount of AB into its component parts.!%, 百拇医药

    Step 4: Measure AB

    Ascertainment bias was measured as the mean absolute fractional error (MAFE), which is in general defined as the absolute value of the difference between theoretical (t) and observed (t) values divided by the theoretical value (t). Here, t = D'all and t = D'id. Thus, the AB in subpopulation i isk, http://www.100md.com

    where n is the number of simulations (which is 1000 unless otherwise noted). MAFE varies from 0 to 1, and higher values indicate greater bias. Other measures of bias, such as the mean square error, were calculated but gave essentially the same results. Finally, note that by contrasting D'all and D'id we are asking whether the LD observed between the identified SNPs is representative of this region-at-large.k, http://www.100md.com

     as a Measure of LDk, http://www.100md.com

    The effect of AB on as a measure of LD was also studied. Data were simulated and SNPs were discovered as described above. Simulation parameters were set at = = 3, M = 0.125, n1 = [2, 4], and n1 = n2 = 2. For each parameter combination, 100 data sets were simulated.

    For each data set, was estimated by a recently developed method that uses importance sampling to approximate the joint likelihood surface of and assuming a coalescent-model . Parameter estimates are obtained by specifying "driving values" that parameterize the prior distribution of and . Here we considered four driving values and simulated 50,000 genealogies/driving value to approximate the likelihood surface of the data and estimate (see for details).z[v@, 百拇医药

    Resultsz[v@, 百拇医药

    Current SNP Identification Strategiesz[v@, 百拇医药

    To better understand the current SNP identification strategies, we searched all of the submitted SNPs in dbSNP and found that 64% were discovered in four chromosomes or less from a singlepopulation. Thus, the possible impact of AB is of great practical importance. Note that although this distribution is only approximate (as there is considerable redundancy among submitted SNPs), it does accurately reflect the current SNP identification strategies.

    Distribution of D'all and D'id(ue), http://www.100md.com

    If AB affects estimates of LD, then the distribution of D'all and D'id will be different. Therefore, we begin by comparing the distribution of D'all and D'id under a hierarchical SNP identification strategy (n1 = 4). clearly demonstrates that D'all and D'id differ and that the shape of these distributions is influenced by M, , and . For example, in subpopulation 1 when = = 6, approximately 65% of simulations result in values of D'all between 0.90 and 1, whereas only 36% of simulations yield values of D'id in this range. The shift in distribution is even more pronounced in subpopulation 2, where the percent of simulations in the range of 0.90 and 1 decreases from approximately 63% in D'all to 30% in D'id. A similar pattern is observed for each parameter value presented in , and thus, in general, AB causes LD to be underestimated. It is also interesting to note that when there was little gene flow between subpopulations (M = 0.125), a significant proportion of replicates resulted in D'id values of 0 in subpopulation 2 under a hierarchical SNP identification strategy. In other words, SNPs that are informative in the population in which they are discovered are not necessarily informative in other populations. This result has important implications for the utility of SNP resources in general, and those constructed by hierarchical study designs in particular (see Discussion). Although comparing the distribution of D'all and D'id provides an intuitive feel for how AB affects measures of LD, we now quantitatively address this question.

    fig.ommittedl^pt], 百拇医药

    FIG. 1. The empirical distribution of D'all and D'id in subpopulations 1 and 2 as a function of , , and M based on 1000 coalescent simulationsl^pt], 百拇医药

    AB Under Hierarchical SNP Identification Strategiesl^pt], 百拇医药

    The effect of AB, measured as the MAFE, under a hierarchical SNP identification strategy is shown in . Four principal results emerge from . First, AB in subpopulation 2 (panel B) is in general greater compared to subpopulation 1 (panel A). Under a hierarchical study design, the observed AB in subpopulation 1 is due to the sampling variation in estimating LD from SNPs identified in a small number of chromosomes. The AB in subpopulation 2 can be interpreted as AB resulting from sampling variation plus an additional amount attributable to population substructure (i.e., the genetic differences between subpopulations 1 and 2).l^pt], 百拇医药

    fig.ommittedl^pt], 百拇医药

    FIG. 2. AB under a hierarchical SNP identification strategy, as a function of {theta} , {rho} , M, and the number of chromosomes sampled from subpopulation 1 (n1 = 2, 4, 8, and 16). AB for subpopulations 1 and 2 are shown in panels A and B, respectively, assuming a constant size demographic model. Each point corresponds to the mean AB bounded by approximate 95% confidence intervals

    Second, the magnitude of AB in subpopulation 2 is inversely proportional to the amount of gene flow between subpopulations, and higher migration rates can significantly attenuate AB. For instance, when n1 = 4 and = = 3, AB in subpopulation 2 decreases from 0.42 to 0.12 as Nm increases from 0.125 to 1.25. Furthermore, the amount of gene flow has a relatively minor influence on AB in subpopulation 1 (i.e., the population in which SNPs were discovered).{?k(d, http://www.100md.com

    Third, the magnitude of AB varies as a function of and . As we demonstrate in the section Decomposing AB, and contribute to AB in distinct ways. This result is important because it suggests that the magnitude of AB may vary as a function of the particular genomic locus considered, because and likely vary across the genome .{?k(d, http://www.100md.com

    Fourth, as the number of chromosomes used for SNP identification increases, AB decreases in both subpopulations, although the decline is slower in subpopulation 2. In fact, under extreme cases of population differentiation AB remains uniformly high in subpopulation 2 (data not shown). Although this amount of differentiation is not, in general, characteristic of human populations, it it may still be possible that certain genomic regions demonstrate such deep subdivision .

    AB Under Balanced SNP Identification Strategiesw/, http://www.100md.com

    Next, we investigated the effect of AB under a balanced SNP discovery strategy. As expected, the AB under a balanced study design is nearly identical in each subpopulation (see ). For example, when = = 1, M = 0.125, and n1 = n2 = 2, the AB in subpopulations 1 and 2 is 0.25 and 0.23, respectively. Furthermore, when comparing the magnitude of AB under hierarchical and balanced SNP identification strategies, it is clear that the latter leads to SNP resources that are broadly applicable across various subpopulations. For example, when = = 6, M = 0.125, and four chromosomes are used for SNP identification, AB in subpopulation 2 is reduced from 0.29 under a hierarchical design to 0.06 (n1 = 4) under a balanced (n1 = n2 = 2) design ( and ). Importantly, AB in subpopulation 1 is only marginally affected under the two different sampling schemes (hierarchical, 0.07; balanced 0.05). Intuitively, this result is obvious, because the rationale for a balanced study design is to minimize AB from population substructure. In the absence of subdivision, sampling N chromosomes from each of two potential subpopulations is equivalent to sampling 2N chromosomes from one subpopulation. Although a balanced SNP identification strategy is preferred over hierarchical approaches, it is important to note that AB can still be strong, particularly when a small number of chromosomes are used for SNP discovery.

    fig.ommitted#w, http://www.100md.com

    FIG. 3. AB under a balanced SNP identification strategy, as a function of of M, and the number of chromosomes sampled from each subpopulation (n1 = n2 = 2, 4, 8). AB for subpopulations 1 and 2 are shown in panels A and B, respectively, assuming a constant size demographic model. Each point corresponds to the mean AB bounded by approximate 95% confidence intervals#w, http://www.100md.com

    Demographic History and AB#w, http://www.100md.com

    We have also studied the effect of different demographic histories on the strength of AB. A demographic model incorporating recent population expansion results in considerably greater AB than observed in a constant sized population. For example, shows the distribution of D'all and D'id in subpopulations 1 and 2 under a hierarchical study design. Notice how large the shift in the distribution of D'id is compared to . Furthermore, , which demonstrates how AB in subpopulation 2 varies as a function of , , and the number of chromosomes used for SNP identification under a hierarchical study design , is qualitatively similar to that of a constant sized model (see , M = 1.25), with AB decreasing as the number of chromosomes used for SNP identification and the values of and all increase. However, AB is uniformly larger in demographic models incorporating population expansion.

    fig.ommitted9gpd/], http://www.100md.com

    FIG. 4. The effect of demographic history on AB. Here we have assumed each subpopulation has undergone a recent population expansion. A, The empirical distribution of D'all and D'id in subpopulations 1 and 2 based on 1000 coalescent simulations. B, AB under a hierarchical SNP identification strategy. The parameters of the simulation were the number of chromosomes used for SNP discovery (n1 = 2, 4, 8, and 16), M = 1.25, and Each point corresponds to the mean AB bounded by approximate 95% confidence intervals9gpd/], http://www.100md.com

    Decomposing AB9gpd/], http://www.100md.com

    Up to this point we have presented a measure of AB that potentially includes multiple sources of bias. More specifically, we hypothesize that AB can affect estimates of LD via two discrete mechanisms. In this section, we attempt to understand the specific mechanisms underlying how AB influences estimates of LD by developing a heuristic model to decompose AB into its component parts. First, AB can influence the probability of identifying informative SNPs for downstream analyses. Here, we narrowly define informative as a SNP that possesses a minor allele frequency greater than some predefined threshold (i.e., 2%). Second, AB can influence the probability of sampling a set of informative SNPs whose patterns of LD are representative of SNPs in the region-at-large. Therefore, to better understand how AB influences measures of LD, we decomposed the total amount of AB into these two component parts.

    Formally, we denote the total amount of AB as ABT, the AB attributable to identifying uninformative SNPs as ABI, and the AB attributable to sampling SNPs whose patterns of LD are unrepresentative of the region-at-large as ABR. If we assume under our heuristic model that ABI and ABR are independent, ABT is simply ABT = ABI + ABR. We estimated ABT by analyzing all simulation replicates regardless of whether informative SNPs were identified (i.e., allowing D'id = 0; see Methods). We estimated ABR by analyzing only the simulation replicates in which informative markers were identified (i.e., excluding replicates where D'id = 0). Finally, we estimated ABI by subtracting ABR from ABT. In summary, ABT is the total amount of bias introduced into an estimate of LD through the process of SNP ascertainment, which can then be partitioned into two component parts ABI and ABR.

    shows the contribution of ABI and ABR to ABT over a broad range of values for and for both constant and recently expanded demographic models. Several interesting observations emerge from these graphs. First, for both demographic models, the relationship of ABI and ABR to and is diametrically opposed. Specifically, ABR increases and ABI decreases as and increase. Therefore, for small values of and ABI dominates ABT, whereas for larger values of and , ABR dominates ABT. Second, the magnitude of ABI, and to a lesser extent ABR, is higher for the recently expanded demographic model than for the constant size model. This observation is attributable to a higher proportion of uninformative (i.e., rare) SNPs in expanded populations (data not shown).@!., 百拇医药

    fig.ommitted@!., 百拇医药

    FIG. 5. Decomposing ABT into its two component parts ABR and ABI. Panels A and B correspond to subpopulations 1 and 2, respectively. The parameters were the demographic model, M = 1.25, and n1 = 4

    To better characterize the relationship of and to ABR and ABI, we fixed to 1, 4, 8, and 12, and for each value we allowed to vary between 0.25, 0.5, 1, and 2 . For each parameter combination, ABR and ABI were calculated, and the results are summarized in . For all ratios of :, ABR and ABI follow opposite trends for different values of. For example, consider the ratio : = 0.25 (i.e., = 0.25 ). As increases, the magnitude of ABR increases and that of ABI decreases. A similar pattern was observed in subpopulation 1 (data not shown). Overall, the data in and demonstrate that primarily determines the magnitude of ABR, whereas dictates the magnitude of ABI.m$uc:?i, 百拇医药

    fig.ommittedm$uc:?i, 百拇医药

    FIG. 6. The contribution of , and to ABR and ABI in subpopulation 2. Here, we fixed to 1, 4, 8, and 12 and for each value allowed , and to vary between 0.25, 0.5, 1, and 2 . For example, consider the category = 0.25 . As . increases from 1, 4, 8, to 12 . increases from 0.25, 1, 2, to 3. The remaining parameters were set to M = 1.25, constant size demographic model, and n1 = 8

    {rho} as a Measure of LD7|, http://www.100md.com

    It has not escaped our attention that when considering LD of a genomic region, simply averaging multiple pairwise D' values has some obvious shortcomings. This has also been recognized in the recent literature, and one suggestion has been to use , the population recombination rate, as a measure of LD, because it provides a summary statistic that simultaneously incorporates information from multiple SNP markers . Therefore, to confirm and extend our previous results we have also investigated the effect of AB on estimates of . Like the previous results using D' as a measure of LD, the effect of AB in subpopulation 2 is more pronounced under a hierarchical study design than under a balanced sampling strategy. Overall, this finding suggests that the effect of AB on LD is a general phenomenon and not specific to D'.7|, http://www.100md.com

    fig.ommitted7|, http://www.100md.com

    Table 1 The Effect of AB on as a Measure of LD.

    Empirical Validation of Simulation Results4et1, 百拇医药

    Finally, we tested our theoretical predictions by comparing our results to publicly available data from ten genes sequenced in 24 African American and 23 European American individuals from the SeattleSNPs project [accessed June 2002]). For each gene, we modeled a hierarchical SNP discovery experiment by sampling 2 or 4 chromosomes from the European population (n1 = 2,4) and a balanced study design by identifying SNPs in two chromosomes from each population (n1 = n2 = 4). Next, we calculated D'all for both populations by analyzing all pairwise SNPs discovered in the total sample of 94 chromosomes and D'id by analyzing only the SNPs that were identified by using either 2, 4, or 8 chromosomes for SNP discovery. For each gene, 1000 replicates were performed for each SNP discovery strategy, and the average AB (measured as MAFE) is summarized in .4et1, 百拇医药

    fig.ommitted

    Table 2 Estimates of AB Modeled in Empirical Data.|i|1, 百拇医药

    is broadly consistent with our simulated results. For example, AB decreases as the number of chromosomes used for SNP identification increases. Moreover, under a hierarchical study design, AB is generally lower for the population in which SNPs are discovered. Furthermore, the magnitude of AB varies considerably across these ten genes, which may reflect variation in the population genetic characteristics (i.e., and ) of their genomic regions. It is interesting to note that the average AB across the three study designs is nearly identical to the simulated data assuming M = 1.25, = 2, and = 2 . Thus, we believe that our simulations capture the essential features describing how AB of SNP markers affects estimates of LD. In summary, clearly demonstrates that discovering SNPs in these ten genes using 2 or 4 chromosomes would have provided an insufficient catalog of sequence variation from which to provide an accurate description of patterns of LD.

    Discussionn3gnjw:, http://www.100md.com

    Through extensive simulations, we have found that AB is a potential problem in attempts to estimate background LD. The magnitude of AB is a function of the SNP identification strategy, number of chromosomes used for SNP discovery, population genetic characteristics of the particular genomic region considered, amount of gene flow between populations, and demographic history of the populations. In general, because balanced study designs minimize AB due to population substructure, balanced SNP identification strategies are preferred over hierarchical approaches if the goal is to construct a broadly applicable SNP resource. However, if a small number of chromosomes are used for SNP identification AB may still be high.n3gnjw:, http://www.100md.com

    In addition, we decomposed the total amount of AB (ABT) into two component parts: ABI and ABR, which capture different aspects of the data. Recall that ABI is the AB attributable to identifying uninformative SNPs and ABR is the AB attributable to sampling SNPs unrepresentative of the region-at-large. Therefore, our finding that the magnitude of ABI is dictated by is intuitive, as one would expect the level of sequence variation to mediate the probability of identifying informative SNPs. Previous studies have also suggested that a significant fraction of candidate SNPs identified in a small number of chromosomes will be uninformative although to our knowledge we are the first to address this question in a subdivided population.

    In contrast, ABR provides an estimate of AB conditional on the SNPs being informative. Hence, it is not surprising that the magnitude of ABR is determined primarily by and proportional to . More specifically, ABR increases as increases, because the sampling variation of LD is greater for larger values of . Thus, the greater variability in LD increases the probability of sampling SNPs whose patterns of LD are not representative of the region-at-large.u}:, 百拇医药

    It is important to note that in our heuristic model to decompose ABT we assumed ABI and ABR to be independent. This assumption may not be completely accurate, however, as and are positively correlated , and warrants further investigation. Nonetheless, in the context of our heuristic model, we believe that our assumption that ABI and ABR are independent is a reasonable approximation that allows a deeper understanding of how population genetic parameters contribute to AB.

    The results presented here are subject to several limitations. For instance, we have focused on how AB affects a commonly used measure of LD, D', although other measures exist (see ). Nevertheless, we also have demonstrated that AB affects estimates of . We therefore believe that our results are general and capture the important details of how AB affects inferences of LD. In future studies, it would be interesting to investigate how the magnitude of AB varies as a function of the particular statistic used to estimate LD.d, 百拇医药

    Moreover, the simulation model is an obvious simplification of human population history. Specifically, we have assumed an island model of population structure, which posits a constant and symmetrical migration rate between subpopulations. Obviously, patterns of human migration are more complex, although it is difficult to predict systematically how these deviations affect our results. It is interesting to note however, how well our simple model fits the empirical data .

    Furthermore, few empirical data are available regarding estimates of M between human populations. Available data from autosomal DNA and mitochondrial DNA suggest that M is approximately 1, with one study reporting a 95% confidence interval of 0.61–1.43 . However, many estimates of M are based primarily on FST, which has been criticized as an unreliable method for inference that is accurate to only a few orders of magnitude .8231, http://www.100md.com

    Moreover, as FST = 1/[1 + 4M]-1, it is plausible that estimates of FST may vary across the genome (for example, selection could result in regionally restricted changes in N), which may lead to a nonuniform distribution of AB. In other words, some genomic regions may be strongly affected by AB, whereas others may be minimally affected, a phenomenon that can observed in the empirical data . The Duffy blood group locus is a particularly good example of a genomic region where AB may be particularly strong because of natural selection. The Fy*O allele is nearly fixed in sub-Saharan African populations but is rare outside Africa, leading to the largest observed FST of any allele in humans .

    Recently, a series of articles were published that suggest LD is arranged in blocklike structures such that within a block, limited haplotype diversity is observed . Hence, it would be of tremendous interest to systematically identify haplotype blocks throughout the genome. To this end, a large-scale project is under consideration . Our results have important implications for the study design in constructing a genome-wide haplotype map. For example, our data intimate that identifying informative SNPs from the currently available collection of markers may be problematic (see ABI in ). Consistent with this hypothesis, , in their study of a European population, remarked that most SNP markers in dbSNP proved to be insufficient for identifying haplotype blocks. Nevertheless, our results are encouraging. If haplotype blocks are determined primarily by limited recombination (as opposed to demographic history), then our data demonstrate that AB within a haplotype block may not be very significant, because ABR is minimal for small values of {rho} . However, in regions outside of haplotype blocks ABR may be quite high, because {rho} will be large.

    Although AB may complicate accurate parameter estimates with the available SNP markers, it is important to realize that approaches exist to account for and correct this bias: for example, if the SNP ascertainment strategy is known and appropriately modeled analytical methods have been developed to accurately estimate and M . In the context of LD studies, there is a crucial need to move beyond pairwise LD measures and develop statistics to estimate the overall LD of a genomic region. Future work in this area should pay close attention to the aforementioned methods that allow AB to be modeled and corrected. As an example, demonstrated that AB could be corrected for in estimates of , which we used as an overall measure of LD .xv9{l9[, http://www.100md.com

    Conclusionxv9{l9[, http://www.100md.com

    In conclusion, our data clearly demonstrate that AB is an important problem to consider when designing and interpreting studies that estimate background LD. Although we have presented our results in the context of human populations, our data and conclusions are general and apply to the construction of SNP resources for other model organisms . We recommend that future large-scale SNP discovery efforts adopt a balanced SNP identification strategy and sample a large number of chromosomes from multiple subpopulations (estimated to be a minimum of 60 chromosomes by ). Finally, we suggest that the available SNP resources should be regarded as an exciting beginning and not an end.

    Acknowledgements6, 百拇医药

    We gratefully acknowledge John Wakeley for helpful comments related to this work and critical reading of the manuscript.6, 百拇医药

    Literature Cited6, 百拇医药

    Abecasis, G. R., E. Noguchi, A. Heinzmann, et al. (9 co-authors). 2001. Extent and distribution of linkage disequilibrium in three genomic regions. Am. J. Hum. Genet. 68:191-197.6, 百拇医药

    Akey, J., L. Jin, and M. Xiong. 2001. Haplotypes vs single marker linkage disequilibrium tests: what do we gain? Eur. J. Hum. Genet. 29:291-300.6, 百拇医药

    Akey, J. M., K. Zhang, M. Xiong, P. Doris, and L. Jin. 2001. The effect that genotyping errors have on the robustness of common linkage-disequilibrium measures. Am. J. Hum. Genet. 68:1447-1456.6, 百拇医药

    Altshuler, D., V. J. Pollar, C. R. Cowles, W. J. Van Etten, J. Baldwin, L. Linton, and E. S. Lander. 2000. A SNP map of the human genome generated by reduced representation shotgun sequencing. Nature 407:513-516.6, 百拇医药

    Beerli, P., and J. Felsenstein. 2001. Maximum likelihood estimation of a migration matrix and effective population sizes in n subpopulations by using a coalescent approach. Proc. Natl. Acad. Sci. USA 98:4563-4568.

    Brookes, A. J. 1999. The essence of SNPs. Gene 234:177-186.x+k, 百拇医药

    Cho, R. J., M. Mindrinos, D. R. Richards, et al. (18 co-authors). 1999. Genome-wide mapping with biallelic markers in Arabidopsis thaliana. Nat. Genet. 23:203-207.x+k, 百拇医药

    Clark, A. G., K. M. Weiss, D. A. Nickerson, et al. (11 co-authors). 1998. Haplotype structure and population genetic inferences from nucleotide-sequence variation in human lipoprotein lipase. Am. J. Hum. Genet. 63:595-612.x+k, 百拇医药

    Daly, M. J., J. D. Rioux, S. F. Schaffner, T. J. Hudson, and E. S. Lander. 2001. High-resolution haplotype structure in the human genome. Nat. Genet. 29:229-232.x+k, 百拇医药

    Eberle, M. A., and L. Kruglyak. 2000. An analysis of strategies for discovery of single-nucleotide polymorphisms. Genet. Epidemiol. 19:S29-S35.x+k, 百拇医药

    Fearnhead, P., and P. Donnelly. 2001. Estimating recombination rates from population genetic data. Genetics 159:1299-1318.x+k, 百拇医药

    Fu, Y. X., and W. H. Li. 1999. Coalescing into the 21st century: an overview and prospects of coalescent theory. Theor. Popul. Biol. 56:1-10.

    Goddard, K. A., P. J. Hopkins, J. M. Hall, and J. S. Witte. 2000. Linkage disequilibrium and allele-frequency distributions for 114 single-nucleotide polymorphisms in five populations. Am. J. Hum. Genet. 66:216-234.7l74}j, http://www.100md.com

    Hamblin, M. T., and A. Di Rienzo. 2000. Detection of the signature of natural selection in humans: evidence from the Duffy blood group locus. Am. J. Hum. Genet. 66:1669-1679.7l74}j, http://www.100md.com

    Hamblin, M. T., E. E. Thompson, and A. Di Rienzo. 2002. Complex signatures of natural selection at the Duffy blood group locus. Am. J. Hum. Genet. 70:369-383.7l74}j, http://www.100md.com

    Hudson, R. R. 1993. The how and why of generating gene genealogies. Pp. 23–36 in N. Takahata and A. G. Clark, eds. Mechanisms of molecular evolution. Japan Scientific Societies, Tokyo.7l74}j, http://www.100md.com

    Irizarry, K., V. Kustanovich, C. Li, N. Brown, S. Nelson, W. Wong, and C. J. Lee. 2000. Genome-wide analysis of single-nucleotide polymorphisms in human expressed sequences. Nat. Genet. 26:233-236.7l74}j, http://www.100md.com

    Jeffreys, A. J., L. Kauppi, and R. Neumann. 2001. Intensely punctate meiotic recombination in the class II region of the major histocompatibility complex. Nat. Genet. 29:217-222.

    Johnson, G. C., L. Esposito, B. J. Barratt, et al. (18 co-authors). 2001. Haplotype tagging for the identification of common disease genes. Nat. Genet. 29:233-237.z&@, 百拇医药

    Kidd, J. R., A. J. Pakstis, H. Zhao, et al. (12 co-authors). 2000. Haplotypes and linkage disequilibrium at the phenylalanine hydroxylase locus, PAH, in a global representation of populations. Am. J. Hum. Genet. 66:1882-1899.z&@, 百拇医药

    Kuhner, M. K., P. Beerli, J. Yamato, and J. Felsenstein. 2000. Usefulness of single nucleotide polymorphism data for estimating population parameters. Genetics 156:439-447.z&@, 百拇医药

    Lewontin, R. C. 1964. The interaction of selection and linkage. I. General considerations: heterotic models. Genetics 49:49-67.z&@, 百拇医药

    Lindblad-Toh, K., E. Winchester, M. J. Daly, et al. (15 co-authors). 2000. Large-scale discovery and genotyping of single-nucleotide polymorphisms in the mouse. Nat. Genet. 24:381-386.z&@, 百拇医药

    Long, A. D., and C. H. Langley. 1999. The power of association studies to detect the contribution of candidate genetic loci to variation in complex traits. Genome Res. 9:720-731.

    Marth, G. T., I. Korf, M. D. Yandell, R. T. Yeh, Z. Gu, H. Zakeri, N. O. Stitziel, L. Hillier, P. Y. Kwok, and W. R. Gish. 1999. A general approach to single-nucleotide polymorphism discovery. Nat. Genet. 23:452-456.!#, 百拇医药

    Mateu, E., F. Calafell, O. Lao, B. Bonne-Tamir, J. R. Kidd, A. Pakstis, K. K. Kidd, and J. Bertranpetit. 2001. Worldwide genetic analysis of the CFTR region. Am. J. Hum. Genet. 68:103-117.!#, 百拇医药

    Moffatt, M. F., J. A. Traherne, G. R. Abecasis, and W. O. Cookson. 2000. Single nucleotide polymorphism and linkage disequilibrium within the TCR alpha/delta locus. Hum. Mol. Genet. 9:1011-1019.!#, 百拇医药

    Mullikin, J. C., S. E. Hunt, C. G. Cole, et al. (40 co-authors). 2000. An SNP map of human chromosome 22. Nature 407:516-520.!#, 百拇医药

    Nachman, M. B. 2001. Single nucleotide polymorphisms and recombination rate in humans. Trends Genet. 17:481-485.!#, 百拇医药

    Nielsen, R. 2000. Estimation of population parameters and recombination rates from single nucleotide polymorphisms. Genetics 154:931-942.

    Nordborg, M., and S. Tavaré. 2002. Linkage disequilibrium: what history has to tell us. Trends Genet. 18:83-90.s-ov?2\, 百拇医药

    Patil, N., A. J. Berno, D. A. Hinds, et al. (19 co-authors). 2001. Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science 294:1719-1723.s-ov?2\, 百拇医药

    Pritchard, J. K., and M. Przeworski. 2001. Linkage disequilibrium in humans: models and data. Am. J. Hum. Genet. 69:1-14.s-ov?2\, 百拇医药

    Robertson, D. 2001. Racially defined haplotype project debated. Nat. Biotechnol. 19:795-796.s-ov?2\, 百拇医药

    Sachidanandam, R., D. Weissman, S. C. Schmidt, et al. (38 co-authors). 2001. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 409:928-933.s-ov?2\, 百拇医药

    Santos, E. J., J. T. Epplen, and C. Epplen. 1997. Extensive gene flow in human populations as revealed by protein and microsatellite DNA markers. Hum. Hered. 47:165-172.s-ov?2\, 百拇医药

    Seielstad, M. T., E. Minch, and L. L. Cavalli-Sforza. 1998. Genetic evidence for a higher female migration rate in humans. Nat. Genet. 20:278-280.

    Taillon-Miller, P., I. Bauer-Sardina, N. L. Saccone, J. Putzel, T. Laitinen, A. Cao, J. Kere, G. Pilia, J. P. Rice, and P. Y. Kwok. 2000. Juxtaposed regions of extensive and minimal linkage disequilibrium in human Xq25 and Xq28. Nat. Genet. 25:324-328.lq8, http://www.100md.com

    Taillon-Miller, P., Z. Gu, Q. Li, L. Hillier, and P. Y. Kwok. 1998. Overlapping genomic sequences: a treasure trove of single-nucleotide polymorphisms. Genome Res. 8:748-754.lq8, http://www.100md.com

    Templeton, A. R., A. G. Clark, K. M. Weiss, D. A. Nickerson, E. Boerwinkle, and C. F. Sing. 2000. Recombinational and mutational hotspots within the human lipoprotein lipase gene. Am. J. Hum. Genet. 66:69-83.lq8, http://www.100md.com

    Tishkoff, S. A., E. Dietzsch, W. Speed, A. J. Pakstis, J. R. Kidd, K. Cheung, B. Bonne-Tamir, A. S. Santachiara-Benerecetti, P. Moral, and M. Krings. 1996. Global patterns of linkage disequilibrium at the CD4 locus and modern human origins. Science 271:1380-1387.lq8, http://www.100md.com

    Wakeley, J. 1999. Nonequilibrium migration in human history. Genetics 153:1863-1871.lq8, http://www.100md.com

    Wakeley, J., R. Nielsen, S. N. Liu-Cordero, and K. Ardlie. 2001. The discovery of single-nucleotide polymorphisms and inferences about human demographic history. Am. J. Hum. Genet. 69:1332-1347.

    Weiss, K. M., and A. G. Clark. 2002. Linkage disequilibrium and the mapping of complex human traits. Trends. Genet. 18:19-24.f[hlof, 百拇医药

    Whitlock, M. C., and D. E. McCauley. 1999. Indirect measures of gene flow and migration: FST not equal to 1/4Nm + 1. Heredity 82:117-125.f[hlof, 百拇医药

    Wilson, J. F., M. E. Weale, A. C. Smith, F. Gratrix, B. Fletcher, M. G. Thomas, N. Bradman, and D. B. Goldstein. 2001. Population genetic structure of variable drug response. Nat. Genet. 29:265-269.f[hlof, 百拇医药

    Yang, Z., G. K. Wong, M. A. Eberle, M. Kibukawa, D. A. Passey, W. R. Hughes, L. Kruglyak, and J. Yu. 2000. Sampling SNPs. Nat. Genet. 26:13-14.f[hlof, 百拇医药

    Yu, A., C. Zhao, Y. Fan, W. Jang, A. J. Mungall, P. Deloukas, A. Olsen, N. A. Doggett, N. Ghebranious, K. W. Broman, and J. L. Weber. 2001a. Comparison of human genetic and sequence-based physical maps. Nature 409:951-953.f[hlof, 百拇医药

    Yu, N., Z. Zhao, X. Y. Fu, N. Sambuughin, M. Ramsay, T. Jenkins, E. Leskinen, L. Patthy, L. B. Jorde, T. Kuromori, and W. H. Li. 2001b. Global patterns of human DNA sequence variation in a 10-kb region on chromosome 1. Mol. Biol. Evol. 18:214-222.f[hlof, 百拇医药

    Zavolan, M., and T. B. Kepler. 2001. Statistical inference of sequence-dependent mutation rates. Curr. Opin. Genet. Dev. 11:612-615.f[hlof, 百拇医药

    Zollner, S., and A. von Haeseler. 2000. A coalescent approach to study linkage disequilibrium between single-nucleotide polymorphisms. Am. J. Hum. Genet. 66:615-628.f[hlof, 百拇医药

    Accepted for publication October 7, 2002.(Joshua M. Akey Kun Zhang Momiao Xiong and Li Jin)